Add cjk_friendly_emphasis extension for CJK underscore emphasis#1599
Open
sotanengel wants to merge 1 commit intoPython-Markdown:masterfrom
Open
Add cjk_friendly_emphasis extension for CJK underscore emphasis#1599sotanengel wants to merge 1 commit intoPython-Markdown:masterfrom
sotanengel wants to merge 1 commit intoPython-Markdown:masterfrom
Conversation
Python-Markdown's underscore emphasis (`_em_`, `__strong__`) uses `\w` word boundaries which fail with CJK text because CJK characters match `\w` in Python 3, preventing emphasis adjacent to CJK characters. This extension relaxes the boundary check so CJK characters are treated as valid emphasis boundaries while preserving ASCII intraword protection (e.g., `foo_bar_baz` remains unaffected). Usage: `markdown.markdown(text, extensions=['cjk_friendly_emphasis'])` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
cjk_friendly_emphasisextension that enables underscore emphasis (_em_,__strong__) to work correctly adjacent to CJK (Chinese, Japanese, Korean) characters.Motivation
Python-Markdown's underscore emphasis patterns use
\wword boundaries (via(?<!\w)and(?!\w)) to prevent intraword emphasis in ASCII text likefoo_bar_baz. However, CJK characters are classified as\win Python 3's regex engine, which means underscore emphasis fails when directly adjacent to CJK text:Note: Asterisk emphasis (
*/**) already works with CJK text because it has no word-boundary check. This extension only needs to fix underscore behavior.Background
This is part of a broader effort to improve CJK emphasis handling across Markdown implementations. The root issue is documented in commonmark-spec#650. The markdown-cjk-friendly project provides a formal specification and implementations for CommonMark-based parsers.
While Python-Markdown follows Gruber's original Markdown rather than CommonMark, the CJK underscore emphasis issue is the same fundamental problem: word-boundary assumptions designed for space-separated languages fail for CJK text.
Changes
New file:
markdown/extensions/cjk_friendly_emphasis.py(?:(?<!\w)|(?<=CJK_CHAR))instead of(?<!\w)CJKUnderscoreProcessorthat overridesUnderscoreProcessorwith CJK-friendly regex patternslegacy_em.pyfor extension structureNew file:
tests/test_syntax/extensions/test_cjk_friendly_emphasis.py14 test cases covering:
__重要__,__「異常」__,__重要。____强调____강조__foo_bar_baz,foo__bar__baz)Usage
Design decisions
extensions=['cjk_friendly_emphasis'], no change to default behavior_/__need fixingfoo_bar_bazremains unaffected because the boundary relaxation only applies to CJK characterslegacy_em.pypattern — minimal code, subclassesUnderscoreProcessor, same registration mechanismTest plan
🤖 Generated with Claude Code