Skip to content

Add cjk_friendly_emphasis extension for CJK underscore emphasis#1599

Open
sotanengel wants to merge 1 commit intoPython-Markdown:masterfrom
sotanengel:feat/cjk-friendly-emphasis
Open

Add cjk_friendly_emphasis extension for CJK underscore emphasis#1599
sotanengel wants to merge 1 commit intoPython-Markdown:masterfrom
sotanengel:feat/cjk-friendly-emphasis

Conversation

@sotanengel
Copy link
Copy Markdown

Summary

Adds a new cjk_friendly_emphasis extension that enables underscore emphasis (_em_, __strong__) to work correctly adjacent to CJK (Chinese, Japanese, Korean) characters.

Motivation

Python-Markdown's underscore emphasis patterns use \w word boundaries (via (?<!\w) and (?!\w)) to prevent intraword emphasis in ASCII text like foo_bar_baz. However, CJK characters are classified as \w in Python 3's regex engine, which means underscore emphasis fails when directly adjacent to CJK text:

>>> markdown.markdown('これは__重要__です')
'<p>これは__重要__です</p>'  # Expected: <p>これは<strong>重要</strong>です</p>

Note: Asterisk emphasis (*/**) already works with CJK text because it has no word-boundary check. This extension only needs to fix underscore behavior.

Background

This is part of a broader effort to improve CJK emphasis handling across Markdown implementations. The root issue is documented in commonmark-spec#650. The markdown-cjk-friendly project provides a formal specification and implementations for CommonMark-based parsers.

While Python-Markdown follows Gruber's original Markdown rather than CommonMark, the CJK underscore emphasis issue is the same fundamental problem: word-boundary assumptions designed for space-separated languages fail for CJK text.

Changes

New file: markdown/extensions/cjk_friendly_emphasis.py

  • Defines CJK-aware boundary patterns: (?:(?<!\w)|(?<=CJK_CHAR)) instead of (?<!\w)
  • Creates CJKUnderscoreProcessor that overrides UnderscoreProcessor with CJK-friendly regex patterns
  • CJK character class covers: CJK Unified Ideographs, Hiragana, Katakana, Hangul Syllables, fullwidth forms, and related blocks
  • Follows the same pattern as legacy_em.py for extension structure

New file: tests/test_syntax/extensions/test_cjk_friendly_emphasis.py

14 test cases covering:

  • Japanese: __重要__, __「異常」__, __重要。__
  • Chinese: __强调__
  • Korean: __강조__
  • Mixed CJK/Latin
  • ASCII intraword protection preserved (foo_bar_baz, foo__bar__baz)
  • Asterisk emphasis unchanged
  • Without-extension baseline verification

Usage

import markdown
html = markdown.markdown('これは__重要__です', extensions=['cjk_friendly_emphasis'])
# '<p>これは<strong>重要</strong>です</p>'

Design decisions

  1. Extension, not core change — opt-in via extensions=['cjk_friendly_emphasis'], no change to default behavior
  2. Underscore only — asterisk emphasis already works with CJK; only _/__ need fixing
  3. ASCII protection preservedfoo_bar_baz remains unaffected because the boundary relaxation only applies to CJK characters
  4. Follows legacy_em.py pattern — minimal code, subclasses UnderscoreProcessor, same registration mechanism

Test plan

  • All 14 CJK-specific tests pass
  • All 1099 existing tests pass (0 failures, 110 skipped as before)
  • ASCII intraword underscore protection unchanged

🤖 Generated with Claude Code

Python-Markdown's underscore emphasis (`_em_`, `__strong__`) uses `\w` word
boundaries which fail with CJK text because CJK characters match `\w` in
Python 3, preventing emphasis adjacent to CJK characters.

This extension relaxes the boundary check so CJK characters are treated as
valid emphasis boundaries while preserving ASCII intraword protection
(e.g., `foo_bar_baz` remains unaffected).

Usage: `markdown.markdown(text, extensions=['cjk_friendly_emphasis'])`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant