Support CJK and emoji in markdown #3013

ikatyang · 2017-10-12T03:18:59Z

I have a question.
Does this markdown formatter support multi-bye language like Japanese or emoji?

It seem that current implementaion has two problems.

First, current implementation will be broken in Chinese-Japanese-Korean(CJK) and emoji.
Because, some printing algorithm depened on string.length.

For example, "❤️".length is 2. Following algorithm jolt out of alignment.

https://github.com/prettier/prettier/pull/2943/files#diff-385bd78c43a57ae55923bfea744a5ae3R359

This issue is known as East Asian Width or Unicode problem.

UAX #11: East Asian Width

JavaScript has a Unicode problem · Mathias Bynens

I know that East Asian Width problem is very difficult.( I don't know perfect solution...)
Some library get unicode length.

timoxley/wcwidth: Port of C's wcwidth() and wcswidth()

sindresorhus/string-width: Get the visual width of a string - the number of columns required to display it

power-assert-runtime/packages/power-assert-util-string-width at master · twada/power-assert-runtime

Second, Following tokenizer can not tokenize non-English text.
Because, non-English text like cheniese doesn't put a space between words.
    .split(/(\s+)/g)
https://github.com/prettier/prettier/pull/2943/files#diff-0f09e16c6ee7c5a40fef83b315027002R149

It will be resolved by using tokenizer like nlcst-parse-japanese, parse-english, and rakutenma.
But, It is not realistic and is not perfect. because toknizer is heavy weight(File size is large and Parse speed is slow)

Toknizer example is Text Tokenizer · Hivemall User Manual

Thanks.

The text was updated successfully, but these errors were encountered:

ikatyang · 2017-10-12T03:19:07Z

from @ikatyang #2943 (comment):

Thanks for the suggestion, I've already thought about it and just merged #3003 to fix the printer first, the CJK support is working in progress now, but it should be in a separate PR I think, this PR is somehow too large.

I think there's no need to use tokenizer for CJK, since AFAIK they can be broke in any place, and that's how CJK books printed, e.g.
一串很長很長很長很長很長很長很長很長很長很長很長的中文字，一串很長很長很長很長很長很
長很長很長很長很長很長的中文字，一串很長很長很長很長很長很長很長很長很長很長很長的中
文字，一串很長很長很長很長很長很長很長很長很長很長很長的中文字。
For the string width thing, use string-width should be enough, but we have to disable the stripAnsi feature.

azz · 2017-10-12T11:34:14Z

Was this done in #3015?

ikatyang · 2017-10-12T11:44:34Z

No, I have to define how to split the CJK text, they're not separated by whitespaces.

azz · 2017-10-12T11:46:08Z

Gotcha!

azu · 2017-10-12T13:07:04Z

In the future, we will can use Intl.Segmenter API that is based on CLDR/ICU.
Intl.Segmenter support almost language text that includes CJK text.
(Currently, This proposal is Stage 2.)
Also, V8 has prefixed Intl.v8BreakIterator that is nonstandard segmentation API and deprecated.

Intl: Consider deprecating Intl.v8BreakIterator · Issue #8865 · nodejs/node

For more details about text wrapping, see CSS Text Module Level 3.

ikatyang · 2017-10-13T07:03:15Z

We may have to add an option for splitting CJK texts (--split-cjk-text) as there will be an additional whitespace in the rendered content, some markdown converter (or via plugin) handles it properly but most of them are not.

(markdown)

中文中文中文
中文中文中文
中文中文中文

(html)

<p>中文中文中文
中文中文中文
中文中文中文</p>

(rendered content)

中文中文中文 中文中文中文 中文中文中文

There shouldn't be whitespaces between CJK characters.

<p>中文中文中文中文中文中文中文中文中文</p>

中文中文中文中文中文中文中文中文中文

Related discussion: markdown-it/markdown-it#334

ikatyang added lang:markdown Issues affecting Markdown type:enhancement A potential new feature to be added, or an improvement to how we print something labels Oct 12, 2017

ikatyang self-assigned this Oct 12, 2017

ikatyang mentioned this issue Oct 14, 2017

feat(markdown): support CJK and emoji #3026

Merged

ikatyang closed this as completed in #3026 Oct 15, 2017

lock bot added the locked-due-to-inactivity Please open a new issue and fill out the template instead of commenting. label Jul 6, 2018

lock bot locked as resolved and limited conversation to collaborators Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CJK and emoji in markdown #3013

Support CJK and emoji in markdown #3013

ikatyang commented Oct 12, 2017

ikatyang commented Oct 12, 2017

azz commented Oct 12, 2017

ikatyang commented Oct 12, 2017

azz commented Oct 12, 2017

azu commented Oct 12, 2017 •

edited

ikatyang commented Oct 13, 2017

Support CJK and emoji in markdown #3013

Support CJK and emoji in markdown #3013

Comments

ikatyang commented Oct 12, 2017

ikatyang commented Oct 12, 2017

azz commented Oct 12, 2017

ikatyang commented Oct 12, 2017

azz commented Oct 12, 2017

azu commented Oct 12, 2017 • edited

ikatyang commented Oct 13, 2017

azu commented Oct 12, 2017 •

edited