Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CJK and emoji in markdown #3013

Closed
ikatyang opened this issue Oct 12, 2017 · 6 comments
Closed

Support CJK and emoji in markdown #3013

ikatyang opened this issue Oct 12, 2017 · 6 comments
Assignees
Labels
lang:markdown Issues affecting Markdown locked-due-to-inactivity Please open a new issue and fill out the template instead of commenting. type:enhancement A potential new feature to be added, or an improvement to how we print something

Comments

@ikatyang
Copy link
Member

from @azu #2943 (comment):

I have a question.
Does this markdown formatter support multi-bye language like Japanese or emoji?

It seem that current implementaion has two problems.

First, current implementation will be broken in Chinese-Japanese-Korean(CJK) and emoji.
Because, some printing algorithm depened on string.length.

For example, "❤️".length is 2. Following algorithm jolt out of alignment.

This issue is known as East Asian Width or Unicode problem.

I know that East Asian Width problem is very difficult.( I don't know perfect solution...)
Some library get unicode length.

Second, Following tokenizer can not tokenize non-English text.
Because, non-English text like cheniese doesn't put a space between words.

    .split(/(\s+)/g)

It will be resolved by using tokenizer like nlcst-parse-japanese, parse-english, and rakutenma.
But, It is not realistic and is not perfect. because toknizer is heavy weight(File size is large and Parse speed is slow)

Toknizer example is Text Tokenizer · Hivemall User Manual

Thanks.

@ikatyang
Copy link
Member Author

from @ikatyang #2943 (comment):

Thanks for the suggestion, I've already thought about it and just merged #3003 to fix the printer first, the CJK support is working in progress now, but it should be in a separate PR I think, this PR is somehow too large.

I think there's no need to use tokenizer for CJK, since AFAIK they can be broke in any place, and that's how CJK books printed, e.g.

一串很長很長很長很長很長很長很長很長很長很長很長的中文字,一串很長很長很長很長很長很
長很長很長很長很長很長的中文字,一串很長很長很長很長很長很長很長很長很長很長很長的中
文字,一串很長很長很長很長很長很長很長很長很長很長很長的中文字。

For the string width thing, use string-width should be enough, but we have to disable the stripAnsi feature.

@ikatyang ikatyang added lang:markdown Issues affecting Markdown type:enhancement A potential new feature to be added, or an improvement to how we print something labels Oct 12, 2017
@ikatyang ikatyang self-assigned this Oct 12, 2017
@azz
Copy link
Member

azz commented Oct 12, 2017

Was this done in #3015?

@ikatyang
Copy link
Member Author

No, I have to define how to split the CJK text, they're not separated by whitespaces.

@azz
Copy link
Member

azz commented Oct 12, 2017

Gotcha!

@azu
Copy link
Contributor

azu commented Oct 12, 2017

In the future, we will can use Intl.Segmenter API that is based on CLDR/ICU.
Intl.Segmenter support almost language text that includes CJK text.
(Currently, This proposal is Stage 2.)
Also, V8 has prefixed Intl.v8BreakIterator that is nonstandard segmentation API and deprecated.

For more details about text wrapping, see CSS Text Module Level 3.

@ikatyang
Copy link
Member Author

We may have to add an option for splitting CJK texts (--split-cjk-text) as there will be an additional whitespace in the rendered content, some markdown converter (or via plugin) handles it properly but most of them are not.

(markdown)

中文中文中文
中文中文中文
中文中文中文

(html)

<p>中文中文中文
中文中文中文
中文中文中文</p>

(rendered content)

中文中文中文 中文中文中文 中文中文中文

There shouldn't be whitespaces between CJK characters.

<p>中文中文中文中文中文中文中文中文中文</p>
中文中文中文中文中文中文中文中文中文

Related discussion: markdown-it/markdown-it#334

@lock lock bot added the locked-due-to-inactivity Please open a new issue and fill out the template instead of commenting. label Jul 6, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jul 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang:markdown Issues affecting Markdown locked-due-to-inactivity Please open a new issue and fill out the template instead of commenting. type:enhancement A potential new feature to be added, or an improvement to how we print something
Projects
None yet
Development

No branches or pull requests

3 participants