Unify logic to find and replace line terminators #7949

fasttime · 2017-01-20T05:42:39Z

The purpose of this proposal is to unify the logic used to find and replace the line terminator sequences "\r\n", "\n", "\r", "\u2028" and "\u2029" throughout the code, following the suggestions in #7924 (review).

A recursive text search for the most common patterns "\n" and "\r" in the lib folder reveals a
number of similar regular expressions used in similar ways in several rules and other modules.

Here is an overview of the code in question based on the current master snapshot.
I added comments to explain what the code apparently does.

lib/ast-utils.js:294:        const lineEndingPattern = /\r\n|[\r\n\u2028\u2029]/g;

Find index and length of all LT sequences in a string

lib/eslint.js:803:                stripUnicodeBOM(text).replace(/^#!([^\r\n]+)/, (match, captured) => {

Comment out shebang

lib/rules/brace-style.js:65:            const NEWLINE_REGEX = /\r\n|\r|\n|\u2028|\u2029/g;

Strip all LTs from a string

lib/rules/eol-last.js:82:                            const finalEOLs = /(?:\r?\n)+$/,

Strip trailing newlines from a string

lib/rules/func-call-spacing.js:91:            const hasNewline = hasWhitespace && /[\n\r\u2028\u2029]/.test(textBetweenTokens);

Test if a string contains a LT

lib/rules/key-spacing.js:18:    return /[\n\r\u2028\u2029]/.test(str);

Test if a string contains a LT

lib/rules/linebreak-style.js:63:                    pattern = /\r\n|\r|\n|\u2028|\u2029/g;

Find index and length of all LT sequences in a string

lib/rules/newline-after-var.js:204:                        const NEWLINE_REGEX = /\r\n|\r|\n|\u2028|\u2029/;

Split lines

lib/rules/newline-per-chained-call.js:50:            const lines = sourceCode.getText(node.property).split(/\r\n|\r|\n/g);

Split lines (misses \u2028 and \u2029)

lib/rules/no-irregular-whitespace.js:16:const LINE_BREAK = /\r\n|\r|\n|\u2028|\u2029/g;

Find index of all LT sequences in a string (assumes length of sequence is always 1)

lib/rules/no-multi-spaces.js:96:                    pattern = /[^\n\r\u2028\u2029\t ].? {2,}/g;  // note: repeating space

Not sure about this one

lib/rules/no-multi-str.js:42:                const lineBreak = /\n/;

Test if a string contains a LT (misses \r, \u2028 and \u2029)

lib/rules/no-trailing-spaces.js:84:                    linebreaks = sourceCode.getText().match(/\r\n|\r|\n|\u2028|\u2029/g);

Find length of all LT sequences in a string

lib/rules/no-unused-vars.js:543:            const lineInComment = (prefix.match(/\n/g) || []).length;

Count LT sequences in a string (misses \r, \u2028 and \u2029)

lib/rules/no-unused-vars.js:546:                column -= 1 + prefix.lastIndexOf("\n");

Find index of last LT sequences (misses \r, \u2028 and \u2029)

lib/rules/operator-linebreak.js:14:const LINEBREAK_REGEX = /\r\n|\r|\n|\u2028|\u2029/g;

Strip all LT sequences from a string

lib/rules/quotes.js:54:    return newQuote + str.slice(1, -1).replace(/\\(\${|\r\n?|\n|.)|["'`]|\${|(\r\n?|\n)/g, (match, escaped, newline) => {

\r\n?|\n|. should match any character but misses \u2028 and \u2029

lib/rules/quotes.js:265:                const shouldWarn = node.quasis.length === 1 && !/(^|[^\\])(\\\\)*[\r\n\u2028\u2029]/.test(node.quasis[0].value.raw);

Match line continuations

lib/rules/spaced-comment.js:92:        pattern += "(?:$|[\n\r]))";

Match end of line (misses \u2028 and \u2029)

lib/util/source-code.js:148:    return text.split(/\r\n|\r|\n|\u2028|\u2029/g);

Split lines

The comments in bold are what I suspect to be problems in the current implementation. Please, don't kill me if I'm wrong. I just added those comments for reference, so we don't happen to change the current behavior accidentally. Each of those should be discussed in a different issue if necessary.

Besides that, the analysis above shows at least four common use cases.

Strip all LTs from a string: str.replace(/\n|\r|\u2028|\u2029/g, ""), the regular expression here can be safely stored and reused, so that only one object is created.

Split lines: str.split(/\r\n|\r|\n|\u2028|\u2029/g), this regular expression can be also stored
in a static context and reused across function calls. As suggested by @not-an-aardvark in the comment below, to split the contents of a node in its lines, it is be better to use sourceCode.lines with node.loc.start and node.loc.end as start and ending positions respectively.

Find index and length of all LT sequences in a string: This is basically something like

var regExp = /\r\n|\r|\n|\u2028|\u2029/g;
while (match = regExp.exec(str))
{
    var index = match.index;
    var length = match[0].length;
    ...
}

The regexp literal here is identical to the one used to split lines, but its value cannot be stored statically because it is not stateless: it uses the regular property lastIndex to keep track of the progress in scanning the current string.

Test if a string contains a LT: /\n|\r|\u2028|\u2029/.test(str), or equivalently !/^.*$/.test(str).

The text was updated successfully, but these errors were encountered:

not-an-aardvark · 2017-01-20T05:58:42Z

Thanks for looking into this! This analysis is very helpful.

Strip all LTs from a string

It looks like this is only used for removing newlines from text while autofixing, which seems separate from the other two cases. However, it might be a good idea to have a helper function that removes newlines from a string.

Split lines

I suspect we can unify most of these cases to depend on sourceCode.lines and node.loc.(start|end).line.

Find index and length of all LT sequences in a string

It looks like this is only used in two places, and I think the code in linebreak-style can probably be refactored to use the ast-utils function.

Unfortunately, this can't be merged into the sourceCode.lines logic, because the sourceCode.lines logic loses information about which linebreak type was used to separate two lines.

If nothing else, it seems like it might be useful to have a LINEBREAKS constant list somewhere and dynamically build the regexes from the constant, so that we don't have to remember all the linebreak types in many different files.

fasttime · 2017-01-20T20:54:01Z

@not-an-aardvark I really like the idea of using sourceCode.lines to split lines. Much more efficient than working with regexps.

Currently, rules that deal with line-splitting need to include a list of JS linebreaks in some form (e.g. in a regex character class). As a result, splitting a string into JS lines is implemented in 22 different places in the codebase. As described in #7949, not all of these implementations are correct (many of them forget to account for \r, \u2028, and \u2029. This commit updates ast-utils.js with the following properties: * `astUtils.LINEBREAKS`: a Set containing all JS linebreaks * `astUtils.LINEBREAK_MATCHER`: a regular expression to match JS linebreaks * `astUtils.createGlobalLinebreakMatcher`: a function that returns a new global regex to match JS linebreaks Additionally, it updates the other line-splitting code to use these constants. Observable changes: * The error message for `newline-per-chained-call` will handle property names with \u2028 and \u2029 the same way it handles names with \n. * `no-multi-str` will now report an error for strings that contain \r, \u2028, or \u2029. (Previously, it only reported strings containing \n.) * `no-unused-vars` will now report the correct location if a `/* global foo */` comment contains a linebreak other than \n. (Previously, it would report an invalid location.) * `spaced-comment` will no longer report an error if a comment contains tokens in the `exceptions` option followed by \u2028 or \u2029.

eslintbot added the triage An ESLint team member will look at this issue soon label Jan 20, 2017

fasttime changed the title ~~Unify logic~~ Unify logic to find and replace line terminators Jan 20, 2017

fasttime mentioned this issue Jan 20, 2017

Fix: func-call-spacing does not recognize all line terminators (fixes #7923) #7924

Merged

not-an-aardvark added chore This change is not user-facing and removed triage An ESLint team member will look at this issue soon labels Jan 20, 2017

not-an-aardvark mentioned this issue Feb 8, 2017

Update: handle uncommon linebreaks consistently in rules (fixes #7949) #8049

Merged

ilyavolodin closed this as completed in acc3301 Feb 13, 2017

This was referenced Feb 20, 2017

Update dependency eslint to version 3.19.0 singapore/lint-condo#244

Merged

Update dependency eslint to version 3.16.0 renovatebot/renovate#111

Merged

eslint-deprecated bot locked and limited conversation to collaborators Feb 6, 2018

eslint-deprecated bot added the archived due to age This issue has been archived; please open a new issue for any further discussion label Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify logic to find and replace line terminators #7949

Unify logic to find and replace line terminators #7949

fasttime commented Jan 20, 2017 •

edited

not-an-aardvark commented Jan 20, 2017

fasttime commented Jan 20, 2017

Unify logic to find and replace line terminators #7949

Unify logic to find and replace line terminators #7949

Comments

fasttime commented Jan 20, 2017 • edited

not-an-aardvark commented Jan 20, 2017

fasttime commented Jan 20, 2017

fasttime commented Jan 20, 2017 •

edited