Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify logic to find and replace line terminators #7949

Closed
fasttime opened this issue Jan 20, 2017 · 2 comments · Fixed by renovatebot/renovate#111 · May be fixed by iamhunter/teammates#4
Closed

Unify logic to find and replace line terminators #7949

fasttime opened this issue Jan 20, 2017 · 2 comments · Fixed by renovatebot/renovate#111 · May be fixed by iamhunter/teammates#4
Labels
archived due to age This issue has been archived; please open a new issue for any further discussion chore This change is not user-facing

Comments

@fasttime
Copy link
Member

fasttime commented Jan 20, 2017

The purpose of this proposal is to unify the logic used to find and replace the line terminator sequences "\r\n", "\n", "\r", "\u2028" and "\u2029" throughout the code, following the suggestions in #7924 (review).

A recursive text search for the most common patterns "\n" and "\r" in the lib folder reveals a
number of similar regular expressions used in similar ways in several rules and other modules.

Here is an overview of the code in question based on the current master snapshot.
I added comments to explain what the code apparently does.

lib/ast-utils.js:294:        const lineEndingPattern = /\r\n|[\r\n\u2028\u2029]/g;

Find index and length of all LT sequences in a string


lib/eslint.js:803:                stripUnicodeBOM(text).replace(/^#!([^\r\n]+)/, (match, captured) => {

Comment out shebang


lib/rules/brace-style.js:65:            const NEWLINE_REGEX = /\r\n|\r|\n|\u2028|\u2029/g;

Strip all LTs from a string


lib/rules/eol-last.js:82:                            const finalEOLs = /(?:\r?\n)+$/,

Strip trailing newlines from a string


lib/rules/func-call-spacing.js:91:            const hasNewline = hasWhitespace && /[\n\r\u2028\u2029]/.test(textBetweenTokens);

Test if a string contains a LT


lib/rules/key-spacing.js:18:    return /[\n\r\u2028\u2029]/.test(str);

Test if a string contains a LT


lib/rules/linebreak-style.js:63:                    pattern = /\r\n|\r|\n|\u2028|\u2029/g;

Find index and length of all LT sequences in a string


lib/rules/newline-after-var.js:204:                        const NEWLINE_REGEX = /\r\n|\r|\n|\u2028|\u2029/;

Split lines


lib/rules/newline-per-chained-call.js:50:            const lines = sourceCode.getText(node.property).split(/\r\n|\r|\n/g);

Split lines (misses \u2028 and \u2029)


lib/rules/no-irregular-whitespace.js:16:const LINE_BREAK = /\r\n|\r|\n|\u2028|\u2029/g;

Find index of all LT sequences in a string (assumes length of sequence is always 1)


lib/rules/no-multi-spaces.js:96:                    pattern = /[^\n\r\u2028\u2029\t ].? {2,}/g;  // note: repeating space

Not sure about this one


lib/rules/no-multi-str.js:42:                const lineBreak = /\n/;

Test if a string contains a LT (misses \r, \u2028 and \u2029)


lib/rules/no-trailing-spaces.js:84:                    linebreaks = sourceCode.getText().match(/\r\n|\r|\n|\u2028|\u2029/g);

Find length of all LT sequences in a string


lib/rules/no-unused-vars.js:543:            const lineInComment = (prefix.match(/\n/g) || []).length;

Count LT sequences in a string (misses \r, \u2028 and \u2029)


lib/rules/no-unused-vars.js:546:                column -= 1 + prefix.lastIndexOf("\n");

Find index of last LT sequences (misses \r, \u2028 and \u2029)


lib/rules/operator-linebreak.js:14:const LINEBREAK_REGEX = /\r\n|\r|\n|\u2028|\u2029/g;

Strip all LT sequences from a string


lib/rules/quotes.js:54:    return newQuote + str.slice(1, -1).replace(/\\(\${|\r\n?|\n|.)|["'`]|\${|(\r\n?|\n)/g, (match, escaped, newline) => {

\r\n?|\n|. should match any character but misses \u2028 and \u2029


lib/rules/quotes.js:265:                const shouldWarn = node.quasis.length === 1 && !/(^|[^\\])(\\\\)*[\r\n\u2028\u2029]/.test(node.quasis[0].value.raw);

Match line continuations


lib/rules/spaced-comment.js:92:        pattern += "(?:$|[\n\r]))";

Match end of line (misses \u2028 and \u2029)


lib/util/source-code.js:148:    return text.split(/\r\n|\r|\n|\u2028|\u2029/g);

Split lines


The comments in bold are what I suspect to be problems in the current implementation. Please, don't kill me if I'm wrong. I just added those comments for reference, so we don't happen to change the current behavior accidentally. Each of those should be discussed in a different issue if necessary.

Besides that, the analysis above shows at least four common use cases.

Strip all LTs from a string: str.replace(/\n|\r|\u2028|\u2029/g, ""), the regular expression here can be safely stored and reused, so that only one object is created.

Split lines: str.split(/\r\n|\r|\n|\u2028|\u2029/g), this regular expression can be also stored
in a static context and reused across function calls. As suggested by @not-an-aardvark in the comment below, to split the contents of a node in its lines, it is be better to use sourceCode.lines with node.loc.start and node.loc.end as start and ending positions respectively.

Find index and length of all LT sequences in a string: This is basically something like

var regExp = /\r\n|\r|\n|\u2028|\u2029/g;
while (match = regExp.exec(str))
{
    var index = match.index;
    var length = match[0].length;
    ...
}

The regexp literal here is identical to the one used to split lines, but its value cannot be stored statically because it is not stateless: it uses the regular property lastIndex to keep track of the progress in scanning the current string.

Test if a string contains a LT: /\n|\r|\u2028|\u2029/.test(str), or equivalently !/^.*$/.test(str).

@eslintbot eslintbot added the triage An ESLint team member will look at this issue soon label Jan 20, 2017
@fasttime fasttime changed the title Unify logic Unify logic to find and replace line terminators Jan 20, 2017
@not-an-aardvark
Copy link
Member

Thanks for looking into this! This analysis is very helpful.

Strip all LTs from a string

It looks like this is only used for removing newlines from text while autofixing, which seems separate from the other two cases. However, it might be a good idea to have a helper function that removes newlines from a string.

Split lines

I suspect we can unify most of these cases to depend on sourceCode.lines and node.loc.(start|end).line.

Find index and length of all LT sequences in a string

It looks like this is only used in two places, and I think the code in linebreak-style can probably be refactored to use the ast-utils function.

Unfortunately, this can't be merged into the sourceCode.lines logic, because the sourceCode.lines logic loses information about which linebreak type was used to separate two lines.


If nothing else, it seems like it might be useful to have a LINEBREAKS constant list somewhere and dynamically build the regexes from the constant, so that we don't have to remember all the linebreak types in many different files.

@not-an-aardvark not-an-aardvark added chore This change is not user-facing and removed triage An ESLint team member will look at this issue soon labels Jan 20, 2017
@fasttime
Copy link
Member Author

@not-an-aardvark I really like the idea of using sourceCode.lines to split lines. Much more efficient than working with regexps.

not-an-aardvark added a commit that referenced this issue Feb 8, 2017
Currently, rules that deal with line-splitting need to include a list of JS linebreaks in some form (e.g. in a regex character class). As a result, splitting a string into JS lines is implemented in 22 different places in the codebase. As described in #7949, not all of these implementations are correct (many of them forget to account for \r, \u2028, and \u2029.

This commit updates ast-utils.js with the following properties:

* `astUtils.LINEBREAKS`: a Set containing all JS linebreaks
* `astUtils.LINEBREAK_MATCHER`: a regular expression to match JS linebreaks
* `astUtils.createGlobalLinebreakMatcher`: a function that returns a new global regex to match JS linebreaks

Additionally, it updates the other line-splitting code to use these constants.

Observable changes:

* The error message for `newline-per-chained-call` will handle property names with \u2028 and \u2029 the same way it handles names with \n.
* `no-multi-str` will now report an error for strings that contain \r, \u2028, or \u2029. (Previously, it only reported strings containing \n.)
* `no-unused-vars` will now report the correct location if a `/* global foo */` comment contains a linebreak other than \n. (Previously, it would report an invalid location.)
* `spaced-comment` will no longer report an error if a comment contains tokens in the `exceptions` option followed by \u2028 or \u2029.
not-an-aardvark added a commit that referenced this issue Feb 9, 2017
Currently, rules that deal with line-splitting need to include a list of JS linebreaks in some form (e.g. in a regex character class). As a result, splitting a string into JS lines is implemented in 22 different places in the codebase. As described in #7949, not all of these implementations are correct (many of them forget to account for \r, \u2028, and \u2029.

This commit updates ast-utils.js with the following properties:

* `astUtils.LINEBREAKS`: a Set containing all JS linebreaks
* `astUtils.LINEBREAK_MATCHER`: a regular expression to match JS linebreaks
* `astUtils.createGlobalLinebreakMatcher`: a function that returns a new global regex to match JS linebreaks

Additionally, it updates the other line-splitting code to use these constants.

Observable changes:

* The error message for `newline-per-chained-call` will handle property names with \u2028 and \u2029 the same way it handles names with \n.
* `no-multi-str` will now report an error for strings that contain \r, \u2028, or \u2029. (Previously, it only reported strings containing \n.)
* `no-unused-vars` will now report the correct location if a `/* global foo */` comment contains a linebreak other than \n. (Previously, it would report an invalid location.)
* `spaced-comment` will no longer report an error if a comment contains tokens in the `exceptions` option followed by \u2028 or \u2029.
not-an-aardvark added a commit that referenced this issue Feb 12, 2017
Currently, rules that deal with line-splitting need to include a list of JS linebreaks in some form (e.g. in a regex character class). As a result, splitting a string into JS lines is implemented in 22 different places in the codebase. As described in #7949, not all of these implementations are correct (many of them forget to account for \r, \u2028, and \u2029.

This commit updates ast-utils.js with the following properties:

* `astUtils.LINEBREAKS`: a Set containing all JS linebreaks
* `astUtils.LINEBREAK_MATCHER`: a regular expression to match JS linebreaks
* `astUtils.createGlobalLinebreakMatcher`: a function that returns a new global regex to match JS linebreaks

Additionally, it updates the other line-splitting code to use these constants.

Observable changes:

* The error message for `newline-per-chained-call` will handle property names with \u2028 and \u2029 the same way it handles names with \n.
* `no-multi-str` will now report an error for strings that contain \r, \u2028, or \u2029. (Previously, it only reported strings containing \n.)
* `no-unused-vars` will now report the correct location if a `/* global foo */` comment contains a linebreak other than \n. (Previously, it would report an invalid location.)
* `spaced-comment` will no longer report an error if a comment contains tokens in the `exceptions` option followed by \u2028 or \u2029.
@eslint-deprecated eslint-deprecated bot locked and limited conversation to collaborators Feb 6, 2018
@eslint-deprecated eslint-deprecated bot added the archived due to age This issue has been archived; please open a new issue for any further discussion label Feb 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
archived due to age This issue has been archived; please open a new issue for any further discussion chore This change is not user-facing
Projects
None yet
3 participants