1 |
3a515b92
|
cagy
|
Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
|
2 |
|
|
========
|
3 |
|
|
|
4 |
|
|
A regex that tokenizes JavaScript.
|
5 |
|
|
|
6 |
|
|
```js
|
7 |
|
|
var jsTokens = require("js-tokens").default
|
8 |
|
|
|
9 |
|
|
var jsString = "var foo=opts.foo;\n..."
|
10 |
|
|
|
11 |
|
|
jsString.match(jsTokens)
|
12 |
|
|
// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
|
13 |
|
|
```
|
14 |
|
|
|
15 |
|
|
|
16 |
|
|
Installation
|
17 |
|
|
============
|
18 |
|
|
|
19 |
|
|
`npm install js-tokens`
|
20 |
|
|
|
21 |
|
|
```js
|
22 |
|
|
import jsTokens from "js-tokens"
|
23 |
|
|
// or:
|
24 |
|
|
var jsTokens = require("js-tokens").default
|
25 |
|
|
```
|
26 |
|
|
|
27 |
|
|
|
28 |
|
|
Usage
|
29 |
|
|
=====
|
30 |
|
|
|
31 |
|
|
### `jsTokens` ###
|
32 |
|
|
|
33 |
|
|
A regex with the `g` flag that matches JavaScript tokens.
|
34 |
|
|
|
35 |
|
|
The regex _always_ matches, even invalid JavaScript and the empty string.
|
36 |
|
|
|
37 |
|
|
The next match is always directly after the previous.
|
38 |
|
|
|
39 |
|
|
### `var token = matchToToken(match)` ###
|
40 |
|
|
|
41 |
|
|
```js
|
42 |
|
|
import {matchToToken} from "js-tokens"
|
43 |
|
|
// or:
|
44 |
|
|
var matchToToken = require("js-tokens").matchToToken
|
45 |
|
|
```
|
46 |
|
|
|
47 |
|
|
Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
|
48 |
|
|
String, value: String}` object. The following types are available:
|
49 |
|
|
|
50 |
|
|
- string
|
51 |
|
|
- comment
|
52 |
|
|
- regex
|
53 |
|
|
- number
|
54 |
|
|
- name
|
55 |
|
|
- punctuator
|
56 |
|
|
- whitespace
|
57 |
|
|
- invalid
|
58 |
|
|
|
59 |
|
|
Multi-line comments and strings also have a `closed` property indicating if the
|
60 |
|
|
token was closed or not (see below).
|
61 |
|
|
|
62 |
|
|
Comments and strings both come in several flavors. To distinguish them, check if
|
63 |
|
|
the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
|
64 |
|
|
|
65 |
|
|
Names are ECMAScript IdentifierNames, that is, including both identifiers and
|
66 |
|
|
keywords. You may use [is-keyword-js] to tell them apart.
|
67 |
|
|
|
68 |
|
|
Whitespace includes both line terminators and other whitespace.
|
69 |
|
|
|
70 |
|
|
[is-keyword-js]: https://github.com/crissdev/is-keyword-js
|
71 |
|
|
|
72 |
|
|
|
73 |
|
|
ECMAScript support
|
74 |
|
|
==================
|
75 |
|
|
|
76 |
|
|
The intention is to always support the latest ECMAScript version whose feature
|
77 |
|
|
set has been finalized.
|
78 |
|
|
|
79 |
|
|
If adding support for a newer version requires changes, a new version with a
|
80 |
|
|
major verion bump will be released.
|
81 |
|
|
|
82 |
|
|
Currently, ECMAScript 2018 is supported.
|
83 |
|
|
|
84 |
|
|
|
85 |
|
|
Invalid code handling
|
86 |
|
|
=====================
|
87 |
|
|
|
88 |
|
|
Unterminated strings are still matched as strings. JavaScript strings cannot
|
89 |
|
|
contain (unescaped) newlines, so unterminated strings simply end at the end of
|
90 |
|
|
the line. Unterminated template strings can contain unescaped newlines, though,
|
91 |
|
|
so they go on to the end of input.
|
92 |
|
|
|
93 |
|
|
Unterminated multi-line comments are also still matched as comments. They
|
94 |
|
|
simply go on to the end of the input.
|
95 |
|
|
|
96 |
|
|
Unterminated regex literals are likely matched as division and whatever is
|
97 |
|
|
inside the regex.
|
98 |
|
|
|
99 |
|
|
Invalid ASCII characters have their own capturing group.
|
100 |
|
|
|
101 |
|
|
Invalid non-ASCII characters are treated as names, to simplify the matching of
|
102 |
|
|
names (except unicode spaces which are treated as whitespace). Note: See also
|
103 |
|
|
the [ES2018](#es2018) section.
|
104 |
|
|
|
105 |
|
|
Regex literals may contain invalid regex syntax. They are still matched as
|
106 |
|
|
regex literals. They may also contain repeated regex flags, to keep the regex
|
107 |
|
|
simple.
|
108 |
|
|
|
109 |
|
|
Strings may contain invalid escape sequences.
|
110 |
|
|
|
111 |
|
|
|
112 |
|
|
Limitations
|
113 |
|
|
===========
|
114 |
|
|
|
115 |
|
|
Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
|
116 |
|
|
perfect. But that’s not the point either.
|
117 |
|
|
|
118 |
|
|
You may compare jsTokens with [esprima] by using `esprima-compare.js`.
|
119 |
|
|
See `npm run esprima-compare`!
|
120 |
|
|
|
121 |
|
|
[esprima]: http://esprima.org/
|
122 |
|
|
|
123 |
|
|
### Template string interpolation ###
|
124 |
|
|
|
125 |
|
|
Template strings are matched as single tokens, from the starting `` ` `` to the
|
126 |
|
|
ending `` ` ``, including interpolations (whose tokens are not matched
|
127 |
|
|
individually).
|
128 |
|
|
|
129 |
|
|
Matching template string interpolations requires recursive balancing of `{` and
|
130 |
|
|
`}`—something that JavaScript regexes cannot do. Only one level of nesting is
|
131 |
|
|
supported.
|
132 |
|
|
|
133 |
|
|
### Division and regex literals collision ###
|
134 |
|
|
|
135 |
|
|
Consider this example:
|
136 |
|
|
|
137 |
|
|
```js
|
138 |
|
|
var g = 9.82
|
139 |
|
|
var number = bar / 2/g
|
140 |
|
|
|
141 |
|
|
var regex = / 2/g
|
142 |
|
|
```
|
143 |
|
|
|
144 |
|
|
A human can easily understand that in the `number` line we’re dealing with
|
145 |
|
|
division, and in the `regex` line we’re dealing with a regex literal. How come?
|
146 |
|
|
Because humans can look at the whole code to put the `/` characters in context.
|
147 |
|
|
A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
|
148 |
|
|
look backwards. See the [ES2018](#es2018) section).
|
149 |
|
|
|
150 |
|
|
When the `jsTokens` regex scans throught the above, it will see the following
|
151 |
|
|
at the end of both the `number` and `regex` rows:
|
152 |
|
|
|
153 |
|
|
```js
|
154 |
|
|
/ 2/g
|
155 |
|
|
```
|
156 |
|
|
|
157 |
|
|
It is then impossible to know if that is a regex literal, or part of an
|
158 |
|
|
expression dealing with division.
|
159 |
|
|
|
160 |
|
|
Here is a similar case:
|
161 |
|
|
|
162 |
|
|
```js
|
163 |
|
|
foo /= 2/g
|
164 |
|
|
foo(/= 2/g)
|
165 |
|
|
```
|
166 |
|
|
|
167 |
|
|
The first line divides the `foo` variable with `2/g`. The second line calls the
|
168 |
|
|
`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
|
169 |
|
|
sees forwards, it cannot tell the two cases apart.
|
170 |
|
|
|
171 |
|
|
There are some cases where we _can_ tell division and regex literals apart,
|
172 |
|
|
though.
|
173 |
|
|
|
174 |
|
|
First off, we have the simple cases where there’s only one slash in the line:
|
175 |
|
|
|
176 |
|
|
```js
|
177 |
|
|
var foo = 2/g
|
178 |
|
|
foo /= 2
|
179 |
|
|
```
|
180 |
|
|
|
181 |
|
|
Regex literals cannot contain newlines, so the above cases are correctly
|
182 |
|
|
identified as division. Things are only problematic when there are more than
|
183 |
|
|
one non-comment slash in a single line.
|
184 |
|
|
|
185 |
|
|
Secondly, not every character is a valid regex flag.
|
186 |
|
|
|
187 |
|
|
```js
|
188 |
|
|
var number = bar / 2/e
|
189 |
|
|
```
|
190 |
|
|
|
191 |
|
|
The above example is also correctly identified as division, because `e` is not a
|
192 |
|
|
valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
|
193 |
|
|
(any letter) as flags, but it is not worth it since it increases the amount of
|
194 |
|
|
ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
|
195 |
|
|
allowed. This means that the above example will be identified as division as
|
196 |
|
|
long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
|
197 |
|
|
characters long.
|
198 |
|
|
|
199 |
|
|
Lastly, we can look _forward_ for information.
|
200 |
|
|
|
201 |
|
|
- If the token following what looks like a regex literal is not valid after a
|
202 |
|
|
regex literal, but is valid in a division expression, then the regex literal
|
203 |
|
|
is treated as division instead. For example, a flagless regex cannot be
|
204 |
|
|
followed by a string, number or name, but all of those three can be the
|
205 |
|
|
denominator of a division.
|
206 |
|
|
- Generally, if what looks like a regex literal is followed by an operator, the
|
207 |
|
|
regex literal is treated as division instead. This is because regexes are
|
208 |
|
|
seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
|
209 |
|
|
could likely be part of such an expression.
|
210 |
|
|
|
211 |
|
|
Please consult the regex source and the test cases for precise information on
|
212 |
|
|
when regex or division is matched (should you need to know). In short, you
|
213 |
|
|
could sum it up as:
|
214 |
|
|
|
215 |
|
|
If the end of a statement looks like a regex literal (even if it isn’t), it
|
216 |
|
|
will be treated as one. Otherwise it should work as expected (if you write sane
|
217 |
|
|
code).
|
218 |
|
|
|
219 |
|
|
### ES2018 ###
|
220 |
|
|
|
221 |
|
|
ES2018 added some nice regex improvements to the language.
|
222 |
|
|
|
223 |
|
|
- [Unicode property escapes] should allow telling names and invalid non-ASCII
|
224 |
|
|
characters apart without blowing up the regex size.
|
225 |
|
|
- [Lookbehind assertions] should allow matching telling division and regex
|
226 |
|
|
literals apart in more cases.
|
227 |
|
|
- [Named capture groups] might simplify some things.
|
228 |
|
|
|
229 |
|
|
These things would be nice to do, but are not critical. They probably have to
|
230 |
|
|
wait until the oldest maintained Node.js LTS release supports those features.
|
231 |
|
|
|
232 |
|
|
[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
|
233 |
|
|
[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
|
234 |
|
|
[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
|
235 |
|
|
|
236 |
|
|
|
237 |
|
|
License
|
238 |
|
|
=======
|
239 |
|
|
|
240 |
|
|
[MIT](LICENSE).
|