Projekt

Obecné

Profil

Stáhnout (7.21 KB) Statistiky
| Větev: | Revize:
1
Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
2
========
3

    
4
A regex that tokenizes JavaScript.
5

    
6
```js
7
var jsTokens = require("js-tokens").default
8

    
9
var jsString = "var foo=opts.foo;\n..."
10

    
11
jsString.match(jsTokens)
12
// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
13
```
14

    
15

    
16
Installation
17
============
18

    
19
`npm install js-tokens`
20

    
21
```js
22
import jsTokens from "js-tokens"
23
// or:
24
var jsTokens = require("js-tokens").default
25
```
26

    
27

    
28
Usage
29
=====
30

    
31
### `jsTokens` ###
32

    
33
A regex with the `g` flag that matches JavaScript tokens.
34

    
35
The regex _always_ matches, even invalid JavaScript and the empty string.
36

    
37
The next match is always directly after the previous.
38

    
39
### `var token = matchToToken(match)` ###
40

    
41
```js
42
import {matchToToken} from "js-tokens"
43
// or:
44
var matchToToken = require("js-tokens").matchToToken
45
```
46

    
47
Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
48
String, value: String}` object. The following types are available:
49

    
50
- string
51
- comment
52
- regex
53
- number
54
- name
55
- punctuator
56
- whitespace
57
- invalid
58

    
59
Multi-line comments and strings also have a `closed` property indicating if the
60
token was closed or not (see below).
61

    
62
Comments and strings both come in several flavors. To distinguish them, check if
63
the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
64

    
65
Names are ECMAScript IdentifierNames, that is, including both identifiers and
66
keywords. You may use [is-keyword-js] to tell them apart.
67

    
68
Whitespace includes both line terminators and other whitespace.
69

    
70
[is-keyword-js]: https://github.com/crissdev/is-keyword-js
71

    
72

    
73
ECMAScript support
74
==================
75

    
76
The intention is to always support the latest ECMAScript version whose feature
77
set has been finalized.
78

    
79
If adding support for a newer version requires changes, a new version with a
80
major verion bump will be released.
81

    
82
Currently, ECMAScript 2018 is supported.
83

    
84

    
85
Invalid code handling
86
=====================
87

    
88
Unterminated strings are still matched as strings. JavaScript strings cannot
89
contain (unescaped) newlines, so unterminated strings simply end at the end of
90
the line. Unterminated template strings can contain unescaped newlines, though,
91
so they go on to the end of input.
92

    
93
Unterminated multi-line comments are also still matched as comments. They
94
simply go on to the end of the input.
95

    
96
Unterminated regex literals are likely matched as division and whatever is
97
inside the regex.
98

    
99
Invalid ASCII characters have their own capturing group.
100

    
101
Invalid non-ASCII characters are treated as names, to simplify the matching of
102
names (except unicode spaces which are treated as whitespace). Note: See also
103
the [ES2018](#es2018) section.
104

    
105
Regex literals may contain invalid regex syntax. They are still matched as
106
regex literals. They may also contain repeated regex flags, to keep the regex
107
simple.
108

    
109
Strings may contain invalid escape sequences.
110

    
111

    
112
Limitations
113
===========
114

    
115
Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
116
perfect. But that’s not the point either.
117

    
118
You may compare jsTokens with [esprima] by using `esprima-compare.js`.
119
See `npm run esprima-compare`!
120

    
121
[esprima]: http://esprima.org/
122

    
123
### Template string interpolation ###
124

    
125
Template strings are matched as single tokens, from the starting `` ` `` to the
126
ending `` ` ``, including interpolations (whose tokens are not matched
127
individually).
128

    
129
Matching template string interpolations requires recursive balancing of `{` and
130
`}`—something that JavaScript regexes cannot do. Only one level of nesting is
131
supported.
132

    
133
### Division and regex literals collision ###
134

    
135
Consider this example:
136

    
137
```js
138
var g = 9.82
139
var number = bar / 2/g
140

    
141
var regex = / 2/g
142
```
143

    
144
A human can easily understand that in the `number` line we’re dealing with
145
division, and in the `regex` line we’re dealing with a regex literal. How come?
146
Because humans can look at the whole code to put the `/` characters in context.
147
A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
148
look backwards. See the [ES2018](#es2018) section).
149

    
150
When the `jsTokens` regex scans throught the above, it will see the following
151
at the end of both the `number` and `regex` rows:
152

    
153
```js
154
/ 2/g
155
```
156

    
157
It is then impossible to know if that is a regex literal, or part of an
158
expression dealing with division.
159

    
160
Here is a similar case:
161

    
162
```js
163
foo /= 2/g
164
foo(/= 2/g)
165
```
166

    
167
The first line divides the `foo` variable with `2/g`. The second line calls the
168
`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
169
sees forwards, it cannot tell the two cases apart.
170

    
171
There are some cases where we _can_ tell division and regex literals apart,
172
though.
173

    
174
First off, we have the simple cases where there’s only one slash in the line:
175

    
176
```js
177
var foo = 2/g
178
foo /= 2
179
```
180

    
181
Regex literals cannot contain newlines, so the above cases are correctly
182
identified as division. Things are only problematic when there are more than
183
one non-comment slash in a single line.
184

    
185
Secondly, not every character is a valid regex flag.
186

    
187
```js
188
var number = bar / 2/e
189
```
190

    
191
The above example is also correctly identified as division, because `e` is not a
192
valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
193
(any letter) as flags, but it is not worth it since it increases the amount of
194
ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
195
allowed. This means that the above example will be identified as division as
196
long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
197
characters long.
198

    
199
Lastly, we can look _forward_ for information.
200

    
201
- If the token following what looks like a regex literal is not valid after a
202
  regex literal, but is valid in a division expression, then the regex literal
203
  is treated as division instead. For example, a flagless regex cannot be
204
  followed by a string, number or name, but all of those three can be the
205
  denominator of a division.
206
- Generally, if what looks like a regex literal is followed by an operator, the
207
  regex literal is treated as division instead. This is because regexes are
208
  seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
209
  could likely be part of such an expression.
210

    
211
Please consult the regex source and the test cases for precise information on
212
when regex or division is matched (should you need to know). In short, you
213
could sum it up as:
214

    
215
If the end of a statement looks like a regex literal (even if it isn’t), it
216
will be treated as one. Otherwise it should work as expected (if you write sane
217
code).
218

    
219
### ES2018 ###
220

    
221
ES2018 added some nice regex improvements to the language.
222

    
223
- [Unicode property escapes] should allow telling names and invalid non-ASCII
224
  characters apart without blowing up the regex size.
225
- [Lookbehind assertions] should allow matching telling division and regex
226
  literals apart in more cases.
227
- [Named capture groups] might simplify some things.
228

    
229
These things would be nice to do, but are not critical. They probably have to
230
wait until the oldest maintained Node.js LTS release supports those features.
231

    
232
[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
233
[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
234
[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
235

    
236

    
237
License
238
=======
239

    
240
[MIT](LICENSE).
(3-3/5)