1 |
3a515b92
|
cagy
|
# Regenerate [![Build status](https://travis-ci.org/mathiasbynens/regenerate.svg?branch=master)](https://travis-ci.org/mathiasbynens/regenerate) [![Code coverage status](https://img.shields.io/codecov/c/github/mathiasbynens/regenerate.svg)](https://codecov.io/gh/mathiasbynens/regenerate) [![Dependency status](https://gemnasium.com/mathiasbynens/regenerate.svg)](https://gemnasium.com/mathiasbynens/regenerate)
|
2 |
|
|
|
3 |
|
|
_Regenerate_ is a Unicode-aware regex generator for JavaScript. It allows you to easily generate ES5-compatible regular expressions based on a given set of Unicode symbols or code points. (This is trickier than you might think, because of [how JavaScript deals with astral symbols](https://mathiasbynens.be/notes/javascript-unicode).)
|
4 |
|
|
|
5 |
|
|
## Installation
|
6 |
|
|
|
7 |
|
|
Via [npm](https://npmjs.org/):
|
8 |
|
|
|
9 |
|
|
```bash
|
10 |
|
|
npm install regenerate
|
11 |
|
|
```
|
12 |
|
|
|
13 |
|
|
Via [Bower](http://bower.io/):
|
14 |
|
|
|
15 |
|
|
```bash
|
16 |
|
|
bower install regenerate
|
17 |
|
|
```
|
18 |
|
|
|
19 |
|
|
Via [Component](https://github.com/component/component):
|
20 |
|
|
|
21 |
|
|
```bash
|
22 |
|
|
component install mathiasbynens/regenerate
|
23 |
|
|
```
|
24 |
|
|
|
25 |
|
|
In a browser:
|
26 |
|
|
|
27 |
|
|
```html
|
28 |
|
|
<script src="regenerate.js"></script>
|
29 |
|
|
```
|
30 |
|
|
|
31 |
|
|
In [Node.js](https://nodejs.org/), [io.js](https://iojs.org/), and [RingoJS ≥ v0.8.0](http://ringojs.org/):
|
32 |
|
|
|
33 |
|
|
```js
|
34 |
|
|
var regenerate = require('regenerate');
|
35 |
|
|
```
|
36 |
|
|
|
37 |
|
|
In [Narwhal](http://narwhaljs.org/) and [RingoJS ≤ v0.7.0](http://ringojs.org/):
|
38 |
|
|
|
39 |
|
|
```js
|
40 |
|
|
var regenerate = require('regenerate').regenerate;
|
41 |
|
|
```
|
42 |
|
|
|
43 |
|
|
In [Rhino](http://www.mozilla.org/rhino/):
|
44 |
|
|
|
45 |
|
|
```js
|
46 |
|
|
load('regenerate.js');
|
47 |
|
|
```
|
48 |
|
|
|
49 |
|
|
Using an AMD loader like [RequireJS](http://requirejs.org/):
|
50 |
|
|
|
51 |
|
|
```js
|
52 |
|
|
require(
|
53 |
|
|
{
|
54 |
|
|
'paths': {
|
55 |
|
|
'regenerate': 'path/to/regenerate'
|
56 |
|
|
}
|
57 |
|
|
},
|
58 |
|
|
['regenerate'],
|
59 |
|
|
function(regenerate) {
|
60 |
|
|
console.log(regenerate);
|
61 |
|
|
}
|
62 |
|
|
);
|
63 |
|
|
```
|
64 |
|
|
|
65 |
|
|
## API
|
66 |
|
|
|
67 |
|
|
### `regenerate(value1, value2, value3, ...)`
|
68 |
|
|
|
69 |
|
|
The main Regenerate function. Calling this function creates a new set that gets a chainable API.
|
70 |
|
|
|
71 |
|
|
```js
|
72 |
|
|
var set = regenerate()
|
73 |
|
|
.addRange(0x60, 0x69) // add U+0060 to U+0069
|
74 |
|
|
.remove(0x62, 0x64) // remove U+0062 and U+0064
|
75 |
|
|
.add(0x1D306); // add U+1D306
|
76 |
|
|
set.valueOf();
|
77 |
|
|
// → [0x60, 0x61, 0x63, 0x65, 0x66, 0x67, 0x68, 0x69, 0x1D306]
|
78 |
|
|
set.toString();
|
79 |
|
|
// → '[`ace-i]|\\uD834\\uDF06'
|
80 |
|
|
set.toRegExp();
|
81 |
|
|
// → /[`ace-i]|\uD834\uDF06/
|
82 |
|
|
```
|
83 |
|
|
|
84 |
|
|
Any arguments passed to `regenerate()` will be added to the set right away. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
|
85 |
|
|
|
86 |
|
|
```js
|
87 |
|
|
regenerate(0x1D306, 'A', '©', 0x2603).toString();
|
88 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
89 |
|
|
|
90 |
|
|
var items = [0x1D306, 'A', '©', 0x2603];
|
91 |
|
|
regenerate(items).toString();
|
92 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
93 |
|
|
```
|
94 |
|
|
|
95 |
|
|
### `regenerate.prototype.add(value1, value2, value3, ...)`
|
96 |
|
|
|
97 |
|
|
Any arguments passed to `add()` are added to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
|
98 |
|
|
|
99 |
|
|
```js
|
100 |
|
|
regenerate().add(0x1D306, 'A', '©', 0x2603).toString();
|
101 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
102 |
|
|
|
103 |
|
|
var items = [0x1D306, 'A', '©', 0x2603];
|
104 |
|
|
regenerate().add(items).toString();
|
105 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
106 |
|
|
```
|
107 |
|
|
|
108 |
|
|
It’s also possible to pass in a Regenerate instance. Doing so adds all code points in that instance to the current set.
|
109 |
|
|
|
110 |
|
|
```js
|
111 |
|
|
var set = regenerate(0x1D306, 'A');
|
112 |
|
|
regenerate().add('©', 0x2603).add(set).toString();
|
113 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
114 |
|
|
```
|
115 |
|
|
|
116 |
|
|
Note that the initial call to `regenerate()` acts like `add()`. This allows you to create a new Regenerate instance and add some code points to it in one go:
|
117 |
|
|
|
118 |
|
|
```js
|
119 |
|
|
regenerate(0x1D306, 'A', '©', 0x2603).toString();
|
120 |
|
|
// → '[A\\xA9\\u2603]|\\uD834\\uDF06'
|
121 |
|
|
```
|
122 |
|
|
|
123 |
|
|
### `regenerate.prototype.remove(value1, value2, value3, ...)`
|
124 |
|
|
|
125 |
|
|
Any arguments passed to `remove()` are removed to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
|
126 |
|
|
|
127 |
|
|
```js
|
128 |
|
|
regenerate(0x1D306, 'A', '©', 0x2603).remove('☃').toString();
|
129 |
|
|
// → '[A\\xA9]|\\uD834\\uDF06'
|
130 |
|
|
```
|
131 |
|
|
|
132 |
|
|
It’s also possible to pass in a Regenerate instance. Doing so removes all code points in that instance from the current set.
|
133 |
|
|
|
134 |
|
|
```js
|
135 |
|
|
var set = regenerate('☃');
|
136 |
|
|
regenerate(0x1D306, 'A', '©', 0x2603).remove(set).toString();
|
137 |
|
|
// → '[A\\xA9]|\\uD834\\uDF06'
|
138 |
|
|
```
|
139 |
|
|
|
140 |
|
|
### `regenerate.prototype.addRange(start, end)`
|
141 |
|
|
|
142 |
|
|
Adds a range of code points from `start` to `end` (inclusive) to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
|
143 |
|
|
|
144 |
|
|
```js
|
145 |
|
|
regenerate(0x1D306).addRange(0x00, 0xFF).toString(16);
|
146 |
|
|
// → '[\\0-\\xFF]|\\uD834\\uDF06'
|
147 |
|
|
|
148 |
|
|
regenerate().addRange('A', 'z').toString();
|
149 |
|
|
// → '[A-z]'
|
150 |
|
|
```
|
151 |
|
|
|
152 |
|
|
### `regenerate.prototype.removeRange(start, end)`
|
153 |
|
|
|
154 |
|
|
Removes a range of code points from `start` to `end` (inclusive) from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
|
155 |
|
|
|
156 |
|
|
```js
|
157 |
|
|
regenerate()
|
158 |
|
|
.addRange(0x000000, 0x10FFFF) // add all Unicode code points
|
159 |
|
|
.removeRange('A', 'z') // remove all symbols from `A` to `z`
|
160 |
|
|
.toString();
|
161 |
|
|
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
|
162 |
|
|
|
163 |
|
|
regenerate()
|
164 |
|
|
.addRange(0x000000, 0x10FFFF) // add all Unicode code points
|
165 |
|
|
.removeRange(0x0041, 0x007A) // remove all code points from U+0041 to U+007A
|
166 |
|
|
.toString();
|
167 |
|
|
// → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
|
168 |
|
|
```
|
169 |
|
|
|
170 |
|
|
### `regenerate.prototype.intersection(codePoints)`
|
171 |
|
|
|
172 |
|
|
Removes any code points from the set that are not present in both the set and the given `codePoints` array. `codePoints` must be an array of numeric code point values, i.e. numbers.
|
173 |
|
|
|
174 |
|
|
```js
|
175 |
|
|
regenerate()
|
176 |
|
|
.addRange(0x00, 0xFF) // add extended ASCII code points
|
177 |
|
|
.intersection([0x61, 0x69]) // remove all code points from the set except for these
|
178 |
|
|
.toString();
|
179 |
|
|
// → '[ai]'
|
180 |
|
|
```
|
181 |
|
|
|
182 |
|
|
Instead of the `codePoints` array, it’s also possible to pass in a Regenerate instance.
|
183 |
|
|
|
184 |
|
|
```js
|
185 |
|
|
var whitelist = regenerate(0x61, 0x69);
|
186 |
|
|
|
187 |
|
|
regenerate()
|
188 |
|
|
.addRange(0x00, 0xFF) // add extended ASCII code points
|
189 |
|
|
.intersection(whitelist) // remove all code points from the set except for those in the `whitelist` set
|
190 |
|
|
.toString();
|
191 |
|
|
// → '[ai]'
|
192 |
|
|
```
|
193 |
|
|
|
194 |
|
|
### `regenerate.prototype.contains(value)`
|
195 |
|
|
|
196 |
|
|
Returns `true` if the given value is part of the set, and `false` otherwise. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
|
197 |
|
|
|
198 |
|
|
```js
|
199 |
|
|
var set = regenerate().addRange(0x00, 0xFF);
|
200 |
|
|
set.contains('A');
|
201 |
|
|
// → true
|
202 |
|
|
set.contains(0x1D306);
|
203 |
|
|
// → false
|
204 |
|
|
```
|
205 |
|
|
|
206 |
|
|
### `regenerate.prototype.clone()`
|
207 |
|
|
|
208 |
|
|
Returns a clone of the current code point set. Any actions performed on the clone won’t mutate the original set.
|
209 |
|
|
|
210 |
|
|
```js
|
211 |
|
|
var setA = regenerate(0x1D306);
|
212 |
|
|
var setB = setA.clone().add(0x1F4A9);
|
213 |
|
|
setA.toArray();
|
214 |
|
|
// → [0x1D306]
|
215 |
|
|
setB.toArray();
|
216 |
|
|
// → [0x1D306, 0x1F4A9]
|
217 |
|
|
```
|
218 |
|
|
|
219 |
|
|
### `regenerate.prototype.toString(options)`
|
220 |
|
|
|
221 |
|
|
Returns a string representing (part of) a regular expression that matches all the symbols mapped to the code points within the set.
|
222 |
|
|
|
223 |
|
|
```js
|
224 |
|
|
regenerate(0x1D306, 0x1F4A9).toString();
|
225 |
|
|
// → '\\uD834\\uDF06|\\uD83D\\uDCA9'
|
226 |
|
|
```
|
227 |
|
|
|
228 |
|
|
If the `bmpOnly` property of the optional `options` object is set to `true`, the output matches surrogates individually, regardless of whether they’re lone surrogates or just part of a surrogate pair. This simplifies the output, but it can only be used in case you’re certain the strings it will be used on don’t contain any astral symbols.
|
229 |
|
|
|
230 |
|
|
```js
|
231 |
|
|
var highSurrogates = regenerate().addRange(0xD800, 0xDBFF);
|
232 |
|
|
highSurrogates.toString();
|
233 |
|
|
// → '[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])'
|
234 |
|
|
highSurrogates.toString({ 'bmpOnly': true });
|
235 |
|
|
// → '[\\uD800-\\uDBFF]'
|
236 |
|
|
|
237 |
|
|
var lowSurrogates = regenerate().addRange(0xDC00, 0xDFFF);
|
238 |
|
|
lowSurrogates.toString();
|
239 |
|
|
// → '(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
|
240 |
|
|
lowSurrogates.toString({ 'bmpOnly': true });
|
241 |
|
|
// → '[\\uDC00-\\uDFFF]'
|
242 |
|
|
```
|
243 |
|
|
|
244 |
|
|
Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but [there can be false negatives in this regard](https://github.com/mathiasbynens/regenerate/issues/28#issuecomment-72224808).
|
245 |
|
|
|
246 |
|
|
If the `hasUnicodeFlag` property of the optional `options` object is set to `true`, the output makes use of Unicode code point escapes (`\u{…}`) where applicable. This simplifies the output at the cost of compatibility and portability, since it means the output can only be used as a pattern in a regular expression with [the ES6 `u` flag](https://mathiasbynens.be/notes/es6-unicode-regex) enabled.
|
247 |
|
|
|
248 |
|
|
```js
|
249 |
|
|
var set = regenerate().addRange(0x0, 0x10FFFF);
|
250 |
|
|
|
251 |
|
|
set.toString();
|
252 |
|
|
// → '[\\0-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]''
|
253 |
|
|
|
254 |
|
|
set.toString({ 'hasUnicodeFlag': true });
|
255 |
|
|
// → '[\\0-\\u{10FFFF}]'
|
256 |
|
|
```
|
257 |
|
|
|
258 |
|
|
### `regenerate.prototype.toRegExp(flags = '')`
|
259 |
|
|
|
260 |
|
|
Returns a regular expression that matches all the symbols mapped to the code points within the set. Optionally, you can pass [flags](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#Parameters) to be added to the regular expression.
|
261 |
|
|
|
262 |
|
|
```js
|
263 |
|
|
var regex = regenerate(0x1D306, 0x1F4A9).toRegExp();
|
264 |
|
|
// → /\uD834\uDF06|\uD83D\uDCA9/
|
265 |
|
|
regex.test('𝌆');
|
266 |
|
|
// → true
|
267 |
|
|
regex.test('A');
|
268 |
|
|
// → false
|
269 |
|
|
|
270 |
|
|
// With flags:
|
271 |
|
|
var regex = regenerate(0x1D306, 0x1F4A9).toRegExp('g');
|
272 |
|
|
// → /\uD834\uDF06|\uD83D\uDCA9/g
|
273 |
|
|
```
|
274 |
|
|
|
275 |
|
|
**Note:** This probably shouldn’t be used. Regenerate is intended as a tool that is used as part of a build process, not at runtime.
|
276 |
|
|
|
277 |
|
|
### `regenerate.prototype.valueOf()` or `regenerate.prototype.toArray()`
|
278 |
|
|
|
279 |
|
|
Returns a sorted array of unique code points in the set.
|
280 |
|
|
|
281 |
|
|
```js
|
282 |
|
|
regenerate(0x1D306)
|
283 |
|
|
.addRange(0x60, 0x65)
|
284 |
|
|
.add(0x59, 0x60) // note: 0x59 is added after 0x65, and 0x60 is a duplicate
|
285 |
|
|
.valueOf();
|
286 |
|
|
// → [0x59, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x1D306]
|
287 |
|
|
```
|
288 |
|
|
|
289 |
|
|
### `regenerate.version`
|
290 |
|
|
|
291 |
|
|
A string representing the semantic version number.
|
292 |
|
|
|
293 |
|
|
## Combine Regenerate with other libraries
|
294 |
|
|
|
295 |
|
|
Regenerate gets even better when combined with other libraries such as [Punycode.js](https://mths.be/punycode). Here’s an example where [Punycode.js](https://mths.be/punycode) is used to convert a string into an array of code points, that is then passed on to Regenerate:
|
296 |
|
|
|
297 |
|
|
```js
|
298 |
|
|
var regenerate = require('regenerate');
|
299 |
|
|
var punycode = require('punycode');
|
300 |
|
|
|
301 |
|
|
var string = 'Lorem ipsum dolor sit amet.';
|
302 |
|
|
// Get an array of all code points used in the string:
|
303 |
|
|
var codePoints = punycode.ucs2.decode(string);
|
304 |
|
|
|
305 |
|
|
// Generate a regular expression that matches any of the symbols used in the string:
|
306 |
|
|
regenerate(codePoints).toString();
|
307 |
|
|
// → '[ \\.Ladeilmopr-u]'
|
308 |
|
|
```
|
309 |
|
|
|
310 |
|
|
In ES6 you can do something similar with [`Array.from`](https://mths.be/array-from) which uses [the string’s iterator](https://mathiasbynens.be/notes/javascript-unicode#iterating-over-symbols) to split the given string into an array of strings that each contain a single symbol. [`regenerate()`](#regenerateprototypeaddvalue1-value2-value3-) accepts both strings and code points, remember?
|
311 |
|
|
|
312 |
|
|
```js
|
313 |
|
|
var regenerate = require('regenerate');
|
314 |
|
|
|
315 |
|
|
var string = 'Lorem ipsum dolor sit amet.';
|
316 |
|
|
// Get an array of all symbols used in the string:
|
317 |
|
|
var symbols = Array.from(string);
|
318 |
|
|
|
319 |
|
|
// Generate a regular expression that matches any of the symbols used in the string:
|
320 |
|
|
regenerate(symbols).toString();
|
321 |
|
|
// → '[ \\.Ladeilmopr-u]'
|
322 |
|
|
```
|
323 |
|
|
|
324 |
|
|
## Support
|
325 |
|
|
|
326 |
|
|
Regenerate supports at least Chrome 27+, Firefox 3+, Safari 4+, Opera 10+, IE 6+, Node.js v0.10.0+, io.js v1.0.0+, Narwhal 0.3.2+, RingoJS 0.8+, PhantomJS 1.9.0+, and Rhino 1.7RC4+.
|
327 |
|
|
|
328 |
|
|
## Unit tests & code coverage
|
329 |
|
|
|
330 |
|
|
After cloning this repository, run `npm install` to install the dependencies needed for Regenerate development and testing. You may want to install Istanbul _globally_ using `npm install istanbul -g`.
|
331 |
|
|
|
332 |
|
|
Once that’s done, you can run the unit tests in Node using `npm test` or `node tests/tests.js`. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use `grunt test`.
|
333 |
|
|
|
334 |
|
|
To generate the code coverage report, use `grunt cover`.
|
335 |
|
|
|
336 |
|
|
## Author
|
337 |
|
|
|
338 |
|
|
| [![twitter/mathias](https://gravatar.com/avatar/24e08a9ea84deb17ae121074d0f17125?s=70)](https://twitter.com/mathias "Follow @mathias on Twitter") |
|
339 |
|
|
|---|
|
340 |
|
|
| [Mathias Bynens](https://mathiasbynens.be/) |
|
341 |
|
|
|
342 |
|
|
## License
|
343 |
|
|
|
344 |
|
|
Regenerate is available under the [MIT](https://mths.be/mit) license.
|