What you should know about JavaScript regular expressions

1,080 阅读9分钟
原文链接: bjorn.tipling.com

Regular expressions in JavaScript may not always be intuitive. I aim to provide some clarity by providing examples about things I have found myself getting stuck on. This post covers a few topics including state in regular expressions, regular expression performance and various things that have tripped me up.


 regular expressions are stateful

Regular expression objects maintain state. For example, the exec method is not idempotent, successive calls may return different results. Calls to exec have this behavior because the regular expression object remembers the last position it searched from when the global flag is set to true.

As an example, examine the following code:

var res, text = "foo1 bar1 foo2 bar \n foo3 bar2",
    regexp = /foo\d (bar\d?)/g;


console.log("regexp.lastIndex:", regexp.lastIndex);
while (res = regexp.exec(text)) {
    console.log("regexp.lastIndex:", regexp.lastIndex, 
                "index:", res.index,
                "res[0]:", res[0],
                "res[1]:", res[1]);
}
console.log("regexp.lastIndex:", regexp.lastIndex);

jsfiddle

The log prints:

regexp.lastIndex: 0
regexp.lastIndex: 9 index: 0 res[0]: foo1 bar1 res[1]: bar1
regexp.lastIndex: 18 index: 10 res[0]: foo2 bar res[1]: bar
regexp.lastIndex: 30 index: 21 res[0]: foo3 bar2 res[1]: bar2
regexp.lastIndex: 0 

The most important thing in this bit of code is that we are calling regexp.exec in a while loop multiple times, and on each call it returns a different result.

The variable text contains a string that we want to search. The variable regexp contains a regular expression literal /foo\d (bar\d?)/g that searches for strings that contain the characters foo followed by a numeric digit \d, a space and the characters bar which may or may not be followed by a digit \d?. Also note the parenthesis around the (bar\d?) portion, this creates a substring match. Finally the regular expression is terminated by a g, which is a global flag that enables us to continue searching for more results after we find the first.

The state in a regular expression is captured by the lastIndex property, which is the index of the character where the search starts on the next call to exec. The lastIndex starts at 0 and is reset to 0 when exec finishes searching the text, at which point exec will return null. As the log indicates, the returned result array res contains important information about what was found or null if nothing was found. The returned array is special because in addition to results, it contains the properties index and input which provide the position of the match and the search pattern used.

If a while loop is used to search a regular expression that does not contain a global flag the while loop will never finish if a match is found. It will only return that first match, forever. If you are not sure if the regular expression you are searching is global you can check the regular expression’s global property:

while (res = regexp.exec(text)) {
  // do something with res
  if (!regexp.global) {
   break;
  }
}

jsfiddle

Using the test method on a global regular expression also makes use of state. It will advance lastIndex the same as exec. test returns true or false depending on whether a match was found.

Using test is faster than exec, thus a use case might be to check a string with test first and only if something is found use exec. This could lead to a mistake for a global regular expression as test will have advanced lastIndex.

var res, text = "foo1 bar1 foo2 bar \n foo3 bar2",
    regexp = /foo\d (bar\d?)/g;


while (regexp.test(text)) {
    res = regexp.exec(text);
    console.log("regexp.lastIndex:", regexp.lastIndex, 
                "index:", res.index,
                "res[0]:", res[0],
                "res[1]:", res[1]);
}

jsfiddle

The result is missed results and finally a JavaScript error because we attempted to access properties on a null object:

regexp.lastIndex: 18 index: 10 res[0]: foo2 bar res[1]: bar
Uncaught TypeError: Cannot read property 'index' of null 

You could reset lastIndex to avoid this issue:

var res, text = "foo1 bar1 foo2 bar \n foo3 bar2",
    regexp = /foo\d (bar\d?)/g,
    lastIndex = 0;


while (regexp.test(text)) {
    regexp.lastIndex = lastIndex;
    res = regexp.exec(text);
    console.log("regexp.lastIndex:", regexp.lastIndex, 
                "index:", res.index,
                "res[0]:", res[0],
                "res[1]:", res[1]);
    lastIndex = regexp.lastIndex;
}

jsfiddle

The extra cost of calling test is not worth it, as explained in the performance section at the end of this post. I only meant to demonstrate that test and exec both advance lastIndex.

In the following bit of code you might expect only one result for each search since I have set the lastIndex to begin at 3, but on the second test it finds 2 because lastIndex was reset to 0 at the end of the while loop when res returned null. lastIndex is not reset to 0 when you test a new string.

var texts = [
    "foo foo",
    "foo foo"
];
var regex = /foo/g;
regex.lastIndex = 3;

texts.forEach(function (text) {
    var count, res;
    count = 0;
    while (res = regex.test(text)) {
        count += 1;
    } 
    console.log("number of results found:", count);
});

jsfiddle

Result:

"number of results found:" 1 
"number of results found:" 2

To fix this put the regex.lastIndex = 3; right before each use of test:

var texts = [
    "foo foo",
    "foo foo"
];
var regex = /foo/g;

texts.forEach(function (text) {
    var count, res;   
    regex.lastIndex = 3;
    count = 0;
    while (res = regex.test(text)) {
        count += 1;
    } 
    console.log("number of results found:", count);
});

jsfiddle

Result:

"number of results found:" 1 
"number of results found:" 1 

ECMAScript 6 (harmony) adds a new stateful flag for regular expressions, the sticky flag. This flag is currently, as of August 2014, only implemented in Firefox. The sticky flag is provided by setting the y flag in a regular expression like so: /foo/y. The sticky flag advances lastIndex like g but only if a match is found starting at lastIndex, there is no forward search. The sticky flag was added to improve the performance of writing lexical analyzers using JavaScript, but as the MDN documentation indicates, it could be used to require a regular expression match starting at position n where n is what lastIndex is set to. In the case of a non-multiline regular expression, a lastIndex value of 0 with the sticky flag would be in effect the same as starting the regular expression with ^ which requires the match to start at the beginning of the text searched.

The following code demonstrates an example of using sticky, note it may only work in Firefox:

var searchStrings, stickyRegexp;

stickyRegexp = /foo/y;

searchStrings = [
    "foo",
    " foo",
    "  foo",
];
searchStrings.forEach(function(text, index) {
    stickyRegexp.lastIndex = 1;
    console.log("found a match at", index, ":", stickyRegexp.test(text));
});

jsfiddle (firefox only)

Result:

"found a match at" 0 ":" false
"found a match at" 1 ":" true
"found a match at" 2 ":" false

 multiline and global search

The confusion between multiline and global might be just a thing that afflicted me, so if you already understand the difference feel free to skip this section, it’s pretty straight forward. Multiline searches affect the way that the special regular expression characters ^ and $ behave when the search string includes new lines. Multiline searches are enabled with the m flag like so: /foo/m. I can’t explain it better than the MDN documentation:

The m flag is used to specify that a multiline input string should be treated as multiple lines. If the m flag is used, ^ and $ match at the start or end of any line within the input string instead of the start or end of the entire string.

Global searches on the other hand allow you to continue searching the entire search string after you have found the first result, to find additional results. It has nothing to do with ^ and $ and it does not treat newlines as special.

Consider the following example:

var text, searches;

text = "foo \nfoo \nfoo ";
searches = [
 /^foo $/g , // finds 0 results
 /^foo $/m,  // finds only 1 result and stops
 /^foo $/mg, // the only regexp that finds 3 
 /foo \n/g,  // finds 2, but not the last
 /^foo \n/g, // finds only 1, the first
];

searches.forEach(function (search) {
  printSearches(search);
});

function printSearches(regex) {
    var res;
    console.log("searching with", regex);
    while (res = regex.exec(text)) {
        console.log(regex, "->", res);
        if (!regex.global) {
            break;
        }
    }
}

jsfiddle

The function printSearches uses a while loop to search text with exec and logs the result. The search string text has multiple new lines.

The first search /^foo $/g would only match the string "foo " and so it finds no results.

searching with /^foo $/g

/^foo $/m prints only one result as it is not a global search and thus the while loop breaks on !regex.global. Otherwise it would loop forever, printing the first result again and again.

searching with /^foo $/m
/^foo $/m "->" ["foo ", index: 0, input: "foo ↵foo ↵foo "]

/^foo $/gm finds each instance of “foo ” in the search string as it is a global search. It continues to search the search string until it can find no more results.

searching with /^foo $/gm
/^foo $/gm "->" ["foo ", index: 0, input: "foo ↵foo ↵foo "]
/^foo $/gm "->" ["foo ", index: 5, input: "foo ↵foo ↵foo "]
/^foo $/gm "->" ["foo ", index: 10, input: "foo ↵foo ↵foo "]

/foo \n/g is not a substitute for a multiline search. It assumes a particular form of new line missing out on \r\n, doesn’t find last “foo” at the end of the string and could lead to other problems because of the lack of a ^.

searching with /foo \n/g
/foo \n/g "->" ["foo ↵", index: 0, input: "foo ↵foo ↵foo "]
/foo \n/g "->" ["foo ↵", index: 5, input: "foo ↵foo ↵foo "]

/^foo \n/ only finds the first foo as ^ will only match the start of the string, not a line, in a non-multiline search.

searching with /^foo \n/g
/^foo \n/g "->" ["foo ↵", index: 0, input: "foo ↵foo ↵foo "] 

 greedy vs non-greedy

Greediness and regular expressions is not really a JavaScript specific topic but I thought I would mention it here anyway because knowing about greedy versus non-greedy is really useful. The greediness of a regular expression affects how “zero or more” * or “one or more” + qualifiers behave. The default behavior for these is greedy. To toggle non-greedy apply a question mark at the end as such: *? or +?. To understand how these behave look at this code which reuses the printSearches function defined above:

var text = "foo bar foo bar foo bar",
    greedyRegexp = /foo.*bar/g,
    nonGreedyRegexp = /foo.*?bar/g;

printSearches(greedyRegexp);
printSearches(nonGreedyRegexp);

jsfiddle

The text we are searching has a repeated pattern for “foo bar”. greedyRegexp will find only a single result, the entire string as .* will capture until the last “bar” is encountered.

searching with /foo.*bar/g
/foo.*bar/g "->" ["foo bar foo bar foo bar", index: 0, input: "foo bar foo bar foo bar"] 

nonGreedyRegexp instead finds each instance of “foo bar”, non-greedy will end after it has found the first “bar” that satisfy the regular expression.

searching with /foo.*?bar/g
/foo.*?bar/g "->" ["foo bar", index: 0, input: "foo bar foo bar foo bar"] 
/foo.*?bar/g "->" ["foo bar", index: 8, input: "foo bar foo bar foo bar"] 
/foo.*?bar/g "->" ["foo bar", index: 16, input: "foo bar foo bar foo bar"] 

I think an image might help to demonstrate the difference between greedy and non-greedy:

/foo.*bar/ matches *foo bar foo bar foo bar* and /foo.*?bar/ matches *foo bar* foo bar foo bar

/foo.*bar/ is greedy and matches “foo bar foo bar foo bar” while /foo.*?bar/ is not greedy and matches “foo bar foo bar foo bar”.

This StackOverflow question has some pretty good answers with more information: how to make Regular expression into non-greedy?


 constructors versus literals

Regular expressions can be created with the literal syntax such as:

var regexp = /foo.*bar/gi`

Or you can create one with the constructor RegExp:

var regexp = new RegExp("foo.*bar", "gi");

Literal regular expressions are evaluated only once at evaluation time while the constructor is evaluated at runtime, thus unless you need to dynamically create a regular expression use the literal syntax. A dynamic regular expression can be useful, here’s an example where I make use of a literal and a dynamic to pull out brands and heroes from a string with comic book heroes:

var text = "DC: Batgirl, Marvel: Rogue, Image: Spawn, Image: Celestine, Marvel: Iron Man, DC: Superman, Image: Ant",
    brandRegexp = /(^|, )(\S+):/g, 
    regexp, brands, res, heroes, brand;

brands = {};
while (res = brandRegexp.exec(text)) {
    brand = res[2];
    brands[brand] ? brands[brand]++ : brands[brand] = 1;
}
for (brand in brands) { 
    regexp = new RegExp(brand + ": ([\\S ]+?)(,|$)","g");
    heroes = [];
    while (res = regexp.exec(text)) {
        heroes.push(res[1]);
    }
    console.log("Found", brands[brand], "heroes for", brand, "comics:", heroes.join(", "));
}

jsfiddle

Result:

Found 2 heroes for DC comics: Batgirl, Superman
Found 2 heroes for Marvel comics: Rogue, Iron Man
Found 3 heroes for Image comics: Spawn, Celestine, Ant

You could just use a single literal regular expression here and that would make the code cleaner and be more efficient but I wanted to demonstrate both ways of creating regular expressions.


 split, search, match and replace

Various methods are available on strings that can take regular expressions as arguments. For example the split method can take a string or a regular expression as an argument. This might be useful if you wanted to split a string on any kind of whitespace:

var text = "apples oranges      bananas\n\n\nstrawberries \tplums";

console.log(text.split(/\s+/));

jsfiddle

This finds all the fruit and trims the whitespace:

["apples", "oranges", "bananas", "strawberries", "plums"]

The search string method finds the index for the first match or -1 if a match wasn’t found. It’s similar to indexOf except you can use a regular expression for a more powerful search.

var text = "foo bar";

console.log(text.indexOf("bar")); //logs 4
console.log(text.search(/bar/));  //also logs 4

jsfiddle

The match string method returns an array with results. It can return all of the matches for a regular expression if you use the g global flag. This might be an alternative to iterative calls with exec .

var text = "foo bar foo bar foo bar";

console.log(text.match(/foo.*bar/));   //greedy
console.log(text.match(/foo.*?bar/));  //non-greedy, only 1 result
console.log(text.match(/foo.*?bar/g)); //non-greedy, all results

jsfiddle

The results are:

["foo bar foo bar foo bar", index: 0, input: "foo bar foo bar foo bar"] 
["foo bar", index: 0, input: "foo bar foo bar foo bar"]
["foo bar", "foo bar", "foo bar"] 

Unlike the regular expression methods exec and test, the string methods match and search are not stateful and do not advance lastIndex on the regular expression.

var text = "foo bar foo bar foo bar",
    globalRegexp = /foo.*?bar/g;

console.log(text.match(globalRegexp)); //non-greedy, all results
console.log(globalRegexp.lastIndex); //logs 0, lastIndex not advanced

jsfiddle

The replace method on strings takes a string or a regular expression as an argument. It returns a new string that has its text replaced, it does not modify the existing string. Use a global flag if you want all instances of the string replaced:

var text = "foo bar foo bar foo bar";

console.log(text.replace("bar", "BAR")); //only replaces first
console.log(text.replace(/bar/, "BAR")); //only replaces first
console.log(text.replace(/bar/g, "BAR")); //replaces all

jsfiddle

Results:

foo BAR foo bar foo bar
foo BAR foo bar foo bar
foo BAR foo BAR foo BAR

Firefox’s implementation of replace allows specifying flags as an argument, but this will not work in other browsers. Set the flags on the regular expression instead.


 regular expressions and application performance

Creating a regular expression that is used multiple times once in the beginning is faster than creating the regular expression later, regardless whether a constructor or a regular expression literal was used. Using a literal regular expression is always faster than creating a regular expression with a constructor.

jsperf

The MDN documentation might lead one to assume that it is OK to define regular expressions in a loop since they’re created at evaluation time and wont be recompiled:

The literal notation provides compilation of the regular expression when the expression is evaluated. Use literal notation when the regular expression will remain constant. For example, if you use literal notation to construct a regular expression used in a loop, the regular expression won’t be recompiled on each iteration.

But defining the regular expression before the loop is much faster in Firefox and a little bit faster in Chrome.

jsperf

Using test to check a string before exec is probably not worth it.

jsperf

test is faster than exec however.

jsperf

Using search to check a string before match is also not worth it.

jsperf

search is faster than match

jsperf

match and exec perform differently in different engines. In Chrome exec seems to be faster than match and in Firefox the reverse is true. If you wanted to get the number of results split is your best bet in Firefox and exec wins in Chrome.

jsperf

If you want to replace text in a string replace seems to be much faster than using split and join which make sense since split creates a new array. This is true in both Firefox and Chrome.

jsperf


That rounds out an exploration of regular expressions in JavaScript. Some of these things never seemed intuitive to me. Ultimately in most scenarios it is probably best to avoid using regular expressions and solve the problem some other way, unless you want to replace text in a string that is.

Now read this

Advanced objects in JavaScript

This posts looks beyond everyday usage of JavaScript’s objects. The fundamentals of JavaScripts objects are for the most part about as simple as using JSON notation. However, JavaScript also provides sophisticated tools to create objects... Continue →