Irregexp, Google Chrome's New Regexp Implementation

Wednesday, February 04, 2009

One of the new features in the most recent dev-channel release of Google Chrome (2.0.160.0) is Irregexp, a completely new implementation of regular expressions (regexps) in the V8 JavaScript engine. Irregexp builds on V8's existing infrastructure for memory management and native code generation and is tailored to work well for the kinds of regexps used by JavaScript programs on the web. The result is a considerable improvement in V8's regexp performance.

While the V8 team has been working hard to improve JavaScript performance, one part of the language that we have so far not given much attention is regexps. Our previous implementation was based on the widely used PCRE library developed by Philip Hazel at the University of Cambridge. The version we used, known as JSCRE, was adapted and improved by the WebKit project for use with JavaScript. Using JSCRE gave us a regular expression implementation that was compatible with industry standards and has served us well. However, as we've improved other parts of the language, regexps started to stand out as being slower than the rest. We felt it should be possible to improve performance by integrating with our existing infrastructure rather than using an external library. The SquirrelFish team is following a similar approach with their JavaScript engine.

A fundamental decision we made early in the design of Irregexp was that we would be willing to spend extra time compiling a regular expression if that would make running it faster. During compilation Irregexp first converts a regexp into an intermediate automaton representation. This is in many ways the "natural" and most accessible representation and makes it much easier to analyze and optimize the regexp. For instance, when compiling /Sun|Mon/ the automaton representation lets us recognize that both alternatives have an 'n' as their third character. We can quickly scan the input until we find an 'n' and then start to match the regexp two characters earlier. Irregexp looks up to four characters ahead and matches up to four characters at a time.

After optimization we generate native machine code which uses backtracking to try different alternatives. Backtracking can be time-consuming so we use optimizations to avoid as much of it as we can. There are techniques to avoid backtracking altogether but the nature of regexps in JavaScript makes it difficult to apply them in our case, though it is something we may implement in the future.

During development we have tested Irregexp against one million of the most popular webpages to ensure that the new implementation stays compatible with our previous implementation and the web. We have also used this data to create a new benchmark which is included in version 3 of the V8 Benchmark Suite. We feel this is a good reflection of what is found on the web.

If you want to try this out, and help us test it in the process, you can subscribe to the dev-channel and if you see problems that might be related to Irregexp consider filing a bug.

And BTW, we'll have sessions on V8 and other Chrome-related topics in May at Google I/O, Google's largest developer conference.

17 comments:

Reid Kleckner said...

With regard to backtracking vs. automata, would it be possible to use the automata in the absence of non-regular regexp features like backreferences, and then even when there are backreferences, use the automata to match everything up to the backreference and everything after it? I feel like this would be the ideal approach, but maybe the native code backtracking is fast enough.

Christian Plesner Hansen said...

Reid: That is certainly something we've considered. The problem we ran into is that not only backreferences but also basic operators like | and * are defined in terms of backtracking. To get the right behavior you may need backtracking even for seemingly simple regexps. Based on the data we have for how regexps are used on the web and considering the optimizations we already had in place we decided that the subset of regexps that would benefit from this was too small.

Chris Ryland said...

http://swtch.com/~rsc/regexp/regexp1.html just popped up on Hacker News recently.

I hope you're using some form of Thompson's NFA/DFA idea.

TuringTest said...

I want to emphasize that there are two kinds of regular expressions: the left-biased kind that Perl and Javascript and other use, and the leftmost-longest that is the Posix standard used by sed and grep.

The left-biased kind matches "a|ab" against "ab" by choosing to match just "a" since "a" comes on the left of the "|" and is tried first. The backtracking implementation is obvious. And handling backreferences in the pattern is easy.

The leftmost-longest kind matches "a|ab" against "ab" by choosing the longest possible match of "ab". To try all branches in parallel is required and thus a finite automaton is needed. But automaton cannot usually handle backreferences inside the pattern.

sheweeeeee said...

Tracemonkey=V10???

divilex said...

No, V8 is faster than Tracemonkey.

Darky said...

yup, the latest V8 from Chromium builds are consistently faster than the latest TraceMonkey from Minefield builds by about 10%

foof said...

Hey! They stole my name!

http://synthcode.com/scheme/irregex/

المعلم حمادة said...

v8 faster than ,,

thanks ..

http://ebdaa.yoo7.com

tronadave said...

I am not a techie..I love all this info on the net...My problem is there has to be this computer in the way and an operating system made for ???? After loading Chrome I found many files tucked into all my folders.....not acceptable!!!! I dumped chrome then searched for all files with chrome or google....found 4 and erased them but now I must search manually through all my folders and erase each one........and this is your idea of user-friendly?

r0dnt said...

I work with web based text logs all day, and I have found Chrome to be the fastest and easiest to use, especially with the extra real estate. It would be super nifty if I was able to use regexp in the Find feature.

Apologies if this is not the right place for an enhancement.

Erik Corry said...

You should probably file it as a feature request on http://code.google.com/p/chromium/issues/list

It might be doable as an extension with the new extensions framework.

Ian said...

Before you look at improving you should have made sure the existing regexp engine was working. We found that when matching the charset [6-9] it matches any number.

Erik Corry said...

Thanks for spotting this bug.

The comments on our blogs are not necessarily monitored for bug reports. http://code.google.com/p/v8/issues is a better place for this. But I saw this bug report and I'm fixing it in the latest V8. It should make its way to Chromium and Google Chrome presently. http://codereview.chromium.org/140021

Addicting Games said...

I've been looking forward to discover new development of Google Chrome. And I finally find this blog that by independently reviewing Chrome development. I think it's very great. Thanks. Anyway, if you have time, take a visit to my Free Games website.

Qy said...

I really like google chrome, because of the speed, lightness and ability in a very good browser. especially with the addition of new features and Irregexp this course is very nice to add the ability google chrome. i like google chrome, and I also use it.
internet security software

web hosting said...

This is good post!
[url=http://www.vpswebserver.com/]vps hosting[/url]