regular expressions benchmark

This page discusses the overhead of regular expression matching in Java, as measured by a simple benchmark.

Regular expressions are available in the Java language, as provided by the standard library module java.util.regex. Regexes are represented as strings.

example

Condider using a regex to recognize messages of the following forms and to return the length of the matched file names:
File /usr/lib/mypatterns not readable, please use chmod!!!
File /usr/lib/mypatterns.so not found, please create it!!!
The following Java code performs the above pattern matching:
Pattern pattern = Pattern.compile("File ([^ ]*) not (found|readable).*!!!");
Matcher matcher = pattern.matcher(str);
if(matcher.matches()) {
  return matcher.end(1) - matcher.start(1);
}
Starting from this relatively naive code sequence, some standard techniques can be applied to improve the regular expression: These combined techniques lead to the optimized pattern:
"File ([^ ]*+) not (?:found|readable).*+!!!"

For ultimate efficiency, the above code can be further hand-optimized not to use regular expressions at all, by replacing it with carefully written string operations such as extractions and comparisons:

int pos;
if(str.startsWith("File ") && 
   ((pos = str.indexOf(' ', 5)) >= 0) && 
   str.startsWith("not ", ++pos) && 
   (str.startsWith("found", (pos += 4)) || 
    str.startsWith("readable", pos)) && 
   str.endsWith("!!!")) {
  return pos - 10;
}
The hand-optimized version is less concise, less convenient to write, and much less maintainable, so probably very few Java programmers would ever bother rewriting the code this way, except maybe for a handful of really critical cases. The optimization may be even harder to perform for more complex regular expressions.

Still, this optimization is interesting to estimate the cost of regex matching.

results

The following table shows the times (in milli-seconds) we obtained by matching one million strings. Of course, the regexp compile operation is outside the loop (the complete code of the benchmark is here). As it can be seen, using regular expressions incurs a severe overhead: 10.6 to 12.9 times slower than the hand-crafted code.

hand crafted optimized regex regex
time (msec) 400 4225 5162
slowdown - 10.6 12.9

conclusion

Regular expression matching is used everywhere because it is extremely convenient and powerful. Programmers are rarely aware of the exact price to pay for this programming comfort. While very significant, the slowdown is clearly justified by advantages such as readability and maintainability.