Monday, September 04, 2006

Java Regex Broken for Alternation over Large Strings

This is a serious bummer.

I've been working through some of Rails's test cases this evening. Since I've got Depot running reasonably well now and I know I can easily wire together a simple Rails app that calls Hibernate or EJBs, I figured I'd try to get a better handle on the true state of Rails support.

As I'm running test cases--the majority of which appear to be passing, btw--I came across a couple goofy regex issues. The first is one we'd seen before, where dot refuses to match \r. In this case, adding DOT_ALL to the Pattern's flags seems to resolve it. I think we just hadn't noticed that flag before, so many other similar bugs might be resolved the same way. Not an easy issue to narrow down, but I found it and moved on.

So then I started seeing SystemStackErrors popping up within the CGI library. When I see stack overflows in JRuby, I usually figure it's one of two things:

  1. Something's wrong with JRuby that's breaking a termination condition in Ruby code, so a recursion is never ending
  2. JRuby itself is missing a termination and recursing endlessly
Pretty simple issues to track down...find where the recursion is happening, and figure out why it's not terminating. Except in this case, the recursion wasn't in either Ruby or JRuby code. It was in Java's regex library.

Blowing The Stack

O Java, how I love Thee. Every library I could want, all in one place. You care for me so well. It pains me to find your faults.

So it appears that the Java regex library's compiled patterns execute recursively. This alone wouldn't be particularly unusual, especially if the recursion was limited by the complexity of the regex itself. However in this case it seems the recursion is a factor of the string to match, rather than of the expression complexity. With a large enough input string, I can get Java's regex to blow the stack every time.
regex = /\A((?:.|\n)*?)?([\r\n]{1,2}|--)/n
long_string = ""
76.times { long_string += "xxxxxxxxxxxxxxxxxx" }

long_string.sub(r) {""}

This bit of Ruby code will cause java.util.regex.Pattern to blow the stack during matching. I'm sure this regex could be reduced to a simpler case, but I don't really need to go through the exercise of doing so; others have reported the same issue:
So the issue stands, unfortunately. I'm not sure I'd characterize it as an RFE...I'm no regex expert, but I would hope my regex engine wouldn't overflow based on large input, especially if that input was completely bogus as in my example above. I don't think that's unreasonable, especially considering that Ruby's regex engine appears to handle this case (and considerably larger cases) just fine. Correct me if I'm expecting too much.

So what's the damage? Well, since we currently use java.util.regex exclusively, and since Ruby libraries and apps very frequently use regex to match against very large strings, we'll have to migrate to a different regex library. Until that happens, however, Rails under JRuby will not support large multipart posts, at a minimum. There may be other places this has an effect, but multipart posting was the area I investigated.

So for now...I guess it's just busted. We'll begin evaluating alternative regex libraries posthaste. Any help from you folks would be appreciated...just take that regex or one from the bugs and try to match against extremely large input.

Rewrite the Regex?

Some may say that the regex should be rewritten to avoid the use of alternation. Perhaps so, perhaps there's a better way. However Ruby handles it fine, so requesting that the Rails folks or any other Ruby devs start tailoring their regex so they will work under JRuby is quite out of the question. If we can't run the same regex against the same input, we need a different regex library...that's all there is to it.

Besides, the bugs posted have additional regex that are even simpler. It just seems to be a limitation of the Java regex library that certain regex will blow the stack on large input.

For the record, here's the relevant bit of the stack. I get pages and pages of this, under both Java 5 and Java 6 beta 2...both 64-bit under Linux:
...
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
...

What a downer.

8 comments:

Johannes Rudolph said...

Hi,
did you know: the JDK contains the source code of all the java classes? If you read Pattern.java you see that Pattern.compile does not compile the Pattern like you would expect it to do (as in 'compile source code'). I would have expected Pattern to compile the regexp into an automaton(DSA). But what it does is only parsing the expression and building a syntax tree out of Pattern.Nodes which contain the match method which matches part of an input string (in a recursive way).
So it's actually a regexp interpreter. Pattern.compile=parse, Pattern.match=run/interpret.

I found that one:
http://weblogs.java.net/blog/tomwhite/archive/2006/03/a_faster_java_r.html

That explains why it is build the way it is: Perl and therefore Java regexps are no regular regular expressions, but a more complex expression language. So if you build a classic automaton you won't have all those features like look-ahead and backtracking.

To make things work you would have to use this other regexp library and do without the missing features

> Pattern.matches throws > StackOverFlow Error - Closed, will > not be fixed (reported in 2004)
It's that lack of support from sun for the developers that is really annoying. (Ten years of Java but almost only bug fixes, argg)

Ruby uses a gnu regexp library (see regex.c in ruby source) which compiles into some kind of virtual machine byte code which is then executed.

If I understand their (Java) workaround correctly, don't use alternations(|) in combination with quantifiers but use [] bracketed character groups or possessive quantifiers.

The question is: Is the stack space necessarily allocated with this regexp or is it not? If it is, Java is probably using more memory per call than the regex.c version of ruby. After thinking about this I think in your case the memory consumption lies in the nature of your regexp (only the memory per character consumption is different for java and ruby/c).
If you use alternations in your regexp and they are *-quantified the regexp engine has to save backtracking information for every character consumed, java does save this information in the stack. To work around this, avoid alternation or use possessive quantifiers ('*+') which don't collect any backtracking information.

Sorry, this comment is a little bit confusing because I read all those sources while writing... ;)

Sum-up:
Your regexp needs backtracking information wether you use java or ruby. The java version consumes more memory per character than the Ruby/C-version (That's a guess from your observations) so it runs out of memory before the ruby one would. Probably the java implementation is not as efficient as it could be (I can't judge this because I'm no expert in regexps)

So see...
http://tusker.org/regex/regex_benchmark.html
... and perhaps try other engines, but dont expect to much ;)

Stephen Viles said...

"dot refuses to match \r. In this case, adding DOT_ALL to the Pattern's flags seems to resolve it"

Yep, that's what you need. \r is one of the six line terminators that dot will not match unless you specify either DOTALL or UNIX_LINES (which recognises only \n as a line terminator). You may also find MULTILINE useful.

Of course, if you switch to another regex lib, it may have different behaviour.

Charles Oliver Nutter said...

Johannes: Yeah, I didn't really expect that it's "compiling" the regex in a classic sense, especially since the blown stack trace shows all Pattern methods in the stack. It would have been nice if it compiled it down to something more efficient, as you say, but that's the way things are.

My hope in trying other engines is that they'll be more efficient with stack and memory or based upon the same algorithms that Ruby's gnu regexp library uses. It will take a bit of research, but we don't really have another choice; we can't ask Ruby developers to custom-tailor their regex to suit JRuby or they'll just avoid using JRuby.

Matt said...

Hi Charles, I'm sure you would have discovered this anyway, but the Jakarta Project's ORO library is very good and does Perl 5 compatible regex'es (not sure exactly what Ruby requires in that regard).

We've been using it for some years now and not run into any showstoppers.

Juri said...

I've tested Jakarta ORO, it doesnt seem to have the StackOverflow issue (while Jakarta Regexp does)

Steven Brandt said...

I recently wrote a java regular expression library that should be immune to stack overflows. You can take a look at it over at http://stevenrbrandt.com.

Maybe it will solve your problem. Please try it out.

Gregory said...

The regex engine in native Ruby is just as broken as Java's, in that they both use backtracking. Here's an article that explains the problem: http://swtch.com/~rsc/regexp/regexp1.html

Charles Oliver Nutter said...

Gregory: Java's engine recurses, which limits the size of data it can handle. Ruby's isn't great, but it doesn't blow out the call stack for large input.