Monday, September 04, 2006

More ActiveRecord Information

Supporting More Databases

A number of folks have asked about support for database X. We want to support all possible databases, of course, and JDBC gives us a huge head start on that. Many databases will only need a bit of tweaking to support all the basic CRUD operations. The additional work...small though it may be...is implementing the schema-manipulating methods of ActiveRecord. Here's a complete list...if you can provide examples for each operation, it would speed our supporting them:

  • add_column
  • add_index
  • change_column
  • change_column_default
  • columns
  • create_table
  • drop_table
  • initialize_schema_information
  • native_database_types
  • remove_column
  • remove_index
  • rename_column
  • rename_table
  • structure_dump
  • table_alias_for
  • table_alias_length
Check available docs for ActiveRecord or look at the existing adapters in ActiveRecord in SVN for examples on each.

Avoiding Modifying active_record.rb

There's now a thread going on the Rails core mailing list and a bug to address the issue of adding adapters to Rails. Hopefully one of two things will come out of this:
  • "jdbc" will be added to the list of adapters available in core Rails
  • a mechanism will be added (or fixed) to allow adding to the list of adapters from outside the core source
The former would be ideal, since no additional, special config changes would be needed for jdbc versus MySQL, for example. The latter would be an "ok" compromise, but would require modifying all apps to add "jdbc" to the adapter list. I'll report back when something comes of it.

Java Regex Broken for Alternation over Large Strings

This is a serious bummer.

I've been working through some of Rails's test cases this evening. Since I've got Depot running reasonably well now and I know I can easily wire together a simple Rails app that calls Hibernate or EJBs, I figured I'd try to get a better handle on the true state of Rails support.

As I'm running test cases--the majority of which appear to be passing, btw--I came across a couple goofy regex issues. The first is one we'd seen before, where dot refuses to match \r. In this case, adding DOT_ALL to the Pattern's flags seems to resolve it. I think we just hadn't noticed that flag before, so many other similar bugs might be resolved the same way. Not an easy issue to narrow down, but I found it and moved on.

So then I started seeing SystemStackErrors popping up within the CGI library. When I see stack overflows in JRuby, I usually figure it's one of two things:

  1. Something's wrong with JRuby that's breaking a termination condition in Ruby code, so a recursion is never ending
  2. JRuby itself is missing a termination and recursing endlessly
Pretty simple issues to track down...find where the recursion is happening, and figure out why it's not terminating. Except in this case, the recursion wasn't in either Ruby or JRuby code. It was in Java's regex library.

Blowing The Stack

O Java, how I love Thee. Every library I could want, all in one place. You care for me so well. It pains me to find your faults.

So it appears that the Java regex library's compiled patterns execute recursively. This alone wouldn't be particularly unusual, especially if the recursion was limited by the complexity of the regex itself. However in this case it seems the recursion is a factor of the string to match, rather than of the expression complexity. With a large enough input string, I can get Java's regex to blow the stack every time.
regex = /\A((?:.|\n)*?)?([\r\n]{1,2}|--)/n
long_string = ""
76.times { long_string += "xxxxxxxxxxxxxxxxxx" }

long_string.sub(r) {""}

This bit of Ruby code will cause java.util.regex.Pattern to blow the stack during matching. I'm sure this regex could be reduced to a simpler case, but I don't really need to go through the exercise of doing so; others have reported the same issue:
So the issue stands, unfortunately. I'm not sure I'd characterize it as an RFE...I'm no regex expert, but I would hope my regex engine wouldn't overflow based on large input, especially if that input was completely bogus as in my example above. I don't think that's unreasonable, especially considering that Ruby's regex engine appears to handle this case (and considerably larger cases) just fine. Correct me if I'm expecting too much.

So what's the damage? Well, since we currently use java.util.regex exclusively, and since Ruby libraries and apps very frequently use regex to match against very large strings, we'll have to migrate to a different regex library. Until that happens, however, Rails under JRuby will not support large multipart posts, at a minimum. There may be other places this has an effect, but multipart posting was the area I investigated.

So for now...I guess it's just busted. We'll begin evaluating alternative regex libraries posthaste. Any help from you folks would be appreciated...just take that regex or one from the bugs and try to match against extremely large input.

Rewrite the Regex?

Some may say that the regex should be rewritten to avoid the use of alternation. Perhaps so, perhaps there's a better way. However Ruby handles it fine, so requesting that the Rails folks or any other Ruby devs start tailoring their regex so they will work under JRuby is quite out of the question. If we can't run the same regex against the same input, we need a different regex library...that's all there is to it.

Besides, the bugs posted have additional regex that are even simpler. It just seems to be a limitation of the Java regex library that certain regex will blow the stack on large input.

For the record, here's the relevant bit of the stack. I get pages and pages of this, under both Java 5 and Java 6 beta 2...both 64-bit under Linux:
...
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
...

What a downer.