Friday, April 28, 2006

In The Beginning: Early Returns on JRuby Compilation

I'll put a big fat disclaimer here stating that these results are extremely preliminary, and mean practically nothing. I'm just very tired and very excited that after seven hours of work, I've already made some progress...I sorta, kinda compiled some Ruby code to Java bytecodes. Yes, I've been up all night. Yes, I have to work tomorrow. Yes...I wish I could work on this during the day. C'est la vie!

The Contenders

In order to soften any growing excitement you may be feeling, I must first introduce the scripts in question. The script to be compiled is quite a whopper:

def foo; 'bar'; end

Hey now, everything has to have a beginning. Even the above script is not the whole story; the only code actually being compiled are the four AST nodes that make up the method's body: a Newline (to start the line), a DStr (dynamic string), another DStr, and a Str (the actual string), all nested one inside the other.

Now as anti-climactic as that may sound, this is early proof-of-concept work here...I don't intend to write a compiler for all of Ruby at the same time. This was the smallest non-trivial script I thought I could handle in a first attempt. I do not compile the nodes that define the foo method, nor do I compile any dispatches to the foo method. I just compile the body.

In order to have an uncompiled version, we have another identical method:

def bar; 'baz'; end

The bar method functions exactly as the foo method, but we will not compile this one. It will be the control.

Now since only a portion of the AST nodes are being compiled, it seemed appropriate to measure a method that does nothing, so that the cost of dispatching could be isolated, further narrowing the test. For that, we have baz:

def baz; end

By running the same tests against baz, we can get a better idea how much compilation is helping. There are also other interpreter bits and pieces that will skew the results a bit, but early results are early results; we want to know whether it will be worth the effort.

The Tools

For this round, I chose to use ObjectWeb's ASM bytecode generation library. It seemed best suited to the visitor pattern we use to traverse the AST (since it folows the visitor pattern itself), and it's itty bitty.

Now for the fun part: I wrote the compiler in Ruby using JRuby's Java integration support. Some time ago, Tom added the ability to parse a string and get back a reference to the AST within a Ruby script under JRuby. Since we already have visitor interfaces for the interpreter, and since JRuby does such a bangup job of implementing interfaces, I figured I'd do the whole thing in Ruby. Naturally, if we continue down this route, we would hope to eventually compile the compiler and add "self-hosting" to our list of buzzwords. However, I digress.

The basic model was simple: parse foo, retrieve the body node from the AST, convert the body to bytecodes, and replace the body with the newly-compiled . This allows our interpreter to continue doing its normal interpretation, but execute this one method's body natively. During my interpreter redesign last Fall we hoped to make this legerdemain invisible...and it seems we've succeeded thusfar.

The foo method's body was transformed into a class with one method, execute, implementing the same Instruction interface we use in the interpreter. The difference, of course, is that we would now have one bytecode-based instruction rather than four.

It's also worth pointing out that this model does not turn a Ruby class or a Ruby method into an accessible Java class or method. It turns snippits of Ruby code into tidbits of Java code...and that Java code continues to live as part of the interpreter. This is the most digestable model of compilation for Ruby, and has been followed by most other projects of merit. We may be able to take things further in the future (especially with the potential dynamifying of the JVM), but this level of compilation has massive potential right now.

The Test

Given that the scripts are so simple, the test should be equally simple:

1_000_000.times { foo }

...and equivalent tests for bar and baz. These tests are parsed, processed, and timed in the exact same way (other than the compilation step):

require 'java'
require 'jruby'
include_class "org.jruby.Ruby"

my_ruby = Ruby.default_instance

foocode = JRuby.parse("def foo; 'bar'; end")
#compilation done here
footest = JRuby.parse("1_000_000.times {foo}")

my_ruby.eval(foocode)
t = Time.now
my_ruby.eval(footest)
p Time.now - t


And again, equivalent tests for bar and baz. That's all folks! Have a nice weekend!

...

Ok, ok. You'd like to know what the times are. That's fair. I've led you this far, and you'd like some payoff for reading my technojargon gobbledygook.

Naturally I wouldn't be writing this if the results were poor.

The Results

Now did I mention these are very early, very preliminary numbers? Don't go off half-cocked talking about JRuby's new compiler, ok?

The bare method baz demonstrated that a large amount of time is spent dispatching in JRuby; perhaps an inordinate amount of time. We have plans to correct this, and have a few optimizations being tossed about.

The baz method took 3.8s to invoke one million times on this machine (Opteron 2.6GHz, Gentoo Linux 2.6.15). That may not seem bad, but of course it wasn't doing anything. Also, C Ruby does the same million dispatches in about 3 tenths of a second. An order of magnitude worse, we are.

The bar method, which represents the uncompiled foo, clocked in at 12.5s to complete a million calls. Now you start to see the real cost from traversing all those AST nodes. Adding only four nodes to the mix (and the side effects that result, of course) quadrupled the amount of time. Minus the method invocations, the body took roughly 8.7s, about 70% of the total run.

As you might expect, foo did better than bar. One million invocations of foo, our bar-with-a-compiled-body, took only 6.5s, almost 50% faster than bar. Subtract the method invocation hit and we're looking at around 2.7 seconds, a 60% improvement over bar.

baz: 3.8 seconds
bar: 12.5 seconds, or 8.7 seconds excluding method invocation
foo: 6.5 seconds, or 2.7 seconds excluding method invocation

So there you have it.

If we only get a 60% improvement I would be very pleased. However, given that Newline, DStr, and Str nodes are some of the least interpreter- and processor-intensive nodes in the AST, I'm certain we'll do far better.

9 comments:

  1. You guys are making great progress!

    About generating Java bytecode, have you thought about wrapping your bytecode generation library in Ruby? This is exactly what I do when I generate CIL code for my Ruby <-> CLR bridge. You should download the 3rd drop of RubyCLR (just google for it) and take a look at rbdynamicmethod.rb for some ideas.

    ReplyDelete
  2. That's exactly what I did, probably for the same reasons as you. The ASM library is a very low-level bytecode generator; you physically specify each and every instruction, method, push and pop. Running this all through Java--with its typical code, compile, test cycle--would be a very slow and agonizing process. I chose to do it in Ruby, using JRuby's excellent integration features, to help preserve my sanity.

    I also think it might be useful to abstract the bytecode generation behind a nice Ruby library. It would be possible to create a language-agnostic code generation library with very elegant semantics. Too many projects, too little time!

    ReplyDelete
  3. In your loop tests, particularly when you're doing the compiled method dispatch, I'm wondering how much of the time is getting burned in the "1_000_000.times". It shouldn't take 2.7 seconds to execute a few bytecodes a million times.

    -Tim Bray

    ReplyDelete
  4. Tim: A very astute observation. The fact is that 1_000_000 is a Fixnum, times is a method, and that method's implementation yields to another method: the block. So the overhead we see from dispatching to foo in the first place is also see during the actual yield.

    However, I would expect that the baz case should include this dispatch hit. baz is a bare method, but its test is otherwise identical to those of foo and bar. It has the yield and the method dispatch, and everything up to the method's body should be about the same.

    Here's a basic summary of what the compiled bytecodes for foo's body must do (these are the big items; obviously pushes and pops are omitted):

    1. create a stringbuffer
    2. append the "bar" string to it
    3. create a RubyString instance (our internal representation of a Ruby string)
    4. put that RubyString into the interpreter's accumulator

    So there's a little more than meets the eye going on. I loaded "bar" as a constant, but there's still a stringbuffer and rubystring to be instantiated and a number of method calls. One optimization might be to skip the stringbuffer for simple cases like this, but I wanted something very generic for now.

    There's more testing to be done...we shall see what shakes loose.

    ReplyDelete
  5. Tim: Your comments spurred me to run another little test, with normal Java code.

    Other than a final placing of the RubyString on the JRuby accumulator, the following code is basically the same as what foo compiles into:

    Ruby r = (Ruby)Ruby.getDefaultInstance()
    StringBuffer sb = new StringBuffer();
    sb.append("bar");
    r.newString(sb.toString());

    I ran this inside a for loop for a million cycles, and after multiple in-process runs, it was still taking about four seconds to complete. The down side is that there may simply be other overhead in the newString call that we'll need to address, but it doesn't appear directly related to the interpreter or the compilation.

    ReplyDelete
  6. mortench: Please stop by the JRuby home page at jruby.sf.net to find out more about the JRuby project. JRuby is a very mature Ruby implementation, recently able to run many of the killer Ruby apps like IRB and parts of Ruby on Rails. The current effort to compile Ruby code is to be built on top of the existing JRuby interpreter, allowing selective compilation for performance gains. Thank you for your interest!

    ReplyDelete
  7. mortench: Not at all, thank you for your comments! We would love to have more of your ideas and feedback. If you have a chance, please join the JRuby mailing lists...we're very chatty, and there's always some design discussion going on.

    ReplyDelete
  8. Do you need to use StringBuffer() instead of StringBuilder()? What level is concurrency being handled at? -Tim

    ReplyDelete
  9. I considered using StringBuilder, but up to now JRuby is compatible with Java 1.4 and up. Using StringBuilder would begin to tie it to Java 5, which would mean some existing users would not benefit from the compiler.

    We could consider using more Java 5 features as time goes on, however...since 6 is right around the corner and most shops are now switching to 5. It's a tightrope act to know when is the right time, but we'll probably only do it when there's a compelling reason to do so.

    ReplyDelete