Friday, April 28, 2006

In The Beginning: Early Returns on JRuby Compilation

I'll put a big fat disclaimer here stating that these results are extremely preliminary, and mean practically nothing. I'm just very tired and very excited that after seven hours of work, I've already made some progress...I sorta, kinda compiled some Ruby code to Java bytecodes. Yes, I've been up all night. Yes, I have to work tomorrow. Yes...I wish I could work on this during the day. C'est la vie!

The Contenders

In order to soften any growing excitement you may be feeling, I must first introduce the scripts in question. The script to be compiled is quite a whopper:

def foo; 'bar'; end

Hey now, everything has to have a beginning. Even the above script is not the whole story; the only code actually being compiled are the four AST nodes that make up the method's body: a Newline (to start the line), a DStr (dynamic string), another DStr, and a Str (the actual string), all nested one inside the other.

Now as anti-climactic as that may sound, this is early proof-of-concept work here...I don't intend to write a compiler for all of Ruby at the same time. This was the smallest non-trivial script I thought I could handle in a first attempt. I do not compile the nodes that define the foo method, nor do I compile any dispatches to the foo method. I just compile the body.

In order to have an uncompiled version, we have another identical method:

def bar; 'baz'; end

The bar method functions exactly as the foo method, but we will not compile this one. It will be the control.

Now since only a portion of the AST nodes are being compiled, it seemed appropriate to measure a method that does nothing, so that the cost of dispatching could be isolated, further narrowing the test. For that, we have baz:

def baz; end

By running the same tests against baz, we can get a better idea how much compilation is helping. There are also other interpreter bits and pieces that will skew the results a bit, but early results are early results; we want to know whether it will be worth the effort.

The Tools

For this round, I chose to use ObjectWeb's ASM bytecode generation library. It seemed best suited to the visitor pattern we use to traverse the AST (since it folows the visitor pattern itself), and it's itty bitty.

Now for the fun part: I wrote the compiler in Ruby using JRuby's Java integration support. Some time ago, Tom added the ability to parse a string and get back a reference to the AST within a Ruby script under JRuby. Since we already have visitor interfaces for the interpreter, and since JRuby does such a bangup job of implementing interfaces, I figured I'd do the whole thing in Ruby. Naturally, if we continue down this route, we would hope to eventually compile the compiler and add "self-hosting" to our list of buzzwords. However, I digress.

The basic model was simple: parse foo, retrieve the body node from the AST, convert the body to bytecodes, and replace the body with the newly-compiled . This allows our interpreter to continue doing its normal interpretation, but execute this one method's body natively. During my interpreter redesign last Fall we hoped to make this legerdemain invisible...and it seems we've succeeded thusfar.

The foo method's body was transformed into a class with one method, execute, implementing the same Instruction interface we use in the interpreter. The difference, of course, is that we would now have one bytecode-based instruction rather than four.

It's also worth pointing out that this model does not turn a Ruby class or a Ruby method into an accessible Java class or method. It turns snippits of Ruby code into tidbits of Java code...and that Java code continues to live as part of the interpreter. This is the most digestable model of compilation for Ruby, and has been followed by most other projects of merit. We may be able to take things further in the future (especially with the potential dynamifying of the JVM), but this level of compilation has massive potential right now.

The Test

Given that the scripts are so simple, the test should be equally simple:

1_000_000.times { foo }

...and equivalent tests for bar and baz. These tests are parsed, processed, and timed in the exact same way (other than the compilation step):

require 'java'
require 'jruby'
include_class "org.jruby.Ruby"

my_ruby = Ruby.default_instance

foocode = JRuby.parse("def foo; 'bar'; end")
#compilation done here
footest = JRuby.parse("1_000_000.times {foo}")

my_ruby.eval(foocode)
t = Time.now
my_ruby.eval(footest)
p Time.now - t


And again, equivalent tests for bar and baz. That's all folks! Have a nice weekend!

...

Ok, ok. You'd like to know what the times are. That's fair. I've led you this far, and you'd like some payoff for reading my technojargon gobbledygook.

Naturally I wouldn't be writing this if the results were poor.

The Results

Now did I mention these are very early, very preliminary numbers? Don't go off half-cocked talking about JRuby's new compiler, ok?

The bare method baz demonstrated that a large amount of time is spent dispatching in JRuby; perhaps an inordinate amount of time. We have plans to correct this, and have a few optimizations being tossed about.

The baz method took 3.8s to invoke one million times on this machine (Opteron 2.6GHz, Gentoo Linux 2.6.15). That may not seem bad, but of course it wasn't doing anything. Also, C Ruby does the same million dispatches in about 3 tenths of a second. An order of magnitude worse, we are.

The bar method, which represents the uncompiled foo, clocked in at 12.5s to complete a million calls. Now you start to see the real cost from traversing all those AST nodes. Adding only four nodes to the mix (and the side effects that result, of course) quadrupled the amount of time. Minus the method invocations, the body took roughly 8.7s, about 70% of the total run.

As you might expect, foo did better than bar. One million invocations of foo, our bar-with-a-compiled-body, took only 6.5s, almost 50% faster than bar. Subtract the method invocation hit and we're looking at around 2.7 seconds, a 60% improvement over bar.

baz: 3.8 seconds
bar: 12.5 seconds, or 8.7 seconds excluding method invocation
foo: 6.5 seconds, or 2.7 seconds excluding method invocation

So there you have it.

If we only get a 60% improvement I would be very pleased. However, given that Newline, DStr, and Str nodes are some of the least interpreter- and processor-intensive nodes in the AST, I'm certain we'll do far better.