Sunday, July 23, 2006

The Fastest Ruby Platform? or Hey, Isn't Java Supposed to be Slow?

I love stirring up trouble. In working on the compiler for JRuby, it's become apparent that a few targetted areas could yield tremendous performance benefit even in pure-interpreted mode. I describe a bit of this evening's study here.

As an experiment, I cut out a bunch of the stuff in our ThreadContext that caused a lot of overhead for Java-implemented methods. This isn't a safe thing to do for general cases, since most of these items are important, but I wanted to see what bare invocation might do for speed by reducing some of these areas to theoretical "zero cost".

Cut from per-method overhead:
- block/iter stack manipulation
- rubyclass stack manipulation

Just two pieces of our overall overhead, but two reasonably expensive pieces not needed for fib().

Explanation and wild theories follow the numbers below.

Recall that the original compiled fib was only twice as fast as the interpreted version. The new numbers put it closer to 2/3 faster:

Time for bi-recursive, interpreted: 18.37
Time for bi-recursive, compiled: 6.6160000000000005
Time for bi-recursive, interpreted: 17.837
Time for bi-recursive, compiled: 6.878
Time for iterative, interpreted: 25.222
Time for iterative, compiled: 24.885

So with the unnecessary overhead removed (simulating zero-cost for those bits of the method call process) we're down to mid 6-seconds for fib(30). The iterative version is calculating fib(500000), but I'll come back to that.

Now consider Ruby's own numbers for both of these same tests:

Time for bi-recursive, interpreted: 2.001974
Time for iterative, interpreted: 9.015137

Hmm, now we're not looking so bad anymore, are we? For the recursive version, we're only about 3.5x slower with the reduced overhead. For iterative, only about 2.5x slower. So there's a few other things to consider:

- Our benchmark still creates and pushes/pops a frame per invocation
- Our benchmark still has fairly costly overhead for method arguments, both on the caller and callee sides (think multiple arrays allocated and copied on both ends)
- Our benchmark is still using reflected invocation

Yes, the bits I removed simulate zero cost, which we'll never achieve. However, if we assume we'll get close (or at least minimize overhead for cases like this where those removed bits are obviously not needed), these numbers are not unreasonable. If we further assume we can trim more time off each call by simplifying and speeding up argument/param processing, we're even better. If we eliminate framing or reduce its cost in any way, we're better still. However, the third item above is perhaps the most compelling.

You should all have seen my microbenchmarks for reflected versus direct invocation. Even in the worst comparison, direct invocation (via INVOKEINTERFACE) took at most 1/8 as much time as reflected. The above fib invocation and all the methods it calls are currently using reflected invocation, just like most stuff in JRuby.

So what does performance hypothetically look like for 6.5s times 1/8? How does around 0.8s sound? A full 50-60% faster than C Ruby! What about for iterative...24s / 8 = 3s, a solid 66% boost over C Ruby again. Add in the fact that we're missing a bunch of optimizations, and things are looking pretty damn good. Granted, the actual cost of invoking all those reflected methods is somewhat less than the total, but it's very promising. Even if we assume that the cost of the unoptimized bits is 50% of the total time, leaving 50% cost for reflection, we'd basically be on par with C Ruby.

It's also interesting to note that the interpreted and compiled times for the iterative version are almost identical. Interpretation is expensive for many things, but not for a simple while loop. The iterative version's code is below:

def fib_iter_ruby(n)
i = 0
j = 1
cur = 1
while cur <= n
k = i
i = j
j = k + j
cur = cur + 1
end
i
end

This is a good example of code that's very light on interpretation. While loops in Ruby and JRuby boil down in both cases to little more than while loops in the underlying language, interpretation or not. The expense of this method is almost entirely in two areas: variable assignment and method calls, neither of which are sped up by compilation. The similarity of the compiled and interpreted numbers for this iterative algorithm show one thing extremely clearly: our method call overhead really, really stinks. It is here we should focus all our efforts in the short term.

Given these new numbers and the fact that we have many optimizations left to do, I think it's now very reasonable to say we could beat C Ruby performance by the end of the year.

Side Note: The compiler work has gone very well, and now supports several types of variables and while loops. This compiler is mostly educational, since it is heavily dependent on the current ThreadContext semantics and primitives. As we attack the call pipeline, the current compiler will break and be left behind, but it has certainly paved the way, showing us what we need to do to make JRuby the fastest Ruby interpreter available.