Thursday, July 19, 2007

Alioth Numbers for JRuby 1.0

Someone pointed out to me the other day that the Alioth "Compuer Language Benchmarks Game" (as they call it now) had started to include JRuby 1.0 in the results. And how do we fare? We're slower than Ruby 1.8.6! Hooray!

But not by much. Depending on your definition of "on par", I'd say we're safely in the range of being on par with Ruby 1.8.6 performance, at least for these too-short, too-small measurements.

Alioth benchmarks are regularly panned by language enthusiasts...or at least enthusiasts on the "slow" end of the benchmarks. In the case of JVM-based language implementations, the problem is the same old excuse: not enough time for the JVM JIT to work its magic. This case isn't all that different, but it's fair to say that Alioth isn't testing straight-line performance in a hot application, but getting from point A to point B without a running start. And in that domain, it provides a reasonable window into implementation performance.

So then, the JRuby 1.0 numbers. You can click the link to see them, but they break down about like this:

Performance:

  • Four tests are equal or faster in JRuby--usually not more than 2x faster, but "4.8x" faster in one case. That fastest one involves "concurrency", and I haven't studied it enough to know whether it's meaningful or not.
  • Eight tests are less than 2x slower in JRuby.
  • The remaining four tests are greater than 2x slower in JRuby.
  • Startup time is considerably worse; it's listed as 305x slower for JRuby (!!!) but it's not a particularly useful ratio. We take a couple seconds to get going compared to Ruby's hundredths. That's life.
Memory:
  • All except one worse in JRuby
  • Most more than 2x worse in JRuby
  • Several more then 10x worse in JRuby
  • Does anyone care?
(Update: Commenters have made it clear that they do care...but of course I was being a little bit glib here :) We continue every release to do what we can to improve performance AND memory usage, and other than some additional overhead from the JVM itself our memory usage is pretty reasonable.)

So I guess that's all I've got to point out. We've been working on performance since 1.0, and there's a number of major improvements planned for the 1.1 release. And considering that "beating Ruby performance" wasn't a primary goal for JRuby 1.0, I think our "roughly on par" numbers here are pretty damn good. Granted, Ruby 1.8.x isn't the fastest implementation in the world, but we're pretty happy to have improved performance by an order of magnitude in the past year.

Now, onward to the future!

Wednesday, July 18, 2007

Groovy and JRuby Cooperating?

Can it be true? Of course it can!

Graeme Rocher blogs about a new feature in upcoming Groovy 1.1 beta 3: support for Ruby-style "method missing" hooks!

In heavily metaprogrammed applications, it's very common to define methods on the fly in response to calls made. For example, in Rails ActiveRecord, no "find by" methods are defined out of the box, but you may still call "find_by_x_and_y_and_z" to run your specific query. In order to reduce the overhead of handling these "missing methods" (usually through parsing the called name), it's typical to generate new methods on demand, binding them to the called class. So if you call find_by_name and it doesn't exist, two things happen: the query is generated and run and a new find_by_name method is dynamically created for future queries.

After Graeme blogged about the typical way to dynamically create methods in Groovy, I tapped him on IM and we started to talk about how Ruby handles the same use case. We agreed that the overhead of overriding invokeMethod (Groovy's entry point for all dynamic calls) was problematic...it's already got enough to do without adding the method-generation logic. As it turns out, "method missing" was a perfect remedy, and pretty easy to add into Groovy. Graeme's already started work to modify Grails to take advantage of methodMissing and I'm sure it will catch on like wildfire in the rest of Groovydom. And as a result you, the JVM dynamic language community, gain the benefit of a little Groovy/JRuby cross-pollination.

So for all you haters out there who just love to pit language teams against one another I say this:

Nyah.

Sunday, July 15, 2007

The Ever-Evolving JVM

John Rose, lead of the "invokedynamic" effort (Java Specification Request 292), has posted some exciting articles about the future of the JVM and a number of changes potentially for the next Java version. Among these is, of course, the dynamic invocation efforts, but these entries include information on non-local returns from closures, tail calls, and tuple support. I'm excited to have John as a co-worker and to be helping out the invokedynamic effort in my own small way by working on JRuby's compiler. John has also provided guidance on how to make a dynamic language fast on current JVMs, which has informed much of my compiler's design.

Check out his articles:

Longumps Considered Inexpensive
tail calls in the VM
tuples in the VM

It's going to be a fun year for language developers on the JVM!

More Compiler Strategy: Call Adapters and Stack-based Methods

Compilers are hard. But not so hard as people would have you believe.

I've committed an update that installs a CallAdapter for every compiled call site. CallAdapter is basically a small object that stores the following:

  • method name
  • method index
  • call type (normal, functional, variable)
As well as providing overloaded call() implementations for 1, 2, 3, n arguments and block or no block. The basic goal with this class is to provide a call adapter (heh) that makes calling a Ruby method in compiled code as similar to (and simple as) calling any Java method.

The end result is that while compiled class init is a bit larger (needs to load adapters for all call sites), compiled method size has dropped substantially; in compiling bench_method_dispatch.rb, the two main tests went from 4000 and 3500 bytes of code down to 1500 and 1000 bytes (roughly). And simpler code means HotSpot has a better chance to optimize.

Here's the latest numbers for the bench_method_dispatch_only test, which just measures time to call a Ruby-implemented method a bunch of times:
Test interpreted: 100k loops calling self's foo 100 times
2.383000 0.000000 2.383000 ( 2.383000)
2.691000 0.000000 2.691000 ( 2.691000)
1.775000 0.000000 1.775000 ( 1.775000)
1.812000 0.000000 1.812000 ( 1.812000)
1.789000 0.000000 1.789000 ( 1.789000)
1.776000 0.000000 1.776000 ( 1.777000)
1.809000 0.000000 1.809000 ( 1.809000)
1.779000 0.000000 1.779000 ( 1.781000)
1.784000 0.000000 1.784000 ( 1.784000)
1.830000 0.000000 1.830000 ( 1.830000)
And Ruby 1.8.6 for reference:
Test interpreted: 100k loops calling self's foo 100 times
2.160000 0.000000 2.160000 ( 2.188087)
2.220000 0.010000 2.230000 ( 2.237414)
2.230000 0.010000 2.240000 ( 2.248185)
2.180000 0.010000 2.190000 ( 2.218540)
2.240000 0.010000 2.250000 ( 2.259535)
2.220000 0.010000 2.230000 ( 2.241170)
2.150000 0.010000 2.160000 ( 2.178414)
2.240000 0.010000 2.250000 ( 2.259772)
2.260000 0.000000 2.260000 ( 2.285141)
2.230000 0.010000 2.240000 ( 2.252396)
Note that these are JIT numbers rather than fully precompiled numbers, so this is 100% real-world safe. Fully precompiled is just a bit faster, since there's no interpreted step or DefaultMethod wrapper to go through.

I have also made a lot of progress on adapting the compiler to create stack-based methods when possible. Basically, this involved inspecting the code for anything that would require access to local variables outside the body of the call. Things like eval, closures, etc. At the moment it works well and passes all tests, but I know methods similar to gsub which modify $~ or $_ are not working right. It's disabled at the moment, pending more work, but here's the method dispatch numbers with stack-based method compilation enabled:
Test interpreted: 100k loops calling self's foo 100 times
1.735000 0.000000 1.735000 ( 1.738000)
1.902000 0.000000 1.902000 ( 1.902000)
1.078000 0.000000 1.078000 ( 1.078000)
1.076000 0.000000 1.076000 ( 1.076000)
1.077000 0.000000 1.077000 ( 1.077000)
1.086000 0.000000 1.086000 ( 1.086000)
1.077000 0.000000 1.077000 ( 1.077000)
1.084000 0.000000 1.084000 ( 1.084000)
1.090000 0.000000 1.090000 ( 1.090000)
1.083000 0.000000 1.083000 ( 1.083000)
It seems very promising work. I hope I'll be able to turn it on soon.

Oh, and for those who always need a fib fix, here's fib with both optimizations turned on:
~ $ jruby -J-server bench_fib_recursive.rb      
1.258000 0.000000 1.258000 ( 1.258000)
0.990000 0.000000 0.990000 ( 0.989000)
0.925000 0.000000 0.925000 ( 0.926000)
0.927000 0.000000 0.927000 ( 0.928000)
0.924000 0.000000 0.924000 ( 0.925000)
0.923000 0.000000 0.923000 ( 0.923000)
0.927000 0.000000 0.927000 ( 0.926000)
0.928000 0.000000 0.928000 ( 0.929000)
And MRI:
~ $ ruby bench_fib_recursive.rb
1.760000 0.010000 1.770000 ( 1.775660)
1.760000 0.010000 1.770000 ( 1.776360)
1.760000 0.000000 1.760000 ( 1.778413)
1.760000 0.010000 1.770000 ( 1.776767)
1.760000 0.010000 1.770000 ( 1.777361)
1.760000 0.000000 1.760000 ( 1.782798)
1.770000 0.010000 1.780000 ( 1.794562)
1.760000 0.010000 1.770000 ( 1.777396)
These numbers went down a bit because the call adapter is currently just generic code, and generic code that calls lots of different methods causes HotSpot to stumble a bit. The next step for the compiler is to generate custom call adapters for each call site that handle arity correctly (avoiding IRubyObject[] all the time) and call directly to the most-likely target methods.

Friday, July 13, 2007

To Keyword Or Not To Keyword

One of the most attractive aspects of Ruby is the fact that it has relatively few sacred keywords. In most cases, things you'd expect to be keywords are actually methods, and you can wrap or hook their behavior and create amazing potential.

One perfect example of this is require. Because require is just a method, you can define your own version that wraps its behavior. This is exactly how RubyGems does its magic...rather than immediately calling the default require, it can modify load paths based on your installed gems, allowing for a dynamically-expanding load path and the pluggability we've all come to know and love.

But all such keyword-like methods are not so well behaved. Many methods make runtime changes that are otherwise impossible to do from normal Ruby code. Most of these are on Kernel. I propose that several of these methods should actually be keywords.

Update: Evan Phoenix of Rubinius (EngineYard), Wayne Kelly of Ruby.NET (Queensland University), and John Lam of IronRuby (Microsoft) have voiced their agreement on this interesting ruby-core mailing list thread. Have you shared your thoughts?

Justifying Keywords

There's a number of (in my opinion, very strong) justifications for this:

  1. Many Kernel methods manipulate runtime state in ways no other methods can. For example: local_variables requires access to the caller's variable scope; block_given? requires access to the block/iter stacks (in MRI code); eval requires access to just about everything having to do with a call; and there are others, see below.
  2. Because many of these methods manipulate normally-inaccessible runtime state, it is not possible to implement them in Ruby code. Therefore, even if someone wanted to override them (the primary reason for them to be methods) they could not duplicate their behavior in the overridden version. Overriding only destroys their utility.
  3. These methods are exactly the ones that complicate optimizing Ruby in all implementations, including Ruby 1.9, Rubinius, JRuby, Ruby.NET, and others. They confound a compiler's efforts to optimize calls by always leaving open questions about the behavior of a method. Will it need access to a heap-allocated scope? Will it save off a binding or the current call frame? No way to know for sure, since they're methods.
In short, there appears to be no good reason to keep them as methods, and many reasons to make them keywords. What follows is a short list of such methods and why they ought to be keywords:
  • *eval - requires implicit access to the caller's binding
  • block_given?/iterator? - requires access to block/iter information
  • local_variables - requires access to caller's scope
  • public/private/protected - requires access to current frame's visibility
There may be others, but these are definitely the biggest offenders. The three points above were used to compile this list, but my criteria for a keyword could be the following more straightforward points. A feature should be implemented (or converted to) a keyword if it fits either of the following criteria:
  • It manipulates runtime state in ways impossible from user-created code
  • It can't be implemented in user-created code, and therefore could not reasonably be overridden or hooked to provide additional behavior
As an alternative, if modifications could be made to ensure these methods were not overridable, Ruby implementations could safely treat them as keywords; searching for calls to "eval" in a given context would be guaranteed to mean an eval would take place in that context.

What do we gain from doing all this?

I can at least give a JRuby perspective. I expect others can give their perspectives.

In JRuby, we could greatly optimize method invocations if, for example, we knew we could just use Java's local variables (on Java's stack) rather than always heap-allocating a scoping structure. We could also avoid allocating a frame or binding when they are not needed, just allowing Java's call frame to be "enough" for us. We can already detect if there are closures in a given context, which helps us learn that a heap-allocated scope will be necessary, but we can't safely detect eval, block_given?, etc. As a result of these methods-that-would-be-keywords, we're forced to set up and tear down every method in the most expensive manner.

Other implementations&emdash;including Ruby 1.9/2.0 and Rubinius&emdash;would probably be able to make similar optimizations if we could calculate ahead of time whether these keyword operations would occur.

For what it's worth, I may go ahead and implement JRuby's compiler to treat these methods as keywords, only falling back on the "method" behavior when we detect in the rest of the system that the keyword has been overridden. But that situation is far from ideal...we'd like to see all implementations adopt this behavior and so benefit equally.

As an example, here's an early demonstration of the performance change in our old friend fib() when we can know ahead of time if any of these keywords are called (fib calls none of them). This example shows the performance today and the performance when we can safely just use Java local variables and scoping constructs. We could additionally omit heap-allocated frames for each call, giving a further boost.

I've included Ruby 1.8.6 to provide a reference value.


Current JRuby:
~ $ jruby -J-server bench_fib_recursive.rb
1.323000 0.000000 1.323000 ( 1.323000)
1.118000 0.000000 1.118000 ( 1.119000)
1.055000 0.000000 1.055000 ( 1.056000)
1.054000 0.000000 1.054000 ( 1.054000)
1.055000 0.000000 1.055000 ( 1.054000)
1.055000 0.000000 1.055000 ( 1.055000)
1.055000 0.000000 1.055000 ( 1.055000)
1.049000 0.000000 1.049000 ( 1.049000)

~ $ jruby -J-server bench_method_dispatch_only.rb
Test interpreted: 100k loops calling self's foo 100 times
3.901000 0.000000 3.901000 ( 3.901000)
4.468000 0.000000 4.468000 ( 4.468000)
2.446000 0.000000 2.446000 ( 2.446000)
2.400000 0.000000 2.400000 ( 2.400000)
2.423000 0.000000 2.423000 ( 2.423000)
2.397000 0.000000 2.397000 ( 2.397000)
2.399000 0.000000 2.399000 ( 2.399000)
2.401000 0.000000 2.401000 ( 2.401000)
2.427000 0.000000 2.427000 ( 2.428000)
2.403000 0.000000 2.403000 ( 2.403000)
Using Java's local variables instead of a heap-allocated scope:
~ $ jruby -J-server bench_fib_recursive.rb
2.360000 0.000000 2.360000 ( 2.360000)
0.818000 0.000000 0.818000 ( 0.818000)
0.775000 0.000000 0.775000 ( 0.775000)
0.773000 0.000000 0.773000 ( 0.773000)
0.799000 0.000000 0.799000 ( 0.799000)
0.771000 0.000000 0.771000 ( 0.771000)
0.776000 0.000000 0.776000 ( 0.776000)
0.770000 0.000000 0.770000 ( 0.769000)

~ $ jruby -J-server bench_method_dispatch_only.rb
Test interpreted: 100k loops calling self's foo 100 times
3.100000 0.000000 3.100000 ( 3.100000)
3.487000 0.000000 3.487000 ( 3.487000)
1.705000 0.000000 1.705000 ( 1.706000)
1.684000 0.000000 1.684000 ( 1.684000)
1.678000 0.000000 1.678000 ( 1.678000)
1.683000 0.000000 1.683000 ( 1.683000)
1.679000 0.000000 1.679000 ( 1.679000)
1.679000 0.000000 1.679000 ( 1.679000)
1.681000 0.000000 1.681000 ( 1.681000)
1.679000 0.000000 1.679000 ( 1.679000)
Ruby 1.8.6:
~ $ ruby bench_fib_recursive.rb     
1.760000 0.010000 1.770000 ( 1.775304)
1.750000 0.000000 1.750000 ( 1.770101)
1.760000 0.010000 1.770000 ( 1.768833)
1.750000 0.010000 1.760000 ( 1.782908)
1.750000 0.010000 1.760000 ( 1.774193)
1.750000 0.000000 1.750000 ( 1.766951)
1.750000 0.010000 1.760000 ( 1.777814)
1.750000 0.010000 1.760000 ( 1.782449)

~ $ ruby bench_method_dispatch_only.rb
Test interpreted: 100k loops calling self's foo 100 times
2.240000 0.000000 2.240000 ( 2.268611)
2.160000 0.010000 2.170000 ( 2.187729)
2.280000 0.010000 2.290000 ( 2.292342)
2.210000 0.010000 2.220000 ( 2.250331)
2.190000 0.010000 2.200000 ( 2.210965)
2.230000 0.000000 2.230000 ( 2.260737)
2.240000 0.010000 2.250000 ( 2.256210)
2.150000 0.010000 2.160000 ( 2.173298)
2.250000 0.010000 2.260000 ( 2.271438)
2.160000 0.000000 2.160000 ( 2.183670)

What do you think? Is it worth it?

Wednesday, July 11, 2007

Finding a JVM compilation strategy for Ruby's dynamic nature

In JRuby, we have a number of things we "decorate" the Java stack with for Ruby execution purposes. Put simply, we pass a bunch of extra context on the call stack for most method calls. At its most descriptive, making a method call passes the following along:

  • a ThreadContext object, for accessing JRuby call frames and variable scopes
  • the receiver object
  • the metaclass for the receiver object
  • the name of the method
  • a numeric index for the method, used for a fast dispatch mechanism
  • an array of arguments to the method
  • the type of call being performed (functional, normal, or variable)
  • any block/closure being passed to the method
Additionally there are a few places where we also pass the calling object, to use for visibility checks.

The problem arises when compiling Ruby code into Java bytecode. The case I'm looking at involves one of our benchmarks where a local variable is accessed and "to_i" is invoked on it a large number of times:
puts Benchmark.measure {
a = 5;
i = 0;
while i < 1000000
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
i += 1;
end
}
(that's 100 accesses and calls in a 1 million loop)

The block being passed to Benchmark.measure gets compiled into its own Java method on the resulting class, called something like closure0. This gets further bound into a CompiledBlock adapter which is what's eventually called when the block gets invoked.

Unfortunately all the additional context and overhead required in the compiled Ruby code seems to be causing trouble for hotspot.

In this case, the pieces causing the most trouble are obviously the "a.to_i" bits. I'll break that down.

"a" is a local variable in the same lexical scope, so we go to a local variable in closure0 that holds an array of local variable values.
 aload(SCOPE_INDEX)
ldc([index of a])
aaload
But for Ruby purposes we must also make sure a Java null is replaced with "nil" so we have an actual Ruby object
 dup
ifnonnull(ok)
pop
aload(NIL_INDEX) # immediately stored when method is invoked
label(ok)
So every variable access is at least seven bytecodes, since we need to access them from an object that can be shared with contained closures.

Then there's the to_i call. This is where it starts to get a little ugly. to_i is basically a "toInteger" method, and in this case, calling against a Ruby Fixnum, it doesn't do anything but return "self". So it's a no-arg noop for the most part.

The resulting bytecode to do the call ends up being uncomfortably long:

(assumes we already have the receiver, a Fixnum, on the stack)
 dup # dup receiver
invokevirtual "getMetaClass"
invokevirtual "getDispatcher" # a fast switch-based dispatcher
swap # dispatcher under receiver
aload(THREADCONTEXT)
swap # threadcontext under receiver
dup # dup receiver again
invokevirtual "getMetaClass" # for call purposes
ldc(methodName)
ldc(methodIndex)
getstatic(IRubyObject.EMPTY_ARRAY) # no args
ldc(call type)
getstatic(Block.NULL_BLOCK) # no closure
invokevirtual "Dispatcher.callMethod..."
So we're looking at roughly 15 operations to do a single no-arg call. If we were processing argument lists, it would obviously be more, especially since all argument lists eventually get stuffed into an IRubyObject[]. Summed up, this means:

100 a.to_i calls * (7 + 15 ops) = 2200 ops

That's 2200 operations to do 100 variable accesses and calls, where in Java code it would be more like 200 ops (aload + invokevirtual). An order of magnitude more work being done.

The closure above when run through my current compiler generates a Java method of something like 4000 bytes. That may not sound like a lot, but it seems to be hitting a limit in HotSpot that prevents it being JITed quickly (or sometimes, at all). And the size and complexity of this closure are certainly reasonable, if not common in Ruby code.

There's a few questions that come out of this, and I'm looking for more ideas too.
  1. How bad is it to be generating large Java methods and how much impact does it have on HotSpot's ability to optimize?
  2. This code obviously isn't optimal (two calls to getMetaClass, for example), but the size of the callMethod signature means even optimal code will still have a lot of argument loading to do. Any ideas on how to get around this in a general way? I'm thinking my only real chance is to find simpler signatures to invoke, such as arity-specific (so there's no requirement for an array of args), avoiding passing context that usually isn't needed (an object knows its metaclass already), and reverting back to a ThreadLocal to get the ThreadContext (though that was a big bottleneck for us before...).
  3. Is the naive approach of breaking methods in two when possible "good enough"?
It should be noted that HotSpot eventually does JIT this code, it's substantially faster than the current general release of Ruby 1.8. But I'm worried about the complexity of the bytecode and actively looking for ways to simplify.

Friday, July 06, 2007

Understanding the JVM JIT and helping it along

I must apologize to my readers. I have been remiss in my blogging duties. I will be posting updates on the various events of the past month or so along with updates on JRuby progress and future events very soon. But for now, a technical divergence after a night of hacking.

--

I finally understand what we should be going for in our compiled code, and how we can really kick JRuby into the next level of performance.

The JVM, at least in HotSpot, gets a lot of its performance from its ability to inline code at runtime, and ultimately compile a method plus its inlined calls as a whole down to machine code. The benefit in doing this is the ability to do compiler optimizations across a much larger call path, essentially compiling all the logic for a method and its calls (and possibly their calls, ad infinatum) into a single optimized segment of machine code.

HotSpot is able to do this in a two main ways:

  1. If it's obvious there's only ever one implementation of a given signature on a given type hierarchy
  2. If it can determine at runtime that one (or a few) implementations are the only ones ever being called
The first one allows code to be optimized fairly quickly, because HotSpot can discover early on that there's only one implementation. In general, if there's a single implementation of a given signature, it will get inlined pretty quickly.

The second one is trickier. HotSpot tracks the actual types being called against for the various calls, and eventually can come up with a best guess at the method or methods to inline. It also can include a slow path for the rare future cases where the receiver does not match the target types, and it can deoptimize later to back down optimizations when situations change, such as when a new class is loaded into the system.

So in the end, inlining is one of the most powerful optimizations. Unfortunately in JRuby (and most other dynamic language implementations on the JVM), we're making inlining difficult or impossible in the most performance-sensitive areas. I believe this is a large part of our performance woes.

Consider that all method calls against any object must pass through an implementation of IRubyObject.callMethod. There's not too many callMethod implementations, and actually now there's only one implementation of each specific signature. So callMethod gets inlined pretty fast.

Consider also that almost all method calls within callMethod are to very specific methods and will also be inlined quickly. So callMethod is looking pretty good so far.

Now we look at the last step in callMethod...DynamicMethod.call. DynamicMethod is the top-level type for all our method objects in the system. The call method has numerous implementations, all of them different. And no one implementation stands out as the most frequently called. So we're already complicating matters for HotSpot, even though we know (based on the incoming method name) exactly the piece of code we *want* to call.

Let's continue on, assuming HotSpot is smart enough to work around our half-dozen or so DynamicMethod.call implementations.

DefaultMethod is the DynamicMethod implementation for interpreted Ruby code, so it calls directly into the evaluator. So at that point, DefaultMethod.call will inline the evaluator code and that looks pretty good. But there's also the JIT located in DefaultMethod. It generates a JVM bytecode version of the Ruby code and from then on DefaultMethod calls that. Now that's certainly a good thing on one hand, since we've eliminate the interpreter, but on the other hand we've essentially made it impossible for HotSpot to inline that generated code. Why? Because we generate a Java method for every JITable Ruby method. Hundreds, and eventually thousands of possible implementations. Making a decision to inline any of them into DefaultMethod.call is basically impossible. We've broken the chain.

To make matters worse, we also have the set of Java-wrapping DynamicMethod implementations, *CallbackMethod (used for binding Java code to Ruby method names) and CompiledMethod (used in AOT-compiled code).

The CallbackMethods all wrap another piece of generated code that implements Callback and calls the Java method in question. So we generate nice little wrappers for all the pre-existing methods we want to call, but we also make it impossible for the *CallbackMethod.call implementations to inline any of those calls. Broken chain again.

CompiledMethod is slightly better in this regard, since there's a new CompiledMethod subclass for every AOT-compiled Ruby method, but we still have a single implementaiton of DynamicMethod.call that all of those subclasses share in common. To make matters worse, even if we had separate DynamicMethod.call implementations, that may actually *hurt* our ability to inline code way back in IRubyObject.callMethod, since we've now added N possible DynamicMethod.call implementations to the system. And the chain gets broken even earlier.

So the bottom line here is that in order to continue improving performance, we need to do everything possible to move the call site and the call target closer together. There are a couple standard ways to do it:
  1. Hard-coded special-case code for specific situations, much like YARV does for simple ops (+, -, <, >, etc). In these cases, the compiler would check that the target implements an appropriate type to do a direct call to the operation in question. In Fixnum's case, we'd first confirm it's a RubyFixnum, and then invoke e.g. RubyFixnum.plus directly. That skips all the chain breakage, and allows the compiled code to inline RubyFixnum.plus straight into the call site.
  2. Dynamic generated method adapters that can be swapped out and that learn from previous calls to make direct invocations earlier in the chain. Basically, this would involve preparing call site caches that point at call adapters. Initially, the call adapters would be of some generic type that can use the slow path. But as more and more calls come in, more and more of the call sites would be replaced with specialized implementations that invoke the appropriate target code directly, allowing HotSpot a direct line from call site to call target.
The second version is obviously the ultimate goal, and essentially would mimic what the state-of-the-art JITs do (i.e. this is how HotSpot works under the covers). The first version is easily testable with some simple hackery.

I created a small patch that includes a trivial, unsafe change to the compiler to make Fixnum#+, Fixnum#-, and Fixnum#< direct calls when
possible. They're unsafe because they don't check to see if any of those
operations have been overridden...but of course you'd have to be a mad
fool to override them anyway.

To demonstrate a bit of the potential performance gains, here are some
numbers for JRuby trunk and trunk + patch. Note that Fixnum#+, Fixnum#-, and Fixnum#< are all already STI methods, which does a lot to speed up their invocation (STI uses a table of switch values to bypass dynamic method lookup). But this simple change of compiling direct calls completely blows the STI performance out of the water, and that's without similar direct calls to the fib_ruby method itself.

test/bench/bench_fib_recursive.rb

JRuby trunk without patch:
1.675000 0.000000 1.675000 ( 1.675000)
1.244000 0.000000 1.244000 ( 1.244000)
1.183000 0.000000 1.183000 ( 1.183000)
1.173000 0.000000 1.173000 ( 1.173000)
1.171000 0.000000 1.171000 ( 1.170000)
1.178000 0.000000 1.178000 ( 1.178000)
1.170000 0.000000 1.170000 ( 1.170000)
1.169000 0.000000 1.169000 ( 1.169000)

JRuby trunk with patch:
1.133000 0.000000 1.133000 ( 1.133000)
0.922000 0.000000 0.922000 ( 0.922000)
0.865000 0.000000 0.865000 ( 0.865000)
0.862000 0.000000 0.862000 ( 0.863000)
0.859000 0.000000 0.859000 ( 0.859000)
0.859000 0.000000 0.859000 ( 0.859000)
0.864000 0.000000 0.864000 ( 0.863000)
0.859000 0.000000 0.859000 ( 0.860000)

Ruby 1.8.6:
1.750000 0.010000 1.760000 ( 1.760206)
1.760000 0.000000 1.760000 ( 1.764561)
1.760000 0.000000 1.760000 ( 1.762009)
1.750000 0.010000 1.760000 ( 1.760286)
1.760000 0.000000 1.760000 ( 1.759367)
1.750000 0.000000 1.750000 ( 1.761763)
1.760000 0.010000 1.770000 ( 1.798113)
1.760000 0.000000 1.760000 ( 1.760355)

That's an improvement of over 25%, with about 20 lines of code. It would be even higher with a dynamic adapter for the fib_ruby call. And we can take this further...modify our Java integration code to do direct calls to Java types, modify compiled code to adapt to methods as they are redefined or added to the system, and so on and so forth. There's a ton of potential here.

I will continue working along this path.