Saturday, May 03, 2008

The Power of the JVM

In the past couple days, a new project release was announced that has shown once again the potential of the Java platform. Shown how the awesome JVM has not yet begun to flex its muscles and really hit its stride in this project's domain. Made clear that even projects with serious issues can correct them, harnessing much more of the JVM with only a modest amount of rework. And demonstrated there's a lot more around the corner.

That project wasn't JRuby this time. It was Groovy.

Groovy's Problem

Groovy 1.6 beta 1 was released a couple days ago. This release was focused largely on performance, rather than polishing bugs and adding features like the 1.5 series. You see, in 1.5 and earlier, Groovy had become basically feature-complete, and was starting to hit its stride. Most of the capabilities they desired were in the language and working. Their oft-touted Java integration had caught up to most Java 5 features. And Grails recently had its 1.0 release; finally there's a framework that can show Groovy at its best. But there was a problem: Groovy was still slow, one of the slowest languages on the JVM.

This doesn't really make a lot of sense, especially compared to languages like JRuby, which have a more complicated feature set to support. JRuby's performance regularly exceeded Groovy's, even though several Ruby features require us, for example, to allocate a synthetic call frame for *every* Ruby method invocation and most block invocations. And JRuby had only received serious work for about 1.5 years. The problem was not that Groovy was an inherently slow language...the problem was the huge amount of code that calls had to pass through to reach their target. Groovy's call path was fat.

A few months back I measured the number of frames between a call and the actual receiver code in Groovy and JRuby. JRuby, which has received a lot of work to shorten and simplify that call path, took only about four stack frames between calls. Groovy, on the other hand, took nearly 15. Some of these frames were due to Groovy still using Java reflection to hold "method objects", but the majority of those frames were Groovy internals. Calls had to dig through several layers of dispatch logic before they would reach a reflected method object, and then there were a few more layers before the target method was actually executed. Oh, and next time you call that method? Start over from scratch.

A Standard Solution

Early in the JRuby 1.1 dev cycle, we shortened the call path in two ways:

  • Rather than use reflection for core Ruby class's methods, we generate small stub methods ("method handles") that directly invoke for us. This avoids all the argument boxing and overhead of reflection entirely. It's only applicable for the core classes, but a very high percentage of any JRuby app--even one that calls Java classes--depends on core classes being fast. So it made a big difference.
  • When compiling Ruby code to Java bytecode, we employed what's called a call site cache, a tiny slot in the calling method where the previously looked-up method handle can be stored. If when we return to that call site the class associated with the method has not changed, and if we're again invoking against that class...we can skip the lookup. That drastically reduces the overhead of making dynamic calls, since most of the time we don't have to start over.
It is the call site mechanism that gave us our largest performance boost back in November (though I blogged a bit about the technique way back in June and July of 2007...boy was I naïve back then!).

It's certainly not a new technique. There are scads of papers out there (some really old) about how to build call site caches, either monomorphic (like JRuby's and Groovy's) or polymorphic (like most of the high-performance JVMs). Until we put them in place in JRuby, they weren't commonly used for languages built on top of the JVM. But that's all changing...now Groovy 1.6 has the same optimizations in place.

What's the result? A tremendous improvement in performance, similar to what we saw in JRuby last fall. According to Guillaume Laforge, Groovy project lead, the boost on the "Alioth" benchmarks can range anywhere from 150% faster to 560% faster. And the latest Benchmarks Game results prove it out: Groovy 1.6 has drastically improved, and even surpasses JRuby for most of those benchmarks. And while JRuby and Groovy will probably spend the next few months one-upping each other, we've both proven something far more important: the JVM is an *excellent* platform for dynamic languages. Don't let anyone tell you it's not.

Why It Works

The reason call site optimizations work so well for both JRuby and Groovy is twofold.

Firstly, eliminating all that extra dispatch logic whenever possible reduces overhead and speeds up method calls. That's a no-brainer, and any dynamic language can get that boost with the simplest of caches.

But it's the second reason that not only shows the benefit of running on the JVM but gives us a direction to take the JVM in the future. Call site optimizations allow the JVM to actually inline dynamic invocations into the calling method.

The JVM is basically a dynamic language runtime. Because all calls in Java are virtual (meaning subclass methods of the same name and parameters always override parent class methods), and because new code can be loaded into the system at any time, the JVM must deal with nearly-dynamic call paths all the time. In order to make this perform, the JVM always runs code through an interpreter for a short time, very much like JRuby does. While interpreting, it gathers information about the calls being made, 'try' blocks that immediately wrap throws, null checks that never fail, and so on. And when it finally decides to JIT that bytecode into native machine code, it makes a bunch of guesses based on that profiled information; methods can be inlined, throws can be turned into jumps, null checks can be eliminated (with appropriate guards elsewhere)...on and on the list of optimizations goes (and I've heard from JVM engineers that they've only started to scratch the surface).

This is where the call site optimizations get their second boost. Because JRuby's and Groovy's call sites now move the target of the invocation much closer to the site where it's being invoked, the JVM can actually inline a dynamic call right into the calling method. Or in Groovy's case, it can inline much of the reflected call path, maybe right up to the actual target. So because Groovy has now added the same call site optimization we use in JRuby, it gets a double boost from both eliminating the dispatch overhead and making it easier for the JVM to optimize.

Of course there's a catch. Even if you call a given method on type A a thousand times, somewhere down the road you may get passed an instance of type B that extends and overrides methods from A. What happens if you've already inlined A's method when B comes along? Here again the JVM shines. Because the JVM is essentially a dynamic language runtime under the covers, it remains ever-vigilant, watching for exactly these sorts of events to happen. And here's the really cool part: when situations change, the JVM can deoptimize.

This is a crucial detail. Many other runtimes can only do their optimization once. C compilers must do it all ahead of time, during the build. Some allow you to profile your application and feed that into subsequent builds, but once you've released a piece of code it's essentially as optimized as it will ever get. Other VM-like systems like the CLR do have a JIT phase, but it happens early in execution (maybe before the system even starts executing) and doesn't ever happen again. The JVM's ability to deoptimize and return to interpretation gives it room to be optimistic...room to make ambitious guesses and gracefully fall back to a safe state, to try again later.

Only The Beginning

So where do we go from here? Well ask me or the Groovy guys about putting these optimizations in place and we'll tell you the same thing: it's hard. Maybe too hard, but I managed to do it and I don't really know anything. It took the Groovy guys quite a while too. At any rate, it's not easy enough, and because we have to wire it together by hand (meaning we can only present a finite set of call paths) we're still not giving the JVM enough opportunity to optimize. Sure, we'll all continue to improve what we have for existing JVMs, and our performance will get better and better (probably a lot better than it is now). But we're also looking to the future. And the future holds another key to making the JVM an even better dynamic language runtime: JSR-292.

JSR-292 is basically called the "invokedynamic" JSR. The original idea for 292 was that a new bytecode could be added to the JVM to allow invoking methods dynamically against a target object, without actually knowing the type of the object or signature of the target method. And though that sounds like it might be useful, it turns out to be worthless in practice. Most dynamic languages don't even use standard Java class structures to represent types, so invokedynamic against a target object wouldn't accomplish anything. The methods don't live there. And it turns out there's a political side to it too: getting a new bytecode added to the JVM is *super hard*. So we needed a better way.

John Rose is in charge of the HotSpot optimizing compiler (the "server" compiler) at the heart of Sun's JVM. HotSpot is an amazing piece of software...it does all the optimizations I listed above plus hundreds of others that may or may not make your ears bleed. It has two different JIT compilers for different needs (soon to be merged into a single three-stage optimization pipeline), probably half a dozen different garbage collectors (a few weeks ago I met a guy in charge of one generation of one collector...crazy), and probably a thousand tweakable execution and optimization flags. It can make most Java run as fast as equivalent C++, even while the HotSpot engineers recommend you "just write normal code". In short, HotSpot has balls of steel.

John took over JSR-292 about this time last year. Not much work had been done on it, and it looked like it was moving toward a dead-end; most of the dynamic language projects agreed it wouldn't help them. Around that time, it was becoming apparent that JRuby would be able to make Ruby run really well (aka "fast") on the JVM, but it was taking a lot of work to do it. Tom and I talked with John a few times about strategies, many of which we've put in place over the past year, and they were all rather tricky to implement. Largely, they moved toward making the call path as fast as possible, by both shortening it and making the number and type of parameters match the target all the way through.

In order to reduce this workload for language implementers, John has been working on several features leading up to "invokedynamic". Here's the rough overview of how it will fit together.
  1. The first feature is already working in John's multi-language VM "Da Vinci Machine" project: anonymous classloading. JRuby first improved invocation performance by avoiding reflection and generating little wrapper classes, but those classes incur a very high cost. Each one has to be generated, classloaded, named, stored, and eventually dereferenced and garbage-collected independently. You can't do that with a single class or a single classloader, so we had a class per method, and a classloader per class. That's a crapload of memory used just to get around the JVM's bent toward plain old Java types. Anonymous classloading aims to eliminate that overhead in two ways: first, it will not require hard references or names for these tiny loaded classes, allowing them to easily garbage collect when the code is no longer in use; and second, it will allow you to generate a template class once, then creating duplicates of it with only small constant pool changes. Lost? Keep up with me...it leads into the next one.
  2. The second feature John hopes to have done real soon now: lightweight method handles. Method handles are essentially like java.lang.reflect.Method objects, except that they exactly represent the target method's parameter list and they take up far less memory...about 1/10 that of Method by John's estimate. Here's where the anonymous classloading comes in. Because all methods that have a given signature can be invoked with basically the same code, we only need to generate that handle once. So to support the broad range of classes and method names we'll want to invoke with that handle, we just patch the handle's constant pool. It's like saying "now I want a handle that invokes the same way, but against the 'bar' method in type B". Ahh, now anonymous classloading starts to make sense. We have one copy of the code with several patched instances. It makes me giddy just to think about it, because of how it would help JRuby. Because all our core classes just accept IRubyObject as arguments, we'd have to generate exactly ten primary handles instead of the thousand or more we generate now. And that means we can get even more specific.
  3. Method handles feed into the big daddy itself: dynamic invocation. Because handles are so close to the metal, and because the JVM understands what the hell they are (rather than having to perform lots of nasty tricks to optimize reflection) we can start to feed handles straight back into the JVM's optimization logic. So once we present our dynamic types to the JVM's dynamic lookup logic, we simply have to toss it method handles. And because the JVM can now connect the caller with the callee using standard mechanisms, our call site optimizations get chucked in the bin. The JVM can now treat our dynamic call like any other virtual call. All we need to do is provide the trigger that tells the JVM that the old handle is no longer correct, and it will come back for a new one. And we get to delete half the JRuby codebase that deals with making dynamic invocation fast. WOW.
Of course this is not there yet and won't be until JDK7 (fingers crossed!). We want to continue to support pre-JDK7 JVMs with languages like JRuby and Groovy, so an important component of this work will be backported libraries to do it "as well as possible" without the above features. That work will probably grow out of JRuby, Groovy, Jython, Rhino, and any other dynamic JVM languages, since we're the primary consumers right now and we're making it happen today. But I'll tell you, friends...you don't know what you've been missing on the JVM. Groovy's performance improvement from simply adding call site caches amazes me, even though we received the same boost in JRuby last year. The techniques we're both planning for our next versions will keep performance steadily increasing. And we've got invokedynamic right around the corner to really take us the last mile.

The future is definitely looking awesome for dynamic languages on the JVM. And languages like Groovy and JRuby are proving it.