Wednesday, December 27, 2006

Holiday Fun: Interpretation, Loading, Dispatching Work

This week (what's left of it) I'm spending on performance again. It's officially a holiday for Sun, so I'm not technically on the clock. I could even sleep the entire week and pick things back up on Tuesday.

Yeah, right.

One of the oddities of working on JRuby while working at Sun stems from the fact that I was working on JRuby for fun for the past two years. Now that this is the full-time job, it's even harder to pull myself away from it. I thought perhaps making JRuby my job might take away some of the attraction. In reality, it's just made a good thing better, since I can spend long hours on the hard problems I would never have tackled before.

So instead of stopping work completely, I'm shifting gears a bit. Instead of the heavy Rails focus we've had over the past month, I'm hitting other "fun" stuff instead. Compilation, interpretation performance, and so on. We know there's still lots of fruit to be plucked from the performance tree, and of course any additional work/research on compilation will help that eventual goal.

Compiler Refactoring Underway

As far as compilation goes, I've started committing a refactored compiler; it separates AST-node-walking from code generation, so the backend could be swapped out with a YARV generator or future compiler revisions. I think that's needed to allow the compiler and the AST to evolve independently, since I believe there will be many possible compile targets and potentially many AST changes in the future. Nothing too fancy there, but it's a bit more readable so hopefully others can also contribute.

Interpreter Enhancements and Fixes

On the interpretation front, there are a few items I've been working on.

1. Speeding up method invocation and block management (committed)

Ruby's AST represents method calls that take a block by putting the block first in the AST. You encounter an "iter node" that points at the "call" with which it is associated. So the typical way you evaluate those nodes is to create the block object and then visit the call. Unfortunately, the call itself may lead to other calls with their own blocks, for evaluating the arguments or receiver.

The end result of this node ordering is that every time a method is called, the evaluation of its args and receiver has to juggle the current "block available" status around, so the block doesn't get consumed before it's needed. Because the block gets created early, we have to push it and the "block available" status onto a stack. This is the primary reason the block logic in the interpreter and ThreadContext is so complicated.

An example would help illustrate what's happening. For the following code:

foo("hello") { 1 }
The parser generates the following hierarchy of AST nodes:
IterNode[]        <= this is the block
NewlineNode[]
FixnumNode[] 1 <= this is the Fixnum 1 inside the block
FCallNode[] foo <= this is the call to foo
ArrayNode[]: {StrNode[]}
StrNode[]"hello"
The node associated with the block is encountered first, so we construct the block then. We move on to the "foo" call, and the block is consumed. No problem, right? However, here's a more complicated example that illustrates the trouble with this AST ordering:
foo { 1 }.bar { 2 }.baz(hello { 3 }) { 4 }
Here things are a bit more interesting. The parser produces the following output:
IterNode[]            <= the "4" block
NewlineNode[]
FixnumNode[] 4
CallNode[] baz
IterNode[] <= the "2" block
NewlineNode[]
FixnumNode[] 2
CallNode[] bar
IterNode[] <= the "1" block
NewlineNode[]
FixnumNode[] 1
FCallNode[] foo
ArrayNode[]: {IterNode[]}
IterNode[] <= the "3" block
NewlineNode[]
FixnumNode[] 3
FCallNode[] hello
So what actually happens here? First the "4" block is encountered, instantiated, and pushed onto the block stack. Then we proceed to the "baz" call. Unfortunately, the "baz" call has both a receiver and arguments, so we have to hide the "4" block to prevent it being consumed. We proceed to evaluate the receiver for "baz", encountering the "2" block. The "2" block is instantiated and pushed down, and we move on to the "bar" call. The "bar" call has another receiver; we evaluate that, encountering the "1" block and the "foo" call it's associated with. The "foo" call consumes the "1" block and returns a receiver for "bar". The "bar" call consumes the "2" block and returns a receiver for "baz". Now the "baz" call has to evaluate its arguments, so the "3" block is created and consumed by the "hello" call. Finally, with a receiver and args, we can call "baz" and consume the "4" block.

Confused? Me too. Why would you order it this way? Perhaps it's to ease parsing, or perhaps there's some reason I don't know. However, I believe the following ordering is much simpler (and I know it makes interpretation easier):

(extraneous nodes omitted)
CallNode[] baz
CallNode[] bar <= the receiver for "baz", a call to "bar"
FCallNode[] foo <= the receiver for "bar", a call to "foo"
IterNode[] <= the "1" block associated with "foo"
IterNode[] <= the "2" block associated with "bar"
ArrayNode[] <= args to "baz"
FCallNode[] hello <= the call to "hello"
IterNode[] <= ...and its "3" block
IterNode[] <= finally the "4" block
The advantages here should be obvious. We encounter the blocks in order, so there's no stack juggling involved. Because there's no stack juggling, we don't have to "hide" blocks as we evaluate receivers and arguments. Finally, because we know we'll only encounter blocks for methods that require them, there's no additional overhead for methods that don't need blocks. It's a good change, and I would love to understand why Ruby uses the more complicated AST structure instead of this.

I have already committed a change to reorder the way these AST nodes are handled. The AST itself is unchanged, but the visit to a given block (IterNode) just sets that node into the associated call and proceeds. The calls themselves are now responsible for creating an associated block (if necessary)...*after* receiver and args have been dealt with. This means two things: calls that don't accept blocks don't pay any block-manipulation penalty; and calls with peripheral blocks (for finding args or receivers) don't pay any block-manipulation penalty either. Only the calls that need blocks have to deal with them.

Eventually this will become an AST change, but this short-term fix resolves 90% of the interpreter goofiness right now. This change will also eventually mean the iter stack (and potentially the block stack) disappear too. Huzzah!

2. LoadService fixes, enhancements, and optimizations (committed)

LoadService cleanup and improvements are well under way. LoadService is responsible for "load" and "require" calls and does all the searching for files and management of loaded extensions and libraries. Unfortunately the existing heuristic was both broken and terribly inefficient.

Ruby's 'load' behavior is easy enough...just look for the exact file and execute it in the current runtime. Ruby's 'require' however has a bit more magic to it.

If you specify a full filename to 'require' it will use that filename to load either a source file or an extension, depending on whether you specify ".rb" or ".[so|o|dll|etc..]". If you do not specify an extension, it will search for .rb, .so, etc in turn until it finds something, If it finds nothing, that's a load error. Here's the "ri" doc for MRI's "require":
--------------------------------------------------------- Kernel#require
require(string) => true or false
------------------------------------------------------------------------
Ruby tries to load the library named _string_, returning +true+ if
successful. If the filename does not resolve to an absolute path,
it will be searched for in the directories listed in +$:+. If the
file has the extension ``.rb'', it is loaded as a source file; if
the extension is ``.so'', ``.o'', or ``.dll'', or whatever the
default shared library extension is on the current platform, Ruby
loads the shared library as a Ruby extension. Otherwise, Ruby tries
adding ``.rb'', ``.so'', and so on to the name. The name of the
loaded feature is added to the array in +$"+. A feature will not be
loaded if it's name already appears in +$"+. However, the file name
is not converted to an absolute path, so that ``+require
'a';require './a'+'' will load +a.rb+ twice.
There's also the issue of what paths to search. Ruby searches the current directory first, followed by custom load paths, site_ruby dirs, and ruby/1.8 dirs. This allows a number of mechanisms for overriding more general locations with more specific ones for particular uses.

JRuby adds a new wrinkle here: classloader resources. JRuby supports JARing up source files and loading them through Java's classloader mechanisms. This is how the JRuby applet and the "complete" JRuby JAR work: they simply include all Ruby source into the archive. So then for us the classloader/classpath represents an additional path to be searched.

The primary problem with JRuby's load heuristic is that it ended up searching the classloader far too frequently; and in many cases it searched it multiple times for files that could not exist, such as for complete absolute paths (starting with '/', which *never* works for classloader resources). A second problem with the heuristic is that it would try all locations for all extensions, so for example it would search for xxx.rb everywhere possible, then xxx.rb.ast.ser (our serialized AST format) everywhere possible, then xxx.so (used internally for extensions...though that may change), and so on. The result of this is that any extensions or serialized scripts went through the full monty of searches before being found on a second or third pass.

These two issues were compounded when running JRuby with a very large classpath. Because classloader resources can be expensive to search, our loading became linearly slower in relation to the number of JARs (or perhaps the size of jars) included into the JVM. More JARs, slower classloader resource searching, slower startup.

The fix for these issues is twofold: search a given load location for all filename extensions before moving on, and only search the classloader as a last resort for filenames likely to be found there. It's fairly simple to explain, but the LoadService code had been endlessly hacked and rejiggered over the years. The cleanup was 90% of the battle; the fixes were considerably easier.

A final issue with LoadService, which is now fixed, was that it allowed require to include files with no extensions at all. This is not correct Ruby behavior, and so now those files can't be loaded. This actually caused a bug a long time ago where the extension-free "rake" startup script was being loaded before the "rake.rb" library file it tried to locate. The result was an eventual stack overflow as the file tried to continually load itself. Goofy behavior we should never see again.

3. Speeding up dynamic dispatch

I'm also doing some experimental work to speed up dynamic dispatch. Currently, methods are located by name, looking up a callable object out of a big hash on a per-class basis. This works reasonably well, and there are caches to speed the process, but with all the recent performance work this search has started to become the new bottleneck.

The eventual callable objects looked up have another flaw: they make Hotspot optimization more difficult. Because they're behind an ICallable interface, because they have multiple levels of logic as well as pre/post-call setup and teardown, and because there are many different implementations of ICallable, they end up slowing execution down significantly.

Hotspot is really an amazing piece of work. For the vast majority of Java code, it's able to unroll loops, inline invocations, dynamically optimize conditionals and switches, and generally improve the speed of code by a drastic amount. When running in "server" mode, where Hotspot lets code run longer in interpreted mode before optimizing it, even greater improvements can be seen. For example, A fib benchmark--recursive and iterative--under Java 6 client and server VMs:

(best times shown)
client recursive:
8.551000

client iterative:
17.995000

server recursive:
5.008000

server iterative:
13.191000

MRI recursive:
1.670000

MRI iterative:
16.964403
These numbers aren't bad, really. We even beat MRI for this trivial benchmark when we start hitting Bignums heavily (Java's BigInteger implementation is quite a bit faster than Ruby's). However, we can certainly do better.

There are two experimental changes I'be been working on. The first eliminates the pre/post method setup for core libraries when that setup is not necessary. I call it "fast invocation", and it's applicable to a large majority of core class methods. For example, with just fast invocation, the recursive fib numbers above drop to the sub-5s range. This change is perfectly safe, since it's just eliminating interpreter overhead that would otherwise be wasted cycles. You can expect to see it included in JRuby soon.

The second change is the holy grail of dynamic invocation: eliminate, to the greatest extent possible, the overhead of looking up and dispatching to a given method. In short, make it as close to a simple static dispatch as possible. This is where the real speed gains in JRuby will start to show up.

I have some experimental code right now, focused on the fib benchmark, that is both safe and drastically improves performance. It's sub-par code at the moment, but it does produce results like this:
recursive before:
5.008000

recursive after:
3.864000
Now of course, this is still interpreted. The same change when applied to my experimental Ruby compiler produces a much more drastic effect:
compiled recursive after:
1.550100
Now we start to see the value of eliminating dynamic-dispatch overhead. This is actually *faster* than Ruby's recursive fib, a feat that hasn't been accomplished by JRuby at any time in the past.

The trick to this is fairly simple. For common core methods which are known to be simple Java code, such as for Fixnum's +, -, and < implementations, I provide integer IDs. Within the Fixnum implementation there's a new implementation of "callMethod", our dynamic-dispatcher, which switches on these IDs. For methods it knows, such as the aforementioned +, -, and <, it dispatches directly to op_plus, op_minus, or op_lt, the Java implementations. This skips the lookup phase, the ICallable implementation, the ThreadContext manipulation, and the pre/post-method setup code completely. It's also perfectly safe, again, because all those pieces only waste cycles for simple methods like this.

Now one problem with a simple approach like this is that if you redefine Fixnum#+, that change won't be picked up. The simple Fixnum callMethod won't ever try a method search for methods it knows can be fast-dispatched. I resolved this in my experimental code by adding a "clean" flag to the Fixnum class. If any any point after its initial definition the Fixnum class becomes "dirty", e.g. if you add or redefine a method, the old, slow dispatch will come back into play. My simple version is too coarse-grained, killing fast dispatching for all methods if any of them are changed, but the principal is sound. I'm going to be exploring this the rest of the week, trying to find a more complete solution...but some good things are around the corner.

--

All told, it's been a productive few days since JavaPolis. I'm hoping to write up a JavaPolis recap soon, but I'm keen to use this holiday time to get some cool stuff done. You'll hear more after the first of the year...hopefully with committed code and additional benchmarks against JRuby trunk :)

1 comment:

Patrick said...

Hi Charles--thanks for the update, these techy pieces are always interesting to read. Have you ever been in contact with Tomatsu of the Pnuts language? There was one blog of his from awhile back about method caching they use: http://www.jroller.com/page/tomatsu?entry=i_don_t_need_invokedynamic
Wondering if there are points to learn from each other. I am eager to see language implementors on the JVM learn from each other's work to improve the whole development space. Looks like nice work on JR, keep it up. Patrick