Wednesday, December 27, 2006

Making Dynamic Invocation Fast: An Idea?

Evan Phoenix (of Rubinius fame) were discussing dynamic dispatch today on #rubinius, sharing our caching strategies and our dispatch woes. We talked at length about various strategies for speeding dispatch, cache invalidation mechanisms, compilation techniques, and so on. All gloriously fun stuff.

So at one point I related to him the extended version of my plans for speeding up dynamic dispatch. I relate it now to you to hopefully spur some discussion. There are no original ideas anymore, right? Or are there?

The initial experiment with Fixnum basically was the static version of my fuzzy vision. During parse time, a very trivial static table mapped method names to numbers. I only handled three method names, +, -, and <. Then those numbers were passed to Fixnum during method dispatch, where a very trivial static switch statement used them to "fast dispatch" to the appropriate methods.

The ultimate vision, however, is much grander.

The general idea is that during dynamic dispatch, we want to use the method name and the class as keys to locate a specific method instance. The typical way this is done is by keeping a hashtable on every class, and looking up methods by name. This works reasonably well, but each class must have its own table and we must pay some hashtable cost for every search. This is compounded when we consider class hierarchies, where the method we're looking for might be in a super class or a super class's super class, ad infinatum.

Retrieving the class is the easy part; every method we want to dispatch will have a receiver, and receivers have a reference to their class.

So then, how do we perform this magic mapping from M method names and N classes to a given method? Well hell, that's a matrix!

So then the great idea: build a giant matrix mapping method names (method IDs) and classes (class IDs) to the appropriate switchable value. Then when we dispatch to the class, we use that switchable value in a neatly dense switch statement.

Let me repeat that in another way to make it clear...the values in the matrix are simple "int" values, ideally as closely-numbered (dense) as possible for a given class's set of methods. In order to dispatch to a named method on a given object, we have the following steps:

  1. Get the method ID from the method (likely calculated at parse or compile time)
  2. Get the class ID from the class (calculated sequentially at runtime)
  3. Index into the matrix using method ID and class ID to get the class-specific method switch value
  4. Use the switch value to dispatch (via a switch statement, of course) to the appropriate method in the target class
Ignoring the size of this matrix for a moment, there are a number of benefits to this data structure:
  • We have no need for per-class or inline method caches
  • Repopulating the switch table for a given class's methods is just a matter of refilling its column in the matrix with appropriate values
  • Since the matrix represents all classes, we can avoid searching hierarchies once parents' methods are known; we just represent those methods in the matrix under the child's column and fast-dispatch to them as normal
  • We can save off and reconstitute the mapping to avoid having to regenerate it
...and there's also the obvious benefit: using the switch values stored in the matrix, we can do a fast dispatch on the target object with only the minimal cost of looking up a value in the matrix. An additional benefit for C code would be simply storing the address of the appropriate code in the matrix, avoiding any switch values completely.

Now naturally we're talking about a big matrix here, I'll admit that. A fellow by the name of "Defiler" piped up and said an incomplete app of his showed 1147 unique method names across 1258 classes. That works out to a matrix with 1.4M elements. In the Java world, where everything is 32 bits, that's around 6MB of memory consumed just for method mapping. Acceptable? Perhaps. But what about using a sparse matrix?

A sparse matrix attempts to be as efficient as possible when there are large numbers of "zeros" in the matrix elements. That is, interestingly enough, exactly what we have here. The vast majority of entries in our method-map would contain a zero, since most methods would not exist on most classes. Zero, as it turns out, would neatly map to our good friend "method_missing", and so dispatching to a method on a target class that doesn't implement it would continue to work as it does today. Even better, method_missing dispatch would be just as fast as a regular method dispatch, excluding the actual method_missing implementation code, of course.

So then, there it is...a potential plan for fast dynamic method invocation in JRuby (and perhaps in Rubinius, if Evan agrees the idea has merit). Any of you out there heard of such a thing? Is this an old idea? Are there problems with it? Share!

20 comments:

Anonymous said...

How would you map method names to ints? A translation table?

Anonymous said...

A lookup matrix of class VS method names sounds a lot like the way earlier Eiffel implementations worked. I believe they also went the sparse table route. I'll see if I can dig up some documentation on that.

komu said...

First of all, I'd say that several per-class lookup tables will be faster than one huge lookup table: if you reference the row of your matrix directly from the associated class, you get one level of lookup for free.

Now, to avoid the rows becoming huge, you need to compress them by removing the empty slots, as you said. Since this implies that you can't perform a direct lookup with the statically calculated method number, you really have just two good choices: either use binary search or perform a hash-lookup. Good hashing with custom collision handling is probably the fastest here.

So hash-tables are not a problem if you are clever with them. If you initialize them correctly, you never have to fall back to the parent class. (Just using the same idea as you had with the matrix.) Moreover, since you can calculate the hash-code statically at compilation time, you don't have that overhead either.

What's best, this approach generalizes easily to instances that want to override some methods: you can put the hash-table directly on instances. (But this has some subtle cache-invalidation problems when instance's class is changed after instance is created, so it might not be worth it.)

The hash-table cost is not that big once you factor most of it to compilation time. The constant-time lookup for the matrix looks even better, but as you'll notice, it won't work for sparse matrices that are likely to be implemented using trees or hash-tables anyway.

Finally, I'd better note that using hashed class descriptors with hash-codes calculated at compile time is not my original idea, but is a relatively well-known technique for implementing dynamic languages and/or multiple inheritance. I believe Appel mentions somewhere that this technique has an average overhead of 4 instructions compared to a normal call. Of course, I don't know JVM well enough to tell if the idiom translates easily.

Anonymous said...

Similar techniques are common in Smalltalk implementations, which have this problem in spades. Another, more agressive technique that they sometimes use is caching the method invoked at each *call site*. Similar techinues were also used by Jim Hugunin in Iron Python.

Unfortunately, googling around doesn't show a quick, official looking writeup, just stuff like this:

http://cow.physics.wisc.edu/~craigm/idl/archive/msg00077.html

which describes similar techinques in email archives. I'm sure they've been formally published somewhere (and in any case, source code for Squeak --- a Smalltalk-in-smalltalk similar in spirit to Rubinius --- and for Iron Python is available).

Anonymous said...

Paper on the Self techniques (spelled right this time) here.

Self was capable of recompiling the same method more than once, based on observed types. So, if a particular call always resolved to the same method accessor on the same class, it would eventually inline the thing; the generated code would check that the type was as expected (to guard against, say, humans making typos in the debugger), and then just read the field, avoiding method dispatch altogether.

Anonymous said...

This one is also interesting:
http://research.sun.com/self/papers/pics.html

They are differentiating method calls to mono/poli/mega-morphic. Identifying and optimizing monomorphic calls gives the best performance

Does it make sense to apply some statistics gathering to identify and optimize monomorphic (and polimorphic - by means of 2 ~ 4 potential receivers) in JRuby ?

Anonymous said...

Charles,
this reminds me strongly of JVM optimizations for invoke_interface, ie. for invoking interface methods. These are similar to the method invocation for dynamic languages (Ruby, Smalltalk) in that they only go via name (and in Java: via argument types).

Some research:
SableVM, an open source JVM:
there's a PHD thesis of the main developer that describes spare method tables for improving the speed of invoke_interface.
http://citeseer.ist.psu.edu/gagnon00sablevm.html
The full PHD thesis is available at:
http://sablevm.org/people/egagnon/gagnon-phd.pdf
This contains details about this (sorry, it seems like the page isn't responding at the moment, here's an alternative PDF link to the full PHD thesis:
http://www.sable.mcgill.ca/publications/thesis/phd-gagnon/sable-thesis-2002-phd-gagnon.pdf
)
Look in chapter 4 "Sparse Interface Virtual Tables".
Basically, they re-use the big gaps in between actual information to reduce the overhead.

And of course, there's the treasure trove for JVM optimization: the JikesRVM publications page;
http://jikesrvm.sourceforge.net/wiki/index.php/Publications
There's at least one on the topic:
http://jikesrvm.sourceforge.net/wiki/index.php/Publications#Efficient_Implementation_of_Java_Interfaces:_Invokeinterface_Considered_Harmless

Logan Capaldo said...

What about singleton classes? Would they appear in this matrix? And if they don't, then what do you do?

Unknown said...

See this paper.

Anonymous said...

This sounds good, it should be easier to maintain a single matrix than keeping a lot of polymorphic inline caches (PICs) in the AST up-to-date.

I'm not sure, but I think PICs would have faster lookup (linear search of very short lists), but the sparse matrix would probably win when it comes to memory consumption.

Anonymous said...

I'm not a big PIC fan here either, but having some weights associated with receivers would even put the search on almost worst case scenario - guards would do the rest of the job. Memory consupmtion is another factor of course.

kjk said...

Or you can just steal all the ideas (and maybe some code) from Strongtalk (http://code.google.com/p/strongtalk/) which is a Smalltalk implementation that, they say, beats any other Smalltalk implementation by order(s) of magnitude. Their code has been recently released under BSD license. They use techniques originated in Self (same people involved) i.e. PICs , feedback-based optimization and aggressive inlining. The company that did Strongtalk was bought by Sun and their people used similar techniques to improve Java's HotSpot VM.

Anonymous said...

It always amusing to see how ignorant ruby people are that Smalltalk solved their problems decades ago.

Unknown said...

Sorry if this is a dumb question, but...

Wouldn't a big switch case statement perform at O(n)? I tried testing this from java test program and that seems to be the case. Is there some bytecode trick to this?

ruby test code:
-------
(10..25).each do |step|
size = step * 100

File.open('BigSwitch.java', 'w') do |javafile|
javafile.puts "public class BigSwitch {
// do something almost interesting
public static void dispatch(String x) {
int y = x.length() * 100 / 10;
}
public static void main(String[] args) {
long t1, t2;
System.out.print(\"\" + #{size} + \",\");
t1 = System.nanoTime();
for (int i = 0; i < #{size}; i++) {
switch (i) {
"
size.times { |i|
javafile.puts "case #{i}: dispatch(\"#{i}\"); break;"
}
javafile.puts "
}
}
t2 = System.nanoTime();
System.out.println("" + (t2 - t1));
}
}
"
end
system 'javac BigSwitch.java'
system 'java BigSwitch'
end

Unknown said...

oops, nevermind...the switch would only contain switch values for methods of a single class wouldn't it? Sorry...

Anonymous said...

Charles,
are you going to switch to asm 3.0 in near future ? (there are some minor api changes). I'm playing with a ruby dsl for bytecode compiling/generation and I'd like to rely on as few dependencies as possible. Should I be prepared for it ?

Erik van Oosten said...

I am missing one important subject in the discussion: parallelism. Programs become more threaded so freezing complete access to the mapping for an update is not very acceptable.

Anonymous said...

Strongtalk is the Sun's version of Smalltalk. So why dun ask the Strongtalk team, if they are still there.......

zheng said...

C Ruby implements an global method cache.

When you call a method first time,the method could be stored in the cache.So when you call the method again,the method could be got from the cache.

The cache is just a array,its index is calculated like this:
#define EXPR1(c,m) ((((c)>>3)^(m))&CACHE_MASK)

where c is the VALUE of class,m is the ID of method name.

The cache is implements in eval.c.

I've read JRuby 0.9.1 source.Two points could be improved.First,there is no cache.Second, String is the key of hashtable.

In XRuby newruntime branch,RubyID is the key of hashtable, which is a wrapper of long and could reduce the calculation of hash key.Of course,global method cache is also implemented in XRuby newruntime.

OSRJ said...

Hi friends,I m sunil from india,i m fresher in S/W Development,so i have some doubts regarding to JAVA, like Dynamic method dispatch and factory classes,,,so will u plz help me regarding my probs.

Thanks,
sunil
sunil.langeh@gmail.com