Sunday, July 09, 2006

Is Reflection Really as Fast as Direct Invocation?

This was originally posted to the jruby-devel mailing list, but I am desperate to be proven wrong here. We use reflection extensively to bind Ruby methods to Java impls in JRuby, and the rumors of how fast reflection is have always bothered me. What is the truth? Certainly there are optimizations that make reflection very fast, but as fast as INVOKEINTERFACE and friends? Show me the numbers! Prove me wrong!!

--

It has long been assumed that reflection is fast, and that much is true. The JVM has done some amazing things to make reflected calls really f'n fast these days, and for most scenarios they're as fast as you'd ever want them to be. I certainly don't know the details, but the rumors are that there's code generation going on, reflection calls are actually doing direct calls, the devil and souls are involved, and so on. Many stories, but not a lot of concrete evidence.

A while back, I started playing around with a "direct invocation method" in JRuby. Basically, it's an interface that provides an "invoke" method. The idea is that for every Ruby method we provide in Java code you would create an implementation of this interface; then when the time comes to invoke those methods, we are doing an INVOKEINTERFACE bytecode rather than a call through reflection code.

The down side is that this would create a class for every Ruby method, which amounts to probably several hundred classes. That's certainly not ideal, but perhaps manageable considering you'd have JRuby loaded once in a whole JVM for all uses of it. It could also be mitigated by only doing this for heavily-hit methods. Still, requiring lots of punky little classes is a big deal. [OT: Oh what I would give for delegates right about now...]

The up side, or so I hoped, would be that a straight INVOKEINTERFACE would be faster than a reflected call, regardless of any optimization going on, and we wouldn't have to do any wacked-out code generation.

Initial results seemed to agree with the upside, but in the long term nothing seemed to speed up all that much. There's actually a number of these "direct invocation methods" still in the codebase, specifically for a few heavily-hit String methods like hash, [], and so on.

So I figured I'd resolve this question once and for all in my mind. Is a reflected call as fast as this "direct invocation"?

A test case is attached. I ran the loops for ten million invocations...then ran them again timed, so that hotspot could do its thing. The results are below for both pure interpreter and hotspotted runs (time are in ms).

Hotspotted:
first time reflected: 293
second time reflected: 211
total invocations: 20000000
first time direct: 16
second time direct: 8
total invocations: 20000000

Interpreted:
first time reflected: 9247
second time reflected: 9237
total invocations: 20000000
first time direct: 899
second time direct: 893
total invocations: 20000000

I would really love for someone to prove me wrong, but according to this simple benchmark, direct invocation is faster--way, way faster--in all cases. It's obviously way faster when we're purely interpreting or before hotspot kicks in, but it's even faster after hotspot. I made both invocations increment a static variable, which I'm hoping prevented hotspot from optimizing code into oblivion. However even if hotspot IS optimizing something away, it's apparent that it does a better job on direct invocations. I know hotspot does some inlining of code when it's appropriate to do so...perhaps reflected code is impossible to inline?

Anyone care to comment? I wouldn't mind speeding up Java-native method invocations by a factor of ten, even if it did mean a bunch of extra classes. We could even selectively "directify" methods, like do everything in Kernel and Object and specific methods elsewhere.

--

The test case was attached to my email...I include the test case contents here for your consumption.

private static interface DirectCall {
public void call();
}

public static class DirectCallImpl implements DirectCall {
public static int callCount = 0;
public void call() { callCount += 1; }
}

public static DirectCall dci = new DirectCallImpl();

public static int callCount = 0;
public static void call() { callCount += 1; }

public void testReflected() {
try {
Method callMethod = getClass().getMethod("call", new Class[0]);

long time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
callMethod.invoke(null, null);
}
System.out.println("first time reflected: " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
callMethod.invoke(null, null);
}
System.out.println("second time reflected: " + (System.currentTimeMillis() - time));
System.out.println("total invocations: " + callCount);
} catch (Exception e) {
e.printStackTrace();
assertTrue(false);
}
}

public void testDirect() {
long time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
dci.call();
}
System.out.println("first time direct: " + (System.currentTimeMillis() - time));
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
dci.call();
}
System.out.println("second time direct: " + (System.currentTimeMillis() - time));
System.out.println("total invocations: " + DirectCallImpl.callCount);
}


Update: A commenter noticed that the original code was allocating a new Object[0] for every call to the reflected method; that was a rather dumb mistake on my part. The commenter also noted that I was doing a direct call to the impl rather than a call to the interface, which was also true. I updated the above code and re-ran the numbers, and reflection does much better as a result...but still not as fast as the direct call:

Hotspotted:

first time reflected: 146
second time reflected: 109
total invocations: 20000000
first time direct: 15
second time direct: 8
total invocations: 20000000

Interpreted:

first time reflected: 6560
second time reflected: 6565
total invocations: 20000000
first time direct: 912
second time direct: 920
total invocations: 20000000

18 comments:

Juha Komulainen said...

It seems that quite a bit of execution time in your tests is consumed by allocation of the empty argument lists. I managed to reduce execution time by over 30% just by converting:

callMethod.invoke(null, new Object[0])

into:

callMethod.invoke(null, null)

That said, I think that most of the difference between your test cases is caused by the fact that JVM can inline the first call, whereas it can't inline the reflected call.

Moreover, I didn't bother to test, but I'm quite sure that your test case will not compile into INVOKEINTERFACE, but into the faster INVOKEVIRTUAL, since you call the method through a reference of the implementation class, rather than the interface. But anyway, if the VM inlines the call, it's not going to make a difference.

Anyway, I think that you're not going to get the performance of normal calls with reflection. :(

mortench said...

Yes, I have also heard that in the way reflection is implemented in the Sun JVM enviroment, internal classes are generated on the fly for reflection. However, this does NOT mean that reflective calls is as fast as normal calls. Far from it.

I can't find the reference right now but unfortately I have also heard of potential problems with garbage collection if you do a lot of new reflective calls to new classes... As far as I recall the problem is that the generated reflective classes can't be garbage collected even if they are not used anymore. It has been a while since I read this. Don't know if this problem has been solved in recent JVMs.

Charles Oliver Nutter said...

Juha: My goodness, I can't believe I didn't see that...of course the allocation of the empty Object array would slow things down. I will post updated numbers based on a more correct version.

I am thinking more and more that inlining has a lot to do with this; I think also that optimizing one layer of indirection (interface call) is always going to be faster than optimizing multiple layers of indirection (3-4 levels of reflection classes), regardless of the magic performed.

Direct invocation should be an option for us, but inlining may not play as much of a role since we'll keep all such "invokers" in a hash. That said, I think there's still a good performance gain to be had.

Chris Nokleberg said...

If you use the Server JVM (-server flag) reflection does better relative to the direct call.

Charles Oliver Nutter said...

Chris: Running with -server didn't change the results much for me. Here's results with -server looping 100 million times:

first time reflected: 1115
second time reflected: 1036
total invocations: 200000000
first time direct: 25
second time direct: 17
total invocations: 200000000

Chris Nokleberg said...

This is on Linux. I'm not sure why yours isn't faster, maybe -server isn't getting picked up.

$ java Test
first time reflected: 1428
second time reflected: 1458
total invocations: 20000000
first time direct: 40
second time direct: 38
total invocations: 20000000

$ java -server Test
first time reflected: 171
second time reflected: 133
total invocations: 20000000
first time direct: 31
second time direct: 48
total invocations: 20000000

Chris Nokleberg said...

(BTW, this is 1.5.0_06, if it matters)

Charles Oliver Nutter said...

Chris: I think I misunderstood...I read your comment to mean that with -server reflected was faster than direct...but I see now you meant that it's faster than it is without server, but still slower than direct.

I am also using 1.5 on Linux, and the server VM is not very much faster than the client; I also run the loops twice to ensure the code has been JITted.

At any rate, it's still a significant boost, and we'll be investigating it further.

Chris Nokleberg said...

Sorry for the confusion. Safe to say reflection will never be faster than direct :-)

Anonymous said...

Reflection can't really be faster than direct calls, unless you have a really, really clever JIT engine that instantiates your reflective invocations into direct one. One of the issues causing the slowdown is that upon reflection, the class library/VM need to do a lot of fun accessibility & security checks to make sure that you've got the permission to invoke the method from your context. That penalty is paid once up front for a direct call in the verifier, which ensures that you're calling the right method with the right paramters in the right object, but can hardly be generalized for a reflective invocation.

What one can do to trim the invocation time down with reflection is to use caching, or to use a trampoline per invocation to translate reflective calls into direct ones, which I assume is what Sun does with their bytecode generation scheme. That shaves off the dynamic part of the lookup, and error checking, but still leaves you with having to go through argument marshalling and a few invocations in the trampoline to actually invoke the method, and get the result back.

In a nice language, like MetaOCaml, this sort of stuff becomes nicer to implement, as you can direct the compiler to generate dynamic code that's specialised for the invoction. See http://www.cs.rice.edu/~taha/publications/journal/dspg04a.pdf for a nice read on multi-stage programming.

cheers,
dalibor topic

Anonymous said...

Maybe you can use class-generation-on-fly (CGLib, etc) to generate all these java-to-ruby class? Of course, with some kind of cache to avoid the penalty of generation ...

Anonymous said...

Ah, yes, in that case I'd suggest taking a good look at Jumbo, which looks very neat for generating dynamic (byte)code in Java. See the very interesting paper at http://loome.cs.uiuc.edu/pubs/marshalling.pdf for their paper on performance improvements in Kaffe's ObjectOutputStream by using Jumbo to specialize marshalling, bringing it close to Sun's performance without the need to use Unsafe and native code.

cheers,
dalibor topic

Chris Nokleberg said...

FWIW CGLIB has some classes for improving reflection performance (you swap out your use of Class, Method with FastClass and FastMethod). However we have considered removing them in future versions because reflection is much better nowadays.

The downside to dynamic generation (and to a lesser degree pre-generation) is slower startup when generating/verifying the classes and running out of PermGen memory. At the very least you should try to combine multiple methods stubs into a single class, instead of one class per method.

I'm happy to help with any of this, just let me know.

Raphael Valyi said...

Hi,

First of all, I praise the JRuby team for what you are doing, we are a lot all waiting to put Rails on Java jvm rails...

Second, I think you are right with you idea of casting object to interface to call the methods you need rather than using Reflection.

I can't remember exact figures, but what I can tell you is that while optimizing the "JGraphpad Community Edition" startup time, I choose that solution to load plugins rather than using reflection which proved to be A WAY slower (it was a context of intensive use as it's in the whole GUI generation according to properties file possibly refering to plugins).

I also read the quite old study from IBM on that subject claiming direct is 10 times faster (on jvm 1.4 however)
http://www-128.ibm.com/developerworks/library/j-dyn0603/

which tends to the same conclusion.

So finally what about casting to interfaces for most used methods?
(Also I wonder, is there no way to dynmically generate the bytecode of those interfaces on the fly? This sound very simple bytcode manipulation...)

Also, I wonder, what about JRuby in java mustang vs C Ruby? (Mustang client jvm is a way faster than previous jvm; 58 % faster has been claimed; but still not as fast as previous server jvm however).

Best regards,

Raphael Valyi, Sophia Antipolis.

Charles Oliver Nutter said...

Raphael: Yes, we could do some code generation, and yes we could target our "direct invocation" to heavily-hit core methods. Both options are on the table...we just need to decide on the best way to go and do some performance testing.

And JRuby does run great under Mustang; on my primitive benchmarks it's something like 20-30% faster than 1.5. Really amazing work from the Sun JVM team.

Tom Palmer said...

You might also want to talk with the pnuts folks. I hear they do some code generation for dodging reflection, too, and they supposedly get good performance out of it. Maybe they can provide some advice from their experience.

Patrick said...

Have you been in touch with the developers of Pnuts? It's one of the older, and one of the fastest, languages for the JVM. There's a note on the use of reflection in this blog entry: http://jroller.com/page/tomatsu/20060328

Regards, Patrick

mihai007 said...

this is very old but you could do a callMethod.setAccessible(true); to cut half of the time taken in reflection.