Monday, July 12, 2010

Finding Leaks in Ruby Apps with Eclipse Memory Analyzer

After my post on Browsing Memory the JRuby Way, one commenter and several other folks suggested I actually show using Eclipse MAT with JRuby. So without further ado...

The Eclipse Memory Analyzer, like many Eclipse-based applications, starts up with a "for dummies" page linking to various actions.



The most interesting use of MAT is to analyze a heap dump in a bit more interactive way than with the "jhat" tool. The analysis supports the "jmap" dump format, so we'll proceed to get a jmap dump of a "leaky" Rails application.

I've added this controller to a simple application:

class LeakyController < ApplicationController
class MyData
def initialize(params)
@params = params
end
end

LEAKING_ARRAY = {}
def index
LEAKING_ARRAY[Time.now] = MyData.new(params)
render :text => "There are #{LEAKING_ARRAY.size} elements now!"
end
end


Some genius has decided to save all recent request parameters into a constant on the LeakyController, keyed by time, wrapped in a custom type, and never cleaned out. Perhaps this was done temporarily for debugging, or perhaps we have a moron on staff. Either way, we need to find this problem and fix it.

We'll run this application and crank 10000 requests through the /leaky index, so the final request should output "There are 10000 elements now!"

~ ➔ ab -n 10000 http://localhost:3000/leaky
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
...


After 10000 requests have completed, we notice this application seems to grow and grow until it maxes out the heap (JRuby, being on the JVM, automatically limits heap sizes for you). Let's start by using jmap to investigate the problem.

~ ➔ jps -l
61976 org/jruby/Main
61999 sun.tools.jps.Jps
61837

~ ➔ jmap -histo 61976 | grep " ruby\." | head -5
37: 11685 280440 ruby.TZInfo.TimezoneTransitionInfo
40: 10000 240000 ruby.LeakyController.MyData
133: 970 23280 ruby.Gem.Version
137: 914 21936 ruby.Gem.Requirement
170: 592 14208 ruby.TZInfo.TimezoneOffsetInfo


We can see our old friend TimezoneTransitionInfo in there, but of course we've learned to accept that one. But what's this LeakyController::MyData object we've apparently got 10000 instances of? Where are they coming from? Who's holding on to them?

At this point, we can proceed to get a memory dump and move over to MAT, or have MAT acquire and open the dump in one shot, similar to VisualVM. Let's have MAT do it for us.

Getting Our Heap Into MAT

(Caveat: While preparing this post, I discovered that the jmap tool for the current OS X Java 6 (build 1.6.0_20-b02-279-10M3065) is not properly dumping all information. As a result, many fields and objects don't show up in dump analysis tools like MAT. Fortunately, there's a way out; on OS X, you can grab Soylatte or OpenJDK builds from various sources that work properly. In my case, I'm using a local build of OpenJDK 7.)

From the File menu, we select Acquire Heap Dump.



The resulting dialog should be familiar, since it lists the same JVM processes the "jps" command listed above. (If you had to specify a specific JDK home, like me, you'll need to click the "Configure" button and set the "jdkhome" flag" for "HPROF jmap dump provider".)



We'll pick our Rails instance (pid 61976) and proceed.

MAT connects to the process, pulls a heap dump to disk, and immediately proceeds to parse and open it.



Once it has completed parsing, we're presented with a few different paths to follow.



On other days, we might be interested in doing some component-by-component browsing to look for fat objects or minor leaks, or we might want to revisit the results of previous analyses against this heap. But today, we really need to figure out this MyData leak, so we'll run the Leak Suspects Report.

Leak Suspects?



Are you kidding? A tool that can search out and report possible leaks in a running system? Yes, Virginia, there is a Santa Claus!

This is the good side of the "plague of choices" we have on the JVM. Because there's so many tools for almost every basic purpose (like the dozen – at least – memory inspection tools), tool developers have moved on to more specific needs like leak analysis. MAT is my favorite tool for leak-hunting (and it uses less memory than jhat for heap-browsing, which is great for larger dumps).

Once MAT has finished chewing on our heap, it presents a pie chart of possible leak suspects. The logic used essentially seeks out data structures whose accumulated size is large in comparison to the rest of the heap. In this case, MAT has identified three suspects that in total comprise over half of the live heap data.



Scrolling down we start to get details about these leak candidates.



So there's a Hash, a Module, and 711 Class objects in our list of suspects. The Class objects are probably just loaded classes, since the JRuby core classes and additional classes loaded from Rails and its dependent libraries will easily number in the hundreds. We'll ignore that one for now. There's also an unusually large Module taking up almost 4MB of memory. We'll come back to that.

The Hash seems like the most likely candidate. Let's expand that.

The first new block of information gives us a list of "shortest paths" to the "accumulation point", or the point at which all this potentially-leaking data is gathering. There's more to this in the actual application, but I'm showing the top of the "path" here.



At the top of this list, we see the RubyHash object originally reported as a suspect, and a tree of objects that lead to it. In this case, we go from the Hash itself into a ConcurrentHashMap (note that we're hiding nothing here; you can literally browse anything in memory) which in turn is referenced by the "constants" field of a Class. So already we know that this hash is being referenced in some class's constant table. Pretty cool, eh?

Let's make sure we've got the right Hash and not some harmless data structure inside Rails. If we scroll down a bit more, we see a listing of all the objects this Hash has accumulated. Let's see what's in there.



Ok, so it's a hashtable structure with a table of entries. Can we get more out of this?

Of course like most of these tools, just about everything is clickable. We can dive into one of the hash entries and see what's in there. Clicking on an entry gives us several new ways to display the downstream objects we've managed to aggregate. In this case, we'll just do "List Objects", and the suboption "With Outgoing References" for downstream data.



Now finally in the resulting view of this particular RubyHashEntry, we can see that our MyData object is happily tucked away inside.



Ok, so we definitely have the right data structure. Not only that, but we can see that the entry's "key" is a Time object (org.jruby.RubyTime). Let's go back to the "Shortest Paths" view and examine the ConcurrentHashMap entry that's holding this Hash object. Each entry in this hash maps a constant name to a value, so we should be able to see which constant is holding the leaking references.

(At this point you'll see the side effects of my switch to OpenJDK 7; the memory addresses have changed, but the structure is the same.)



We'll do another "List Objects" "with outgoing references" against the the HashEntry object immediately referencing our RubyHash.



And there it is! In the "key" field of the HashEntry, we see our constant name "LEAKING_ARRAY".

What About That Module?

Oh yeah, what about that Module that showed up in the leak suspects? It was responsible for almost 4MB of the heap. Let's go back and check it out.



A-ha! Eclipse MAT has flagged the Gem module as being a potential leak suspect. But why? Let's go back to the suspect report and look at the Accumulated Objects by Class table, toward the bottom.



Ok, so the Gem module eventually references nearly 6000 Gem::Specification objects, which makes up the bulk of our 3.8MB. I guarantee I don't have 6000 gem versions installed. Perhaps that's something that RubyGems should endeavor to fix? Perhaps we've just used JRuby and Eclipse MAT to discover either a leak or wasteful memory use in RubyGems?

Evan Phoenix pointed out that I misread the columns. It's actually 249 Specification objects, their "self" size is almost 6000 bytes, and their "retained" size is 3.8MB. But that gives me an opportunity to show off another feature of MAT: Customized Retained Set calculation.

In this case, the retained size seems a bit suspect. Could there really be 3.8MB of data kept alive by Gem::Specification objects? It seems like a bit much, to be sure, but digging through the tree of references from the Gem module down shows there's several references to classes and modules, which in turn reference constant tables, method tables, and so on. How can we filter out that extra noise?

First we'll return to the view of the Gem module (two screenshots up) by going back to leak suspect #2, expanding "Shortest Paths". The topmost RubyModule in that list is the Gem module, so we're all set to calculate a Customized Retained Set.



The resulting dialog provides a list of options through which you can specify classes or fields to ignore when calculating the retained set from a given starting point. In this case, it's simple enough to filter out org.jruby.RubyClass and org.jruby.RubyModule, so that references from Gem::Specification back into the class/module hierarchy don't get included in calculations.



Which results in a similar view to those we've seen, but with objects sorted by retained heap.



Well what the heck? It looks like it's all String data?

JRuby's String implementation is an org.jruby.RubyString object, aggregating an org.jruby.util.ByteList object, aggregating a byte array, so the top three entries there in total are essentially all String memory. The best way to investigate where they're coming from is to do "List Objects" on RubyString, but instead of "with outgoing references" we'll use "with incoming references" to show where all those Strings are coming from.



Finally we have a view that lets us hunt through all these strings and see where they're coming from. Poking at the first few shows they're stored in constant tables of the Gem module (that last RubyModule I haven't expanded in). That's probably not a big deal. But if we sort the the list of RubyString objects by their retained sizes, we get a different picture of the system.



If we dig into the *largest* String objects, they appear to be referenced by Gem::Specification instance variables! So there's probably something worth investigating here.

It's also worth noting that any Ruby application is going to have a lot of Strings in it, so this isn't all that unusual to see. But it's nice to have a tool that lets you investigate potential inefficiencies (even down to the raw bytes!), and it's nice to know that at least some of that retained data for the Gem module is "real" and not just references back into the class hierarchy.

(And I'm not convinced all those Strings really *need* to be alive...but you're welcome to take it from here!)

Your Turn

Eclipse MAT is probably one of the nicest of the free tools. In addition to object browsing, leak detection, GC root analysis, and object query language support, there's a ton of other features, both in the main distribution and available from third parties. If you're hunting for memory leaks, or just want to investigate the memory usage of your (J)Ruby application, MAT is a tool worth playing with (and as always, I hope you will blog and report your experiences!)

Friday, July 09, 2010

Browsing Memory the JRuby Way

There's been a lot of fuss made lately over memory inspection and profiling tools for Ruby implementations. And it's not without reason; inspecting a Ruby application's memory profile, much less diagnosing problems, has traditionally been very difficult. At least, difficult if you don't use JRuby.


Because JRuby runs on the JVM, we benefit from the dozens of tools that have been written for the JVM. Among these tools are numerous memory inspection, profiling, and reporting tools, some built into the JDK itself. Want a heap dump? Check out the jmap (Java memory map) and jhat (Java heap analysis tool) shipped with Hotspot-based JVMs (Sun, OpenJDK). Looking for a bit more? There's the Memory Analysis Tool based on Eclipse, the YourKit memory and CPU profiling app, VisualVM, now also shipped with Hotspot JVMs...and many more. There's literally dozens of these tools, and they provide just about everything you can imagine for investigating memory.

In this post, I'll show how you can use two of these tools: VisualVM, a simple, graphical tool for exploring a running JVM; and the jmap/jhat combination, which allows you to dump the memory heap to disk for inspection offline.

Getting JRuby Prepared

All these tools work with any version of JRuby, but as part of JRuby 1.6 development I've been adding some enhancements. Specifically, I've made some modifications that allow Ruby objects to show up side-by-side with Java objects in memory profiles. A little explanation is in order.

In JRuby, all the core classes are represented by "native" Java classes. Object is represented by org.jruby.RubyObject, String is org.jruby.RubyString, and so on. Normally, if you extend one of the core classes, we don't actually create a new "native" class to represent it; instead, all user-created classes that extend Object simply show up as RubyObject in memory. This is still incredibly useful; you can look into RubyObject and see the metaClass field, which indicates the actual Ruby type.

Let's see what that looks like, so we know where we're starting from. We'll run a simple script that creates a custom class, instantiates and saves 10000 instances of it, and then sleeps.


~/projects/jruby ➔ cat foo_heap_example.rb 
class Foo
end

ary = []
10000.times { ary << Foo.new }

puts "ready for analysis!"
sleep

~/projects/jruby ➔ jruby foo_heap_example.rb
ready for analysis!


So we have our test subject ready to go. To use the jmap tool, we need the pid of this process. Of course we can use the usual shell tricks to get it, but the JDK comes with a nice tool for finding all JVM pids active on the system: jps

~/projects/jruby ➔ jps -l
52862 sun.tools.jps.Jps
52857 org/jruby/Main
48716 com.sun.enterprise.glassfish.bootstrap.ASMain


From this, you can see I have three JVMs running on my system right now: jps itself; our JRuby instance; and a GlassFish server I used for testing earlier today. We're interested in the JRuby instance, pid 52857. Let's see what jmap can do with that.

~/projects/jruby ➔ jmap
Usage:
jmap [option] <pid>
(to connect to running process)
jmap [option] <executable <core>
(to connect to a core file)
jmap [option] [server_id@]<remote server IP or hostname>
(to connect to remote debug server)

where <option> is one of:
<none> to print same info as Solaris pmap
-heap to print java heap summary
-histo[:live] to print histogram of java object heap; if the "live"
suboption is specified, only count live objects
-permstat to print permanent generation statistics
-finalizerinfo to print information on objects awaiting finalization
-dump:<dump-options> to dump java heap in hprof binary format
dump-options:
live dump only live objects; if not specified,
all objects in the heap are dumped.
format=b binary format
file=<file> dump heap to <file>
Example: jmap -dump:live,format=b,file=heap.bin <pid>
-F force. Use with -dump:<dump-options> <pid> or -histo
to force a heap dump or histogram when <pid> does not
respond. The "live" suboption is not supported
in this mode.
-h | -help to print this help message
-J<flag> to pass <flag> directly to the runtime system

<


The simplest option here is -histo, to print out a histogram of the objects on the heap. Let's run that against our JRuby instance.

~/projects/jruby ➔ jmap -histo:live 52857

num #instances #bytes class name
----------------------------------------------
1: 22677 3192816 <constMethodKlass>
2: 22677 1816952 <methodKlass>
3: 35089 1492992 <symbolKlass>
4: 2860 1389352 <instanceKlassKlass>
5: 2860 1193536 <constantPoolKlass>
6: 2798 739264 <constantPoolCacheKlass>
7: 5861 465408 [B
8: 5399 298120 [C
9: 3042 292032 java.lang.Class
10: 4037 261712 [S
11: 10002 240048 org.jruby.RubyObject
12: 3994 179928 [[I
13: 5474 131376 java.lang.String
14: 1661 95912 [I
...


The resulting output is a listing of literally every object in the system...not just Ruby objects even! The value of this should be apparent; not only can you start to investigate the memory overhead of code you've written, you'll also be able to investigate the memory overhead of every library and every piece of code running in the same process, right down to byte arrays (the "[B" above) and "native" Java strings ("java.lang.String" above). And so far we haven't had to do anything special to JRuby. Nice, eh?

So, back to the matter at hand: the Foo class from our example. Where is it?

Well, the answer is that it's right there; 10000 of those 10002 org.jruby.RubyObject instances are our Foo objects; the other two are probably objects constructed for JRuby runtime purposes. But obviously, there's nothing in this output that tells us how to find our Foo instances. This is what I'm remedying in JRuby 1.6.

On JRuby master, there's now a flag you can pass that will stand up a JVM class for every user-created Ruby class. Among the many benefits of doing this, we also get a more useful profile. Let's see how to use the flag (which will either be default or very easy to access by the time we release JRuby 1.6).

~/projects/jruby ➔ jruby -J-Djruby.reify.classes=true foo_heap_example.rb 
ready for analysis!


If we run jmap against this new instance, we see a more interesting result.

 num     #instances         #bytes  class name
----------------------------------------------
1: 22677 3192816 <constMethodKlass>
2: 22677 1816952 <methodKlass>
3: 35089 1492992 <symbolKlass>
4: 2860 1389352 <instanceKlassKlass>
5: 2860 1193536 <constantPoolKlass>
6: 2798 739264 <constantPoolCacheKlass>
7: 5863 465456 [B
8: 5401 298208 [C
9: 3042 292032 java.lang.Class
10: 4037 261712 [S
11: 10000 240000 ruby.Foo
12: 3994 179928 [[I
13: 5476 131424 java.lang.String
14: 1661 95912 [I


A-ha! There's our Foo instances! The "reify classes" option generates a JVM class of the same name as the Ruby class, prefixed by "ruby." to separate it from other JVM classes. Now we can start to see the real power of the tools, and we're just at the beginning. Let's see what a simple Rails application looks like.

~/projects/jruby ➔ jmap -histo:live 52926 | grep " ruby."
29: 11685 280440 ruby.TZInfo.TimezoneTransitionInfo
97: 970 23280 ruby.Gem.Version
98: 914 21936 ruby.Gem.Requirement
122: 592 14208 ruby.TZInfo.TimezoneOffsetInfo
138: 382 9168 ruby.Gem.Dependency
159: 265 6360 ruby.Gem.Specification
201: 142 3408 ruby.ActiveSupport.TimeZone
205: 118 2832 ruby.TZInfo.DataTimezoneInfo
206: 118 2832 ruby.TZInfo.DataTimezone
273: 41 984 ruby.Gem.Platform
383: 14 336 ruby.Mime.Type
403: 13 312 ruby.Set
467: 8 192 ruby.ActionController.MiddlewareStack.Middleware
476: 8 192 ruby.ActionView.Template
487: 7 168 ruby.ActionController.Routing.DividerSegment
508: 6 144 ruby.TZInfo.LinkedTimezoneInfo
523: 6 144 ruby.TZInfo.LinkedTimezone
810: 4 96 ruby.ActionController.Routing.DynamicSegment
2291: 2 48 ruby.ActionController.Routing.Route
2292: 2 48 ruby.I18n.Config
2293: 2 48 ruby.ActiveSupport.Deprecation.DeprecatedConstantProxy
2298: 2 48 ruby.ActionController.Routing.ControllerSegment
...


This time I've opted to grep out just the "ruby." items in the histogram, and the results are pretty impressive! We can see the baffling fact that there's 970 instance of Gem::Version, using at least 23280 bytes of memory. We can see the even more depressing fact that there's 11685 live instances of TZInfo::TimezoneTransitionInfo, using at least 280440 bytes.

Now that we're getting useful data, let's look at the first of our tools in more detail: jmap and jhat.

jmap and jhat

As you might guess, I do a lot of profiling in the process of developing JRuby. I've used probably a dozen different tools at different times. But the first tool I always reach for is the jmap/jhat combination.

You've seen the simple case of using jmap above, generating a histogram of the live heap. Let's take a look at an offline heap dump.

~/projects/jruby ➔ jmap -dump:live,format=b,file=heap.bin 52926
Dumping heap to /Users/headius/projects/jruby/heap.bin ...
Heap dump file created


That's how easy it is! The binary dump in heap.bin is supported by several tools: jhat (obviously), VisualVM, the Eclipse Memory Analysis Tool, and others. It's not officially a "standard" format, but it hasn't changed in a long time. Let's have a look at jhat options.

~/projects/jruby ➔ jhat
ERROR: No arguments supplied
Usage: jhat [-stack <bool>] [-refs <bool>] [-port <port>] [-baseline <file>] [-debug <int>] [-version] [-h|-help] <file>

-J<flag> Pass <flag> directly to the runtime system. For
example, -J-mx512m to use a maximum heap size of 512MB
-stack false: Turn off tracking object allocation call stack.
-refs false: Turn off tracking of references to objects
-port <port>: Set the port for the HTTP server. Defaults to 7000
-exclude <file>: Specify a file that lists data members that should
be excluded from the reachableFrom query.
-baseline <file>: Specify a baseline object dump. Objects in
both heap dumps with the same ID and same class will
be marked as not being "new".
-debug <int>: Set debug level.
0: No debug output
1: Debug hprof file parsing
2: Debug hprof file parsing, no server
-version Report version number
-h|-help Print this help and exit
<file> The file to read

For a dump file that contains multiple heap dumps,
you may specify which dump in the file
by appending "#<number>" to the file name, i.e. "foo.hprof#3".

All boolean options default to "true"


Generally you can just point jhat at a heap dump and away it goes. Occasionally if the heap is large, you may need to use the -J option to increase the maximum heap size of the JVM jhat runs in. Since we're running a Rails app, we'll bump the heap up a little bit.

~/projects/jruby ➔ jhat -J-Xmx200M heap.bin
Reading from heap.bin...
Dump file created Fri Jul 09 02:07:46 CDT 2010
Snapshot read, resolving...
Resolving 604115 objects...
[much verbose logging elided for brevity]

Chasing references, expect 120 dots........................................................................................................................
Eliminating duplicate references........................................................................................................................
Snapshot resolved.
Started HTTP server on port 7000
Server is ready.


"Server is ready"? Damn you Java people! Does everything have to be a server with you?

In this case, it's actually an incredibly useful tool. jhat starts up a small web application on port 7000 that allows you to click through the dump file. Let's see what that looks like.



Here's the front page of the tool. We see a listing of all JVM classes in the system. If you scroll to the bottom, there's a few more general functions.



Let's go with what we know and view the heap histogram again.



Here we can see that there's lots of objects taking up memory, and they're a mix of JVM-native types, JRuby implementation classes, and actual Ruby classes. In fact, here we can see our friend TZInfo::TimezoneTransitionInfo again. Let's click through.



Pretty mundane stuff so far; basically just information about the class itself. But you see at the bottom of this screenshot that we can go from here to viewing all instances of TimezoneTransitionInfo. Let's try that.



Ahh, that's more like it! Now we can see that there's a heck of a lot of these things floating around. Let's investigate a bit more and click through the first instance.



Now this is some cool stuff!

We can see that the JVM class generated for TimezoneTransitionInfo has three fields: metaClass, which points at the Ruby Class object; varTable, which is an array of Object references used for instance variables and other "internal" variables; and a flags field containing runtime flags for the object, like whether it's frozen, tainted, and so on. We can see that this object has no special flags set, and we can dig deeper into those fields if we like. We'll skip that today.

Moving further down, we see a few more amazing links. First, there's a list of all references to this object. Ahh, now we can start to investigate why they're staying in memory, even though we're not using them. We can even have jhat show us the full chains of references keeping these objects alive; a series of objects leading all the way back to one "rooted" by a thread or by global JVM state. And we can explore the other direction as well, walking all objects reachable from this one.

This is only a small part of what you can do with jmap and jhat, and they're so simple to use it feels almost criminal. But what if we want to inspect an application while it's running? Dumping heaps and analyzing them offline can tell you much of the story, but sometimes you just want to see the objects coming and going yourself. Let's move on to VisualVM.

VisualVM

VisualVM spawned out of the NetBeans profiling tools. One of the biggest complaints about the JVMs of old were that all the built-in tooling seemed to be designed for JVM engineers alone. Because Sun had the foresight to build and own their own IDE and related modules, it eventually became a natural fit to pull out the profiling tools for use by everyone. And so VisualVM was born.

On most systems with Java 6 installed, you should have a "jvisualvm" command. Let's run it now.



When you start up VisualVM, you're presented with a list of running JVMs, similar to using the 'jps' command. You can also connect to remote machines, browse offline heap and core dump files, and look through memory and CPU profiling snapshots from previous runs. Today, we'll just open up our running Rails app and see what we can see.



VisualVM connects to the running process and brings up a basic information pane with process information, JVM information, and so on. We're interested in monitoring heap usage, so let's move to the "Monitor" tab.



Already we're getting some useful information. This view shows CPU usage (currently zero, since it's an idle Rails app), Heap usage over time, and the number of JVM classes and threads that are active. We can trigger a full GC, if we'd like to tidy things up before we start poking around. But most importantly, we can do the jmap/jhat dance in one step, by clicking the Heap Dump button. Tantalizing, isn't it?



Initially, we see a basic summary of the heap: total size, number of classes and GC roots, and so on. We're looking for our friend TimezoneTransitionInfo, so let's look for it in the "Classes" pane.



Ahh, there it is, just a little ways down the list. The counts are as we expect, so let's double-click and dig a bit deeper.



Here we have a lot of the same information about object instances that we did with jhat, but presented in a much richer format. Almost everything is active; you can jump around the heap and do analysis that would take a lot of manual work very easily. Let's try another tool: the Retained Size calculator.

Because our JVM tools see all objects equally, the reported size for a Ruby object on the heap is only part of the story. There's also the variable table, the object's instance variables, and objects they reference to consider. Let's jump to a different object now, Gem::Version.

We don't want to have to scroll through the list of classes to find ruby.Gem.Version, so let's make use of the Object Query Language console. With the OQL console, you can write SQL-like queries to retrieve listings of objects in the heap. We'll search for all instances of ruby.Gem.Version.



The query runs and we get a listing of Gem::Version objects. Let's dig deeper and see how much retained memory each Version object is keeping alive.



Clicking on the "Compute Retained Sizes" link in the "Instances" pane prompts us with this dialog. We're tough...we can take it.



Reticulating splines...



So it looks like each of the Version objects take from 125 to 190 bytes for a total of 19400 bytes, most of which is from the variable table. What's in there?



Ahh...looks like there's a String and an Array. And of course we can poke around the heap ad infinatum, into and out of "native" JRuby and JVM classes, and truly get a complete picture of what our running applications look like. Now you're playing with power.

Your Turn

This is obviously only the tip of the iceberg. Tools like Eclipse Memory Analysis Tool include features for detecting leaks; VisualVM and NetBeans both allow you to turn on allocation tracing, to show where in your code all those objects are being created. There's tools for monitoring live GC behavior, and many of these tools even allow you to dig into a running heap and modify live objects. If you can dream it, there's a tool that can do it. And you get all that for free by using JRuby.

If you'd like to play with this, it all works with JRuby 1.5.1 but you won't get the nice JVM classes for Ruby classes. For that, you can pull and build JRuby master, download a 1.6.0.dev snapshot, or just wait for JRuby 1.6. And if you do play with these or other tools, I hope you'll let us know and blog about your experience!

In the future, I'll try to show some of the other tools plus some of the CPU profiling capabilities they bring to the table. For now, rest assured that if you're using JRuby, you really do have the best tools available to you.

Wednesday, June 16, 2010

My Short List of Key Missing JVM Features

I mused today on Twitter that there's just a few small things that the JVM/JDK need to become a truly awesome platform for all sorts of development. Since so many people asked for more details, I'm posting a quick list here. There's obviously other things, but these are the ones on my mind today.


Cold Performance

Current JVMs start up pretty fast, and there's changes coming in Hotspot in Java 7 that will make them even better. Usually this comes from combinations of pre-verifying bytecode (or providing verification hints), sharing class data across processes, and run-of-the-mill tweaks to make the loading and linking processes more efficient. But for many apps, this doesn't do anything to solve the biggest startup hit of all: cold execution performance. Because the JVM doesn't save off the jitted products of each run, it must start "cold" every time, running everything in the bytecode interpreter until it gets hot enough to compile. Even on JVMs that don't have an interpreter, the initial cost of compiling everything to not-particularly-optimized assembly also causes a major startup hit (try running command-line stuff on JRockit or J9).

There's a few things people have suggested, and they're all hard:
  • Tiered compilation - compile earlier using the fastest-possible, least-optimizing compiler, but decorate the compiled code with appropriate profiling logic to do a better job later. Hotspot in Java 7 may ship a tiered compiler, but there have been some resource setbacks that delayed its development.
  • Save off compilation or optimization artifacts - this is theoretically possible, but the deeper you go the harder it is to save it. Usually the in-memory results of optimization and compilation depend on the layout of things...in memory. Saving them to disk means you need to scrub out anything that might be different in a new process like memory addresses and class identities. But .NET can do this, though it largely *just* does static compilation. Happy medium?
  • Keep a JVM process running and toss it new work. We do this in JRuby with the Nailgun library, but it has some problems. First off, it can leave various aspects of the JVM in a dirty state, like system properties and memory footprint. Second, it can't kill off rogue threads that don't terminate, so they can collect over time. And third...it's not actually running at the console, so a lot of console things you'd do normally don't work.
This is probably the biggest unsolvable problem for JRuby right now, and the one we most often have to apologize for. JRuby is fast...at times, very fast...and getting faster every day. But not during the first 5 seconds, and so everyone gets the same bad impression.

Better Console/Terminal Support

There's endless blogs out there complaining about how the standard IO streams you get from the JVM are crippled in various ways. You can't select on them, for example, which is the source of a few unfixable bugs in JRuby. You can't pass them along to subprocesses, which is perhaps more a failing of the process-launching APIs in the JDK than standard IO itself. There's no direct terminal support in any of Java's APIs, so people end up yanking in libraries like jline just to support line editing. If the JDK shipped with some nice terminal and process APIs, a lot of the hassles developers have writing command-line tools in Java would melt away.

There's some light at the end of the tunnel. NIO2, scheduled to be part of Java 7, will bring better process launching APIs (with inherited standard IO streams, if you desire), a broader range of selectable channels, and much more. Hopefully it will be enough.

Fix the Busted APIs

JDBC is broken. Why? Because you have to register your driver in a global hard-referencing hash, and have to unregister it from the same classloader or it will leak. That means that if you're loading JDBC drivers from within a webapp or EE application, *your entire application remains in memory* because the driver references it and that map references the driver. This is the primary reason why most Java web application servers leak memory on undeploy, and it's another "unfixable" from JRuby's perspective.

Object serialization is broken. Why? Because it plays all sorts of tricks to get your classloader, reflectively access fields (if you're going to reflectively access them anyway, why not just break encapsulation if security allows it), and construct object instances without allowing you the opportunity to initialize them appropriately yourself. You have to provide no-arg constructors, have to un-final fields so they can be set up outside of construction, and heaven forbid you use default serialization: it's dead slow.

Reflection is too slow and there's no way around it. Not only do you end up calling through many extra levels of logic for reflective invocation, you have to box your argument lists, box your numerics, and wrap everything in exception-handling. And it doesn't have to be this way. The invokedynamic work brings along with it method handles, which are fast, direct pointers to methods. This should have been added long ago, but thankfully it's on the way in Java 7. Until then, projects like JRuby will have to continue eating the cost of reflection...or generate method handles by hand. We do both.

Regular expressions are broken. Why? Because simple alternations can blow the Java stack when fed especially large input. The current Sun-created regex implementation recurses for things like alternation, making it easy for it to fail to match on large input. The problem is so bad that we've actually switched regular expression engines in JRuby *four times*, including two implementations we wrote ourselves. Nobody can say we haven't bled for our users.

And there's numerous other examples. Some are relics of Java 1.0 that never got corrected (because old APIs don't die, they just get deprecated...or ignored). Some are relics of the idea that gigantic monolithic servers hosting dozens of apps (and leaking memory when they undeploy, or else contending for basic resources that separate processes would not) are a good idea, when in actuality running multiple JVMs that each only host one or a few apps works far better. Making a real effort to smooth these bad APIs would go a long way.

Better Support for Native Libraries and POSIX Features

As of Java 6, there's still no support for working with symlinks and only limited support for setting file permissions. Process launching is absolutely terrible. You can't select on all channels...only on sockets. If you want to use a native library, you have to write JNI code to do it, even though there are libraries like JNA and JFFI in the wild that do an outstanding job of dynamically loading and binding those libraries.

Missing the POSIXy features is basically inexcusable today. Most of the system-level APIs in the JDK are still based on a lowest common denominator somewhere near Windows 95, even though all modern operating systems provide at least a substantial subset of those APIs. NIO2 will bring many improvements, but it's almost certain that some parts of POSIX won't be exposed, either because there's not enough resources to spec out the Java APIs for them or because they end up being too system-specific.

As for loading native libraries...this is again something that should have been rolled into the JDK a long time ago. Many people will cry foul..."pure Java!" they'll shout...and I agree. But there are times when some functionality simply doesn't exist in a Java library, or doesn't scale well as an out-of-process call. For these times, you just have to use the native library...and the barrier to entry for doing that on the Java platform is just too high. Rolle JNA or JFFI or something similar into the JDK, so we grown-ups can choose when we want to use native code.

The Punchline

The punchline is that for most of these things, we've solved or worked around them in JRuby...at *great expense*. I'd go so far as to say we've done more to work around the JVM and the JDK's lackings than any other project. We've gone out of our way to improve startup by any means possible. We ship beautiful, solidly-performing libraries for binding native libs without writing a line of C code. We generate a large amount of code at compile and runtime to avoid using reflection as much. We maintain and ship native POSIX layers for a dozen platforms. We've (i.e. one of our champions, Marcin Mielzynski) implemented our own regular expression engine (i.e. a port of Oniguruma). We've pulled every trick in the book to get process launching to work nicely and behave like Ruby users expect. And so on and so forth.

But it's not sustainable. We can't continue to patch around the JVM and JDK forever, even if we've done a great job so far. Hopefully this will serve as a wake-up call for JVM and JDK implementers around the world: If you don't want the Java platform to be a server-only, large-app-only, long-running-only, headless-only world...it's time to fix these things. I'm standing by to help coordinate those efforts :)

Tuesday, June 01, 2010

Restful Services in Ruby using JRuby and Jersey

There's lots of ways to present RESTful web services these days, and REST has obviously become the new "it's IPC no it's not" hotness. And of course Rubyists have been helping to lead the way, building restfulness into just about everything they write. Rails itself is built around REST, with most controllers doubling as RESTful interfaces, and it even provides extra tools to help you transparently make RESTful calls from your application. If you're doing RESTful services for a typical Ruby application, Rails is the way to go (and even if you're not using Ruby in general...you should be considering JRuby + Rails).

In the Java world the options aren't quite as clear, but one API has at least attempted to standardize the idea of doing RESTful services on the Java platform: JSR-311, otherwise known as JAX-RS.

JAX-RS in theory makes it easy for you to simply mark up a piece of Java code with annotations and have it automatically be presented as a RESTful service. Of the available options, it may be the simplest, quickest way to get a Java-based service published and running.

So I figured I'd try to use it from JRuby, and when doing it in Ruby it's actually surprisingly clean, even compared to Ruby options.

The Service

I followed the Jersey Getting Started tutorial using Ruby for everything (and not using Maven in this case).

My version of their HelloWorldResource looks like this in Ruby:

require 'java'java_import 'javax.ws.rs.Path'
java_import 'javax.ws.rs.GET'
java_import 'javax.ws.rs.Produces'

java_package 'com.headius.demo.jersey'
java_annotation 'Path("/helloworld")'
class HelloWorld
java_annotation 'GET'
java_annotation 'Produces("text/plain")'
def cliched_message
"Hello World"
end
end
Notice that we're using the new features in JRuby 1.5 for producing "real" Java classes: java_package to specify a target package for the Java class, java_annotation to specify class and method annotations.

We compile it using jrubyc from JRuby 1.5 like this:
~/projects/jruby ➔ jrubyc -c ../jersey-archive-1.2/lib/jsr311-api-1.1.1.jar --javac restful_service.rb
Generating Java class HelloWorld to /Users/headius/projects/jruby/com/headius/demo/jersey/HelloWorld.java
javac -d /Users/headius/projects/jruby -cp /Users/headius/projects/jruby/lib/jruby.jar:../jersey-archive-1.2/lib/jsr311-api-1.1.1.jar /Users/headius/projects/jruby/com/headius/demo/jersey/HelloWorld.java
The new --java(c) flags in jrubyc examine your source for any classes, spitting out .java source for each one in turn. Along the way, if you have marked it up with signatures, annotations, imports, and so on, it will emit those into the Java source as well. If you specify --javac (as opposed to --java), it will also compile the resulting sources for you with jruby and your user-specified jars in the classpath, as I've done here.

The result looks like this to Java:
~/projects/jruby ➔ javap com.headius.demo.jersey.HelloWorld
Compiled from "HelloWorld.java"
public class com.headius.demo.jersey.HelloWorld extends org.jruby.RubyObject{
public static org.jruby.runtime.builtin.IRubyObject __allocate__(org.jruby.Ruby, org.jruby.RubyClass);
public com.headius.demo.jersey.HelloWorld();
public java.lang.Object cliched_message();
static {};
}
Under the covers, this class will load in the source of our restful_service.rb file and wire up all the Ruby and Java pieces so that both sides see HelloWorld as the in-memory representation of the Ruby HelloWorld class. Method calls are dispatched to the Ruby code, constructors dispatch to initialize, and so on. It's truly living in both worlds.

With the service in hand, we now need a server script to start it up.

The Server Script

Continuing with the tutorial, I've taken their simple Java-based server script and ported it directly to Ruby:
require 'java'
java_import com.sun.jersey.api.container.grizzly.GrizzlyWebContainerFactory

base_uri = "http://localhost:9998/"
init_params = {
"com.sun.jersey.config.property.packages" => "com.headius.demo.jersey"
}

puts "Starting grizzly"
thread_selector = GrizzlyWebContainerFactory.create(
base_uri, init_params.to_java)

puts <<EOS
Jersey app started with WADL available at #{base_uri}application.wadl
Try out #{base_uri}helloworld
Hit enter to stop it...
EOS

gets

thread_selector.stop_endpoint

exit(0)
It's somewhat cleaner and shorter, but it wasn't particularly large to begin with. At any rate, it shows how simple it is to launch a Grizzly server and how nice annotation-based APIs can be for auto-configuring our JAX-RS service.

The CLASSPATH

Ahh CLASSPATH. You are so maligned when all you hope to do is make it explicit where libraries are coming from. The world should learn from your successes and your failures.

There's five jars required for this Jersey example to run. I've tossed them into my CLASSPATH env var, but you're free to do it however you like
jersey-core-1.2.jar
jersey-server-1.2.jar
jsr311-api-1.1.1.jar
asm-3.1.jar
grizzly-servlet-webserver-1.9.9.jar
The first four are available in the jersey-archive download, and you can fetch the Grizzly jar from Maven or other places.

Testing It Out

The lovely bit about this is that it's essentially a four-step process to do the entire thing: write and compile the service, write the server script, set up CLASSPATH, and run the server. Here's the server output, finding my HelloWorld service right where it should:
~/projects/jruby ➔ jruby main.rb
Starting grizzly
Jersey app started with WADL available at http://localhost:9998/application.wadl
Try out http://localhost:9998/helloworld
Hit enter to stop it...
Jun 1, 2010 6:36:55 PM com.sun.jersey.api.core.PackagesResourceConfig init
INFO: Scanning for root resource and provider classes in the packages:
com.headius.demo.jersey
Jun 1, 2010 6:36:55 PM com.sun.jersey.api.core.ScanningResourceConfig logClasses
INFO: Root resource classes found:
class com.headius.demo.jersey.HelloWorld
Jun 1, 2010 6:36:55 PM com.sun.jersey.api.core.ScanningResourceConfig init
INFO: No provider classes found.
Jun 1, 2010 6:36:55 PM com.sun.jersey.server.impl.application.WebApplicationImpl initiate
INFO: Initiating Jersey application, version 'Jersey: 1.2 05/07/2010 02:04 PM'
And perhaps the most anti-climactic climax ever, curling our newly-deployed service:
~ ➔ curl http://localhost:9998/helloworld
Hello World
It works!

What We've Learned

This was a simple example, but we demonstrated several things:
  • JRuby's new jrubyc --java(c) support for generating "real" Java classes
  • Tagging a Ruby class with Java annotations, so it can be seen by a Java framework
  • Booting a Grizzly server from Ruby code
  • Implementing a JAX-RS service with Jersey in Ruby code
Do you have any examples of other nice annotation-based APIs we could test out with JRuby?

Monday, May 31, 2010

Kicking JRuby Performance Up a Notch

We've often talked about the benefit of running on the JVM, and while our performance numbers have been *good*, they've never been as incredibly awesome as a lot of people expected. After all, regardless of how well we performed when compared to other Ruby implementations, we weren't as fast as statically-typed JVM languages.

It's time for that to change.

I've started playing with performing more optimizations based on runtime information in JRuby. You may know that JRuby has always had a JIT (just-in-time compiler), which lazily compiled Ruby AST into JVM bytecode. What you may not know is that unlike most other JIT-based systems, we did not gather any information at runtime that might help the eventual compilation produces a better result. All we applied were the same static optimizations we could safely do for AOT compilation (ahead-of-time, like jrubyc), and ultimately the JIT mode just deferred compilation to reduce the startup cost of compiling everything before it runs.

This was a pragmatic decision; the JVM itself does a lot to boost performance, even with our very naïve compiler and our limited static optimizations. Because of that, our performance has been perfectly fine for most users; we haven't even made a concentrated effort to improve execution speed for almost 18 months, since the JRuby 1.1.6 release. We've spent a lot more time handling the issues users actually wanted us to work on: better Java integration, specific Ruby incompatibilities, memory reduction (and occasional leaks), peripheral libraries like jruby-rack and activerecord-jdbc, and general system stability. When we asked users what we should focus on, performance always came up as a "nice to have" but rarely as something people had a problem with. We performed very nicely.

These days, however, we're recognizing a few immutable truths:

  • You can never be fast enough, and if your performance doesn't steadily improve people will feel like you're moving backward.
  • People eventually *do* want Ruby code to run as fast as C or Java, and if we can make it happen we should.
  • JRuby users like to write in Ruby, obviously, so we should make an effort to allow writing more of JRuby in Ruby, which requires that we suffer no performance penalty in the process.
  • Working on performance can be a dreadful time sink, but succeeding in improving it can be a tremendous (albeit shallow) ego boost.
So along with other work for JRuby 1.6, I'm back in the compiler.

I'm just starting to play with this stuff, so take all my results as highly preliminary.

A JRuby Call Site Primer

At each location in Ruby code where a dynamic call happens, JRuby installs what's called a call site. Other VMs may simply refer to the location of the call as the "call site", but in JRuby, each call site has an implementation of org.jruby.runtime.CallSite associated with it. So given the following code:

def fib(a)
if a < 2
a
else
fib(a - 1) + fib(a - 2)
end
end

There are six call sites: the "<" call for "a < 2", the two "-" calls for "a - 1" and "a - 2", the two "fib" calls, and the "+" call for "fib(a - 1) + fib(a - 2)". The naïve approach to doing a dynamic call would be to query the object for the named method every time and then invoke it, like the dumbest possible reflection code might do. In JRuby (like most dynamic language VMs) we instead have a call site cache at each call site to blunt that lookup cost. And in implementation terms, that means each caching CallSite is an instance of org.jruby.runtime.callsite.CachingCallSite. Easy, right?

The simplest way to take advantage of runtime information is to use these caching call sites as hints for what method we've been calling at a given call site. Let's take the example of "+". The "+" method here is being called against a Fixnum object every time it's encountered, since our particular run of "fib" never overflows into Bignum and never works with Float objects. In JRuby, Fixnum is implemented with org.jruby.RubyFixnum, and the "+" method is bound to RubyFixnum.op_plus. Here's the implementation of op_plus in JRuby:
    @JRubyMethod(name = "+")
public IRubyObject op_plus(ThreadContext context, IRubyObject other) {
if (other instanceof RubyFixnum) {
return addFixnum(context, (RubyFixnum)other);
}
return addOther(context, other);
}

The details of the actual addition in addFixnum are left as an exercise for the reader.

Notice that we have an annotation on the method: @JRubyMethod(name = "+"). All the Ruby core classes implemented in JRuby are tagged with a similar annotation, where we can specify the minimum and maximum numbers of arguments, the Ruby-visible method name (e.g. "+" here, which is not a valid Java method name), and other details of how the method should be called. Notice also that the method takes two arguments: the current ThreadContext (thread-local Ruby-specific runtime structures), and the "other" argument being added.

When we build JRuby, we scan the JRuby codebase for JRubyMethod annotations and generate what we call invokers, one for every unique class+name combination in the core classes. These invokers are all instance of org.jruby.internal.runtime.methods.DynamicMethod, and they carry various informational details about the method as well as implementing various "call" signatures. The reason we generate an invoker class per method is simple: most JVMs will not inline through code used for many different code paths, so if we only had a single invoker for all methods in the system...nothing would ever inline.

So, putting it all together. At runtime, we have a CachingCallSite instance for every method call in the code. The first time we call "+" for example, we go out and fetch the method object and cache it in place for future calls. That cached method object is a subclass of DynamicMethod, which knows how to do the eventual invocation of RubyFixnum.op_plus and return the desired result. Simple!

First Steps Toward Runtime Optimization

The changes I'm working on locally will finally take advantage of the fact that after running for a while, we have a darn good idea what methods are being called. And if we know what methods are being called, those invocations no longer have to be dynamic, provided our assumptions hold (assumptions being things like "we're always calling against Fixnums" or "it's always the core implementation of "+"). Put differently: because JRuby defers its eventual compilation to JVM bytecode, we can potentially turn dynamic calls into static calls.

The process is actually very simple, and since I just started playing with it two days ago, I'll describe the steps I took.

Step One: Turn Dynamic Into Static

First, I needed the generated methods to bring along enough information for me to do a direct call. Specifically, I needed to know what target Java class they pointed at (RubyFixnum in our "+" example), what the Java name of the method was (op_plus), what arguments the method signature expected to receive (returning IRubyObject and receiving ThreadContext, IRubyObject), and whether the implementation was static (as are most module-based methods...but our "op_plus" is not). So I bundled this information into what I called a "NativeCall" data structure, populated on any invokers for which a native call might be possible.
    public static class NativeCall {
private final Class nativeTarget;
private final String nativeName;
private final Class nativeReturn;
private final Class[] nativeSignature;
private final boolean statik;
...

In the compiler, I added a similar bit of logic. When compiling a dynamic call, it now will look to see whether the call site has already cached some method. If so, it checks whether that method has a corresponding NativeCall data structure associated with it. If it does, then instead of emitting the dynamic call logic it would normally emit, it compiles a normal, direct JVM invocation. So where we used to have a call like this:
INVOKEVIRTUAL org/jruby/runtime/CallSite.call

We now have a call like this:
INVOKEVIRTUAL org/jruby/RubyFixnum.op_plus

This call to "+" is now essentially equivalent to writing the same code in Java against our RubyFixnum class.

This comprises the first step: get the compiler to recognize and "make static" dynamic calls we've seen before. My experimental code is able to do this for most core class methods right now, and it will not be difficult to extend it to any call.

(The astute VM implementer will notice I made no mention of inserting a guard or test before the direct call, in case we later need to actually call a different method. This is, for the moment, intentional. I'll come back to it).

Step Two: Reduce Some Fixnum Use

The second step I took was to allow some call paths to use primitive values rather than boxed RubyFixnum objects. RubyFixnum in JRuby always represents a 64-bit long value, but since our call path only supports invocation with IRubyObject, we generally have to construct a new RubyFixnum for all dynamic calls. These objects are very small and usually very short-lived, so they don't impact GC times much. But they do impact allocation rates; we still have to grab the memory space for every RubyFixnum, and so we're constantly chewing up memory bandwidth to do so.

There are various ways in which VMs can eliminate or reduce object allocations like this: stack allocation, value types, true fixnums, escape analysis, and more. But of these, only escape analysis is available on current JVMs, and it's a very fragile optimization: all paths that would consume an object must be completely inlined, or else the object can't be elided.

So to help reduce the burden on the JVM, we have a couple call paths that can receive a single long or double argument where it's possible for us to prove that we're passing a boxed long or double value (i.e. a RubyFixnum or RubyFloat). The second step I took was to make the compiler aware of several of these methods on RubyFixnum, such as the long version of op_plus:
    public IRubyObject op_plus(ThreadContext context, long other) {
return addFixnum(context, other);
}

Since we have not yet implemented more advanced means of proving a given object is always a Fixnum (like doing local type propagation or per-variable type profiling at runtime), the compiler currently can only see that we're making a call with a literal value like "100" or "12.5". As luck would have it, there's three such cases in the "fib" implementation above: one for the "<" call and two for the "-" calls. By combining the compiler's knowledge of which actual method is being called in each case and how to make a call without standing up a RubyFixnum object, our actual call in the bytecode gets simplified further:
INVOKEVIRTUAL org/jruby/RubyFixnum.op_minus (Lorg/jruby/runtime/ThreadContext;J)

(For the uninitiated, that "J" at the end of the signature means this call is passing a primitive long for the second argument to op_minus.)

We now have the potential to insert additional compiler smarts about how to make a call more efficiently. This is a simple case, of course, but consider the potential for another case: calling from Ruby to arbitrary Java code. Instead of encumbering that call with all our own multi-method dispatch logic *plus* the multiple layers of Java's reflection API, we can instead make the call *directly*, going straight from Ruby code to Java code with no intervening code. And with no intervening code, the JVM will be able to inline arbitrary Java code straight into Ruby code, optimize it as a whole. Are we getting excited yet?

Step Three: Steal a Micro-optimization

There's two pesky calls remaining in "fib": the recursive invocations of "fib" itself. Now the smart thing to do would be to proceed with adding logic to the compiler to be aware of calls from currently-jitting Ruby code to not-yet jitted Ruby code and handle that accordingly. For example, we might also force those methods to compile, and the methods they call to compile, and so on, allowing us to optimize from a "hot" method outward to potentially "colder" methods within some threshold distance. And of course, that's probably what we'll do; it's trivial to trigger any Ruby method in JRuby to JIT at any time, so adding this to the compiler will be a few minutes work. But since I've only been playing with this since Thursday, I figured I'd do an even smarter thing: cheat.

Instead of adding the additional compiler smarts, I instead threw in a dirt-simple single-optimization subset of those smarts: detect self-recursion and optimize that case alone. In JRuby terms, this meant simply seeing that the "fib" CachingCallSite objects pointed back to the same "fib" method we were currently compiling. Instead of plumbing them through the dynamic pipeline, we make them be direct recursive calls, similar to the core method calls to "<" and "-" and "+". So our fib calls turn into the following bytecode-level invocation:
INVOKESTATIC ruby/jit/fib_ruby_B52DB2843EB6D56226F26C559F5510210E9E330D.__file__

Where "B52DB28..." is the SHA1 hash of the "fib" method, and __file__ is the default entry point for jitted calls.

Everything Else I'm Not Telling You

As with any experiment, I'm bending compatibility a bit here to show how far we can move JRuby's "upper bound" on performance. So these optimizations currently include some caveats.

They currently damage Ruby backtraces, since by skipping the invokers we're no longer tracking Ruby-specific backtrace information (current method, file, line, etc) in a heap-based structure. This is one area where JRuby loses some performance, since the JVM is actually double-tracking both the Java backtrace information and the Ruby backtrace information. We will need to do a bit more work toward making the Java backtrace actually show the relevant Ruby information, or else generate our Ruby backtraces by mining the Java backtrace (and the existing "--fast" flag actually does this already, matching normal Ruby traces pretty well).

The changes also don't track the data necessary for managing "backref" and "lastline" data (the $~ and $_ pseudo-globals), which means that regular expression matches or line reads might have reduced functionality in the presence of these optimizations (though you might never notice; those variables are usually discouraged). This behavior can be restored by clever use of thread-local stacks (only pushing down the stack when in the presence of a method that might use those pseudo-globals), or by simply easing back the optimizations if those variables are likely to be used. Ideally we'll be able to use runtime profiling to make a smart decision, since we'll know whether a given call site has ever encountered a method that reads or writes $~ or $_.

Finally, I have not inserted the necessary guards before these direct invocations that would branch to doing a normal dynamic call. This was again a pragmatic decision to speed the progress of my experiment...but it also raises an interesting question. What if we knew, without a doubt, that our code had seen all the types and methods it would ever see. Wouldn't it be nice to say "optimize yourself to death" and know it's doing so? While I doubt we'll see a JRuby ship without some guard in place (ideally a simple "method ID" comparison, which I wouldn't expect to impact performance much), I think it's almost assured that we'll allow users to "opt in" to completely "statickifying" specific of code. And it's likely that if we started moving more of JRuby's implementation into Ruby code, we would take advantage of "fully" optimizing as well.

With all these caveats in mind, let's take a look at some numbers.

And Now, Numbers Meaningless to Anyone Not Using JRuby To Calculate Fibonacci Numbers

The "fib" method is dreadfully abused in benchmarking. It's largely a method call benchmark, since most calls spawn at least two more recursive calls and potentially several others if numeric operations are calls as well. In JRuby, it doubles as an allocation-rate or memory-bandwidth benchmark, since the bulk of our time is spent constructing Fixnum objects or updating per-call runtime data structures.

But since it's a simple result to show, I'm going to abuse it again. Please keep in mind these numbers are meaningless except in comparison to each other; JRuby is still cranking through a lot more objects than implementations with true Fixnums or value types, and the JVM is almost certainly over-optimizing portions of this benchmark. But of course, that's part of the point; we're making it possible for the JVM to accelerate and potentially eliminate unnecessary computation, and it's all becoming possible because we're using runtime data to improve runtime compilation.

I'll be using JRuby 1.6.dev (master plus my hacks) on OS X Java 6 (1.6.0_20) 64-bit Server VM on a MacBook Pro Core 2 Duo at 2.66GHz.

First, the basic JRuby "fib" numbers with full Ruby backtraces and dynamic calls:
  0.357000   0.000000   0.357000 (  0.290000)
0.176000 0.000000 0.176000 ( 0.176000)
0.174000 0.000000 0.174000 ( 0.174000)
0.175000 0.000000 0.175000 ( 0.175000)
0.174000 0.000000 0.174000 ( 0.174000)

Now, with backtraces calculated from Java backtraces, dynamic calls restructured slightly to be more inlinable (but still dynamic and via invokers), and limited use of primitive call paths. Essentially the fastest we can get in JRuby 1.5 without totally breaking compatibility (and with some minor breakages, in fact):
  0.383000   0.000000   0.383000 (  0.330000)
0.109000 0.000000 0.109000 ( 0.108000)
0.111000 0.000000 0.111000 ( 0.112000)
0.101000 0.000000 0.101000 ( 0.101000)
0.101000 0.000000 0.101000 ( 0.101000)

Now, using all the runtime optimizations described above. Keep in mind these numbers will probably drop a bit with guards in place (but they're still pretty fantastic):
  0.443000   0.000000   0.443000 (  0.376000)
0.035000 0.000000 0.035000 ( 0.035000)
0.036000 0.000000 0.036000 ( 0.036000)
0.036000 0.000000 0.036000 ( 0.036000)
0.036000 0.000000 0.036000 ( 0.036000)


And a couple comparisons of another mostly-recursive, largely-abused benchmark target: the tak function...

Static optimizations only, as fast as we can make it:
  1.750000   0.000000   1.750000 (  1.695000)
0.779000 0.000000 0.779000 ( 0.779000)
0.764000 0.000000 0.764000 ( 0.764000)
0.775000 0.000000 0.775000 ( 0.775000)
0.763000 0.000000 0.763000 ( 0.763000)

And with runtime optimizations:
  0.899000   0.000000   0.899000 (  0.832000)
0.331000 0.000000 0.331000 ( 0.331000)
0.332000 0.000000 0.332000 ( 0.332000)
0.329000 0.000000 0.329000 ( 0.329000)
0.331000 0.000000 0.331000 ( 0.332000)

So for these trivial benchmarks, a couple hours hacking in JRuby's compiler has produced numbers 2-3x faster than the fastest JRuby performance we've achieved thusfar. Understanding why requires one last section.

Hooray for the JVM

There's a missing trick here I haven't explained: the JVM is still doing most of the work for us.

In the plain old dynamic-calling, static-optimized case, the JVM is dutifully optimizing everything we throw at it. It's taking our Ruby calls, inlining the invokers and their eventual calls, inlining core class methods (yes, this means we've always been able to inline Ruby into Ruby or Java into Ruby or Ruby into Java, given the appropriate tweaks), and optimizing things very well. But here's the problem: the JVM's default settings are tuned for optimizing *Java*, not for optimizing Ruby. In Java, a call from one method to another has exactly one hop. In JRuby's dynamic calls, it may take three or four hops, bouncing through CallSites and DynamicMethods to eventually get to the target. Since Hotspot only inlines up to 9 levels of calls by default, you can see we're eating up that budget very quickly. Add to that the fact that Java to Java calls have no intervening code to eat up bytecode size thresholds, and we're essentially only getting a fraction of the optimization potential out of Hotspot that a "simpler" language like Java does.

Now consider the dynamically-optimized case. We've eliminated both the CallSite and the DynamicMethod from most calls, even if we have to insert a bit of guard code to do it. That means nine levels of Ruby calls can inline, or nine levels of logic from Ruby to core methods or Ruby to Java. We've now given Hotspot a much better picture of the system to optimize. We've also eliminated some of the background noise of doing a Ruby call, like updating rarely-used data structures. We'll need to ensure backtraces come out in some usable form, but at least we're not double-tracking them.

The bottom line here is that instead of doing all the much harder work of adding inlining to our own compiler, all we need to do is let the JVM do it, and we benefit from the years of work that have gone into current JVMs' optimizing compilers. The best way to solve a hard problem is to get someone else to solve it for you, and the JVM does an excellent job of solving this particular hard problem. Instead of giving the JVM a fish *or* teaching it how to fish, we're just giving it a map of the lake; it's already an expert fisherman.

There's also another side to these optimizations: our continued maintenance of an interpreter is starting to pay off. Other JVM languages that don't have an interpreted mode have a much more difficult time doing any runtime optimization; simply put, it's really hard to swap out bytecode you've already loaded unless you're explicitly abstracting how that bytecode gets generated in the first place. In JRuby, where methods have always had to switch from interpreted to jitted at runtime, we're can can hop back and forth much more freely, optimizing along the way.

That's all for now. Don't expect to download JRuby 1.6 in a few months and see everything be three times (or ten times! or 100 times!) faster. There's a lot of details we need to work out about when these optimizations will be safe, how users might be able to opt-in to "full" optimization if necessary, and whether we'll get things fast enough to start replacing core code with Ruby. But early results are very promising, and it's assured we'll ship some of this very soon.