Sunday, March 29, 2009

On Benchmarking

Sigh. It must be that time of year again. Another partially-completed Ruby implementation has started to get overhyped because of early performance numbers.

MacRuby has been mentioned on this blog before. It's a reimplementation of Ruby 1.9 targeting the Objective-C runtime--and now, targeting LLVM for immediately compiling Ruby code to native code. Initial performance results running some of my benchmark show an interesting mixed bag. For some, MacRuby's new "experimental" branch performs very well, in some cases a few times faster than JRuby. For others, performance is slow enough there must be something wrong. And there's a large number of my benchmarks that don't even run, due to broken features they'll be fixing over the next several months.

And yet, at least one Rubyist has already seen fit to declare MacRuby "the fastest Ruby implementation around". Really? When it's crashing for about half the scripts I ran and extremely slow for many others?

He bases this assertion on running the benchmarks MacRuby includes in its own repository. Because MacRuby usually performs much better on those benchmarks than Ruby 1.9, he has decided they're now "the fastest Ruby". Do we have to do this hype dance every year?

Look, I know I'm biased. I want JRuby to be the best Ruby implementation possible. I want it to be fast, and if possible, the fastest. I also want it to run existing Ruby applications and integrate well with Java libraries and applications and continue to be one of the best choices for running Ruby. So I can understand that it sounds like I'm throwing stones by pouring water on such a breathless proclamation as "fastest Ruby implementation around". But seriously guys...haven't we learned anything?

MacRuby's experimental branch is just that: experimental. Lots of stuff is fast, but lots of stuff is broken or slow. I'm sure the MacRuby guys are going to get everything resolved and working, and I'll admit these early results drive me to work on JRuby performance even harder. But I also know from experience that many of the missing features are exactly those that make Ruby performance a really difficult problem. That's why we've always focused on compatibility first (almost to a fault); it's really easy to paint yourself into a corner.

But this post isn't about MacRuby. They're doing awesome work, and I have no doubt at least some of the performance numbers will stick. This post is about the evils of benchmarking, especially prematurely.

Around this time last year, MagLev (Ruby based on the Gemstone VM) posted some crazy benchmarks and shocked the Ruby world at RailsConf. They had numbers even more stunning than MacRuby, running some simple numerical benchmarks orders of magnitude faster than either Ruby 1.8 or 1.9. Several Ruby bloggers immediately posted not just their enthusiasm, but their belief that MagLev had won the performance battle without ever firing a shot.

And I believe it was a great disservice to the MagLev team.

MagLev was, last spring, a very primitive and early implementation. It could run some useful Ruby code, but the majority of the core classes had not yet been implemented and very little work had been done on compatibility. Now we're approaching a year later, and MagLev is still in development, still closed source, still at a private alpha stage of life. Again, I'll admit I'm biased, so I need to state that I believe MagLev is also a really cool technology, at least as cool as MacRuby or JRuby. In many ways and for many domains both of them are going to be more compelling than JRuby, and I have no illusions that JRuby will never get leapfrogged in performance. But we need to remember a really important fact: these implementations are not done.

I could post blog entries with every experimental branch of JRuby I've ever tested. I could show you "fib" numbers 3-5 times faster than current JRuby and 10 times faster than Ruby 1.9. But honestly, what would be the point? I know it's experimental, I know we need to get there in a careful, measured way, and I know that my best experiments may never be reflected in real-world, real-application performance. And yet it seems like people just love to latch on to these early contenders, hyping them to death almost before they're out of the starting gate.

Listen, people: Ruby is hard to implement. Oh, it may look easy at a glance, and you can probably get 70, 80, or even 90% of the way pretty quickly. But there's some crazy stuff in that last 10% or 5% that totally blindsides you if you're not looking for it. An early Ruby implementation has not run that last mile of Ruby implementation, and it takes almost as much work to get there as it does to run the first 90%.

So let's try to be adults about this and give new implementations time to actually finish before we whip the community into a frenzy. Every time we go overboard in our declarations, we look like amateurs. And as certain as I am that MacRuby is going to be a major contender for the "fastest Ruby" crown, I think we'd be wise to hold judgment until it and other young Ruby implementations are actually finished.


knowtheory said...

Another good read, Headius.

One of the things missing from these discussions is an explicit description of 1) the degree to which benchmarks of partial implementations are representative of something meaningful and 2) the manner in which implementation completeness and benchmark speeds are related;

Particularly what i'd want to know is whether performance decreases linearly as completeness increases (your posts would lead me to believe it's not), and whether the 5-10% of completeness that makes life complicated is the same for each ruby implementation.

To some degree i'm not sure these questions have answers at the moment, given the fact that there are really only a handful of implementations that have made it to what we can call completion. Although perhaps that makes these questions all the more salient and relevant.

One can imagine it'd be possible to build a suite of tests that cover both speed and completeness, but i'm not sure what it would consist of. Stress-testing the ruby way i guess! ;)

McLovin said...

If this hype happens every year, then as a corollary won't it die down soon as well? This constant 'taking-offence-that-other-people-are-hyping-their-stuff' is getting a little old.

I agree that JRuby is an excellent interpreter and I'm having a lot of fun playing with it. But I think this constant bickering is doing JRuby a great disservice because it comes off as whining.

IMHO, the people who are truly secure in the knowledge that their software is the best in the world, do not have to go around calling out others' bullshit. It will get uncovered on its own.

Paul Brannan said...

I think the same logic can be applied to "unladen swallow" (a reimplementation of the python eval loop on top of llvm -- Python is probably not as hard to implement as Ruby, but it's still nontrivial. I wish them the best, but it's not what I expect.

Herve said...

I think that no developers "in their right mind" will use something else than MRI or JRuby for now, except for experimenting. All the other implementations are still trying to implement Ruby IMHO.

In that regard, benchmarking implementations which are far from being complete is maybe interesting, but their results can greatly change over the time as they are evolving.

James Moore said...

Can't agree with you, McLovin. This is the way things are made to die down - someone points out their flaws.

Hubris is definitely a flaw - people should have learned their lessons _decades_ ago about making performance claims for stuff that isn't finished. It's not like this is a big mystery.

Nimrod said...

Would you care to elaborate on exactly what makes Ruby so difficult to implement -- relative to, say, Smalltalk?

[With the exception of 'eval', perhaps]

Frank Wierzbicki said...

Paul Brannan: unladen swallow is a branch of CPython more than a re-implementation, as the main implementers are core CPython developers as well. They are pushing patches back to CPython already. Of course when they start seriously targeting LLVM, remove the GIL, etc all bets are off...

Dan said...

I always wondered if releasing early benchmarks actually *hurts* an implementation's ability to pick up new contributors. Benchmarks imply a level of maturity, and perhaps a higher barrier to entry before you learn enough to meaningfully contribute to the project. That could be a turn-off to someone who may be interested in helping but has never contributed to MRI, JRuby or Rubinius before.

If there's going to be hype, I think it would probably be better to outline what makes your approach different from the others, and why you think it *could* become faster. Be honest that you're just getting started, and I bet you'll have alot more people pitching in because they want to learn as the project matures.

atog said...

Nice read indeed.

Greg said...

For those folks questioning Headius's objectiveness on this issue due to his relationship with JRuby, you're missing a vital point: he's right. This "Oooh, shiny!" mentality towards things that are new, not because they are better, but because they are new, has to stop.

It permeates the entire Ruby community, as people jump head first into new testing tools/techniques, web frameworks, templating languages, distributed databases, key value stores, runtimes, source control systems, hosting services, HTTP stacks, and so on. Hype-filled blog posts about alpha level software of course strap rocket boosters onto this already common "obsession with the new and shiny" phenomenon. I'd like to see more skepticism and less hype, and when someone writes a blog post about the Next Big Thing and Why It Matters, it'd be nice if they had some compelling, production scale evidence why this New Thing is so Great.

It'd also be refreshing every once in a while to see someone post something about how they've built a successful business on tools used in the ancient past of more than 6 months ago.

Matthew King said...

Headius, if you have the time, could you make a list of the RubySpec items that exercise this last ten percent?

Perhaps someone could then make a benchmark suite for the hard stuff.

Anonymous said...

Uh oh. The "MacRuby is 3x faster than everyone else" article just made it to slashdot.

Markus Kohler said...

Smalltalk is build on a a very clean and small core set of ideas.
That makes it easier to optimize.
It also stayed away from too much native code integration, which is a limitation for scripting but makes optimizations easier

shevegen said...

Perhaps we should all ask the people who come up with great benchmarks - AND THEN BE SILENT FOR A LONG TIME - to shut up until their work is so complete that indeed all of ruby really works.

What gain do we get from +50% speed increase when only 70% of ruby is supported?

Anonymous said...

Making bulk claim "Fastest ruby implementation" is indeed a bit foolish, however, from a emotional point of view among all ruby implementers who hasn't post some bench and saying they want to be one of the fastest is very understandable. We should enjoy and be emotional when we making software...

I see exactly the same pattern happening on this blog

1. Every time a new ruby implementation or a different ruby implementation posting the progress or benchmarks (Rubinius/IronRuby/YARV/MagLev/MacRuby)
2.Headius usually goes like

> Congratulations on ruby implementation XXX, But jruby is ...

I know Headius has the character of being very vocal and open about his opinion and i kinda like that. However, If your congrats comment can cut out the "But jruby is ... " part it will be more welcoming and a lot easier to digest by other implementers.

Some quotes from:

The Joker: Why so serious?

Anonymous said...


Are you happy now Charles?

knowtheory said...

Wow, guys, wtf. What's with the bile?

This isn't about JRuby vs MacRuby, or any other cage match.

This is post more than anything is about Bad Science. Antonio Cangiano's post draws false conclusions from incomplete data. You'll also note that the title of the post is "On Benchmarking".

People need to get over the idea that Ruby implementation is a zero sum game. Each implementation has it's own unique feature set, while aspiring to be fully qualified Rubies. The work that gets put into implementing a Ruby should be put to use enhancing ALL Rubys, since the main differentiation between the Rubies tends to be things that aren't core to being a Ruby.

This is particularly true since most of the apples to apples comparisons (i.e. same target Ruby, 1.8.6 or 1.9) are at least in the same ball park for speed (yes if JRuby is 2x faster than MRI, that's great, but it's not like an order of magnitude or anything).

The real point is that arguing about benchmarks adds just about nothing to meaningful discussions about the benefits or detriments of each particular Ruby implementation.

So First, quit being jerks. Second if you want to talk about the implementations, lets talk about them on the merits rather than numbers that don't really correspond with anything in the real world (yet). If you want to do speed tests, lets see some non-trivial non-microbench code.

Yehuda Katz said...

>> Are you happy now Charles?

Based on the post above, I have to assume that the answer to that question would be a very strong "no".

Robert Dober said...

Well I guess that JRuby is the fastest implementation of Ruby, but that is not important, important is that it works.
I really have come to know Charles as someone very open, very tolerant and I would be very surprised if this BLOG was about defending JRuby. IMHO it was about defending Ruby. That said he shall be proud indeed of what they have accomplished. And when he points out why and how they did it that way I believe it is to share, not to lecture.