Thursday, January 29, 2009

My Favorite Hotspot JVM Flags

I probably start up a JVM a thousand times a day. Test runs, benchmark runs, bug confirmation, API exploration, or running actual apps. And in many of these runs, I use various JVM switches to tweak performance or investigate runtime metrics. Here's a short list of my favorite JVM switches (note these are Hotspot/OpenJDK/SunJDK switches, and may or may not work on yours. Apple JVM is basically the same, so these work).

The Basics

Most runs will want to tweak a few simple flags:

  • -server turns on the optimizing JIT along with a few other "server-class" settings. Generally you get the best performance out of this setting. The default VM is -client, unless you're on 64-bit (it only has -server).
  • -Xms and -Xmx set the minimum and maximum sizes for the heap. Touted as a feature, Hotspot puts a cap on heap size to prevent it from blowing out your system. So once you figure out the max memory your app needs, you cap it to keep rogue code from impacting other apps. Use these flags like -Xmx512M, where the M stands for MB. If you don't include it, you're specifying bytes. Several flags use this format. You can also get a minor startup perf boost by setting minimum higher, since it doesn't have to grow the heap right away.
  • -Xshare:dump can help improve startup performance on some installations. When run as root (or whatever user you have the JVM installed as) it will dump a shared-memory file to disk containing all of the core class data. This file is much faster to load then re-verifying and re-loading all the individual classes, and once in memory it's shared by all JVMs on the system. Note that -Xshare:off, -Xshare:on, -Xshare:auto set whether "Class Data Sharing" is enabled, and it's not available on the -server VM or on 64-bit systems. Mac users: you're already using Apple's version of this feature, upon which Hotspot's version is based.
There are also some basic flags for logging runtime information:
  • -verbose:gc logs garbage collector runs and how long they're taking. I generally use this as my first tool to investigate if GC is a bottleneck for a given application.
  • -Xprof turns on a low-impact sampling profiler. I've had Hotspot engineers recommend I "don't use this" but I still think it's a decent (albeit very blunt) tool for finding bottlenecks. Just don't use the results as anything more than a guide.
  • -Xrunhprof turns on a higher-impact instrumenting profiler. The default invocation with no extra parameters records object allocations and high-allocation sites, which is useful for finding excess object creation. -Xrunhprof:cpu=times instruments all Java code in the JVM and records the actual CPU time calls take. I generally only use this to profile JRuby internals because it's extremely slow, but it's also much more accurate than -Xprof.
Deeper Magic

Eventually you may want to tweak deeper details of the JVM:
  • -XX:+UseParallelGC turns on the parallel young-generation garbage collector. This is a stop-the-world collector that uses several threads to reduce pause times. There's also -XX:+UseParallelOldGC to use a parallel collector for the old generation, but it's generally only useful if you often have large numbers of old objects getting collected.
  • -XX:+UseConcMarkSweepGC turns on the concurrent mark-sweep collector. This one runs most GC operations in parallel to your application's execution, reducing pauses significantly. It still stops the world for its compact phase, but that's usually quicker than pausing for the whole set of GC operations. This is useful if you need to reduce the impact GC has on an application run and don't mind that it's a little slower than the full stop-the-world versions. Also, you obviously would need multiple processors to see full effect. (Incidentally, if you're interested in GC tuning, you should look at Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning. There's a lot more there.)
  • -XX:NewRatio=# sets the desired ratio of "new" to "old" generations in the heap. The defaults are 1:12 in the -client VM and 1:8 in the -server VM. You often want a higher ratio if you have a lot more transient data flowing through your application than long-lived data. For example, Ruby's high object churn often means a lower NewRatio (i.e. larger "new" versus "old") helps performance, since it prevents transient objects from getting promoted to old generations.
  • -XX:MaxPermSize=###M sets the maximum "permanent generation" size. Hotspot is unusual in that several types of data get stored in the "permanent generation", a separate area of the heap that is only rarely (or never) garbage-collected. The list of perm-gen hosted data is a little fuzzy, but it generally contains things like class metadata, bytecode, interned strings, and so on (and this certainly varies across Hotspot versions). Because this generation is rarely or never collected, you may need to increase its size (or turn on perm-gen sweeping with a couple other flags). In JRuby especially we generate a lot of adapter bytecode, which usually demands more perm gen space.
And there are a few more advanced logging and profiling options as well:
  • -XX:+PrintCompilation prints out the name of each Java method Hotspot decides to JIT compile. The list will usually show a bunch of core Java class methods initially, and then turn to methods in your application. In JRuby, it eventually starts to show Ruby methods as well.
  • -XX:+PrintGCDetails includes the data from -verbose:gc but also adds information about the size of the new generation and more accurate timings.
  • -XX:+TraceClassLoading and -XX:+TraceClassUnloading print information class loads and unloads. Useful for investigating if you have a class leak or if old classes (like JITed Ruby methods in JRuby) are getting collected or not.
Into The Belly

Finally here's a list of the deepest options we use to investigate performance. Some of these require a debug build of the JVM, which you can download from java.net.

Also, some of these may require you also pass -XX:+UnlockDiagnosticVMOptions to enable them.
  • -XX:MaxInlineSize=# sets the maximum size method Hotspot will consider for inlining. By default it's set at 35 *bytes* of bytecode (i.e. pretty small). This is largely why Hotspot really like lots of small methods; it can then decide the best way to inline them based on runtime profiling. You can bump it up, and sometimes it will produce better performance, but at some point the compilation units get large enough that many of Hotspot's optimizations are skipped. Fun to play with though.
  • -XX:CompileThreshold=# sets the number of method invocations before Hotspot will compile a method to native code. The -server VM defaults to 10000 and -client defaults to 1500. Large numbers allow Hotspot to gather more profile data and make better decisions about inlining and optimizations. Smaller numbers reduce "warm up" time.
  • -XX:+LogCompilation is like -XX:+PrintCompilation on steroids. It not only prints out methods that are being JITed, it also prints out why methods may be deoptimized (like if new code is loaded or a new call target is discovered) and information about which methods are being inlined. There's a caveat though: the output is seriously nasty XML without any real structure to it. I use a Sun-internal tool for rendering it in a nicer format, which I'm trying to get open-sourced. Hopefully that will happen soon. Note, this option requires -XX:+UnlockDiagnosticVMOptions.
And finally, my current absolute favorite option, which requires a debug build of the JVM:
  • -XX:+PrintOptoAssembly dumps to the console a log of all assembly being generated for JITed methods. The instructions are basically x86 assembly with a few Hotspot-specific instruction names that get replaced with hardware-specific instructions during the final assembly phase. In addition to the JITed assembly, this flag also shows how registers are being allocated, the probability of various branches being followed (along with multiple assembly blocks for the different paths), and information about calls back into the JVM. Outside the logging options for the final generated assembly (which requires a separate plugin) this is the best tool for discovering what optimizations are actually happening. I use this at least a couple times a week to investigate JRuby performance enhancements.
And So Much More

Hotspot has literally hundreds of different flags (and here's another list specific to Java 6), and dozens of them that might be useful to you. I may add a few more to this post as I remember them, but this list includes all those I use on a regular basis. If you're using JRuby, you can use the -J flag to pass any of these flags through to the JVM, as in -J-XX:+PrintCompilation.

What are some of your favorite Hotspot JVM flags?

Update: Another couple that commenters added or reminded me of:
  • Marcus Kohler commented on -XX:+HeapDumpOnOutOfMemoryError, useful if you have a slow-leaking application you can't pin down. It will dump heap information to disk whenever there's an OutOfMemoryError, allowing you to do offline analysis.
  • j6wbs mentioned that you can send SIGQUIT (or hit Ctrl+Backslash or Ctrl+Break in the console) to dump the current execution stack of all running threads. This is especially nice if you have a runaway app or if an app appears to have frozen.
  • karld offers up -XX:OnOutOfMemoryError="mail -s 'OOM on `hostname` at `date`' whoever@example.com <<< ''" as a way to send out email when there's an OutOfMemoryError. Poor-man's monitoring!
  • I also remembered a very important option for JRuby: -Xbootclasspath specifies classpath entries you want loaded without verification. The JVM verifies all classes it loads to ensure they don't try to dereference an object with an int, pop extra entries off the stack or push too many, and so on. This verification is part of the reason why the JVM is very stable, but it's also rather costly, and responsible for a large part of startup delay. Putting classes on the bootclasspath skips this cost, but should only be used when you know the classes have been verified many times before. In JRuby, this reduced startup time by half or more for a simple script. Use -Xbootclasspath/a: and -Xbootclasspath/p: to append and prepend to the default bootclasspath or -Xbootclasspath: to completely set your own.

25 comments:

Markus Kohler said...

-XX:+HeapDumpOnOutOfMemoryError

- dump a heap on an out of memory Error.
- no cost at runtime

Charles Oliver Nutter said...

Marcus: Good one. And also related to that the 'jmap' tool for remotely examining or dumping the heap, and the 'jhat' tool for analyzing that heap and serving up (via localhost http) a set of pages for browsing heap information.

And of course jconsole, which gives you a remote management console for the JVM with threading, memory, GC, and other management information and tools.

j6wbs said...

Not really a startup flag, more a shutdown flag, but killing the JVM with the SIGQUIT (-3) flag has debugged mysterious hanging JVMs quite a few times...
http://tinyurl.com/jvmsigquit

jez.

karld said...

I also like -XX:OnOutOfMemoryError="mail -s 'OOM on `hostname` at `date`' whoever@example.com <<< ''" if you don't have other fancy monitoring for this condition.

This page also rocks as a reference: http://blogs.sun.com/watt/resource/jvm-options-list.html

Taylor said...

Great post, except the entry for -XX:+UseConcMarkSweepGC should have read:

DO NOT USE.

Really. It's completely broken for a real app.

Charles Oliver Nutter said...

Taylor: Can you elaborate? I know several folks running with CMS in production without problems.

Markus Kohler said...

Agreed
-XX:+UseConcMarkSweepGC

works pretty well at least on the SAP JVM Hotspot:)

Behrang Saeedzadeh said...

My favorite one is: -XX:++ForGodsSakeStartupThisAppAsFastAsNativeApss!!!111!!

;-)

foot prints in the sand said...

Nice insight to good options.

-- Satish Patruni
http://mynotesday2day.googlepages.com

Bill Robertson said...

Have you talked to any of your colleagues about dtrace? Maybe you're not using Solaris by default, but it might be worth it.

Charles Oliver Nutter said...

Behrang: Actually there's a magic flag that's almost as good as that. Once you're using shared class data (-Xshare) if you have a large application that can run out of the bootstrap classloader, you can use -Xbootclasspath to load the app without costly classloader verification. In JRuby, this cut our startup time in half. I think I'll add it to the main article.

Bill: We'd love to use DTrace, but I think we really just need someone to sit down and show us how. Or perhaps a book/tutorial recommendation?

Taylor said...

@Charles,

We've seen that in production high load usage, ConcMarkSweepGC works fine for about two weeks as advertised (give or take considering load and actual use case), and then goes into a pathological stop-the-world GC that can last anywhere from 30 minutes to 2 hours depending on heap size.

We suspect this is due to accumulated heap fragmentation that ultimately results in the need for the stop the world collection to clean everything out.

Also although we cannot confirm it, anecdotally we seem to have run into more JVM kernel crashes with it on than off.

Bill Robertson said...

"Or perhaps a book/tutorial recommendation?"

You work for Sun right? I would suggest that you offer beers to the guys who wrote it for some of their time. ;-)

Behrang Saeedzadeh said...

Charles, I used to use -Xverify:none to skip the verification and I am not still satisfied with that. Does -Xbootclasspath reduce the startup time even more?

Behrang Saeedzadeh said...

@Taylor,

Does the 2 hour pause occur for a 4gig heap size or does it occur for larger (smaller?) heap sizes?

Stephen said...

@taylor .. Did you open a case on the cms behavior and the other crashes? Let me know.

James Abley said...

server versus client, GC algorithm used and heap sizes all vary by default depending on the class of machine that you're using as well.

See here for more details.

Peter Runge said...

The Tiered compiler is a pretty cool combination of both the client and server VMs. To turn it on, use -server and -XX:+TieredCompilation. This one's mostly useful for client apps.

You can tune the tier 2 compilation (server compiler) threshold independently of the client compiler by using -XX:Tier2CompileThreshold=[value]. This page recommends 35000.

Charles Oliver Nutter said...

Peter: Thanks for that, I had not tried tiered compilation yet myself. This is only in recent OpenJDK 7, yes?

William Louth said...

For performance investigations you might want to check out JXInsight's JRuby-to-Ruby profiling (multi-resource metering) solution.

http://williamlouth.wordpress.com/2008/10/14/cross-language-profiling-with-jxinsight-jruby-to-ruby/

Markus Kohler said...

Note that
there's
-noverify to disable verifying the bytecodes.
Always use this for Eclipse

jrose said...

Unlike PrintOptoAssembly, PrintAssembly works in product mode also. The catch is you need the hsdis.so plugin. Today you must build your own (from public sources) but someday we'll post prebuilt ones.

http://wikis.sun.com/display/HotSpotInternals/PrintAssembly

Peter Runge said...

Charles: Tiered compiler is in the later JDK6 (at least since u10) as well as OpenJDK7.

Anonymous said...

How to do data cache misses for a Java program? Any idea?
Thanks a lot

Johan said...

Hi! Late answer but there is an intersting article about DTrace and Java Profiling, at http://java.dzone.com/articles/java-profiling-dtrace

There is also some good documentation about DTrace here as well: http://wikis.sun.com/display/DTrace/Documentation

I found Dtrace very helpful when debugging JVM HotSpot in realtime. Thanks for a great article!