Wednesday, October 5, 2011

JavaOne 2011: Every Last Nanosecond: Tuning the JVM for Extreme Low Latency

Sunny Chan's presentation "Every nanosecond counts" (20262) was offered in Hilton Golden Gate 6/7/8 beginning at 1 pm on Wednesday. This was another very well attended session that was standing-room only (and sitting on the floor in the aisles room only) from the beginning and throughout his presentation (not all presentations that start well attended end well attended). Chan stated that "Java is extensively used within Goldman Sachs" (his employer) and he works with Java SE and Java EE. He previously worked in IBM JVM development.

Chan stated that some of the issues Goldman Sachs has faced using Java include low latency and high throughput. He said he would be talking about "reducing of start-up latency" and "improving performance by tuning" (memory access and large pages support). Chan also set the level of the session at more advanced than simply running profiling tools.

Chan said that many developers say things like "I want it really fast" when what they really want is "high throughput." He talked about "low latency" meaning "reducing the time between receiving the information and response to the information" (time to receive, time to process, and time to send data). He defined "high throughput" as focusing on "the amount of the data you can process in a given amount of time." Chan had a bullet that noted that high throughput does "not necessarily mean reducing the latency." Chan cited some causes of increased latency as garbage collection and inefficient code or use of resources and listed some others as well. Developers need to ask themselves if they care more about request/response times or amount of data transferred. Chan's experience is that most developers really want higher throughput.

Chan talked about ways to tune Java to reduce latency. These include understanding your application by using profiling tools such as hprof and VisualVM. He also recommends knowing your operating system (including the operating system characteristics and tools) and your hardware (including interaction of memory, CPUs, and caches) as a beginning point. Chan also reminded attendees to have benchmarks on own software and to not base decisions on others' benchmarked products.

In his slide "Start up latency," Chan stated that "JVM is a complex piece of software" and he talked about the relatively lengthy start-up time to get it up and running. This leads to a condition he describes in another bullet: "Component start-up could cause significant latency." He then talked about "biased locking." The Java SE 6 Performance Whitepaper describes biased locking:

Biased Locking is a class of optimizations that improves uncontended synchronization performance by eliminating atomic operations associated with the Java language’s synchronization primitives. These optimizations rely on the property that not only are most monitors uncontended, they are locked by at most one thread during their lifetime.

An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations can be performed by that thread without using atomic operations resulting in much better performance, particularly on multiprocessor machines.

See the Java Tuning White Paper for an example of using biased locking. See also related command-line options. Chan mentioned an under-documented command-line option for HotSpot called BiasedLockingStartupDelay (one of set of options recommended for running high performance server applications).

Chan moved onto discussion of the JIT compiler. He talked about using -Xcomp to "force JIT to compile everything" in HotSpot rather than its normal behavior of "initially Java bytecodes are interpreted by the JVM." He also referenced Ahead-of-Time (AOT) compilation in some JVMs. He referenced the HotSpot options -XX:+PrintCompilation and -XX:+PrintOptoAssembly for seeing what the JVM is doing. Chan explained that hsdis must be compiled from OpenJDK's source code to use the PrintCompilation option. Chan showed how printing the goings-on of the JIT compiler can help identify latency issues.

Chan recommended that "most of the time you should leave tuning to the JVM developers!" and enjoy "Java abstracting the programmer from the hardware." He would later have this as part of his conclusion as well ("You can trust them").

Chan then moved onto Linux "standard profiling tools" called Oprofile and perf. He stated that some of the commercial ones are even nicer.

Chan talked about the "performance benefit of large pages," a fairly common theme that I've heard in multiple presentations and keynotes at JavaOne 2011. Chan described this concept as "changing the page size to 2M [from 4k on x86], so the number of translations is reduced." This leads to faster memory access. Chan stated that applications with large heaps are likely to benefit from large page sizes and that benefits of large page sizes depend on hardware being used.

Chan ended with a slide with several references. These references include What Every Programmer Should Know About Memory, OProfile results with JIT samples, Perf JIT Interface, and Java and Large Page Support.

I really enjoyed Sunny Chan's presentation, but he covered so much material that I'd like to see it again. His slides are good (and he said he plans to make them available), but he added a lot of value in the verbal descriptions, so it would be best viewed with both his slides and his presentation. Unfortunately, I don't think it's being recorded, but hopefully I'm just not aware of its recording.


Dustin said...

The PDF version of this presentation is now available online.

Ashwin Jayaprakash said...

I should've downloaded those slides. All the JavaOne 2011 slides are gone.