Wednesday, December 11, 2013

Software sympathy, or how i learned to stop worrying and love the Garbage Collector

Mechanical sympathy is a term originating from Formula 1 where it was used to describe a driving style that takes into consideration the mechanical properties of the engine and the car as a whole, leading to better performance of the vehicle-driver combo. Martin Thompson has taken this term and applied it in software engineering, demonstrating how understanding the architecture of the various components of a computing machine and the way they interact can lead to writing vastly more efficient code.
Embarking from this idea, i wondered how this can be applied in a purely software world and write code that can take advantage of the way other software works. Being a Java programmer the answer is actually quite obvious - look at how the JVM performs a task on behalf of your code and optimize things so they operate in symphony.

Which component to choose from, though? My focus ended up being on the Garbage Collector and in particular on the Garbage First GC implementation that is available with Java7. Specifically, i wanted to understand what the impact of setting references in Java objects is on performance, an effect associated with the write barrier and the way it is implemented in G1.

 

The Write Barrier you say?

The write barrier is a piece of code executed by garbage collectors whenever a reference to an object is set. This is part of the book keeping done by the collector and allows for the garbage in the heap to be traced and collected when the time comes. The implementation and use is specific to each garbage collector of course and here we'll look at the case of G1.

Garbage First

Roughly, G1 works by slicing the heap into equal sized pieces or regions and satisfies requests for memory from one region at a time, moving through regions as they fill up. When heap memory gets low, G1 mostly-concurrently collects regions that are mostly filled with garbage, until it reaches a target size for available heap (or a time limit is reached, allowing for soft real time behaviour). The regions are useful because, since they are collected as a whole, it is not necessary to keep information about objects pointing to other objects in the same region - the scan during the mark phase will go through everything anyway. This in turn means that the write barrier when setting a reference to an object in the same region as the reference holder will leave the write barrier as a no op, while setting cross region references will cost somewhat extra.

The above is something that we can test for. We can devise an experiment to benchmark the difference between setting intra vs extra region references. If it turns out to be significant, we can be aware and prefer, when writing code, to allocate together objects that are expected to point to each other, giving us much better throughput at the cost of properly structuring our allocation strategy.


Moving on to measuring things

In this case the benchmark is quite simple - we need to do both same region and cross region reference setting and see how they compare. To do that we'll need to layout our objects in memory in a predictable way so that we know that the only difference between our comparative runs is only the relative position of the referenced objects. As it turns out, that is not that hard to do.
If we allocate a fixed size for heap (by setting -Xms and -Xmx to be the same) and we know the size of the Java objects we'll be allocating, we can, given the fixed number of regions created, to calculate how many objects fit in each region and in turn figure out which object needs to point to which in order to have extra or intra region references.

A region is the heap average size (min size + max size divided by 2) divided by 2048. That means that for a heap of 1Gb, each region is 2^30/2^11 = 2^19 bytes big. If Ballast objects are 32 (2^5) bytes then that means we require 2^19 / 2^5 = 16384 objects of that class to fill a region.

We'll do two runs. Both will allocate 32768 Ballast objects. One run will set references from the first 1/4th to the second 1/4th and from the third 1/4th to the fourth 1/4th (and back, for each couple). The other run will set references from the first half to the second half (and back). We expect the first run to have much bigger throughput even though the number of allocations and reference sets will be exactly the same. All is much better explained if you read the code.

A note about the ballast objects: The Ballast objects which are used to contain the references also contain a long field. The reason for that is twofold - one is padding to get the object to an even 32 bytes and the other is for the plot twist at the end of the blog post.

That's it really. On my computer, a MacBook Pro with a 2.3Ghz Intel Core i7 CPU and 8GB of RAM, running the above code on an Oracle HotSpot JVM version 7u45 and the command line options being

-Xms1024m -Xmx1024m -XX:+UseG1GC

i get the following results:


Setting same region references took 80640ms

Setting cross region references took 84140ms

The time is how long it took for 5000 repetitions of allocating 2 new regions of objects and set the references properly.

Results

What this shows is that there is little difference between the two methods of setting references. Surprising, no? On to the source code then, to try and understand why that happens. As it turns out, the cost of marking each card as dirty is there but it actually is pretty small - it costs a put in a queue when the card is dirtied and that queue is processed afterwards while doing the collection. Both operations cost relatively little and this cost is also split between reference set time and collection time. Alternatively, one might say that the cost is dominated by the write barrier check rather than the operations taken from it.

An extra step and a twist

The write barrier cost - i wonder how much that is. What should we compare it against? Well, the minimal cost i could think of was that of setting an integer or a long. Since we already have a long field in our Ballast objects, we can use that. So we'll alter the test to instead of setting references in two different ways, it'll set once the reference to a fixed, known object and the other it'll set the long field to a given value. The new source code is here.

On the same setup as the previous experiment
(changing the GC implementation every time), i get the following numbers:

G1 (-XX:+UseG1GC)

Setting the long value took 79601ms
Setting the reference took 90197ms


ParNew (no arguments)

Setting the long value took 85628ms 
Setting the reference took 104894ms

CMS (-XX:+UseConcMarkSweep) 
Setting the long value took 91407ms
Setting the reference took 108954ms

The difference in cost lies strictly with the write barrier, which is of course not triggered for the case of storing a long. An interesting note here is that if you set the long field and the reference field in Ballast to be volatile, both actions take about the same time and at twice the time of setting the long value as shown above, demonstrating the cost of volatile variables.

Closing remarks and future work

While the first result is a negative one when it comes to the driving assumption, i still wanted to discuss it for two main reasons: One is the educational content from it, demonstrating in a hands on, high level way how G1 operates. The other is that negative results are still results and it's noteworthy that we can ignore the placement of objects when it comes to single threaded programs. This last part is quite important and it's going to be the next piece of work I'll undertake in this track. In particular, how does the card marking affect multithreaded programs and how does the no-op barrier for same-region assignment in the case of G1 fare under such conditions?
As for the long vs reference setting comparison, its explanation is quite easy but a way of taking advantage of the fact directly is not obvious. If however you are familiar with off heap memory management, you will immediately see a reason why performance in such scenarios is substantially better - being away from the control of the garbage collector does not only lead to collection improvements, but it also removes the bookkeeping overhead during program runtime. But that is a discussion for another time.



A parting word

As you can see, all source code is available, as well as the setup i used. If you think that information is not enough to recreate the results i got, please say so and I'll improve the article. But, if you think i got the results wrong, say so and I'll review the methodology. There is no reason why results such as the above should not be peer reviewed.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.