Wednesday, December 11, 2013

Software sympathy, or how i learned to stop worrying and love the Garbage Collector

Mechanical sympathy is a term originating from Formula 1 where it was used to describe a driving style that takes into consideration the mechanical properties of the engine and the car as a whole, leading to better performance of the vehicle-driver combo. Martin Thompson has taken this term and applied it in software engineering, demonstrating how understanding the architecture of the various components of a computing machine and the way they interact can lead to writing vastly more efficient code.
Embarking from this idea, i wondered how this can be applied in a purely software world and write code that can take advantage of the way other software works. Being a Java programmer the answer is actually quite obvious - look at how the JVM performs a task on behalf of your code and optimize things so they operate in symphony.

Which component to choose from, though? My focus ended up being on the Garbage Collector and in particular on the Garbage First GC implementation that is available with Java7. Specifically, i wanted to understand what the impact of setting references in Java objects is on performance, an effect associated with the write barrier and the way it is implemented in G1.

 

The Write Barrier you say?

The write barrier is a piece of code executed by garbage collectors whenever a reference to an object is set. This is part of the book keeping done by the collector and allows for the garbage in the heap to be traced and collected when the time comes. The implementation and use is specific to each garbage collector of course and here we'll look at the case of G1.

Garbage First

Roughly, G1 works by slicing the heap into equal sized pieces or regions and satisfies requests for memory from one region at a time, moving through regions as they fill up. When heap memory gets low, G1 mostly-concurrently collects regions that are mostly filled with garbage, until it reaches a target size for available heap (or a time limit is reached, allowing for soft real time behaviour). The regions are useful because, since they are collected as a whole, it is not necessary to keep information about objects pointing to other objects in the same region - the scan during the mark phase will go through everything anyway. This in turn means that the write barrier when setting a reference to an object in the same region as the reference holder will leave the write barrier as a no op, while setting cross region references will cost somewhat extra.

The above is something that we can test for. We can devise an experiment to benchmark the difference between setting intra vs extra region references. If it turns out to be significant, we can be aware and prefer, when writing code, to allocate together objects that are expected to point to each other, giving us much better throughput at the cost of properly structuring our allocation strategy.


Moving on to measuring things

In this case the benchmark is quite simple - we need to do both same region and cross region reference setting and see how they compare. To do that we'll need to layout our objects in memory in a predictable way so that we know that the only difference between our comparative runs is only the relative position of the referenced objects. As it turns out, that is not that hard to do.
If we allocate a fixed size for heap (by setting -Xms and -Xmx to be the same) and we know the size of the Java objects we'll be allocating, we can, given the fixed number of regions created, to calculate how many objects fit in each region and in turn figure out which object needs to point to which in order to have extra or intra region references.

A region is the heap average size (min size + max size divided by 2) divided by 2048. That means that for a heap of 1Gb, each region is 2^30/2^11 = 2^19 bytes big. If Ballast objects are 32 (2^5) bytes then that means we require 2^19 / 2^5 = 16384 objects of that class to fill a region.

We'll do two runs. Both will allocate 32768 Ballast objects. One run will set references from the first 1/4th to the second 1/4th and from the third 1/4th to the fourth 1/4th (and back, for each couple). The other run will set references from the first half to the second half (and back). We expect the first run to have much bigger throughput even though the number of allocations and reference sets will be exactly the same. All is much better explained if you read the code.

A note about the ballast objects: The Ballast objects which are used to contain the references also contain a long field. The reason for that is twofold - one is padding to get the object to an even 32 bytes and the other is for the plot twist at the end of the blog post.

That's it really. On my computer, a MacBook Pro with a 2.3Ghz Intel Core i7 CPU and 8GB of RAM, running the above code on an Oracle HotSpot JVM version 7u45 and the command line options being

-Xms1024m -Xmx1024m -XX:+UseG1GC

i get the following results:


Setting same region references took 80640ms

Setting cross region references took 84140ms

The time is how long it took for 5000 repetitions of allocating 2 new regions of objects and set the references properly.

Results

What this shows is that there is little difference between the two methods of setting references. Surprising, no? On to the source code then, to try and understand why that happens. As it turns out, the cost of marking each card as dirty is there but it actually is pretty small - it costs a put in a queue when the card is dirtied and that queue is processed afterwards while doing the collection. Both operations cost relatively little and this cost is also split between reference set time and collection time. Alternatively, one might say that the cost is dominated by the write barrier check rather than the operations taken from it.

An extra step and a twist

The write barrier cost - i wonder how much that is. What should we compare it against? Well, the minimal cost i could think of was that of setting an integer or a long. Since we already have a long field in our Ballast objects, we can use that. So we'll alter the test to instead of setting references in two different ways, it'll set once the reference to a fixed, known object and the other it'll set the long field to a given value. The new source code is here.

On the same setup as the previous experiment
(changing the GC implementation every time), i get the following numbers:

G1 (-XX:+UseG1GC)

Setting the long value took 79601ms
Setting the reference took 90197ms


ParNew (no arguments)

Setting the long value took 85628ms 
Setting the reference took 104894ms

CMS (-XX:+UseConcMarkSweep) 
Setting the long value took 91407ms
Setting the reference took 108954ms

The difference in cost lies strictly with the write barrier, which is of course not triggered for the case of storing a long. An interesting note here is that if you set the long field and the reference field in Ballast to be volatile, both actions take about the same time and at twice the time of setting the long value as shown above, demonstrating the cost of volatile variables.

Closing remarks and future work

While the first result is a negative one when it comes to the driving assumption, i still wanted to discuss it for two main reasons: One is the educational content from it, demonstrating in a hands on, high level way how G1 operates. The other is that negative results are still results and it's noteworthy that we can ignore the placement of objects when it comes to single threaded programs. This last part is quite important and it's going to be the next piece of work I'll undertake in this track. In particular, how does the card marking affect multithreaded programs and how does the no-op barrier for same-region assignment in the case of G1 fare under such conditions?
As for the long vs reference setting comparison, its explanation is quite easy but a way of taking advantage of the fact directly is not obvious. If however you are familiar with off heap memory management, you will immediately see a reason why performance in such scenarios is substantially better - being away from the control of the garbage collector does not only lead to collection improvements, but it also removes the bookkeeping overhead during program runtime. But that is a discussion for another time.



A parting word

As you can see, all source code is available, as well as the setup i used. If you think that information is not enough to recreate the results i got, please say so and I'll improve the article. But, if you think i got the results wrong, say so and I'll review the methodology. There is no reason why results such as the above should not be peer reviewed.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

2 comments:

  1. I like the experimental approach (so keep at it), but I think the model still differs from reality in enough places that more refinement is needed before you can draw comparative conclusions about the G1 cross-region write barrier tracking costs.

    Here are some specifics:

    1. The write barrier in G1 serves three different purposes, only 1 of which is specific to cross-region tracking:

    1.1 As in all generational collectors, a cross-generation (as opposed to cross-region) remembered set is updated by the write barrier. In HotSpot this is done by means of card marking, and is not a no-op on any of the collectors involved in the tests above.

    1.2 As with *some* mostly-concurrent multi-pass markers, the write barrier is used to track mutations during the GC mark phase, so that marking remains safe. This main cost of this is usually borne only during the actual mark phase of the collector.

    1.3 G1 (as do some other incremental stop-the-world compactors) also uses the write barrier to track cross-regional references in the olden. [this is the cost you are trying to focus on in your measurements].

    One of the problems with the current experiment is that your reference setting is being done on newly allocated objects. Since the cross-region reference information between newgen objects is useless information for G1, it can be discarded early and it's cost can [at least in theory] be dramatically reduced compared to reference updates in olden objects (that also point to oldgen objects).

    Another problem is that G1 uses multiple tiers for tracking cross-region references. Without getting into too much details, you want to make sure that each region has references tracked into it from at least 5 other regions if you want to see the full cost occur (I'd be conservative and go with at least 20).

    And last, G1 has a minimal regions size of 1MB, so in your model right now, only half the reference store that you think are cross regional actually would be.

    To remedy this, I'd recommend allocating your ballast objects in two multi-MB sets (as in 32 MB each), separated by enough useless allocation (several GB) to guarantee that the sets are separately promoted into the olden and therefore likely to reside in separate regions. Then [with no allocation pressure or GC activity] time the work of setting references between the two sets, making sure to stripe enough such that each region will have many references to all other regions in the other set (not just to 1 or 2). This will at least assure that cross-region references are being updated in a way that would need to be accounted for.

    It's harder to make the measurement happen during the olden mark phase, which makes for a potentially much more expensive write barrier for what could be many seconds at a time. Other than forcing back to back olden GCs somehow, I'm not sure how you'd make that happen.

    ReplyDelete
    Replies
    1. Thank you for this high quality feedback. I don't have much to say in defense of my thesis, except that i will have to do more experiments and try and measure the things you mention. An additional thing that interests me is multi threaded access and the effect of cache line sharing through the cards, for which your comments will prove valuable.

      Delete