Tuesday, June 22, 2010

A state-based explanation of the ConcurrentLinkedQueue algorithm by Michael and Scott

Drawing inspiration from Cliff Click's wait-free ConcurrentHashMap I decided to see how the same idea can be applied to other existing wait-free data structures and later on apply it to the construction of new ones. I started off from the 1.6 Java SDK and my first target was the relatively easy java.util.concurrent.ConcurrentLinkedQueue.

Before we begin
You should be aware of a few things to be able to easily follow the upcoming discussion. First off, I will assume that you know what lock-free, wait-free, non-blocking algorithms and data structures are. Knowledge of Java is not required but if you can read the source, so much the better. Also, a familiarity with the Java Memory Model and the concurrent programming culprits it tries to address will prove very useful but is not necessary. Finally, the facilities provided by j.u.c.atomic.* are supposed to be known, especially what a Compare-And-Swap is.

A few words about state-based reasoning
Traditional concurrent programming (i.e with locks, semaphores etc) can be tricky, especially if you want to do it right. But, designing data structures to be operated upon without any kind of synchronization and still be able to provide guarantees about liveness and correctness seems kind of magic. I begun studying this stuff when I came across Dr. Cliff Click's presentations on the subject of a wait-free ConcurrentHashMap. There he demonstrates how a wait-free implementation of a HashMap is possible to retain correctness and liveness under multiple concurrent read and write operations using only a minimum of memory barriers. The invariants of the algorithm are demonstrated using a state machine where it can be shown that every possible state is reachable and moreover valid, in the sense that it is clear from the combination of the current state and the operation a thread wants to carry out what that thread must do to both keep the structure valid and advance its operation. Of course, for all these nice things to happen, proper state and transitions must be defined and that's where it gets tricky.

Enter the ConcurrentLinkedQueue
A single linked queue is simplicity itself. A synchronized queue is no more complicated, even if you strip the lock in two parts, one for the head and one for the tail, to provide truly concurrent offer() and poll() call combinations. A wait-free queue on the other hand is not so straightforward, since in the traditional implementations there is no apparent way to keep two concurrent offer()s (or poll()s) to step on each other's toes, damaging the structure. In 1996, Maged M. Michael and Michael L. Scott from University of Rochester showed how it is possible to augment the state of the queue to include enough information to allow threads to not only complete their operations but also to help other threads towards completing theirs as well.

Let's talk about state
So, what magic did they do? Well, in a traditional queue implementation we have single linked nodes, where each has a value field
that is a pointer to the stored item and next,
a pointer to the next node in the queue,
NULL
for the tail. Actually, only two elements comprise the state of our structure: the head node and the tail node. The tail and head fields of the structure are just a convenience, existing for O(1) operations and are atomically dependent on the first and last node, that is the offer()
operation updates atomically the last node and the tail field (similarly for poll() and the head field). So, in a synchronized implementation, the operation in figure 1 is linearizable. There is no possibility to see the state in figure 2, because of synchronization.


Before solving this, let us define some notation to help us. I will refer to state as "<element1, element2, ..., elementN>", where each element is a valid value for a template of the state, notated the same way but defined as such. For example, the template for the simple queue state is <*
head, *tail>, with the initial value of <NULL, NULL>, after offer(A)ing <&A, &A>, after offer(B)ing <&A, &B> etc.

Lack of locking however, can lead to states that cannot be described by this model, hence are invalid. We have to include information that can make the state in figure 2 valid but of course, with the trade off of more code to deal with the situation. The state template that can solve our problems is
<*head, *tail, tail->next>, an augmentation of the previous state by the element that could not be previously described: the actual tail node. Since synchronization blocks are out of reach now, the only possible atomic operations are CAS on AtomicReference fields and this is the only way to transition the structure from one state to another and that means one field at a time, each one with a possible failure. So, from each state, we have to make sure that both successful and failed CAS on any state element will lead to another valid state.

The edges
The CAS operations will be on AtomicReference fields, namely
head, tail and tail->next. Both successful and failed operations are informative so we can decide what action to take in either case. Suppose the state we start at is <&A, &B, NULL> and the operation is offer(C). First we CAS(tail->next, NULL, &C). On failure we know we must be in state <&K,&L,&M>, that is we have no idea what has happened while we were preempted and we have to try again. On success we reached state <&N, &B, &C>. Note that in both cases, tail and tail->next should be rearranged to be concise, i.e. to have tail actually point to the last node. So, we simply demand that every thread, when faced with the state <&K, &L, &M> to do a CAS(tail->next, &L, &M). Success means the tail pointer has advanced a position, failure means that someone else got there first. Either way, if every thread respects that convention, we are assured that eventually the tail pointer will be at most one position behind the actual last node. The last remaining operation is poll(), which is a single CAS on the head field which in success returns the value, on failure simply retries. Putting all this together yields figure 3.



This diagram is not a flow chart for each thread. Instead it shows the possible states and transitions from them. The equivalence edges are simple shortcuts that denote states that are indistinguishable. Every thread should check at which state it finds the structure and either fix it or do its operation. Even if it is preempted between check and action, it will end up again in a recognizable state where it can retry its operation.

Final words
Note that this diagram is not complete, in that it does not check for an empty queue. This was left aside on purpose, since it is not that interesting a case and falls more in the realm of engineering. The main purpose of this presentation was to show what lies underneath the description of the M&S algorithm and at the same time demonstrate the applicability of Dr. Click's mode of thinking. The SDK implementation is almost a literal implementation of the Michael and Scott algorithm and I urge you to enjoy it.