Wednesday, October 20, 2010

Neo4j Internals: Transactions (Part 1) - Write Ahead Log and Deadlock Detection

It took some time to get this post live but transactions are a tricky thing to implement. There are many things to discuss, so to keep the size of the posts manageable, I will crack it down to parts. In this post I will write a bit about two different components that can be explained somewhat in isolation and upon which higher level components are build. The first is the Write Ahead Log (WAL) and the other is an implementation of a Wait-For graph that is used to detect deadlocks in Neo before they happen.
As always, assumptions for your grasp of some concepts are in place. You should already be familiar with my previous posts on the workings of Neo and as a result you know about Java I/O and the NIO package, so I will not write about that - the source should be obvious to you. Also, you know what transactions are (and that they are abbreviated to txs), what they are used for and the guarantees of a transactional system. You can probably get away without knowing a lot about JTA/JTS but if these acronyms are new to you, read up on them a little - it will help. Finally, since you know about txs, you should have some idea at least what a Write Ahead Log is. Oh, and since we will discuss deadlock detection, it is implied that you have used threads (preferably in Java) and you know what deadlocks are.
So, are you set? Open up the source and enjoy.

So, where is that WAL again?

In tx systems, atomicity and durability (from the ACID guarantees) are provided through the use of a WAL. All modifications during a tx are persisted on disk as they are requested (but are not performed on the actual store). After the tx is committed, only then are the tx-encapsulated modifications performed upon the db store and the tx is removed from is temporary storing place. This ensures that even if during a tx commit the system crashes, by reading the place where the txs are stored we can recreate them and replay them, making sure that their operations are performed on the store. So there you have it - atomicity, ie all of it happens or none of it happens and durability, since after I call commit() I can be certain that my changes will be applied.
Before going any further, a bit of nomenclature. A tx goes through states, as a response to user requests that it should do so. When you ask for beginTx(), a tx is created (and bound to the current thread but more on that on the next post) and placed in the TX_STARTED state. Similarly, there are states for prepared txs (again, at the next post), committed txs, rollbacked txs and so on. Each tx holds the modifications the user has requested during this tx and in neo parlance they are called Commands. For every operation you perform on the graph db there is a corresponding Command, such as node creation command, property deletion command etc. All these together, the phases and the commands, define what a tx is.
In Neo, the WAL is implemented by XaLogicalLog. This class manages a set of files that act as an intermediate store for all the commands and lifecycle changes of a tx. It has an accompanying class, LogEntry, that helps define and abstract the form of records that XaLogicalLog uses to store the information it needs. In there are defined entries for all the phases a tx goes through plus a LogEntry.Command that stores the commands a tx is comprised of. Every time the txManager (again, next post) decides that it is time for a running tx to change phase (from started to rollbacked, for example) or when the tx realizes a command has been added (eg, the user asked for createNode()), the XaLogicalLog is informed and a proper entry is written in its file.

Yeah, but which file?

WAL stores its data in 3 basic files, at the same directory that the primitive store keeps its (in other words, the directory where you asked the db to be stored). The first is, a marker file, keeping track of which of the other two is the current active log file. It stores one character (4 bytes, the first two the char, the other two 0), with a value of 'C', '1' or '2'. 'C' stands for clean, meaning that there is currently no active log file, the other two mean that either nioneo_logical.log.1 or nioneo_logical.log.2 is active respectively. Only one of these two is at any time the real log, the other shouldn't exist (a condition temporarily violated during log rotation - see below). The currently active log file is written to and read from using either a heap or a memory mapped buffer, based on configuration settings (setting "use_memory_mapped_buffers" - default is memory mapped).
There are also backup versioned files, if the configuration is set to maintain backups ("keep_logical_logs" setting). There is the notion of the current logical log version, which is an integer incremented for every rotation and is persisted in the neostore file from the NeoStore persistence store (recall that this class maintains 3 records in total - this is one of them). Backup files are kept with a filename of nioneo_logical.log.vversion, where version is an integer and is the version of this log file. Backup files are created if the configuration says so and they are always used instead of deleting any file. So, from now on, when I write that a file is removed, you are to understand that if the configuration says so it is renamed with a versioned filename, otherwise it is deleted. OK?

So, let's store some txs

As I mentioned above, LogEntry defines what entries are written in the logical log by the XaLogicalLog. Each entry has an identifier field, which is an integer that stands as the identifier (duh!) for a tx. They are assigned by the XaLogicalLog whenever a tx is created, an event marked by the creation of a LogEntry.Start and its persistence on disk. XaLogicalLog.start(Xid) has all the details. Note that the generated identifier is returned to the caller so that the tx can be identified. This identifier is also stored in a Map from integer to LogEntry.Start kept by XaLogicalLog (in the field xidIdentMap), indicating the currently active txs that are managed.
A useful observation here is that all write operations on the log are appends. This means that after a Start entry is written, all info about that tx will be available after that file offset. This makes an optimization possible, where given a tx's identifier and the xidIdentMap, if we store the offset of the Start entry in it, we don't have to scan the log file from the start to find info about that tx. Instead we need to go to the indicated offset and start from there. This can (and does) save some time.
The rest of the state changes are also straightforward. LogEntry.Prepare means that the transaction is in the prepare state, a stage where it is being prepared to be committed. LogEntry.OnePhaseCommit and LogEntry.TwoPhaseCommit are written when the tx has started its commit process, either a 1PC or a 2PC. 1PC is the standard for an EmbeddedGraphDatabase while 2PC is for distributed txs in a JTA/JTS environment.
LogEntry.Done is a special one. Writing this entry for a tx means that the tx is no longer active. It has done its job and has ceased to be. It is an ex-tx. Apart from writing the bits on disk, the identifier-to-Start entry from the xidIdentMap is removed. XaLogicalLog.done() does all that.
Only LogEntry.Command remains. This is actually not a state of a tx but it encapsulates the commands that are part of a tx. XaLogicalLog.writeCommand(), apart from the ever present identifier, accepts a XaCommand to write to disk. XaLogicalLog, being logical and all, does not know how to store a command, so it delegates to the implementation (Command) to write itself to the file.
So, summing up. Every state transition of a tx is persisted on disk as an LogEntry, with a flag indicating the state and an identifier linking the LogEntry with the tx. Nothing is ever deleted, only appended. When a tx passes on a special Done entry is written to indicate the fact. Apart from state transitions, the commands that constitute the activity of the tx are also written on disk.

Round and round

So, if entries are not deleted, how large will the file become? Fear not, for there is log rotation. The default maximum size is 10MB, above which a rotation is triggered, as implemented in XaLogicalLog.rotate(). It is triggered if, during a command entry insertion, the size of the log is larger than the maximum, as long as there is no huge tx running (XaLogicalLog.checkLogRotation() defines a huge tx as one that is live and takes up space larger than 5MB). The rotate() method does the following:

  • Finds, looking at the .active file, which is the current log and what the next should be.
  • Forces the log file buffer out to disk, creates the new log file and writes the header: one long that is the version and one long that is the id of the last committed tx.
  • Finds the offset of the first live Start entry in the current log file and starts reading entries from there.
  • For every LogEntry that is part of a live tx (the entry's identifier exists in xidIdentMap) the entry is copied to the new log file. If the entry is a Start entry, the position field (the offset in the log file) is updated to the offset in the new log file.
  • The old log file is removed and the .active file is updated to the new log. We are done.
The rotate() method is synchronized on this, so while it runs all txs that need to update something are put to a hold, making this whole thing safe.

So, how is all this useful?

This being a WAL and all, it has to be able to help with recovery, right? When XaLogicalLog is close()d, if there are no txs pending (the xidIdentMap is empty), the .active file is marked with 'C', the current active log (.1 or .2) is removed and we are happy. Otherwise, there are pending txs and on restart we must recover them, a fact indicated by a non clean .active file. In that case, when the XaLogicalLog is first open()ed, the last active logical log is found by looking at the .active file and XaLogicalLog.doInternalRecovery() is started. The job that must be done there is kind of obvious. Read the log file, recreate the dangling txs, passing them to something that can do something useful with them. Then clear the log file and get ready for standard operations. Step by step, the work done is:

  • Open the active log file, check the header for version and restore the last committed tx id.
  • Start to readEntry()
  • For every entry, if it is a LogEntry.Start, create a new tx for the stored identifier (via a TxFactory) and store that in xidIdentMap. Pass that to the XaResourceManager. For Prepare, OnePhaseCommit, TwoPhaseCommit and Done, inform interested parties (XaResourceManager) of the fact. For Command entries, retrieve the tx and inject the command. All these are implemented the XaLogicalLog.apply*() methods. When passed to the XaResourceManager, if necessary (on inject of OnePhaseCommit or TwoPhaseCommit entries) the tx (which, since all entries are appended, should be completely recreated by now) is commit()ted. This removes it from the xidIdentMap in the XaLogicalLog.
  • After done reading entries (which means that all that could be recovered has been recovered), overwrite the file portion from the last correctly formatted read LogEntry to the end of the file with 0, deleting garbage that could get there from a nasty crash. We are done.
When a tx is recovered, it is marked as such (via setRecovered()). This prevents it from being re-inserted in the WAL during a recovery run.

Enough about the WAL, for now

You should have a mental picture by now of the basic operations the XaLogicalLog performs during a run of Neo. Of course there is a ton of error handling and weird cases left unmentioned, plus some administrative functions that do not directly influence the core operations of the class as exposed above. But at least you know how your txs are recovered and what I will mean when in the next posts I will write in passing that "the command is written in the log and the tx is marked as prepared". At least, I hope I managed that much.

Change of subject: Deadlock Detection

To provide ACID guarantees, a database must be able to assure that all transactions must, one way or another, end (otherwise where is the Atomicity?). This means that all threads must stop, eventually, which means that, at least, deadlocks must be avoided. So far so good. So, how does Neo avoid deadlocks? Enter LockManager, RagManager and RWLock.


RWLock is an implementation of a ReentrantReadWriteLock for Neo. This means that for the resource it locks, it allows many concurrent readers but the write lock is handed to only one thread to the exclusion of any other, reader or writer. Also, it is reentrant, meaning that a holder of the lock can re-acquire it. Cute, but since the java.concurrent.locks package provides that, why re-invent the wheel? Well, because we don't. Apart from a simpler implementation that knows about Neo's needs, these locks are also aware of the existence of a Resource Allocation Graph manager (RagManager to its friends) that checks whether a wait() on a locked resource will result in a deadlock. But more on that later.
RWLock does its job by keeping a reference to the resource it locks, a list of waiting threads, a list of locking threads and two counts, one for read locks and one for write locks. A count for both kinds of locks is also held for each locking thread. When a thread requests a read lock, if its write count is equal to the global (for the lock) write count (meaning that if any write locks exist, they are all held by this thread), the lock is granted. Else, the ragManager is consulted if wait would result in a deadlock and if things check out, wait() is called. The same goes for write locks, but both the read and the write counts must match. Whenever a thread is granted a lock, the thread specific lock count and the global lock count are incremented. That pretty much does it, except for releases, which are straightforward. Reduce the count and wake eligible threads. Threads are waken up in a FIFO mode, making this lock implementation basically a fair lock.

Keeping track of the wait-lock graph: The RagManager

RagManager practically does its job in two methods, checkWaitOn() and checkWaitOnRecursive(). It works in tandem with RWLock, with the contract that whenever a lock is acquired or released on a resource or is entered in a wait state for it by RWLock, ragManager is informed of the fact. RWLock then can ask ragManager before each wait() whether this would result in a deadlock.
RagManager is an implementation of a Wait-for graph, keeping a graph of threads and resources. When a thread waits a resource, an edge is drawn from the thread to the resource (implemented as a Map<Transaction, Object>) and when a thread locks a resource an edge is drawn from the resource to the thread (a Map<Object, List<Transaction>). Obviously, a thread can wait() only on one resource but many threads can have a lock on a resource if it is a read lock.
When a thread is required to wait() on a resource, checkWaitOn() is called for that thread and resource and a traversal of the graph begins from that resource. If the traversal finds a path back to the candidate for wait() thread, it means a deadlock will happen and the thread is killed with a DeadlockDetectedException.
The traversal is recursive DFS, with a Stack also passed that stores the current path explored. If a cycle is detected, the Stack is unwound, providing a dump of the offending cycle that leads to the deadlock. Terminating the thread with an exception (unless you caught it, but why would you?) asserts that the tx will not complete, ensuring its atomicity, while leaving the rest of the threads intact to do their job. No one will be stuck somewhere for ever, so all txs will terminate, one way or another.

Managing all that

Finally, there is LockManager, which is exactly that: a manager, meaning it does almost no work, delegating to RWLock and through that, to RagManager. Its job is to synchronize access to RWLocks, creating them if necessary, passing them a RagManager instance and the resource to lock, and removing them when not needed. All the higher levels of Neo do their locking through this class.

Beware, for there may be dragons

This locking scheme and the resulting deadlock detection facility do not protect you from all deadlocks in your application - this problem is undecidable in its generality. It simply makes sure that transactions do not lock graph primitives in an order that can result in a deadlock, which pretty much is not your fault. However, for the locking done directly by your code, you are on your own, so the usual level of alert must be maintained while dealing with multithreaded apps.

To come

This was a very high level description of the operations of two base classes in the tx framework in Neo. Getting these out of the way, we can now focus on the way txs are created, maintained and committed or rolled back already knowing how the locking works (ensuring isolation) and the WAL/recovery mechanism. Lots of code was left untouched but I am already at 3000 words (again) and it is only 5 classes. Next time I will write about the tx implementation, commands and data sources/resources (probably).

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments:

Post a Comment