Wednesday, October 27, 2010

Neo4j Internals: Transactions (Part 2) - XaResources, Transactions and TransactionManagers

We continue ploughing through the Neo codebase, peeling away the layers that sum up to the transaction handling mechanism. If you came here hoping for a walkthrough of the code paths that lead from a beginTx() to a tx.finish() you will be disappointed, as there are some more encapsulated behaviours we must discuss first. This time we will look into a higher level than last time, discussing the Transaction class and its implementations, Commands and TransactionManagers, touching a bit first on the subject of XAResources. As prerequisites go, everything that was assumed last time applies here also. If you do not know what that is, I suggest you read at least the previous post. Having said that, let's dig in.

What was it that txs manage again?

Neo's tx handling model is heavily influenced by the XA specification, being designed to fit into a JTA environment (although it does not do that yet). Describing the XA specification is something I do not intend to do here, it would go way out of scope, although I hope to revisit the subject some time later. I advise you to go and look it up if you are completely unfamiliar with it, especially this article.
Txs in Neo deal primarily in terms of XAResources. A XAResource is an abstraction of a connection to a transactional resource, such as a RDBMS, or in our case, over the Neo persistence store (the Lucene indexer also supports operations in a transactional context but we do not discuss indexing here). XAResources are enlisted with a tx during its lifetime to perform whatever operations are needed on whichever persistence backend they refer to. Since we discuss the Neo kernel here we simply concern ourselves with resources over the NeoStore and conveniently choose to ignore the Indexer component regardless of the fact that since 1.2M2 indexing has been intergrated into the kernel.
The above lead to a layered approach, having a high level tx that will manage the XAResources that participate in an operation and one or more lower level txs that each encapsulates the operations on a single XAResource. This is a miniaturization of the 2PC protocol found in distributed tx environments, implemented in Neo's case with a resource-encompassing TransactionImpl and a XAResource specific XaTransaction. We now look at each one in turn.

XaTransaction, the resource specific flavour and WriteTransaction, the NeoStore specific implementation

An XaTransaction is a Neo specific class, meaning that it does not belong to the XA framework, despite its name. It defines the basic structure of a transaction over a Neo resource, holding a XaLogicalLog, the state of the tx, its identifier and providing the expected operations of a tx (commit(), rollback() etc) plus addCommand() and injectCommand(), the first used during normal operations and the second - you guessed it - used during recovery. It is extended by WriteTransaction, the core tx implementing class over a NeoStore.
WriteTransaction exposes the expected interface: all operations that can be performed on a NeoStore are there. The fields are separated into two categories, Records and Commands. Records are stored in Maps from Integers to primitive's Records, the Integer being the primitive's id. Commands are stored in lists of Command objects. During normal (ie, non-recovery) operations, every action on a primitive of the database is recorded as a proper Record, which is a Record object obtained from the backing store manager (NodeStore for nodes, RelationshipStore for relationships etc). The Record is altered as the operation requires and is inserted in the proper map. On prepare, all stored records are turned into proper Commands and stored in the corresponding list. A walkthrough for a common operation will be enlightening here, so let's create a Relationship between two Nodes.
The method for that is WriteTransaction.relationshipCreate(). It accepts the id of the relationship, the id of the nodes and the type of the relationship. First, the two nodes, if not stored already in the nodeRecord map are fetched from the backing store and added there. Then, a new RelationshipRecord is created, marked as inUse() and added to the relRecord map. Next the relationship pointers of the two Node records must be updated (recall that the relationships a node participates in are kept in a doubly linked list in the relationship records). Although the new RelationshipRecord is thread confined and does not need to be locked, the rest of the node's relationships are potentially shared so they must be locked. For that purpose a dummy implementation of the Relationship interface that only knows of the RelationshipRecord's id is created so that the LockManager can obtain the lock on it. For both participating Nodes, if there exists another Relationship, it is in this way locked, its RelationshipRecord obtained, the proper nodeNextRel field is updated to include the new Relationship and the record is added in the relRecord map. We are done with the changes and note that the WRITE lock has not been released, as it is expected.
When on this tx the doPrepare() call is made, all record maps are scanned and the records they hold are transformed into Command objects. During each such creation, a proper LogEntry.Command entry is written in the referenced XaLogicalLog. Upon doCommit(), the created commands are execute()d one by one, then the held locks are released (via the LockReleaser) and the records and commands collections are cleared. If a doRollback() is requested instead then what must be undone is basically the releasing of all obtained entity ids from the backing stores. For that purpose, for every record present in the maps, if it is marked as created, the proper backing store is asked to free() the record's id. Then we clear the records and command collections and we are done.
During recovery, we do not hold records. Recall that LogEntry.Command entries are written in the XaLogicalLog, so, when the WAL finds a dangling tx, it reads in the Commands and calls injectCommand() on the tx. There they are directly added in the Command lists and executed during the subsequent doCommit(). This means that if a tx fails before the doPrepare() call, it is implicitly rolled back, with the unfreed ids recovered from the IdGenerator of each backing store when the database restarts.

Commands

It is worth to take a look at how commands are implemented, since they are such an integral part of the Neo tx operations. As WriteTransaction extends XaTransaction, likewise Command extends XaCommand, since not only NeoStore knows about commands (again, indexing makes use of this framework). Implementation-wise, Command is pretty simple - it needs to define a way to persist itself on a LogBuffer, to read itself back from a ReadableByteChannel and to execute itself. Obviously a NodeCommand will be different from a RelationshipCommand, for instance, so for every primitive operation a different command class must be implemented. Also, DynamicRecords are primitive-agnostic, so a utility static method is provided for use by every one interested. The implementation of the operations of Commands are conceptually pretty much the same among them and if you are familiar with the different persistent store operations explained elsewhere (you are, aren't you?) you will have no problem understanding what is going on there. As an example, look at Command.NodeCommand. Every such instance holds the two obvious things to do its work. A NodeRecord, which represents the changes to be performed on the store and a NodeStore, since this is where changes will be persisted. execute() simply asks the store to update() the NodeRecord. writeToFile() persists the Command, first writing a flag marking the entry as a NodeCommand and then writing out the fields of the Record. readCommand() assumes that the NodeCommand identifing flag has been read (and thus, the bits that follow are known to be a NodeCommand) and reads in the following bytes, restructuring a NodeCommand. The same pattern holds for every other of the remaining 4 types of Commands.

TransactionImpl, the glue between XaTransactions

Ok, we have XaTransactions over a NeoStore (more precisely over a NeoStoreXaDataSource, but that will be explained on the next post) and we can commit that. We also may have transactions over a Lucene index (via the same framework) and we can commit that also. But these are connected txs, meaning that if the WriteTransaction fails, we must make sure that the LuceneTransaction also fails and vice versa. You must recognise this as the essence of 2PC and this is what TransactionImpl takes care of. Instances of this class hold a list of XaResources, each of which is added to this top-level tx via a call to enlistResource() (and removed via delistResource()). Adding a resource such as NeoStoreXaConnection.NeoStoreXaResource (which is the NeoStore specific implementation of XaResource and will be discussed in the next post) to a TransactionImpl causes a record to be written to the txLog (discussed next) and binds it to the scope of this tx. All XaResources enlisted to a TransactionImpl are bound together in a sort of informal 2PC protocol. When requesting for a TransactionImpl.doCommit(), all enlisted resources are asked to prepare() their operation, returning an indicator of whether they succeded in their changes or not. If all resources return XAResource.XA_OK then they can be committed so each one is asked in turn to commit(). There are two different paths here, in essence distinguishing between 1PC and 2PC cases. If there are more than one distinct XaResources, a full 2PC protocol is followed, with voting and everything. If there is only one XaResource enlisted however, all these are bypassed, asking the resource directly to commit(), passing a boolean flag that informs the resource that this is a 1PC, hoping that the implementation will take care itself of the prepare() call. doRollback() is simpler, iterating over all enlisted resources and asking them to rollback().
There are some more things TransactionImpl does that are worth noting. First, every time there is a change in the tx status (for example, from ACTIVE to PREPARING, as defined in Status), the txManager is notified and a proper record is added to the txLog. This way, a Write Ahead Log of these high-level txs is kept for recovering from failures. This bookkeeping procedure is explained below. Also, there is the management of synchronization hooks, as they are defined in the JTA spec. A list of Synchronizations is maintained and the doBeforeCompletion() and doAfterCompletion() take care to notify them properly when required. To fully understand TransactionImpl and high-level txs, however, we must look also at the TxManager.

The TxManager and its TxLog

The existence of these classes should come as no surprise. Since we have these so-called high-level txs, we must provide a WAL and the means to recover them in case of failures. In essence, we have a mechanism similar to XaLogicalLog, but TransactionManager is agnostic of specific XaResources and as such has no interest in Commands. Instead, it has interest in XaResources in the abstract, so it needs a means of distinguishing them. Say hello to globalId and branchId. Each TransactionImpl is identified by a globalId (a byte[]) and every XaResource that it has enlisted is identified by a branchId (again, a byte[]). This is in correspondence with the XA specification, which binds these two together in an abstraction known as Xid. The first time a resource is enlisted with a TransactionImpl, a writeStartRecord() call is made to the txManager, writing in the txLog a record indicating the beggining of a new tx. Then, after each enlisting of a resource that has not been enlisted before (for instance, a resource could have been suspended and then re-enlisted to make it active again) a new branch is added in the txLog for this tx via a call to txManager.addBranch(). This way, when reconstructing a tx from the log after a failure, it is possible to reassociate a recovered tx with the resources it was managing.
TxManager "hides" the TransactionImpl objects from the rest of the code. Whenever a new tx must start, the begin() method is called and a new TransactionImpl is created and mapped to the current thread. This map is kept in txThreadMap, a Map<Thread,TransactionImpl>. All public methods of TxManager for managing the lifecycle of txs do not accept a TransactionImpl argument - they get the currentThread() and retrieve the corresponding TransactionImpl. If a begin() call is made from a thread with a mapped tx, an exception is thrown, reminding you that there is no support for nested txs. commit(), rollback(), suspend(), getStatus() and the rest follow the same pattern. An exception is resume(), which does accept a TransactionImpl - after all, which tx should be resumed?
The main operations in TxManager may seem a little complicated but there is not much to them. There is a lot of error handling, since there are so many things that can go wrong and management of the actual state in terms of flags requires a lot of if-else code. Let's walk through the commit() method, that is probably the most challenging.
As described before, the current TransactionImpl is first obtained from the txThreadMap. If it is STATUS_ACTIVE, beforeCompletion hooks are called and then commit() is called on the tx (we are now in the private commit() method). If the tx did not change it status to STATUS_COMMITED, doRollback() is asked from it, afterCompletion hooks are called, it is removed from the txThreadMap, a done entry is written in the log, its status is changed to STATUS_NO_TRANSACTION and the proper exception is thrown (failure during a commit leads to a rollback attempt and an exception thrown always, you know that, right?). Else, if the commit() succeeded, afterCompletion hooks are called, the tx is removed from the txThreadMap, a done entry is written in the log and the status is changed to STATUS_NO_TRANSACTION. If the status of the tx was STATUS_MARKED_ROLLBACK, then doRollback() is asked of the tx and the rest proceed as before.
Finally, a few words about the TxLog. Its philosophy is the same as the XaLogicalLog, writing entries formatted as per the Record inner class, flagged for their kind by the public static final byte fields of the class. It also rotates between files (named tm_tx_log.1 or .2, the active one marked by the contents of active_tx_log), but this rotation is triggered and managed by the TxManager at the getTxLog() and changeActiveLog() methods and performed by TxLog.switchToLogFile(). Of note is the getDanglingRecords() method, used during initialization of the TxManager and during log rotation. It goes over the contents of the log, reading Record entries. It keeps a map from Xid to List<Record>. If an TX_START Record is read, a Xid to empty list mapping is added in the map, else for BRANCH_ADD and MARK_COMMIT records, they are added to the list of the corresponding Xid. For TX_DONE records, the corresponding Xid entry is removed from the map, since this tx has successfully committed or rolled back.

Replaying half-committed txs

An interesting question is what happens if, during the execution of Commands of a tx, the system crashes without having a chance to roll back. In that case, the whole tx will be replayed, meaning that the commands that were executed before the failure will be executed again. The answer to this is: so what? Since the commands are composed of Records, every Command execution is the persistence of the changed record in the persistence store. Writing again a record is simply wasted I/O - nothing changes but the timestamps on the files. Also, since the recovered txs are sorted and executed in serially before normal operation resumes, no locks are necessary - these are not concurrent txs.

We are done for now

The above went a level higher in the hierarchy of the transaction handling code than the previous article. It also tried to make a small excursion into the realm of the XA framework, most of which will be discussed next time. As the posts go by, I tend to focus most on the mechanics rather than analysing the code line-by-line - I feel that if you have followed me during my writings on this subject you should have developed a similar "feel" for the actual implementations. Before closing, I want to thank the people at Neo for their kind words concerning my work and especially Peter Neubauer whose one line of an answer to a question I asked helped my more than he can imagine.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Wednesday, October 20, 2010

Neo4j Internals: Transactions (Part 1) - Write Ahead Log and Deadlock Detection

It took some time to get this post live but transactions are a tricky thing to implement. There are many things to discuss, so to keep the size of the posts manageable, I will crack it down to parts. In this post I will write a bit about two different components that can be explained somewhat in isolation and upon which higher level components are build. The first is the Write Ahead Log (WAL) and the other is an implementation of a Wait-For graph that is used to detect deadlocks in Neo before they happen.
As always, assumptions for your grasp of some concepts are in place. You should already be familiar with my previous posts on the workings of Neo and as a result you know about Java I/O and the NIO package, so I will not write about that - the source should be obvious to you. Also, you know what transactions are (and that they are abbreviated to txs), what they are used for and the guarantees of a transactional system. You can probably get away without knowing a lot about JTA/JTS but if these acronyms are new to you, read up on them a little - it will help. Finally, since you know about txs, you should have some idea at least what a Write Ahead Log is. Oh, and since we will discuss deadlock detection, it is implied that you have used threads (preferably in Java) and you know what deadlocks are.
So, are you set? Open up the source and enjoy.

So, where is that WAL again?

In tx systems, atomicity and durability (from the ACID guarantees) are provided through the use of a WAL. All modifications during a tx are persisted on disk as they are requested (but are not performed on the actual store). After the tx is committed, only then are the tx-encapsulated modifications performed upon the db store and the tx is removed from is temporary storing place. This ensures that even if during a tx commit the system crashes, by reading the place where the txs are stored we can recreate them and replay them, making sure that their operations are performed on the store. So there you have it - atomicity, ie all of it happens or none of it happens and durability, since after I call commit() I can be certain that my changes will be applied.
Before going any further, a bit of nomenclature. A tx goes through states, as a response to user requests that it should do so. When you ask for beginTx(), a tx is created (and bound to the current thread but more on that on the next post) and placed in the TX_STARTED state. Similarly, there are states for prepared txs (again, at the next post), committed txs, rollbacked txs and so on. Each tx holds the modifications the user has requested during this tx and in neo parlance they are called Commands. For every operation you perform on the graph db there is a corresponding Command, such as node creation command, property deletion command etc. All these together, the phases and the commands, define what a tx is.
In Neo, the WAL is implemented by XaLogicalLog. This class manages a set of files that act as an intermediate store for all the commands and lifecycle changes of a tx. It has an accompanying class, LogEntry, that helps define and abstract the form of records that XaLogicalLog uses to store the information it needs. In there are defined entries for all the phases a tx goes through plus a LogEntry.Command that stores the commands a tx is comprised of. Every time the txManager (again, next post) decides that it is time for a running tx to change phase (from started to rollbacked, for example) or when the tx realizes a command has been added (eg, the user asked for createNode()), the XaLogicalLog is informed and a proper entry is written in its file.

Yeah, but which file?

WAL stores its data in 3 basic files, at the same directory that the primitive store keeps its (in other words, the directory where you asked the db to be stored). The first is nioneo_logical.log.active, a marker file, keeping track of which of the other two is the current active log file. It stores one character (4 bytes, the first two the char, the other two 0), with a value of 'C', '1' or '2'. 'C' stands for clean, meaning that there is currently no active log file, the other two mean that either nioneo_logical.log.1 or nioneo_logical.log.2 is active respectively. Only one of these two is at any time the real log, the other shouldn't exist (a condition temporarily violated during log rotation - see below). The currently active log file is written to and read from using either a heap or a memory mapped buffer, based on configuration settings (setting "use_memory_mapped_buffers" - default is memory mapped).
There are also backup versioned files, if the configuration is set to maintain backups ("keep_logical_logs" setting). There is the notion of the current logical log version, which is an integer incremented for every rotation and is persisted in the neostore file from the NeoStore persistence store (recall that this class maintains 3 records in total - this is one of them). Backup files are kept with a filename of nioneo_logical.log.vversion, where version is an integer and is the version of this log file. Backup files are created if the configuration says so and they are always used instead of deleting any file. So, from now on, when I write that a file is removed, you are to understand that if the configuration says so it is renamed with a versioned filename, otherwise it is deleted. OK?

So, let's store some txs

As I mentioned above, LogEntry defines what entries are written in the logical log by the XaLogicalLog. Each entry has an identifier field, which is an integer that stands as the identifier (duh!) for a tx. They are assigned by the XaLogicalLog whenever a tx is created, an event marked by the creation of a LogEntry.Start and its persistence on disk. XaLogicalLog.start(Xid) has all the details. Note that the generated identifier is returned to the caller so that the tx can be identified. This identifier is also stored in a Map from integer to LogEntry.Start kept by XaLogicalLog (in the field xidIdentMap), indicating the currently active txs that are managed.
A useful observation here is that all write operations on the log are appends. This means that after a Start entry is written, all info about that tx will be available after that file offset. This makes an optimization possible, where given a tx's identifier and the xidIdentMap, if we store the offset of the Start entry in it, we don't have to scan the log file from the start to find info about that tx. Instead we need to go to the indicated offset and start from there. This can (and does) save some time.
The rest of the state changes are also straightforward. LogEntry.Prepare means that the transaction is in the prepare state, a stage where it is being prepared to be committed. LogEntry.OnePhaseCommit and LogEntry.TwoPhaseCommit are written when the tx has started its commit process, either a 1PC or a 2PC. 1PC is the standard for an EmbeddedGraphDatabase while 2PC is for distributed txs in a JTA/JTS environment.
LogEntry.Done is a special one. Writing this entry for a tx means that the tx is no longer active. It has done its job and has ceased to be. It is an ex-tx. Apart from writing the bits on disk, the identifier-to-Start entry from the xidIdentMap is removed. XaLogicalLog.done() does all that.
Only LogEntry.Command remains. This is actually not a state of a tx but it encapsulates the commands that are part of a tx. XaLogicalLog.writeCommand(), apart from the ever present identifier, accepts a XaCommand to write to disk. XaLogicalLog, being logical and all, does not know how to store a command, so it delegates to the implementation (Command) to write itself to the file.
So, summing up. Every state transition of a tx is persisted on disk as an LogEntry, with a flag indicating the state and an identifier linking the LogEntry with the tx. Nothing is ever deleted, only appended. When a tx passes on a special Done entry is written to indicate the fact. Apart from state transitions, the commands that constitute the activity of the tx are also written on disk.

Round and round

So, if entries are not deleted, how large will the file become? Fear not, for there is log rotation. The default maximum size is 10MB, above which a rotation is triggered, as implemented in XaLogicalLog.rotate(). It is triggered if, during a command entry insertion, the size of the log is larger than the maximum, as long as there is no huge tx running (XaLogicalLog.checkLogRotation() defines a huge tx as one that is live and takes up space larger than 5MB). The rotate() method does the following:

  • Finds, looking at the .active file, which is the current log and what the next should be.
  • Forces the log file buffer out to disk, creates the new log file and writes the header: one long that is the version and one long that is the id of the last committed tx.
  • Finds the offset of the first live Start entry in the current log file and starts reading entries from there.
  • For every LogEntry that is part of a live tx (the entry's identifier exists in xidIdentMap) the entry is copied to the new log file. If the entry is a Start entry, the position field (the offset in the log file) is updated to the offset in the new log file.
  • The old log file is removed and the .active file is updated to the new log. We are done.
The rotate() method is synchronized on this, so while it runs all txs that need to update something are put to a hold, making this whole thing safe.

So, how is all this useful?

This being a WAL and all, it has to be able to help with recovery, right? When XaLogicalLog is close()d, if there are no txs pending (the xidIdentMap is empty), the .active file is marked with 'C', the current active log (.1 or .2) is removed and we are happy. Otherwise, there are pending txs and on restart we must recover them, a fact indicated by a non clean .active file. In that case, when the XaLogicalLog is first open()ed, the last active logical log is found by looking at the .active file and XaLogicalLog.doInternalRecovery() is started. The job that must be done there is kind of obvious. Read the log file, recreate the dangling txs, passing them to something that can do something useful with them. Then clear the log file and get ready for standard operations. Step by step, the work done is:

  • Open the active log file, check the header for version and restore the last committed tx id.
  • Start to readEntry()
  • For every entry, if it is a LogEntry.Start, create a new tx for the stored identifier (via a TxFactory) and store that in xidIdentMap. Pass that to the XaResourceManager. For Prepare, OnePhaseCommit, TwoPhaseCommit and Done, inform interested parties (XaResourceManager) of the fact. For Command entries, retrieve the tx and inject the command. All these are implemented the XaLogicalLog.apply*() methods. When passed to the XaResourceManager, if necessary (on inject of OnePhaseCommit or TwoPhaseCommit entries) the tx (which, since all entries are appended, should be completely recreated by now) is commit()ted. This removes it from the xidIdentMap in the XaLogicalLog.
  • After done reading entries (which means that all that could be recovered has been recovered), overwrite the file portion from the last correctly formatted read LogEntry to the end of the file with 0, deleting garbage that could get there from a nasty crash. We are done.
When a tx is recovered, it is marked as such (via setRecovered()). This prevents it from being re-inserted in the WAL during a recovery run.

Enough about the WAL, for now

You should have a mental picture by now of the basic operations the XaLogicalLog performs during a run of Neo. Of course there is a ton of error handling and weird cases left unmentioned, plus some administrative functions that do not directly influence the core operations of the class as exposed above. But at least you know how your txs are recovered and what I will mean when in the next posts I will write in passing that "the command is written in the log and the tx is marked as prepared". At least, I hope I managed that much.

Change of subject: Deadlock Detection

To provide ACID guarantees, a database must be able to assure that all transactions must, one way or another, end (otherwise where is the Atomicity?). This means that all threads must stop, eventually, which means that, at least, deadlocks must be avoided. So far so good. So, how does Neo avoid deadlocks? Enter LockManager, RagManager and RWLock.

RWLock

RWLock is an implementation of a ReentrantReadWriteLock for Neo. This means that for the resource it locks, it allows many concurrent readers but the write lock is handed to only one thread to the exclusion of any other, reader or writer. Also, it is reentrant, meaning that a holder of the lock can re-acquire it. Cute, but since the java.concurrent.locks package provides that, why re-invent the wheel? Well, because we don't. Apart from a simpler implementation that knows about Neo's needs, these locks are also aware of the existence of a Resource Allocation Graph manager (RagManager to its friends) that checks whether a wait() on a locked resource will result in a deadlock. But more on that later.
RWLock does its job by keeping a reference to the resource it locks, a list of waiting threads, a list of locking threads and two counts, one for read locks and one for write locks. A count for both kinds of locks is also held for each locking thread. When a thread requests a read lock, if its write count is equal to the global (for the lock) write count (meaning that if any write locks exist, they are all held by this thread), the lock is granted. Else, the ragManager is consulted if wait would result in a deadlock and if things check out, wait() is called. The same goes for write locks, but both the read and the write counts must match. Whenever a thread is granted a lock, the thread specific lock count and the global lock count are incremented. That pretty much does it, except for releases, which are straightforward. Reduce the count and wake eligible threads. Threads are waken up in a FIFO mode, making this lock implementation basically a fair lock.

Keeping track of the wait-lock graph: The RagManager

RagManager practically does its job in two methods, checkWaitOn() and checkWaitOnRecursive(). It works in tandem with RWLock, with the contract that whenever a lock is acquired or released on a resource or is entered in a wait state for it by RWLock, ragManager is informed of the fact. RWLock then can ask ragManager before each wait() whether this would result in a deadlock.
RagManager is an implementation of a Wait-for graph, keeping a graph of threads and resources. When a thread waits a resource, an edge is drawn from the thread to the resource (implemented as a Map<Transaction, Object>) and when a thread locks a resource an edge is drawn from the resource to the thread (a Map<Object, List<Transaction>). Obviously, a thread can wait() only on one resource but many threads can have a lock on a resource if it is a read lock.
When a thread is required to wait() on a resource, checkWaitOn() is called for that thread and resource and a traversal of the graph begins from that resource. If the traversal finds a path back to the candidate for wait() thread, it means a deadlock will happen and the thread is killed with a DeadlockDetectedException.
The traversal is recursive DFS, with a Stack also passed that stores the current path explored. If a cycle is detected, the Stack is unwound, providing a dump of the offending cycle that leads to the deadlock. Terminating the thread with an exception (unless you caught it, but why would you?) asserts that the tx will not complete, ensuring its atomicity, while leaving the rest of the threads intact to do their job. No one will be stuck somewhere for ever, so all txs will terminate, one way or another.

Managing all that

Finally, there is LockManager, which is exactly that: a manager, meaning it does almost no work, delegating to RWLock and through that, to RagManager. Its job is to synchronize access to RWLocks, creating them if necessary, passing them a RagManager instance and the resource to lock, and removing them when not needed. All the higher levels of Neo do their locking through this class.

Beware, for there may be dragons

This locking scheme and the resulting deadlock detection facility do not protect you from all deadlocks in your application - this problem is undecidable in its generality. It simply makes sure that transactions do not lock graph primitives in an order that can result in a deadlock, which pretty much is not your fault. However, for the locking done directly by your code, you are on your own, so the usual level of alert must be maintained while dealing with multithreaded apps.

To come

This was a very high level description of the operations of two base classes in the tx framework in Neo. Getting these out of the way, we can now focus on the way txs are created, maintained and committed or rolled back already knowing how the locking works (ensuring isolation) and the WAL/recovery mechanism. Lots of code was left untouched but I am already at 3000 words (again) and it is only 5 classes. Next time I will write about the tx implementation, commands and data sources/resources (probably).


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Sunday, October 10, 2010

Neo4j Internals: Persistence and Memory Mapping

Last time I gave a more or less detailed description of the storage file format of neo. Less than half of the org.neo4j.kernel.impl.nioneo.store package was covered, leaving aside most of the implementation details that give meaning to the bit buckets used to store data. In this post I will conclude the low level details and show how useful, high level objects are passed to and from the persistent store. As always, there are some assumptions on what you already should know. In this case I will gloss over the details of the Java NIO package, take for granted that you know what a memory mapped file is and of course I will build on the previous post, so you should be aware of the implementation of ids and what they mean in neo. In fact, simply by mentioning "memory mapped files" and "file offsets" you should be creating a mental image of what is going to happen. So buckle up, fire up some tabs with the NIO API (especially primitive buffers and file channels), of course refill your coffee mug and enjoy.

Windows with a useful view

The first layer of abstraction over the contents of the files comes from the PersistenceWindow interface. It encapsulates a Buffer which is a decorator over a NIO Buffer. The reason for all this is that in this manner we gain a uniform interface over the two main methods of accessing the storage. One is the more mundane PlainPersistenceWindow and the other is the more exciting (and default) MappedPersistenceWindow. The first uses a simple, heap based buffer to store data while the second is built on top of a memory mapped portion of a file (of course, again with a Buffer). They both extend LockableWindow which provides a ReentrantLock-like implementation of locking, keeping a LinkedList of waiting threads and, after the current thread is finished (i.e. calls unlock()) it interrupt()s them in a FIFO fashion.
There are three extending classes of LockableWindow, one is the MappedPersistenceWindow, one is the AbstractPersistenceWindow (which provides all the mechanism required to load and flush the contents of a file in conjunction with a heap based buffer) which PlainPersistenceWindow simply renames and the last is PersistenceRow which is used when only one record must be brought into memory. All these implementations require during construction a position in the file stream, a recordSize, a totalSize and of course the file channel. During construction, an appropriate buffer is constructed (either mapped or allocated in the heap) from file offset position*recordSize and for totalSize bytes after than. You should have guessed by that multiplication before that position is actually the id of a record or a block and the recordSize is the record's or block's size. All accesses from now on for the area of that file are performed via this PersistenceWindow, which takes care of the translation from record ids to file (or rather, buffer) offsets. This is actually hidden from users by the harmless looking getOffsetedBuffer(int) : Buffer method, which is actually where the (simple, actually) magic happens. The PersistenceWindow implementation that wraps the file portion translates the position argument of this method to a buffer offset and returns the underlying buffer positioned at that offset. This way, the caller can start reading bytes right away, certain that what is returned is right at the start of a record.

Pool it together, brick it over

The window abstraction alleviates the need for explicitly managing buffers, whether they are mmap()ed or not, translations of file offsets, flush()ing and the rest of the low level details. However, there is still the need to maintain multiple windows over each file, since we typically read from multiple locations at once. Moreover, there is precious little memory available to us, so we must maintain a careful count of how much memory we have been allocated for our store (remember, each user-visible store - NodeStore, RelationshipStore and the PropertyStores all have individual configuration parameters defining how much memory they can consume for mapping their files' contents) and, if we feel like optimizing stuff, allocate it in hotspots. Finally, the logic for constructing a new window over a non-memory resident portion of our file could be abstracted to some kind of manager that would do this transparently for us upon request of a record id. All these operations are useful and well defined for all storage managers, so since we claim to be good engineers we should provide a reusable implementation for this concept. Enter the PersistenceWindowPool.
A PersistenceWindowPool does all that in only 600 lines of code. It functions as a LFU cache for the buffers over a file, it maintains the memory constrains imposed by the configuration and abstracts over the creation and expiration of windows over the file. The primary storage unit is a decorator over a PersistenceWindow called BrickElement (a static private inner class) that maintains a reference to a LockableWindow, an index and a hitCounter. The setupBricks() method (called during construction time) determines the count and size of the BrickElements that should exist and, if the available memory is enough stores them in an array. There is a lower limit on the number and size of the bricks, however, and when that is not satisfied PersistenceWindowPool reverts to using PersistenceRows, in essence getting rid of any memory caching whatsoever. Note that the whole expanse of the file is covered by bricks, regardless of whether they actually have memory reserved. As the file grows more bricks are created to cover it.
Whenever a store manager requires a buffer of a record for a read or write, it passes its id to the underlying PersistenceWindowPool. The pool translates the id to an index in the Brick array (the id*recordSize is the file offset and the offset/brickSize - the window size actually - is the brick index). If the returned brick has a PersistenceWindow in place, then it is returned. Else, a PersistenceRow is created and kept for reference in a map (or if already in the map, it is retrieved) and returned. Either way, the window is mark()ed and lock()ed. At the same time, statistics on brick hits and misses are updated, both in the target brick and overall.
When the brick misses pass a certain threshold value (50k at the time of this writing), a procedure to refresh the bricks is started. The bricks are separated in mapped and unmapped (based on whether they currently encapsulate a mapped window over the file or not) and each list is sorted based on their elements' hit count. Then, for the most hit-upon unmapped bricks, and while there is available memory, new PersistenceWindows are allocated and assigned (the choice of mmap()ed over heap-based is class wide, set upon construction). After that, the least hit-upon mapped bricks are stripped of their buffers (after flushing them to file) and their space is assigned to unmapped bricks as long as there are unmapped bricks with more hits than them. Obviously, lock()ed or mark()ed windows are left in peace, someone is using them after all. There, that takes care of LFU cache semantics.
The last important piece of code is the release() method. When the store manager is done using a window, it must return it to the pool for unLock()ing and possible recycling. In the case of PersistenceRows, it is immediately written out to disk and if no one else is using it, it is removed from the cache.
That pretty much covers the basic operations of the PersistenceWindowPool. Now, store managers can retrieve buffers from the disk based on record ids without worrying about management, including locking. That is something of a relief, wouldn't you say?

Higher up: The store managers

With the details out of the way, we can now look at the storage managers.
There are as many as there are data files so it would be kind of boring to enumerate them all here, more so since I mentioned them all in the previous post. Instead I will write about the core operations they expose and how they pass from bits to objects.
The structure of every record in the storage has a corresponding object in the org.neo4j.kernel.impl.store.nioneo package (node records have NodeRecord, relationship type records have RelationshipType, all dynamic records have DynamicRecord and so on). Every one of these objects is a simple holder of the values of the corresponding record but with two exceptions. First, DynamicRecord is a generic holder for dynamic record data, meaning that it holds either array data or char[] (aka String) data (but not both), in addition to the next and previous block id values. The other is that for records that refer to a dynamic store (e.g. the string value of a property), then the corresponding field is not an integer (as an id normally would be). Instead it is a Collection (List or Map) of DynamicRecords that store in their byte[] or char[] arrays the corresponding value (each block is represented by a different DynamicRecord, hence the need for a Collection). When they are initialized, you may ask. Well, patience, we will get to that.
Now that we know about records and their implementation in Java, is should be apparent that every store manager manages its corresponding record type. NodeStore manages NodeRecords, DynamicArrayStore manages DynamicRecords that store data in their byte[] field etc. Each Store knows the size and structure of the records it manages. At creation time it allocates a PersistenceWindowPool that manages its file. It passes to it its recordSize (block size for DynamicStores) so all that it has to do now is translate bytes from the file to values for the fields of its record objects and write out any changed records it receives.
Reading a record from a store requires an id. The store implementation requests for a window from its PersistenceWindowPool that covers that id, asks the returned window for a byte buffer offsetted at the id and starts reading primitives as the record structure dictates. In the case of writing a record, the same procedure is followed but the record is written in the buffer. The flushing of the contents is something the store manager is not concerned about except at closing time.

Heavy or light?

PropertyIndexStore, PropertyStore and RelationshipTypeStore maintain apart from their corresponding record stores, dynamic stores for keeping variable length data. How and when are they fetched in memory? Well, the default operation is to don't. This is called a light record, and is a record that the collection of DynamicRecords is uninitialized. If you want the data, you must first call makeHeavy() on the corresponding store manager to have it fill the DynamicRecord collection that belongs to that record. These DynamicRecords will themselves be light, meaning that only the block chain ids will be retrieved, not the contents. After making the record heavy, you ask the record's store to retrieve the object kept in the DynamicRecord collection. This is necessary, since you do not have access to the underlying dynamic store and even if you did only the record's manager knows where and how stuff are stored. The store manager iterates over the DynamicRecord collection, makes it heavy and then asks the dynamic store to translate the resulting byte[] to something meaningful.
This mechanism allows for lazily fetching stuff from the disk, actually hitting the platters only when you need something. The traversal of the graph does not require fetching all the stuff a Node or Relationship might have hanging under it, so we can dispense with I/O until we actually need it.

Enough building for now

All the storage managers are kept by a NeoStore instance that does nothing much: It creates them, holds then and also keeps its small file. It is not exposed to the user, instead it is encapsulated inside the transaction handling code. But this is another complicated topic that will be discussed in my next post.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Saturday, October 9, 2010

Neo4j Internals: File Storage

NOTE: This post is quite outdated, stuff has changed since i wrote this. While you can somewhat safely ignore the alterations for increased address space of entities, the Property store has changed in a fundamental way. Please find the new implementation here.

Ah, the physical layer! Storing bits and bytes on spinning metal, away from the security and comfort of objects and high-level abstractions. This is the realization of any database system, the sole purpose for which it is build. In this post I will try to explain the storage subsystem of Neo4j, exposing the structure and purpose of all those "neostore.*" files that your graph lives in. The management code and all those layers that give meaning to and make usable theses buckets of bits will be discussed at my next post. I should make clear however that, although I will try not to assume much, you should have some idea of what it means to manage data files, a vague idea of database storage schemes (like, what an index is), in general a familiarity with I/O. Also, this is a large post, so grab a coffee before you start.

Which files again?

By now you should be aware that your graph lives in a bunch of files under the directory which you instructed your instance to store them. All their filenames start with "neostore." and these are the ones we discuss today. There are more files and directories there, for the lucene indexer and the logs but they are not of interest to us here.
The first thing to note, just from the filenames is that they somewhat go in twos. One is the ".db" filename, for example "neostore.nodestore.db" (except for "neostore") and the other is the ".id" file, "neostore.nodestore.db.id". Also, there is an apparent hierarchy, especially in the "propertystore" branch as we see "neostore.propertystore.db", "neostore.propertystore.db.index" etc. There is of course a reason for all these, but the details will become apparent later. For now, lets talk about ids.

Recycling Ids

I will tell a lie now but I have to start somewhere. Lets assume for a moment that neo's data is stored in equally sized chunks of known size and lets call these chunks records. There, that was easy to believe, wasn't it? The structure and size of a record can vary between files (this is obvious - a relationship record cannot possibly be the same as a node record) but they are fixed, at least per storage version scheme. Each record has an id (this id is known to neo users as the id of the primitive, available via getId()) to be able to refer to it. However, given an id, how do you find where in the file are its data? If you have an idea about databases, various access schemes must have popped in your mind now and the most obvious is the correct answer. Since the record is of known size and we have the liberty to assign ids as we see fit (this is our code after all), then why not go with long ids, incrementing the last assigned id by one for every new record? Then the offset in the file is simply the id*RECORD_SIZE. Brilliant. But, a problem arises, called fragmentation. Think about deleting a node. Its storage space, possibly in the middle of the file, is available to store a new node record, but the file offset and the id have a one-to-one correspondence. The only way to reuse the hole we created is to assign its id to the next request for a new node. This is why ids are recycled in neo. This scheme calls for a manager of the available ids, which will keep a list of recycled ids and the next highest available id in case those run out. The idea is that when we request an id, if there are recycled ones we get one of those, otherwise a new one is created.

The IdGenerator and its data file

Of course, a neo instance is not expected to run for ever, so this list of recycled ids must also be kept in storage, since they must be available between restarts so that space does not become fragmented (aka wasted). This is where the org.neo4j.kernel.impl.nioneo.store.IdGenerator and its implementation IdGeneratorImpl come into play. Every manager of every datafile keeps an IdGenerator at hand to manage ids for it. IdGenerator abstracts the low level details by managing the id store (the ".id" file that goes hand in hand with every ".db" file), flushing out occasionally and reading in batches of recycled ids and accepting freed and removing used ids. Its file format is very simple actually. It keeps at the beginning of its file a byte that marks a clean shutdown, called "sticky". After open()ing a channel over the id file, it sets the sticky bit to 1 and writes it out, signifying a "dirty" file, meaning that it is not up to date to the actual defragmented ids. This byte is set to 0 only after a successful flush of the recycled ids and right before the close() of the file. So, if something crashed and the file is inconsistent, at restart this is known and a rebuild process is initiated, but more on recovery later.
A parameter called grabSize determines the both the maximum number of unflushed recycled ids and the batch read of stored ids. When this size is reached (for write operations, that is when the storage manager that uses this IdGenerator has free()d that many ids), the list is written to disk. When there are no more recycled ids in memory, grabSize ids are read from disk.
Finally, there is always at least one long value stored in the id file and that is the next highest available id. This is stored always right after the sticky bit, making the minimum file size 9 bytes (remember, ids are longs, which is 8 bytes in Java), making these 9 bytes a kind of header for the id file.
Go on, run the "Hello, World" example and open the var/base/neostore.nodestore.db.id file in a hex editor. You should see 9 bytes, the first being 0 (clean shutdown) and the next 8 are a long equal to 3, which is the next available node id, since 1 and 2 are already taken.  There where no deletes, so there are no recycled ids. There, in one sitting the mystery of ".id" files and why there are recycled ids in neo dissolved. Not bad.

Restoring the truth

Remember when I said before that neo organizes its storage in equally sized records? Well, that was not the whole truth. Actually, there are two kinds of storage engines in neo, one is org.neo4j.kernel.impl.nioneo.store.AbstractStore and the other is org.neo4j.kernel.impl.nioneo.store.AbstractDynamicStore. They both inherit from CommonAbstractStore which provides common operations on the IdGenerators, opening and closing, keeping version information and the rest. However, apart from these common mechanisms, they serve different storage needs. AbstractStores manage records of fixed size while DynamicStores manage data of arbitrary length, storing them in blocks and chaining them together, kind of like a LinkedList but on disk. As far as the contents of their files go, there are some common things between them. Both storage mechanisms make use of ids to refer to their building blocks, but whereas AbstractStores can communicate their ids to the outside world, DynamicStores' ids make sense only within the class, so they are not passed outside. Both implementations define a version string (this is part of the implementation of CommonAbstractStore) that is written at the end of the file and is checked at load time, so that the runtime is certain that the file passed is of the right format and version (running strings(1) on any ".db" file will reveal this version string). Also, every building block of these implementations, be it a record or a block, has a first byte that marks it as used or not. Recall that although the id manager knows which ids (and hence file offsets) are in use and which are not, there is the possibility that the application fails to write out the recycled ids. Having a used flag makes recovery possible, since by scanning the file we can re-discover which ids need to be added to the free list. Finally, all stores have an ".id" file, so I will never mention them explicitly again.
Common details aside, lets look at every implementation in isolation, explaining the file it creates and its format.

Record - oriented storage : The primitives

Node store

We start with the most basic of files, the node store. It is implemented at org.neo4j.kernel.impl.nioneo.store.NodeStore and creates the file "neostore.nodestore.db". Every node you create is stored here, in a simple, fixed size record of 9 bytes in total. The first byte is the use flag. The next 4 are an integer that is the id of the first of the relationships this node participates in (more on relationships in a bit). The final 4 bytes are the integer id of the first of the properties this node has. That's it.
A word about ids before we go any further, although this should be clear by now. Records in an AbstractStore do not store ids. Their offset in the file defines their id and their id is defines their offset in the file. There is no need to store it anywhere.

Relationship store

The relationship store is implemented by org.neo4j.kernel.impl.nioneo.store.RelationshipStore and creates the file "neostore.relationshipstore.db". Obviously it stores relationships, in a record of 33 bytes. First comes the used byte, with the second least significant bit used as a flag to indicate whether this relationship is directed (1) or not (0). Then 4 bytes, the id of the first node and another 4 bytes as the id of the second node. Next is another 4 bytes, an integer that is the id of the record that represents the type of this relationship (more on relationship type storage later). The next 4 integers (16 bytes for those that do not keep count) are in order: The id of the previous relationship of the first node, the id of the next relationship of the first node, the id of the previous relationship of the second node and finally the id of the next relationship of the second node. The last 4 bytes are the id of the first property of this relationship.

Relationships (as well as properties) form a doubly linked list on disk. They are part of a chain that contains all that is to know about a primitive and the means to reach it. A special value (-1) is used to indicate the absence of another entry in the list in either direction. The links are ids of records in the same or a different store, which are translated by the proper storage manager into proper offsets. Who requests the accesses, how they are performed and transformed into useful objects will be the subject of the next post. For now, on to...

RelationshipType store

The relationship type store is implemented by org.neo4j.kernel.impl.nioneo.store.RelationshipTypeStore and creates the "neostore.relationshiptypestore.db". It also needs a place to store the relationship type name (which is a String and therefore of variable size), so it creates a private DynamicStore which stores its data in "neostore.relatioshiptypestore.db.names". The record size for the type is 5 bytes, the first as usual the in_use byte and the 4 remaining the id of the block that stores the String that is the name of the type.

Property store

A main feature of any graph database is the ability to store properties at its primitives. A property in neo, both in nodes and relationships, is a name-value pair, the name being a string and the value being any Java primitive (int, long etc) or Strings or arrays of those. A thing of note is that many neo primitives can have the same name for a property (for instance, both "Trinity" and "Morpheus" nodes will have a "rank" property) but every primitive will have possibly a different value for its property (the "Trinity" node has a different "rank" value than "Morpheus"). However, we want to also be able to find nodes with a specific property, regardless of its value. These requirements lay out a specific path. Primitive property names should actually be commonly shared between them but the values should be private to each instance. This calls for an indirection in the storage, an observation that will come handy in understanding how properties are stored.

The property storage is primarily managed by org.neo4j.kernel.impl.nioneo.store.PropertyStore, which stores its data in "neostore.propertystore.db". Each property record begins with the in_use byte, an integer that stores the type of the property (int, String, long[] etc, as defined in org.neo4j.kernel.impl.nioneo.store.PropertyType), an integer that is the id of the property index (see below), a long that is an id to a DynamicStore (a DynamicStringStore or DynamicArrayStore, depending on they property Type, stored either in "neostore.propertystore.db.strings" or "neostore.propertystore.db.arrays")  that stores the value OR (depending on the Type field) the actual value (there is no need to redirect to another file if the Type is integer and can be held there) and finally two integers that are of course ids of the previous and next property of the owning Primitive, for a total of 25 bytes.

The property index is stored in "neostore.propertystore.db.index" and is managed by an org.neo4j.kernel.impl.nioneo.store.PropertyIndexStore created and held by PropertyStore. Its record starts with an in_use byte, an integer that keeps a property count and lastly another integer that is the id in a DynamicStore (a DynamicStringStore that keeps data in "neostore.propertystore.db.index.keys") that keeps the property name. Yeap, that's 9 whole bytes. This is what makes possible for different neo primitives to have a common property. The key_index_id of the Property record (the integer at offset 0x8, after the Type field), points to a record in the PropertyIndexStore. This is common for all properties that have the same name. Voila! Property sharing and space preservation.

Block - oriented storage: Strings and arrays

The previous storage engines had fixed size records, composed mainly of pointers to other records and holding little actual data. The storage of your precious, variable size String or array property values must obviously be handled in a different way. This is the job DynamicStores where made. We already saw their use, as part of PropertyStore and RelationshipTypeStore, both of which need to store Strings or arrays of Java primitives. Now we look at their structure.

The general block structure

The variable size nature of arbitrary data is accommodated by a doubly linked block structure within the dynamic store file. What you need of course is an in_use byte, a pointer to the previous block and one to the next block, the size of the data and finally, the data. This lot is bounded by a maximum block size to make finding offsets in the file possible. If a piece of data requires more space than blockSize, more blocks are added to its chain to fit it in. There is a trick at play here, meaning that all blocks have the same size, that is blockSize. However, the last block in each chain (or the first, if it is only one) can use less than the allocated amount, a fact indicated in the length field.
The actual layout of each block header is the in_use byte, the int id of the previous block (-1 if it is the first or only block), the length as an int and the int id of the next block (-1 if it is the last block in the chain or the only block). After these 13 bytes comes the data, with max length blockSize, for a total actual block length of blockSize+13. Users of DynamicStores are aware only of the id of the first block in a chain, meaning that at the records discussed previously, all ids to DynamicStores refer to the id of the first block.
Finally, the first block of every DynamicStore is reserved and instead of the above format stores in its first 4 bytes an integer that describes the block size used by this store. This way, when a DynamicStore is opened from a new instance of neo the size of the block can be found and compared with the expected (as defined in the Configuration), saving us a lot of trouble in case of mistakes.

Strings vs Arrays

A dynamic store has no use in storing a single int or long. That is apparent in the property store, where properties that are single Java primitives are stored directly in the record (and that explains why the property_block_id is a long and not an int). So, we need to store Strings, primitive arrays and String arrays. This is done in somewhat different ways.

org.neo4j.kernel.impl.nioneo.store.DynamicArrayStore is responsible for storing arrays. In general, arrays of Java primitives are no trouble. Just read all data as bytes and separate them in bytes, shorts, chars, longs, whatever. The type of the primitive held is the first byte of the data held in the first block, as defined in the private enum DynamicArrayStore.ArrayType. Arrays of Strings are a little bit more complicated, as they are actually arrays of arrays. What happens is that the first byte flags the data as String[], the next integer is the size of the array and the next integer is the size of the first String. After the first String is read in, the next integer is the size of the next String and so on.
For example, to store {"Hello", "World"}, you store a (byte) 2 which is the ArrayType for String[]. Then you store a (int) 2, which is the number of Strings. Then you store a (int) 10, which is the size of the first String (remember, chars are 2 bytes). Then goes the "Hello" string, then another (int) 10 and then "World". This series of 33 bytes may span more than one block but this is abstracted by the store manager.

org.neo4j.kernel.impl.nioneo.store.DynamicStringStore does not impose any kind of additional structure on its data (unlike DynamicArrayStore), so the implementation of AbstractDynamicStore and its block structure are enough to just throw the String bytes in there and be done with it. That was easy.

Almost there

18 of the 20 "neostore." files created by a standard EmbeddedGraphDatabase usage have been covered. What remains is the top level "neostore" file. This is managed by org.neo4j.kernel.impl.nioneo.store.NeoStore and stores all in all three records. Each is 9 bytes, the in_use byte and a long. The first record stores the creation time for the database, the second a random number and the third the database version. That's it.

On a more personal note

Fiddling with the details of the on disk storage is a pain and I am not sure I have done my job of explaining them correctly. The best way to see these for yourself is to create tiny sample databases and look at the resulting files with a hex editor (I know I did). Also, regardless of my dislike for creating graphics, I have a table that demonstrates the structure of the records and blocks as we discussed them above. Enjoy.



Finally, the end

Whew, a long post at last to an end. We only talked about the files however, not at all about the code. In the next post we shall see how all this structure is used and what optimizations are in place to make touching the disk are rare as possible.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, October 7, 2010

Neo4j Internals: Caching

Although Neo utilizes the NIO package with memory mapping (meaning that at least part of the db file store is in memory), it features a caching mechanism that is mainly useful for keeping primitives (nodes and relationships) in memory as internal, useful objects (recall NodeImpl and RelationshipImpl). Here I will take a shot explaining how that stuff works, diverging for a little from the Hello, World line-by-line walkthrough.

But first a little background

One of the most common cache management algorithms is the venerable Least Recently Used (LRU to its friends). It combines simplicity of implementation, fast operations and pretty good caching characteristics, although there are some (me, for instance) who swear by more advanced solutions (ARC anyone?). Another advantage of LRU is that it is almost implemented in the java.util package, as LinkedHashMap. What it misses is a proper implementation of removeEldestEntry(), which with a single line of code will give LRU characteristics to the class. But no more details, if you cannot follow, go do your homework and then come back.
Another way to look at LRU policy implementations specific to Java is with Weak/Soft references. These two posts provide a good introduction to the subject of non-strong references and discuss Maps with Soft and Weak references. In simple words, both WeakReference and SoftReference objects (actually, the values they refer to) are treated specially by the garbage collector, who reclaims their memory more aggressively. This behaviour is used in "poor man caches" that expand quickly to cover the heap but at the moment the memory is required for more important work, they are reclaimed. As a consequence, we have a way to greedily cache objects without hard, not adaptable upper limits while maintaining availability of memory for other tasks. This is the path chosen in the default implementation of the cache in neo.

Before the main event

First note that there are two caches, one for nodes and one for relationships, which apart from storing different classes, they can have different maximum sizes (via "max_node_cache_size" and "max_relationship_cache_size" configuration parameters), albeit not different implementations. There are 4 different kinds of cache implementations, all in package org.neo4j.kernel.impl.cache. There is the degenerate NoCache, which stores nothing. LruCache uses a good old LinkedHashMap with an overridden removeEldestEntry() which makes it an LRU cache and provides adaptability (more on that later). Finally, {Soft,Weak}LruCache provide the exact same implementation, with Weak and Soft interchanged, utilizing {Soft,Weak}Values to store references and ReferenceQueues to store garbage collected references. They both extend org.neo4j.kernel.impl.cache.ReferenceCache, which defines a pollClearedValues() method for removing "dead" entries from the map. By the way, this is the end of the code for adaptability for ReferenceCaches, meaning that their adaptation to the current memory utilization is whatever the Weak/SoftReference + the gc can offer (which is a lot). All management of the content of the caches is performed exclusively by NodeManager, while the actual memory consumption is determined by the AdaptiveCacheManager and is influenced by the user-supplied configuration.

The
AdaptiveCacheManager

The main work of AdaptiveCacheManager is to maintain the set of caches, parse and keep the configuration and provide adaptability of their size. This is achieved by spawning a worker thread that has a very simple life. Every "adaptive_cache_worker_sleep_time" milliseconds (3000 by default) it will wake up and re-adjust the sizes of the caches. For the ReferenceCaches it simply calls pollClearedValues(), while for the LruCache things are a little bit more complicated. Since the collection of over-the-quota entries must be done by hand, the adaptCache() method is run on every LruCache, its target size is recalculated based on the memory statistics reported by the JVM and passed to its resize() method, which drops elements in an LRU fashion until it reaches the target size. The exact heuristic and the configuration parameters that influence it are documented precisely here.


CacheManager Startup

One last thing remains to say, which is where in the startup chain the cache manager is created and started. The Config object creates and stores the AdaptiveCacheManager during its construction and passed it at the constructor of GraphDbModule. There it is passed to the constructor of NodeManager whose start() method provides it the caches for registration and then its start() method is called. There the adaptive cache worker thread is started and all the magic happens.


There you have it

This is the whole of the caching mechanism used in neo. It is mainly used to cache node and relationship implementations as they are retrieved from the persistent store and are completely hidden in the NodeManager. It is wholly contained in its own package and is indeed a simple piece of code to understand.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.