Sunday, October 10, 2010

Neo4j Internals: Persistence and Memory Mapping

Last time I gave a more or less detailed description of the storage file format of neo. Less than half of the package was covered, leaving aside most of the implementation details that give meaning to the bit buckets used to store data. In this post I will conclude the low level details and show how useful, high level objects are passed to and from the persistent store. As always, there are some assumptions on what you already should know. In this case I will gloss over the details of the Java NIO package, take for granted that you know what a memory mapped file is and of course I will build on the previous post, so you should be aware of the implementation of ids and what they mean in neo. In fact, simply by mentioning "memory mapped files" and "file offsets" you should be creating a mental image of what is going to happen. So buckle up, fire up some tabs with the NIO API (especially primitive buffers and file channels), of course refill your coffee mug and enjoy.

Windows with a useful view

The first layer of abstraction over the contents of the files comes from the PersistenceWindow interface. It encapsulates a Buffer which is a decorator over a NIO Buffer. The reason for all this is that in this manner we gain a uniform interface over the two main methods of accessing the storage. One is the more mundane PlainPersistenceWindow and the other is the more exciting (and default) MappedPersistenceWindow. The first uses a simple, heap based buffer to store data while the second is built on top of a memory mapped portion of a file (of course, again with a Buffer). They both extend LockableWindow which provides a ReentrantLock-like implementation of locking, keeping a LinkedList of waiting threads and, after the current thread is finished (i.e. calls unlock()) it interrupt()s them in a FIFO fashion.
There are three extending classes of LockableWindow, one is the MappedPersistenceWindow, one is the AbstractPersistenceWindow (which provides all the mechanism required to load and flush the contents of a file in conjunction with a heap based buffer) which PlainPersistenceWindow simply renames and the last is PersistenceRow which is used when only one record must be brought into memory. All these implementations require during construction a position in the file stream, a recordSize, a totalSize and of course the file channel. During construction, an appropriate buffer is constructed (either mapped or allocated in the heap) from file offset position*recordSize and for totalSize bytes after than. You should have guessed by that multiplication before that position is actually the id of a record or a block and the recordSize is the record's or block's size. All accesses from now on for the area of that file are performed via this PersistenceWindow, which takes care of the translation from record ids to file (or rather, buffer) offsets. This is actually hidden from users by the harmless looking getOffsetedBuffer(int) : Buffer method, which is actually where the (simple, actually) magic happens. The PersistenceWindow implementation that wraps the file portion translates the position argument of this method to a buffer offset and returns the underlying buffer positioned at that offset. This way, the caller can start reading bytes right away, certain that what is returned is right at the start of a record.

Pool it together, brick it over

The window abstraction alleviates the need for explicitly managing buffers, whether they are mmap()ed or not, translations of file offsets, flush()ing and the rest of the low level details. However, there is still the need to maintain multiple windows over each file, since we typically read from multiple locations at once. Moreover, there is precious little memory available to us, so we must maintain a careful count of how much memory we have been allocated for our store (remember, each user-visible store - NodeStore, RelationshipStore and the PropertyStores all have individual configuration parameters defining how much memory they can consume for mapping their files' contents) and, if we feel like optimizing stuff, allocate it in hotspots. Finally, the logic for constructing a new window over a non-memory resident portion of our file could be abstracted to some kind of manager that would do this transparently for us upon request of a record id. All these operations are useful and well defined for all storage managers, so since we claim to be good engineers we should provide a reusable implementation for this concept. Enter the PersistenceWindowPool.
A PersistenceWindowPool does all that in only 600 lines of code. It functions as a LFU cache for the buffers over a file, it maintains the memory constrains imposed by the configuration and abstracts over the creation and expiration of windows over the file. The primary storage unit is a decorator over a PersistenceWindow called BrickElement (a static private inner class) that maintains a reference to a LockableWindow, an index and a hitCounter. The setupBricks() method (called during construction time) determines the count and size of the BrickElements that should exist and, if the available memory is enough stores them in an array. There is a lower limit on the number and size of the bricks, however, and when that is not satisfied PersistenceWindowPool reverts to using PersistenceRows, in essence getting rid of any memory caching whatsoever. Note that the whole expanse of the file is covered by bricks, regardless of whether they actually have memory reserved. As the file grows more bricks are created to cover it.
Whenever a store manager requires a buffer of a record for a read or write, it passes its id to the underlying PersistenceWindowPool. The pool translates the id to an index in the Brick array (the id*recordSize is the file offset and the offset/brickSize - the window size actually - is the brick index). If the returned brick has a PersistenceWindow in place, then it is returned. Else, a PersistenceRow is created and kept for reference in a map (or if already in the map, it is retrieved) and returned. Either way, the window is mark()ed and lock()ed. At the same time, statistics on brick hits and misses are updated, both in the target brick and overall.
When the brick misses pass a certain threshold value (50k at the time of this writing), a procedure to refresh the bricks is started. The bricks are separated in mapped and unmapped (based on whether they currently encapsulate a mapped window over the file or not) and each list is sorted based on their elements' hit count. Then, for the most hit-upon unmapped bricks, and while there is available memory, new PersistenceWindows are allocated and assigned (the choice of mmap()ed over heap-based is class wide, set upon construction). After that, the least hit-upon mapped bricks are stripped of their buffers (after flushing them to file) and their space is assigned to unmapped bricks as long as there are unmapped bricks with more hits than them. Obviously, lock()ed or mark()ed windows are left in peace, someone is using them after all. There, that takes care of LFU cache semantics.
The last important piece of code is the release() method. When the store manager is done using a window, it must return it to the pool for unLock()ing and possible recycling. In the case of PersistenceRows, it is immediately written out to disk and if no one else is using it, it is removed from the cache.
That pretty much covers the basic operations of the PersistenceWindowPool. Now, store managers can retrieve buffers from the disk based on record ids without worrying about management, including locking. That is something of a relief, wouldn't you say?

Higher up: The store managers

With the details out of the way, we can now look at the storage managers.
There are as many as there are data files so it would be kind of boring to enumerate them all here, more so since I mentioned them all in the previous post. Instead I will write about the core operations they expose and how they pass from bits to objects.
The structure of every record in the storage has a corresponding object in the package (node records have NodeRecord, relationship type records have RelationshipType, all dynamic records have DynamicRecord and so on). Every one of these objects is a simple holder of the values of the corresponding record but with two exceptions. First, DynamicRecord is a generic holder for dynamic record data, meaning that it holds either array data or char[] (aka String) data (but not both), in addition to the next and previous block id values. The other is that for records that refer to a dynamic store (e.g. the string value of a property), then the corresponding field is not an integer (as an id normally would be). Instead it is a Collection (List or Map) of DynamicRecords that store in their byte[] or char[] arrays the corresponding value (each block is represented by a different DynamicRecord, hence the need for a Collection). When they are initialized, you may ask. Well, patience, we will get to that.
Now that we know about records and their implementation in Java, is should be apparent that every store manager manages its corresponding record type. NodeStore manages NodeRecords, DynamicArrayStore manages DynamicRecords that store data in their byte[] field etc. Each Store knows the size and structure of the records it manages. At creation time it allocates a PersistenceWindowPool that manages its file. It passes to it its recordSize (block size for DynamicStores) so all that it has to do now is translate bytes from the file to values for the fields of its record objects and write out any changed records it receives.
Reading a record from a store requires an id. The store implementation requests for a window from its PersistenceWindowPool that covers that id, asks the returned window for a byte buffer offsetted at the id and starts reading primitives as the record structure dictates. In the case of writing a record, the same procedure is followed but the record is written in the buffer. The flushing of the contents is something the store manager is not concerned about except at closing time.

Heavy or light?

PropertyIndexStore, PropertyStore and RelationshipTypeStore maintain apart from their corresponding record stores, dynamic stores for keeping variable length data. How and when are they fetched in memory? Well, the default operation is to don't. This is called a light record, and is a record that the collection of DynamicRecords is uninitialized. If you want the data, you must first call makeHeavy() on the corresponding store manager to have it fill the DynamicRecord collection that belongs to that record. These DynamicRecords will themselves be light, meaning that only the block chain ids will be retrieved, not the contents. After making the record heavy, you ask the record's store to retrieve the object kept in the DynamicRecord collection. This is necessary, since you do not have access to the underlying dynamic store and even if you did only the record's manager knows where and how stuff are stored. The store manager iterates over the DynamicRecord collection, makes it heavy and then asks the dynamic store to translate the resulting byte[] to something meaningful.
This mechanism allows for lazily fetching stuff from the disk, actually hitting the platters only when you need something. The traversal of the graph does not require fetching all the stuff a Node or Relationship might have hanging under it, so we can dispense with I/O until we actually need it.

Enough building for now

All the storage managers are kept by a NeoStore instance that does nothing much: It creates them, holds then and also keeps its small file. It is not exposed to the user, instead it is encapsulated inside the transaction handling code. But this is another complicated topic that will be discussed in my next post.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.


  1. I wish you'd explain the Txn handling and multi-threading. But before that, pls fix the font colors :)


  2. These posts are really helpful as I dive into the world of neo4j. It makes a difference to understand the inner working so that it is clear how to avoid going against the grain of you and your colleague's mental model.