Thursday, November 10, 2011

Rooting out redundancy - The new Neo4j Property Store


So, for the last 2 months we've been working diligently, trying to create the 1.5 release of Neo4j. While on the surface it may look like little has changed, under the hood a huge amount of work has gone into a far more stable and usable HA implementation and rewriting the property storage layer to use far less disk space while maintaining all its features and providing a speed boost at the same time. In this post I will deal exclusively with the latter.

Departing from the old model: A new way to store things

So far, the properties were stored on disk in a doubly linked list, where each of its nodes contained some necessary administrative/structural overhead and the actual property data. More specifically, the layout was:
byte  0     : 4 high bits of previous pointer, inUse flag
byte  1     : unused<
byte  2     : 4 high bits of next pointer
bytes 3-4   : property type
bytes 5-8   : property index
bytes 9-12  : previous pointer 32 low bits
bytes 13-16 : next pointer 32 low bits
bytes 17-24 : property data

The last 8 bytes where the value stored, enough to accommodate all primitive values, a short string or a pointer to the dynamic store, where a dynamic record chain would store a long string, an array of primitives or String[].

There is some waste here, in part because the full 8 bytes are used in the (rare) cases of storing doubles and longs or for short strings but mostly because this pointers are repeated for each property, making the impact of the structural overhead felt. On the flip side, the Short String optimization was a great success, proving the value in inlining more property types. So we decided to highlight the good parts and lowlight the bad, ending up with a PropertyRecord structure that is no longer equivalent to one property but acts as a container for a variable number of variable length properties. The current layout is:
byte  0    : 4 high bits of previous, 4 high bits of next pointers
bytes 1-4  : previous property record
bytes 5-8  : next property record
bytes 9-40 : payload
Yes, that is correct, no inUse flag, explained by the payload structure.

First, let's call the 4 8-byte-blocks in payload just blocks, to have a simple name for them. Each of these blocks is used in various ways, depending on the property data type. Starting off, every property needs to have the property index and the property type. These are common and always present, with the property index taking up the first 3 bytes of the block and the type taking up the 4 high bits of the 4th byte. Now, after that comes the property value. If it is a primitive that fits in 4 bytes, then the 4 low bits of the 4th byte are skipped and the remaining 4 bytes of the block are used to store the value and we are done. When storing a pointer into the DynamicStore for non-short strings and for arrays, the 36 bits required find home to the second half of the 4th byte and the low order 4 bytes. This means that each PropertyRecord can store up to 4 such properties - a huge saving in space.
For longs and doubles which require 8 bytes, the 4 1/2 trailing bytes are skipped and instead the next block is used as a whole to store the value. This leads to some waste but it is still more efficient than the previous method and it is a relatively rare use case.

What remains is ShortStrings and the brand new ShortArray. Since we saved all that space and I/O calls with ShortString, why not expand on the idea? We now have LongerShortString, which is like ShortString but on crack. It operates on the same principle - it scans a string, sees if it falls within an encoding, encodes it and stores a header with the length and the encoding table id and then the actual data, encoded in longs that take up blocks right after the property info. If it doesn't fit in the max of 3 1/2 blocks of a property record, it is instead encoded as UTF8 and stored in the DynamicStringStore. A similar idea is applied to arrays. When passed a primitive array we first determine the minimum number of bits required to store its values, effectively shaving off all the leftmost zeros we can while keeping all array members the same size. This means that if we are asked to store new int[] {1,2,3,4,5}, the entries will take up not 32 but 3 bits each. boolean[] for example costs 1 bit per entry. Obviously, mixing in even a single negative value gives immediately a maximum number of bits per entry. So, to store an array we first determine this number and then the header becomes:

   4 bits, an enum value identifying the primitive type
   6 bits, the length of the array
   6 bits, the number of bits per item

and then follow the "bit shaved" array entries. The same algo is used for dynamic arrays as well, but the length is actualy stored in the length field of the dynamic record (as usual), not the ShortArray header and we just keep how many bits of the last byte are used. That, along with the bits per entry  number are enough to reconstruct the value. Of course, in this case as well, if the array does not fit in the PropertyRecord even after this "compression", it is stored in the DynamicArrayStore as usual, though now in its bit-shaved form as byte[], meaning less DynamicRecords are used so less waste. This comes at the price of reconstructing the array when reading it in, but the reduced I/O more than makes up for it. A more exact description of the new ShortString, including all the ShortString classes and size limits, as well as the new ShortArray, is available in the manual.

What about the mystery of the missing inUse flag? Well, that is a combination of 2 things. One is that the blocks are marked individually as in use or not, since the API allows for a property to be deleted, and now a property is no longer a record but a collection of blocks. So we folded that into the property type, with 0 signifying not in use. The second is that the blocks are written out defragmented on disk, meaning that if from 3 properties in a record we delete the middle one (set its type to deleted), then only the remaining two will be written. This leads to a simple method of marking "no more properties in this record" by writing a 0 for the 4th byte of the first not-used block (the implementation just writes a whole long). A corollary of this is that a property record that has the 4th byte of the first block 0 is actually not used.

Code walkthrough

I was going to outline the changes/additions at a source code level here, but this post is getting too long. Besides, from the above the code becomes straightforward to follow. If you have any questions, suggestions or would like to small talk about the implementation, drop by our mailing list.

Just a tweaking note here - the logic of when and how allocation of blocks happens and the defragmentation strategy is held in WriteTransaction. Go ahead and experiment with what best suites your use case - feedback on these code paths will be greeted with much rejoice!

Out with the old, in with the new: Migrating the store

Unlike the 4+ billion changes for extended address space changes a while ago, this store upgrade cannot happen in place over an old database. We need to do a true migration, meaning recreating the store from scratch and replacing your existing data files with the new ones. This process is extremely safe: It never writes in your existing data files, it is crash resistant (so if it fails mid-way nothing bad happens) and keeps a backup of your data (under upgrade-backup/ in the database directory). However, better safe than sorry, so it is considered good practice to keep an independent backup of your data.

The store migration process is relatively straightforward - it goes over the node and relationship stores, copying them over as they are and, for each primitive, it reads in the property chains, transforms them in the new format and stores them. That has the side benefit of compacting the property store, skipping over deleted entries, so you should notice a significant reduction in disk usage if you happen to delete lots of properties and not restart often.

All the migration code is bundled in the kernel source in package org.neo4j.kernel.impl.storemigration and can be run both as a standalone tool and as part of normal startup - so no matter if you use the server scripts or just the kernel library, just set the config option "allow_store_upgrade"="true" and you are set to go.

Onwards and upwards

There are more stuff in this release that can fit in a blog post. Long discussions in the community have ended up providing inspiration for substantial changes which not only provide robustness in the current offering but pave the way for more exciting features to come. So, maybe "Boden Bord" is not filled to the brim with obvious new features, but rest assured, we are in for a wild next year.

Thank you for making Neo4j what it is.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.


  1. Hi Chris,

    Thanks for sharing the insight! I'm using Neo4j in storing social graphs. It's great to know the underlying storage method.

    I wonder how this new storage method handles the case where only 2 out of the 4 blocks are used. Is there a possibility that misinterpretation happens because the empty 2 blocks have random bits? Or the empty blocks are marked by special property index and types?

    Also, one thing I find is that usually I will set several properties of a node at a time. Is it possible to further reduce "structural overhead",i.e. pointers, by making the number of blocks in the "payload" variable? So that one property record could contain several properties and no space will be wasted, and the linkedlist would look like linked chunks (of variable size).

  2. Thanks for sharing. I read this twice and I am still confused. I need to read this many times to clearly understand. May be a pictorial representation will help.

    I guess I understand the idea, at high level its a double linked list, but you guys have done some deep thinking into making best use of each bit in storing the data.