Saturday, October 9, 2010

Neo4j Internals: File Storage

NOTE: This post is quite outdated, stuff has changed since i wrote this. While you can somewhat safely ignore the alterations for increased address space of entities, the Property store has changed in a fundamental way. Please find the new implementation here.

Ah, the physical layer! Storing bits and bytes on spinning metal, away from the security and comfort of objects and high-level abstractions. This is the realization of any database system, the sole purpose for which it is build. In this post I will try to explain the storage subsystem of Neo4j, exposing the structure and purpose of all those "neostore.*" files that your graph lives in. The management code and all those layers that give meaning to and make usable theses buckets of bits will be discussed at my next post. I should make clear however that, although I will try not to assume much, you should have some idea of what it means to manage data files, a vague idea of database storage schemes (like, what an index is), in general a familiarity with I/O. Also, this is a large post, so grab a coffee before you start.

Which files again?

By now you should be aware that your graph lives in a bunch of files under the directory which you instructed your instance to store them. All their filenames start with "neostore." and these are the ones we discuss today. There are more files and directories there, for the lucene indexer and the logs but they are not of interest to us here.
The first thing to note, just from the filenames is that they somewhat go in twos. One is the ".db" filename, for example "neostore.nodestore.db" (except for "neostore") and the other is the ".id" file, "neostore.nodestore.db.id". Also, there is an apparent hierarchy, especially in the "propertystore" branch as we see "neostore.propertystore.db", "neostore.propertystore.db.index" etc. There is of course a reason for all these, but the details will become apparent later. For now, lets talk about ids.

Recycling Ids

I will tell a lie now but I have to start somewhere. Lets assume for a moment that neo's data is stored in equally sized chunks of known size and lets call these chunks records. There, that was easy to believe, wasn't it? The structure and size of a record can vary between files (this is obvious - a relationship record cannot possibly be the same as a node record) but they are fixed, at least per storage version scheme. Each record has an id (this id is known to neo users as the id of the primitive, available via getId()) to be able to refer to it. However, given an id, how do you find where in the file are its data? If you have an idea about databases, various access schemes must have popped in your mind now and the most obvious is the correct answer. Since the record is of known size and we have the liberty to assign ids as we see fit (this is our code after all), then why not go with long ids, incrementing the last assigned id by one for every new record? Then the offset in the file is simply the id*RECORD_SIZE. Brilliant. But, a problem arises, called fragmentation. Think about deleting a node. Its storage space, possibly in the middle of the file, is available to store a new node record, but the file offset and the id have a one-to-one correspondence. The only way to reuse the hole we created is to assign its id to the next request for a new node. This is why ids are recycled in neo. This scheme calls for a manager of the available ids, which will keep a list of recycled ids and the next highest available id in case those run out. The idea is that when we request an id, if there are recycled ones we get one of those, otherwise a new one is created.

The IdGenerator and its data file

Of course, a neo instance is not expected to run for ever, so this list of recycled ids must also be kept in storage, since they must be available between restarts so that space does not become fragmented (aka wasted). This is where the org.neo4j.kernel.impl.nioneo.store.IdGenerator and its implementation IdGeneratorImpl come into play. Every manager of every datafile keeps an IdGenerator at hand to manage ids for it. IdGenerator abstracts the low level details by managing the id store (the ".id" file that goes hand in hand with every ".db" file), flushing out occasionally and reading in batches of recycled ids and accepting freed and removing used ids. Its file format is very simple actually. It keeps at the beginning of its file a byte that marks a clean shutdown, called "sticky". After open()ing a channel over the id file, it sets the sticky bit to 1 and writes it out, signifying a "dirty" file, meaning that it is not up to date to the actual defragmented ids. This byte is set to 0 only after a successful flush of the recycled ids and right before the close() of the file. So, if something crashed and the file is inconsistent, at restart this is known and a rebuild process is initiated, but more on recovery later.
A parameter called grabSize determines the both the maximum number of unflushed recycled ids and the batch read of stored ids. When this size is reached (for write operations, that is when the storage manager that uses this IdGenerator has free()d that many ids), the list is written to disk. When there are no more recycled ids in memory, grabSize ids are read from disk.
Finally, there is always at least one long value stored in the id file and that is the next highest available id. This is stored always right after the sticky bit, making the minimum file size 9 bytes (remember, ids are longs, which is 8 bytes in Java), making these 9 bytes a kind of header for the id file.
Go on, run the "Hello, World" example and open the var/base/neostore.nodestore.db.id file in a hex editor. You should see 9 bytes, the first being 0 (clean shutdown) and the next 8 are a long equal to 3, which is the next available node id, since 1 and 2 are already taken.  There where no deletes, so there are no recycled ids. There, in one sitting the mystery of ".id" files and why there are recycled ids in neo dissolved. Not bad.

Restoring the truth

Remember when I said before that neo organizes its storage in equally sized records? Well, that was not the whole truth. Actually, there are two kinds of storage engines in neo, one is org.neo4j.kernel.impl.nioneo.store.AbstractStore and the other is org.neo4j.kernel.impl.nioneo.store.AbstractDynamicStore. They both inherit from CommonAbstractStore which provides common operations on the IdGenerators, opening and closing, keeping version information and the rest. However, apart from these common mechanisms, they serve different storage needs. AbstractStores manage records of fixed size while DynamicStores manage data of arbitrary length, storing them in blocks and chaining them together, kind of like a LinkedList but on disk. As far as the contents of their files go, there are some common things between them. Both storage mechanisms make use of ids to refer to their building blocks, but whereas AbstractStores can communicate their ids to the outside world, DynamicStores' ids make sense only within the class, so they are not passed outside. Both implementations define a version string (this is part of the implementation of CommonAbstractStore) that is written at the end of the file and is checked at load time, so that the runtime is certain that the file passed is of the right format and version (running strings(1) on any ".db" file will reveal this version string). Also, every building block of these implementations, be it a record or a block, has a first byte that marks it as used or not. Recall that although the id manager knows which ids (and hence file offsets) are in use and which are not, there is the possibility that the application fails to write out the recycled ids. Having a used flag makes recovery possible, since by scanning the file we can re-discover which ids need to be added to the free list. Finally, all stores have an ".id" file, so I will never mention them explicitly again.
Common details aside, lets look at every implementation in isolation, explaining the file it creates and its format.

Record - oriented storage : The primitives

Node store

We start with the most basic of files, the node store. It is implemented at org.neo4j.kernel.impl.nioneo.store.NodeStore and creates the file "neostore.nodestore.db". Every node you create is stored here, in a simple, fixed size record of 9 bytes in total. The first byte is the use flag. The next 4 are an integer that is the id of the first of the relationships this node participates in (more on relationships in a bit). The final 4 bytes are the integer id of the first of the properties this node has. That's it.
A word about ids before we go any further, although this should be clear by now. Records in an AbstractStore do not store ids. Their offset in the file defines their id and their id is defines their offset in the file. There is no need to store it anywhere.

Relationship store

The relationship store is implemented by org.neo4j.kernel.impl.nioneo.store.RelationshipStore and creates the file "neostore.relationshipstore.db". Obviously it stores relationships, in a record of 33 bytes. First comes the used byte, with the second least significant bit used as a flag to indicate whether this relationship is directed (1) or not (0). Then 4 bytes, the id of the first node and another 4 bytes as the id of the second node. Next is another 4 bytes, an integer that is the id of the record that represents the type of this relationship (more on relationship type storage later). The next 4 integers (16 bytes for those that do not keep count) are in order: The id of the previous relationship of the first node, the id of the next relationship of the first node, the id of the previous relationship of the second node and finally the id of the next relationship of the second node. The last 4 bytes are the id of the first property of this relationship.

Relationships (as well as properties) form a doubly linked list on disk. They are part of a chain that contains all that is to know about a primitive and the means to reach it. A special value (-1) is used to indicate the absence of another entry in the list in either direction. The links are ids of records in the same or a different store, which are translated by the proper storage manager into proper offsets. Who requests the accesses, how they are performed and transformed into useful objects will be the subject of the next post. For now, on to...

RelationshipType store

The relationship type store is implemented by org.neo4j.kernel.impl.nioneo.store.RelationshipTypeStore and creates the "neostore.relationshiptypestore.db". It also needs a place to store the relationship type name (which is a String and therefore of variable size), so it creates a private DynamicStore which stores its data in "neostore.relatioshiptypestore.db.names". The record size for the type is 5 bytes, the first as usual the in_use byte and the 4 remaining the id of the block that stores the String that is the name of the type.

Property store

A main feature of any graph database is the ability to store properties at its primitives. A property in neo, both in nodes and relationships, is a name-value pair, the name being a string and the value being any Java primitive (int, long etc) or Strings or arrays of those. A thing of note is that many neo primitives can have the same name for a property (for instance, both "Trinity" and "Morpheus" nodes will have a "rank" property) but every primitive will have possibly a different value for its property (the "Trinity" node has a different "rank" value than "Morpheus"). However, we want to also be able to find nodes with a specific property, regardless of its value. These requirements lay out a specific path. Primitive property names should actually be commonly shared between them but the values should be private to each instance. This calls for an indirection in the storage, an observation that will come handy in understanding how properties are stored.

The property storage is primarily managed by org.neo4j.kernel.impl.nioneo.store.PropertyStore, which stores its data in "neostore.propertystore.db". Each property record begins with the in_use byte, an integer that stores the type of the property (int, String, long[] etc, as defined in org.neo4j.kernel.impl.nioneo.store.PropertyType), an integer that is the id of the property index (see below), a long that is an id to a DynamicStore (a DynamicStringStore or DynamicArrayStore, depending on they property Type, stored either in "neostore.propertystore.db.strings" or "neostore.propertystore.db.arrays")  that stores the value OR (depending on the Type field) the actual value (there is no need to redirect to another file if the Type is integer and can be held there) and finally two integers that are of course ids of the previous and next property of the owning Primitive, for a total of 25 bytes.

The property index is stored in "neostore.propertystore.db.index" and is managed by an org.neo4j.kernel.impl.nioneo.store.PropertyIndexStore created and held by PropertyStore. Its record starts with an in_use byte, an integer that keeps a property count and lastly another integer that is the id in a DynamicStore (a DynamicStringStore that keeps data in "neostore.propertystore.db.index.keys") that keeps the property name. Yeap, that's 9 whole bytes. This is what makes possible for different neo primitives to have a common property. The key_index_id of the Property record (the integer at offset 0x8, after the Type field), points to a record in the PropertyIndexStore. This is common for all properties that have the same name. Voila! Property sharing and space preservation.

Block - oriented storage: Strings and arrays

The previous storage engines had fixed size records, composed mainly of pointers to other records and holding little actual data. The storage of your precious, variable size String or array property values must obviously be handled in a different way. This is the job DynamicStores where made. We already saw their use, as part of PropertyStore and RelationshipTypeStore, both of which need to store Strings or arrays of Java primitives. Now we look at their structure.

The general block structure

The variable size nature of arbitrary data is accommodated by a doubly linked block structure within the dynamic store file. What you need of course is an in_use byte, a pointer to the previous block and one to the next block, the size of the data and finally, the data. This lot is bounded by a maximum block size to make finding offsets in the file possible. If a piece of data requires more space than blockSize, more blocks are added to its chain to fit it in. There is a trick at play here, meaning that all blocks have the same size, that is blockSize. However, the last block in each chain (or the first, if it is only one) can use less than the allocated amount, a fact indicated in the length field.
The actual layout of each block header is the in_use byte, the int id of the previous block (-1 if it is the first or only block), the length as an int and the int id of the next block (-1 if it is the last block in the chain or the only block). After these 13 bytes comes the data, with max length blockSize, for a total actual block length of blockSize+13. Users of DynamicStores are aware only of the id of the first block in a chain, meaning that at the records discussed previously, all ids to DynamicStores refer to the id of the first block.
Finally, the first block of every DynamicStore is reserved and instead of the above format stores in its first 4 bytes an integer that describes the block size used by this store. This way, when a DynamicStore is opened from a new instance of neo the size of the block can be found and compared with the expected (as defined in the Configuration), saving us a lot of trouble in case of mistakes.

Strings vs Arrays

A dynamic store has no use in storing a single int or long. That is apparent in the property store, where properties that are single Java primitives are stored directly in the record (and that explains why the property_block_id is a long and not an int). So, we need to store Strings, primitive arrays and String arrays. This is done in somewhat different ways.

org.neo4j.kernel.impl.nioneo.store.DynamicArrayStore is responsible for storing arrays. In general, arrays of Java primitives are no trouble. Just read all data as bytes and separate them in bytes, shorts, chars, longs, whatever. The type of the primitive held is the first byte of the data held in the first block, as defined in the private enum DynamicArrayStore.ArrayType. Arrays of Strings are a little bit more complicated, as they are actually arrays of arrays. What happens is that the first byte flags the data as String[], the next integer is the size of the array and the next integer is the size of the first String. After the first String is read in, the next integer is the size of the next String and so on.
For example, to store {"Hello", "World"}, you store a (byte) 2 which is the ArrayType for String[]. Then you store a (int) 2, which is the number of Strings. Then you store a (int) 10, which is the size of the first String (remember, chars are 2 bytes). Then goes the "Hello" string, then another (int) 10 and then "World". This series of 33 bytes may span more than one block but this is abstracted by the store manager.

org.neo4j.kernel.impl.nioneo.store.DynamicStringStore does not impose any kind of additional structure on its data (unlike DynamicArrayStore), so the implementation of AbstractDynamicStore and its block structure are enough to just throw the String bytes in there and be done with it. That was easy.

Almost there

18 of the 20 "neostore." files created by a standard EmbeddedGraphDatabase usage have been covered. What remains is the top level "neostore" file. This is managed by org.neo4j.kernel.impl.nioneo.store.NeoStore and stores all in all three records. Each is 9 bytes, the in_use byte and a long. The first record stores the creation time for the database, the second a random number and the third the database version. That's it.

On a more personal note

Fiddling with the details of the on disk storage is a pain and I am not sure I have done my job of explaining them correctly. The best way to see these for yourself is to create tiny sample databases and look at the resulting files with a hex editor (I know I did). Also, regardless of my dislike for creating graphics, I have a table that demonstrates the structure of the records and blocks as we discussed them above. Enjoy.



Finally, the end

Whew, a long post at last to an end. We only talked about the files however, not at all about the code. In the next post we shall see how all this structure is used and what optimizations are in place to make touching the disk are rare as possible.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

11 comments:

  1. Very nice! However I found the text very hard to read - the white text on the black background is not easy to read.

    ReplyDelete
  2. Agreed - great content, v painful to read.

    ReplyDelete
  3. Thanks for sharing! Hope you enjoyed reading the code for the benefit of the readers!

    ReplyDelete
  4. Wow! This was a great read. I'd already started making some correct assumptions about the storage, like the fixed record sizes for nodes, so that id's would map directly to disk locations, but you really filled in the gaps for the more complex structures, like the strings and arrays and string[] property stores. Very nice indeed.

    Next up, I want to read about caching.

    ReplyDelete
  5. Thanks for nice blog.We can use data for file storage.We should have some idea of what it means to manage data files and overview idea of database storage schemes.You are right that this list of recycled ids should also be kept in storage that can be used if needed.

    ReplyDelete
  6. Great post! I hope you will share more with us. You have discussed all the issues very clearly regarding file storage. I am sure my visitors will find that very useful.

    ReplyDelete
  7. This is great stuff--the dumps I was seeing from the block style files didn't make sense until I read this. In particular, the "reserved" one at the beginning threw me off quite a bit. I'm still confused by one thing I'm seeing in my hex dumps all around. It seems like null pointer "ids" are specified by 0xFFFFFFFF, but the high order bits aren't set. Does this mean that there's one inaccessible block that needs a special case at ~4billion somewhere, or what am I missing?

    ReplyDelete
  8. Mystery solved, after Mr. Hunger showed me how to insert by node id:
    java.lang.IllegalArgumentException: id 4294967295 is reserved for internal use

    ReplyDelete
  9. The id based offsets are great for a single storage file, but how is storage handled when it is distributed over a cluster?

    ReplyDelete
  10. Thanks so much for this post! I've been looking for this information for a while now. Keep up the great work on this blog!

    ReplyDelete