Thursday, November 10, 2011

Rooting out redundancy - The new Neo4j Property Store


So, for the last 2 months we've been working diligently, trying to create the 1.5 release of Neo4j. While on the surface it may look like little has changed, under the hood a huge amount of work has gone into a far more stable and usable HA implementation and rewriting the property storage layer to use far less disk space while maintaining all its features and providing a speed boost at the same time. In this post I will deal exclusively with the latter.

Departing from the old model: A new way to store things

So far, the properties were stored on disk in a doubly linked list, where each of its nodes contained some necessary administrative/structural overhead and the actual property data. More specifically, the layout was:
byte  0     : 4 high bits of previous pointer, inUse flag
byte  1     : unused<
byte  2     : 4 high bits of next pointer
bytes 3-4   : property type
bytes 5-8   : property index
bytes 9-12  : previous pointer 32 low bits
bytes 13-16 : next pointer 32 low bits
bytes 17-24 : property data

The last 8 bytes where the value stored, enough to accommodate all primitive values, a short string or a pointer to the dynamic store, where a dynamic record chain would store a long string, an array of primitives or String[].

There is some waste here, in part because the full 8 bytes are used in the (rare) cases of storing doubles and longs or for short strings but mostly because this pointers are repeated for each property, making the impact of the structural overhead felt. On the flip side, the Short String optimization was a great success, proving the value in inlining more property types. So we decided to highlight the good parts and lowlight the bad, ending up with a PropertyRecord structure that is no longer equivalent to one property but acts as a container for a variable number of variable length properties. The current layout is:
byte  0    : 4 high bits of previous, 4 high bits of next pointers
bytes 1-4  : previous property record
bytes 5-8  : next property record
bytes 9-40 : payload
Yes, that is correct, no inUse flag, explained by the payload structure.

First, let's call the 4 8-byte-blocks in payload just blocks, to have a simple name for them. Each of these blocks is used in various ways, depending on the property data type. Starting off, every property needs to have the property index and the property type. These are common and always present, with the property index taking up the first 3 bytes of the block and the type taking up the 4 high bits of the 4th byte. Now, after that comes the property value. If it is a primitive that fits in 4 bytes, then the 4 low bits of the 4th byte are skipped and the remaining 4 bytes of the block are used to store the value and we are done. When storing a pointer into the DynamicStore for non-short strings and for arrays, the 36 bits required find home to the second half of the 4th byte and the low order 4 bytes. This means that each PropertyRecord can store up to 4 such properties - a huge saving in space.
For longs and doubles which require 8 bytes, the 4 1/2 trailing bytes are skipped and instead the next block is used as a whole to store the value. This leads to some waste but it is still more efficient than the previous method and it is a relatively rare use case.

What remains is ShortStrings and the brand new ShortArray. Since we saved all that space and I/O calls with ShortString, why not expand on the idea? We now have LongerShortString, which is like ShortString but on crack. It operates on the same principle - it scans a string, sees if it falls within an encoding, encodes it and stores a header with the length and the encoding table id and then the actual data, encoded in longs that take up blocks right after the property info. If it doesn't fit in the max of 3 1/2 blocks of a property record, it is instead encoded as UTF8 and stored in the DynamicStringStore. A similar idea is applied to arrays. When passed a primitive array we first determine the minimum number of bits required to store its values, effectively shaving off all the leftmost zeros we can while keeping all array members the same size. This means that if we are asked to store new int[] {1,2,3,4,5}, the entries will take up not 32 but 3 bits each. boolean[] for example costs 1 bit per entry. Obviously, mixing in even a single negative value gives immediately a maximum number of bits per entry. So, to store an array we first determine this number and then the header becomes:

   4 bits, an enum value identifying the primitive type
   6 bits, the length of the array
   6 bits, the number of bits per item

and then follow the "bit shaved" array entries. The same algo is used for dynamic arrays as well, but the length is actualy stored in the length field of the dynamic record (as usual), not the ShortArray header and we just keep how many bits of the last byte are used. That, along with the bits per entry  number are enough to reconstruct the value. Of course, in this case as well, if the array does not fit in the PropertyRecord even after this "compression", it is stored in the DynamicArrayStore as usual, though now in its bit-shaved form as byte[], meaning less DynamicRecords are used so less waste. This comes at the price of reconstructing the array when reading it in, but the reduced I/O more than makes up for it. A more exact description of the new ShortString, including all the ShortString classes and size limits, as well as the new ShortArray, is available in the manual.

What about the mystery of the missing inUse flag? Well, that is a combination of 2 things. One is that the blocks are marked individually as in use or not, since the API allows for a property to be deleted, and now a property is no longer a record but a collection of blocks. So we folded that into the property type, with 0 signifying not in use. The second is that the blocks are written out defragmented on disk, meaning that if from 3 properties in a record we delete the middle one (set its type to deleted), then only the remaining two will be written. This leads to a simple method of marking "no more properties in this record" by writing a 0 for the 4th byte of the first not-used block (the implementation just writes a whole long). A corollary of this is that a property record that has the 4th byte of the first block 0 is actually not used.

Code walkthrough

I was going to outline the changes/additions at a source code level here, but this post is getting too long. Besides, from the above the code becomes straightforward to follow. If you have any questions, suggestions or would like to small talk about the implementation, drop by our mailing list.

Just a tweaking note here - the logic of when and how allocation of blocks happens and the defragmentation strategy is held in WriteTransaction. Go ahead and experiment with what best suites your use case - feedback on these code paths will be greeted with much rejoice!

Out with the old, in with the new: Migrating the store

Unlike the 4+ billion changes for extended address space changes a while ago, this store upgrade cannot happen in place over an old database. We need to do a true migration, meaning recreating the store from scratch and replacing your existing data files with the new ones. This process is extremely safe: It never writes in your existing data files, it is crash resistant (so if it fails mid-way nothing bad happens) and keeps a backup of your data (under upgrade-backup/ in the database directory). However, better safe than sorry, so it is considered good practice to keep an independent backup of your data.

The store migration process is relatively straightforward - it goes over the node and relationship stores, copying them over as they are and, for each primitive, it reads in the property chains, transforms them in the new format and stores them. That has the side benefit of compacting the property store, skipping over deleted entries, so you should notice a significant reduction in disk usage if you happen to delete lots of properties and not restart often.

All the migration code is bundled in the kernel source in package org.neo4j.kernel.impl.storemigration and can be run both as a standalone tool and as part of normal startup - so no matter if you use the server scripts or just the kernel library, just set the config option "allow_store_upgrade"="true" and you are set to go.

Onwards and upwards

There are more stuff in this release that can fit in a blog post. Long discussions in the community have ended up providing inspiration for substantial changes which not only provide robustness in the current offering but pave the way for more exciting features to come. So, maybe "Boden Bord" is not filled to the brim with obvious new features, but rest assured, we are in for a wild next year.

Thank you for making Neo4j what it is.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Wednesday, July 27, 2011

A new cache for Neo4j, part 1 : The hash function

Since my last post a lot of things have happened in Neo4j, some of which have made some of the material in this blog outdated. Imagine that! Just half a year later, the kernel crackdown (specifically, the storage layer walkthrough) is pretty much out of date. How awesome is that? My wonderful colleagues at Neo Technology however have shown me that staying still is not an option, so here I am, trying to make another post outdated. This time is the cache's turn, which is about time it received an overhaul. Challenging the current implementation will start at an unexpected place (or is it?), the hash function.

The hash function you say?

Well, makes sense, doesn't it? At least, I hope it does. You see, in general, hash functions are developed with a very specific goal in mind: uniformity over the hash values, which reduces conflicts, the main quality metric of a hash function. The rationale behind this is the more random the output appears in relation to the input, the smaller the probability for two "related" values (what that means is domain dependent) to collide. This however completely ignores some things that could lead to better utilization of the co-domain (the set of result values from the hash function). In particular, if we know that some class of values is going to be more frequently polled that others, it would make sense to have that class mapped to a larger interval of the co-domain so that collisions in this class (and, consequently, throughout) are reduced even more.

Quick example

Say I have the interval [1,100] as the hash function's domain and the interval [1,10] as the co-domain. A typical hash function will try to allocate k*10/100 values for any collection of k values from the co-domain, a goal best reached by uniformity, as discussed. If, however, we knew that the interval [1,20] would be referenced twice as often as the (20,100], wouldn't it be more logical to allocate a bigger chunk than [1,2] to it? Let's see:
First, for a co-domain of arity N of a perfectly uniform hash function, the possibility of a collision for two inputs is 1/N. This is kind of obvious, selecting the first number does not matter and a collision will happen if we select that number and only that number again, thus 1/N. Having said that:
A uniform selection from [1,100] to [1,10] will lead to collision probability 1/10.
A selection with probability 1/2 from [1,20] mapped to 20% of the co-domain and 1/2 from [21,100] mapped to the proportional 80% will lead to 1/2*0.5 + 1/2*0.125=0.3125 collision probability.
A selection with probability 1/2 from [1,20] mapped to 50% of the co-domain and 1/2
from [21,100] mapped to the remaining 50% will lead to 1/2*0.2 + 1/2*0.2 = 0.2 collision probability.
A selection with probability 1/2 from [1,20] mapped to 80% of the co-domain and 1/2
from [21,100] mapped to the remaining 20% will lead to 1/2*0.125 + 1/2*0.5 = 0.3125 collision probability.

The above kind of demonstrate that uniform hashing is not always the best choice and that a sweet spot exists when trying to find an alternate distribution. So let's try that.

Practical matters

When we start operations on our data set it is usually not known beforehand what the value distribution will be but we might have a vague idea, such that there is some locality or other underlying trend. We can develop therefore an adapting hash function, that after it looks at some data can form some idea about their nature and start allocating chunks of the co-domain hopefully in a more efficient manner. Keeping statistics will be key here.
Beforehand, this appears to have two problems. One is performance - hash functions have to be really fast and here we talk about keeping statistics. This is somewhat problematic and actually the only thing we can do is to be relaxed on the accuracy of our statistics, keep them simple and fast and of course, optimize the hell out of it, hoping that the reduction of collisions will increase performance more that this additional code path slows things down.
The other problem is changing hash values for the domain. Since the hash function adapts, by definition this means that the values it gives for a specific input will change. This is not a hash function. The way around that can be either to keep the statistics and actually change the hash function to a better version when we can (such as cache table resize) or keep all versions in memory and hash them all - if even one matches, we have a hit, otherwise we have a miss. Hybrids are possible and is probably what we will discuss at part 2. But for now let's talk about

What statistics to keep

Our main goal is to find out how skewed our domain distribution is. Model fitting data is an open ended problem so we have to resort to heuristics. A key observation is that a uniform distribution over the domain will also be reflected as a uniformity in the distribution of bits in the input values. What I mean by that is that if the input values are truly random, then the frequency of 1's and 0's at a specific bit position will also tend to be equal, otherwise the values would not be selected truly uniformly. So if we keep the numbers of times we see 1 vs the times we see 0 at every bit position, we can begin forming a picture about how values are diverging from a truly uniform distribution - roughly equal 0 and 1 count for a bit will mean that this bit position is not part of a trend. This intuition is simple enough to implement and test and has the potential of being quite fast. What remains is to see the actual collision rate that it gives, which, as with all heuristics, requires testing.

Another example

Assume we want to hash 8 bit integers to 4 bit hashes and the hash function is value modulo 16, aka the 4 LSBs. A serial scan over the whole range would give equal counts of 1 and 0 for each of the 8 positions, which is expected - this is a uniform distribution and the chosen hash is the best possible. But if we were to choose the values {16,32,48,64}  twice as often as the rest, the counters of the bits 5 and 6 would be roughly equal while the rest would be more far apart. So if we were to include these bits in the extraction and not the 4 LSBs then maybe the resulting hash value would demonstrate less collisions since it would vary more.

The implementation

The implementation is actually pretty straightforward. We want the hashes to be from longs to integers. We therefore keep 63 counters initialized at 0 (longs in Java are signed, the MSB is always zero for positives so it is meaningless to track). When the hash of a long is requested, first its bits are looped over - for each position, the corresponding counter is increased by 1 if the bit is 1, decreased by 1 if it is 0. After a number of such hashes we order the counters based on absolute value - the closer to 0 the more uniform the distribution of that bit and the more important it becomes for the hash value. This is the new ordering with which bits are extracted and concatenated to form the hash. In the above example, bits 5 and 6 would be the 2 LSBs in the 4-bit hash.

What are the consequences of this?

The fact that the hash function is an order makes it persistable between restarts of the database, so statistics gathered during a run can be used to optimize subsequent operations. It also means that, if need be, knowledge can be injected from the beginning, getting rid of any warmup period that is needed for the knowledge of the distribution to build up. Finally, multiple hash functions can be stored side by side, providing multiple hashes, either for different caches, for collision resolution, transition periods (eg resizing) or weighting among them for better hash distribution.
The downsides are, besides the slightly increased runtime cost, that if there is no fixed access pattern, then the added overhead is of no benefit (though it should cause at most as many collisions as any similar bit-extracting hash). Even worse, if it is trained and the access pattern changes, collisions can increase significantly until the new pattern dominates. The latter issue can be dealt with simply by signaling that the pattern changes so we should dump all gathered statistics and start from scratch or something more sophisticated such as keeping the old version along side a new one with a weighting factor between them or an additional level of hashing that can choose the hash function to use.

A sample run

Of course this is not all - we have experiments to run. I have done so in an ad hoc fashion so far, so there is no pretty graphic, but i have the first example run that had me giggling for some time after it printed out the result.
The metric is collision count for a hash function with co-domain [0,1023].
The data set is a loop over 0 to 1000 with increment of 1 and over 2048 to 1048576 (2^19) in increments of 1024 (this gives another 1022 iterations with only the bits 10 to 19 changing). The test was two such sets of values - once with no data (just building up the statistics while returning the 10 LSBs) and the other was with the training data gathered from the first. Note that the minimum number of collisions is 512, since we project 2022 points to 1024.
The first run gave a collision count of 1022 and the second gave a collision count of 999. That is a significant save for something as trivial as bit projection and I have already a couple of ideas for weighted XOR operations that will widen significantly that margin.

From here

Since we are building a cache, the hash function is actually a small part of the story, even if it still needs tuning. We have to build the store, come up with an efficient eviction policy and see how the GC will mess with us. There will be plenty of stuff to discuss down the line, so keep an eye on our mailing list and our github repositories if you want to keep up with the changes.

Sunday, February 27, 2011

Springy outside, Graphy inside

It's been a long time since I presented some thing about the greatest graph database out there, and with good reason. A lot of things happened, the most important of which is that I officially joined the crew behind Neo4j. After helping out with the short strings patch, I went back to more familiar territory, providing support for external transaction managers, this time with integration though Spring. So, in this first of two parts I will show you how to get Neo to work in Spring, in standalone mode, with your external transaction manager of choice. The second part will throw a JDBC DataSource in the mix and provide some guidelines on how to perform crash recovery. So, of we go.

As always, some ground rules

This post will focus on the integration of the three frameworks, and as such you should be already familiar with them. Also, you will not see a full blown application here - the main artifacts are JUnit test cases that demonstrate use of the API. With that said, you can go and take a look at this post where the pluggable transaction manager solution for Neo is introduced and you can now find in the official distribution since version 1.3.M01. We will cover some of that here as well however. All setup that follows is Maven based, so classpaths and dependencies are automatically taken care of.

What is the problem that we are trying to solve again?

Say you want to use Neo4j from within the Spring framework, either through Spring Data Graph or as a raw component. Since Neo4j cannot work without explicit transaction boundaries there has to be a way to start, commit and rollback transactions through Spring. If this is your use case fear not, because that is already taken care of, with very clear cut instructions here. This is the simplest use case and it is all you need to get things working with full ACID guarantees, provided that the only participating resource is Neo4j. However, if you need distributed transactions in an XA setting, then you have to take the steps outlined below.

First things first

To better demonstrate the setup and provide you with a starting configuration for your projects, there is a github repository available with an already configured environment. So point your git at Neo4j-Spring-Integration and clone. Usage is demonstrated through JUnit test cases and they are ready to run, with some additional legwork needed if you want Spring's own transaction manager implementation, described below. Note also that the latest Neo4j SNAPSHOT is required, the latest milestone as of this writing (1.3.M03) misses some needed functionality. So, let's go through the various components, in order of significance.

Enter the players: The txModule

In the txModule project you can find an implementation of a @Service module that is needed to enable support for Spring managed transactions in Neo. Since Neo is a fully transactional, XA compliant data source, it is mandatory to work under the supervision of a Transaction Manager even in standalone mode. Actually, even when using "just" Neo4j and you do some indexing, you have full 2PC happening under the hood with a Neo XAResource and a Lucene XAResource. But that is another story. Therefore, if we want the kernel to participate in 2PC with other XA resources (a JDBC connection, for instance) we have to cut out the internal transaction manager and delegate all calls to the externally provided one. The aforementioned module does exactly that, if enabled with a configuration parameter passed in the EmbeddedGraphDatabase constructor. So, mvn install that and you should be ready for the next step. Note that as far as Neo4j is concerned, you are done. The rest is configuring Spring with an external TM and getting it to work, including exposing it to all the XA Resources of your project.

Choosing the Transaction Manager

Spring does not provide its own transaction manager, preferring instead to delegate to a backing implementation, usually in the form of a JNDI-available instance in an Application Server. This is not mandatory however and instead we will opt for something more elaborate - use a (JVM-local) instance of either Atomikos, JOTM or SpringSource's Transaction Manager, some of the most widely used standalone transaction manager implementations.
For JOTM and Atomikos it is easy to get access to and it is already in the pom of the demo.
Spring's TM do not provide a maven repository so you will have to download and install it yourself, a pretty easy task but I want to keep crossovers from the actual project contents to a minimum, so if you want the exact steps for the installation I will refer you to the top level README file.

Also, some notes of caution. JOTM does not provide a bean implementation for retrieving the TransactionManager through Spring and an analogous solution is not available anymore from Spring. For this reason a utility class JotmFactoryBean is provided which facilitates things. Keep this in mind if you want to base your real world project on these examples. Atomikos provides comprehensive information on how to configure it under Spring - the details are already taken care of in the demo project under src/test/resources/spring-atomikos-tx-test-context.xml
Finally, for the Spring Source's JTS implementation to work some custom bean factories must also be present, one for the TransactionManager and one for the TransactionService interface to enable recovery. Both are available in /sampleCode/src/main/java/org/neo4j/integration/spring and the corresponding configuration is in /sampleCode/src/test/resources/spring-native-tx-test-context.xml

The Springy glue

The file sampleCode/src/test/resources/base-tx-test-context.xml contains the basic configuration that binds all things together. The EmbeddedGraphDatabase bean is configured here and the parameter tx_manager_impl is passed in the constructor with a value of spring-jta. Take a look at org.neo4j.jta.spring.SpringProvider. See what i did there? When Neo starts up and finds the @Service for the transaction manager provider with this name it will load it, bypassing the native implementation. The provider gets the Spring's JtaTransactionManager injected and delegates all calls there. Voila! transactions are now handled externally and the graph kernel is unaware of the situation. This in turns enables programmatic transaction demarcation from the Spring API, including the @Transactional annotation. Of course, all facilities of Spring will work, including Spring Graph Data, leading to a practically complete integration of Neo4j with Spring.

Finally, the code

The demo project shows how to actually work with your TM of choice in its test cases. Most of the work is performed in, where all the tests are held. It expects subclasses to provide a valid configuration as a classpath XML file via getConfigName() and then extracts the jtaTransactionManager and dataSource beans for having programmatic control over transaction demarcation and dataSource management for JDBC calls. You will note that the test cases that work with the graph database do not have a graphDb.beginTx() as usual. Instead, the relevant call is tm.getUserTransaction.begin() and commit() or rollback(), as required.

Let's take a closer look at the implementation of the testIndexDependencies() test case:

public void testIndexDependencies() throws Exception {

UserTransaction tx = tm.getUserTransaction();
Node node = null;
Index index = gds.index().forNodes("node");
IndexHits indexHits = index.get("field", "value");
node = gds.createNode();
assertEquals(false, indexHits.hasNext());
Node readBackOutsideOfTx = gds.getNodeById(node.getId());
assertEquals(node, readBackOutsideOfTx);

tx = tm.getUserTransaction();
index.add(node, "field", "value"+node.getId());

tx = tm.getUserTransaction();
Node readBackInsideOfTx = gds.getNodeById(node.getId());
assertEquals(node, readBackInsideOfTx);
indexHits = gds.index().forNodes("node").get("field", "value"+node.getId());


Note that all work is done through interfaces, so the actual implementation can be pushed down to the concrete test case classes. First we get the UserTransaction object from the TM implementation and ask for a transaction to begin(). Now we begin using our Neo4j instance as if a transaction was started with graphDb.beginTx() - only now, if another XA resource was to run in the same thread (and thus, in the same tx context) it would register as an XAResource making this a full 2PC transaction. After some sanity checks we tx.commit(), a method call on the UserTransaction instance, committing the global tx. This triggers behind the scenes a 2PC commit, since there is the neostore and the lucene index XAResources in there. The next chunk of code does a read outside of a transaction, ensuring that the "reads-can-happen-ouside-of-a-transaction" feature of Neo4j still holds. A new transaction sees an addition to the index only. At the end a read back sanity check is performed but from within a tx.

If you have done EJB development you will note here that this is a widely used idiom. This is called bean-managed transactions and is a finer form of control over annotation-based transaction demarcation (or container-managed transactions). This is entirely on purpose and a natural consequence of using the TransactionManager interface.

The concrete test cases do two boring things. One is to provide the actual TM implementation via a proper spring config file. The other is to do any implementation specific testing and take any actions required for things to work - the notable example here is NativeTMIntegrationTest.recover() which calls the recover() method, a required step for normal work to start in the case of Spring's TM.

That was it

Not much to write because it is that easy. With the txModule in place and the Spring abstractions on top it is very easy to plug in your tx manager of choice and have Neo4j participate in full 2PC. In the sample project there is already some groundwork to show how to also get JDBC calls to work with our setup - they require a mySQL database running though. Feel free to experiment with this part also, but wait a bit for an official walkthrough. In the next post we will see how to get the @Ignore tests to work and what is required for recovery in the case of a system crash. All the above would have been far more difficult to create without the help of Michael Hunger, a fellow coder at Neo Technology with knowledge of the ins and outs of both Neo4j and Spring.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License