Wednesday, July 27, 2011

A new cache for Neo4j, part 1 : The hash function

Since my last post a lot of things have happened in Neo4j, some of which have made some of the material in this blog outdated. Imagine that! Just half a year later, the kernel crackdown (specifically, the storage layer walkthrough) is pretty much out of date. How awesome is that? My wonderful colleagues at Neo Technology however have shown me that staying still is not an option, so here I am, trying to make another post outdated. This time is the cache's turn, which is about time it received an overhaul. Challenging the current implementation will start at an unexpected place (or is it?), the hash function.

The hash function you say?

Well, makes sense, doesn't it? At least, I hope it does. You see, in general, hash functions are developed with a very specific goal in mind: uniformity over the hash values, which reduces conflicts, the main quality metric of a hash function. The rationale behind this is the more random the output appears in relation to the input, the smaller the probability for two "related" values (what that means is domain dependent) to collide. This however completely ignores some things that could lead to better utilization of the co-domain (the set of result values from the hash function). In particular, if we know that some class of values is going to be more frequently polled that others, it would make sense to have that class mapped to a larger interval of the co-domain so that collisions in this class (and, consequently, throughout) are reduced even more.

Quick example

Say I have the interval [1,100] as the hash function's domain and the interval [1,10] as the co-domain. A typical hash function will try to allocate k*10/100 values for any collection of k values from the co-domain, a goal best reached by uniformity, as discussed. If, however, we knew that the interval [1,20] would be referenced twice as often as the (20,100], wouldn't it be more logical to allocate a bigger chunk than [1,2] to it? Let's see:
First, for a co-domain of arity N of a perfectly uniform hash function, the possibility of a collision for two inputs is 1/N. This is kind of obvious, selecting the first number does not matter and a collision will happen if we select that number and only that number again, thus 1/N. Having said that:
A uniform selection from [1,100] to [1,10] will lead to collision probability 1/10.
A selection with probability 1/2 from [1,20] mapped to 20% of the co-domain and 1/2 from [21,100] mapped to the proportional 80% will lead to 1/2*0.5 + 1/2*0.125=0.3125 collision probability.
A selection with probability 1/2 from [1,20] mapped to 50% of the co-domain and 1/2
from [21,100] mapped to the remaining 50% will lead to 1/2*0.2 + 1/2*0.2 = 0.2 collision probability.
A selection with probability 1/2 from [1,20] mapped to 80% of the co-domain and 1/2
from [21,100] mapped to the remaining 20% will lead to 1/2*0.125 + 1/2*0.5 = 0.3125 collision probability.

The above kind of demonstrate that uniform hashing is not always the best choice and that a sweet spot exists when trying to find an alternate distribution. So let's try that.

Practical matters

When we start operations on our data set it is usually not known beforehand what the value distribution will be but we might have a vague idea, such that there is some locality or other underlying trend. We can develop therefore an adapting hash function, that after it looks at some data can form some idea about their nature and start allocating chunks of the co-domain hopefully in a more efficient manner. Keeping statistics will be key here.
Beforehand, this appears to have two problems. One is performance - hash functions have to be really fast and here we talk about keeping statistics. This is somewhat problematic and actually the only thing we can do is to be relaxed on the accuracy of our statistics, keep them simple and fast and of course, optimize the hell out of it, hoping that the reduction of collisions will increase performance more that this additional code path slows things down.
The other problem is changing hash values for the domain. Since the hash function adapts, by definition this means that the values it gives for a specific input will change. This is not a hash function. The way around that can be either to keep the statistics and actually change the hash function to a better version when we can (such as cache table resize) or keep all versions in memory and hash them all - if even one matches, we have a hit, otherwise we have a miss. Hybrids are possible and is probably what we will discuss at part 2. But for now let's talk about

What statistics to keep

Our main goal is to find out how skewed our domain distribution is. Model fitting data is an open ended problem so we have to resort to heuristics. A key observation is that a uniform distribution over the domain will also be reflected as a uniformity in the distribution of bits in the input values. What I mean by that is that if the input values are truly random, then the frequency of 1's and 0's at a specific bit position will also tend to be equal, otherwise the values would not be selected truly uniformly. So if we keep the numbers of times we see 1 vs the times we see 0 at every bit position, we can begin forming a picture about how values are diverging from a truly uniform distribution - roughly equal 0 and 1 count for a bit will mean that this bit position is not part of a trend. This intuition is simple enough to implement and test and has the potential of being quite fast. What remains is to see the actual collision rate that it gives, which, as with all heuristics, requires testing.

Another example

Assume we want to hash 8 bit integers to 4 bit hashes and the hash function is value modulo 16, aka the 4 LSBs. A serial scan over the whole range would give equal counts of 1 and 0 for each of the 8 positions, which is expected - this is a uniform distribution and the chosen hash is the best possible. But if we were to choose the values {16,32,48,64}  twice as often as the rest, the counters of the bits 5 and 6 would be roughly equal while the rest would be more far apart. So if we were to include these bits in the extraction and not the 4 LSBs then maybe the resulting hash value would demonstrate less collisions since it would vary more.

The implementation

The implementation is actually pretty straightforward. We want the hashes to be from longs to integers. We therefore keep 63 counters initialized at 0 (longs in Java are signed, the MSB is always zero for positives so it is meaningless to track). When the hash of a long is requested, first its bits are looped over - for each position, the corresponding counter is increased by 1 if the bit is 1, decreased by 1 if it is 0. After a number of such hashes we order the counters based on absolute value - the closer to 0 the more uniform the distribution of that bit and the more important it becomes for the hash value. This is the new ordering with which bits are extracted and concatenated to form the hash. In the above example, bits 5 and 6 would be the 2 LSBs in the 4-bit hash.

What are the consequences of this?

The fact that the hash function is an order makes it persistable between restarts of the database, so statistics gathered during a run can be used to optimize subsequent operations. It also means that, if need be, knowledge can be injected from the beginning, getting rid of any warmup period that is needed for the knowledge of the distribution to build up. Finally, multiple hash functions can be stored side by side, providing multiple hashes, either for different caches, for collision resolution, transition periods (eg resizing) or weighting among them for better hash distribution.
The downsides are, besides the slightly increased runtime cost, that if there is no fixed access pattern, then the added overhead is of no benefit (though it should cause at most as many collisions as any similar bit-extracting hash). Even worse, if it is trained and the access pattern changes, collisions can increase significantly until the new pattern dominates. The latter issue can be dealt with simply by signaling that the pattern changes so we should dump all gathered statistics and start from scratch or something more sophisticated such as keeping the old version along side a new one with a weighting factor between them or an additional level of hashing that can choose the hash function to use.

A sample run

Of course this is not all - we have experiments to run. I have done so in an ad hoc fashion so far, so there is no pretty graphic, but i have the first example run that had me giggling for some time after it printed out the result.
The metric is collision count for a hash function with co-domain [0,1023].
The data set is a loop over 0 to 1000 with increment of 1 and over 2048 to 1048576 (2^19) in increments of 1024 (this gives another 1022 iterations with only the bits 10 to 19 changing). The test was two such sets of values - once with no data (just building up the statistics while returning the 10 LSBs) and the other was with the training data gathered from the first. Note that the minimum number of collisions is 512, since we project 2022 points to 1024.
The first run gave a collision count of 1022 and the second gave a collision count of 999. That is a significant save for something as trivial as bit projection and I have already a couple of ideas for weighted XOR operations that will widen significantly that margin.

From here

Since we are building a cache, the hash function is actually a small part of the story, even if it still needs tuning. We have to build the store, come up with an efficient eviction policy and see how the GC will mess with us. There will be plenty of stuff to discuss down the line, so keep an eye on our mailing list and our github repositories if you want to keep up with the changes.

1 comment:

  1. News Listing - Gehirndoping
    Gehirndoping wird zumeist angewendet, um eine der Steigerung der Intelligenz zu erreichen.
    Hirn doping