Friday, September 12, 2014

Excursions in Javascript land: How the Libraries app for was built

Javascript has always been somewhat of a black box to me, posing in my imagination more like an arcane art rather than a transparent language. Of course, given that it dominates the Web, i assumed it was more a lack of practice than an innate incomprehensibility of the platform. It was quite fortunate therefore that last week i got the opportunity to create my first web application, which also allowed me to use a Javascript library that i have been meaning to check out for quite some time - the Google Data API.

The app lives here. It provides a comprehensive list of public libraries in Greece, searchable by area. The site has been created by Thodoris Georgakopoulos with the purpose of promoting literature and readership in Greece, making the Libraries application a motivator for visitors to engage in reading suggested titles without going through the expense of acquiring the books.

The requirements for Libraries were quite straightforward. It had to make available a list of all known libraries in Greece, ordered alphabetically by location. It had to be searchable, again on the location field, with some ease of use features, like filtering by first letter. The single script that makes up the application can be found here.

Overall there were few, if any, software engineering challenges. The main requirement was that data should be available from and edited in a Google spreadsheet, which immediately made clear the need for access through the Google Data API. You can see the simple call to fetch the spreadsheet data in function fetchTable() which passes the result to the main rendering method. The Google Spreadsheet is a simple Google Drive document that has its permissions set so that anyone with the link can view it, but only specific users can edit it. This allows for embedding the link directly in the script and having no further hosting or processing requirements.

After the DataTable is received, it is iterated over and the data extracted and placed in a new table element with cells setup as the presentation layer expects them, including CSS classes and element onclick() methods. Unfortunately, there are some assumptions in place about the structure of the DataTable object returned because the API does not make available a method for retrieving the backing table or iterating over the data.

In order for search to be intuitive, the search box had to ignore accents and case of the input string. For that purpose, the search query and every text element in the database is normalized through the normalizeGreek() function, which simply does a regex replace of a list of characters with the corresponding non accented, lowercase counterpart. This also revealed another issue - the webpage, including the scripts, have to be served with the encoding explicitly set to UTF-8, otherwise the Greek characters contained are not rendered properly and the search function returns no results.

As for the rest of the application, the UI design is inspired by Thodoris and implemented by nevma, and the data was curated by Stella Kasdagli.

That's all it took, really, The script is evidently small and can probably be made smaller, including reducing the number of required loops over the dataset on every user event. Regardless, it performs reasonably well and utilizes hardly any bandwidth as is. It also proved a quite useful exercise in Javascript and JQuery, which hopefully will allow me to build better UIs for my upcoming data projects.

Wednesday, December 11, 2013

Software sympathy, or how i learned to stop worrying and love the Garbage Collector

Mechanical sympathy is a term originating from Formula 1 where it was used to describe a driving style that takes into consideration the mechanical properties of the engine and the car as a whole, leading to better performance of the vehicle-driver combo. Martin Thompson has taken this term and applied it in software engineering, demonstrating how understanding the architecture of the various components of a computing machine and the way they interact can lead to writing vastly more efficient code.
Embarking from this idea, i wondered how this can be applied in a purely software world and write code that can take advantage of the way other software works. Being a Java programmer the answer is actually quite obvious - look at how the JVM performs a task on behalf of your code and optimize things so they operate in symphony.

Which component to choose from, though? My focus ended up being on the Garbage Collector and in particular on the Garbage First GC implementation that is available with Java7. Specifically, i wanted to understand what the impact of setting references in Java objects is on performance, an effect associated with the write barrier and the way it is implemented in G1.


The Write Barrier you say?

The write barrier is a piece of code executed by garbage collectors whenever a reference to an object is set. This is part of the book keeping done by the collector and allows for the garbage in the heap to be traced and collected when the time comes. The implementation and use is specific to each garbage collector of course and here we'll look at the case of G1.

Garbage First

Roughly, G1 works by slicing the heap into equal sized pieces or regions and satisfies requests for memory from one region at a time, moving through regions as they fill up. When heap memory gets low, G1 mostly-concurrently collects regions that are mostly filled with garbage, until it reaches a target size for available heap (or a time limit is reached, allowing for soft real time behaviour). The regions are useful because, since they are collected as a whole, it is not necessary to keep information about objects pointing to other objects in the same region - the scan during the mark phase will go through everything anyway. This in turn means that the write barrier when setting a reference to an object in the same region as the reference holder will leave the write barrier as a no op, while setting cross region references will cost somewhat extra.

The above is something that we can test for. We can devise an experiment to benchmark the difference between setting intra vs extra region references. If it turns out to be significant, we can be aware and prefer, when writing code, to allocate together objects that are expected to point to each other, giving us much better throughput at the cost of properly structuring our allocation strategy.

Moving on to measuring things

In this case the benchmark is quite simple - we need to do both same region and cross region reference setting and see how they compare. To do that we'll need to layout our objects in memory in a predictable way so that we know that the only difference between our comparative runs is only the relative position of the referenced objects. As it turns out, that is not that hard to do.
If we allocate a fixed size for heap (by setting -Xms and -Xmx to be the same) and we know the size of the Java objects we'll be allocating, we can, given the fixed number of regions created, to calculate how many objects fit in each region and in turn figure out which object needs to point to which in order to have extra or intra region references.

A region is the heap average size (min size + max size divided by 2) divided by 2048. That means that for a heap of 1Gb, each region is 2^30/2^11 = 2^19 bytes big. If Ballast objects are 32 (2^5) bytes then that means we require 2^19 / 2^5 = 16384 objects of that class to fill a region.

We'll do two runs. Both will allocate 32768 Ballast objects. One run will set references from the first 1/4th to the second 1/4th and from the third 1/4th to the fourth 1/4th (and back, for each couple). The other run will set references from the first half to the second half (and back). We expect the first run to have much bigger throughput even though the number of allocations and reference sets will be exactly the same. All is much better explained if you read the code.

A note about the ballast objects: The Ballast objects which are used to contain the references also contain a long field. The reason for that is twofold - one is padding to get the object to an even 32 bytes and the other is for the plot twist at the end of the blog post.

That's it really. On my computer, a MacBook Pro with a 2.3Ghz Intel Core i7 CPU and 8GB of RAM, running the above code on an Oracle HotSpot JVM version 7u45 and the command line options being

-Xms1024m -Xmx1024m -XX:+UseG1GC

i get the following results:

Setting same region references took 80640ms

Setting cross region references took 84140ms

The time is how long it took for 5000 repetitions of allocating 2 new regions of objects and set the references properly.


What this shows is that there is little difference between the two methods of setting references. Surprising, no? On to the source code then, to try and understand why that happens. As it turns out, the cost of marking each card as dirty is there but it actually is pretty small - it costs a put in a queue when the card is dirtied and that queue is processed afterwards while doing the collection. Both operations cost relatively little and this cost is also split between reference set time and collection time. Alternatively, one might say that the cost is dominated by the write barrier check rather than the operations taken from it.

An extra step and a twist

The write barrier cost - i wonder how much that is. What should we compare it against? Well, the minimal cost i could think of was that of setting an integer or a long. Since we already have a long field in our Ballast objects, we can use that. So we'll alter the test to instead of setting references in two different ways, it'll set once the reference to a fixed, known object and the other it'll set the long field to a given value. The new source code is here.

On the same setup as the previous experiment
(changing the GC implementation every time), i get the following numbers:

G1 (-XX:+UseG1GC)

Setting the long value took 79601ms
Setting the reference took 90197ms

ParNew (no arguments)

Setting the long value took 85628ms 
Setting the reference took 104894ms

CMS (-XX:+UseConcMarkSweep) 
Setting the long value took 91407ms
Setting the reference took 108954ms

The difference in cost lies strictly with the write barrier, which is of course not triggered for the case of storing a long. An interesting note here is that if you set the long field and the reference field in Ballast to be volatile, both actions take about the same time and at twice the time of setting the long value as shown above, demonstrating the cost of volatile variables.

Closing remarks and future work

While the first result is a negative one when it comes to the driving assumption, i still wanted to discuss it for two main reasons: One is the educational content from it, demonstrating in a hands on, high level way how G1 operates. The other is that negative results are still results and it's noteworthy that we can ignore the placement of objects when it comes to single threaded programs. This last part is quite important and it's going to be the next piece of work I'll undertake in this track. In particular, how does the card marking affect multithreaded programs and how does the no-op barrier for same-region assignment in the case of G1 fare under such conditions?
As for the long vs reference setting comparison, its explanation is quite easy but a way of taking advantage of the fact directly is not obvious. If however you are familiar with off heap memory management, you will immediately see a reason why performance in such scenarios is substantially better - being away from the control of the garbage collector does not only lead to collection improvements, but it also removes the bookkeeping overhead during program runtime. But that is a discussion for another time.

A parting word

As you can see, all source code is available, as well as the setup i used. If you think that information is not enough to recreate the results i got, please say so and I'll improve the article. But, if you think i got the results wrong, say so and I'll review the methodology. There is no reason why results such as the above should not be peer reviewed.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, October 4, 2012

Evolution of a cyclist

Don't really remember how it all started. You know how it goes. It usually starts at Sheldon's page. You put it aside as something exotic. Then it pops up as suggested videos in YouTube and Vimeo. You see someone on the street. Next thing you know, you are admiring the simplicity and the elegance that follows. You feel your road bike is too fancy, too "heavy", not gram wise but in every other respect. Eventually you start looking, drooling over manufacturer sites, component vendors and web albums.
How can you avoid the unavoidable? I didn't even try. Last week found me at 48x17 bikes in the company of Agis, Yiannis and many bike enthusiasts that came and went, building my new fixed gear bike. Built around a Tokyo Fixed Gear S1 frame, envisioned by me and manifested by Agis, it is one of the simplest, most elegant machines i have laid my eyes on. One for the road and one for the living room!

The frame naked, ready to be dressed up by Agis and Yiannis' capable hands

Omnium: check. Funny how you put on the crankset and you are half done. How about that for minimalism?

But all that is secondary to on-the-road performance and feel. It is hard to be accurate about that, i am afraid, since i have to learn a new bike and learn how to ride a fixed gear bike at the same time. So far i've done few small routes, the longest at around 50km (to Rafina and back), getting to know the machine. I think i should do a separate piece on the process of learning to ride a fixie, so for the time being i'll just comment on the whole experience as "learning to cycle all over again". Which is the best thing i can think of when it comes to cycling really.

Parts list for the curious:
Tokyo Fixed Gear S1 frame
SRAM Omnium Crankset
Phil Wood Track Hubs
Mavic OpenPro Wheel rims
Fi'zi:k Arione CX saddle
Nitto Handlebar

See you on the road.

Monday, July 23, 2012

Master election in Neo4j High Availability Clusters

Let me take you on a journey through one of the most challenging parts of Neo4j's current HA architecture - the master election algorithm. We will start by exploring the reason for a master existing, pass through the high level algorithm that master election is based on and we'll end up understanding how it is implemented. After we arrive at our destination, we'll reflect upon some recent changes that don't alter the base algorithm but significantly improve its performance characteristics.

Getting in the mindset: What we need the master for

The HA implementation of Neo4j is a way to consistently replicate the graph across multiple machines, eventually propagating writes through the whole cluster. The key point here is replication - HA is not about decentralizing the graph (a solution known as sharding). This replication is based on an authoritative source that receives all updates and hands them out to the rest of the database instances as needed to keep them up to date. That machine is also needed for keeping track of some other things, such as locks and transactional context. So, you guessed it, that machine is the master - the provider of the ground truth about what the graph and the current mutating operations are.

Filling up our backpack: Getting hold of the basic concepts.

When a Neo4j cluster is setup it consists of a set of pretty much equivalent, random machines - there is no special requirement in place for designating beforehand a machine as master. Additionally, it wouldn't be that much Highly Available if only one particular machine could be master - if that machine fails, what then? So, we need a way to elect a master dynamically and on demand.
To do that we depend on a number of primitives, the implementation of which i will not share right now - it is a mechanical thing, definitely interesting but not directly relevant. We'll talk about them later. What they are though - now that is extremely relevant. The basis for all operations and actually the only primitive needed for implementing any distributed system is Atomic Broadcast (AB) - essentially a method for any instance to reliably send to the whole cluster a specific message. Reliably in our case means no message loss, duplication or corruption and guaranteed message ordering. But I'll agree if you say that this is too generic to be useful, so let's narrow it down a bit. First we need a heartbeat facility, a way for each instance participating in an HA cluster to broadcast its existence to the rest of the cluster. Not broadcasting this heartbeat is equivalent to being in a failed state and not currently participating in cluster operations. Second, we need a way to manage cluster membership - finding out which machines currently comprise the cluster. While very similar to the heartbeat mechanism it is not the same concern, though the final implementation may be the same. Indeed, in Neo4j HA we use the same solution for both and that is how we'll treat it for the rest of this discussion. Third, we need an event notification mechanism that allows instances to broadcast alerts to the rest of the cluster. Finally, we require a method for instances to distribute key pieces of information that are needed by all other instances to make decisions.
Let me repeat - all the above are specialized cases of AB. Having them categorized like this however makes it easier to discuss about implementation specifics later on. Also, since the implementation is a detail we tend to refer to this service as the Coordinator - that is what neo4j-coordinator start provides

Taking the first steps: How the master is elected

Like already mentioned, the master is the provider of ground truth - the keeper of the knowledge of the real graph. Among other things, that means the master must have the most up to date version of the database. In Neo4j land the means to determine graph version is the latest transaction id executed on it. You see, since transaction application has to be serialized either way we can, at no additional cost, assign to it a monotonically increasing id. The id of the latest transaction that was committed on the graph essentially characterizes the version of a database. Of course, that information is generated internally from the system and as such cannot be used alone for election purposes. An externally provided, priority enforcing piece of information must be provided, for at least breaking ties in case two machines have the same latest TxID, a very common occurrence since most of the time the instances are not lagging behind the master. This information can be anything, from each instance's IP address to any guaranteed unique number present on the machine (CPUID for example). In Neo4j's case we chose to have this as a configuration option, known as ha.server_id. The lower the server id of an instance, the higher it's priority in being elected as a master.
By now the master election algorithm should be clear - let me just fill in the gaps. When an instance realizes that a master is needed it gathers the latest transaction id for each machine in the cluster. The machine that has the largest is elected master, with ties broken by the server id value. After the machine chooses the master, it notifies the cluster of the result and the same algorithm is executed on all other machines, including the new master. If all machines reach the same conclusion no new notification takes place (since that would lead to infinite election rounds).
Notice how have taken for granted and integrated in the above the primitives outlined in the previous section. The need for a master election is communicated by lack of heartbeats from the current master. The cluster membership mechanism is used to answer the question which machines participate in the cluster and should be considered for election. The event notification is used to communicate the result of the master election a machine performed. Finally, the information distribution channel is used to gather the latest transaction id for each cluster member prior to doing the election.

Deeper in the jungle: Using ZooKeeper as a guide

It's that time when we need stop being abstract and get down and dirty. In particular, we need to talk about how the required primitives (or actually the one primitive, AB) is implemented in an HA cluster. The answer is Apache ZooKeeper. ZooKeeper is an implementation of Zookeeper Atomic Broadcast, an AB protocol that (unlike Paxos, for example) guarantees that messages from a client are delivered in the order they are issued. A very interesting concept ZooKeeper implements is a filesystem abstraction for its operations. A ZooKeeper quorum is viewed as a distributed, highly available hierarchical filesystem with some additional perks. The most interesting of those perks is the notion of watches. A client can set a watch on a file in the ZooKeeper filesystem and receive a notification when that node's contents change or one of it's children is deleted or a node is added as a child (ZK nodes can have both contents and child nodes, making them a file/directory hybrid, a concept not encountered in everyday filesystems). Also, ZooKeeper has the notion of ephemeral nodes, which are bound to the lifetime of a client. If the ZooKeeper quorum looses contact with a client, all nodes that client created with the ephemeral flag are automatically removed.
Having said that, let's see how Neo4j structures its file layout in ZooKeeper and how all primitives are implemented. If a machine does not find any structure in place (it is the first one) it creates a node that is the cluster root. Then it adds an ephemeral node as the root's child with a node name that is the server id of the instance and sets as contents the latest transaction id its database has. It also creates some more administrative nodes, the most interesting of which is master-notify, to which it adds a watch. A watch is also added to the root node. Each subsequent machine will find the root and hence do the same steps apart from the root and administrative nodes creation. All nodes and their contents are read and write-able from all cluster instances.
We are set. While an instance is available, its ephemeral node will be present. When it fails, its ephemeral node will be removed from root's children and all instances will receive a node deleted event. If that instance was the master a master election will be triggered  and the first one to complete it will broadcast its result by writing it to the master-notify node. Note that since the latest transaction id is shared via ephemeral nodes, all machines that share their latest transaction id are also assumed reachable and eligible for election.

Almost home: The brutalities of real world

We are actually there. We have explained with little handwaving how master election is performed. One detail we glossed over is the fact that a master election may happen not because the master became unreachable but because an instance joins a running cluster and actually has to make a decision on which machine is the master. This means that all machines must keep their Coordinator node data up to date. In particular, the master must keep its latest transaction id really up to date, lest it loses the election and branch data occurs. To save against this scenario we must update from the master synchronously with each committed transaction. So, HA deployments (until 1.8M05) bottleneck transaction commits on writes to the Coordinator. M06 (and of course 1.8GA) removes this bottleneck by adding an extra watched node which is written to by any instance that wants to do master election. When the watch is triggered each instance starts flushing synchronously its transaction id. When the master election round ends the watch is triggered again and all instances stop flushing their TxIDs. That limits the Coordinator writes only during periods where it is actually necessary without any loss in cluster consistency or change in master election semantics. The performance increase is highly site dependent and affects only write operations, but it can easily surpass 2-3x the previous performance.

Putting our feet up: Lessons learned and future travel

You should now have a mental picture of how Neo4j HA works and what to expect if you venture into the HA codebase, at least its master-related parts. You should also have understood why you need to configure, run and maintain the Coordinator service that comes bundled with Neo4j Enterprise edition. It is expected to have more questions now than what you had when you started reading - that is good, it means that understanding has been achieved - keep in mind i have purposefully glossed over some large chunks.
Our understanding of how distributed systems must work and be built continuously evolves and we want to stop depending on third party components for such critical infrastructure. We are in the process therefore of implementing our own AB solution based on some well studied and battle proven algorithms, hopefully coming soon in a milestone release. Until then keep tending to your HA clusters, direct questions to Neo4j forums and keep an eye on the horizon for future travels in the Neo4j land.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, November 10, 2011

Rooting out redundancy - The new Neo4j Property Store


So, for the last 2 months we've been working diligently, trying to create the 1.5 release of Neo4j. While on the surface it may look like little has changed, under the hood a huge amount of work has gone into a far more stable and usable HA implementation and rewriting the property storage layer to use far less disk space while maintaining all its features and providing a speed boost at the same time. In this post I will deal exclusively with the latter.

Departing from the old model: A new way to store things

So far, the properties were stored on disk in a doubly linked list, where each of its nodes contained some necessary administrative/structural overhead and the actual property data. More specifically, the layout was:
byte  0     : 4 high bits of previous pointer, inUse flag
byte  1     : unused<
byte  2     : 4 high bits of next pointer
bytes 3-4   : property type
bytes 5-8   : property index
bytes 9-12  : previous pointer 32 low bits
bytes 13-16 : next pointer 32 low bits
bytes 17-24 : property data

The last 8 bytes where the value stored, enough to accommodate all primitive values, a short string or a pointer to the dynamic store, where a dynamic record chain would store a long string, an array of primitives or String[].

There is some waste here, in part because the full 8 bytes are used in the (rare) cases of storing doubles and longs or for short strings but mostly because this pointers are repeated for each property, making the impact of the structural overhead felt. On the flip side, the Short String optimization was a great success, proving the value in inlining more property types. So we decided to highlight the good parts and lowlight the bad, ending up with a PropertyRecord structure that is no longer equivalent to one property but acts as a container for a variable number of variable length properties. The current layout is:
byte  0    : 4 high bits of previous, 4 high bits of next pointers
bytes 1-4  : previous property record
bytes 5-8  : next property record
bytes 9-40 : payload
Yes, that is correct, no inUse flag, explained by the payload structure.

First, let's call the 4 8-byte-blocks in payload just blocks, to have a simple name for them. Each of these blocks is used in various ways, depending on the property data type. Starting off, every property needs to have the property index and the property type. These are common and always present, with the property index taking up the first 3 bytes of the block and the type taking up the 4 high bits of the 4th byte. Now, after that comes the property value. If it is a primitive that fits in 4 bytes, then the 4 low bits of the 4th byte are skipped and the remaining 4 bytes of the block are used to store the value and we are done. When storing a pointer into the DynamicStore for non-short strings and for arrays, the 36 bits required find home to the second half of the 4th byte and the low order 4 bytes. This means that each PropertyRecord can store up to 4 such properties - a huge saving in space.
For longs and doubles which require 8 bytes, the 4 1/2 trailing bytes are skipped and instead the next block is used as a whole to store the value. This leads to some waste but it is still more efficient than the previous method and it is a relatively rare use case.

What remains is ShortStrings and the brand new ShortArray. Since we saved all that space and I/O calls with ShortString, why not expand on the idea? We now have LongerShortString, which is like ShortString but on crack. It operates on the same principle - it scans a string, sees if it falls within an encoding, encodes it and stores a header with the length and the encoding table id and then the actual data, encoded in longs that take up blocks right after the property info. If it doesn't fit in the max of 3 1/2 blocks of a property record, it is instead encoded as UTF8 and stored in the DynamicStringStore. A similar idea is applied to arrays. When passed a primitive array we first determine the minimum number of bits required to store its values, effectively shaving off all the leftmost zeros we can while keeping all array members the same size. This means that if we are asked to store new int[] {1,2,3,4,5}, the entries will take up not 32 but 3 bits each. boolean[] for example costs 1 bit per entry. Obviously, mixing in even a single negative value gives immediately a maximum number of bits per entry. So, to store an array we first determine this number and then the header becomes:

   4 bits, an enum value identifying the primitive type
   6 bits, the length of the array
   6 bits, the number of bits per item

and then follow the "bit shaved" array entries. The same algo is used for dynamic arrays as well, but the length is actualy stored in the length field of the dynamic record (as usual), not the ShortArray header and we just keep how many bits of the last byte are used. That, along with the bits per entry  number are enough to reconstruct the value. Of course, in this case as well, if the array does not fit in the PropertyRecord even after this "compression", it is stored in the DynamicArrayStore as usual, though now in its bit-shaved form as byte[], meaning less DynamicRecords are used so less waste. This comes at the price of reconstructing the array when reading it in, but the reduced I/O more than makes up for it. A more exact description of the new ShortString, including all the ShortString classes and size limits, as well as the new ShortArray, is available in the manual.

What about the mystery of the missing inUse flag? Well, that is a combination of 2 things. One is that the blocks are marked individually as in use or not, since the API allows for a property to be deleted, and now a property is no longer a record but a collection of blocks. So we folded that into the property type, with 0 signifying not in use. The second is that the blocks are written out defragmented on disk, meaning that if from 3 properties in a record we delete the middle one (set its type to deleted), then only the remaining two will be written. This leads to a simple method of marking "no more properties in this record" by writing a 0 for the 4th byte of the first not-used block (the implementation just writes a whole long). A corollary of this is that a property record that has the 4th byte of the first block 0 is actually not used.

Code walkthrough

I was going to outline the changes/additions at a source code level here, but this post is getting too long. Besides, from the above the code becomes straightforward to follow. If you have any questions, suggestions or would like to small talk about the implementation, drop by our mailing list.

Just a tweaking note here - the logic of when and how allocation of blocks happens and the defragmentation strategy is held in WriteTransaction. Go ahead and experiment with what best suites your use case - feedback on these code paths will be greeted with much rejoice!

Out with the old, in with the new: Migrating the store

Unlike the 4+ billion changes for extended address space changes a while ago, this store upgrade cannot happen in place over an old database. We need to do a true migration, meaning recreating the store from scratch and replacing your existing data files with the new ones. This process is extremely safe: It never writes in your existing data files, it is crash resistant (so if it fails mid-way nothing bad happens) and keeps a backup of your data (under upgrade-backup/ in the database directory). However, better safe than sorry, so it is considered good practice to keep an independent backup of your data.

The store migration process is relatively straightforward - it goes over the node and relationship stores, copying them over as they are and, for each primitive, it reads in the property chains, transforms them in the new format and stores them. That has the side benefit of compacting the property store, skipping over deleted entries, so you should notice a significant reduction in disk usage if you happen to delete lots of properties and not restart often.

All the migration code is bundled in the kernel source in package org.neo4j.kernel.impl.storemigration and can be run both as a standalone tool and as part of normal startup - so no matter if you use the server scripts or just the kernel library, just set the config option "allow_store_upgrade"="true" and you are set to go.

Onwards and upwards

There are more stuff in this release that can fit in a blog post. Long discussions in the community have ended up providing inspiration for substantial changes which not only provide robustness in the current offering but pave the way for more exciting features to come. So, maybe "Boden Bord" is not filled to the brim with obvious new features, but rest assured, we are in for a wild next year.

Thank you for making Neo4j what it is.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Wednesday, July 27, 2011

A new cache for Neo4j, part 1 : The hash function

Since my last post a lot of things have happened in Neo4j, some of which have made some of the material in this blog outdated. Imagine that! Just half a year later, the kernel crackdown (specifically, the storage layer walkthrough) is pretty much out of date. How awesome is that? My wonderful colleagues at Neo Technology however have shown me that staying still is not an option, so here I am, trying to make another post outdated. This time is the cache's turn, which is about time it received an overhaul. Challenging the current implementation will start at an unexpected place (or is it?), the hash function.

The hash function you say?

Well, makes sense, doesn't it? At least, I hope it does. You see, in general, hash functions are developed with a very specific goal in mind: uniformity over the hash values, which reduces conflicts, the main quality metric of a hash function. The rationale behind this is the more random the output appears in relation to the input, the smaller the probability for two "related" values (what that means is domain dependent) to collide. This however completely ignores some things that could lead to better utilization of the co-domain (the set of result values from the hash function). In particular, if we know that some class of values is going to be more frequently polled that others, it would make sense to have that class mapped to a larger interval of the co-domain so that collisions in this class (and, consequently, throughout) are reduced even more.

Quick example

Say I have the interval [1,100] as the hash function's domain and the interval [1,10] as the co-domain. A typical hash function will try to allocate k*10/100 values for any collection of k values from the co-domain, a goal best reached by uniformity, as discussed. If, however, we knew that the interval [1,20] would be referenced twice as often as the (20,100], wouldn't it be more logical to allocate a bigger chunk than [1,2] to it? Let's see:
First, for a co-domain of arity N of a perfectly uniform hash function, the possibility of a collision for two inputs is 1/N. This is kind of obvious, selecting the first number does not matter and a collision will happen if we select that number and only that number again, thus 1/N. Having said that:
A uniform selection from [1,100] to [1,10] will lead to collision probability 1/10.
A selection with probability 1/2 from [1,20] mapped to 20% of the co-domain and 1/2 from [21,100] mapped to the proportional 80% will lead to 1/2*0.5 + 1/2*0.125=0.3125 collision probability.
A selection with probability 1/2 from [1,20] mapped to 50% of the co-domain and 1/2
from [21,100] mapped to the remaining 50% will lead to 1/2*0.2 + 1/2*0.2 = 0.2 collision probability.
A selection with probability 1/2 from [1,20] mapped to 80% of the co-domain and 1/2
from [21,100] mapped to the remaining 20% will lead to 1/2*0.125 + 1/2*0.5 = 0.3125 collision probability.

The above kind of demonstrate that uniform hashing is not always the best choice and that a sweet spot exists when trying to find an alternate distribution. So let's try that.

Practical matters

When we start operations on our data set it is usually not known beforehand what the value distribution will be but we might have a vague idea, such that there is some locality or other underlying trend. We can develop therefore an adapting hash function, that after it looks at some data can form some idea about their nature and start allocating chunks of the co-domain hopefully in a more efficient manner. Keeping statistics will be key here.
Beforehand, this appears to have two problems. One is performance - hash functions have to be really fast and here we talk about keeping statistics. This is somewhat problematic and actually the only thing we can do is to be relaxed on the accuracy of our statistics, keep them simple and fast and of course, optimize the hell out of it, hoping that the reduction of collisions will increase performance more that this additional code path slows things down.
The other problem is changing hash values for the domain. Since the hash function adapts, by definition this means that the values it gives for a specific input will change. This is not a hash function. The way around that can be either to keep the statistics and actually change the hash function to a better version when we can (such as cache table resize) or keep all versions in memory and hash them all - if even one matches, we have a hit, otherwise we have a miss. Hybrids are possible and is probably what we will discuss at part 2. But for now let's talk about

What statistics to keep

Our main goal is to find out how skewed our domain distribution is. Model fitting data is an open ended problem so we have to resort to heuristics. A key observation is that a uniform distribution over the domain will also be reflected as a uniformity in the distribution of bits in the input values. What I mean by that is that if the input values are truly random, then the frequency of 1's and 0's at a specific bit position will also tend to be equal, otherwise the values would not be selected truly uniformly. So if we keep the numbers of times we see 1 vs the times we see 0 at every bit position, we can begin forming a picture about how values are diverging from a truly uniform distribution - roughly equal 0 and 1 count for a bit will mean that this bit position is not part of a trend. This intuition is simple enough to implement and test and has the potential of being quite fast. What remains is to see the actual collision rate that it gives, which, as with all heuristics, requires testing.

Another example

Assume we want to hash 8 bit integers to 4 bit hashes and the hash function is value modulo 16, aka the 4 LSBs. A serial scan over the whole range would give equal counts of 1 and 0 for each of the 8 positions, which is expected - this is a uniform distribution and the chosen hash is the best possible. But if we were to choose the values {16,32,48,64}  twice as often as the rest, the counters of the bits 5 and 6 would be roughly equal while the rest would be more far apart. So if we were to include these bits in the extraction and not the 4 LSBs then maybe the resulting hash value would demonstrate less collisions since it would vary more.

The implementation

The implementation is actually pretty straightforward. We want the hashes to be from longs to integers. We therefore keep 63 counters initialized at 0 (longs in Java are signed, the MSB is always zero for positives so it is meaningless to track). When the hash of a long is requested, first its bits are looped over - for each position, the corresponding counter is increased by 1 if the bit is 1, decreased by 1 if it is 0. After a number of such hashes we order the counters based on absolute value - the closer to 0 the more uniform the distribution of that bit and the more important it becomes for the hash value. This is the new ordering with which bits are extracted and concatenated to form the hash. In the above example, bits 5 and 6 would be the 2 LSBs in the 4-bit hash.

What are the consequences of this?

The fact that the hash function is an order makes it persistable between restarts of the database, so statistics gathered during a run can be used to optimize subsequent operations. It also means that, if need be, knowledge can be injected from the beginning, getting rid of any warmup period that is needed for the knowledge of the distribution to build up. Finally, multiple hash functions can be stored side by side, providing multiple hashes, either for different caches, for collision resolution, transition periods (eg resizing) or weighting among them for better hash distribution.
The downsides are, besides the slightly increased runtime cost, that if there is no fixed access pattern, then the added overhead is of no benefit (though it should cause at most as many collisions as any similar bit-extracting hash). Even worse, if it is trained and the access pattern changes, collisions can increase significantly until the new pattern dominates. The latter issue can be dealt with simply by signaling that the pattern changes so we should dump all gathered statistics and start from scratch or something more sophisticated such as keeping the old version along side a new one with a weighting factor between them or an additional level of hashing that can choose the hash function to use.

A sample run

Of course this is not all - we have experiments to run. I have done so in an ad hoc fashion so far, so there is no pretty graphic, but i have the first example run that had me giggling for some time after it printed out the result.
The metric is collision count for a hash function with co-domain [0,1023].
The data set is a loop over 0 to 1000 with increment of 1 and over 2048 to 1048576 (2^19) in increments of 1024 (this gives another 1022 iterations with only the bits 10 to 19 changing). The test was two such sets of values - once with no data (just building up the statistics while returning the 10 LSBs) and the other was with the training data gathered from the first. Note that the minimum number of collisions is 512, since we project 2022 points to 1024.
The first run gave a collision count of 1022 and the second gave a collision count of 999. That is a significant save for something as trivial as bit projection and I have already a couple of ideas for weighted XOR operations that will widen significantly that margin.

From here

Since we are building a cache, the hash function is actually a small part of the story, even if it still needs tuning. We have to build the store, come up with an efficient eviction policy and see how the GC will mess with us. There will be plenty of stuff to discuss down the line, so keep an eye on our mailing list and our github repositories if you want to keep up with the changes.