Wednesday, October 6, 2010

Inside Neo4j: Hello world

The code

The neo4j site offers a getting started document on their wiki that demonstrates via a "Hello, World" type application the basic usage of the framework. This code is enough to dive into the neo4j kernel and understand its internals. First, here is the code

import org.neo4j.kernel.EmbeddedGraphDatabase;


* Example class that constructs a simple graph with
* message attributes and then prints them.


public class NeoOneMinute {
public enum MyRelationshipTypes implements RelationshipType {

public static void main(String[] args) {
GraphDatabaseService graphDb = new EmbeddedGraphDatabase("var/base");
Transaction tx = graphDb.beginTx();
try {
Node firstNode = graphDb.createNode();
Node secondNode = graphDb.createNode();
Relationship relationship =
(secondNode, MyRelationshipTypes.KNOWS);

firstNode.setProperty("message", "Hello, ");
secondNode.setProperty("message", "world!");
relationship.setProperty("message", "brave Neo4j ");

finally {

Practically every line of this example is a different post. Let's start with the bootup code.

The GraphDatabaseService

org.neo4j.graphdb.GraphDatabaseService is the entry point to neo. It is an interface that exposes the core functionality expected by any provider of graph database services, such as createNode(), beginTx() and getNodeById(). In other words, if you want to create a replacement for neo, this is the first interface to implement. In our example and in all probability in your code, the chosen implementation is that of org.neo4j.kernel.EmbededGraphDatabase, which is a simple wrapper over org.neo4j.kernel.EmbeddedGraphDbImpl which delegates, in turn, most of its heavy lifting to an instance of org.kernel.neo4j.GraphDbInstance (it performs some work for nested transactions and registration of kernel event handlers but they are not important right now). One of the interesting points in this class is the first appearance of org.neo4j.kernel.impl.core.NodeManager. But first let's go to org.neo4j.kernel.GraphDbInstance.

Getting Dirty: The Configuration

GraphDbInstance, after construction from EmbeddedGraphDbImpl, is start()ed with the configuration map provided by the user (in our example this is empty). This is passed to an org.neo4j.kernel.AutoConfigurator object which augments the parameters with the amount of memory that will be used by the database for its memory mapped buffers, calculating it based on the reported values from the JVM. This, of course, does not happen if your configuration sets "use_memory_mapped_buffers" to "false". This ends the life of the AutoConfigurator and the improved configuration map is passed to an org.neo4j.kernel.Config object. This is where the CacheManager, LockManager/LockReleaser, PersistenceModule, TxModule and GraphDbModule are created and held for the rest of the application's lifetime. The caching mechanism, the transaction management and the locking and persistence modules will be explained in later posts. For now, let's see the NodeManager created in GraphDbModule. By the way, after this setup, if all goes well, your embedded database has started.

Managing Nodes and Relationships

org.neo4j.kernel.impl.core.NodeManager is one of the largest classes in the source tree. It abstracts over the persistent store and the caching mechanism, providing the required methods for manipulating nodes and relationships in the database (maybe it should be named EntityManager or something more representative of its role). Its first job is to parse the configuration and initialize the caching subsystem, determining the cache sizes (there are two caches, one for the nodes and one for the relationships). The cache size management is abstracted by the AdaptiveCacheManager class, which we will look into next time.
The NodeManager uses lock stripping for its lock operations. It maintains an array of ReentrantLocks and based on the id (an integer value) of the requested construct (Nodes and Relationships) it hashes over it and locks, preserving this way a balance between concurrency level and memory use/performance. The general cycle of the getNodeById()/getRelationshipById() methods is roughly:
  1. Check the cache for it, if it exists, return it.
  2. Acquire lock based on requesting entity id.
  3. Recheck cache (this is multi-threaded, people!), if it has appeared, return it.
  4. Ask the persistent store manager for the entity, throw NotFoundException if not existent.
  5. Wrap the value in a proper implementation class (more on that later on).
  6. Put the value in the cache.
  7. Release the lock.
  8. Return the value to the caller.
From our high-level approach, this is the main work of the NodeManager. There are many more operations there, of course, but it is pointless to explain them all here, although some will be discussed later.

Implementations of Node and Relationship

There is one implementation for each of the org.neo4j.graphdb.{Node, Relationship} interfaces, NodeProxy and RelationshipProxy, respectively. They are mere shells that hold the unique identifier for the primitive they represent and they propagate every call to the interface methods to the NodeManager reference they are provided with during construction (this is obviously a GOF Proxy pattern, hence the name). This is what is returned from EmbeddedGraphDatabase every time you request/create a Node or Relationship, or add properties to them. In essence, all you are returned is an integer id.
There are two more classes related to Nodes and Relationships, org.neo4j.kernel.impl.core.{NodeImpl,RelationshipImpl}. They are internally used classes that both extend org.neo4j.kernel.impl.core.Primitive. This abstract class implements Properties as are used by neo4j, delegating the loading and storing of property values to the NodeManager. It also holds the id value. It is extended by NodeImpl and RelationshipImpl, each implementing the abstract methods as forward calls to the proper NodeManager method and adding support for operations specific to its represented primitive (so NodeImpl has a getRelationships() method while RelationshipImpl has a getStartNode() and so on).
These are the implementation handled from the NodeManager and further "down" (meaning the cache and the persistent store among others), while "above" that, what you receive are proxies.

Until next time

This is a simple overview of some core classes in the neo4j implementation. Practically no mechanics were discussed, although that will change next time when I will write about the cache mechanism.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments:

Post a Comment