Tuesday, September 14, 2010

When data locality isn't

I am working on releasing as an open source project a data structure that uses as a store two parallel arrays. I tried to do some performance optimization by merging the two arrays into one, with the (naive, as it proved) logic that since the arrays are processed mostly in parallel (that means that accesses are performed at the same indexes) it would be beneficial to interleave them in one, using even indexes to represent the first and odd ones for the second. Obviously I was "thinking" that data accesses would be faster because of caching and prefetching chunks of the now common store. Boy was I wrong! First, take a look at the API so that we have a common vocabulary:

package org.digitalstain.structures;
/**
* Operations that are supposed to be present on a store
* supported by two arrays that are parallel. That means
* they are of the same size and in general the elements
* that correspond to the same index are accessed together.
*
* @param E The type of element to store.
* @author Chris Gioran
*
*/

public interface JointArray<E> {
/**
* Returns the size of the arrays.
* @return The size of the arrays.
*/
public int size();

/**
* Returns the element that is at position <code>index</code>
* at the first array.
*
* @param index The index requested
* @return The element present at the requested index at the first
* array
*/
public E getFirst(int index);

/**
* Returns the element that is at position <code>index</code>
* at the second array.
*
* @param index The index requested
* @return The element present at the requested index at the second
* array
*/
public E getSecond(int index);

/**
* Sets the value at <code>index</code> of the first array
* to the value <code>element</code>
* @param index The position to set.
* @param element The value to set.
*/

public void setFirst(int index, E element);

/**
* Sets the value at <code>index</code> of the second array
* to the value <code>element</code>
* @param index The position to set.
* @param element The value to set.
*/
public void setSecond(int index, E element);

/**
* Appends to the list, increasing its size if necessary.
* @param first The element to append to the first array
* @param second The element to append to the second array
*/
public void add(E first, E second);

/**
* Removes the element at <code>index</code> from both arrays
* @param index The index to remove
*/
public void remove(int index);
}


The setup

Since the details of the actual structure are irrelevant, I implemented both storage solutions bare-bone, that is two classes, one with two arrays with the corresponding indexes acting as a pair and the other with one backing array, two consecutive indexes acting as a pair. Before you go any further, which one is faster?

The trials

Obviously it depends on the access mode. So I performed a series of experiments with different access scenarios.

I did serial writes:
for(int i = 0; i < theArray.size(); i++) {
theArray.setFirst(i, i);
theArray.setSecond(i, i);
}


serial reads:
for(int i = 0; i < INIT_SIZE*10; i++) {
sum += theArray.getFirst(i%INIT_SIZE);
sum += theArray.getSecond(i%INIT_SIZE);
}

random reads:
Random rng = new Random(startTime);
for(int i = 0; i < RAND_ITERS; i++) {
theArray.getFirst(rng.nextInt(theArray.size()));
theArray.getSecond(rng.nextInt(theArray.size()));
}


and the crown jewel, random mixed access modes:
Random rng = new Random(startTime);
Normal normal = new Normal(0.5, 0.0001, new MersenneTwister64((int) System.currentTimeMillis()));
for(int i = 0; i < RAND_ITERS; i++) {
size = theArray.size();
switch (rng.nextInt(7)) {
case 0 : theArray.getFirst(nextRandomIndex(normal, size)); break;
case 1 : theArray.getSecond(nextRandomIndex(normal, size)); break;
case 2 : {
int start = nextRandomIndex(normal, size);
int length = (int)(rng.nextInt(size - start));
for(int j = 0; j < length; j++) {
theArray.getFirst(j+start);
theArray.getSecond(j+start);
}
}; break;
case 3 : theArray.setFirst(nextRandomIndex(normal, size), rng.nextInt()); break;
case 4 : theArray.setSecond(nextRandomIndex(normal, size), rng.nextInt()); break;
case 5 : theArray.add(rng.nextInt(), rng.nextInt()); break;
case 6 : theArray.remove(rng.nextInt(size)); break;
default : throw new IllegalArgumentException("Error from RNG in mixed case");
}
}


the last one was performed with the help of the colt library, with three distributions: ~Uni(0, theArray.size()), ~N(0.5, 0.1), ~N(0.5, 0.0001), the last two normalized to return a valid index. The host PC was a laptop with a P9500 Intel Core2Duo processor (2 cores, 6MB L2 cache) and 4 GB of RAM, 1 GB dedicated to the heap of the Sun JDK 1.6.0_21 JVM.

The numbers

I did 20 iterations of every test and took the mean and standard deviation from these. The results are tedious to reproduce here (if you need them, ping me) but the resulting charts are interesting.
Behold:








They display is average time per element vs the number of elements, for each type of operation.

The first two are simply clues that my setup is correct. For the serial read, things are as expected. The SingleArray implementation dominates over the other as a result of prefetching, especially around the time the cache evictions start to become noticeable (around 4M elements). For the random reads test, there is no noticeable difference between the two implementations, since essentially the test trashes the memory.

The three randomized, mixed operations tests however are not all that predictable. Apart from a small non-linear increase around the 4M mark (where cache evictions become more often and the memory access costs start to accumulate) and a more or less constant time after that, the order is the reverse than the one I originally expected. However, upon closer inspection, it is obvious that this is the way things should be (as if reality would get it wrong this time!). The serial access case in the mixed ops test is a very small part of the overall operations and even then, the random accesses destroy any chance that due to prefetching the correct portion of the array would be in the cache. The effects cumulatively make the disjoint arrays solution to perform measurably better.

Conclusion

The assumption that closely accessed data should be located near in memory is correct but far more sensitive to access patterns than I originally thought. In my case, I often but not always accessed the elements of two parallel arrays together, so I thought it was prudent to interleave them, taking advantage of prefetching and caching. However, this not only did not make any improvement, but, because of the less common random accesses within the arrays, the performance suffered to a noticeable degree. It seems that to take advantage of the large caches, it is
necessary to couple data that indeed is strongly coupled in the vast majority of the code paths. Even then I would run an experiment or two just to be on the safe side. I think that the last test takes a bit more refinement, to see how far along deviations of the coupled access pattern can go before they stop messing with performance.

No comments:

Post a Comment