Sunday, 29 April 2012

An experiment in Van Emde Boas layouts

Given an arbitrarily large sorted, balanced binary tree of values, the obvious way to search for elements is binary search. In undergraduate classes I learned that a binary search requires log(n) operations and is thus as efficient as you can get without throwing additional storage space at the problem. One thing my lecturer didn't dive into though was the effect of CPU cache on this otherwise simple little log(n) algorithm.



Jumping around in memory comes with significant costs. How significant? In the case of my Intel i5 quad-core 3Ghz CPU, an L1 cache miss costs 4ns (~12 cycles). An L2 cache miss costs 10ns (~30 cycles). An L3, 39ns (~117 cycles). RAM 39-60ns (117-180 cycles).

While binary sort sounds great on paper, it turns out that repeatedly jumping back and forth in memory could really clock up a tonne of wasted cycles if you're doing it a lot. What makes matters worse is the fact that different architectures and even generations of CPU's will have different sized caches with different performance characteristics.

Enter the van Emde Boas layout - a data layout designed to minimise cache misses without regard for the size of the cache. (I have a coded up visual comparison in JavaScript here.)

The layout breaks a tree into sqrt(n) sub trees, each of sqrt(n) nodes. This continues recursively until the trees contain a single node. The idea is to try to locate related data close to each other in memory to minimise cache misses. The recursive nature of the layout means that it works relatively well regardless of cache size.

This layout complexity obviously adds additional computational complexity. The question I wanted answered was whether or not the benefits outweigh the costs. Enter experiment time!
  1. Single templatized C++ class for binary searching
  2. A "Traverser" class for each of: in-order (sorted array), tree-order (think breadth-first-search), and van Emde Boas order
  3. CacheGrind
My vEB Traverser class:
class vEBTraverser {
 public:
  uint64_t root() {
    d = 0;
    cix = 1;
    return 1;
  }
  uint64_t left() { 
    d++;
    cix <<= 1;
    return vEBIndex();
  }
  uint64_t right() { 
    d++;
    cix = (cix << 1) + 1;
    return vEBIndex();
  }
 private:
  uint64_t cix;
  uint64_t d;

  uint64_t vEBIndex() {
    // Start with largest sub-tree, work down to smallest.
    uint64_t ix = 1;
    uint32_t new_d = d;
    for (char b = 4; b >= 0; --b) {
      const uint64_t b_val = 1L << b;
      if (d & b_val) {
        // Determine sub triangle and add start offset to index.
        const uint64_t masked_d =      d & (b_val - 1);
        const uint64_t new_node_size = (1L << b_val) - 1;
        uint64_t subtri_ix =           (cix >> masked_d) & new_node_size;
        ix += new_node_size * (1L + subtri_ix);
      }
    }
    return ix;
  }
};

Results

Cache Misses

$ for i in inorder bfs vEB; do echo; echo; echo ======= $i =======; valgrind --tool=cachegrind --D1=32768,8,64 --LL=6291456,12,64 ./a.out $i; done


======= inorder =======
==23833== Cachegrind, a cache and branch-prediction profiler
==23833== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==23833== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
==23833== Command: ./a.out inorder
==23833== 
--23833-- warning: Unknown Intel cache config value (0x76), ignoring
--23833-- warning: Unknown Intel cache config value (0xff), ignoring
--23833-- warning: L2 cache not installed, ignore LL results.
In Order
Searched first half of keyspace in 674778 usec.
Searched last half of keyspace in 678757 usec.
Searched whole keyspace in 680336 usec.
==23833== 
==23833== I   refs:      216,654,084
==23833== I1  misses:          1,013
==23833== LLi misses:          1,011
==23833== I1  miss rate:        0.00%
==23833== LLi miss rate:        0.00%
==23833== 
==23833== D   refs:       56,351,206  (37,579,957 rd   + 18,771,249 wr)
==23833== D1  misses:      6,822,090  ( 6,808,199 rd   +     13,891 wr)
==23833== LLd misses:         11,053  (     5,878 rd   +      5,175 wr)
==23833== D1  miss rate:        12.1% (      18.1%     +        0.0%  )
==23833== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==23833== 
==23833== LL refs:         6,823,103  ( 6,809,212 rd   +     13,891 wr)
==23833== LL misses:          12,064  (     6,889 rd   +      5,175 wr)
==23833== LL miss rate:          0.0% (       0.0%     +        0.0%  )


======= bfs =======
==23836== Cachegrind, a cache and branch-prediction profiler
==23836== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==23836== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
==23836== Command: ./a.out bfs
==23836== 
--23836-- warning: Unknown Intel cache config value (0x76), ignoring
--23836-- warning: Unknown Intel cache config value (0xff), ignoring
--23836-- warning: L2 cache not installed, ignore LL results.
Breadth-First Order
Searched first half of keyspace in 492395 usec.
Searched last half of keyspace in 558116 usec.
Searched whole keyspace in 513091 usec.
==23836== 
==23836== I   refs:      169,892,772
==23836== I1  misses:          1,018
==23836== LLi misses:          1,016
==23836== I1  miss rate:        0.00%
==23836== LLi miss rate:        0.00%
==23836== 
==23836== D   refs:       45,135,648  (37,489,692 rd   + 7,645,956 wr)
==23836== D1  misses:      2,196,869  ( 2,182,966 rd   +    13,903 wr)
==23836== LLd misses:         11,055  (     5,880 rd   +     5,175 wr)
==23836== D1  miss rate:         4.8% (       5.8%     +       0.1%  )
==23836== LLd miss rate:         0.0% (       0.0%     +       0.0%  )
==23836== 
==23836== LL refs:         2,197,887  ( 2,183,984 rd   +    13,903 wr)
==23836== LL misses:          12,071  (     6,896 rd   +     5,175 wr)
==23836== LL miss rate:          0.0% (       0.0%     +       0.0%  )


======= vEB =======
==23839== Cachegrind, a cache and branch-prediction profiler
==23839== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==23839== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
==23839== Command: ./a.out vEB
==23839== 
--23839-- warning: Unknown Intel cache config value (0x76), ignoring
--23839-- warning: Unknown Intel cache config value (0xff), ignoring
--23839-- warning: L2 cache not installed, ignore LL results.
vEB Order (traverser 24)
Searched first half of keyspace in 1354290 usec.
Searched last half of keyspace in 1370509 usec.
Searched whole keyspace in 1367261 usec.
==23839== 
==23839== I   refs:      488,281,602
==23839== I1  misses:          1,008
==23839== LLi misses:          1,006
==23839== I1  miss rate:        0.00%
==23839== LLi miss rate:        0.00%
==23839== 
==23839== D   refs:       68,705,251  (37,868,809 rd   + 30,836,442 wr)
==23839== D1  misses:      1,416,027  ( 1,402,129 rd   +     13,898 wr)
==23839== LLd misses:         11,055  (     5,880 rd   +      5,175 wr)
==23839== D1  miss rate:         2.0% (       3.7%     +        0.0%  )
==23839== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==23839== 
==23839== LL refs:         1,417,035  ( 1,403,137 rd   +     13,898 wr)
==23839== LL misses:          12,061  (     6,886 rd   +      5,175 wr)
==23839== LL miss rate:          0.0% (       0.0%     +        0.0%  )
As expected, we see the D1 miss rate where vEB < tree-order < in-order.

Real-world performance


Its almost impossible to make generic claims about real-world performance here. The actual cost-benefit argument will depend entirely on your combination of CPU, RAM, data set size and data access patterns. Nevertheless, in my case I'm testing with 64k of packed entries of 4 bytes of key and 1 byte of data. (Note that due to a limitation of my implementation of the vEB layout, I require tree depths d = 2^(2^n) and it would involve significantly more effort to run these benchmarks with 4G of entries.)
$ for i in inorder bfs vEB; do echo; echo; echo ======= $i =======; ./a.out $i; done


======= inorder =======
In Order
Searched first half of keyspace in 37566 usec.
Searched last half of keyspace in 32375 usec.
Searched whole keyspace in 32109 usec.


======= bfs =======
Breadth-First Order
Searched first half of keyspace in 16411 usec.
Searched last half of keyspace in 17368 usec.
Searched whole keyspace in 17736 usec.


======= vEB =======
vEB Order (traverser 24)
Searched first half of keyspace in 36477 usec.
Searched last half of keyspace in 38805 usec.
Searched whole keyspace in 38378 usec.
So it looks as though despite vEB being a more cache-friendly layout, the cost of determining the index of nodes in a vEB layout tends to outweigh the benefits.

Conclusion

I've been investigating vEB layouts for potentially turbo-charging a bunch of static lookup service code so this is a pretty disappointing result in that regard but still, all is not lost. BFS ordering is clearly a very big performance win for minimal computational cost. Also, the focus of my efforts so far have been exclusively on the CPU cache benefits vEB might provide but even if these are nullified by the extra computational overheads, slower storage technologies such as CD, flash and disk should still benefit significantly from the vEB layout. This is doubly true for media with expensive seeks. I might have to run another set of similar benchmarks on various storage media. I imagine the results would be much more in vEB's favour if applied to spinning media.

FYI: If you're interested in the cache configuration of your machine and you run linux, this is a great little trick I discovered in my Googling travels:
$ grep . /sys/devices/system/cpu/cpu0/cache/index*/*