Smolderingly fast b-trees

Published 2024-10-06

(This is part of a series on the design of a language. See the list of posts here.)

Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties:

Ordered data-structures like b-trees don't have any of these disadvantages. They are typically slower than hashmaps, but I was surprised to find fairly wide variation in people's expectations of how much slower. So let's compare:

microbenchmarks are hard

Let's start with the dumbest possible benchmark - fill a map with uniformly distributed random integers, measure how many cycles takes to lookup all those integers, and take the mean over many iterations.

const before = rdtscp();
for (keys) |key| {
    const value_found = map.get(key);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}
const after = rdtscp();
record("lookup_hit_all", map.count(), @divTrunc(after - before, map.count()));
lookup_hit_all / uniform u64
log2(#keys)012345678910111213141516
rust btree462619547894106133152166193215236262290316367
rust hashmap siphash10267494037343332323232333334374351
zig b+tree462637566487100110123143165180197220235269294
zig hashmap siphash9984646970686868686868686968697377
zig hashmap wyhash8035293534353433343233313132323334

That makes btrees look pretty bad. Much worse, in fact, than the other benchmark that several people pointed me at, where at similar sizes btree lookups are only ~2x slower than hashmaps. But don't worry, we can reproduce that too:

for (keys) |key| {
    const before = rdtscp();
    const value_found = map.get(key);
    const after = rdtscp();
    record("lookup_hit_one", map.count(), after - before);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}
lookup_hit_one / uniform u64
log2(#keys)012345678910111213141516
rust btree47485082107123135163182200227256279309328358405
rust hashmap siphash101101101100105103102103103103103108112116124140166
zig b+tree4545597285107126135148170188203223246264292319
zig hashmap siphash106108116117120121123124124124123124127130132137147
zig hashmap wyhash5353576368697272737172697681788792

All we did differently was average over one lookup at a time instead of over many, but somehow that made the hashmaps 2-3x slower!

Both of these benchmarks are pretty bad, so let's make better versions of both before trying to explain the differences. I only did this for the zig data-structures, because I am lazy.

First, rather than looking the keys up in the same order we inserted them, we'll precompute a random sample (with replacement):

const keys_hitting = try map.allocator.alloc(Key, @max(batch_size, count));
var permutation = XorShift64{};
for (keys_hitting) |*key| {
    const i = permutation.next() % count;
    key.* = keys[i];
}

Then rather than measuring a single lookup at a time, we'll measure a batch of 256 lookups to amortize out the overhead of the rdtscp instruction and pull the panics out of the measurement:

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + 1) % keys_batch.len];
    }
    const after = rdtscp();
    report("lookup_hit_batch", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

The results look similar to the results for lookup_hit_all:

lookup_hit_batch / uniform u64
log2(#keys)012345678910111213141516
zig b+tree6931476082102112126147174186204230245276299
zig hashmap siphash5253616871727576767575767880818895
zig hashmap wyhash2929313538384041424040424139424643

Now we make one tiny tweak. Instead of iterating through the batch in order, we'll use the value we just looked up to pick the next key.

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + value.*.?) % keys_batch.len]; // <-- we changed this line
    }
    const after = rdtscp();
    report("lookup_hit_chain", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

Now we have two benchmarks with the exact same batch size and the exact same instructions to execute. The only difference is that in lookup_hit_chain we've introduced a data dependency between each iteration of the loop - we need to know value before we know what to lookup next. This prevents successful speculative execution of the next loop iteration.

lookup_hit_chain / uniform u64
log2(#keys)012345678910111213141516
zig b+tree72537485982105115126148172186198219240273300
zig hashmap siphash777988889091939494939599102103106115130
zig hashmap wyhash3537444852525557555457636463707688

The wyhash hashmap halves it's lookup time in lookup_hit_batch vs lookup_hit_chain but the btree doesn't benefit at all. I don't understand the limits to speculative execution at all, but let's make some mildly informed guesses.

At 2^16 keys the whole dataset is about 1mb, which fits comfortably in L2 but is much bigger than L1. Each hashmap lookup costs 1 or maybe 2 L2 cache lookups, each of which have ~20 cycle latency. The wyhash fast path for u64 is only a handful of instructions with no branches:

bench[0x1014710] <+0>:  rorxq  $0x20, %rdi, %rdx
bench[0x1014716] <+6>:  movabsq $-0x18fc812e5f4bd725, %rax ; imm = 0xE7037ED1A0B428DB
bench[0x1014720] <+16>: xorq   %rax, %rdx
bench[0x1014723] <+19>: movabsq $0x1ff5c2923a788d2c, %rcx ; imm = 0x1FF5C2923A788D2C
bench[0x101472d] <+29>: xorq   %rdi, %rcx
bench[0x1014730] <+32>: mulxq  %rcx, %rcx, %rdx
bench[0x1014735] <+37>: movabsq $-0x5f89e29b87429bd9, %rsi ; imm = 0xA0761D6478BD6427
bench[0x101473f] <+47>: xorq   %rcx, %rsi
bench[0x1014742] <+50>: xorq   %rax, %rdx
bench[0x1014745] <+53>: mulxq  %rsi, %rcx, %rax
bench[0x101474a] <+58>: xorq   %rcx, %rax
bench[0x101474d] <+61>: retq

So while we're waiting for one cache lookup we can start hashing the next key (if we can predict what it is) and maybe even get started on the next cache lookup.

The btree at 2^16 keys has 4 levels. There are typically 15-16 keys per node when inserting random keys, so we'll expect to do around 32 comparisons on average before finding our key. The body of that search loop looks like:

bench[0x10121b0] <+16>: cmpq   %rdx, (%rdi,%rax,8)
bench[0x10121b4] <+20>: jae    0x10121c2                 ; <+34> at bptree.zig
bench[0x10121b6] <+22>: addq   $0x1, %rax
bench[0x10121ba] <+26>: cmpq   %rax, %rsi
bench[0x10121bd] <+29>: jne    0x10121b0                 ; <+16> [inlined] bench.less_than_u64 + 9 at bench.zig:159:5

So 5 instructions, including two branches, per comparison. At least 160 instructions and 64 branches for the whole lookup.

The 16 keys take up 2 cachelines, so we'll average 1.5 cache lookups for each linear key search, plus 1 cache lookup to hit the child/values array. 10 cache lookups in total for the whole btree lookup. Each of those cache lookups depends on the previous lookup so it'll be hard to speculate correctly, but prefetching might help within a single node. If every cache lookup hit L2 in strict series we'd expect ~200 cycles, but probably some of the earlier nodes are in L1.

Anyway, let's leave that rabbit hole alone for now. It's enough to notice that hashmaps benefit from speculative execution between multiple lookups and btrees don't.

We can roughly think of lookup_hit_batch as measuring throughput and lookup_hit_chain as measuring latency. All the other benchmarks I've seen only measure one or the other, which starts to explain the disagreement over btree vs hashmap performance.

string keys

Btrees do a lot of key comparisons. If the keys are integers then they are stored in the btree node, so it doesn't matter too much. But for strings keys we potentially pay for a cache lookup for every comparison.

lookup_hit_chain / random strings
log2(#keys)012345678910111213141516
zig b+tree101124144160184219255279310366427470512577700826965
zig hashmap siphash145147154159162163165167168173181186196211247287312
zig hashmap wyhash495059656869727475818893103119154188219

Completely random strings is actually pretty unlikely and unfairly benefits the btree, which typically only has to compare the first character of each string to find that they are not equal. But real datasets (eg urls in a log) often have large common prefixes. We can see the difference if we make the first 16 characters constant:

lookup_hit_chain / random-ish strings
log2(#keys)012345678910111213141516
zig b+tree102150221338546618807911107713931507164118122095262828483302
zig hashmap siphash145147153159162163165167168174182187197210245282313
zig hashmap wyhash495059666969727474818893102117154188214

Hashmaps don't care, but the btree is really hurting.

If we're just supporting strings we can optimize for this case by storing the length of the common prefix within each node. But it's hard to imagine how to generalize that to arbitrary types. What if the key is a tuple of strings? Or a small set?

wasm hashes

Let's try the same in wasm, where hash functions have less access to fast vector instructions. Wasm also doesn't have rdtscp so times here are in nanoseconds rather than cycles.

lookup_hit_chain / uniform u64 / wasm
log2(#keys)012345678910111213141516
zig b+tree51419222738454953597075808798111120
zig hashmap siphash3334373840404142414142434445465258
zig hashmap wyhash2930333536363738383738404042434751

lookup_hit_chain / random strings / wasm
log2(#keys)012345678910111213141516
zig b+tree64778897117141154167192216269287294308369436500
zig hashmap siphash6564687073717273737578818488102118135
zig hashmap whyhash45454850525253545456596265697993105

The overall ratios are fairly similar, although wyhash seems to have been penalized a little relative to siphash.

Neither the hash functions nor the table scan generates vector instructions at all, even when comparing strings. And, uh, now that I think to look, they don't generate vector instructions on x86 either. So I guess that's a non-issue.

btree tuning

I implemented both btrees and b+trees. I didn't see much difference between them for insert/lookup, so I preferred the b+tree for the easier/faster implementation of scans and range queries.

The rust btreemap fixes the max key count per node to 11. For all the workloads I've tried the sweet spot seems to be to fix the node size to 512 bytes, which is 31 keys for u64 and 20 keys for strings.

Allowing leaves and branches to have different sizes didn't help.

I gained a small speedup by manually laying out the node like key_count, keys, values/children. Zig by default prefers to put key_count at the end of the struct to avoid padding, but we always read the key_count first so it's nice to get some keys on the same cacheline. Maintaining this optimization across different architectures was annoying though, so I rolled it back and it's not reflected in the tests above.

The rust btreemap switches on Ordering. I got a small boost from instead using less_than during the search and calling equal afterwards. For a lookup at 2^16 keys we'll expect to call less_than 32 times and equal once, so it's worth paying for the extra call to equal in exchange for tightening up the inner search loop.

I use tried various different binary searches. The best was this 'branchless' variant:

fn searchBinary(keys: []Key, search_key: Key) usize {
    if (keys.len == 0) return 0;
    var offset: usize = 0;
    var length: usize = keys.len;
    while (length > 1) {
        const half = length / 2;
        const mid = offset + half;
        if (less_than(keys[mid], search_key)) {
            @branchHint(.unpredictable);
            offset = mid;
        }
        length -= half;
    }
    offset += @intFromBool(less_than(keys[offset], search_key));
    return offset;
}

while (length > 1) requires a branch but it's easily predictable. branchHint(.unpredictable) causes llvm to emit a conditional move for offset = mid.

bench[0x101f5e0] <+0>:  movq   %rsi, %rax
bench[0x101f5e3] <+3>:  testq  %rsi, %rsi
bench[0x101f5e6] <+6>:  je     0x101f616                 ; <+54> at bptree.zig
bench[0x101f5e8] <+8>:  xorl   %ecx, %ecx
bench[0x101f5ea] <+10>: cmpq   $0x1, %rax
bench[0x101f5ee] <+14>: je     0x101f60b                 ; <+43> [inlined] bench.less_than_u64 at bench.zig:185:5
bench[0x101f5f0] <+16>: movq   %rax, %rsi
bench[0x101f5f3] <+19>: shrq   %rsi
bench[0x101f5f6] <+22>: leaq   (%rcx,%rsi), %r8
bench[0x101f5fa] <+26>: cmpq   %rdx, (%rdi,%r8,8)
bench[0x101f5fe] <+30>: cmovbq %r8, %rcx
bench[0x101f602] <+34>: subq   %rsi, %rax
bench[0x101f605] <+37>: cmpq   $0x1, %rax
bench[0x101f609] <+41>: ja     0x101f5f0                 ; <+16> at bptree.zig:334:37
bench[0x101f60b] <+43>: cmpq   %rdx, (%rdi,%rcx,8)
bench[0x101f60f] <+47>: adcq   $0x0, %rcx
bench[0x101f613] <+51>: movq   %rcx, %rax
bench[0x101f616] <+54>: retq

Linear search was still faster in most benchmarks though, even with ludicrously large node sizes.

I also tried a btree variant I saw in some paper where leaves are left unsorted and keys are always inserted at the end of the leaf. This saves some memcopying in the common case, but having to sort the leaf before splitting negates the benefit.

outcome

All the benchmarks above are basically best case scenarios, where we're doing a lot of lookups in a row. If we were just doing one lookup in the middle of some other work then the btree might not be in cache at all and each of those 2.5 cache lookups per level are going all the way to main memory. That's catastrophic. Whereas an open-addressed hashmap will typically only hit 1 or 2 cachelines per lookup regardless of size.

And while btrees avoid many of the performance edge cases of hashmaps, they also have some cliffs of their own to fall off as comparisons get more expensive and touch more memory, as we saw with the random-ish strings above.

I haven't measure space usage yet, but we can expect it to be worse for btrees. For random keys the typical node occupancy is 50%, minus per-node overhead like key_count, whereas I've been running the zig hashmaps at 80%. So we can guesstimate the btrees will use >60% more memory.

Space usage is really bad for small maps too. I'd need to add some extra tweaks to allow the btree root node to start small and grow, rather than paying the full 512 bytes for a map with only 1 key.

Overall, I'm unconvinced that it's worth exploring btrees further. I'll stick to hashmaps and I'll either iterate in insertion order or I'll require sorting entries before iterating. But if anyone else likes the look of this rabbit hole I left some other ideas untouched:

miscellanea

All the experiments here are using:

Running on an intel i7-1260P with:

Usually I would also use cset to pin experiments to a shielded core, but it seems that recent systemd versions have broken that workflow and I haven't figured out a replacement yet.

The rest of the benchmark setup can be found in jamii/maps. I measured more than just lookups, but btrees have similar performance for every operation due to having to find a key first.

Cache latency numbers came from mlc and roughly match claims I found online.

Thanks to Dan Luu for help thinking through cache behaviour.