Smolderingly fast b-trees

(This is part of a series on the design of zest. See the list of posts here.)

Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties:

Vulnerable to hash flooding.
If protected against hash flooding by random seeds, the iteration order becomes non-deterministic which is annoying for snapshot testing, reproducible builds, etc.
May have to rehash on insert, which produces terrible worst-case latencies for large hashmaps.
Repeatedly growing large allocations without fragmentation is difficult on wasm targets, because virtual memory tricks are not available and pages can't be unmapped.
Vector instructions on wasm are limited and there are no AES instructions. This makes many hash functions much slower.

Ordered data-structures like b-trees don't have any of these disadvantages. They are typically slower than hashmaps, but I was surprised to find fairly wide variation in people's expectations of how much slower. So let's compare:

rust's std::collections::HashMap with siphash
rust's std::collections::BTreeMap
zig's std.HashMap with siphash
zig's std.HashMap with wyhash
My own bptree with various compile-time options for different experiments.

microbenchmarks are hard

Let's start with the dumbest possible benchmark - fill a map with uniformly distributed random integers, measure how many cycles takes to lookup all those integers, and take the mean over many iterations.

const before = rdtscp();
for (keys) |key| {
    const value_found = map.get(key);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}
const after = rdtscp();
record("lookup_hit_all", map.count(), @divTrunc(after - before, map.count()));

lookup_hit_all / uniform u64

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
rust btree	46	26	19	54	78	94	106	133	152	166	193	215	236	262	290	316	367
rust hashmap siphash	102	67	49	40	37	34	33	32	32	32	32	33	33	34	37	43	51
zig b+tree	46	26	37	56	64	87	100	110	123	143	165	180	197	220	235	269	294
zig hashmap siphash	99	84	64	69	70	68	68	68	68	68	68	68	69	68	69	73	77
zig hashmap wyhash	80	35	29	35	34	35	34	33	34	32	33	31	31	32	32	33	34

That makes btrees look pretty bad. Much worse, in fact, than the other benchmark that several people pointed me at, where at similar sizes btree lookups are only ~2x slower than hashmaps. But don't worry, we can reproduce that too:

for (keys) |key| {
    const before = rdtscp();
    const value_found = map.get(key);
    const after = rdtscp();
    record("lookup_hit_one", map.count(), after - before);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}

lookup_hit_one / uniform u64

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
rust btree	47	48	50	82	107	123	135	163	182	200	227	256	279	309	328	358	405
rust hashmap siphash	101	101	101	100	105	103	102	103	103	103	103	108	112	116	124	140	166
zig b+tree	45	45	59	72	85	107	126	135	148	170	188	203	223	246	264	292	319
zig hashmap siphash	106	108	116	117	120	121	123	124	124	124	123	124	127	130	132	137	147
zig hashmap wyhash	53	53	57	63	68	69	72	72	73	71	72	69	76	81	78	87	92

All we did differently was average over one lookup at a time instead of over many, but somehow that made the hashmaps 2-3x slower!

Both of these benchmarks are pretty bad, so let's make better versions of both before trying to explain the differences. I only did this for the zig data-structures, because I am lazy.

First, rather than looking the keys up in the same order we inserted them, we'll precompute a random sample (with replacement):

const keys_hitting = try map.allocator.alloc(Key, @max(batch_size, count));
var permutation = XorShift64{};
for (keys_hitting) |*key| {
    const i = permutation.next() % count;
    key.* = keys[i];
}

Then rather than measuring a single lookup at a time, we'll measure a batch of 256 lookups to amortize out the overhead of the rdtscp instruction and pull the panics out of the measurement:

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + 1) % keys_batch.len];
    }
    const after = rdtscp();
    report("lookup_hit_batch", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

The results look similar to the results for lookup_hit_all:

lookup_hit_batch / uniform u64

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	6	9	31	47	60	82	102	112	126	147	174	186	204	230	245	276	299
zig hashmap siphash	52	53	61	68	71	72	75	76	76	75	75	76	78	80	81	88	95
zig hashmap wyhash	29	29	31	35	38	38	40	41	42	40	40	42	41	39	42	46	43

Now we make one tiny tweak. Instead of iterating through the batch in order, we'll use the value we just looked up to pick the next key.

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + value.*.?) % keys_batch.len]; // <-- we changed this line
    }
    const after = rdtscp();
    report("lookup_hit_chain", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

Now we have two benchmarks with the exact same batch size and the exact same instructions to execute. The only difference is that in lookup_hit_chain we've introduced a data dependency between each iteration of the loop - we need to know value before we know what to lookup next. This prevents successful speculative execution of the next loop iteration.

lookup_hit_chain / uniform u64

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	7	25	37	48	59	82	105	115	126	148	172	186	198	219	240	273	300
zig hashmap siphash	77	79	88	88	90	91	93	94	94	93	95	99	102	103	106	115	130
zig hashmap wyhash	35	37	44	48	52	52	55	57	55	54	57	63	64	63	70	76	88

The wyhash hashmap halves it's lookup time in lookup_hit_batch vs lookup_hit_chain but the btree doesn't benefit at all. I don't understand the limits to speculative execution at all, but let's make some mildly informed guesses.

At 2^16 keys the whole dataset is about 1mb, which fits comfortably in L2 but is much bigger than L1. Each hashmap lookup costs 1 or maybe 2 L2 cache lookups, each of which have ~20 cycle latency. The wyhash fast path for u64 is only a handful of instructions with no branches:

bench[0x1014710] <+0>:  rorxq  $0x20, %rdi, %rdx
bench[0x1014716] <+6>:  movabsq $-0x18fc812e5f4bd725, %rax ; imm = 0xE7037ED1A0B428DB
bench[0x1014720] <+16>: xorq   %rax, %rdx
bench[0x1014723] <+19>: movabsq $0x1ff5c2923a788d2c, %rcx ; imm = 0x1FF5C2923A788D2C
bench[0x101472d] <+29>: xorq   %rdi, %rcx
bench[0x1014730] <+32>: mulxq  %rcx, %rcx, %rdx
bench[0x1014735] <+37>: movabsq $-0x5f89e29b87429bd9, %rsi ; imm = 0xA0761D6478BD6427
bench[0x101473f] <+47>: xorq   %rcx, %rsi
bench[0x1014742] <+50>: xorq   %rax, %rdx
bench[0x1014745] <+53>: mulxq  %rsi, %rcx, %rax
bench[0x101474a] <+58>: xorq   %rcx, %rax
bench[0x101474d] <+61>: retq

So while we're waiting for one cache lookup we can start hashing the next key (if we can predict what it is) and maybe even get started on the next cache lookup.

The btree at 2^16 keys has 4 levels. There are typically 15-16 keys per node when inserting random keys, so we'll expect to do around 32 comparisons on average before finding our key. The body of that search loop looks like:

bench[0x10121b0] <+16>: cmpq   %rdx, (%rdi,%rax,8)
bench[0x10121b4] <+20>: jae    0x10121c2                 ; <+34> at bptree.zig
bench[0x10121b6] <+22>: addq   $0x1, %rax
bench[0x10121ba] <+26>: cmpq   %rax, %rsi
bench[0x10121bd] <+29>: jne    0x10121b0                 ; <+16> [inlined] bench.less_than_u64 + 9 at bench.zig:159:5

So 5 instructions, including two branches, per comparison. At least 160 instructions and 64 branches for the whole lookup.

The 16 keys take up 2 cachelines, so we'll average 1.5 cache lookups for each linear key search, plus 1 cache lookup to hit the child/values array. 10 cache lookups in total for the whole btree lookup. Each of those cache lookups depends on the previous lookup so it'll be hard to speculate correctly, but prefetching might help within a single node. If every cache lookup hit L2 in strict series we'd expect ~200 cycles, but probably some of the earlier nodes are in L1.

Anyway, let's leave that rabbit hole alone for now. It's enough to notice that hashmaps benefit from speculative execution between multiple lookups and btrees don't.

We can roughly think of lookup_hit_batch as measuring throughput and lookup_hit_chain as measuring latency. All the other benchmarks I've seen only measure one or the other, which starts to explain the disagreement over btree vs hashmap performance.

string keys

Btrees do a lot of key comparisons. If the keys are integers then they are stored in the btree node, so it doesn't matter too much. But for strings keys we potentially pay for a cache lookup for every comparison.

lookup_hit_chain / random strings

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	101	124	144	160	184	219	255	279	310	366	427	470	512	577	700	826	965
zig hashmap siphash	145	147	154	159	162	163	165	167	168	173	181	186	196	211	247	287	312
zig hashmap wyhash	49	50	59	65	68	69	72	74	75	81	88	93	103	119	154	188	219

Completely random strings is actually pretty unlikely and unfairly benefits the btree, which typically only has to compare the first character of each string to find that they are not equal. But real datasets (eg urls in a log) often have large common prefixes. We can see the difference if we make the first 16 characters constant:

lookup_hit_chain / random-ish strings

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	102	150	221	338	546	618	807	911	1077	1393	1507	1641	1812	2095	2628	2848	3302
zig hashmap siphash	145	147	153	159	162	163	165	167	168	174	182	187	197	210	245	282	313
zig hashmap wyhash	49	50	59	66	69	69	72	74	74	81	88	93	102	117	154	188	214

Hashmaps don't care, but the btree is really hurting.

If we're just supporting strings we can optimize for this case by storing the length of the common prefix within each node. But it's hard to imagine how to generalize that to arbitrary types. What if the key is a tuple of strings? Or a small set?

wasm hashes

Let's try the same in wasm, where hash functions have less access to fast vector instructions. Wasm also doesn't have rdtscp so times here are in nanoseconds rather than cycles.

lookup_hit_chain / uniform u64 / wasm

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	5	14	19	22	27	38	45	49	53	59	70	75	80	87	98	111	120
zig hashmap siphash	33	34	37	38	40	40	41	42	41	41	42	43	44	45	46	52	58
zig hashmap wyhash	29	30	33	35	36	36	37	38	38	37	38	40	40	42	43	47	51

lookup_hit_chain / random strings / wasm

log2(#keys)	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
zig b+tree	64	77	88	97	117	141	154	167	192	216	269	287	294	308	369	436	500
zig hashmap siphash	65	64	68	70	73	71	72	73	73	75	78	81	84	88	102	118	135
zig hashmap whyhash	45	45	48	50	52	52	53	54	54	56	59	62	65	69	79	93	105

The overall ratios are fairly similar, although wyhash seems to have been penalized a little relative to siphash.

Neither the hash functions nor the table scan generates vector instructions at all, even when comparing strings. And, uh, now that I think to look, they don't generate vector instructions on x86 either. So I guess that's a non-issue.

btree tuning

I implemented both btrees and b+trees. I didn't see much difference between them for insert/lookup, so I preferred the b+tree for the easier/faster implementation of scans and range queries.

The rust btreemap fixes the max key count per node to 11. For all the workloads I've tried the sweet spot seems to be to fix the node size to 512 bytes, which is 31 keys for u64 and 20 keys for strings.

Allowing leaves and branches to have different sizes didn't help.

I gained a small speedup by manually laying out the node like key_count, keys, values/children. Zig by default prefers to put key_count at the end of the struct to avoid padding, but we always read the key_count first so it's nice to get some keys on the same cacheline. Maintaining this optimization across different architectures was annoying though, so I rolled it back and it's not reflected in the tests above.

The rust btreemap switches on Ordering. I got a small boost from instead using less_than during the search and calling equal afterwards. For a lookup at 2^16 keys we'll expect to call less_than 32 times and equal once, so it's worth paying for the extra call to equal in exchange for tightening up the inner search loop.

I use tried various different binary searches. The best was this 'branchless' variant:

fn searchBinary(keys: []Key, search_key: Key) usize {
    if (keys.len == 0) return 0;
    var offset: usize = 0;
    var length: usize = keys.len;
    while (length > 1) {
        const half = length / 2;
        const mid = offset + half;
        if (less_than(keys[mid], search_key)) {
            @branchHint(.unpredictable);
            offset = mid;
        }
        length -= half;
    }
    offset += @intFromBool(less_than(keys[offset], search_key));
    return offset;
}

while (length > 1) requires a branch but it's easily predictable. branchHint(.unpredictable) causes llvm to emit a conditional move for offset = mid.

bench[0x101f5e0] <+0>:  movq   %rsi, %rax
bench[0x101f5e3] <+3>:  testq  %rsi, %rsi
bench[0x101f5e6] <+6>:  je     0x101f616                 ; <+54> at bptree.zig
bench[0x101f5e8] <+8>:  xorl   %ecx, %ecx
bench[0x101f5ea] <+10>: cmpq   $0x1, %rax
bench[0x101f5ee] <+14>: je     0x101f60b                 ; <+43> [inlined] bench.less_than_u64 at bench.zig:185:5
bench[0x101f5f0] <+16>: movq   %rax, %rsi
bench[0x101f5f3] <+19>: shrq   %rsi
bench[0x101f5f6] <+22>: leaq   (%rcx,%rsi), %r8
bench[0x101f5fa] <+26>: cmpq   %rdx, (%rdi,%r8,8)
bench[0x101f5fe] <+30>: cmovbq %r8, %rcx
bench[0x101f602] <+34>: subq   %rsi, %rax
bench[0x101f605] <+37>: cmpq   $0x1, %rax
bench[0x101f609] <+41>: ja     0x101f5f0                 ; <+16> at bptree.zig:334:37
bench[0x101f60b] <+43>: cmpq   %rdx, (%rdi,%rcx,8)
bench[0x101f60f] <+47>: adcq   $0x0, %rcx
bench[0x101f613] <+51>: movq   %rcx, %rax
bench[0x101f616] <+54>: retq

Linear search was still faster in most benchmarks though, even with ludicrously large node sizes.

I also tried a btree variant I saw in some paper where leaves are left unsorted and keys are always inserted at the end of the leaf. This saves some memcopying in the common case, but having to sort the leaf before splitting negates the benefit.

outcome

All the benchmarks above are basically best case scenarios, where we're doing a lot of lookups in a row. If we were just doing one lookup in the middle of some other work then the btree might not be in cache at all and each of those 2.5 cache lookups per level are going all the way to main memory. That's catastrophic. Whereas an open-addressed hashmap will typically only hit 1 or 2 cachelines per lookup regardless of size.

And while btrees avoid many of the performance edge cases of hashmaps, they also have some cliffs of their own to fall off as comparisons get more expensive and touch more memory, as we saw with the random-ish strings above.

I haven't measure space usage yet, but we can expect it to be worse for btrees. For random keys the typical node occupancy is 50%, minus per-node overhead like key_count, whereas I've been running the zig hashmaps at 80%. So we can guesstimate the btrees will use >60% more memory. EDIT Duh, hashmaps have low occupancy after doubling, so their average occupancy is actually more like 57%. Whereas the btree with steady churn over time will approach 67%.

Space usage is really bad for small maps too. I'd need to add some extra tweaks to allow the btree root node to start small and grow, rather than paying the full 512 bytes for a map with only 1 key.

Overall, I'm unconvinced that it's worth exploring btrees further. I'll stick to hashmaps and I'll either iterate in insertion order or I'll require sorting entries before iterating. But if anyone else likes the look of this rabbit hole I left some other ideas untouched:

Consider the hash function part of the stable api for the compiler. Use an open-addressed hashtable that preserves hash order. Solve hash flooding without seeding the hash eg by falling back to a tree.
Hash-array mapped tries have similar O(log(n)) cacheline behaviour to btrees. But if we used open-addressed hashtables at the leaves we could keep the nesting depth pretty low.
In-memory LSMs with incremental merging work pretty well in differential dataflow, but still have this O(log(n)) cacheline behaviour. But maybe with a hashmap as secondary index lookups might be reasonable, and we can consider slow inserts the price to pay for not rehashing.

miscellanea

All the experiments here are using:

rustc 1.77.2
zig 0.14.0-dev.1694+3b465ebec
wasmtime-cli 20.0.2

Running on an intel i7-1260P with:

efficiency cores disabled
scaling governer set to performance
turbo mode disabled
aslr disabled

Usually I would also use cset to pin experiments to a shielded core, but it seems that recent systemd versions have broken that workflow and I haven't figured out a replacement yet.

The rest of the benchmark setup can be found in jamii/maps. I measured more than just lookups, but btrees have similar performance for every operation due to having to find a key first.

Cache latency numbers came from mlc and roughly match claims I found online.

Thanks to Dan Luu for help thinking through cache behaviour.