(This is part of a series on the design of a language. See the list of posts here.)
Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties:
- Vulnerable to hash flooding.
- If protected against hash flooding by random seeds, the iteration order becomes non-deterministic which is annoying for snapshot testing, reproducible builds, etc.
- May have to rehash on insert, which produces terrible worst-case latencies for large hashmaps.
- Repeatedly growing large allocations without fragmentation is difficult on wasm targets, because virtual memory tricks are not available and pages can't be unmapped.
- Vector instructions on wasm are limited and there are no AES instructions. This makes many hash functions much slower.
Ordered data-structures like b-trees don't have any of these disadvantages. They are typically slower than hashmaps, but I was surprised to find fairly wide variation in people's expectations of how much slower. So let's compare:
- rust's std::collections::HashMap with siphash
- rust's std::collections::BTreeMap
- zig's std.HashMap with siphash
- zig's std.HashMap with wyhash
- My own bptree with various compile-time options for different experiments.
microbenchmarks are hard
Let's start with the dumbest possible benchmark - fill a map with uniformly distributed random integers, measure how many cycles takes to lookup all those integers, and take the mean over many iterations.
const before = rdtscp();
for (keys) |key| {
const value_found = map.get(key);
if (value_found == null) {
panic("Value not found", .{});
}
}
const after = rdtscp();
record("lookup_hit_all", map.count(), @divTrunc(after - before, map.count()));
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rust btree | 46 | 26 | 19 | 54 | 78 | 94 | 106 | 133 | 152 | 166 | 193 | 215 | 236 | 262 | 290 | 316 | 367 |
rust hashmap siphash | 102 | 67 | 49 | 40 | 37 | 34 | 33 | 32 | 32 | 32 | 32 | 33 | 33 | 34 | 37 | 43 | 51 |
zig b+tree | 46 | 26 | 37 | 56 | 64 | 87 | 100 | 110 | 123 | 143 | 165 | 180 | 197 | 220 | 235 | 269 | 294 |
zig hashmap siphash | 99 | 84 | 64 | 69 | 70 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 69 | 68 | 69 | 73 | 77 |
zig hashmap wyhash | 80 | 35 | 29 | 35 | 34 | 35 | 34 | 33 | 34 | 32 | 33 | 31 | 31 | 32 | 32 | 33 | 34 |
That makes btrees look pretty bad. Much worse, in fact, than the other benchmark that several people pointed me at, where at similar sizes btree lookups are only ~2x slower than hashmaps. But don't worry, we can reproduce that too:
for (keys) |key| {
const before = rdtscp();
const value_found = map.get(key);
const after = rdtscp();
record("lookup_hit_one", map.count(), after - before);
if (value_found == null) {
panic("Value not found", .{});
}
}
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rust btree | 47 | 48 | 50 | 82 | 107 | 123 | 135 | 163 | 182 | 200 | 227 | 256 | 279 | 309 | 328 | 358 | 405 |
rust hashmap siphash | 101 | 101 | 101 | 100 | 105 | 103 | 102 | 103 | 103 | 103 | 103 | 108 | 112 | 116 | 124 | 140 | 166 |
zig b+tree | 45 | 45 | 59 | 72 | 85 | 107 | 126 | 135 | 148 | 170 | 188 | 203 | 223 | 246 | 264 | 292 | 319 |
zig hashmap siphash | 106 | 108 | 116 | 117 | 120 | 121 | 123 | 124 | 124 | 124 | 123 | 124 | 127 | 130 | 132 | 137 | 147 |
zig hashmap wyhash | 53 | 53 | 57 | 63 | 68 | 69 | 72 | 72 | 73 | 71 | 72 | 69 | 76 | 81 | 78 | 87 | 92 |
All we did differently was average over one lookup at a time instead of over many, but somehow that made the hashmaps 2-3x slower!
Both of these benchmarks are pretty bad, so let's make better versions of both before trying to explain the differences. I only did this for the zig data-structures, because I am lazy.
First, rather than looking the keys up in the same order we inserted them, we'll precompute a random sample (with replacement):
const keys_hitting = try map.allocator.alloc(Key, @max(batch_size, count));
var permutation = XorShift64{};
for (keys_hitting) |*key| {
const i = permutation.next() % count;
key.* = keys[i];
}
Then rather than measuring a single lookup at a time, we'll measure a batch of 256 lookups to amortize out the overhead of the rdtscp instruction and pull the panics out of the measurement:
for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
var values_found: [batch_size]?Key = undefined;
const before = rdtscp();
var key: Key = keys_batch[0];
for (&values_found, 0..) |*value, i| {
value.* = map.get(key);
key = keys_batch[(i + 1) % keys_batch.len];
}
const after = rdtscp();
report("lookup_hit_batch", map.count(), @divTrunc(after - before, batch_size));
for (values_found) |value_found| {
if (value_found == null) {
panic("Value not found", .{});
}
}
}
The results look similar to the results for lookup_hit_all
:
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 6 | 9 | 31 | 47 | 60 | 82 | 102 | 112 | 126 | 147 | 174 | 186 | 204 | 230 | 245 | 276 | 299 |
zig hashmap siphash | 52 | 53 | 61 | 68 | 71 | 72 | 75 | 76 | 76 | 75 | 75 | 76 | 78 | 80 | 81 | 88 | 95 |
zig hashmap wyhash | 29 | 29 | 31 | 35 | 38 | 38 | 40 | 41 | 42 | 40 | 40 | 42 | 41 | 39 | 42 | 46 | 43 |
Now we make one tiny tweak. Instead of iterating through the batch in order, we'll use the value we just looked up to pick the next key.
for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
var values_found: [batch_size]?Key = undefined;
const before = rdtscp();
var key: Key = keys_batch[0];
for (&values_found, 0..) |*value, i| {
value.* = map.get(key);
key = keys_batch[(i + value.*.?) % keys_batch.len]; // <-- we changed this line
}
const after = rdtscp();
report("lookup_hit_chain", map.count(), @divTrunc(after - before, batch_size));
for (values_found) |value_found| {
if (value_found == null) {
panic("Value not found", .{});
}
}
}
Now we have two benchmarks with the exact same batch size and the exact same instructions to execute. The only difference is that in lookup_hit_chain
we've introduced a data dependency between each iteration of the loop - we need to know value
before we know what to lookup next. This prevents successful speculative execution of the next loop iteration.
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 7 | 25 | 37 | 48 | 59 | 82 | 105 | 115 | 126 | 148 | 172 | 186 | 198 | 219 | 240 | 273 | 300 |
zig hashmap siphash | 77 | 79 | 88 | 88 | 90 | 91 | 93 | 94 | 94 | 93 | 95 | 99 | 102 | 103 | 106 | 115 | 130 |
zig hashmap wyhash | 35 | 37 | 44 | 48 | 52 | 52 | 55 | 57 | 55 | 54 | 57 | 63 | 64 | 63 | 70 | 76 | 88 |
The wyhash hashmap halves it's lookup time in lookup_hit_batch
vs lookup_hit_chain
but the btree doesn't benefit at all. I don't understand the limits to speculative execution at all, but let's make some mildly informed guesses.
At 2^16 keys the whole dataset is about 1mb, which fits comfortably in L2 but is much bigger than L1. Each hashmap lookup costs 1 or maybe 2 L2 cache lookups, each of which have ~20 cycle latency. The wyhash fast path for u64 is only a handful of instructions with no branches:
bench[0x1014710] <+0>: rorxq $0x20, %rdi, %rdx
bench[0x1014716] <+6>: movabsq $-0x18fc812e5f4bd725, %rax ; imm = 0xE7037ED1A0B428DB
bench[0x1014720] <+16>: xorq %rax, %rdx
bench[0x1014723] <+19>: movabsq $0x1ff5c2923a788d2c, %rcx ; imm = 0x1FF5C2923A788D2C
bench[0x101472d] <+29>: xorq %rdi, %rcx
bench[0x1014730] <+32>: mulxq %rcx, %rcx, %rdx
bench[0x1014735] <+37>: movabsq $-0x5f89e29b87429bd9, %rsi ; imm = 0xA0761D6478BD6427
bench[0x101473f] <+47>: xorq %rcx, %rsi
bench[0x1014742] <+50>: xorq %rax, %rdx
bench[0x1014745] <+53>: mulxq %rsi, %rcx, %rax
bench[0x101474a] <+58>: xorq %rcx, %rax
bench[0x101474d] <+61>: retq
So while we're waiting for one cache lookup we can start hashing the next key (if we can predict what it is) and maybe even get started on the next cache lookup.
The btree at 2^16 keys has 4 levels. There are typically 15-16 keys per node when inserting random keys, so we'll expect to do around 32 comparisons on average before finding our key. The body of that search loop looks like:
bench[0x10121b0] <+16>: cmpq %rdx, (%rdi,%rax,8)
bench[0x10121b4] <+20>: jae 0x10121c2 ; <+34> at bptree.zig
bench[0x10121b6] <+22>: addq $0x1, %rax
bench[0x10121ba] <+26>: cmpq %rax, %rsi
bench[0x10121bd] <+29>: jne 0x10121b0 ; <+16> [inlined] bench.less_than_u64 + 9 at bench.zig:159:5
So 5 instructions, including two branches, per comparison. At least 160 instructions and 64 branches for the whole lookup.
The 16 keys take up 2 cachelines, so we'll average 1.5 cache lookups for each linear key search, plus 1 cache lookup to hit the child/values array. 10 cache lookups in total for the whole btree lookup. Each of those cache lookups depends on the previous lookup so it'll be hard to speculate correctly, but prefetching might help within a single node. If every cache lookup hit L2 in strict series we'd expect ~200 cycles, but probably some of the earlier nodes are in L1.
Anyway, let's leave that rabbit hole alone for now. It's enough to notice that hashmaps benefit from speculative execution between multiple lookups and btrees don't.
We can roughly think of lookup_hit_batch
as measuring throughput and lookup_hit_chain
as measuring latency. All the other benchmarks I've seen only measure one or the other, which starts to explain the disagreement over btree vs hashmap performance.
string keys
Btrees do a lot of key comparisons. If the keys are integers then they are stored in the btree node, so it doesn't matter too much. But for strings keys we potentially pay for a cache lookup for every comparison.
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 101 | 124 | 144 | 160 | 184 | 219 | 255 | 279 | 310 | 366 | 427 | 470 | 512 | 577 | 700 | 826 | 965 |
zig hashmap siphash | 145 | 147 | 154 | 159 | 162 | 163 | 165 | 167 | 168 | 173 | 181 | 186 | 196 | 211 | 247 | 287 | 312 |
zig hashmap wyhash | 49 | 50 | 59 | 65 | 68 | 69 | 72 | 74 | 75 | 81 | 88 | 93 | 103 | 119 | 154 | 188 | 219 |
Completely random strings is actually pretty unlikely and unfairly benefits the btree, which typically only has to compare the first character of each string to find that they are not equal. But real datasets (eg urls in a log) often have large common prefixes. We can see the difference if we make the first 16 characters constant:
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 102 | 150 | 221 | 338 | 546 | 618 | 807 | 911 | 1077 | 1393 | 1507 | 1641 | 1812 | 2095 | 2628 | 2848 | 3302 |
zig hashmap siphash | 145 | 147 | 153 | 159 | 162 | 163 | 165 | 167 | 168 | 174 | 182 | 187 | 197 | 210 | 245 | 282 | 313 |
zig hashmap wyhash | 49 | 50 | 59 | 66 | 69 | 69 | 72 | 74 | 74 | 81 | 88 | 93 | 102 | 117 | 154 | 188 | 214 |
Hashmaps don't care, but the btree is really hurting.
If we're just supporting strings we can optimize for this case by storing the length of the common prefix within each node. But it's hard to imagine how to generalize that to arbitrary types. What if the key is a tuple of strings? Or a small set?
wasm hashes
Let's try the same in wasm, where hash functions have less access to fast vector instructions. Wasm also doesn't have rdtscp so times here are in nanoseconds rather than cycles.
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 5 | 14 | 19 | 22 | 27 | 38 | 45 | 49 | 53 | 59 | 70 | 75 | 80 | 87 | 98 | 111 | 120 |
zig hashmap siphash | 33 | 34 | 37 | 38 | 40 | 40 | 41 | 42 | 41 | 41 | 42 | 43 | 44 | 45 | 46 | 52 | 58 |
zig hashmap wyhash | 29 | 30 | 33 | 35 | 36 | 36 | 37 | 38 | 38 | 37 | 38 | 40 | 40 | 42 | 43 | 47 | 51 |
log2(#keys) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zig b+tree | 64 | 77 | 88 | 97 | 117 | 141 | 154 | 167 | 192 | 216 | 269 | 287 | 294 | 308 | 369 | 436 | 500 |
zig hashmap siphash | 65 | 64 | 68 | 70 | 73 | 71 | 72 | 73 | 73 | 75 | 78 | 81 | 84 | 88 | 102 | 118 | 135 |
zig hashmap whyhash | 45 | 45 | 48 | 50 | 52 | 52 | 53 | 54 | 54 | 56 | 59 | 62 | 65 | 69 | 79 | 93 | 105 |
The overall ratios are fairly similar, although wyhash seems to have been penalized a little relative to siphash.
Neither the hash functions nor the table scan generates vector instructions at all, even when comparing strings. And, uh, now that I think to look, they don't generate vector instructions on x86 either. So I guess that's a non-issue.
btree tuning
I implemented both btrees and b+trees. I didn't see much difference between them for insert/lookup, so I preferred the b+tree for the easier/faster implementation of scans and range queries.
The rust btreemap fixes the max key count per node to 11. For all the workloads I've tried the sweet spot seems to be to fix the node size to 512 bytes, which is 31 keys for u64 and 20 keys for strings.
Allowing leaves and branches to have different sizes didn't help.
I gained a small speedup by manually laying out the node like key_count, keys, values/children
. Zig by default prefers to put key_count
at the end of the struct to avoid padding, but we always read the key_count first so it's nice to get some keys on the same cacheline. Maintaining this optimization across different architectures was annoying though, so I rolled it back and it's not reflected in the tests above.
The rust btreemap switches on Ordering. I got a small boost from instead using less_than
during the search and calling equal
afterwards. For a lookup at 2^16 keys we'll expect to call less_than
32 times and equal
once, so it's worth paying for the extra call to equal
in exchange for tightening up the inner search loop.
I use tried various different binary searches. The best was this 'branchless' variant:
fn searchBinary(keys: []Key, search_key: Key) usize {
if (keys.len == 0) return 0;
var offset: usize = 0;
var length: usize = keys.len;
while (length > 1) {
const half = length / 2;
const mid = offset + half;
if (less_than(keys[mid], search_key)) {
@branchHint(.unpredictable);
offset = mid;
}
length -= half;
}
offset += @intFromBool(less_than(keys[offset], search_key));
return offset;
}
while (length > 1)
requires a branch but it's easily predictable. branchHint(.unpredictable)
causes llvm to emit a conditional move for offset = mid
.
bench[0x101f5e0] <+0>: movq %rsi, %rax
bench[0x101f5e3] <+3>: testq %rsi, %rsi
bench[0x101f5e6] <+6>: je 0x101f616 ; <+54> at bptree.zig
bench[0x101f5e8] <+8>: xorl %ecx, %ecx
bench[0x101f5ea] <+10>: cmpq $0x1, %rax
bench[0x101f5ee] <+14>: je 0x101f60b ; <+43> [inlined] bench.less_than_u64 at bench.zig:185:5
bench[0x101f5f0] <+16>: movq %rax, %rsi
bench[0x101f5f3] <+19>: shrq %rsi
bench[0x101f5f6] <+22>: leaq (%rcx,%rsi), %r8
bench[0x101f5fa] <+26>: cmpq %rdx, (%rdi,%r8,8)
bench[0x101f5fe] <+30>: cmovbq %r8, %rcx
bench[0x101f602] <+34>: subq %rsi, %rax
bench[0x101f605] <+37>: cmpq $0x1, %rax
bench[0x101f609] <+41>: ja 0x101f5f0 ; <+16> at bptree.zig:334:37
bench[0x101f60b] <+43>: cmpq %rdx, (%rdi,%rcx,8)
bench[0x101f60f] <+47>: adcq $0x0, %rcx
bench[0x101f613] <+51>: movq %rcx, %rax
bench[0x101f616] <+54>: retq
Linear search was still faster in most benchmarks though, even with ludicrously large node sizes.
I also tried a btree variant I saw in some paper where leaves are left unsorted and keys are always inserted at the end of the leaf. This saves some memcopying in the common case, but having to sort the leaf before splitting negates the benefit.
outcome
All the benchmarks above are basically best case scenarios, where we're doing a lot of lookups in a row. If we were just doing one lookup in the middle of some other work then the btree might not be in cache at all and each of those 2.5 cache lookups per level are going all the way to main memory. That's catastrophic. Whereas an open-addressed hashmap will typically only hit 1 or 2 cachelines per lookup regardless of size.
And while btrees avoid many of the performance edge cases of hashmaps, they also have some cliffs of their own to fall off as comparisons get more expensive and touch more memory, as we saw with the random-ish strings above.
I haven't measure space usage yet, but we can expect it to be worse for btrees. For random keys the typical node occupancy is 50%, minus per-node overhead like key_count, whereas I've been running the zig hashmaps at 80%. So we can guesstimate the btrees will use >60% more memory. EDIT Duh, hashmaps have low occupancy after doubling, so their average occupancy is actually more like 57%. Whereas the btree with steady churn over time will approach 67%.
Space usage is really bad for small maps too. I'd need to add some extra tweaks to allow the btree root node to start small and grow, rather than paying the full 512 bytes for a map with only 1 key.
Overall, I'm unconvinced that it's worth exploring btrees further. I'll stick to hashmaps and I'll either iterate in insertion order or I'll require sorting entries before iterating. But if anyone else likes the look of this rabbit hole I left some other ideas untouched:
- Consider the hash function part of the stable api for the compiler. Use an open-addressed hashtable that preserves hash order. Solve hash flooding without seeding the hash eg by falling back to a tree.
- Hash-array mapped tries have similar O(log(n)) cacheline behaviour to btrees. But if we used open-addressed hashtables at the leaves we could keep the nesting depth pretty low.
- In-memory LSMs with incremental merging work pretty well in differential dataflow, but still have this O(log(n)) cacheline behaviour. But maybe with a hashmap as secondary index lookups might be reasonable, and we can consider slow inserts the price to pay for not rehashing.
miscellanea
All the experiments here are using:
- rustc 1.77.2
- zig 0.14.0-dev.1694+3b465ebec
- wasmtime-cli 20.0.2
Running on an intel i7-1260P with:
- efficiency cores disabled
- scaling governer set to performance
- turbo mode disabled
- aslr disabled
Usually I would also use cset to pin experiments to a shielded core, but it seems that recent systemd versions have broken that workflow and I haven't figured out a replacement yet.
The rest of the benchmark setup can be found in jamii/maps. I measured more than just lookups, but btrees have similar performance for every operation due to having to find a key first.
Cache latency numbers came from mlc and roughly match claims I found online.
Thanks to Dan Luu for help thinking through cache behaviour.