0048: zest progress, zest ordering, wasm alignment, umbra papers, future of fast code, new internet, books, other stuff

Published 2024-08-31

zest progress

I've started working on the runtime. Many of the features of zest are going to be implemented by the runtime rather than by the compiler, but the runtime is itself written in zest. I'm slowly unpicking the dependency graph of features to make that work, so the last month saw a lot of tiny changes:

It's funny that printing manages to exercise so many language features. Ideally it will look something like:

print = (thing) {
  thing/repr-of/match(
    (u32) %print(thing),
    (i64) %print(thing),
    (string) {
      %print('\'')
      print-escaped(thing/bytes)
      %print('\'')
    },
    (struct[.._]) print-object(thing),
    (union[.._]) print-object(thing),
  )
}

print-escaped = (thing) {
  lo mut = 0
  hi mut = 0
  while true {
    if {hi >= thing/count} {
      %print(thing/slice(lo, hi))
      break
    } 
    thing/get(hi)/escaped/match(
      ([some: escaped]) {
        %print(thing/slice(lo, hi))
        %print(escaped(byte))
        hi@ += 1
        lo@ = hi
      }
      ([:none]) {
        hi@ += 1
      }
    }
  }
}

print-object = (thing) {
  %print('[')
  need-comma mut = false
  thing/each((k,v) {
    if need-comma %print(', ') else {}
    need-comma@ = true
    print(k)
    %print(': ')
    print(v)
  })
  %print(']')
}

But I don't yet have:

I started working on each but it tipped my annoyance with my ir over the fixit threshold.

All the compiler passes want data in the ir to be on the branch begin tags, so that they can make good decisions about how to handle children. But the interpreter wants data on the branch end tags so it can operate as a pure stack machine.

Type inference is also currently written as a stack machine to avoid the possibility of stack overflows from deep call trees. But this is increasingly difficult to work with eg if I want to do bidirectional tpye inference then I need two more stacks for passing types downwards and sideways. I originally wrote the codegen in this style but after rewriting it in recursive style the code was much more readable and flexible and I immediately spotted several bugs.

So I have two changes planned:

  1. Store all ir in post-order (easier to generate and interpret), but convert to pre-order on the fly before the next pass (easier to process recursively). The conversion is extra work that I didn't need to do before, but it removes the need for the indirect trick that I was using in some passes and which has a similar cost.
  2. Require top-level functions to have type annotations. This means that type-inference doesn't have to descend recursively from caller to callee, but can instead just use the callee's type annotation when inferring the caller's body, and queue the callee's body for type checking later. Then the size of the inference call stack is bounded by the nesting depth of the source code rather than by the depth of the call-graph, leaving me free to use recursive functions in inference instead of manually converting to a stack machine.

I think that will all fit into september, and probably still leave time to keep chipping away at the feature list for the print function.

zest ordering

Zest has a split between data notation (numbers, strings, objects) and representation (u32, i64, utf8-string, struct, union, list, map). Objects are collections of key-value pairs. Structs, unions, lists, and maps are specialized representations of those objects. Eg lists can only represent objects where the keys are consecutive integers starting at 0.

One big design question is are objects ordered collections of key-value pairs or unordered collections of key-value pairs?

If order doesn't matter then we can choose the internal representation for structs that packs most efficiently. But the programmer can't control what order struct fields print in (because types are structural we can't store the field order in the type either, otherwise order would affect equality again).

If order doesn't matter than f([a: 0, b: 1]) and f([b: 1, a: 0]) generate the same specialization. If order does matter than those need to be two different specializations. But how often are struct types likely to collide like that?

If order does matter then iterating over a struct or map has clearly defined semantics (we can use compact hashmaps to preserve insertion order). If order doesn't matter then we have to choose an order for iteration. Choosing hash order makes programs non-deterministic. Choosing sorted order imposes a big performance cost (we have to sort hashmaps, either on insert or on iteration).

If order does matter then does it affect equality too? If it doesn't, then we have observable differences between equal values, which is subtly weird. But if it does then set[string]['zero', 'one'] != set[string]['one', 'zero']. That would also mean that set[set[string]][['zero', 'one'], ['one', 'zero']] has two elements, which is almost certainly going to be surprising. I don't plan to make the equality function used by maps/sets/etc parameterizable, so any code that wants non-ordered equality would have to sort collections before using them as keys.

Similarly, does ordering affect type conversions or pattern matching? Does struct[a: i64, b: i64][b: 0, a: 1] work? What about [:a, :b] = [b: 0, a: 1]? Ergonomically, probably both conversions and patterns should ignore order, but that does introduce a subtle asymetry to patterns vs constructors. It also means that in thing/match(([:a, :b]) true, ([:b, :a]) false) the second branch is unreachable, even though [a: 0, b: 1] != [b: 1, a: 0].

The decision tree looks something like:

wasm alignment

I found the answer to my alignment questions. The spec says that the alignment given on load/store instructions is a hint and the wasm backend is not allowed to trap if the hint is wrong. But if the backend can't trust the hint, does it have to just always emit fully unaligned loads/stores?

It turns out that the intent is that the backend does emit aligned loads/stores, but also installs a signal handler to handle incorrect hints by emulating the misaligned load/store in software. The idea is that a bad hint still results in the same output on every platform, instead of working on x86 and faulting on arm. But it will potentially be many orders of magnitude slower.

Also wasm's atomic loads/stores don't have the same behaviour - they trap on misaligned addresses.

So I do actually need to care about alignment in zest. Humbug.

compile-time analysis of compiler frameworks for query compilation

More details on the umbra query compiler and comparisons to other backends.

Their own backend has very impressive compile-time vs runtime tradeoffs. For tcp-h sf=10 it's always faster than any of the other backends, and for tcp-h sf=100 it's still sometimes the best choice.

Cranelift just barely dominates llvm debug builds. I'm surprised that the compile times are so similar.

They see ~20% runtime improvement between llvm debug and release, whereas for zig/rust I often see 2-10x difference. That makes me suspect that the ir they are emitting is already pretty good, with few abstractions to boil off, in which case adapting their compilation strategy for zig/rust might not see such impressive results. Or alternatively maybe their queries are mostly memory-bound, so suboptimal codegen gets hidden in the stalls?

robust join processing wit diamond hardened joins

The umbra folks have previously noted that their best implementation of worst-case optimal joins is still much slower than a regular binary join in the average case, and it's also hard to give them a cost function during planning.

Even Yannakakis algorithm (which is only worst-case optimal for acyclic queries) can vary in performance depending on the join order chosen, which implies that cost-based optimization is still important.

If we go back to classic binary join planning, it's hard to decide whether it's better to push a join down the tree (good if it produces less output than input) or hoist it up the tree (good if it produces more output than input). So they split the probe side of a join into two suboperators. lookup returns the matching bucket in the hashtable (like the way a wcoj will find a range of matching attributes) and expand returns all the rows in that bucket. When lookup and expand are next to each other this behaves like a regular join, but the optimizer is free to push lookups down and hoist the expands up, because lookups nevr produce more output than input and expands never produce less output than input.

This is already enough flexibility for the query optimizer to produce a Yannakakis-like plan, while still being subject to the full flexibility of cost-based optimization.

Adding a ternary expand operator allows them to do wcoj on three relations at a time, which they prove is sufficient to be worst-case optimal on any cyclic queries with only binary edges (ie most graph queries).

They additionally cut down duplicate values by eagerly aggregating on the build side of joins.

For outer joins, they make lookup produces the nulls, allowing the expand to be pushed up.

To avoid blowing up the number of possible plans, they limit the optimizer to choosing the lookup tree and then place each expand at the highest branch possible, taking note of equivalence classes of attributes to avoid early expands, and introducing expand3 whenever a lookup closes a cycle in the query graph.

On the build side, they build the hash-tables in order of ascending (predicted) size and then use bloom filters to filter later tables, avoiding building hash tables for rows that can never be emitted anyway.

Weirdly the evaluation also covers their previous paper on hashtables, and most of the benefits come from that change. So, uh, maybe I'll read that paper next month. The lookup/expand separation reins in some of the gnarlier cyclic queries, while the eager aggregation mostly benefits acyclic queries.

the future of fast code

For a given amount of silicon we can make one fast-ish out-of-order core or a bazillion dumb cores. Most consumer computers contain examples of both, but we've barely invested any effort in programming models for the latter. For suitable problems the difference in performance can be many orders of magnitude.

When writing eg c the programmer is responsible for safety and the compiler takes a lot of responsibility for optimization. They describe several research projects where they flip those tradeoffs, creating dsls where the programmer can specify optimizations to apply and the compiler checks that those optimizations don't change the result of the program.

the new internet

Even a mid-range Macbook can do 10x or 100x more transactions per second on its SSD than a supposedly fast cloud local disk, because cloud providers sell that disk to 10 or 100 people at once while charging you full price. Why would you pay exorbitant fees instead of hosting your mission-critical website on your super fast Macbook?

You pay exorbitant rents to cloud providers for their computing power because your own computer isn't in the right place to be a decent server. It's behind a firewall and a NAT and a dynamic IP address and probably an asymmetric network link that drops out just often enough to make you nervous.

They wanted to make software development in general simpler, and decided that the root problem was centralization caused by us accidentally making it impossible to do p2p connections over the internet. So they made tailscale which handles nat traversal and peer certificates etc.

I don't think that aws wouldn't exist if p2p connections were easy. But this does seem very applicable to barefoot developers and situated software. Anything that is serving a small and specific community could easily be running on a mac mini in a cupboard, if it was possible to connect to it.

books

The age of insecurity. I found it agreeable, but it's much more of an rallying cry than a detailed thought. A interesting points though:

Peak mind. Could have been a blog post.

The hard truth. Literally just a series of blog posts, and often contradicts itself to boot.

Space below my feet. Mixed feelings. I loved the adventure stories at the start, and I'm blown away by the kinds of climbs they pulled of with just a hemp rope on hip belay. But the latter two thirds of the book is mostly about how miserable mountaineering is. It seems to be all getting lost, benighted, frostbitten, and barely suriving sketchy glissades. I'm sure Gwen loved something about it, but it doesn't come through at all.

other stuff

Wasm has a can I use -style page. For more detail on upcoming features, I discovered that subscribing to the meetings repo is pretty useful. Eg I discovered that stack-switching is headed for stage 2 now but not likely to hit stage 3 until some time next year. Which means that it likely won't be widely supported for at least another two years, in the best case.

Spidermonkey has a neat bytecode encoding. Rather than making each operand variable-width, they switch between 1 byte operands and 4 byte operands per op. Whenever an op uses 4 byte operands it is preceded by a one-byte op_wide tag. This occasionally costs more memory but it makes decoding much simpler - for each op, they generate one handler for narrow operands and one handler for wide operands. Much less branchy.

Zig gained a fuzzer. I'm not sold on the benefits of writing a fuzzer from scratch, as opposed to providing the same coverage info that llvm provides for libfuzzer/honggfuzz. But the surrounding tooling is nice.

Gibbon is a language that operates on mostly-serialized data. It has a neat trick for dealing with code that is hard to serialize - it emits an indirection tag (similar to what I'm using in zest ir) when necessary, but when the garbage collector copies values it undoes the indirection and produces fully-serialized data.

I finally got around to watching this classic talk on composable allocators. It didn't end up begin useful for writing the zest allocator, but I still enjoyed it.

Google added pipe syntax to their sql dialect. This is definitely a big improvement. They also claim that it removes the need for subqueries, but I disagree. Queries like newest 10 employees per department are still awkward in sql-plus-pipes, whereas in prql which has sane subqueries they're straightforward:

from employees
group department (
  sort join_date
  take 10
)