0046: zest syntax, zest progress, sponsors-only repos, future compilers, error-handling implementations, suboperators, why we drive

Minimal writing this month:

Zest syntax

But maximum coding!

zest progress

Last month I mentioned that I had a good chunk of the language working in a single-pass compiler in destination-passing style, but it was becoming increasingly gnarly as I added more language features. Adding any kind of optimization on top of that seemed gnarly.

So I spent a bunch of time studying other compilers to try to come up with reasonable IR and division of passes. I ended up with SSA, structured control flow (like zig and mlir), and alloca instructions (like pretty much everything except cranelift). It worked, it was understandable, but the codegen was terrible.

I compared to other compilers output in debug builds (ie without optimizations) and their codegen is also terrible. Eg here's some simple rust:

pub fn add(left: usize, right: usize) -> usize {
    left.wrapping_add(right)
}

pub fn flip(p: *mut [[u32; 2]; 2]) {
    let p = unsafe{&mut *p};
    *p = [
        [p[0][0], p[1][0]],
        [p[0][1], p[1][1]],
    ];
}

And the wasm from a debug build:

(func $add (type 0) (param i32 i32) (result i32)
  (local i32 i32 i32 i32)
  (local.set 2
    (global.get $__stack_pointer))
  (local.set 3
    (i32.const 16))
  (local.set 4
    (i32.sub
      (local.get 2)
      (local.get 3)))
  (i32.store
    (local.get 4)
    (local.get 0))
  (i32.store offset=4
    (local.get 4)
    (local.get 1))
  (i32.store offset=8
    (local.get 4)
    (local.get 0))
  (i32.store offset=12
    (local.get 4)
    (local.get 1))
  (local.set 5
    (i32.add
      (local.get 0)
      (local.get 1)))
  (return
    (local.get 5)))

(func $flip (type 1) (param i32)
  (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
  (local.set 1
    (global.get $__stack_pointer))
  (local.set 2
    (i32.const 32))
  (local.set 3
    (i32.sub
      (local.get 1)
      (local.get 2)))
  (i32.store offset=24
    (local.get 3)
    (local.get 0))
  (i32.store offset=28
    (local.get 3)
    (local.get 0))
  (local.set 4
    (i32.load
      (local.get 0)))
  (local.set 5
    (i32.load offset=8
      (local.get 0)))
  (i32.store offset=8
    (local.get 3)
    (local.get 4))
  (i32.store offset=12
    (local.get 3)
    (local.get 5))
  (local.set 6
    (i32.load offset=4
      (local.get 0)))
  (local.set 7
    (i32.const 12))
  (local.set 8
    (i32.add
      (local.get 0)
      (local.get 7)))
  (local.set 9
    (i32.load
      (local.get 8)))
  (i32.store offset=16
    (local.get 3)
    (local.get 6))
  (i32.store offset=20
    (local.get 3)
    (local.get 9))
  (local.set 10
    (i32.load offset=8
      (local.get 3)))
  (local.set 11
    (i32.load offset=12
      (local.get 3)))
  (i32.store offset=4
    (local.get 0)
    (local.get 11))
  (i32.store
    (local.get 0)
    (local.get 10))
  (local.set 12
    (i32.load offset=16
      (local.get 3)))
  (local.set 13
    (i32.load offset=20
      (local.get 3)))
  (i32.store
    (local.get 8)
    (local.get 13))
  (i32.store offset=8
    (local.get 0)
    (local.get 12))
  (return))

Here is a release build for comparison:

(func $add (type 0) (param i32 i32) (result i32)
  (i32.add
    (local.get 1)
    (local.get 0)))

(func $flip (type 1) (param i32)
  (i64.store offset=4 align=4
    (local.get 0)
    (i64.rotl
      (i64.load offset=4 align=4
        (local.get 0))
      (i64.const 32))))

There are two main issues with the debug builds:

Shuffling values between local variables instead of making good use of the stack. Wasm backends can optimize this away in higher tiers, but it will still hurt performance during validation and in the interpreter tier.
Pointless shadow stack traffic. As far as the wasm backend is concerned this is all observable side effects, so it's not allowed to optimize this away even if it could.

These seem intrinsic to the style of the compilers I've been copying. The typical architecture is to have the frontend emit IR that is obviously correct and then rely on the backend to perform miracles (which LLVM does). This means that it's ok if all your internal representations add lots of local and stack traffic because the optimizer will be able to remove that traffic.

By contrast, my original one-pass destination-passing compiler was able to produce pretty reasonable code without any optimization passes. So this month I've been trying to keep the advantages of destination-passing while still splitting the compiler into multiple passes to maintain my sanity. Currently I have:

tokenize
parse
- LL(1) but doesn't validate patterns/destructuring
desugar
- desugar patterns
- insert staging annotations
- resolve names
- replace closures with an explicit struct
infer
- infer/check types
- specialize functions
- evaluate staged expressions
- drop unused functions
lower (WIP)
- assign slots on the shadow stack (using destination-passing to reduce redundant stack traffic)
- replace struct values with load/store instructions from the shadow stack
- propagate constants and remove dead branches
- replace constant values with copy instructions from the data section
generate (WIP)

The interpreter that evaluates staged expressions can also run entire programs under the dynamic semantics, which gives me a nice target to fuzz the compiler against.

After the desugar pass the IRs look a lot like wasm - effectively an expression tree but encoded as a stack machine. This makes the IR much denser (or will once I get around to packing it properly) and makes the interpreter much easier to write. I'm also able to avoid recursion in all the passes that operate on it (although in the case of destination-passing it took some head-scratching to manage the dataflow both up and down the tree). I'm tempted to remove the recursion from parsing too once I have the rest finished.

I don't quite have enough working to compare the codegen yet, but I'm optimistic. The stackful IR makes it explicit when I'm emitting code that will require shuffling locals, and the destination-passing can remove most redundant shadow stack traffic. The IR isn't very friendly to dataflow analysis and optimization (especially any kind of code motion), but on the other hand it doesn't create all the low-hanging fruit that makes optimization so necessary in debug builds for other languages.

Hopefully next month I can show some actual outputs!

zest sponsors-only repo?

I've been thinking (eg in how to trade software for small money) about ways to work on zest long-term. The option I'm favoring at the moment is an open-source but closed-development model, where code is released under an open-source license but the development branches, issues, roadmap, discussion etc are in a private sponsors-only repo. (At some point in the future when there is something worth looking at.)

All else being equal, open-source code creates far more value than proprietary code. It's not just the monetary cost but the overhead of licensing and distribution - being able to drop in a dependency and share your code with having to call up a salesperson - and the ability to fork the code if the maintainer disappears. It's much easier to capture value from proprietary code but that comes at the cost of creating less overall value. Even in an open-core model, I'd have to spend a bunch of my time working on the part of the project that produces less value. So I really want to find a solution that allows the entire project to be released as open-source.

The idea of a sponsors-only repr came from reading working in public, which points out that while the marginal cost of distributing code is zero, the marginal cost of maintaining code and supporting users is not. Much of the burnout in open-source can be attributed to treating maintenance and support as a zero-cost good. So separate the two - distribute code for free but only offer time and attention to people with some skin in the game (either sponsors or downstream maintainers).

At the peak in 2022, before I went to work for tigerbeetle for a while, github sponsors was covering half of my spending. That was without any incentive at all. So it seems plausible that I could get enough people interested in a programming language to double that.

The closed-development part is inspired by sqlite's model. They don't take pull requests and the bulk of the commits were authored by 3-5 people, but those people have working on this project for almost 25 years, producing an incredibly focused and cohesive tool. They make a living by offering support and access to maintainers via the sqlite consortium. (Maybe I can do something like that one day, but individual hobbyists seem like a better target audience for a newborn programming language.)

Clojure had a similar development model and a similarly focused and cohesive result. But they also marketed themselves like a standard open-source project, and I think much of the drama came from not recognizing the associations that come with that. If you're going to do something that is not the default, it probably helps to be really explicit about it up-front.

At present, githubs support for sponsors-only repos is limited to fixed tiers on one repo and I would really like to keep the pay-what-you-want button. But I could probably just do things by hand at first, and maybe add a webhook later.

Future Directions for Optimizing Compilers

Most of this is concerned with specifying more parts of a compiler declaratively, so that they can be verified and optimized more easily than hand-written imperative code. For example, once an IR has been given formal semantics it's possible to automatically generate peephole optimizations using an smt solver.

But there's also this fun tidbit about pointerful IR in compilers:

Informal measurements show that on a modern core with exclusive use of a 25 MB cache, an optimizing C++ compiler using GCC or LLVM still spends 30-35% of its runtime stalled on memory operations.

Safe Native Code

Some history of the midori project. I was interested in the error handling design, and most notably:

A nice accident of our model was that we could have compiled it with either return codes or exceptions. Thanks to this, we actually did the experiment, to see what the impact was to our system's size and speed. The exceptions-based system ended up being roughly 7% smaller and 4% faster on some key benchmarks.

But earlier they also mention:

From a code quality perspective, exceptions can be nice. First, you can organize code segments so that the 'cold' handlers aren't dirtying your ICACHE on successful pathways. Second, you don't need to perform any extra work during the normal calling convention. There's no wrapping of values - so no extra register or stack pressure - and there's no branching in callers.

This makes me think that their result abi was a straightforward sum type, which is what rust and zig currently do. But I wonder if the musings on better calling conventions here would impact the results of the experiment:

Results are often passed through several layers of functions via ?, which can result in a lot of redundant register moves. Often, a Result is large enough that it doesn't fit in registers, so each call in the ? stack has to inspect an ok bit by loading it from memory. Instead, a Result return might be implemented as an out-parameter pointer for the error, with the ok variant's payload, and the is ok bit, returned as an Option.

It would also be interesting to see how this plays out in wasm, where unwinding can't be implemented directly and each backend might have very different implementations of the exceptions extension.

suboperators

There are a whole bunch of papers with similar ideas, but here are a few good ones:

Building Advanced SQL Analytics From Low-Level Plan Operators. Building efficient implementations of many different kinds of relational aggregation is far too much work. Relational aggregates can be decomposed into multiple smaller operations (materialization, partitioning, hashing, sorting etc) to allow reusing code between different implementations. If this decomposition is done explicitly during planning then new optimizations open up. Eg often some of those operations can be shared between different aggregates, saving redundant work.

Incremental Fusion: Unifying Compiled and Vectorized Query Execution. Query compilation requires very lower compilation times, which is difficult. Some databases hide the latency by using an interpreter to start the query and then switching over when compilation is finished, but this requires writing both an interpreter and compiler. The authors instead create a vectorized interpreter by compiling all possible query plan nodes. Doing this for regular relational plan nodes wouldn't work because there are a non-finite number of possible parameters - the generated code would end up making too many decisions at runtime. But those relational nodes can be broken down into suboperators, each of which has much simpler parameters. Eg a filter with a complex condition can be broken down into suboperators which each mutably update a selection vector.

Declarative Sub-Operators for Universal Data Processing. SQL databases are supporting more and more varied workloads. The way we do this is by just adding more and more operations to the relational algebra, because most of these operations can't be expressed efficiently (or at all) in the relational algebra itself. Efficiently implementing all these operators is a huge burden for database developers. The authors propose instead exposing a suboperator language in which these operations can be implemented efficiently by the database users instead of just hard-coding ever more operators into the database.

So there are a couple of different databases headed in this direction. Their suboperator languages vary a lot, but there are some commonalities:

Plans are dag-structured.
Plan output nodes have richer types - not just tuple streams but also tuple buffers, hashtables, ordered sets, nested groups/windows etc.
Nodes can access shared mutable state.
More control-flow than sql, but still restricted for easy plan optimization. Eg nestedMap for expressing subquery-like operations.

None of this is totally fleshed out yet, but it feels like there is some momentum here towards moving the boundary between database and client.

books

Feel-good productivity. Blah.

Why we drive. I wanted to like this more - I really liked the authors previous book and this book shores up some of the weakenesses in the previous argument. But it's also lazier and often devolves into smug culture war. There is an important argument to be made here about how 'safety' often means moving agency from individuals into opaque bureaucracies which are immune to dissent. But he's not doing it well.