The main thing I've been working on is a port of differential dataflow. I have most of the basics sketched out with naive implementations and I'm currently working on implementing progress tracking correctly, at which point simple examples like graph reachability should start working.
Github doesn't have any way of saying "give all sponsors read-only access to this private repo" short of making a brand new github organization and manually spamming everyone with invites. I will figure something out eventually. But for now here are the contents of the readme:
Dida is a (WIP) library for streaming, incremental, iterative, internally-consistent computation on time-varying collections.
The jargon-free version: You write code that manipulates collections using familiar operations like map
, join
and loop
. You run the code on some input and get some output. Then when the input changes, you get changes to the output, much faster than recomputing the whole thing from scratch. (And the outputs will be correct!)
Dida is heavily based on differential dataflow. Why start a new implementation? The goals are (in roughly priority order):
- Be easier to use and understand. This is naturally very subjective, but some guiding principles
are:
- Prefer readability and forkability over modularity and flexibility (and especially preserve jump-to-definition in the source rather than using extension traits for everything).
- Make it clear where state resides.
- Provide well-documented default implementations for common tasks (eg writing output to a file).
- Provide interactive graphical debuggers for every component. Many of the complaints I've heard about differential dataflow are about struggling to understand where state is stored, when things happen, how different ways of writing a computation affect performance etc. Debuggers can answer this question directly, but I suspect will also help by teaching useful mental representations.
- Write a short book that uses the debuggers to explain both the theory of differential dataflow and the details of this implementation. Differential dataflow suffers from having the theory spread across myriad papers with limited space and which each describe different versions of the algorithms.
- Better support use as an interpreter backend and for binding to other languages.
- Don't rely on specialization for performance, since that requires compilation and also doesn't work well cross-language. This will require rethinking eg how functions are lifted over batches.
- Support storing data inline in indexes when the size is only known at runtime. (Differential dataflow can support this, but materialize currently stores each row in a separate heap allocation even if the row is all fixed-width types).
- Support reference-counted values without adding overhead for non-reference-counted values. This is needed for pinning eg javascript objects. (Materialize could reference-count eg strings but would then pay for the row-by-row Drop impl on all columns, not just string columns.)
- Support being embedded in another event loop. (Differential dataflow can be run a step at a time, but the amount of work per step is not bounded).
- Support cpu and memory limits. This makes it much easier to safely support live evaluation eg embedding a query repl in a tracing dashboard.
On a related note, I wrote up a draft of some recent experiments on making live repls behave. This is something I need to figure out in order to have interactive dida examples and visualizations on a web page.
Both the async version and the threaded version work in native code, but both have issues in wasm:
- The async version has stack overflows, for as-yet-unknown reasons but probably related to wasm's control-flow integrity.
- The threaded version won't work on most mobile browsers due to limited support for SharedArrayBuffer.
For small examples in wasm I can always fall back to copying all the data to a webworker every time the interpreter resets, but it would be nice to figure out something that works for real datasets and uses the same code for both native and wasm.
I wrote up some notes on why query planning for streaming systems is hard.
I don't plan to work on this any time soon - this post is more of a "note to self" so that when I do get around to working on this I remember all the problems I've already thought of.
In my assorted thoughts on zig and rust post I mentioned the Rust Allocators Working Group which is working on adding custom allocators to the rust standard library.
After reading this recent discussion on the rust internals mailing list it seems like the kind of tricks I'm using for the live repls will still not be possible in rust, because they require passing around not just custom allocators but non-global allocators.
The zig goto saga continues. Someone had great success writing an interpreter using tailcalls instead of using the traditional goto pattern, and zig already supports tail calls.
I have no expertise on this subject at all, but I'm fascinated.
I occasionally get asked for good resources on how databases work. Andy Pavlo's course is my favourite, and the red book is also pretty good.
I had assumed that guix' lack of support for mac was for ideological reasons, but it seems that there might be legal obstacles too.
Some of my work (eg) is strongly critical of other peoples work. I think this kind of work is important - it helps provide the gradient that software quality climbs. But it's also easy to do a lot of damage.
A trick I've found useful lately to find the right balance is to imagine reading the article to the author of the work I'm criticising. I'll still point out things that I think are wrong, but it strips out most of the negative adjectives and interpretation and leaves the evidence to stand by itself.
Github launched pay-what-you-want for sponsors, so I've enabled that with a default value of $20 and removed all the fixed tiers.
They are designed to focus attention but don't train you to overcome the obstacles to being focused. They are fun but don't tend to make a person more interesting.
Talking about video games, but also applies to much of modern consumer tech.
From You by Austin Grossman