Dan Luu has been trying for months to persuade me that I should start a paid newsletter so that I can spend more time researching and writing interesting things. He hardly ever even mentions the existence of his own private patreon posts and still manages to make decent pocket money out of them, so this doesn't seem like an unreasonable idea.
I produce a lot of writing that mostly just gets sent to friends or left to rot on my hard drive. When I get around to turning something into a blog post it tends to end up on the Hacker News front page. It still seems kind of crazy to me that anyone might want to pay for my writing, but it also used to seem crazy to me that people would pay for my code. Maybe I'm just not very good at capitalism.
So I'm going to give this a go - any github sponsors at any tier will get access to all the new posts. Using github instead of substack is maybe a weird choice, but I want to be able to embed katex and interactive demos in the posts and provide code alongside for examples and case studies. I also expect to be releasing more open-source software this year as a side-effect of doing less consulting.
Here are the kinds of things that I might be working on and writing about this year:
- Big projects
- Past examples - A practical relational query compiler in 500 lines of code, A UI library for a relational language, Julia as a platform for language development, A UI for exploring relational databases
- Designing a database query language that supports abstraction. An ongoing project which is 80% done and just needs a sustained slog to finish the last 80% and produce a clear writeup and demo.
- Building a text editor from scratch. An existing project, already in daily use. Posts will start with absolute basics (eg how to render simple UI in opengl) and eventually catch up to ongoing development (currently working on editing 1gb files at 60 fps without dropping frames).
- Benchmarking streaming systems. A new project for 2021. Existing streaming benchmarks suffer from problems like measuring how well they can parallelize the overhead they added to the original problem, mostly measuring distributed json de/serialization or even just quietly ignoring pauses. I want to figure out how to measure these systems correctly, begin to map out the space of engineering tradeoffs and try to extract some lessons for designers of future systems.
- Building an embedabble relational database with incrementally maintained views. Another new project for 2021. An attempt to place the ideas from differential dataflow into a more accessible form.
- Little projects
- Past examples - Staged interpreters in Rust, Zero-copy deserialization in Julia, Open multiple dispatch in zig, Writing a simple app for the pinephone, Vive experiments
- Ideas for 2021:
- See if the IJON fuzzer can discover new planning bugs in SQL databases
- Record my screen for an entire week long coding project. Find out where the time goes and what the bottlenecks are.
- Figure out how to measure the battery usage of desktop programs on linux, and find the biggest offenders.
- Figure out how to execute arbitrary zig expressions when debugging a running program. Maybe frida can inject them?
- There's some folk wisdom that multiway join algorithms are more robust to planning errors than traditional binary joins. Is this true? Why?
- Experiment with using embedded wasm sandboxes to prevent memory safety bugs from escalating into vulnerabilities.
- Figure out how to systematically avoid common benchmarking mistakes eg is it possible to eg combine coz and stabilizer or will they interfere with each other?
- Can differential dataflow handle late-arriving data by using bitemporal timestamps?
- Document how to use the tracy profiler via C bindings.
- Distilling existing research in my own field eg:
- Why don't existing query planners work well for streaming systems?
- How can we optimize query plans for worst-case performance over a range of input sizes, rather than best-case performance over the current input sizes?
- Why do modern databases contain compilers?
- How does differential dataflow work? How does it differ from Incremental? Which approach is better for which problems?
- Reading
- Literature reviews, mostly focused on databases, query languages, query planning (especially in streaming systems), compilers, incremental view maintenance, self-adjusting computation
- Book summaries. I used to publish summaries of everything I read. Even though they aren't linked from anywhere anymore google still sends a few thousand people to them each month, so I guess they must be useful. I typically read ~100 books per year, of which ~25 are non-fiction, mostly tech, cognitive science, economics, sociology etc.
- Other curiosity-driven binges eg reading every paper ever published in the Psychology of Programming Interest Group
- Random musings eg Small tech, Frugality is non-linear
I've spent the last 7 years building database engines, query planners, compilers, developer tools and interfaces for Materialize, RelationalAI, LogicBlox and Eve, as well as various smaller consulting gigs and personal research projects. I expect to be informative when talking about those subjects, and at least entertaining when talking about other subjects.
Some of the posts will also be published on this blog. Others will be exclusive to the newsletter. I expect the schedule to consist of bursts of activity interspersed with weeks of frustrated silence.
It may seem strange, but this is one of the most frightening things I've done in years. I can only see that as a good sign.