It's alive! Internal consistency in streaming systems. Tell your friends.
Major changes since the last update:
- A bigger dataset with better delay distribution.
- A second dataset which keeps the values in
balance
between -1 and 1. This removes some noise and makes the strange dynamics in flink much clearer. - Graphs!
- I added differential dataflow, which has similar behavior to materialize but made it easy to demonstrate watermarks in the output.
- Spark structured streaming turned out to have major limitations buried in its documentation, including "inputs to joins can't use any operations except map" and "aggregations can't be chained", so I removed it from the post.
- The flink datastream api also can't express the example because it doesn't have a concept of retraction. I tested enough to establish that it isn't internally consistent though. Also its non-windowed aggregates ignore event-time and process inputs as they arrive, which was a surprise.
- Still no output from kafka streams, but the data loss issue turned out to be me holding it wrong - the default configuration for kafka allows non-deterministicly garbage collecting data with event-times older than 7 days.
- The ksqldb docker image worked fine, so that replaced kafka streams in the post.
- Most of the rest of the post was rewritten.
I also edited the map to reflect my discovery that flink and spark structured streaming are not internally consistent.
This is the first significant piece of work to come out of this sponsorship. I'm happy with where it ended up, but I think I could have been a lot more efficient. It started out as a pretty rushed explanation of internal consistency, morphed into a survey of consistency mechanisms, added some demonstrations to be more convincing, then when I belatedly realized that the documentation for most systems was inaccurate the demonstrations turned into experiments to figure out what I should actually be writing. The examples went through several iterations. Some of the systems had to be dropped because they couldn't express any interesting examples.
Some of this was learning pains and couldn't be avoided. But if I had to go back and do this again I would:
- Ignore the documentation and rely primarily on experiments.
- Focus just on popular systems rather than trying to be exhaustive.
- Give up on kafka streams way earlier - it ate 3 of the 6 weeks I spent on this and I got nothing out of it.
Geoffrey Lit started a newsletter. He's done a lot of work on developer tools/ux, end-user programming and most recently compatibility across versions for local-first software.
In small tech I was trying to capture some elusive ideas about business models and quality. Two recent (and unrelated) posts provide much better angles:
Loris Cro on the open source game
It's not easy to come up with a full fledged description of whatever this thing should be, but as a first approximation I came up with 'software you can love'. It's very vague, but it perfectly captures the good parts about open source and free software, and filters out many of their flaws.
There's a limit to how much you can love software with terrible UX, just as much as there's a limit to how much you can love software that has good UX, but that keeps nagging you about enabling notifications because it really needs more engagement, or software that is bloated, janky and that has short shelf life because of bad engineering choices. It also captures the fact that having the source code available is nice for learning and 'right to repair' purposes, but that there is more to software you can love and that sometimes a reasonably priced, rocksolid, proprietary tool can be preferable to a janky OSS project connected to a murky business model.
Martin Kleppmann on saying goodbye to the gpl.
If the company providing the cloud software goes out of business or decides to discontinue a product, the software stops working, and you are locked out of the documents and data you created with that software. This is an especially common problem with software made by a startup, which may get acquired by a bigger company that has no interest in continuing to maintain the startup's product.
Google and other cloud services may suddenly suspend your account with no warning and no recourse, for example if an automated system thinks you have violated its terms of service. Even if your own behaviour has been faultless, someone else may have hacked into your account and used it to send malware or phishing emails without your knowledge, triggering a terms of service violation. Thus, you could suddenly find yourself permanently locked out of every document you ever created on Google Docs or another app.
With software that runs on your own computer, even if the software vendor goes bust, you can continue running it forever (in a VM/emulator if it's no longer compatible with your OS, and assuming it doesn't need to contact a server to check for a license check). For example, the Internet Archive has a collection of over 100,000 historical software titles that you can run in an emulator inside your web browser! In contrast, if cloud software gets shut down, there is no way for you to preserve it, because you never had a copy of the server-side software, neither as source code nor in compiled form.
The 1990s problem of not being able to customise or extend software you use is aggravated further in cloud software. With closed-source software that runs on your own computer, at least someone could reverse-engineer the file format it uses to store its data, so that you could load it into alternative software (think pre-OOXML Microsoft Office file formats, or Photoshop files before the spec was published). With cloud software, not even that is possible, since the data is only stored in the cloud, not in files on your own computer.
Cloud software, not closed-source software, is the real threat to software freedom, because the harm from being suddenly locked out of all of your data at the whim of a cloud provider is much greater than the harm from not being able to view and modify the source code of your software. For that reason, it is much more important and pressing that we make local-first software ubiquitous. If, in that process, we can also make more software open-source, then that would be nice, but that is less critical. Focus on the biggest and most urgent challenges first.
This 2018 post on scaling sqlite has a surprising twist - they were able to scale vertically on their own hardware but not on AWS, because the AWS machines were configured to balance performance and power consumption.
This is yet another item on the list of cloud performance complexities, joining sporadically slow disk access, noisy netowrks, cache pollution from co-tenants etc. I wonder how much of our wisdom about scaling vertically vs horizontally is distorted by the fact that most of our tests are running on mildly crippled machines.
Principles and Practice of Consistency for Distributed Data has annouced its accepted papers and it looks like most of them have preprints online. Lots of promising titles focused around CRDTs and local-first software.
I've learned a lot over the years from the team behind Our Machinery and Bitsquid (for the bitsquid blog skip back to 2014, before they got acquired). The topics covered include performance-friendly architecture, various case studies of data-oriented design, advanced immediate-mode guis, using universal data models for real-time collaboration between disparate tools etc. They have great design taste, most of the posts are grounded in real war stories and their software actually works so their advice isn't useless.
I was recently reminded of their work when they added a live editor for c plugins that uses zig cc
as the compiler. I love seeing crossovers between projects I admire.
I had to make a twitter account to join a twitter spaces thing and I'm tentatively keeping it, in exchange for no longer visiting hn/lobsters. I spent a month with both and recorded where I discovered interesting new things that I actually followed up on and twitter won hands down. They're all tire fires, but at least on twitter I can decide who to listen to. I still overwhelmingly focus time on books, papers, blogs and email, but a little serendipity is useful.
I don't expect to post anything there that isn't also included in way more detail in this newsletter, but if you like to retweet things then twitter.com/sc13ts is the place to get them from.
I mildly injured my wrist and typing seems to aggravate it so I'll probably spend the next couple of weeks catching up on reading, especially conference papers that I missed in the last year or two.