An opinionated map of incremental and streaming systems

Last updated 2021-04-18

Disclaimer: I used to work for Materialize which is one of the systems being compared in this post.

This is a space of systems for which we don't have clear terminology yet. The core idea uniting them is this:

The function in question could be anything from a simple budget calculation (input = bank transactions, output = remaining balance) to an entire website (input = user actions, output = html).

Most of these systems are equally expressive and can emulate each other, so the most useful way to characterise them is by which workloads they handle well, and especially by which workloads they handle better than repeatedly running a reasonably competent batch system.

This is similar to how we characterise SQL databases into OLTP, OLAP, time-series etc and test those characterisations using widely-implemented, well-understood benchmarks suites like TPC. Unfortunately there are no such suites for incremental/streaming systems and even the benchmarks that do exist are typically ill-conceived.

For now, we can organise them by design choices, and then later in the year we'll test how those design choices actually affect the range of workloads that they can handle.

Let's start with a high-level map:

Unstructured vs structured

Any incremental system has to track the dependency graph of the computation in order to reuse parts of it when the input changes. Unstructured systems allow completely arbitrary graphs, where each node is a single value in the computation and each edge is some arbitrary operation. In structured systems, each node is a collection of values and the edges in the graph are restricted to some set of operations for which the system can efficiently propagate small changes to the collections.

The main advantage of unstructured systems is that they allow making regular code incremental (hence their popularity in UI) whereas structured systems require rewriting code against some restricted interface.

The disadvantage is that unstructured systems tend to struggle with controlling the granularity of incremental computation. Make the values in each node too big and you do a lot of unnecessary recomputation. Make them too small and the overhead of the graph metadata dominates the actual computation. Incremental eventually ended up growing a semi-structured extension to deal with these kinds of problems. The endless waves of state management libraries for React also seem to be a symptom of this difficulty in finding the right granularity for mapping changes in state to changes in UI.

Structured systems are able to scale to much larger computations and much finer-grained changes because of the reduced metadata overhead and the ability to take advantage of algebraic properties of their limited operator set. They're also usually more amenable to automatic parallelization because the granularity of the work can be made much smaller - it's easier to pack sand than rocks.

Build systems a la carte and Towards a unified theory of reactive UI provide good overviews of the state of the art in unstructured systems. I won't discuss them further here.

(A very closely aligned axis is value-oriented vs collection-oriented interfaces. All structured systems necessarily have collection-oriented interfaces, and almost all unstructured systems have value-oriented interfaces. The only exception that comes to mind is salsa which has a collection-oriented interface but still allows arbitrary edges in the graph and tracks those edges individually per-value.)

High temporal locality vs low temporal locality

In high temporal locality workloads, inputs are typically typically grouped by window or session before futher processing (eg analytics, monitoring or sensor fusion). This enables a kind of forgetfulness - once we're reasonably sure that no new inputs will arrive in a given window or session we can export the results to some external systems and throw away the working state, safe in the knowledge that those past inputs will never influence any future work. Systems in this part of the map are usually referred to as 'streaming' systems.

In low temporal locality workloads, the processing of a given input event can typically be affected by some other input event long in the past (eg a bank can't stop tracking a customer's balance just because they haven't spent anything recently). These workloads often involve unwindowed aggregates and joins over time-varying collections, so a key feature is being able to propagate changes to inputs rather than just appending new inputs. Systems in this part of the map don't have a common terminology, but 'incremental view maintenance' is often used by database folks.

This division is very soft. Most serious workloads have a mixture of high-temporal-locality and low-temporal-locality computations, and systems on either side of the division are converging towards handling both kinds of workloads, so this division is less about capability and more about the workloads that were foremost in the minds of the designers and that appear most in the documentation.

It is notable that the systems in the low-temporal-locality side of this map are generally younger. The high-temporal-locality side has received most of the attention so far, likely because of the needs of the advertising industry.

Internally consistent vs internally inconsistent

(This division is covered in much more detail here, including demonstrations of inconsistency in flink and ksqldb)

Internal consistency means always returning an output that is the correct result for some present or past input. An internally inconsistent system might return outputs that are not the correct result for any of the past inputs. Perhaps they processed some update multiple times, or dropped an update on the floor, or combined part of one input with part of another, or finished processed some update down one path in the graph but not yet down another. An internally inconsistent system can generate unbounded errors and impossible outputs and as long as new inputs keep arriving might never converge to a correct answer.

(Eventual consistency and internal consistency overlap but neither implies the other - an eventually consistent system is allowed to produce internally inconsistent answers before converging, and an internally consistent system does not have to converge to the latest result. We generally want both.)

The easiest way to be consistent is to process inputs through the entire system one at a time. This is easy at a small scale which is why there aren't any inconsistent unstructured systems. But when trying to scale to larger volumes of fine-grained changes, processing one change at a time requires too much waiting around. Instead we want to try to do a lot of work in parallel. Doing this correctly typically requires having an explicit model of time which is tracked alongside incoming data to allow the system to reason about which versions of various intermediate results go together and when it is safe to produce an output.

Many popular systems allow internal inconsistency under various scenarios. This information is usually buried in the documentation rather than being clearly advertised. As a result people using these systems often believe that they will produce consistent results. To compound this confusion, it's common that such systems will behave well during development and only later in production start to quietly produce incorrect results under pressure.

It's notable that all the high-temporal-locality systems are also internally inconsistent. When all your inputs are windowed the effects of internal inconsistency are not as pernicious - you can simply wait until the window closes before looking at the results. In low-temporal-locality workloads there is no convenient stopping point so we have to be much more careful.

Internally inconsistent systems require the user to reason about all the possible interleavings of stream events. Internally consistent systems allow the user to pretend they are just operating a very fast batch system. I would like to ignore internally inconsistent systems simply because the cognitive complexity is so high, but that would unfortunately rule out almost all of the options.

Eager vs lazy

An eager system actively computes new outputs whenever the inputs change. A lazy system waits until the output is needed.

Being lazy makes sense when most of the output is not read most of the time. Many website backends fall into this category - the list of possible pages might even be infinite so it makes sense to compute results on the fly rather than up-front.

On the other hand, lazy systems can't provide notifications when a certain output appears or changes because they won't even know about the output until asked to calculate it. This makes them poorly suited to eg monitoring.

In practice, it seems that people tend to use eager systems and approximate laziness where needed by either running ad-hoc queries against the output or by adding a list of outputs-to-maintain into the inputs. The only actually lazy structured system I'm aware of is noria, which is why this division doesn't appear on the map.

I want to focus on systems which are structured and which can handle both low-temporal-locality and high-temporal-locality workloads. That gives us differential dataflow, flink and kafka streams.

I'll also include noria for the sake of studying laziness, but as far as I can tell it doesn't support windows so it will only be able to handle the low-temporal-locality workloads.

And then maybe much later we can also look at materialize, flink sql and ksqldb to see how query planning complicates matters.