An opinionated map of incremental and streaming systems

Last updated 2021-03-20

Disclaimer: I used to work for Materialize which is one of the systems being compared in this post.

This is a space of systems for which we don't have clear terminology yet. The core idea uniting them is this:

The function in question could be anything from a simple budget calculation (input = bank transactions, output = remaining balance) to an entire website (input = user actions, output = html).

Most of these systems are equally expressive and can emulate each other, so the most useful way to characterise them is by which workloads they handle well, and especially by which workloads they handle better than repeatedly running a reasonably competent batch system.

This is similar to how we characterise SQL databases into OLTP, OLAP, time-series etc and test those characterisations using widely-implemented, well-understood benchmarks suites like TPC. Unfortunately there are no such suites for incremental/streaming systems and even the benchmarks that do exist are typically ill-conceived.

For now, we can organise them by design choices, and then later in the year we'll test how those design choices actually affect the range of workloads that they can handle.

Let's start with a high-level map:

Unstructured vs structured

Any incremental system has to track the dependency graph of the computation in order to reuse parts of it when the input changes. Unstructured systems allow completely arbitrary graphs, where each node is a single value in the computation and each edge is some arbitrary operation. In structured systems, each node is a collection of values and the edges in the graph are restricted to some set of operations for which the system can efficiently propagate small changes to the collections.

The main advantage of unstructured systems is that they allow making regular code incremental (hence their popularity in UI) whereas structured systems require rewriting code against some restricted interface.

The disadvantage is that unstructured systems tend to struggle with controlling the granularity of incremental computation. Make the values in each node too big and you do a lot of unnecessary recomputation. Make them too small and the overhead of the graph metadata dominates the actual computation. Incremental eventually ended up growing a semi-structured extension to deal with these kinds of problems. The endless waves of state management libraries for React also seem to be a symptom of this difficulty in finding the right granularity for mapping changes in state to changes in UI.

Structured systems are able to scale to much larger computations and much finer-grained changes because of the reduced metadata overhead and the ability to take advantage of algebraic properties of their limited operator set. They're also usually more amenable to automatic parallelization because the granularity of the work can be made much smaller - it's easier to pack sand than rocks.

Build systems a la carte and Towards a unified theory of reactive UI provide good overviews of the state of the art in unstructured systems. I won't discuss them further here.

(A very closely aligned axis is value-oriented vs collection-oriented interfaces. All structured systems necessarily have collection-oriented interfaces, and almost all unstructured systems have value-oriented interfaces. The only exception that comes to mind is salsa which has a collection-oriented interface but still allows arbitrary edges in the graph and tracks those edges individually per-value.)

Internally consistent vs internally inconsistent

(This distinction will be covered in more detail in an upcoming post)

Internal consistency means always returning an output that is the correct result for some present or past input. An internally inconsistent system might return outputs that are not the correct result for any of the past inputs. Perhaps they processed some update multiple times, or dropped an update on the floor, or combined part of one input with part of another, or finished processed some update down one path in the graph but not yet down another.

Vendors prefer terms like 'eventually consistent' over 'inconsistent', but trying to compose incremental/streaming operations which are individually eventually consistent can lead to unbounded error and impossible outputs.

To make this more concrete, imagine an accounting system where the input is a list of transactions moving money from one account to another. In an internally consistent system the total amount of money summed across all accounts should never change because money is never created or destroyed, only moved. In an internally inconsistent system the reported total amount might vary over time and in the worst cases might never converge to the correct amount.

The easiest way to be consistent is to process changes one at a time. This is easy at a small scale which is why there aren't any inconsistent unstructured systems. But when trying to scale to larger volumes of fine-grained changes, processing one change at a time requires too much waiting around.

Instead we want to try to do a lot of work in parallel. Doing this correctly typically requires having an explicit model of time which is tracked alongside incoming data to allow the system to reason about which versions of various intermediate results go together and when it is safe to produce an output.

(An alternative option is to only use those operations which do compose under eventual consistency, as in bloom).

This is what is missing in the inconsistent systems - when combining two streams they don't guarantee that they are combining values that were produced at the same point in time in the original input stream, so the system as a whole might produce outputs that could not possibly have been produced from any input, past or present.

Many popular systems allow internal inconsistency under various scenarios (more detail in an upcoming post). This information is usually buried in the documentation rather than being clearly advertised. As a result people using these systems often believe that they will produce consistent results. To compound this confusion, it's common that such systems will behave well during development and only later in production start to quietly produce incorrect results under pressure.

Given that we now have multiple good options for completely consistent computation I don't think there is any reason in the overwhelming majority of cases to take on the massive cognitive and maintenance overhead of an inconsistent system, so I won't discuss inconsistent systems further here.

Eager vs lazy

An eager system actively computes new outputs whenever the inputs change. A lazy system waits until the output is needed.

Being lazy makes sense when most of the output is not read most of the time. Many website backends fall into this category - the list of possible pages might even be infinite so it makes sense to compute results on the fly rather than up-front.

On the other hand, lazy systems can't provide notifications when a certain output appears or changes because they won't even know about the output until asked to calculate it. This makes them poorly suited to eg monitoring.

The lazy branch of this map might be unpopulated? Salsa is lazy but unstructured. Noria is lazy and has a credible claim to being eventually consistent, but is not internally consistent. (Eventual consistency and internal consistency overlap but neither implies the other - an eventually consistent system is allowed to produce internally inconsistent answers before converging, and an internally consistent system does not have to converge to the latest result).

In practice, it seems that people tend to use eager systems and approximate laziness where needed by either running ad-hoc queries against the output or by adding a list of outputs-to-maintain into the inputs. Building a structured consistent system that can transparently switch between eager and lazy processing might be very useful.

High temporal locality vs low temporal locality

(This distinction will be covered in more detail in an upcoming post)

In some workloads, inputs are processed together only if they were created at similar times eg analytics, monitoring or sensor fusion where events are typically grouped by window or session before summary. This enables a kind of forgetfulness - once we're reasonably sure that no new inputs will arrive in a given window or session we can export the results to some external systems and throw away the working state, safe in the knowledge that those past inputs will never influence any future work. Systems in this part of the map are usually referred to as 'streaming' systems.

By contrast, in eg OLTP workloads it's often the case that the processing of a given input event can be affected by some other input event long in the past. A bank can't stop tracking a customer's balance just because they haven't spent anything recently. A key feature for this kind of workload is being able to handle joins between time-varying tables without windowing. Systems in this part of the map don't have a common terminology, but 'incremental view maintenance' is often used by database folks.

This division is very soft. Most serious workloads have a mixture of high-temporal-locality and low-temporal-locality computations, and systems on either side of the division have converged towards handling both kinds of workloads, so this division is less about capability and more about the workloads that were foremost in the minds of the designers and that appear most in the documentation.

It is notable though that the low-temporal-locality side of this map consists almost solely of research projects. The high-temporal-locality side has received most of the attention so far, likely because of the needs of the advertising industry.

So we're not going to talk about:

That leaves us with structured, consistent, eager systems and this interesting soft division between systems designed around high temporal locality vs low temporal locality.

Of these remaining systems:

That leaves us with differential dataflow, spark structured streaming and flink as the systems worth exploring in depth in this series.

(And then maybe in a later series we can look materialize, flink sql, ksqldb etc to see how query planning complicates matters.)