This is a part of a series - start at the beginning.
It's hard to be Turing-complete without some form of iteration or recursion. There is a tricky design problem here - I'm trying to steer between two extremes:
- Programming languages usually allow arbitrary user-defined iteration via loops or recursion, but this introduces false data dependencies that inhibit parallelization and incremental maintenance (eg it's very difficult for a compiler to prove that the order of iteration in a for loop doesn't matter).
- SQL specifies the data dependencies of every set-valued function precisely, but they can't be defined in the language itself which is why chapter 2 of the spec is 1732 pages long, each SQL implementation has to hardcode all of those functions, and every SQL database ends up inventing it's own adhoc sublanguage for user-defined functions.
The main piece of information I have at the moment is that differential dataflow can incrementally maintain the fixpoint of an arbitrary dataflow, with performance roughly proportional to the amount of churn per iteration. Anything that I can compile down into such a fixpoint should be ok.
It seems like it should be possible to allow recursive functions in imp and convert them to a differential dataflow fixpoint, where the state per iteration is:
- the set of function inputs for which we want outputs
- the set of function inputs for which we have have outputs
The main problem with this is that recursion isn't all that useful if all your inputs must be scalars. Using boxes in recursive functions doesn't work out either with the current type-system because each successive recursive call would have a box type that is nested one deeper than last. And even if they did work, I can't see how to avoid duplicating the boxed data between successive recursive calls. So that seems like a dead end.
A simpler option is just to directly expose fixpoints.
fix init next repeatedly applies
next [next [next [... [init]]]] until the result stops changing. Together with boxes, this allows abstracting over inductive computations like transitive closure:
let closure = ?[edge] . ( fix edge (?[reach] . (reach | (reach (?a . ?b . a . (b edge)))) ) ) in closure [1 . 2 | 2 . 3 | 3 . 4 | 3 . 1] --- type: number . number value: 1 . 1 | 1 . 2 | 1 . 3 | 1 . 4 | 2 . 1 | 2 . 2 | 2 . 3 | 2 . 4 | 3 . 1 | 3 . 2 | 3 . 3 | 3 . 4
Unlike datalog, this doesn't require that
next is a monotonic function. The reasons for this restriction in datalog are:
- it guarantees termination
- it guarantees a single least satisfying model
But 1 is lost as soon as you add eg arithmetic anyway. And 2doesn't matter if you have explicit fixpoints rather than an implicit global fixpoint over the whole program. Besides every good incremental implementation I know of tracks iterations anyway, and most production datalog-derived languages offer non-monotonic recursion with yolo semantics. So if we're going to expose this we might as well explicitly define it.
fix is limited to fixpoints over a single set at the moment, but that will be fixed very naturally once I get to structs/modules.
I have a rough idea how non-termination should be expressed in the denotational semantics. I think errors will have similar semantics too. But I haven't written it down yet so there are probably some icky corners.
For fast incremental maintenance it's important it's important to avoid specifying any kind of ordering on data or operations whenever possible. But sometimes you actually do care about ordering eg for pagination. So we need some way to talk about it.
enumerate ("a" | "b" | "c" | "d") --- type: number . text value: 1 . "a" | 2 . "b" | 3 . "c" | 4 . "d" range 2 4 --- type: number value: 2 | 3 | 4 enumerate ("a" | "b" | "c" | "d") (range 2 4) --- type: text value: "b" | "c" | "d"
Maintaining the enumeration of a large set will be really inefficient, but we can special case the last example above into a lookup on an ordered index which is not too expensive.
The ordering is defined over some natural ordering. I imagine introducing wrapper types to allow asking for different sort orders - something like:
enumerate (foo (?a . ?b . (descending a) . a . b)).
Allowing user-defined aggregates without breaking incremental maintenance is tricky. There are different maintenance strategies depending on the algebraic properties of the aggregate and different storage strategies depending on the amount of intermediate state. I don't want to bake different operators for each of these choices into the language, but they have to exposed somehow.
My current plan is something something compiler hints but I don't want to stay blocked on this so for now I've just added a placeholder
let sum = ?[things] . ( reduce things 0 (?[sum] . ?[thing] . (sum + thing)) ) in sum (3 | 4 | 5) --- type: number value: 12
I'm not really happy with this, but once I've implemented incremental maintenance and written a decent amount of imp code I'll have much more information about where the tradeoffs lie and I can come back to fix it.
The existence of
reduce means that storing the closed-over scope is no longer enough to reconstruct the value of an expression. If a box contains a reference to a value that depends on the arguments to
next then we need to also store the iteration number for the loop. It's effectively a hidden variable that the program can't refer to directly but that boxes might close over.
The stack of iteration numbers in scope will map down to the time lattice for differential dataflow.