November 01, 2023

This post outlines some ideas for control flow features that I would like to add to the Eqlog Datalog engine:

I believe all of these features make sense for a general Datalog engine, but some aspects I’ll discuss are about the interaction with native equality and function symbols, which Eqlog adds on top of basic Datalog. If you haven’t heard of Eqlog, I encourage you to have a look at the README or some of the posts I wrote on implementing a Hindley-Milner style type checker using Eqlog starting here.

Eqlog currently supports only one very basic kind of rule, which I
will refer to as an *implication* in this post. I’ve changed
Eqlog’s syntax recently (more on this below), but such implications used
to look something like this:

```
Axiom
LessThan(a, b)
& LessThan(b, c)
=>
LessThan(a, c)
& GreaterThan(c, a)
;
```

The part before the implication symbol `=>`

is the
*premise*, *query* or *body* of the implication,
and everything after is the *conclusion* or *head*. Both
premise and conclusion are given by conjunctions of *clauses* or
*atoms*. Eqlog evaluates such an implication by repeatedly
enumerating matches of the premise, and then adding data corresponding
to the conclusion.

I don’t want to change Eqlog’s fundamental execution model, but I would like for it to be less verbose and less error-prone to specify certain sets of implications. As an example, consider the following implications taken from the Hindley-Milner type checker I mentioned above, which has to do with instantiating types of variables with polymorphic type. The definition of the predicates and functions in these implications are not relevant here, it’s the general shape of those implications that I want to draw attention to:

```
Axiom
VariableExprNode(expr, var)
& VarTypeInExpr(var, expr) = PolyType(_)
=>
ExprInstantiation(expr)!
;
Axiom
VariableExprNode(expr, var)
& VarTypeInExpr(var, expr) = PolyType(_)
& instance = ExprInstantiation(expr)
& ctx = ExprTypeContext(expr)
=>
InstantiationTarget(instance) = ctx
;
```

Taken together, these two implications assert the following:

- If
`VariableExprNode(expr, var) & VarTypeInExpr(var, expr) = PolyType(_)`

holds, - then
`ExprInstantiation(expr)`

is defined. We’ll refer to this element by`instance`

henceforth. - Furthermore, if
`ctx = ExprTypeContext(expr)`

, - then
`InstantiationTarget(instance) = ctx`

.

Abstracting away from the concrete formulas, these two implications are of the form

`A => B A & B & C => D`

for formulas *A*, *B*, *C* and *D*. With Eqlog’s current syntax,
we’re forced to state *A* and
*B* twice, in each implication.
That’s the first problem we’re going to tackle: Allow going back and
forth between specifying premises and conclusions, so that implications
can be specified similarly to the list above, without duplication.
Another way to look at this is that Eqlog should understand something to
the effect of the English “furthermore” as used in *C*.

Unfortunately, I couldn’t come up with a natural way of extending Eqlog’s original syntax to accommodate for this. But since Eqlog programs are supposed to be embedded into Rust projects, it would probably be better in any case if Eqlog looked more similar to Rust than the original syntax allowed it to. So this was a good opportunity for a major revision of Eqlog’s syntax.

With the new syntax I came up with, the two implications above can be
written as a single *rule* as follows:

```
rule expr_instantiation_target {
if variable_expr_node(expr, var);
if var_type_in_expr(var, expr) = poly_type(_);
then instance := expr_instantiation(expr)!;
if ctx = expr_type_context(expr);
then instantiation_target(instance) = ctx;
}
```

Thus:

The

`rule`

keyword replaces the old`Axiom`

keyword. Rules have an optional name, and are given by a list of statements:`rule <optional_name> { stmt1; stmt2; ... stmtn; }`

Each statement is either an

`if`

statement of the form`if <atom>`

, corresponding to what was previously a clause in a premise, or a`then`

statement of the form`then <atom>`

, corresponding to what was previously a clause in a conclusion.Crucially,

`if`

and`then`

statements can be freely interleaved.Semantically, every rule is equivalent to a

*set*of implications, one for each`then`

statement in the rule: The implication corresponding to a statement`then <atom_k>`

occurring as*k*th statement in a rule is as follows: If`<atom_1>, <atom_2>, ..., <atom_{k - 1}>`

are the atoms of statements before statement*k*(regardless of whether they’re`if`

or`then`

statements!), then the implication is given by`<atom_1> & ... & <atom_{k - 1}> => <atom_k>;`

A naive Rust implementation of the example rule above might look like this:

```
// if variable_expr_node(expr, var);
for (expr, var) in iter_variable_expr_node() {
// if var_type_in_expr(var, expr) = poly_type(_);
for (var0, expr0, res0) in iter_poly_type() {
if var0 != var || expr != expr0 {
continue;
}
// then instance := expr_instantiation(expr)!;
let instance = match expr_instantiation(expr) {
Some(instance) => instance,
None => {
let instance = new_instantiation();
, instance);
insert_expr_instantation(expr
instance}
};
// if ctx = expr_type_context(expr);
if Some(ctx) = expr_type_context(expr) {
// then instantiation_target(instance) = ctx;
, ctx);
insert_instantiation_target(instance}
}
}
```

Thus, we don’t actually have to emit the equivalent of separate
nested loops for each `then`

statement and can instead reuse
outer loops.

I’ve already implemented the new syntax, but only original Eqlog
semantics (i.e., implications) are supported yet. Thus, every rule must
be given by a list of `if`

statements, followed by a list of
`then`

statements, and interleaving the two doesn’t work
yet.

The Hindley-Milner type checker that the example implications above are taken from defines a third, related implication:

```
Axiom
VariableExprNode(expr, var)
& VarTypeInExpr(var, expr) = PolyType(sigma)
& instance = ExprInstantiation(expr)
& expr_ty = ExprType(expr)
=>
Instantiate(instance, sigma) = expr_ty
;
```

The first three clauses of this implication are the same as in one of
the implications we’ve already seen, but the atom
`expr_ty = ExprType(expr)`

in the premise and the conclusion
are specific to this implication. Even with the ability to interleave
`if`

and `then`

statement, we’d need to add a
second rule for this implication that would duplicate the three initial
clauses/statements. What we would need to avoid this duplication is a
mechanism to fork a rule after a number of statements. That’s the
purpose of the `fork`

keyword.

Using `fork`

, this third premise can be incorporated into
the existing `expr_instantiation_target`

rule we’ve seen in
the previous section as follows:

```
rule expr_instantiation_target {
if variable_expr_node(expr, var);
if var_type_in_expr(var, expr) = poly_type(sigma);
then instance := expr_instantiation(expr)!;
fork {
if ctx = expr_type_context(expr);
then instantiation_target(instance) = ctx;
} or {
if expr_ty = expr_type(expr);
then instantiate(instance, sigma) = sigma;
}
}
```

Thus:

- Rules can now contain statements of the form
`fork { <block_1> } or { <block_2> } ... or { <block_n }`

. Each block is a list of statements, and can recursively contain other`fork`

statements. - Semantically, usage of the
`fork`

statement in a rule can be desugared by duplicating the rule*n*times, once for each block of the`fork`

statement. In the*i*th copy of the rule, the`fork`

statement is replaced by`<block_i>`

. - In particular,
`fork`

statements need not necessarily occur at the end of the rule.

A good implementation will probably desugar a `fork`

statement into several rules though: Instead, results of queries for the
shared part of a rule can be used for each block of a subsequent
`fork`

statement.

I’m not completely sure about the `fork`

keyword itself
though, and especially whether `or`

is the right keyword to
separate the individual blocks. I’ve also considered
“`allof`

” instead of `fork`

and “`and`

”
instead of `or`

, since *all* of the following blocks
apply after the statements before the `fork`

have matched.
But statements after the `fork`

apply if *any* of the
blocks of the fork statement have matched, suggesting keywords
“`anyof`

” and “`or`

”. So perhaps `fork`

is the best we can do here, although I don’t like that it emphasizes the
operational aspects of the mechanism instead of the semantics. The
`or`

keyword is somewhat misleading though, and a good
alternative eludes me at the moment. One option might be to just not
have a keyword between blocks, so that only closing and opening curly
braces `} {`

would separate blocks.

Sum types are probably among the most impactful features that have made it from the academic programming language ivory tower into mainstream languages in recent years. Once you’re used to sum types, you see them crop up almost everywhere, and they make many classes of problems much easier to deal with.

One such problem class are algorithms that have to traverse abstract syntax tree (AST) nodes. For example, here’s how we might define a type of AST nodes representing (a subset of) boolean expressions in Rust:

```
enum BoolExpr {
,
True(),
False()Box<BoolExpr>, Box<BoolExpr>),
Or(Box<BoolExpr>, Box<BoolExpr>),
And(}
```

The main benefit of sum types is that they enable destructuring
pattern matching. Continuing with the `BoolExpr`

example,
here’s a Rust function that evaluates a `BoolExpr`

:

```
fn eval(e: &BoolExpr) -> bool {
use BoolExpr::*;
match e {
=> { true }
True() => { false }
False() , rhs) => { eval(lhs) || eval(rhs) }
Or(lhs, rhs) => { eval(lhs) && eval(rhs) }
And(lhs}
}
```

Crucially, the Rust compiler checks statically that there is a match
arm for every possible value of `e`

. If we forget one of the
constructors of `BoolExpr`

, for example
`BoolExpr::Or`

, then the compiler will immediately point us
to the bug.

Standard Datalog doesn’t have sum types, and neither does Eqlog. Instead, Datalog programs that deal with ASTs typically define a separate relation/predicate for each different constructor of a given kind of AST node. In Eqlog, we can use functions to encode AST nodes (unless we want to distinguish structurally equal AST nodes), for example like so:

```
type BoolExpr;
func true_expr() -> BoolExpr;
func false_expr() -> BoolExpr;
func or_expr(BoolExpr, BoolExpr) -> BoolExpr;
func and_expr(BoolExpr, BoolExpr) -> BoolExpr;
```

We can then define an analogue of the `eval`

function
above in Eqlog as follows:

```
// We'll add rules such that this predicate holds
// for expressions that evaluate to true.
pred evals_to_true(BoolExpr);
rule true_eval {
if expr = true_expr();
then evals_to_true(expr);
}
// No rule for false_expr -- it doesn't evaluate to true.
rule or_eval {
if expr = or_expr(lhs, rhs);
fork {
if evals_to_true(lhs);
} or {
if evals_to_true(rhs);
}
then evals_to_true(expr);
}
rule and_eval {
if expr = and_expr(lhs, rhs);
if evals_to_true(lhs);
if evals_to_true(rhs);
then evals_to_true(expr);
}
```

Note that nothing prevents us from removing or forgetting to add a
rule governing `evals_to_true`

, say the `or_eval`

rule. What we’d need to catch such an error an is an equivalent of
Rust’s exhaustive match statement in Eqlog, hence sum types.

So here’s what sum types could look like in Eqlog:

```
enum BoolExpr {
True(),
False(),
Or(BoolExpr, BoolExpr),
And(BoolExpr, BoolExpr),
}
pred evals_to_true(BoolExpr);
rule evals_to_true_laws {
if e: BoolExpr;
match e {
True() => {
then evals_to_true(e);
}
False() => {}
Or(lhs, rhs) => {
fork {
if evals_to_true(lhs);
} or {
if evals_to_true(rhs);
}
then evals_to_true(e);
}
And(lhs, rhs) => {
if evals_to_true(lhs);
if evals_to_true(rhs);
then evals_to_true(e);
}
}
}
```

Thus, every constructor of the `BoolExpr`

enum implicitly
defines a function, for example
`Or(BoolExpr, BoolExpr) -> BoolExpr`

. If the type of a
variable in a rule is an `enum`

type, then the variable can
be destructured with a `match`

statement. Such
`match`

statements desugar into `fork`

statements,
with each arm in the `match`

statement corresponding to an
alternative block in the `fork`

. The first statement in each
block of the resulting `fork`

statement corresponds to the
constructor of the corresponding match arm and the rest of the block is
given by the statements in the match arm. Thus, the
`evals_to_true_laws`

rule above would desugar into the
following rule:

```
rule evals_to_true_laws_desugared {
if e: BoolExpr;
fork {
if e = True();
then evals_to_true(e);
} or {
if e = False();
} or {
if e = Or(lhs, rhs);
fork {
if evals_to_true(lhs);
} or {
if evals_to_true(rhs);
}
then evals_to_true(e);
} or {
if e = And(lhs, rhs);
if evals_to_true(lhs);
if evals_to_true(rhs);
then evals_to_true(e);
}
}
```

So far I think there’s not too much to argue about. However, there are some considerations regarding sum types that are specific to the Datalog (or Eqlog’s) evaluation model.

Does every element of an `enum`

type need to be equal to
the result of applying a constructor? This is what you would expect from
other languages with sum types. However, it’s not obvious how Eqlog can
guarantee this property.

That’s because Eqlog allows enforcing that a given function is
defined on some element. For example, the
`expr_instantiation_target`

law above enforces that the
`expr_instantiation`

function is defined on an
`expr`

under certain conditions:

```
rule expr_instantiation_target {
...
then instance := expr_instantiation(expr)!;
...
}
```

When Eqlog infers during evaluation that a function should be defined
on some arguments, it creates a new element (that is, Eqlog chooses a
previously unused integer ID to represent the element), and records it
as result of applying that function. But what if the result type of the
function was declared as an `enum`

? We now have an element of
an enum type that is not given by one of the constructors.

One option here is to prevent such situations from arising in the first place using a static compile time check. For example, we could require that every new element of an enum type that is introduced by a rule is later in that rule equated with the result of applying a constructor. Such a check cannot also consider arbitrary applications of other rules as well though, since that would mean that the check does not necessarily terminate. Thus, some generality would be lost.

In some cases it can even be desirable to have elements of an
`enum`

-like type that are not equal to any of the
constructors. One example is the type `Type`

that the Hindley-Milner
type checker I mentioned above uses to represent the types of its
input language. This type might be declared as an Eqlog enum like
so:

```
enum Type {
Boolean(),
String(),
Function(arg: Type, result: Type),
...
}
```

To account for generics in a Hindley-Milner type system,
`Type`

must be able to represent type variables. The type
checker represents such type variables as elements of `Type`

which are not given by any of the constructors, which is an intentional
violation of surjectivity of constructors. So it might seem that
`Type`

should just not be an `enum`

to begin with.
But on the other hand, the type checker defines some rules that consider
which, if any, of the constructors a type was obtained with and conclude
accordingly, and there the `match`

statement is useful.

The other option is to simply drop the requirement that every element
of an `enum`

type must be obtained via one of the
constructors. The desugaring of `match`

to `fork`

outlined above works perfectly fine; the `match`

statement
simply does nothing for elements that are not obtained via a
constructor. But still, this situation will be surprising at first and
not intended in the majority of cases. So a synthesis could be to make
Eqlog recognize `enum`

types declared as either
`open`

or `closed`

, and then to let the user
choose one or the other approach separately for each
`enum`

.

When we speak of sum types, we usually mean a type representing the
*disjoint* union of the summand types. That the union is
“disjoint” means that elements obtained by one constructor are never
equal to the result of applying a different constructor. As a
consequence, there is a unique way in which a given element can be
destructured in terms of the constructors of the enum type.

What distinguishes Eqlog from other Datalog engines is that Eqlog
rules can equate elements. When Eqlog infers during evaluation that an
equation `lhs = rhs`

of elements should hold, then
`lhs`

and `rhs`

are completely interchangeable in
all other contexts henceforth.

But then what should happen when a rule equates elements of an enum
type that were obtained from different constructors, or from different
arguments of the same constructor? Here are two (nonsensical but valid)
examples using the `Type`

enum above:

```
rule bool_is_string {
if b = Boolean();
if s = String();
then b = s;
}
rule function_is_commutative {
if f = Function(arg, result);
then f = Function(result, arg);
}
```

One option is again to just allow this: This would mean that more
than one arm of a match statement can apply to an element (for the
example above, this could be both the `Boolean()`

and
`String()`

arm of a match statement), or that one arm applies
in more than one way (if `Function(arg, result)`

matches,
then also `Function(result, arg)`

matches).

Another option would be to stop evaluation and report an error as soon as an equality among different constructors is found. Such a dynamic check is perhaps the most pragmatic option, which is why I’m not very fond of it.

And a third option would be to statically ensure that no rule can be used to infer any equality whatsoever of elements of an enum type. This is more difficult than it might seem at first, because many equalities stem from implicit functionality axioms and not explicitly enforced equalities in rules.

As an example, consider types `A`

and `B`

where
`B`

is declared as an enum

```
enum B {
First(),
Second(),
}
```

and suppose there is a function `foo(A) -> B`

such that
`foo(a0) = First()`

and `foo(a1) = Second()`

for
elements `a0`

and `a1`

of `A`

. Now, if
we infer that `a0 = a1`

holds, then the functionality axiom
for `foo`

lets us deduce `First() = Second()`

.
Thus, we sometimes infer equalities in a type even if no rule explicitly
enforces that equality. So what would be needed here is that we forbid
rules from equating elements of enum types and all types for which there
exists (transitively) a function into an enum type. That would usage of
`enum`

types in many instances though.

From what I can tell, the most straightforward option for now is to
just not enforce injectivity of constructors. Obfuscated in category
theoretic language, this would mean that Eqlog’s `enum`

types
are *amalgamated
sum* or *pushout* types. As before, it could make sense
for Eqlog to support an optional annotation on `enum`

declarations that would enable a check that constructors must be
injective.

Rust and other languages typically support wildcard patterns
`_`

in match arms statements:

```
fn is_literal(e: BoolExpr) -> bool {
match e {
=> { true }
True() => { true }
False() => { false }
_ }
}
```

Wildcard patterns apply when all other arms have *not*
matched. That’s a negation, and Datalog’s fixed point semantics are
incompatible with arbitrary negations. To see why, consider this
hypothetical Eqlog program:

```
pred p();
rule paradox {
if not(p());
then p();
}
```

After evaluation, should `p()`

hold?

So Eqlog can’t support arbitrary negations, but there’s no
fundamental reason why it could not support *stratified*
negations in the same way that other Datalog engines support it.
Stratification prevents the paradox above by requiring that if a rule
concludes from `p`

that `q`

holds, then there must
be no rules that conclude (transitively) from `q`

that
`p`

holds. In other words, there must be no recursion
involving negations.

Stratification could also be used to implement wildcards. Eqlog would evaluate the program until the enum constructors stabilize, i.e., until no rule can fire anymore that would change anything about the graphs of the enum constructors. At this point, Eqlog would apply the wildcard match arm (if any) to elements that no other arm applies to.

An alternative to this that doesn’t require stratification could be for Eqlog to replace the wildcard with the complementary patterns to the ones that are explicitly given. For example, the match statement in

```
if e: BoolExpr;
match e {
True() => { <A> }
False() => { <B> }
_ => { <C> }
}
```

would desugar to

```
match e {
True() => { <A> }
False() => { <B> }
Or(_, _) => { <C> }
And(_, _) => { <C> }
}
```

If we enforce that enum constructors are surjective and injective,
then this second approach is strictly more powerful than the first,
because it gives the same results but does not require stratification.
But the first approach using stratification would still be useful for
enums with non-surjective constructors, where the wildcard pattern would
apply to elements that are not given by a constructor. And in case of
non-injective constructors, the second approach would mean that the
wildcard pattern can match even when there is already some other pattern
that matches, which is not the case with the approach based on
stratified negation. I don’t know which alternative would be preferable
here, but luckily there’s no need to make a decision before sum types
and `match`

statements without wildcards are actually
implemented :)