September 29, 2024

*This is a high-level description of a Datalog extension that
would allow instantiating multiple models of one Datalog program in a
larger ambient Datalog program. These instances are represented as
elements in the ambient Datalog program, and types of the instantiated
Datalog program are dependent on these elements.*

*The main purpose of this language feature is to make it possible
to compose small self-contained Datalog programs into larger programs,
similarly to how classes in object oriented languages enable multiple
instantiations of part of a program and its associated data. A second
benefit is that it enables the Datalog compiler to detect more
programming errors. Finally, dependent models allow informing the
Datalog compiler about locality in data and rules, which enables it to
generate more efficient code.*

*Dependent models are a further generalization of Cartmell’s
framework of generalized
algebraic theories (GATs): Where GATs have a notion of dependent
type or set, dependent models exist for every user-supplied Datalog
theory, in particular the theory of a single type.*

This will be a rather conceptual post, and unfortunately I haven’t even started to implement this. But there’s a good (bad?) chance I’ll be a bit busier soon, so I figured it might be better to write down some of those ideas for when I can get back to them.

In case you haven’t followed the work on the Eqlog Datalog compiler or the similar Egglog engine, let me try to briefly summarize. Eqlog implements an extension of Datalog with support for (partial) functions and equality. And while dependent models probably make sense without equality or functions, dependent models interact non-trivally with equality, so I decided to write directly about dependent models in Eqlog.

Types, functions and predicates in Eqlog are declared as follows:

```
type A;
type B;
func f(x: A) -> B;
pred p(x_0: A, x_1: A);
```

Rules consist of a sequence of `if`

and `then`

statements. For example, the following rule enforces transitivity for
`p`

:

```
rule p_transitivity {
if p(x_0, x_1);
if p(x_1, x_2);
then p(x_0, x_2);
}
```

In standard Datalog syntax this would be written as follows:

`, x_2) :- p(x_0, x_1), p(x_1, x_2). p(x_0`

Compiled Eqlog programs represent elements of types as integer IDs,
predicates as sets of tuples and functions via their graphs. The graph
of the function `f(a : A) -> B`

is a predicate
`graph_f : (a : A, b: B)`

where `graph_f(x, y)`

means that `f(x) = y`

. Eqlog’s support for equality means
that it can enforce that `graph_f`

satisfies the
functionality axiom, i.e. the following holds:

```
rule functionality {
if graph_f(x, y_0);
if graph_f(x, y_1);
then y_0 = y_1;
}
```

During evaluation, when an Eqlog program infers an equality among
elements `x`

and `y`

, it replaces all occurrences
of `y`

in predicates and functions by `x`

. Another
extension over standard Datalog is the ability to quantify over all
elements of a given type, and to enforce that a function is defined on
given arguments via the exclamation mark operator:

```
func f(x: A) -> B;
rule {
// if x is an element of type A
if x: A;
// then the function f is defined on x
then f(x)!;
}
```

The exclamation mark operator makes Eqlog Turing complete. To prevent
Eqlog programmers (so: me) from inadvertedly introducing new elements in
rules, leading to non-termination, Eqlog enforces *surjectivity*
restrictions in rule definitions: Unless the exclamation mark operator
is used, `then`

statements can only mention elements that
have been introduced earlier in the same rule by an `if`

statement or via the exclamation mark operator.

The Eqlog compiler does not generate standalone executables. Instead,
the generated code is meant to be imported into a host program written
in Rust. This generated code provides a data structure to represent the
data declared in the Eqlog program, and it provides a `close`

method that operates on this data structure which adds data according to
the rules of the Eqlog program until all rules are satisfied.

Consider reachability analysis for the control flow graph (CFG) of a
function in a given programming language. This is a common subtask
during compilation and static analysis of programming languages. A CFG
is a representation of a function’s execution flow as directed graph:
The nodes of the graph represent basic code blocks without branches, and
edges represent possible direct execution flows between basic blocks.
The CFG of a function has a dedicated entry node, and we call a node
`n`

*reachable* if there is a path from the entry node
to `n`

.

Here’s a straightforward implementation of the CFG data structure and reachability in Eqlog:

```
type Node;
pred edge(from: Node, to: Node);
pred reachable(node: Node);
func entry_node(): Node;
rule entry_reachable {
if n = entry_node();
then reachable(n);
}
rule reachable_step {
if reachable(n);
if edge(n, m);
then reachable(m);
}
```

If we compile this Eqlog program, we get a Rust module that provides
a `CFG`

data structure. Our host program can instantiate this
data structure, populate `Node`

, `edge`

and
`entry_node`

data based on the abstract syntax tree (AST) of
a given function, and then run the `close`

method to compute
reachability in the CFG.

While this works, it’s somewhat unsatisfactory: Ideally, we’d like to move more logic into the Eqlog program and only instantiate the Eqlog program once for the whole AST, which will contain the ASTs of many functions. However, we can’t easily use the Eqlog program above as part of a larger Eqlog program that analyzes multiple functions at once. The problem is that we need to instantiate separate CFGs for each function in our Eqlog program, but Eqlog does not have a notion of “instance” of a subprogram, for example the CFG program.

We can work around this by parametrizing types, predicates and functions in our CFG program with elements of a new type to encode different instances of the original CFG program:

```
// A type representing instances of CFGs.
type CFG;
type Node;
func node_owner(node: Node) -> CFG;
pred edge(cfg: CFG, from: Node, to: Node);
pred reachable(cfg: CFG, node: Node);
func entry_node(cfg: CFG): Node;
rule entry_reachable {
if n = entry_node(cfg);
then reachable(cfg, n);
}
rule reachable_step {
if reachable(cfg, n);
if edge(cfg, n, m);
then reachable(cfg, m);
}
```

Instantiating CFGs based on AST nodes in Eqlog, without involvement of the host program, will now look something like this:

```
// Types representing AST nodes.
type FunctionAST;
type BasicBlock;
pred basic_block_in_function(block: BasicBlick, function: FunctionAST);
func function_cfg(function: FunctionAST) -> CFG;
rule function_cfg_total {
if function: FunctionAST;
then function_cfg(function)!;
}
func basic_block_node(block: BasicBlock, function: Function) -> Node;
rule basic_block_node_total {
if basic_block_in_function(block, function);
then node := basic_block_node(block, function)!;
then node_onwer(node) = function;
}
...
```

This accomplishes the goal of moving more logic from the host program
into the Eqlog program, but at a cost: We’ve had to clutter our original
code with mentions of a `CFG`

element everywhere. Our Eqlog
program also got more brittle, since it relies on the unenforced
convention that we never create an edge between nodes of different
CFGs.

Finally, evaluation of the rules governing the `reachable`

predicate will now likely be slower than with the original approach. In
terms of runtime behavior, what we’ve done as we went from the first to
the second solution is that we’ve combined many small graphs into a
single large graph with many isolated subgraphs, and now we compute
reachability in the larger graph.

The runtime of reachability computation is dominated by join computations on the respective edge tables. And while the total number of joins required is roughly the same with both approaches, each join will be more costly with the second approach than with the first one because the edge table is much larger.

Is there way to retain the advantages of our first CFG program while making it compose well into larger Eqlog programs like our second version? With dependent Datalog models, which I describe in the rest of this post, the answer is yes.

The `model`

keyword is used to designate a fragment of an
Eqlog program for future instantiation:

```
model CFG {
// Same code as in the original CFG program, but now the
// definitions are *members* of the CFG model.
type Node;
pred edge(from: Node, to: Node);
pred reachable(node: Node);
func entry_node(): Node;
rule entry_reachable {
if n = entry_node();
then reachable(n);
}
rule reachable_step {
if reachable(n);
if edge(n, m);
then reachable(m);
}
}
```

Each `model`

definition introduces a type (here:
`CFG`

) whose elements represent separate instances of the
Eqlog program in the model definition body. Similarly to classes in
object oriented orientation, we’ll refer to the definitions in the body
as *members* (here:
`Node, edge, reachable, entry_node`

).

Members are only accessible via an element of the model type.
Whenever an expression `t`

has model type, we expect a way to
access members of the model instance associated to `t`

.
Following the typical convention, I think the dot character is
appropriate here, so that `t.<member>`

refers to the
member.

With the `CFG`

model definition above, this allows us to
write code like this:

```
type BasicBlock;
func basic_block_node(cfg: CFG, bb: BasicBlock) -> cfg.Node;
func node_basic_block(cfg: CFG, node: cfg.Node) -> BasicBlock;
rule node_basic_block_inverse {
if cfg: CFG;
if n_0: cfg.Node;
if bb = node_basic_block(cfg, n_0);
if n_1 = basic_block_node(cfg, bb);
then n_0 = n_1;
}
pred is_entry_block(bb: BasicBlock);
rule cfg_entry_block {
if n = basic_block_node(cfg, bb);
if is_entry_block(bb);
then cfg.entry_block() = n;
}
pred basic_block_reachable(bb: BasicBlock);
rule cfg_reachability {
if n = basic_block_node(cfg, bb);
if cfg.reachable(n);
then reachable(bb);
}
```

This snippet defines function `basic_block_node`

that
associates a node in a given CFG to a basic block. Observe that the
result type `cfg.Node`

depends on the parameter
`cfg: CFG`

. The `node_basic_block`

function is
meant to represent an inverse function to `basic_block_node`

.
Here the type of the second argument `node: cfg.Node`

depends
on the first argument `cfg: CFG`

.

The rule `basic_block_node_inverse`

formalizes the
equality
`basic_block_node(cfg, node_basic_block(cfg, n)) = n`

. Here
the type `cfg.Node`

of `n_0`

and `n_1`

depends on the element `cfg`

that was introduced earlier in
the rule. The rules `cfg_entry_block`

and
`cfg_reachability`

demonstrate access of member functions and
member predicates, respectively.

Member types depend on the expressions they are accessed with. Since type checking requires comparing types, we now have to decide expression equality during type checking. Consider the following example:

```
rule bad_argument {
if cfg_0: CFG;
if cfg_1: CFG;
if n: cfg_0.Node;
// Type error: Expected argument of type cfg_1.Node, got cfg_0.Node.
then cfg_1.reachable(n);
}
```

This program should not type check because `cfg_1.Node`

is
potentially different from `cfg_0.Node`

. But if we added the
assumption `if cfg_0 = cfg_1`

, then we would expect the
program to type check. This means that type checking Eqlog programs now
involves deciding equality among expressions.

To enforce the surjectivity restrictions I alluded to earlier, the
Eqlog type checker already needs to decide equality of expressions in
rules, so I don’t think this would add significant additional
complexity. Note that the equality decision procedure takes only
implicit functionality rules (i.e. `x = y => f(x) = f(y)`

)
into account but not user-supplied rules in the Eqlog program, as this
would make the type checker run indefinitely on some input programs.

How should model definitions and elements of model types be represented at runtime?

**Desugaring.** One way would be to desugar models into
non-dependent Eqlog, similarly to how we turned our first CFG
implementation into a parametrized version. So the Eqlog compiler would
internally add the following definitions for each model definition:

- A global type
`M`

representing instances of the model. - For each type member, a global type
`T`

and a function`belongs_to(x: T) -> M`

. - For each member predicate, a global predicate
`p`

that takes the arguments of the member and furthermore an argument of type`M`

. Similarly for member functions. - For every member rule, a desugared global rule in which a new
argument
`m: M`

is added to all member function and member predicate applications. Quantifications`x: T`

over member types desugar to equalities`belongs_to(x) = m`

.

For the `CFG`

model example this would ultimately compile
into something like this Rust code:

```
struct CFG(u32);
struct Node(u32);
struct CompiledProgramData {
: Set<CFG>,
cfgs: Map<Node, CFG>,
node_belongs_to: Set<(CFG, Node, Node)>,
edge: Set<(CFG, Node)>,
reachable: Map<CFG, Node>,
entry_node...
}
```

**Maps to model instances.** Alternatively, we could
first compile the model body in isolation, which would give us a data
structure representing just one instance of the model:

```
struct Node(u32);
struct CFGMembers {
: Set<(Node, Node)>,
edge: Set<Node>,
reachable: Option<Node>,
entry_node}
```

We would then use this data structure to represent instances in the ambient Eqlog program:

```
struct CFG(u32);
struct CompiledProgramData {
: Map<CFG, CFGMembers>,
cfgs...
}
```

This second representation leads to faster evaluation of member rules, since those can be evaluated in isolation for each instance. But it also takes up more memory if the members of model instances are mostly empty, and it will probably make it more costly to access members when evaluating rules of the ambient program involving those members.

Eqlog and other Datalog engines often maintain indices on relations, i.e. multiple representations of the same data that are suitable for computing different types of joins efficiently. So perhaps that’s also the way to go here: Simply use both representations to improve runtime performance at the cost of memory usage.

Originally I intended to also write at length about *morphisms*
between instances, but decided to postpone a proper discussion to some
other time because I haven’t worked out some relevant details. I do want
to outline the rough idea though, which is to add a built-in type of
morphisms between two elements of the same model. Semantically,
morphisms should represent maps of data in the domain instance to data
in the codomain instance. So if the model formalizes graphs, then
morphisms would be edge-preserving maps of nodes, and for groups you’d
get group morphisms.

The point of having a built-in notion of morphism is that the Datalog
engine can implement those morphisms efficiently via *shallow*
instead of deep copies of tables in the domain instance. For example, if
there’s a morphisms `f : x -> y`

and no other rules
asserting facts about the codomain instance `y`

, then tables
in `y`

can be represented simply as pointers to the tables of
`x`

. And if we use persistent
insertion operations into tables, then we can share chunks of tables
among `x`

and `y`

even if the tables in
`y`

have some more entries.

The concrete problem this is supposed to solve is the maintenance of
*scopes* when implementing programming languages in Eqlog. A
scope captures the set of variables that are accessible at a given AST
node and possibly its type. You could represent scopes like so:

`func var_type(node: ASTNode, var: Identifier) -> Type;`

The problem with this is that the program has to deal with a large
number of AST nodes, and so the table representing `var_type`

will be prohibitively large.

Since many nodes share the same scope, we can improve this by
introducing a `Scope`

model:

```
model Scope {
func var_type(var: Identifier) -> Type;
}
func scope(node: ASTNode) -> Scope;
```

Now we can at least express when two nodes have the same scope. For example, expressions (as opposed to statements) can usually not introduce new variables, so you can add a rule like this:

```
rule {
if is_child_expression(child, parent);
then scope(child) = scope(parent);
}
```

This way, there’s only one `var_type`

table for both
`child`

and `parent`

.

But a problem remains: Every variable declaration, for example by a
`let`

statement, increases the `var_type`

table of
the associated scope. We could implement this like so:

```
pred scope_extension(smaller: Scope: larger: Scope);
rule scope_extension_var_type {
if scope_extension(s_1, s_2);
if s_1.var_type(var) = ty;
then s_2.var_type(var) = ty;
}
rule {
if second_stmt = successor_statement(first_stmt);
if stmt_introduces_variable(first_stmt, var, ty);
then scope(second_stmt).var_type(var) = typ
then scope_extension(scope(first_stmt), scope(second_stmt));
}
```

But this would mean that whenever a new variable is introduced, we
take a deep copy of the current scope and insert the new variable into
it. For a sequence of *n*
statements that each introduce a new variable, this would result in
𝒪(*n*^{2}) memory
usage.

With morphisms, we could express that one scope is an extension of
another. If the compiler implements morphisms via shallow copies, then
we’d cut the memory required for storing scope tables down to 𝒪(*n*⋅log(*n*)), which is
probably the best you can hope for in a language without mutation.