An evaluator for a language L is a program that takes as input an expression of L and returns an output value, possibly executing side effects along the way. In some sense the evaluator defines the meaning of the expressions of L. A compiler for L, on the other hand, is a program that takes as input an expression of L and transforms it into some other representation, typically in a lower-level language.
The most straightforward compiler would be one that simply outputs the exact sequence of steps that the evaluator takes while evaluating the expression. Most compilers, however, go a step further and optimize the output code in terms of size or speed or whatever, yielding better code than what the evaluator would have done.
The evaluator sets the standard for the language, and it is expected that the code generated by the compiler for some expression e of L will, when run, execute exactly the same side effects as evaluator would when evaluating e and will return the same result. But with increasingly arcane optimizations, it can be hard to tell that the compiler output really is the same. How can it be ensured that the compiler is correct with respect to the evaluator?
In dependently-typed languages like Coq or Idris, it’s possible to write proofs of properties like compiler correctness. If you write a proof and it compiles, then you can be assured that the property holds, assuming the correctness of the compiler (of the language in which the proof is written). The rest of this post will develop 1) a simple language, 2) an evaluator for that language, 3) a simple stack machine, 4) a compiler for the language targeting that machine, and 5) a proof that the compiler is correct with respect to the evaluator.1 Idris is the language that will be used, because that’s the one I know.
Here’s a simple language for addition. It’s pretty boring, but notice that the
Plus expression is recursive.
Evaluating expressions in this language is easy; just return the argument of a
Const expression, and recursively evaluate the subexpressions of an
Plus expression and add them together.
Here’s an example expression2:
eval example evaluates to 25, as expected3:
Here’s a simple stack machine, along with a language for “programming” it:
The meaning of the stack machine instructions is defined as follows. A
Push instruction adds its argument to the top of the stack. An
Add instruction does nothing if the stack is less than two items high; otherwise, it pops off the top two items and pushes back their sum.
Running a program against a stack consists of executing each instruction against the stack in turn.
Here’s a compiler that converts expressions of the addition language to programs for the stack machine:
Running the program generated by
comp example against the empty stack leaves the stack
, just as
eval example gave
25, so evaluating and compiling that expression give the same result.
The program up to this point could be easily converted to Haskell; indeed, parts of it are just a few characters away from being syntactically valid Haskell. But now we’re at the correctness proof, and that can only be done with dependent types. Specifically, we need to write a program to implement the following type:
compCorrect is a “function” that takes an expression
e as an argument and returns the claim that compiling and evaluating
e works out to be the same. Note that the type of the return value depends on the value of the argument, hence dependent type.5
Proving facts like this can be tricky. The first thing to notice is that
compCorrect is an instance of a slightly more general theorem, namely that running the compilation of
e against any stack, not just the empty one, is the same as evaluating
e and pushing the result onto the stack.
Theorems like these are typically proved using induction; that is, the problem is solved by assuming some subproblems are already solved. Adding this extra parameter provides some flexibility in using various inductive hypotheses.
The next step is case-splitting on
Const case is trivial, but the
Plus case will take some work.
?goal is the demonstrandum, and its type is
An obvious first inductive hypothesis is
That is, assume that the theorem is holds for the subexpression
a, that running the compilation of the subexpression
a against the initial stack
s is the same as evaluating
a and pushing the result onto
s. The same can also be assumed for running the compilation of the subexpression
b against the input stack with the result of evaluating
a on top:
At this point the proof looks like this:
The problem at this point is that the types of the inductive hypotheses don’t fit into the type of
?goal deals with the whole program, while the hypotheses deal with fragments of it. This suggests a lemma for running a program partially:
That is, running program
p ++ q against stack
s is the same as running program
q against the stack that results from running program
s. The program on the left side of
?goal can be split into
comp a and
comp b ++ [Add].
?goal has the same left side as
runa, so it can be rewritten:
The new type of
The program on the left side of
?goal can be split up further:
?goal has the same left side as
runb, so it can be rewritten:
The new type of
run (comp a) s is on the left side of
hypa, so the right side can be substituted in:
The new type of
run (comp b) ([eval a] ++ s) is on the left side of
hypb, so the right side can be swapped in:
The type of
?goal is now
[eval a + eval b] ++ s = [eval a + eval b] ++ s, which can be finished off with a simple
With all the junk variables cleaned up, the whole proof looks like this:
- Add multiplication to the language.
- Suppose the stack machine is designed in such a way that it’s cheaper to execute all
Addinstructions together, rather than in separate blocks. Write a new compiler for the addition language to make this optimization. Prove its equivalence to the evaluator, or to the old compiler.
- Figure out Jekyll syntax highlighting for Idris.
1 I got this example from a list of induction exercises in Coq. That author in turn got it from someone else, and apparently it ultimately goes back to John McCarthy. It seems to be a pretty common toy problem; see, for instance, this post, which I found after this one was already nearly finished.
2 Writing out these expressions by hand is a pain. This example was generated with Elisp:
3 Actually, we don’t even need to “run” the code. With dependent types, it’s possible to “prove” simple test cases at compile time:
These functions are pure, so if the input is known at compile time, the compiler can unwind the functions down as far as it needs to verify the assertion.
4 As with the evaluator, we can run this “test” at compile time:
5 Dependent types are distinct from parameterized types like