Control Flow Graphs in Lean

The Whats, the Hows, and a lot of How Nots.

Table of Contents:

Introduction
What’s a CFG
Dataflow Analysis
What Proofs?

Introduction

In the context of my internship at JetBrains, I’m working on a very particular and fun problem. My team (the Theorists) are drafting a proposal for uniqueness types in Kotlin, defining type system and semantics using a simpler core language. In parallel, another team (the Engineers) is writing the actual compiler pass to typecheck this system, implemented as static analysis over the Kotlin IR (which is, you guessed it, a control flow graph).

My task lies somewhere in the middle: I’m paid to think very hard and try to come up with a good definition of equivalence between the semantics and the dataflow approach. The goal is to derive a general theorem of soundness of dataflow analysis! Having little experience with CFGs and abstract interpretation in general, the following is a reordered collection of the notes I wrote over the month, and the decisions I took while formalizing.

The following document is provided without guarantee, but a lot of good will and pure intentions. All bad decisions are my own and do not reflect the capabilities of my company. If you have any input, comment, concern, or remark, feel free to reach out to me by Slack, email, or on bsky.

In the following writeup, I’ll be considering a small imperative language, with variable declaration and assignment, if and while constructs.

What’s a CFG

The field of static program analysis concerns itself with knowing everything there is to know about a given program without daring to execute it. For a given program p, this can manifest in multiple ways, depending on the kind of analysis put in place.

Flow-insensitive analyses are purely syntactic, meaning they do not care about control flow, and simply use the AST of the language to derive some guarantees on program behavior. While fast, this is usually an imprecise way to go about analysis; less sensitivity means coarser results, so the derived invariants are usually less refined than we would want.

Control flow graphs are to flow-sensitive analysis what ASTs are to its flow-insensitive counterpart, An AST represents the syntax of a program (every node corresponding to a language construct, forming a nested tree), while a CFG represents the program’s execution flow via basic blocks connected by runtime paths:

Definition. The Control Flow Graph (CFG) of a program p is a graph where every node represents a “code point”, aka a snapshot in the execution of p, and every edge represents program flow from one code point to the next.

These snapshots can be more or less granular depending on the necessary “closeness” of the analysis: a node can encompass anything, from a full block of statements to individual entry and exit nodes per statement or expression. Additionally, the edges can be tagged with what “choice” they represent: true/false branch for an if, loop back for the end of a while, and so on and so forth. The generality of the technique means there exist many different CFG constructions in the wild, seldom formally defined or described.

CFGs ≈? Program semantics

Recent-ish research [1] has presented the correspondence between program semantics and CFGs. In particular, this sort of graph arises as an abstraction over abstract machine semantics. A short presentation of the notion follows, though for a full discussion the reader can refer to [2].

Abstract machines are efficient constructions that model execution states and the relationships between them. An exemplary variant of this approach is the CEK machine, which defines states as triples $(C, E, K)$ , where $C$ is the current “construct” being evaluated, $E$ is the execution environment, and $K$ represents the continuation, the “next execution step” to jump to once we’re done with computing $C$ . The nature of the execution environment depends on the language being considered, as do the exact continuations.

This sort of construction differs from other semantics in that it focuses on states that describe their next step, rather than discussing the steps directly. This allows for a very straightforward formal definition of “codepoints”, which in turn makes the connection to CFG nodes evident. While any abstract machine shows this correspondence, we shall focus on CEK machines throughout this writeup because that’s what I am (most) familiar with.

Dataflow Analysis

The theoretical backing for the analysis on control flow graphs is given by lattice theory. This write-up will not be a full discussion of the math behind it, but the interested reader can refer to [3] for a more in-depth exploration.

Control Flow Graphs can allow efficient analysis of many program properties, be it in compiler optimizations or other contexts. These include but aren’t limited to, for example:

dead code detection (code that has no execution path to it);
memory mismanagement (potentially null values, use-after-free);
various compiler optimizations (constant folding).

The goal of these analyses is to obtain a relevant state answer per code point (“in this statement, variable x might be null”), taking into account every potential execution path that leads to that point. As it would be impractical to actually run every path, we have to resort to an abstract interpretation of the code.

A quick review of the theory

Consider a program p, written in a programming language manipulating concrete values from a set $C$ through basic operations $\bullet_i : C^{ar(i)} \to C$ , where $ar$ gives the arity for this operation. For our analysis, we define a set $A$ of abstract values, an abstraction function for values that takes every concrete value $c$ to an abstract value $a \in A$ , and transfer functions that compute a new abstract state based on a concrete operation to approximate and abstract operands to apply it to.

The key idea is that the abstract domain doesn’t just contain abstract values, but also “meta” states for analysis: it doesn’t just represent “values”, but “knowledge”. For example, consider the question of sign analysis: at every node, for every variable, we want to know whether it is positive, negative, or zero. We need to have states $\text{Pos}$ , $\text{Neg}$ , $\text{Zero} \in A$ , but that’s not enough. To have proper analysis responses, we also need a way to state combinations of these: if a branch yields $x < 0$ , and another $x > 0$ , the abstract state after the branching can’t and should not conclude either way, but have a state to represent “can’t conclude”.

To save years of mathematical development, we skip to the conclusion of this thought pattern: the most general way to represent this sort of thinking is using a structure of this sort:

    ℤ
 /  |  \
+   0   -
 \  |  /
    ∅

This would allow us to define a more “precise” abstract state pretty intuitively, by exploiting the inherent partial order:

    ℤ
  /   \
 ≥0   ≤0
/  \ /  \
+   0   -
 \  |  /
    ∅

Mathematicians, in their infinite wisdom, already have a name for this sort of structure: a lattice.

Definition. A lattice is a partially ordered set structured in such a way that, for every two elements, there exists a unique “closest ancestor” that is above both in the order. We call this ancestor the least upper bound, or the join, of two elements, denoted a ⊔ b.

Consider the following example program:

y = 0;
if (x == 0):
  y = y + 1
else:
  y = y - 2

In the true branch, we can easily see that we obtain the mapping y -> Pos, and in the false branch, y -> Neg. The state at the exit of the if statement should be a join of the two states: looking at our lattice diagram, the answer can be found to be y -> ℤ, simply using lattice properties.

Automating this work

This approach lends itself very nicely to automated reasoning, and automate this reasoning we will! The result we’re looking for is the result of iteratively applying, at every node $n$ , the abstract behavior described in $n$ to the “incoming” abstract state. This gives us natural equations on the flow of data depending on the incoming states operations. Solving the analysis therefore becomes solving these dataflow equations for every node. Abstractly:

while <stuff changes>:
  for every node n:
    inc n := pred(n).reduce(· ⊔ ·)
    out n := transfer n inc

Due to the potential presence of loops in the construct, changing the outgoing state at node $n_1$ may change the incoming state at node $n_2$ , so we have to argue for the termination of this algorithm. To that end, notice two facts:

the algorithm only keeps running as long as there’s state changes.
the state changes can only move “upwards” in the lattice order, because of the properties of the join operator.

This means that the state at every node follows a monotone ascending chain. If we can ensure that this lattice has finite height, then we can conclude that every node’s state must stabilize after some number of steps, terminating the algorithm. This requirement on the abstract state lattice unfortunately “limits the expressiveness” of this sort of technique: for some kinds of analysis, we might want lattices of non-finite height, and clamping them to a finite amount of states inherently introduces some approximations. However, finite height is a reasonable requirement for the analyses that interest us, so we won’t discuss this in detail.

Given an abstract state lattice $A$ of finite height, we can show that any finite mapping of variable names to values of $A$ is also a finite height lattice, which allows us to extend all of the reasoning on a single value to reasoning on arbitrary environments.

What kind of proofs do you get?

Before stating formal theorems, we should state formal definitions. Consider a language $L$ with proper formal semantics defined as a CEK machine.

Let p be a program in $L$ , and $A$ be an abstract domain lattice with abstraction function $\alpha$ and transfer function $\beta$ . To perform dataflow analysis on p, we first compute the control flow graph g of p, yielding a node per codepoint and a representation of every execution path. To that, we apply the abstract interpretation algorithm of your choosing with $\alpha$ and $\beta$ : we therefore obtain a dataflow result $r : \textsf{Node} \to \textsf{Var} \to A$ , giving us the final abstract state of every variable at every codepoint.

It is easy to state a soundness theorem for this procedure in natural language:

Theorem. (Dataflow Soundness, informal) An analysis pipeline on $L$ is sound if for every well-formed program of $L$ and every CEK state $\sigma$ in its execution, the result computed by the analysis at the node $n$ that corresponds to $\sigma$ is the abstraction of the state of the environment at $\sigma$ .

This statement, although simple, hides its complexity behind the load-bearing notion of “correspondence”. What is the “corresponding node”? We shall investigate that while devising a formal proof strategy for this property.

The goal

Given a language whose AST and semantics are formally defined in Lean, we want to define

an appropriate CFG format, that allows for reasonable and efficient representations of the control flow of the language’s constructs;
a builder of such control flow graphs $b : \textsf{Prog} \to CFG$ ;
a program-indexed relation between CEK states and CFG nodes, equipped with correspondence theorems proving the soundness of this relation (the specifics of which are yet to be discussed);
a framework for performing analyses: ways to define abstract domains, and abstraction and transfer functions, and to actually run analysis on CFGs;
a proof of termination of the dataflow algorithm;
a proof of correctness of the dataflow algorithm.

Once this is in place, it is feasible to prove the soundness of the dataflow analysis from the two intermediate guarantees:

CEK ≈ CFG \to CFG ≈ Dataflow \to CEK ≈ Dataflow

With that being said, we can now move on to describing what I just spent the past month working on, and how that may have been a waste of time.

`Lean` on me

The reader is expected to have some familiarity with the Lean theorem prover, as well as properties of theorem proving in general. The “introduction to theorem proving” blogpost isn’t due for another few weeks.

The CFG construction

The CFG is a very standard construction, mirroring the structure of the Kotlin IR: it is comprised of entry and exit nodes for every statement and expression in the language, allowing for a very straightforward definition of the dataflow analysis algorithm. Additionally, edges are tagged with the flow construct they abstract, either “normal” flow, true/false branching, loop back and break. For example, the following program:

var x : Nat := 1;
x = x + 2;

is transformed into the following control flow graph:

Figure 1: Verbose, sure, but optimizations are beyond the scope for now.

Node and edge kinds are implemented as ADTs to make case analysis simpler, both in code and in proofs. On this CFG construction is built the analysis framework, which uses an abstract typeclass LatticeLike to ensure that the abstract state is well-defined with respect to the properties we care about:

class LatticeLike
    (A : Type) [Max A] [Bot A] [FiniteHeight A]
    (nodeTransfer : CFGNode -> A -> A)
    (edgeTransfer : CFGEdge -> A -> A) where
  -- regular lattice structure
  join_comm : ∀ a b : A, a ⊔ b = b ⊔ a
  join_assoc : ∀ a b c : A, (a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)
  join_idem : ∀ a : A, a ⊔ a = a
  bot_le : ∀ a : A, a ⊔ ⊥ = a

  -- edge and node monotoncity
  node_mono : ∀ n, monotoneD (nodeTransfer n)
  edge_mono : ∀ e, monotoneD (edgeTransfer e)

With this framework, it’s possible to show that the dataflow algorithm terminates properly, and that it computes a proper fixpoint (i.e. that it finds the correct final abstract state).

Abstractly formulating correspondence

The formulation of the correspondence is the really tricky part in this formal proof, and what I suspect to be the portion of this project that’s in a difficult spot. As stated earlier, this notion is fundamental to state dataflow soundness, as we require a state and a node to be related in order to establish proper abstraction. We therefore need sensical and natural requirements on such a relation, that are both reasonable to prove for regular relations and tight enough to help us prove soundness later down the line. To that end, I chose to define a general relation type, and a typeclass to characterize all “appropriate” relations:

def StateRel := CEK -> CFGNode -> Prop

class TranslationReq (s : Lang .Stmt) (R : StateRel) : Prop where
  init_related : R (initState s) (stmtCFG s).entry
  terminal_related : ∀ E, R (terminalState E) (stmtCFG s).exit

  init_uniq : ∀ {n}, R (initState s) n -> n = (stmtCFG s).entry

  step_sound :
    ∀ {σ σ' n}, R σ n -> Eval σ σ' ->
      ∃ n', R σ' n' ∧ CFGReach (stmtCFG s) n n'

  step_complete :
    ∀ {σ σ' n'}, Eval σ σ' -> R σ' n' ->
      ∃ n, R σ n /\ CFGReach (stmtCFG s) n n'

We state that, for a program s, a relation R between execution states and CFG nodes is “sensible” if:

the initial state and the entry node are R-related;
the terminal state and the exit node are R-related;
for any states σ, σ' and node n, if σ and n are R-related and σ steps to σ', then there’s a node n' that’s R-related to σ' that’s reachable from n.
for any states σ, σ' and node n', if σ' and n' are R-related and σ steps to σ', then there’s a node n that’s R-related to σ', from which n' is reachable.

This feels like a reasonable approach, modulo some design questions which felt minor at the time (does this need to be indexed by the program? what would be the implications of formulating this over a general graph instead?). With this correspondence stated, we can move on to formulating the other building block for our proof, a framework of requirements on the analysis statement: given an abstract, lattice-like domain A, transfer functions for nodes and edges and an abstraction relation, we define the following requirements:

class Framework
    (A : Type) [Bot A] [Max A] [FiniteHeight A]
    (edgeTransfer : CFGEdge -> A -> A)
    (nodeTransfer : CFGNode -> A -> A)
    (s : Lang .Stmt) (R : StateRel)
    (abs : CEK -> A -> Prop)
    extends LatticeLike A nodeTransfer edgeTransfer where

  abs_mono : ∀ {σ : CEK} {a b : A}, a ⊔ b = b -> abs σ a -> abs σ b
  abs_preserves_init : ∀ {a : A}, abs (initState s) a ->
    abs (initState s) (nodeTransfer (stmtCFG s).entry a)
  step_sound : ∀ {σ σ' : CEK} {n n' : CFGNode} {entryInit : A},
    R σ n -> R σ' n' -> Eval σ σ' -> CFGReach (stmtCFG s) n n' ->
    ∀ (f : fact A), IsForwardPostFixpoint (stmtCFG s) nodeTransfer edgeTransfer entryInit f ->
    abs σ (f n) -> abs σ' (f n')

This encodes the requirements that:

The abstract domain A forms a lattice according to the LatticeLike requirements.
The abstraction relation abs is monotonic.
The abstraction is preserved at the initial node.
The transfer functions are sound: given a sound abstract state and a concrete Eval step, the next abstract state computed by the analysis soundly approximates the next concrete state.

From this, we can make sense of the statement of the main soundness proof.

Assuming the existence of a sensical relation R that allows us to interpret the CFG construction as an abstraction of the program’s semantics (an instance of TranslationReq s R), and the definition of a lattice-like, finite height abstract domain A, then it is the case that:

theorem dataflow_soundness {s : Lang .Stmt}
      {A : Type} [Bot A] [Max A] [DecidableEq A] [FiniteHeight A]
      {edgeTransfer : CFGEdge -> A -> A} {nodeTransfer : CFGNode -> A -> A}
      {abs : CEK -> A -> Prop} {R : StateRel} [tr : TranslationReq s R]
      [fr : Framework A edgeTransfer nodeTransfer s R abs]
      (entryInit : A) (out0 : fact A)
      (wl0 : List CFGNode) (hwl0 : ∀ x ∈ wl0, x ∈ (stmtCFG s).nodes)
      (h_inv0 : ∀ m, m ∉ wl0 ->
          nodeTransfer m (expectedIn (stmtCFG s) edgeTransfer entryInit out0 m) ⊔ out0 m = out0 m)
      (hinit_sound : abs (initState s) entryInit)
      (σ : CEK) (n : CFGNode) (hreach : CEKReach (initState s) σ) (hrel : R σ n) :
    let res := worklistForwardEdge (stmtCFG s) nodeTransfer edgeTransfer entryInit out0 wl0 hwl0
    abs σ (res n)

The analysis result res n at node n is a sound over-approximation of the concrete state σ (i.e., abs σ (res n) holds), provided that σ is reachable, and related to n via the translation relation R.

This characterisation of the dataflow soundness theorem makes sense, and it does allow for a complete proof: should a relation between CEK states and CFG nodes satisfy the TranslationReq requirements, and the analysis you’re carrying through satisfies the Framework requirements, then you can prove that the dataflow analysis carried by the code is sound. That’s great!

Why it’s not great

The abstract formulation of correspondence was a great way to formulate the theoretical requirements, and to reason about the problem on paper. However, it proved hard to establish in practice.

I tried defining a correspondence between the CFG and the semantics as a simple inductive relation, with four constructors:

inductive cfgcekRel (s : Lang .Stmt) : StateRel where
| exprEntry (e : Lang .Expr) (E : Environment)
    (J : JStackCtx) (K : List Cont) (n ex : CFGNode) :
    n.kind = .exprEntry e ->
    ... ->
    cfgcekRel s ⟨.sourceExpr e, E, J, K⟩ n
| exprExit
| stmtEntry
| stmtExit

Roughly, a CEK state σ is related to a node n when the kind of n matches the control component of σ. While initially meant to be a very lightweight relation, it soon became very clear that to derive the necessary TranslationReq proofs, more invariants needed to be carried by each constructor to certify well-formedness. These include (and hopefully are limited to):

Expr- and StmtEntryEdgeInv, two inductive relations that store all mandatory outgoing edges from that node, with edges, kinds and destinations. For example, for binary operators Binop e₁ e₂ op, we log that there must be an edge to e₁.entry, but also an edge e₁.exit -> e₂.entry and an edge e₂.exit -> parent.exit.
ContCFGInv, an invariant relating a continuation stack K to the remaining path from the current node to the graph’s exit. Each frame in the stack corresponds to a constructor recording the relevant CFG edges and nodes:
```
inductive ContCFGInv (g : CFG) : List Cont -> List CFGNode -> CFGNode -> Prop where
| halt : x = g.exit -> ContCFGInv g [] breakTargets x
| step : CFGStep g x y ->
  ContCFGInv g K breakTargets y ->
  ContCFGInv g K breakTargets x
| ... -- one constructor per continuation frame kind
```
The step constructor is the only one that does not correspond to a continuation frame, but it acts as a “meta” rule that lets us skip over intermediate CFG nodes that have no semantic content (purely structural transitions).
JCFGInv relates the jump context (saved break states) to the list of break target nodes carried through the builder. Each entry in J stores a saved continuation K, and JCFGInv asserts that this is a valid K (using ContCFGInv) at the corresponding while exit node.

This results in a veeeeeery big definition for every necessary inductive relation we require, but a correct proof of most of the requirements on TranslationReq. Unfortunately, that most is loadbearing and bothersome. When this definition was drafted, the step_complete requirement wasn’t yet in TranslationReq, and the proof of step_sound required adding the “step” constructor in ContCFGInv in order to properly account for the purely structural transitions. However, that rule means that step_complete no longer holds: it is the case that the correspondence relation is, on paper at least, no longer functional. We can easily prove a weaker version of soundness:

theorem dataflow_soundness_exists {s : Lang .Stmt}
    {A : Type} [Bot A] [Max A] [DecidableEq A] [FiniteHeight A]
    {edgeTransfer : CFGEdge -> A -> A} {nodeTransfer : CFGNode -> A -> A}
    {abs : CEK -> A -> Prop} {R : StateRel} [tr : TranslationReq s R]
    [fr : Framework A edgeTransfer nodeTransfer s R abs]
    (entryInit : A) (out0 : fact A)
    (wl0 : List CFGNode) (hwl0 : ∀ x ∈ wl0, x ∈ (stmtCFG s).nodes)
    (h_inv0 : ∀ m, m ∉ wl0 ->
        nodeTransfer m (expectedIn (stmtCFG s) edgeTransfer entryInit out0 m) ⊔ out0 m = out0 m)
    (hinit_sound : abs (initState s) entryInit)
    (σ : CEK) (hreach : CEKReach (initState s) σ) :
    let res := worklistForwardEdge (stmtCFG s) nodeTransfer edgeTransfer entryInit out0 wl0 hwl0
    ∃ n, R σ n ∧ abs σ (res n)

This changes the statement, from the universal statement:

for every reachable execution state σ, for every n, if σ and n are R-related, then the dataflow result at n abstracts σ.

to a more narrow existential:

for every reachable execution state σ, there exists an n that is R-related to it where the dataflow result abstracts σ.

The problem is now clear: the current proofs of soundness proceed by induction on the reachability derivation, and the inductive case for the universal formulation requires deriving the backwards step through step_complete, which, in turn, requires a functional relation to be derived. And cfgcekRel cannot be functional as it stands.

This does raise a few questions about the next steps, based on the degrees of liberty that are available to the project right now, still in its (admittedly relative) infancy:

Let’s assume the abstract definitions for TranslationReq and Framework are sound: they represent actual requirements on the CFG construction and analysis procedure for the process to yield a correct result. Is there a way to build a functional correspondence relation between the semantics and this CFG formulation to achieve a proof? If not, is there a way to state a relaxed functionality requirement to properly encode the “meta” status of the step constructor?
If these definitions aren’t sound, is there a way to relax them to remove the requirement of functionality, so that we can use our relation directly?
Does this approach even make any sense at all? Is there some other approach to the construction or the proof architecture that could be used instead?
What’s the meaurable, appreciable difference between the existential and universal formulations of soundness? The former is clearly weaker than the latter, but what exactly does it entail?

My answer paragraph to all of the above is: I believe this approach makes some sense and I believe these abstract requirements to be sound, and in some way the step constructor feels excessively general. Intuitively, there should be a way to state that it’s only applied for certain nodes and states, but even then, that wouldn’t really fix functionality, which makes me think the answer is slightly more sophisticated.

If you have any insights, readings, or whatnot, I’d be more than happy to discuss with you. I am not going crazy yet, but I am slowly exhausting all of the related readings on the topic, and feel like this type of soundness development is a sorely underexplored area in formal methods: most discussions I could find of control flow graphs came from a compiler-engineering background, or followed a completely different approach. While I would be happy to derive a full Lean formalization of 1, I do believe it would require a large amount of work beyond the scope of what JetBrains is paying me to do. It’s definitely in mind for a master’s thesis in the future, though…

That aside, I think that covers all of the background and current status of the project, as well as the issues and next steps. If you made it this far in the post, thank you! I hope if could be of partial interest. For any question, concern, input, typo, feel free to reach out to me: I’d be glad to hear any input, even if it’s just to say this doesn’t make any sense.

Footnotes

[1] Koppel et al., 2022, Automatically Deriving Control-Flow Graph Generators from Operational Semantics.↩ ↩

[2] Guo, CPSC 509 Lecture 13: Abstract Machines. ↩

[3] Møller and Schwartzbach, Static Program Analysis. ↩