Sometimes, writing a large program is conceptually as simple as translating from a big unstructured input into a more and more structured output. A good example is a single-pass compiler:

  • First we perform lexical analysis on a big string, this gives us a list of tokens.
  • Then we parse the tokens into a tree.
  • Then we do some semantic analysis on the tree, usually this involves some typechecking and error reporting
  • Once we know what the program is, we generate code for it.

Introducing the pipeline

Pipeline illustration

This structure is purely linear, and functional programmers will be delighted to point out that you can implement an entire compiler pipeline using nothing but function composition:

codegen :: Sema -> String
typecheck :: Tree -> Sema
parsing :: [Token] -> Tree
lexing :: String -> [Token]

compiler :: String -> String
compiler = codegen . typecheck . parsing . lexing

Of course this is pure fantasy, there is no error reporting, no way to debug it, and no command line tool. But the core idea is compelling: what if there was a way to implement a compiler as a single pipeline of operation that clearly describes what each step achieves? For this, we’re going to define pipelined programs using Idris, because it uses dependent types, and it is reasonably fast and production ready. We define a pipeline as a list of Types, each type represents an intermediate layer.

Pipeline : Nat -> Type
Pipeline length = Vect length Type

CompilerPipeline : Pipeline ?
CompilerPipeline =
  [ String
  , List Token
  , Tree
  , Sema
  , String
  ]

With it, we define a compiler pipeline as the list of types String, List Token, Tree, Sema, and String, representing all the stages of our compiler.

The pipeline is indexed by it length, to ensure that our pipeline is valid this length needs to be at least 2, since we need a type from which we start, and a type to which we arrive. Using this fact we can define the implementation for a pipeline.

-- Well formed pipelines have at least 2 elements, the start and the end type
Impl : Pipeline (2 + n) -> Type
Impl (x :: y :: []) = x -> y
Impl (x :: y :: z :: xs) = Pair (x -> y) (Impl (y :: z :: xs))

We can test our implementation by asking in the REPL what is a valid implementation of out compiler pipeline.

> Impl CompilerPipeline
< (String -> List Token, (List Token -> Tree, (Tree -> Sema, Sema -> String)))

This is a 4-tuple of functions String -> List Token, List Token -> Tree, Tree -> Sema and Sema -> String, exactly what we wanted, each of those function represents a stage in our compiler.

Now that Impl correctly gives us the type of an appropriate implementation for a pipeline, we still need to run the pipeline. Running the pipeline should amount to using function composition on each of the functions. Unfortunately it’s not that easy because each intermediate step uses a different type and we can’t just blindly apply a function without knowing its type.

Like before, to run a Pipeline, it needs at least two elements. Running a pipeline should result in a simple function from the first layer of the pipeline to the last one. To write this we use the function head and last from Data.Vect those functions do exactly what you expect from their name but here we call them to define the type of the function that will result from running the pipeline.

-- When we run the pipeline `p` we expect to get out a function `head p -> last `p`,
-- that is, the first stage of the pipeline as the argument of the function and the
-- last stage as the result of it.
Run : (p : Pipeline (2 + n)) -> Impl p -> head p -> last p
-- In the base case, the pipeline contains only one stage x -> y so we return it
Run [x, y] f = f
-- In the inductive case, we compose `f` with the rest of the pipeline
Run (s :: t :: x :: xs) (f, cont) = Run (t :: x :: xs) cont . f

The base case runs the single stage we have with the argument we’re given. The inductive case runs the remainder of the pipeline after running the current stage represented by the function f.

We can test this idea by assuming we have functions lex, parse, typecheck and codegen like in the haskell example and see what happens:

lex : String -> List Token
parse : List Token -> Tree
typecheck : Tree -> Sema
codegen : Sema -> String

runCompiler : String -> String
runCompiler = Run CompilerPipeline (lex, parse, typecheck, codegen)

And it typechecks! We cannot run this code because we don’t have an implementation for lex, parse, etc, but we’ve at least reached our goal: to build a datastructure that help us keep track of what are the stages in our pipeline.

Handling effects

This is quite cool, but it’s not the end. Remember that compiler need to perform a lot of side effects, returning errors, sometimes print out intermediate trees for debugging, how does the pipeline help for that?

Well the great benefit from the pipeline is that we’ve separated the information about the stages from the information about the runtime implementation of the stages. This information is created by Impl which creates the type of an implementation given a pipeline, but this current version only places the types end-to-end without changes. What if we add an extra Monad around the types such that instead of functions a -> b, each stage is now an effectful function a -> m b.

First we define m : Type -> Type, our monad, we use a parameter block because we will use it for running our programs too.

parameters {0 m : Type -> Type}

Then, in the implementation, we use the same type signature, but in the base-case, instead of returning x -> y, we return x -> m y. And similarly with the inductive case.

  0
  ImplM : Pipeline (2 + n) -> Type
  ImplM [x, y] = x -> m y
  ImplM (x :: y :: z :: xs) = Pair (x -> m y) (ImplM (y :: z :: xs))

Like before, we now test ImplM in the repl with a monad Maybe and we get:

> ImplM {m = Maybe} CompilerPipeline
< (String -> Maybe (List Token), (List Token -> Maybe Tree, (Tree -> Maybe Sema, Sema -> Maybe String)))

Now every stage runs in a Maybe monad, and we could replace this monad by anything else and obtain all sorts of effects. It remains to run a pipeline with effects using our effectful implementation. We implement a function just like Run but we make use of kleisli composition instead of function composition.

  RunM : (mon : Monad m) => (p : Pipeline (2 + n)) -> ImplM p -> Vect.head p -> m (Vect.last p)
  RunM [x, y]  f = f
  RunM (x :: y :: z :: xs) (f, c) = f >=> RunM (y :: z :: xs) c

To run the compiler, each stage needs to perform an effect, here we’re using Either String to represent errors:

namespace EffectfulCompiler
  lex : String -> Either String (List Token)
  parse : List Token -> Either String Tree
  typecheck : Tree -> Either String Sema
  codegen : Sema -> Either String String

  runCompiler : String -> Either String String
  runCompiler = RunM CompilerPipeline (lex, parse, typecheck, codegen)

Crucially, we’ve not changed the pipeline at all! We’ve only changed how to interpret it as a runtime program.

Debugging programs

Finally, using the same datastructure we can run our pipeline in debug mode. This mode will print in the terminal all the intermediate steps so that you can see what is happening at each stage. For this to work, we need to make sure that, for all layers in the pipeline, we have a way to print the result of the computation, we achieve this with the All : (ty -> Type) -> Vect n ty -> Type type which takes a list of elements and a predicate over that list, and describes a list of proofs that ensure the predicate hold for every value of ty given. In our case, the predicate given is the Show interface, and so for each type there must exist a Show instance.

The rest of the function is the same as before, except we compose each f with pure . traceValBy show to print the value we just computed.

RunTraceM : Monad m => (p : Pipeline (2 + n)) -> (print : All Show p) => ImplM {m} p -> Vect.head p -> m (Vect.last p)
RunTraceM [x, y] {print = [p1, p2]} f = f >=> pure . traceValBy show
RunTraceM (x :: y :: z :: xs) {print = p :: q :: ps} (f, c) =
  f >=> pure . traceValBy show >=> RunTraceM (y :: z :: xs) c

Calling our pipeline with RunTraceM will print out every intermediate value produced by the pipeline, It won’t print the first argument given to it but since we already have it, we can print it when we call the pipeline.

Conclusion

There is much more to say about this but is already more than we can usually do without dependent types. We can take the pipeline and build multiple runtimes for it, and run it in multiple modes as well. I’ve not shown it here but we can also implement operations on the pipeline like concatenation, splicing, etc. Those operations should reflect what happens at runtime: Concatenating two pipeline should compose two programs that run them. This framework can be extended in many other ways but that will serve as a solid base for now, see you in the next one.

The code is available as a library: https://gitlab.com/glaive-research/pipelines.