The Guts of a Spineless Machine

Posted on October 28, 2014
Tags: haskell, c

It’s fairly well known that Haskell is a bit um.. different from how stock hardware sees the world. I’m not aware of too many processors that have decided that immutability and higher order functions are the right way to go.

Compiling Haskell and its ilk, however, does have one interesting wrinkle on top of the normal problem: laziness. Laziness stands completely at odds with how most everything else works. Moreover, whether or not you think it’s the right default, it’s an interesting question of how to efficiently compile some evaluation strategy other than call by value or name.

To this end, people have built a lot of abstract machines that lazy languages could target. These machines can be mapped easily to what the hardware wants and transitively, we can get our compiler. Most of these work by “graph reduction” (that’s the G in STG) and the latest incarnation of these graph machines is the spineless tagless graph machine which lies at the heart of GHC and a few other compilers.

In this post, I’d like to go over how exactly the STG machine actually works. Turns out it’s pretty cool!

Core Concepts

The basic idea behind a compiler intent on going the STG route is something like

  1. .. front end stuff ..
  2. Translate IL to STG language
  3. Compile STG language to C/ASM/LLVM/Javascript

In GHC case I understand the pipeline is something like

  1. Parsing
  2. Type checking
  3. Desugaring + a few bobs and bits
  4. Translation to core
  5. Lion share of optimization
  6. Translation to STG language
  7. STG language to C–
  8. C– to assembly or llvm

We’re really concerned with parts 6 and 7 here. First things first, let’s lay out what’s exactly in the STG language. It’s a tiny functional language that looks a bit like Haskell or Core, with a few restrictions. A program is simply a series of bindings, much like Haskell. The top levels look something like

f = {x y z} flag {a b c} -> ...

You should read this for now as f = \a b c -> .... The first set of variables and the flag correspond to some stuff we’ll discuss later.

Inside the ... we can write most of what you would expect from Haskell. We have let[rec] bindings, case expressions, application, constructors, literals, and primitives. There is a caveat though: first off all, constructor applications must be fully saturated. This isn’t unlike OCaml or something where you can’t just treat a constructor as a function with an arbitrary name. We would write

\a -> Just a

instead of just Just. Another bit of trickiness: our language has no lambdas! So we can’t even write the above. Instead if we had something like

 map Just [1, 2, 3]

We’d have to write

 let f   = \a -> Just a
     l'' = 3 : nil
     l'  = 2 : l''
     l   = 1 : l'
 in map f l

The reason for the awkward l'' series is that we’re only allowed to apply constructors and functions to atoms (literals and variables).

One other noteworthy feature of STG is that we have primitive operations. They need to be fully saturated, just like constructors, but they work across unboxed things. For example there would probably be something like +# which adds to unboxed integers. To work with these we also have unboxed literals, 1#, 2#, so on and so on.

Now, despite all these limitations placed on STG, it’s still a pretty stinking high level language. There’s letrec, higher order functions, a lot of the normal stuff we’d expect in a functional language. This means it’s not actually to hard to compile something like Haskell or Core to STG (I didn’t say “compile efficiently”).

As an example, let’s look at translating factorial into STG language. We start with

f :: Int -> Int
f i = case i of
  0 -> 1
  i -> i * (f (i - 1))

Now the first step is we change the binding form

f = {} n {i} -> ...

The case expressions clause can remain the same, we’re already casing on an atom

case i of
  (MkInt# i#) -> ...

Now comes the first big change, our boxed integers are going to get in the way here, so the case expression strips away the constructor leaving us with an unboxed integer. We can similarly refactor the body to make evaluation order explicit

 case i of
   MkInt i# ->
     case i# -# 1# of
       dec# ->
         let dec = \{dec#} u {} -> MkInt dec#
         in case fact dec of
              MkInt rest# ->
                case i# * rest# of
                  result# -> MkInt result#

Notice how the case expressions here are used to make the evaluation of various expressions explicit and let was used to create a new thing to evaluate.

Now we can see what those extra {}’s were for. They notate the free variables for a thunk. Remember how we can have all sorts of closures and it can make for some really nice code? Well the machine doesn’t exactly support those naively. What we need to do and note the variables that we close over explicitly and then generate code that will store these free variables with the value that closes over them. This pair is more or less what is called a “closure” for the rest of this post.

Actually, I’ll sometimes use “thunk” as another descriptor for this pair. This is because closures in STG land do quite a lot! In particular, they are used to represent the fundamental unit of lazy code, not just closing over variables. We’ll have closures that actually don’t close over anything! This would be a bit strange, but each “thunk” in Haskell land is going to become a closure in STG-ville. The notion of forcing a thunk in Haskell is analogous to evaluating an STG closure and creating a thunk is creating a new closure. This is helpful to keep in mind as we examine the rest of the machine.

dec for example has a free variable dec# and it exists to box that result for the recursive call to factorial. We use case expressions to get evaluation. Most programs thus become chains of case’s and let alternating between creating thunks and actually doing work.

That u in between the {}’s in dec was also important. It’s the update flag. Remember how in Haskell we don’t want to force the same thunk twice. If I say

let i = 1 + 1 in i + i

We should only evaluate 1 + 1 once. That means that the thunk i will have to be mutated to not evaluate 1 + 1 twice. The update flag signifies the difference between thunks that we want to update and thunks that we don’t. For example, if we replaced the thunk for + with the first result it returned, we’d be mighty surprised. Suddenly 1 + 1 + 1 is just 2!

The u flag says “yes, I’m just a normal expression that should be updated” and the n flag says the opposite.

That about wraps up our discussion of the STG language, let’s talk about how to implement it now.

Semantics

This language wouldn’t be much good if it didn’t lend itself to an easy implementation, indeed we find that the restrictions we placed upon the language prove to be invaluable for its compilation (almost like they were designed that way!).

In order to decide how best to implement it, we first define the formal semantics for our language, which operates on a tuple of 6 things:

  1. The code - the instruction we’re currently executing
  2. The argument stack - A stack of integers or pointers to closures
  3. The return stack - A stack of continuations
  4. The update stack - A stack of update frames
  5. The heap - A map from addresses to closures
  6. The environment - A map from names to addresses of toplevel closures

A code is more or less the current thing we’re attempting to do. It’s either

  1. Eval e p - evaluate an expression in an environment (p)
  2. Enter a - Enter a closure
  3. ReturnCon c ws - Return a constructor applied to some arguments
  4. ReturnInt - Return an integer

Now the idea is we’re going to “unroll” our computations into pushing things onto the continuation stack and entering closures. We start with the code Eval main {}. That is to say, we start by running main. Then if we’re looking at a case we do something really clever

 EVAL(case expr of {pat1 -> expr1; ...}, p) as rs us h o

becomes

EVAL (expr, p) as ({pat1 -> expr1; ...} : rs) us h o

That is to say, we just push the pattern matching on to the continuation stack and evaluate the expression.

At some point we’ll get to a “leaf” in our expression. That is random literal (a number) or constructor. At this point we make use of our continuation stack

EVAL (C ws, p) as ((...; c vs -> expr; ...) : rs) us h o
ReturnCon (C ws) as ((...; c vs -> expr; ...) : rs) us h o
EVAL (expr, p[vs -> ws]) as rs us h o

So our pattern matching is rolled into ReturnCon. ReturnCon will just look on top of the return stack looking for a continuation which wants its constructor and evaluate its expression, mapping the constructor’s variables to the pattern’s variables.

The story is similar for literals

EVAL (Int i, p) as ((...; c vs -> expr; ...) : rs) us h o
ReturnInt i as ((...; i -> expr; ...) : rs) us h o
EVAL (expr, p) as rs us h o

Another phase is how we handle let’s and letrec’s. In this phase instead of dealing with continuations, we allocate more thunks onto the heap.

EVAL ((let x = {fs} f {xs} -> e; ... in expr), p) as rs us h o
EVAL e p' as us h' o

So as we’d expect, evaluating a let expression does indeed go and evaluate the body of the let expression, but changes up the environment in which we evaluate them. We have

p' = p[x -> Addr a, ...]
h' = h[a -> ({fs} f {xs} -> e) p fs, ...]

In words “the new environment contains a binding for x to some address a. The heap is extended with an address a with a closure {fs} f {xs} -> ... where the free variables come from p”. The definition for letrec is identical except the free variables come from p' allowing for recursion.

So the STG machine allocates things in lets, adds continuations with case, and jumps to continuation on values.

Now we also have to figure out applications.

EVAL (f xs, p) as rs us h o
ENTER a (values of xs ++ as) rs us h o

where the value of f is Addr a. So we push all the arguments (remember they’re atoms and therefore trivial to evaluate) on to the argument stack and enter the closure of the function.

How do we actually enter a closure? Well we know that our closures are of the form

({fs} f {vs} -> expr) frees

If we have enough arguments to run the closure (length vs > length of argument stack), then we can just EVAL expr [vs -> take (length vs) as, fs -> frees]. This might not be the case in something like Haskell though, we have partial application. So what do we do in this case?

What we want is to somehow get something that’s our closure but also knows about however many arguments we actually supplied it. Something like

({fs ++ supplied} f {notSupplied} -> expr) frees ++ as

where supplied ++ notSupplied = vs. This updating of a closure is half of what our update stack us is for. The other case is when we do actually enter the closure, but f = u so we’re going to want to update it. If this is the case we add an update from to the stack (as, rs, a) where as is the argument stack, rs is the return stack, and a is the closure which should be updated. Once we’ve pushed this frame, we promptly empty the argument stack and return stack.

We then add the following rules to the definition of ReturnCon

ReturnCon c ws {} {} (as, rs, a) : us h o
ReturnCon c ws as rs us h' o

where h' is the new heap that’s replaced our old closure at a with our new, spiffy, updated closure

h' = h[a -> ({vs} n {} -> c vs) ws]

So that’s what happens when we go to update a closure. But what about partial application?

Enter a as {} (asU, rs, aU) : us h o
Enter a (as ++ asU) rs us h' o

where

h a = ({vs} n {xs} -> expr) frees
h' = h [aU -> ((vs ++ bound) n xs -> e) (frees ++ as)]

This is a simplified rule from what’s actually used, but gives some intuition to what’s happening: we’re minting a new closure in which we use the arguments we’ve just bound and that’s what the result of our update is.

Compiling This

Now that we have some idea of how this is going to work, what does this actually become on the machine?

The original paper by SPJ suggests an “interpreter” approach to compilation. In other words, we actually almost directly map the semantics to C and call it compiled. There’s a catch though, we’d like to represent the body of closures as C functions since they’re well.. functions. However, since all we do is enter closures and jump around to things till the cows come home, it had damn well better be fast. C function calls aren’t built to be that fast. Instead the paper advocates a tiny trampolining-esque approach.

When something wants to enter a closure, it merely returns it and our main loop becomes

 while(1){cont = (*cont)();}

Which won’t stackoverflow. In reality, more underhanded tricks are applied to make the performance suck less, but for we’ll ignore such things.

In our compiled results there will be 2 stacks, not the 3 found in our abstract machine. In the first stack (A-stack) there are pointer things and the B-stack has non-pointers. This are monitored by two variables/registers SpA and SpBwhich keep track of the heights of the two stacks. Then compilation becomes reasonably straightforward.

An application pushes the arguments onto appropriate stacks, adjusts Sp*, and enters the function. A let block allocates each of the bound variables, then the body. Entering a closure simply jumps to the closure’s code pointer. This is actually quite nifty. All the work of figuring out exactly what Enter will do (updates, continuation jiggering) is left to the closure itself.

A case expression is a bit more complicated since a continuation’s representation involves boxing up the local environment for each branch. Once that’s bundled away, we represent a continuation as a simple code pointer. It is in charge of scrutinizing the argument stack and selecting an alternative and then running the appropriate code. This is a lot of work, and, unless I’m crazy, we’ll need two types of bound variables for each branch (really just ptr/non-ptr). The selection of an alternative would be represented as a C switch, letting all sorts of trickery with jump tables be done by the C compiler.

In order to return a value, we do something clever. We take a constructor and point a global variable at its constructor closure, containing its values and jump to the continuation. The continuation can then peek and poke at this global variable to bind things as needed for the alternatives. There is potentially a massive speedup by returning through registers, but this is dangerously close to work.

From here, primitive operations can be compiled to statements/instructions in whatever environment we’re targeting. In C for example we’d just use the normal + to add our unboxed integers.

The last beast to slay is updates. We represent update frames by pointers to argument stacks and a pointer to a closure. That means that the act of updating is merely saving Sp* in an update form, clobbering them, and then jumping into the appropriate closure. We push the update form onto stack B and keep on going.

I realize that this is a glancing overview and I’m eliding a lot of the tricky details, but hopefully this is sufficient to understand a bit about what’s going on at an intuitive level.

Wrap Up

So now that you’ve put all the effort to get through this post, I get to tell you it’s all lies! In reality GHC has applied all manner of tricks and hacks to get fast performance out of the STG model. To be honest I’m not sure where I should point to that explains these tricks because well… I have no idea what they are.

I can point to

If you have any suggestions for other links I’d love to add them!

Thanks Chris Ganas for proof reading

comments powered by Disqus