Compilers – Code Generation & Operational Semantics: my study notes

These notes are from the Week 7 of the Compilers course by Stanford. This focuses on the discussion about code generation. The architecture chosen for this class is a Stack Machine architecture on top of the MIPS architecture.

Code Generation

The MIPS processor supports the following instructions:

  • lw reg1 offset reg2: Load a 32-bit word from the address reg2 + offset into reg1
  • add reg1 reg2 reg3: reg1 = reg2 + reg3
  • sub reg1 reg2 reg3: reg1 = reg2 - reg3
  • beq reg1 reg2 label: branch to label if req1 and req2
  • sw reg1 offset(reg2): Store 32-bit word in reg1 at address reg2 + offset
  • addiu reg1 reg2 imm: reg1 = reg2 + imm. Where imm is an immediate value expressed in code. Differs from add as it doesn’t check overflows
  • li reg imm: reg = imm
  • jal label: Jump to label and save the address of the next instruction in $ra
  • jr reg: Jump to address in register

A stack machine can be represented as follows in MIPS:

li $a0 7                       # acc = 7

sw $a0 0($sp)                  # push acc
addiu $sp $sp -4

li $a0 5                       # acc = 5

lw $t1 4($sp)                  # acc = acc + top_of_stack
add $a0 $a0 $t1

pop                            # addiu $sp $sp 4

The register $a0 is the accumulator and $sp is the stack pointer. Also, it is worth noting that the stack grows from high addresses to low addresses.

We’ll look at generating code for the following language:

P    := D; P | D
D    := def id(ARGS) = E;
ARGS := id, ARGS | id
E    := int | id 
     | E1 + E2
     | E1 - E2
     | if E1 = E2 then E3 else E4
     | id(E1, ..., En) 

The code generation for an identifier can be expressed as follows:

cgen(i) = 
    emmit(li $a0 i)

Also, the code generation for the addition and subtraction can be built on top of that:

cgen(e1 + e2) =
    cgen(e1)
    emmit(sw $a0 0($sp))
    emmit(addiu $sp $sp -4)
    cgen(e2)
    emmit(lw $t1 4($sp))
    emmit(add $a0 $t1 $a0)
    emmit(addiu $sp $sp 4)

cgen(e1 - e2) =
    cgen(e1)
    emmit(sw $a0 0($sp))
    emmit(addiu $sp $sp -4)
    cgen(e2)
    emmit(lw $t1 4($sp))
    emmit(sub $a0 $t1 $a0)
    emmit(addiu $sp $sp 4)

Besides those operations, we can define an if-else expression as follows:

cgen(if e1 = e2 then e3 else e4) =
    cgen(e1)
    emmit(sw $a0 0($sp))
    emmit(addiu $sp $sp -4)
    cgen(e2)
    emmit(lw $t1 4($sp))
    emmit(addiu $sp $sp 4)
    emmit($a0 $t1 true_branch)
    emmit(false_branch:)
        cgen(e4)
        emmit(b end_if)
    emmit(true_branch:)
        cgen(e3)
    emmit(endif)

Furthermore, we have the following code for function definitions, function calls, and variable references. For this, we’ll need a new register $fp that will hold the frame pointer of the current activation.

The design chosen for the activation record, or frame, is the following:

old $fp
x1
xn
Activation Record with n variables

So next, we are going to examine how to generate code for a function call. It’s worth noting, that the following code snippet is executed on the caller-side.

cgen(f(e1, ..., en)) =
    emmit(sw $fp 0($sp))
    emmit(addiu $sp $sp -4)
    cgen(en)
    emmit(sw $fp 0($sp))
    emmit(addiu $sp $sp -4)
    ...
    cgen(e1)
    emmit(sw $fp 0($sp))
    emmit(addiu $sp $sp -4)
    emmit(jal f_entry)

Now, to generate code for the function definitions, we can do the following:

cgen(def(x1, ..., xn) = e) =
    emmit(f_entry:)
    emmit(move $fp $sp)
    emmit(addiu $sp $sp -4)
    cgen(e)
    emmit(lw $ra 4($sp))
    emmit(addiu $sp $sp z)
    emmit(lw $fp 0($sp))
    emmit(jr $ra)

Note that in the previous example we used the $ra register, which holds the value of the next address after the jal instruction. Also, the z variable is defined as 4 * n + 8, which is space enough for the n variables, the return address, and the old $fp.

Finally, the variables can be defined as an offset from the current $fp:

cgen(xi) =
    emmit(lw $a0 z($fp))

In this case, z can be defined as 4 * i.

In closing, the developers of production compilers usually try to keep as much data as possible in the registers for performance purposes, differently from the previous implementation of the stack machine.

One way to solving this is the use of temporaries, which are a pre-allocated region in the activation record that can be used to store intermediate results of operations and avoid back and forth with the stack.

Semantics

Semantics is a way to describe a programming language without the details of the low-level language the compiler is going to be written with. For example, the assembly language needs many details to work properly such as how the stack grows or how the frame is implemented.

One way to specify programs without implementation details is what is called Operational Semantics, which is basically to describe how the code would work in an abstract machine.

Leave a Reply

Your email address will not be published. Required fields are marked *