Compilers – Bottom-Up Parsing: my study notes

These notes are from the Week 4 of the Compilers course by Stanford.

Predictive Parsing

Predictive Parsing is one type of recursive parsing that does not require backtracking. It does so by looking at the next few tokens and accept LL(k) – left-to-right, left-most, k tokens – grammars.

The predictive parsing algorithm uses the following steps to finally be able to parse the input:

  • Elimination of left recursion or left factor the grammar: that means removing all derivations that might recursive to the left. For example A ::= Aa | B can be transformed to A ::= BA' and A' ::= aA' | ε
  • Create the predictive parsing table: which is a table that will make derivations possible
  • Use the table to parse the input string: which follows the algorithm to parse the input

Given the grammar:

E ::= T + E | T
T ::= int | int * T | (E)

This grammar can be left factored as:

E ::= TX
X ::= + E | ε
T ::= (E) | int Y
Y ::= * T | ε

To be able to go to the next step, it is necessary to define both the First and Follow functions. The First function is described as the first terminal of each derivation. Whereas the Follow function is all the possible derivations after a given symbol. For example:

S ::= Sa | b

First(S) = {b}
Follow(S) = {$, a}

To produce the parsing table the First and Follow functions should be used. and the table can be described as:

Parsing Table

The pseudocode for this algorithm can be described as the following:

stack = [S, $]
repeat
  case stack of
    [not terminal?(X) | rest]:
      if derivations = T[X, *next] then
        stack = [derivations | rest]
      else
        error()
    [terminal?(t) | rest]:
      if t == *next++ then
        stack = rest
      else
        error()
until empty?(stack)

It is important to note that most languages’ grammars are not supported by LL(1). The reason it’s worth studying is as a building block for more powerful parsing techniques.

Bottom-Up Parsing

The bottom-up parsing algorithm doesn’t need the grammar to be left-factored. This category of algorithms starts with terminal symbols and goes all the way up to the initial production.

Given the grammar:

T ::= int
T ::= int * T
E ::= T
E ::= T + E

The process checking if a string belongs to this grammar goes like this:

int * int + int
int * T + int
T + int
T + T
T + E
E

It’s worth noting that this way of processing always chooses the right side of the expression.

If we navigate the previous resolution from bottom to top, we call the symbols productions, as we would call them from the grammar. On the other hand, we would call reductions when we perform the other way around, or from a terminal symbol, we resolve the next production.

Shift-Reduce Parsing

The algorithm of Shift-Reduce parse, from its name, does basically two types of moves the Shift, which advances tokens from the input, and Reduce, which transforms a token based on the grammar. The condition for reduction is that the input that is going to be reduced needs to be already read, or already shifted through.

Shift: ABC|xyz → ABCx|yz
Reduce: Cbxy|ijk → CbA|ijk

Usually, the left side of the string can be implemented as a stack, being the Shift move to one that pushes a terminal symbol onto the stack, and later the Reduce poping symbols off of the stack (RHS) and pushing back a non-terminal symbol (LHS).

If at any given state, more than one action can be performed leading to a valid parse, then there is either a shift-reduce or a reduce-reduce conflict. The first one refers to an ambiguity that is legal to either shift or to reduce. The second one usually indicates a problem with the grammar since two productions are possible.

There is still one missing point to describe the whole algorithm, we have to define what is called a handle. We will always want to reduce at handles, and it is important to note that handles appear only at the top of the stack.

A handle is a reduction that also allows further reductions back to the start symbol

The next point now is how to identify handles. There is no general algorithm that works for all CFG though, but there are algorithms that work for a large subset of CFGs, called LALR(k) CFGs.

The idea of the algorithm is to keep a sequence of partial prefixes of reductions on the stack, and if we look at the stack as a set of prefixes we would see the later ones to relate to the previous ones.

SLR Parsing

The SLR parsing algorithm is a type of shift-reduce algorithm that looks ahead to the next symbol to resolve shift-reduce conflicts. If a conflict is found, it reduces if the next symbol is a Follow(X), being X the production with a valid prefix being analyzed.

Although more powerful, this algorithm still doesn’t work for ambiguous grammar. To make it work, it is necessary to introduce operator precedence.

Leave a Reply

Your email address will not be published. Required fields are marked *