Parsing

Parsing (Parser)

Regular Languages
- Fast and has many applications (checking digit, valid identifier (variable names, phone numbers, etc)
- But rather limited, e.g., cannot recognize the balanced parentheses \(\{(^i)^i \mid i \ge 0\}\), (), (()), ((()))
  - formulas/expressions ((1+2) & 3 )
  - nested structure: if then else fi
- In short, DFA/RE doesn't "remember" or "count" things, required by languages such as \(\{(^i)^i \mid i \ge 0\}\).
Parser
- Input: sequence of tokens from lexer
- Output: parse tree of the program
- Ex:
  - Program (Cool): if x = y then 1 else 2 fi
  - Lexer output / parser input: IF ID = ID THEN INT ELSE INT FI
  - Parser output:
    
``` if-then-else = int int id id
```

Phase	Input	Output
Lexer	String of Characters	String of tokens
Parser	String of tokens	Parse tree

Context Free Grammar

Parser needs to distinguish btw valid and invalid strings of tokens
We need
- A lang for describing valid strings of tokens
- A method for distinguishing valid from invalid strings of tokens
Programming Language often has recrusive structures. E.g., Expr is :
- if Expr then Expr else Expr fi
- while EXPR loop EXPR pool
Context Free Grammar: a natural way to express these recursive structures
- A set of Terminals \(T\)
- A set of non-terminals \(N\)
- A start symbols \(s\in N\)
- a set of productions (rules)
  - \(x \rightarrow y_1 \dots y_n, (x\in N, y_i \in N \cup T\cup \{\epsilon \} )\)

Example:

    #balance parentheses
    S -> (S)   
    S -> e

    # N ={S}, T ={(,)}, 2 productions

Terminals are so called bc no rules for replacing them
Once generated, terminals are permanent
Terminals are the tokens of the language

Another example:

EXPR -> if EXPR then EXPR else EXPR fi 
#non-term :  EXPR  , term: ow
EXPR -> while EXPR loop EXPR pool
EXPR -> id

#short cut using |

EXPR -> if EXPR ..
     |  while ..
     |  id

Some examples of the above EXPR CFG

id      
if id then id else id fi   
while id loop id pool  
if while id loop id pool then id else id       #correct syntanctically, what does it mean semantically?
if if id then id else id fi then id else id fi

Another CFG example for Arithmetic:

E -> E + E        id 
  |  E * E     id + id 
  | (E)        (id | id ) * id 
  | id

Given the CFG

S -> aXa
X -> e | bY
Y -> e | cXc

which of the following strings are in its language?

abcba
acca
aba # Sol: only this one
abcbcba

The idea of a CFG is a big step. But:

Membership in a language is "yes" or "no"; also need parse tree of the input
Must handle errors gracefully
Need an implementation of CFG's (e.g., bison, yacc, ocamlyacc)
Form of the grammar is important
- Many grammars generate the same language
- Tools are sensitive to the grammar

Derivations

A derivation is a sequence of productions
- S -> ... -> ...-> ...
A derivation can be drawn as a tree
- Start symbol is the tree's root
- For a production X->Y1...Yn add children Y1,...,Yn to node X

Example

Grammar
- E -> E + E | E * E | (E) | id
Input String
- id *id + id

Derivation and Parse Tree

#Derivation

   E
-> E       + E
-> E  * E  + E
-> id * E  + E
-> id * id + E
-> id * id + id

#Parse Tree
            E
    E       +        E
 E  *   E             id
id     id

A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
An in-order traversal of the leaves is the original input
The parse tree shows the association of operations, the input string does not
Left-most derivation
- At each step, replace the left-most non-terminal
Right-most derivation
Both right/left-most derivations have the same parse tree

Summary

We are not just interested in whether \(s \in L(G)\)
- We need a parse tree for s
A derivation defines a parse tree
- But one parse tree may have many derivations (e.g., left/right-most derivation)
Left-most and right-most derivations are important in parser implementation

Ambiguity

Grammar
- E -> E + E | E * E | (E) | id
Input String
- id *id + id
This string has two parse trees

    # Tree 1
                E
        E       +        E
    E  *   E             id
    id     id

    # Tree 2
                E
        E       *       E
        id           E  +   E
                     id     id

A grammar is ambiguous if it has more than one parse tree for some string
Equivalently, there is more than one right-most or left-most derivation for some string
Ambiguity is BAD
- Leaves meaning of some programs ill-defined

Ways to handle ambiguity

Most direct method is to rewrite grammar to make it unambiguous. For example, the below grammar for arithmetic expression enforces precedence of * over +

    E -> E’ + E | E’
    E’ -> id * E' | id | (E) * E’| (E)

Another Example

Consider the grammar

    E -> if E then E
       | if E then E else E
       | OTHER

The expression (input string) if E1 then if E2 then E3 else E 4 has two parse trees

          if
      /   |   \
    E1    if    E4
        /   \
       E2    E3


          if
        /    \
      E1       if
            /  |  \
           E2  E3  E4

One way to handle this ambiguity is to rewrite it to enforce that else matches the closest unmatched then

    E -> MIF   /* all then are matched */
       | UIF   /* some then is unmatched */

    MIF -> if E then MIF else MIF
          | OTHER

    UIF -> if E then E
         | if E then MIF else UIF

Additional info

Impossible to automatically convert an ambiguous grammar to an unambiguous one
Used with care, ambiguity can simplify the grammar
- Sometimes allows more natural definitions
- We need disambiguation mechanisms
Instead of rewriting the grammar
- Use the more natural (ambiguous) grammar
- Along with disambiguating declarations
Most tools allow precedence* and *associativity declarations to disambiguate grammars
- Consider the grammar E -> E + E | int Ambiguous: two parse trees of int + int + int
  - Can use Left associativity declaration: %left + (e.g., in bison)
  - Consider the grammar E -> E + E | E * E | int
    - Ambiguous: two parse trees for int + int * int
  - Can use Precedence declaration supported by the parser generator tool (e.g., bison or yacc)
    - Below example (in bison syntax) says that both * and * are left associativity and that * has higher precedence than + %left + %left *

Abstract Syntax Tree (AST)

Like parse trees but ignore (abstract) some details
Example: Consider the grammar E -> int | ( E ) | E + E and the input string 5 + (2 + 3)
- After lexical analysis, we have a list of tokens: int_5 + ( int_2 + int_3 )
- During parsing we build a parse tree
```
            E
    E       +          E
   int5         (      E       )
                   E   +   E
                 int2     int3
```
- But the parse tree contains unnessary info
  - e.g., based on it structure, we already know 2 + 3 is nested together, so we don't need parenthenses
  - we also don't need single-successor node E -> int5 , can just have int5
  - so we can remove these info to get an AST
```
      + 
5         +
       2     3
```
- In general AST is a very commonly used data structure in compilers and program analysis
  - Also capture nested structure
  - But abstract from concrete syntax (e.g., no parantheses)
  - More compact and Easier to use

## Recursive Descent Parsing - Parse tree is constructed from grammar - From the top - From left to right - Terminals are seen in-order of appearance of the token stream - Example: t2 t5 t6 t8 t9

              1
         /    |        \
      t2      3         t9
          /       \
        4          7
     /    \        |
   t5    t6       t8

Example

Consider the grammar

E -> T   | T + E
T -> int | int * T | ( E )

And the token stream input (int5 )
Recursive Descent Algorithm:
- Start with top-level non-terminal E
- Try the rules for E in order
  - E -> T -> int: mismatch: int does not match curr ptr (
- Backtrack
  - E -> T -> int * T : also mismatch (* does not match ()
  - E -> T -> ( E ) : ( match, advance curr ptr to int5
  - E -> T -> ( int ) : int match, advance curr ptr to ), also match, advance ctr: done

Summary

Recursive descent paraising is a simple and general parsing strategy
Used in production compilers e.g., gcc
Two problems:
- Won't work with certain kind of grammar (e.g., need to elliminate left-recursion cfg's, discussed next)
- Backtracking is slow (use predictive parsing, discussed next)

Left Recursion

Consider a production S -> S a
- RD parsing would goes into an infinite loop (keep attempt to expand S and never get to a)
A left-recursive grammar has a non-terminal S and the rule S->+ Sf for some f (can contain terminals or non-terminals)
Recursive descent does not work for this grammar.
Consider the left-recursive grammar
- S -> S a | b
- S generates all strings starting with a b and followed by any number of a
  - S a -> S aa -> ... -> S a ... a -> b a ... a

Rewriting left-recursive grammar

Left-recursive grammar can be fixed/rewritten automatically
To fix left-recursive grammar, can rewrite using right-recursion
```
S -> b S’
S' -> a S' | e
```
In general, left recursion has the form
```
S -> S a1 | ... | S an | b1 | ... | bm
```
- All strings of S starts with one of the b1,...,bm continue with several instances of a1,...,an

This can be rewriten as

S -> b1 S’ | ... | bm S’
S’ -> a1 S’ | ... | an S' | e

Consider the grammar
```
S -> A a | c
A -> S b
```
- This is also left-recursive because =S -> Aa -> S b a -> .. = (the S repeats) This left-recursion can also be eliminated, there is a general algorithm to deal with left recursion (details in Dragon book)

Given the left recursive grammar

E -> E + T | T
T -> id | (E)

Choose the grammar that eliminates this grammar

E -> E + id | E + (E) | id | (E)

this one ?

E -> TE’
E’ -> +TE' | e
T  -> id | (E)

this one?

E -> E’ + T | T 
E’-> id | (E) 
T -> id | (E)

this one?
```
E -> id + E | E + T | T 
T -> id | (E)
```

Predictive Parsing

Limitation of Recursive Descent (other than Left Recursive grammar)
- Backtracking is expensive
Predictive Parsing Algorithm
- Like recursive-descent but parser can "predict" which production to use
- By looking at the next few tokens (lookahead and using a restricted grammar LL below)
- No backtracking
Predictive parsers accept LL(k) grammars
- L: left-to-right
- L: left-most derivation
- k: k tokens to look ahead, in practice, k=1

LL(1) Grammar

The problem with Recursive Descent parsing
- At each step, many choices of production to use, e..,g S -> T1 | T2 | T3 | ..
- Has to use backtracking to undo bad choices
LL(1) grammar
- At each step, only one choice of production
Recall the grammar
```
E -> T + E | T     # 2 choices 
T -> int | int * T | ( E )   # 2 choices for int ..
```
- Hard to predict because
  - For T two productions start with int
  - For E two productions

We need to left-factor the grammar

E -> TX
X -> +E | e
T -> intY | ( E )
Y -> e | * T

Self practice:

Consider the grammar

EXPR -> if BOOL then { EXPR }
     |  if BOOL then { EXPR } else { EXPR } 
BOOL -> true | false

Choose the below that correctly left-factor the IF:

1. EXPR -> if true then { EXPR }
        |  if false then { EXPR }
        |  if true then { EXPR } else { EXPR } 
        |  if false then { EXPR } else { EXPR }

2. EXPR -> EXPR’ | EXPR’ else { EXPR }
   EXPR’ -> if BOOL then { EXPR }
   BOOL -> true | false

3. EXPR -> if BOOL EXPR’
   EXPR’ -> then { EXPR }
          | then { EXPR } else { EXPR }
   BOOL -> true | false

4. EXPR -> if BOOL then { EXPR } EXPR’ 
   EXPR’ -> else { EXPR } | e 
   BOOL -> true | false