Skip to content

Parsing

Parsing (Parser)

  • Regular Languages

    • Fast and has many applications (checking digit, valid identifier (variable names, phone numbers, etc)
    • But rather limited, e.g., cannot recognize the balanced parentheses \(\{(^i)^i \mid i \ge 0\}\), (), (()), ((()))
      • formulas/expressions ((1+2) & 3 )
      • nested structure: if then else fi
    • In short, DFA/RE doesn't "remember" or "count" things, required by languages such as \(\{(^i)^i \mid i \ge 0\}\).
  • Parser

    • Input: sequence of tokens from lexer
    • Output: parse tree of the program
    • Ex:

      • Program (Cool): if x = y then 1 else 2 fi
      • Lexer output / parser input: IF ID = ID THEN INT ELSE INT FI
      • Parser output:

        ``` if-then-else = int int id id

      ```

Phase Input Output
Lexer String of Characters String of tokens
Parser String of tokens Parse tree

Context Free Grammar

  • Parser needs to distinguish btw valid and invalid strings of tokens
  • We need
    • A lang for describing valid strings of tokens
    • A method for distinguishing valid from invalid strings of tokens
  • Programming Language often has recrusive structures. E.g., Expr is :
    • if Expr then Expr else Expr fi
    • while EXPR loop EXPR pool
  • Context Free Grammar: a natural way to express these recursive structures

    • A set of Terminals \(T\)
    • A set of non-terminals \(N\)
    • A start symbols \(s\in N\)
    • a set of productions (rules)
      • \(x \rightarrow y_1 \dots y_n, (x\in N, y_i \in N \cup T\cup \{\epsilon \} )\)
  • Example:

        #balance parentheses
        S -> (S)   
        S -> e
    
        # N ={S}, T ={(,)}, 2 productions
    
  • Terminals are so called bc no rules for replacing them

  • Once generated, terminals are permanent

  • Terminals are the tokens of the language

  • Another example:

    EXPR -> if EXPR then EXPR else EXPR fi 
    #non-term :  EXPR  , term: ow
    EXPR -> while EXPR loop EXPR pool
    EXPR -> id
    
    #short cut using |
    
    EXPR -> if EXPR ..
         |  while ..
         |  id
    

  • Some examples of the above EXPR CFG

    id      
    if id then id else id fi   
    while id loop id pool  
    if while id loop id pool then id else id       #correct syntanctically, what does it mean semantically?
    if if id then id else id fi then id else id fi 
    

  • Another CFG example for Arithmetic:
    E -> E + E        id 
      |  E * E     id + id 
      | (E)        (id | id ) * id 
      | id 
    

Given the CFG

S -> aXa
X -> e | bY
Y -> e | cXc
which of the following strings are in its language?

  1. abcba
  2. acca
  3. aba # Sol: only this one
  4. abcbcba

The idea of a CFG is a big step. But:

  • Membership in a language is "yes" or "no"; also need parse tree of the input
  • Must handle errors gracefully
  • Need an implementation of CFG's (e.g., bison, yacc, ocamlyacc)
  • Form of the grammar is important
    • Many grammars generate the same language
    • Tools are sensitive to the grammar

Derivations

  • A derivation is a sequence of productions
    • S -> ... -> ...-> ...
  • A derivation can be drawn as a tree
    • Start symbol is the tree's root
    • For a production X->Y1...Yn add children Y1,...,Yn to node X

Example

  • Grammar
    • E -> E + E | E * E | (E) | id
  • Input String
    • id *id + id
  • Derivation and Parse Tree

    #Derivation
    
       E
    -> E       + E
    -> E  * E  + E
    -> id * E  + E
    -> id * id + E
    -> id * id + id
    
    #Parse Tree
                E
        E       +        E
     E  *   E             id
    id     id
    
  • A parse tree has

    • Terminals at the leaves
    • Non-terminals at the interior nodes
  • An in-order traversal of the leaves is the original input

  • The parse tree shows the association of operations, the input string does not

  • Left-most derivation

    • At each step, replace the left-most non-terminal
  • Right-most derivation

  • Both right/left-most derivations have the same parse tree

Summary

  • We are not just interested in whether \(s \in L(G)\)
    • We need a parse tree for s
  • A derivation defines a parse tree
    • But one parse tree may have many derivations (e.g., left/right-most derivation)
  • Left-most and right-most derivations are important in parser implementation

Ambiguity

  • Grammar
    • E -> E + E | E * E | (E) | id
  • Input String
    • id *id + id
  • This string has two parse trees
    # Tree 1
                E
        E       +        E
    E  *   E             id
    id     id

    # Tree 2
                E
        E       *       E
        id           E  +   E
                     id     id
  • A grammar is ambiguous if it has more than one parse tree for some string
  • Equivalently, there is more than one right-most or left-most derivation for some string
  • Ambiguity is BAD
    • Leaves meaning of some programs ill-defined

Ways to handle ambiguity

  • Most direct method is to rewrite grammar to make it unambiguous. For example, the below grammar for arithmetic expression enforces precedence of * over +
    E -> E’ + E | E’
    E’ -> id * E' | id | (E) * E’| (E)

Another Example

  • Consider the grammar
    E -> if E then E
       | if E then E else E
       | OTHER
  • The expression (input string) if E1 then if E2 then E3 else E 4 has two parse trees
          if
      /   |   \
    E1    if    E4
        /   \
       E2    E3


          if
        /    \
      E1       if
            /  |  \
           E2  E3  E4
  • One way to handle this ambiguity is to rewrite it to enforce that else matches the closest unmatched then
    E -> MIF   /* all then are matched */
       | UIF   /* some then is unmatched */

    MIF -> if E then MIF else MIF
          | OTHER

    UIF -> if E then E
         | if E then MIF else UIF

Additional info

  • Impossible to automatically convert an ambiguous grammar to an unambiguous one

  • Used with care, ambiguity can simplify the grammar

    • Sometimes allows more natural definitions
    • We need disambiguation mechanisms
  • Instead of rewriting the grammar

    • Use the more natural (ambiguous) grammar
    • Along with disambiguating declarations
  • Most tools allow precedence* and *associativity declarations to disambiguate grammars

    • Consider the grammar E -> E + E | int Ambiguous: two parse trees of int + int + int

      • Can use Left associativity declaration: %left + (e.g., in bison)

      • Consider the grammar E -> E + E | E * E | int

        • Ambiguous: two parse trees for int + int * int
      • Can use Precedence declaration supported by the parser generator tool (e.g., bison or yacc)

        • Below example (in bison syntax) says that both * and * are left associativity and that * has higher precedence than + %left + %left *

Abstract Syntax Tree (AST)

  • Like parse trees but ignore (abstract) some details
  • Example: Consider the grammar E -> int | ( E ) | E + E and the input string 5 + (2 + 3)

    • After lexical analysis, we have a list of tokens: int_5 + ( int_2 + int_3 )
    • During parsing we build a parse tree
                  E
          E       +          E
         int5         (      E       )
                         E   +   E
                       int2     int3
      
    • But the parse tree contains unnessary info

      • e.g., based on it structure, we already know 2 + 3 is nested together, so we don't need parenthenses
      • we also don't need single-successor node E -> int5 , can just have int5
      • so we can remove these info to get an AST
              + 
        5         +
               2     3
        
    • In general AST is a very commonly used data structure in compilers and program analysis

      • Also capture nested structure
      • But abstract from concrete syntax (e.g., no parantheses)
      • More compact and Easier to use

## Recursive Descent Parsing - Parse tree is constructed from grammar - From the top - From left to right - Terminals are seen in-order of appearance of the token stream - Example: t2 t5 t6 t8 t9

              1
         /    |        \
      t2      3         t9
          /       \
        4          7
     /    \        |
   t5    t6       t8 

Example

  • Consider the grammar
    E -> T   | T + E
    T -> int | int * T | ( E )
    
  • And the token stream input (int5 )

  • Recursive Descent Algorithm:

    • Start with top-level non-terminal E
    • Try the rules for E in order
      • E -> T -> int: mismatch: int does not match curr ptr (
    • Backtrack
      • E -> T -> int * T : also mismatch (* does not match ()
      • E -> T -> ( E ) : ( match, advance curr ptr to int5
      • E -> T -> ( int ) : int match, advance curr ptr to ), also match, advance ctr: done

Summary

  • Recursive descent paraising is a simple and general parsing strategy
  • Used in production compilers e.g., gcc
  • Two problems:
    • Won't work with certain kind of grammar (e.g., need to elliminate left-recursion cfg's, discussed next)
    • Backtracking is slow (use predictive parsing, discussed next)

Left Recursion

  • Consider a production S -> S a
    • RD parsing would goes into an infinite loop (keep attempt to expand S and never get to a)
  • A left-recursive grammar has a non-terminal S and the rule S->+ Sf for some f (can contain terminals or non-terminals)
  • Recursive descent does not work for this grammar.

  • Consider the left-recursive grammar

    • S -> S a | b
    • S generates all strings starting with a b and followed by any number of a
      • S a -> S aa -> ... -> S a ... a -> b a ... a

Rewriting left-recursive grammar

  • Left-recursive grammar can be fixed/rewritten automatically

  • To fix left-recursive grammar, can rewrite using right-recursion

    S -> b S’
    S' -> a S' | e
    
  • In general, left recursion has the form

    S -> S a1 | ... | S an | b1 | ... | bm
    
    • All strings of S starts with one of the b1,...,bm continue with several instances of a1,...,an
  • This can be rewriten as

    S -> b1 S’ | ... | bm S’
    S’ -> a1 S’ | ... | an S' | e
    
  • Consider the grammar

    S -> A a | c
    A -> S b
    
    • This is also left-recursive because =S -> Aa -> S b a -> .. = (the S repeats) This left-recursion can also be eliminated, there is a general algorithm to deal with left recursion (details in Dragon book)

Given the left recursive grammar

E -> E + T | T
T -> id | (E)
Choose the grammar that eliminates this grammar

  1. E -> E + id | E + (E) | id | (E)
  2. this one ?
    E -> TE’
    E’ -> +TE' | e
    T  -> id | (E)
    
  3. this one?

    E -> E’ + T | T 
    E’-> id | (E) 
    T -> id | (E)
    

  4. this one?

    E -> id + E | E + T | T 
    T -> id | (E)
    

Predictive Parsing

  • Limitation of Recursive Descent (other than Left Recursive grammar)
    • Backtracking is expensive
  • Predictive Parsing Algorithm
    • Like recursive-descent but parser can "predict" which production to use
    • By looking at the next few tokens (lookahead and using a restricted grammar LL below)
    • No backtracking
  • Predictive parsers accept LL(k) grammars
    • L: left-to-right
    • L: left-most derivation
    • k: k tokens to look ahead, in practice, k=1

LL(1) Grammar

  • The problem with Recursive Descent parsing

    • At each step, many choices of production to use, e..,g S -> T1 | T2 | T3 | ..
    • Has to use backtracking to undo bad choices
  • LL(1) grammar

    • At each step, only one choice of production
  • Recall the grammar

    E -> T + E | T     # 2 choices 
    T -> int | int * T | ( E )   # 2 choices for int ..
    
    • Hard to predict because
      • For T two productions start with int
      • For E two productions
  • We need to left-factor the grammar

    E -> TX
    X -> +E | e
    T -> intY | ( E )
    Y -> e | * T
    
  • Self practice:

    • Consider the grammar
      EXPR -> if BOOL then { EXPR }
           |  if BOOL then { EXPR } else { EXPR } 
      BOOL -> true | false
      
    • Choose the below that correctly left-factor the IF:
      1. EXPR -> if true then { EXPR }
              |  if false then { EXPR }
              |  if true then { EXPR } else { EXPR } 
              |  if false then { EXPR } else { EXPR }
      
      2. EXPR -> EXPR’ | EXPR’ else { EXPR }
         EXPR’ -> if BOOL then { EXPR }
         BOOL -> true | false
      
      3. EXPR -> if BOOL EXPR’
         EXPR’ -> then { EXPR }
                | then { EXPR } else { EXPR }
         BOOL -> true | false
      
      4. EXPR -> if BOOL then { EXPR } EXPR’ 
         EXPR’ -> else { EXPR } | e 
         BOOL -> true | false