Parsing
Parsing (Parser)
-
Regular Languages
- Fast and has many applications (checking digit, valid identifier (variable names, phone numbers, etc)
- But rather limited, e.g., cannot recognize the balanced parentheses \(\{(^i)^i \mid i \ge 0\}\), (), (()), ((()))
- formulas/expressions ((1+2) & 3 )
- nested structure: if then else fi
- In short, DFA/RE doesn't "remember" or "count" things, required by languages such as \(\{(^i)^i \mid i \ge 0\}\).
-
Parser
- Input: sequence of tokens from lexer
- Output: parse tree of the program
-
Ex:
- Program (Cool):
if x = y then 1 else 2 fi
- Lexer output / parser input:
IF ID = ID THEN INT ELSE INT FI
-
Parser output:
``` if-then-else = int int id id
```
- Program (Cool):
Phase | Input | Output |
---|---|---|
Lexer | String of Characters | String of tokens |
Parser | String of tokens | Parse tree |
Context Free Grammar
- Parser needs to distinguish btw valid and invalid strings of tokens
- We need
- A lang for describing valid strings of tokens
- A method for distinguishing valid from invalid strings of tokens
- Programming Language often has recrusive structures. E.g.,
Expr
is :if Expr then Expr else Expr fi
while EXPR loop EXPR pool
-
Context Free Grammar: a natural way to express these recursive structures
- A set of Terminals \(T\)
- A set of non-terminals \(N\)
- A start symbols \(s\in N\)
- a set of productions (rules)
- \(x \rightarrow y_1 \dots y_n, (x\in N, y_i \in N \cup T\cup \{\epsilon \} )\)
-
Example:
#balance parentheses S -> (S) S -> e # N ={S}, T ={(,)}, 2 productions
-
Terminals are so called bc no rules for replacing them
-
Once generated, terminals are permanent
-
Terminals are the tokens of the language
-
Another example:
EXPR -> if EXPR then EXPR else EXPR fi #non-term : EXPR , term: ow EXPR -> while EXPR loop EXPR pool EXPR -> id #short cut using | EXPR -> if EXPR .. | while .. | id
-
Some examples of the above EXPR CFG
id if id then id else id fi while id loop id pool if while id loop id pool then id else id #correct syntanctically, what does it mean semantically? if if id then id else id fi then id else id fi
- Another CFG example for Arithmetic:
E -> E + E id | E * E id + id | (E) (id | id ) * id | id
Given the CFG
S -> aXa
X -> e | bY
Y -> e | cXc
- abcba
- acca
- aba # Sol: only this one
- abcbcba
The idea of a CFG is a big step. But:
- Membership in a language is "yes" or "no"; also need parse tree of the input
- Must handle errors gracefully
- Need an implementation of CFG's (e.g., bison, yacc, ocamlyacc)
- Form of the grammar is important
- Many grammars generate the same language
- Tools are sensitive to the grammar
Derivations
- A derivation is a sequence of productions
S -> ... -> ...-> ...
- A derivation can be drawn as a tree
- Start symbol is the tree's root
- For a production
X->Y1...Yn
add childrenY1,...,Yn
to nodeX
Example
- Grammar
E -> E + E | E * E | (E) | id
- Input String
id *id + id
-
Derivation and Parse Tree
#Derivation E -> E + E -> E * E + E -> id * E + E -> id * id + E -> id * id + id #Parse Tree E E + E E * E id id id
-
A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
-
An in-order traversal of the leaves is the original input
-
The parse tree shows the association of operations, the input string does not
-
Left-most derivation
- At each step, replace the left-most non-terminal
-
Right-most derivation
-
Both right/left-most derivations have the same parse tree
Summary
- We are not just interested in whether \(s \in L(G)\)
- We need a parse tree for s
- A derivation defines a parse tree
- But one parse tree may have many derivations (e.g., left/right-most derivation)
- Left-most and right-most derivations are important in parser implementation
Ambiguity
- Grammar
E -> E + E | E * E | (E) | id
- Input String
id *id + id
- This string has two parse trees
# Tree 1
E
E + E
E * E id
id id
# Tree 2
E
E * E
id E + E
id id
- A grammar is ambiguous if it has more than one parse tree for some string
- Equivalently, there is more than one right-most or left-most derivation for some string
- Ambiguity is BAD
- Leaves meaning of some programs ill-defined
Ways to handle ambiguity
- Most direct method is to rewrite grammar to make it unambiguous. For example, the below grammar for arithmetic expression enforces precedence of
*
over+
E -> E’ + E | E’
E’ -> id * E' | id | (E) * E’| (E)
Another Example
- Consider the grammar
E -> if E then E
| if E then E else E
| OTHER
- The expression (input string)
if E1 then if E2 then E3 else E 4
has two parse trees
if
/ | \
E1 if E4
/ \
E2 E3
if
/ \
E1 if
/ | \
E2 E3 E4
- One way to handle this ambiguity is to rewrite it to enforce that
else
matches the closest unmatchedthen
E -> MIF /* all then are matched */
| UIF /* some then is unmatched */
MIF -> if E then MIF else MIF
| OTHER
UIF -> if E then E
| if E then MIF else UIF
Additional info
-
Impossible to automatically convert an ambiguous grammar to an unambiguous one
-
Used with care, ambiguity can simplify the grammar
- Sometimes allows more natural definitions
- We need disambiguation mechanisms
-
Instead of rewriting the grammar
- Use the more natural (ambiguous) grammar
- Along with disambiguating declarations
-
Most tools allow precedence* and *associativity declarations to disambiguate grammars
-
Consider the grammar
E -> E + E | int
Ambiguous: two parse trees ofint + int + int
-
Can use Left associativity declaration: %left + (e.g., in bison)
-
Consider the grammar
E -> E + E | E * E | int
- Ambiguous: two parse trees for
int + int * int
- Ambiguous: two parse trees for
-
Can use Precedence declaration supported by the parser generator tool (e.g.,
bison
oryacc
)- Below example (in
bison
syntax) says that both*
and*
are left associativity and that*
has higher precedence than+
%left + %left *
- Below example (in
-
-
Abstract Syntax Tree (AST)
- Like parse trees but ignore (abstract) some details
-
Example: Consider the grammar
E -> int | ( E ) | E + E
and the input string5 + (2 + 3)
- After lexical analysis, we have a list of tokens:
int_5
+
(
int_2
+
int_3
)
- During parsing we build a parse tree
E E + E int5 ( E ) E + E int2 int3
-
But the parse tree contains unnessary info
- e.g., based on it structure, we already know 2 + 3 is nested together, so we don't need parenthenses
- we also don't need single-successor node E -> int5 , can just have int5
- so we can remove these info to get an AST
+ 5 + 2 3
-
In general AST is a very commonly used data structure in compilers and program analysis
- Also capture nested structure
- But abstract from concrete syntax (e.g., no parantheses)
- More compact and Easier to use
- After lexical analysis, we have a list of tokens:
## Recursive Descent Parsing - Parse tree is constructed from grammar - From the top - From left to right - Terminals are seen in-order of appearance of the token stream - Example: t2 t5 t6 t8 t9
1
/ | \
t2 3 t9
/ \
4 7
/ \ |
t5 t6 t8
Example
- Consider the grammar
E -> T | T + E T -> int | int * T | ( E )
-
And the token stream input
(int5 )
-
Recursive Descent Algorithm:
- Start with top-level non-terminal E
- Try the rules for E in order
- E -> T -> int: mismatch: int does not match curr ptr
(
- E -> T -> int: mismatch: int does not match curr ptr
- Backtrack
- E -> T -> int * T : also mismatch (
*
does not match(
) - E -> T -> ( E ) : ( match, advance curr ptr to
int5
- E -> T -> ( int ) : int match, advance curr ptr to
)
, also match, advance ctr: done
- E -> T -> int * T : also mismatch (
Summary
- Recursive descent paraising is a simple and general parsing strategy
- Used in production compilers e.g., gcc
- Two problems:
- Won't work with certain kind of grammar (e.g., need to elliminate left-recursion cfg's, discussed next)
- Backtracking is slow (use predictive parsing, discussed next)
Left Recursion
- Consider a production
S -> S a
- RD parsing would goes into an infinite loop (keep attempt to expand
S
and never get toa
)
- RD parsing would goes into an infinite loop (keep attempt to expand
- A left-recursive grammar has a non-terminal
S
and the ruleS->+ Sf
for somef
(can contain terminals or non-terminals) -
Recursive descent does not work for this grammar.
-
Consider the left-recursive grammar
S -> S a | b
- S generates all strings starting with a
b
and followed by any number ofa
S a -> S aa -> ... -> S a ... a -> b a ... a
Rewriting left-recursive grammar
-
Left-recursive grammar can be fixed/rewritten automatically
-
To fix left-recursive grammar, can rewrite using right-recursion
S -> b S’ S' -> a S' | e
-
In general, left recursion has the form
S -> S a1 | ... | S an | b1 | ... | bm
- All strings of S starts with one of the b1,...,bm continue with several instances of a1,...,an
-
This can be rewriten as
S -> b1 S’ | ... | bm S’ S’ -> a1 S’ | ... | an S' | e
-
Consider the grammar
S -> A a | c A -> S b
- This is also left-recursive because =S -> Aa -> S b a -> .. = (the S repeats) This left-recursion can also be eliminated, there is a general algorithm to deal with left recursion (details in Dragon book)
Given the left recursive grammar
E -> E + T | T
T -> id | (E)
E -> E + id | E + (E) | id | (E)
- this one ?
E -> TE’ E’ -> +TE' | e T -> id | (E)
-
this one?
E -> E’ + T | T E’-> id | (E) T -> id | (E)
-
this one?
E -> id + E | E + T | T T -> id | (E)
Predictive Parsing
- Limitation of Recursive Descent (other than Left Recursive grammar)
- Backtracking is expensive
- Predictive Parsing Algorithm
- Like recursive-descent but parser can "predict" which production to use
- By looking at the next few tokens (lookahead and using a restricted grammar LL below)
- No backtracking
- Predictive parsers accept LL(k) grammars
- L: left-to-right
- L: left-most derivation
- k: k tokens to look ahead, in practice, k=1
LL(1) Grammar
-
The problem with Recursive Descent parsing
- At each step, many choices of production to use, e..,g
S -> T1 | T2 | T3 | ..
- Has to use backtracking to undo bad choices
- At each step, many choices of production to use, e..,g
-
LL(1) grammar
- At each step, only one choice of production
-
Recall the grammar
E -> T + E | T # 2 choices T -> int | int * T | ( E ) # 2 choices for int ..
- Hard to predict because
- For T two productions start with int
- For E two productions
- Hard to predict because
-
We need to left-factor the grammar
E -> TX X -> +E | e T -> intY | ( E ) Y -> e | * T
-
Self practice:
- Consider the grammar
EXPR -> if BOOL then { EXPR } | if BOOL then { EXPR } else { EXPR } BOOL -> true | false
- Choose the below that correctly left-factor the
IF
:1. EXPR -> if true then { EXPR } | if false then { EXPR } | if true then { EXPR } else { EXPR } | if false then { EXPR } else { EXPR } 2. EXPR -> EXPR’ | EXPR’ else { EXPR } EXPR’ -> if BOOL then { EXPR } BOOL -> true | false 3. EXPR -> if BOOL EXPR’ EXPR’ -> then { EXPR } | then { EXPR } else { EXPR } BOOL -> true | false 4. EXPR -> if BOOL then { EXPR } EXPR’ EXPR’ -> else { EXPR } | e BOOL -> true | false
- Consider the grammar