Skip to content

Intro

Intro to Compilers

  • Two main approaches to implement a programming lang: Compilers and Interpreters
  • Examples: Python or Ruby interpreters,
  • Input: a program (e.g., add_2_nums.py) and data (e.g., integer inputs 5 and 7)
  • Output: results (e.g., shown on the screen, python add_2_nums.py 5 7 shows 12)
  • "On-line"
  • Examples: Java javac , C compiler gcc, clang, Ocaml compiler
  • Input: a program (e.g., add_2_nums.c)
  • Output: another program (e.g., in executable format gcc add_2_nums.c -o add_2_nums.exe), which can then be executed with data to obtain output (e.g., ./add_2_numbers.exe 2 3 shows 5
  • "Off-line"
  • Compiler has the Code Generation phase that an Interpreter does not

History

  • In the 50's, computer hardware (e.g., IBM's) are expensive (and big), but software is even more expensive and difficult to use.

  • John Backus:

  • Speedcoding: early form of an interpreter
    • Advantages: much faster to do development, increase productivity
    • Disadvantage:
    • Slow : 10-20x slower than program written in low level
    • Space: ..
  • FORTRAN 1
    • Formula Translation
    • 1954-1957 : time to write compiler for Fortran
    • 1958: 50% written Fortran
    • Require lots of good engineering skills

Structure of Compiler: 5 main phases

  1. Lexical Analysis (Syntax)
  2. Parsing (Syntax)
  3. Semantic Analysis (Semantics)
  4. Optimization (makes it runs faster, use less memory, etc)
  5. Code Generation (not needed for interpreters)

Phase 1: Lexer

  • Goal: classify provide text into words or tokens
  • Technique: use Regular Expressions and Finite Automata to recognize tokens (e.g., tokens x and var5 are variable names while tokens 1.2 and -0.5 are constants (numbers))

Example

if x==y then z=1; else z=2;: multiple tokens such as if, space, x, ==, y, space, then ..

  • Reserve kws: if, then, else, ..
  • Constant, digit: 1,2
  • Variable names: x, y, z
  • Operators: ==, =
  • Others: ;, blank

Phase 2: Parser

  • Goal: check if obtained tokens form a valid program (i.e., conform to a grammar/structure specific to the language)
  • Technique: use a Context Free Grammar and parsing algorithm to build a parse tree

Example

  • Matching parenthesis, brackets: (x*(x))) would have a parsing error
  • Statements do not end with semi-colons
  • A boolean expression follows if, while , e.g., if (exp)

Phase 3: Semantic Analyzer

  • Goal: Understand meaning or semantics of the program
  • Technique: Compilers include various techniques that perform static analysis to catch errors such as type checking, determining scopes, ambiguities, etc

Example

  • Type checking: int x = 5; x = "a string" would be a type mismatch
  • PL's have rules to avoid certain kind of ambiguities
    {
      int Jack = 3; 
      {
        int Jack = 4;
        cout << Jack; // print 4
      }
    }
    

Phase 4: Optimization

  • Goal: Automatically modify the program to another one that has the same meaning as the original one, but is "better" in some way, e.g.,:
    • Run faster
    • Use less memory
    • Use less power
  • Techniques: various ways (compile time evaluation, constant propagation, dead code elimination, etc)

Could be tricky: really has to preserve the original semantics

  • X = Y * 0 is the same as X = 0 ?
    • No! only true for integers, not true for floating points, NaN * 0 = NaN

Phase 5: Code Generation

  • Goal: produce assembly code (usually)
    • Essentially a translation from a language to another
    • Not needed in an interpreter, i.e., only in a compiler

Classical vs Modern Compilers

  • Every compiler has these five phases, but their proportions have changed over the years
    In the past: [ l ][  p ][s][ o ][ cg ]
    Now        : [l][p][ s ][   o  ][ cg ]   # more effort in optimization