PA2: Lexer
Goal
For this assignment you will write a lexical analyzer for Cool, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for Cool in an appropriate input format and the analyzer generator will generate actual code. You will then write additional code to serialize the tokens for use by later compiler/interpreter stages.
Specification
You will turn in an Ocaml program that takes a single command-line argument (e.g.,
file.cl
). That argument will be an ASCII text Cool
source file.
Your program must either indicate that there is an error in the
input (e.g., a malformed string) or emit file.cl-lex
, a serialized
list of Cool tokens. Your program's main lexer component must be
constructed by a lexical analyzer generator. The "glue code" for
processing command-line arguments and serializing tokens should be
written by hand. If your program is called lexer
, invoking
lexer file.cl
should yield the same output as
cool --lex file.cl
.
You will not write the lexer but instead use ocamllex
to generate one.
You will just provide regular expressions describing valid Cool tokens and ocamllex
will generate the appropriate lexer.
Line Numbers
The first line in a file is line 1
. Each successive '\n'
newline
character increments the line count. Your lexer is responsible for
keeping track of the current line number.
Error Reporting
To report an error, write the string:
ERROR: line_number: Lexer: message
Backslash not allowed \
Example error report output:
ERROR: 1: Lexer: invalid character: \
The .cl-lex
File Format
If there are no errors in file.cl
your program should create
file.cl-lex
and serialize the tokens to it. Each token is represented
by a pair (or triplet) of lines:
- The first line holds the line number.
- The second line gives the name of the token.
- The optional third line holds additional information (i.e., the lexeme) for identifiers, integers, strings and types.
For example, for an integer token the third line should contain the decimal integer value.
Example input
Backslash not
allowed
Corresponding .cl-lex
output
1
type
Backslash
1
not
2
identifier
allowed
Hint
This .cl-lex
file format is exactly the same as the one generated by
the reference COOL compiler when you specify the --lex
argument.
So you can check what the expected output would be by running the reference compiler with --lex
and check the resulting .cl-lex
file.
In
addition, the reference compiler (and your upcoming PA3 parser!)
will read .cl-lex
files instead of .cl
files.
Tokens
Official list of token names
at case class comma colon divide dot else equals esac false fi
identifier if in inherits integer isvoid larrow lbrace le let
loop lparen lt minus new not of plus pool rarrow rbrace rparen
semi string then tilde times true type while
In general the intended token is evident. For the more exotic names:
larrow = <-, lbrace = {, le = <=, lparen = (, lt = <,
rarrow = =>, rbrace = }, semi = ;, tilde = ~, at = @.
Lexical Analyzer Generators
For this assignment, you must use a lexcical analyzer generator.
For example, every OCaml distribution has an ocamllex
lexical analyzer generator.
Commentary
Tests
- You can use the following tests to check your implementation.
-
You can do basic testing with something like the following:
$ cool --out reference --lex file.cl $ my-lexer file.cl $ diff -b -B -E -w file.cl-lex reference.cl-lex
For example, if you used OCaml:
$ cool --out reference --lex file.cl $ ocamllex my-lexer.mll $ ocaml my-lexer.ml file.cl $ diff file.cl-lex reference.cl-lex
You may find the reference compiler's --unlex
option useful for debugging your .cl-lex
files.
Need more testcases? Any Cool file you have works fine. The contents of cool-examples should be a good start. There's also one among the PA1 hints. You'll want to make more complicated test cases --- in particular, you'll want to make negative testcases (e.g., testcases with malformed string constants).
Video Guides
FAQs
- String length: A Cool string has at most 1024 characters. See Constants under Expressions in the Cool manual.
Turn-In and Grading
What To Turn In For PA2
PA2
You must turn in a zip file containing these files:
readme.txt
: a plain ASCII text file describing your design decisions In addition, answer the following questions in yourreadme.txt
:- What are some challenging parts for this assignment? What did you do to solve them?
- Suggestions for both the instructor and future students for this assignment?
- source files :
lexer.ml
lexer.mll
- See an example of a
readme.txt
file - If you work in a team, list the name of your other team member in the
readme.txt
.
Grading Rubic
PA2 Grading (out of 50 points)
- 41 points : for autograder tests (-1 point per failed test,
minimum score of 0)
- Each missed test removes points, to a minimum of 0, even if there are more tests than total points.
- 5 points : for a clear description in your README
- 5 : thorough discussion of design decisions (including handling of strings and comments) and answering given qeustions; a few paragraphs of coherent English sentences should be fine
- 2 : vague or hard to understand; omits important details
- 0 : little to no effort cases
- 4 point : for code cleanliness
- 4 : code is mostly clean and well-commented
- 2 : code is sloppy and/or poorly commented in places
- 0 : little to no effort to organize and document code