PA2: Lexer

Goal

For this assignment you will write a lexical analyzer for Cool, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for Cool in an appropriate input format and the analyzer generator will generate actual code. You will then write additional code to serialize the tokens for use by later compiler/interpreter stages.

Specification

You will turn in an Ocaml program that takes a single command-line argument (e.g., file.cl). That argument will be an ASCII text Cool source file.

Your program must either indicate that there is an error in the input (e.g., a malformed string) or emit file.cl-lex, a serialized list of Cool tokens. Your program's main lexer component must be constructed by a lexical analyzer generator. The "glue code" for processing command-line arguments and serializing tokens should be written by hand. If your program is called lexer, invoking lexer file.cl should yield the same output as cool --lex file.cl.

You will not write the lexer but instead use ocamllex to generate one.
You will just provide regular expressions describing valid Cool tokens and ocamllex will generate the appropriate lexer.

Line Numbers

The first line in a file is line 1. Each successive '\n' newline character increments the line count. Your lexer is responsible for keeping track of the current line number.

Error Reporting

To report an error, write the string:

ERROR: line_number: Lexer: message

to standard output and terminate the program. You may write whatever you want in the message, but it should be fairly indicative. Example erroneous input:

Backslash not allowed \

Example error report output:

ERROR: 1: Lexer: invalid character: \

The `.cl-lex` File Format

If there are no errors in file.cl your program should create file.cl-lex and serialize the tokens to it. Each token is represented by a pair (or triplet) of lines:

The first line holds the line number.
The second line gives the name of the token.
The optional third line holds additional information (i.e., the lexeme) for identifiers, integers, strings and types.

For example, for an integer token the third line should contain the decimal integer value.

Example input

Backslash not 
  allowed

Corresponding .cl-lex output

1 
type 
Backslash 
1 
not 
2 
identifier 
allowed

Hint

This .cl-lex file format is exactly the same as the one generated by the reference COOL compiler when you specify the --lex argument. So you can check what the expected output would be by running the reference compiler with --lex and check the resulting .cl-lex file. In addition, the reference compiler (and your upcoming PA3 parser!) will read .cl-lex files instead of .cl files.

Tokens

Official list of token names

at case class comma colon divide dot else equals esac false fi 
identifier if in inherits integer isvoid larrow lbrace le let 
loop lparen lt minus new not of plus pool rarrow rbrace rparen 
semi string then tilde times true type while

In general the intended token is evident. For the more exotic names:

larrow = <-, lbrace = {, le = <=, lparen = (, lt = <, 
rarrow = =>, rbrace = }, semi = ;, tilde = ~, at = @.

Lexical Analyzer Generators

For this assignment, you must use a lexcical analyzer generator. For example, every OCaml distribution has an ocamllex lexical analyzer generator.

Commentary

Tests

You can use the following tests to check your implementation.

You can do basic testing with something like the following:

$ cool --out reference --lex file.cl
$ my-lexer file.cl 
$ diff -b -B -E -w file.cl-lex reference.cl-lex

For example, if you used OCaml:

$ cool --out reference --lex file.cl 
$ ocamllex my-lexer.mll 
$ ocaml my-lexer.ml file.cl 
$ diff file.cl-lex reference.cl-lex

You may find the reference compiler's --unlex option useful for debugging your .cl-lex files.

Need more testcases? Any Cool file you have works fine. The contents of cool-examples should be a good start. There's also one among the PA1 hints. You'll want to make more complicated test cases --- in particular, you'll want to make negative testcases (e.g., testcases with malformed string constants).

Video Guides

OCaml + ocamllex

FAQs

String length: A Cool string has at most 1024 characters. See Constants under Expressions in the Cool manual.

Turn-In and Grading

What To Turn In For PA2

PA2

You must turn in a zip file containing these files:

readme.txt: a plain ASCII text file describing your design decisions In addition, answer the following questions in your readme.txt:
1. What are some challenging parts for this assignment? What did you do to solve them?
2. Suggestions for both the instructor and future students for this assignment?
source files :
- lexer.ml
- lexer.mll

See an example of a readme.txt file
If you work in a team, list the name of your other team member in the readme.txt.

Grading Rubic

PA2 Grading (out of 50 points)

41 points : for autograder tests (-1 point per failed test, minimum score of 0)
- Each missed test removes points, to a minimum of 0, even if there are more tests than total points.
5 points : for a clear description in your README
- 5 : thorough discussion of design decisions (including handling of strings and comments) and answering given qeustions; a few paragraphs of coherent English sentences should be fine
- 2 : vague or hard to understand; omits important details
- 0 : little to no effort cases
4 point : for code cleanliness
- 4 : code is mostly clean and well-commented
- 2 : code is sloppy and/or poorly commented in places
- 0 : little to no effort to organize and document code