Milestone #3: Front-End (Scanner & Parser)

Due: Friday February 7th at 11:59pm

This milestone will require you to get started implementing the front-end for the compiler.

Working Individually or as a Pair

You are allowed to work alone or in teams of two for the compiler project. Before creating your repository via the invitation, you will first need to

  1. Decide if you want to work alone or find a partner in the class.

  2. Come up with a fun team name or if you are working alone a name for your compiler. Be creative (if you wish)!

  3. The first person (or yourself if you are working alone) to accept the Github classroom invite will enter in the team name. If working in a pair, the second group member will need to click on their team name in-order to join the repository.

Once you have done these two steps, then you can move on to Getting started section.

Getting started

A Git repository will be created for you on GitHub. However, before that repository can be created for you, you need to have a GitHub account. If you do not yet have one, you can get an account here: https://github.com/join.

To actually get your private repository, you will need this invitation URL:

  • Compiler Project Invitation (Please check the Post “Compiler Project is ready” Ed)

When you click on an invitation URL, you will have to complete the following steps:

  1. See step 3 in “Working Individually or as a Pair” section for the first step.

  2. You now need to clone your repository (i.e., download it to your machine).
    • Make sure you’ve set up SSH access on your GitHub account.

    • For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs51300-win24/proj-TEAM_NAME-GITHUB-USERNAME.git.

    • If you do not know how to use git clone to clone your repository then follow this guide that Github provides: Cloning a Repository

If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.

Language Overview: GoLite

Please make sure you read over the document that describes the language we will be implementing this quarter: Language Overview

Compiler Structure

To help with code modularity, we will provided a specific repository structure throughout the implementation of the compiler. You are able to include additional directories and files to help with your implementation. You are not allowed to use any external third-party libraries unless Professor Samuels specifically gives you permission to do so. You may only use the libraries provided in the standard library of your chosen language. If you think you need to use an external library then ask before using it; not adhering to this requirement will result in a major penalty on this milestone and the final submission of the compiler.

The initial contents of the repository includes directories and files needed to help you get started implementing the front-end of the compiler. Lets review these directories now:

  • ast - This directory is currently empty. You will place your AST node files in this directory.

  • benchmarks - Throughout the implementation of the compiler, you can place sample Golite programs in this directory to help test the various phases. Professor Samuels will provide additional more robust benchmark tests.

  • golite - The directory that includes the main driver for the compiler (i.e., the file that includes the main function or entry point code into the compiler).

  • grammars - The directory that will contain the ANTLR grammar files and the full ANTLR runtime jar file (i.e., antlr-4.11.1-complete.jar). Please note, we are aware there is a more updated version of the ANTLR jar; however, you will not need this updated version for this course. If you are running ANTLR locally then you will need to install Java and Java Runtime on your local machine: Install Java Runtime . The Java is already installed on the linux servers. You will also see a file called generate.sh. This is a bash script used to generate the lexer and parser. Right now, it’s preset to produce Go code but you can change the Dlanguage=YOUR_LANGUAGE flag to your specific language. When ran, the generate.sh file will automatically place the generated lexer and parser files into the parser and lexer directories.

  • lexer - The directory contains the autogenerated lexer files. Additionally, you will need to write code to connect the autogenerated code into the your compiler implementation.

  • parser - The directory contains the autogenerated parser files. Additionally, you will need to write code to connect the autogenerated code into the your compiler implementation.

  • build.sh - You are required to write a bash script that fully builds the compiler into an executable. This script must also call any other scripts that generate/build any additional components of the compiler. If you using a language that does not build an executable (i.e., Python), then just generating/building all the required components is fine. You will provide instructions in the README file that explains how to run your compiler.

  • README - This document provide specific instructions to run your compiler on the CS Linux Servers. You must be detailed and explicit with these instructions. Please break these instructions down into specific steps (ie.. Step 1, 2, 3, etc.). Also provide instructions on how to run a a program and specify flags. Failing to provide a straight-forward way of executing your compiles will result in significant deductions to your score.

You are not allowed to change this default structure. However, you can add additional and files and directories as you wish. If you need to change this default structure for your language of choice then please reach out to Professor Samuels for permission. The reason for this structure is that it makes all projects consistent and easier to grade and provide feedback. As we build each phase of the compiler, we will state the additional required directories.

CS Linux Servers and Golang

Always make sure that your compiler runs correctly on the CS Linux Servers. If it doesn’t run on those servers then you will receive an automatic zero for your completeness score for the project. We need to be able to grade them specifically on those machines. Feel free to include additional directories/files as needed.

The default version of Go on the CS Linux servers is 1.13, which is very old. You need to use the module environment command to specify an updated version. The CS Linux Servers do have Go 1.21.7 now installed using the module command. You can remotely run Go programs on a CS Department Linux computer by following the below steps (note do not actually enter in $:):

$ module load golang/1.21.7

Note You will always need to do this step each time you login to a CS Linux machine; otherwise, you will be using the old version of Golang or include in your .bashrc file.

Verify the correct version was loaded by running the version Go command:

$ go version go1.21.07

linux/amd64

Compiler Implementation Languages

You are required to implement the entire compiler in one of the following languages:

  1. Java

  2. C++

  3. Java (or one of its dialects)

  4. Python 3

  5. Go

You are not allowed to implement the compiler in any other language. We recommend you implement the compiler in Go since Professor Samuels and TA will be able to help you faster. Implementing, the compiler in other languages may result in the course staff not fully be able to help quickly fix problems. Please keep in mind that building and fixing errors in your compiler will be your sole responsibility but the course staff will help in any way possible.

Milestone Requirement: Scanner & Parser

The main requirements for this milestone is the following

  1. Working ANTLR generated lexer.

  2. Working ANTLR generated parser.

  3. AST Construction from the ANTLR generated parse tree.

You must use ANTLR (ANother Tool for Language Recognition) to build your scanner (i.e., lexer) and parser. The grammar files must be placed inside the grammars directory. We already have created this directory for you and have placed the ANTLR jar file inside this directory.

Task 0: Learning ANTLR

Your first task is to understand ANTLR system by reading over it’s documentation:

ANTLR4 Docs

You are required to read over all the sections under the Sections area, with the exception of these sections:

  • Unicode U+FFFF, U+10FFFF character streams

  • Parsing binary streams

  • Parser and lexer interpreters

  • Resources

Professor Samuels will shortly post a video explaining how you can implement the Cal language using ANTLR and connecting it to a compiler implementation. However, We want you to first read over these docs before watching the video so you understand better what’s happening in the video.

Task 1: Implementing the Lexer using ANTLR

Assume the following Golite program is defined in a file simple.golite inside the benchmarks/simple/ directory:

1  func main() {
2
3    var a int;
4
5    a = 3 + 4 + 5;
6
7   printf("%d", a);
8
9 }

For the lexer, the compiler must read in a Golite program and produce a token stream that will be provided to the parser. The default execution of the compiler at this point is to just run the lexer and parser. However, add a flag -l to command line arguments such that it will print out each token on a separate line for a given program. With the -l command the execution of the compiler stops at the end of lexing.

Note

Make sure to think about your function decomposition inside driver file because you will be adding more command line arguments to the compiler as the quarter progresses.

Assuming you are inside the driver golite directory, a sample run using simple.golite is as follows ($: is just mimicking the command line)

$: go run golite.go -l ../benchmarks/simple/simple.golite
Token.FUNC(1)
Token.IDENT(1, "main")
Token.LPAREN(1)
Token.RPAREN(1)
Token.LBRACE(1)
Token.VAR(3)
Token.IDENT(3, "a")
Token.INT(3)
Token.SEMICOLON(3)
... (I'm not printing them all out but you get the idea)

Your output does not have to look like the above output. We just want to see a series of tokens being printed out to verify that your scanner is working. You can print and format the output as you wish. Your implementation must go under a specific directory dedicated to the lexer (e.g. lexer directory) along with the generated lexer code. Please note you will need to write more code to incorporate the generated lexer into your compiler.

Task 2: Implement the Parser and the Abstract Syntax Tree (AST)

For the parser, the compiler must read in a Golite program and parse through the input source code. This will require that you connect your lexer code to your parser code. For this milestone, you must also design and construct an abstract syntax tree based on the language specification and formal grammar. You must determine the appropriate representations for your nodes based on the formal grammar and the AST will be the final output for this milestone. To help you with this construction, here are few nodes you should think about defining:

  • Statement nodes: Create a node for each type of statement in the grammar. Additionally, we recommend that you define a Statement interface that all statement nodes can inherit methods unique to statements.

  • Expression nodes: Create a BinaryExpr, UnaryExpr and FieldExpr nodes to handle binary expressions (i.e., integer and boolean), unary expressions, and field expressions (e.g., mystruct.coords.x). We also recommend that you define an Expression interface that all expression nodes can inherit methods unique to expressions. Also define a Variable (i.e., identifiers) node, which is also considered an expression. DO NOT define a node for each type of term (e.g., BoolTerm and SimpleTerm etc.) seem in the grammar. These all represent the forming of binary expressions so you can handle them easily by generating binary expression nodes.

  • Literal nodes: For each type of literal value (i.e., integer, boolean, etc.), define a node that represents that type of literal value. Since literals are expressions, they should also inherit from your Expression interface.

  • Left hand-side value: Every assignment statement has a left-hand side (i.e., an l-value). You should define a node that can handle l-values that are just identifiers (e.g., x and y) and field l-values (e.g., point.x and point.y).

  • I/O and allocation nodes: Define nodes for reading (i.e., 'scan') and printing (i.e., 'printf') along with nodes to handle memory allocation/deallocation (i.e., new and delete).

  • Top-level nodes: Define specific node for a type declaration, **function*, and the all encompassing program node.

You may define a few more nodes then this but these suggestions should cover most node definitions you’ll need in your AST. Please refer back to the lecture slides and videos for tips and suggestions about the AST construction. Place all node definitions inside the ast directory. We recommend having a single file for each node definition. For example, if you are implementing your language in Go then your ast will have files such as: program.go, function.go, type_decl.go, etc.

At this point, the default output of the compiler will print out the AST. We are not requiring a specific printing format. You are free to print the AST as you wish. However, you must define String() methods for each AST node to print out its representation as string. If there is an error in the lexing or parsing of a program then the compiler must produce a formatted syntax error(s). The syntax error message must be formatted as follows

syntax error(LINE_NUMBER, COLUMN_NUMER): MESSAGE

where LINE_NUMBER is the line number, COLUMN_NUMBER is the column number where the error occurred. The MESSAGE is a helpful error message. You can either collect all possible errors in the program or exit early if you find one syntax error (the latter is easier to implement).

Assume the following Golite program is defined in a file bad.golite inside the benchmarks/bad/ directory:

1  func main() {
2
3    var a bool;
4
5    a = +;
6
7   printf("%d", a);
8
9 }

Assuming you are inside the driver golite directory, a sample run using simple.golite and bad.golite is as follows ($: is just mimicking the command line)

$: go run golite.go ../benchmarks/simple/simple.golite
  func main() {
    var a int;
    a = 3 + 4 + 5;
    printf("%d", a);
  }

$: go run golite.go ../benchmarks/bad/bad.golite
syntax error(5,3): expected expression on right-hand side but found '+'

The message does not have to match the one above. You can just use the message you get back from ANTLR. Notice the default execution of the compiler at this point is just running through both the lexer then the parser. Additionally, the print out of the AST looks very similar to the original source code. You should try to do the same in your implementation; however, it does not to be exact but close to the original source code.

Grading and What to Submit

You must provide the following to get full credit for this milestone:

  1. Submit a few Golite programs (e.g., simple1.golite, simple2.golite, etc.) that verifies your lexer and parser are running correctly.

  2. Make sure to explicitly fill out the README file as explained above.

The milestone is 6% of your grade and the exact weights for grading are:

  • 6% credit: Parses everything correctly with a few minor errors.

  • 5% credit: 75% and above of all tokens are being identified and parsed correctly.

  • 4%-3% credit: 74%-50% of all tokens are being identified and parsed correctly.

  • 3%-0% credit: No solution provided or less than 50% of the tokens in the language are being identified.

We will verify “correctness” by checking the output of your AST from parsing.

Submission

Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Milestone #3” assignment page via two ways,

  1. Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repsotiory you wish to upload and the branch (which should always be “main” or “master”) for this course.

  2. Uploading via a Zip file: You can also upload a zip file of the assignment directory.