Recall: a parser converts an input stream to its syntactic structure.

We created readers of the type:
  type ('a, 'b) reader = 'b -> ('a * 'b) option

That turned an input stream of characters into a stream of tokens.

Today, we will turn those tokens into an expression tree. Let's start with some examples:

<<< move to the SML session >>>
CM.make "fun.cm";
use "test-lexer.sml";
use "test-parser.sml";

So, how did the tokenizer work?
tokenize "1+2*3";

Now, let's try parsing it. Notice that precedence is correct.
parse "1+2*3";

And just so you know we don't have it hardcoded to do the right thing:
parse "1*2+3";

This parser also supports more complicated expressions, generating the full mini-ML language.
tokenize "fun f(i)=i+1";
parse "fun f(i)=i+1";
tokenize "let x = 1 in x*37+1";
parse "let x = 1 in x*37+1";

Finally, we also report errors:
parse "^";

And even handle parsing partial expressions and then generating an error:
parse "1+2^";

Let's start by looking at the datatypes that define our expression tree.

<<< open syntax.sml >>>

Unit, num, and bool should be straightforward. Var is a variable, represented as a string, and the primops are operations like plus. It stands for "primitive operations."

The rest of the constructors are recursive. That is, each of them contains one or more expressions inside of them. This point is where we are transitioning from a model that looks like our stream of tokens into an expression tree, and figuring out how to turn the stream of tokens into these structures is where most of the complexity will appear in our model of the language.

So, what is this language we're parsing in the first place?

<<< open README, scroll to the "here is the full grammar" sentence >>>

This specification format is the one that we use to represent the definition of a language. Formally, it is known as EBNF, or the Extended Backus-Naur Form, and is widely used not just in the compilers field but, for example, in ISO specifications of web protocols.

The general form is a nonterminal on the left and then either another nonterminal or terminals on the right. You can read this first rule as, "a nat is defined as the terminal Nat". In this context (and in most compiler-related uses, such as in SML/NJ or Manticore), the terminals correspond exactly to the tokens that come out of the lexer.

So, this shorthand defines the language that our parser accepts. Let's start with the atomic expressions. The nat, bool, and id nonterminals should be unsurprising, as they simply correspond to the underling terminal/tokens. The definition of atom iself is the first interesting one. The pipe symbols denote "or". So, an atom is either a nat, bool, identifier or a parenthesized expression.

An application is two or more atoms adjacent to one another. So, f 3 is one example, but so is, f g h i 3. The latter is an example of multiple chained applications inside of it, corresponding to the expression ((((f g) h) i) 3) because applications bind to the left.

Arithmetic expressions are the first place where things get interesting. So, does anybody know why we can't use the following definition?
aexpr ::= aexpr "+" aexper
      |  aexpr "-" aexpr
      |  aexpr "*" aexpr
      |  aexpr "/" aexpr

It's because we need to account for _precedence_, namely, that multiplication and division bind more tightly than addition and subtraction. If we didn't have that, it would be ambiguous in our language whether 1+2*3 should be implemented as ((1+2)*3) or (1+(2*3)). Note that we're still ambigious, in this definition, about (1*3/2). These split rules do not distinguish between the two. In practice, these decisions are usually specificed and implemented deterministically (e.g. left-to-right). But that can be handled at the level of the parser implementation rather than in the grammar.

Relation expressions are simler than arithmetic ones, but boolean expressions have a similar problem --- && binds more tightly than ||.

Finally, we have conditionals, let expressions, and function expresions, which make up the toplevel expressions of the grammar. So, let's go over to the parser and see how we handle implementing all of these rules!

<<< open parserfn.sml >>>

After a few definitions and basic functions similar to those from the lexerfn from yesterday, we get to the type definition for parser. Note that it is simply a reader, but we call it a parser because it will actually go from a string all the way to a parse tree.

The first interesting function is isToken. This function takes a token and returns a parser (reader) that uses the underlying lexer to retrieve the next token from the input stream and then check it against the provided token. Glancing down at boolexpr, you can see two uses of the isToken function, once to check if TRUE is in the stream and once for FALSE.

Notice that the number, identifier, and variable parsers cannot use isToken because they need to not only check the token but also pull out the integer or string associated with that token class.

With arithmetic expressions, we finally get to the more interesting part. This set of mutually recursive functions defines all of the rules from that EBNF grammar that we looked at last time. The biggest thing to notice here is how mechanical the translation is from that EBNF to this monadic parsing style.

Let's look at the aexpr function and the EBNF for aexpr in particular.

<<< split screen horizontal, bring up README and show the EBNF for aexpr >>>

In the EBNF, we first parse an aterm, then one of the three clauses. Looking at the monadic parsing version, you see a chain of an aterm with three clauses, connected by the orelse combinators we went over yesterday. The first one chains a plus and another arithmetic expression, the second has a minus and another arithmetic expression, and the final one falls through and just returns if neither of those two were available.

The biggest non-syntactic difference between the two, is the additional code in the monadic parser that calls return. These are referred to as the _actions_ and are the pieces of code that actually build the tree. As you can see, we build up nodes from the Syntax.expr datatype.

<<< close the READER pane >>>

The rest of these are all pretty straightforward until we get to appexpr. The only reason that it is interesting is that application is _left_ associative. 

<<< inc-search back to fixApp >>>

So, we build up the list and then use the helper function to build the App nodes by munching them from the left.

<<< return to appexpr >>>

Starting with cexpr, we have conditional, let, case, and fn expressions. Each of them are unsurprising --- parse each of the constituent subpieces and then return the constructed datatype.

The final bits with declarations and statements mirror the Standard ML handling at the top-level of the interpreter --- you can create function definitions, val bindings, or have a single expression. It also adds a QUIT keyword, which is present for the interpreter. I don't know if Prof. MacQueen will be covering the evaluator/interpreter portion, but it's there in case we built an interactive loop so that you know when to stop taking and executing commands.

So, how do we use what we've built? It's similar to the test lexer from Monday.

<<< open test-parser.sml >>>

The parser functor is instantiated using the StringStream, just like the last example. The parse function even looks somewhat similar. But, this time, it is configured to parse all of the expressions in the string in sequence. If we get a statement (which wraps the expressions), we print it and then loop over the rest of the input. When we get NONE back from the parser, it can happen for two reasons. Either the stream is empty and there is nothing left to read or there is more in the stream and either the lexer or parser could not handle it. 

In order to print it out, I created a small reader called readAllTheThings that munches any remaining input (using the starPlus combinator) and then returns those characters in a string for error reporting. And that's how we generated the error messages you saw in the original examples.