Based on Chapter 8 of the Graham Hutton Programming in Haskell material.

What is a parser?
Converts an input stream to determine its syntactic structure.
Consider a file with the string "3+4*5"
A parser is the program which converts that string into an expression
tree: ((3+4)*5) or Times(Plus(3,4),5).

Or consider the typechecker example. It had a datatype for expressions. How did
we get those expressions in the first case?

Parsers are used all over. The commandline in UNIX has a parser. SQL queries you
send to a server have one. The web browser uses one to turn the HTML pages into
pictures on screen. And, of course, every compiler has them. 

This set of lectures is all about explaining that process and doing so using a
monadic style. This type of parser is very easy to write and understand in the
context of a functional programming class, though it is a bit underpowered for a
real compiler. Those techniques are covered in more detail in the Compilers I
class, which will hopefully be offered next year.

So, what is a parser?

The more general version of a parser is called a reader in this code. You could
imagine that it was:

  type reader = string -> tree

But, this might fail, so we'd extend it to:

  type reader = string -> tree option

And we might not use up all of the input data:

  type reader = string -> (tree * string) option

Of course, we look at that as functional programmers and have the instinct that
we should generalize it. Our final type is:

  type ('a, 'b) reader = 'b -> ('a * 'b) option

So, a reader is a function that consumes some source (of type 'b) and,
if successful, returns a result of type 'a and the unused remainder of
the source (also of type 'b). Parsing is the process of applying these
functions to inputs to generate either an output datatype or NONE,
indicating that it failed.

<< switch to computer, bring up reader.sml >>

Here's the definition of the reader type, just as we mentioned below. Now, let's
go over 7 readers. After we go over them, I will go over an example that uses
many of them, to help you get an idea of how they work. Then, on Wednesday, we
will go over a second example that uses these same combinators.

First is the return reader. It just passes along the current input unchanged
(called instream) and returns the provided value.

Next up is fail. It just returns NONE.

Chain takes a reader p and a function f. If the reader p succeeds,
then f is applied to the value it returns, yielding a second reader,
which in turn is applied to the remainder of the source produced when
p was run. Chain is used to perform parsing actions in sequence, where
the second action depends on the value produced by the first.

Choice takes two readers, p and q. It returns the result of p if it
succeeds and otherwise returns the result of reading with q starting
with the same source that p used.

The star and starPlus operators correspond to the Kleene star and Kleene
plus operators. These operators will run a reader as many times as it succeeds
until it fails, returning the final result. 

First, notice that types are slightly more interesting. They take an ('a, 'b)
reader, but return an ('a list, 'b) reader. This change handles the difference in
functionality from what we described earlier. If you remember, "readers return
NONE to indicate they failed to parse." But the star operator always
succeeds, even when it was able to do zero reads. So in that case we need to be
able to return SOME(v,s) to indicate that we "successfully read nothing". The
returned value is the list nil, indicating a successful parse with
zero read actions.

So, how do they work? Star uses the choice operator, which you will recall tries
the first reader and returns its results if it works and otherwise
runs the second reader. In this case, the second, (return []), does
what we just discussed. The first handles the case where one or more
iterations of the reader action were successful, returning a list
accumulating the values they returned.

The starPlus reader first performs a read, then chains that value forward to a
new reader, definied inline here. This second reader takes that returned result
and then tries to call the original star function on the reader to get more
parsed values. Finally, it combines that returned result with the result returned
from the call to the star function (which, remember, will never be NONE). 

This pair of functions is by far the most complicated, and you should be
prepared to spend some time looking at them before the details of how they work
will really fall into place.

The final reader function allows conditional parsing. It takes a preliminary
result and then tries to apply another reader. If that reader succeeds, it
returns that result. Otherwise, it returns the original one.

We'll see some examples, but since it's lunchtime, let's imagine we're trying to
classify pastries. There's some stream of pastries and we have a set of
readers. In a choice, our donut reader returns the fact that we have found a
donut. But, our implementation also cares about whether it is a kreuller,
old-fashioned, or jelly. So we use invert to check if it is any of those three
(in a choice) and return donut if it is not.

You will see in the lexical analysis example to come where this feature comes in
handy. Questions about the reader or parsing before we move on?

===

So I lied a little bit earlier. We won't be going from a string to an expression
tree directly. First, we will go from a string to a list of tokens. Then, we
will turn those tokens into an expression tree.

Today, we'll finish off by going through the lexical analysis
example. Wednesday, we'll cover the parser, which takes those tokens and
produces an expression tree.

So, what is a token? Let's open up token.sml and see. There's a datatype that
defines all of the individual atomic elements of a string stream. IN this case,
we have identifiers (ID), natural numbers, a bunch of symbols, and the keywords
of our language.

Before we get too far into the language, let's see everything working.

<<< switch to the command prompt >>>
rlwrap sml
CM.make "fun.cm";
use "test-lexer.sml";

tokenize "donut";
tokenize "let";
tokenize "1 + *donut";
tokenize "^";
tokenize "1+2^";


So, how does this function work? Basically, it takes a string as input, wraps it
into one of those instream abstractions that you saw used in reader.sml, and
then calls lexerfn.tokenize.

Let's look at the lexer, which is in lexerfn.sml. As you can see from its
interface, it exposes two functions. The first is a reader that takes a
tokenstream and returns either NONE or a token and the rest of the stream. The
other is the one used in the test sample, which will read as many tokens from
the head of the input as possible, returning those in a list along with the rest
of the stream.

Let's hop down and look at the definition of that first. If you remembered the
starPlus operator --- which means to match one or more --- you'll see that this
is what we wrap around the tokenRdr to produce the high-level function.

Back up to the top. Since the input stream is basically just a bunch of
characters, it shouldn't be too surprising that the first few functions are all
higher-order functions that handle checking whether the character at the
beginning of the stream has a given property.

For example, look at isAlpha. We use the sat function paired with the
Char.isAlpha predicate to build a reader that will, when given an input stream,
return the character and the rest of the input stream if the character is an
alphabetic character and otherwise return NONE.

The other ones are similar. The wrapper function checkChar and checkChars check
for exact identity. checkChars is used to determine if there is a particular
string at the head. For example, you can pass in the characters "l" "e" "t" to
the checkChars function to generate a reader that will determine if the keyword
"let" is at the head of the stream.

Identifiers are a chain that begins with an alphabet character, then has some
number of alphanumerics. implode is a function that takes a list of characters
and turns it into a string.

Nat finds a natural number. Space eats up any number of whitespace characters,
returning unit since we don't want to return any token for it. As you can see in
the token function, we use the space function to gobble up any spaces before a
token.

Finally, we create the real readers that we will use in the tokenRdr function by
using the token wrapper to build a reader using the predicates provided. Notice
the keyword function, which uses the invert function to ensure that we don't
match "let" from the stream unless there is a space after it --- it's not the
keyword let if the string is "letter"..

The +++ inline operator (pronounced "orelse") is mapped to the choice operator,
which you will see the use of in the next function (otherwise, we would have a
hugely-nested chain of operations slowly moving further and further right on the
screen).

The tokenRdr is now a pretty readable function. It starts out using chain to
check if there is a natural number and if so return the NAT token. Otherwise
(the +++, which is choice), it check for the quit keyword, and so on until it
gets to identifier. Note that order matters here! All of the keywords had to
come before the identifier option. 

Can anybody tell me why that is?

We finish off with the symbols... and that's where we'll be stopping for today.