Based on Chapter 8 of the Graham Hutton Programming in Haskell material. What is a parser? Converts an input stream to determine its syntactic structure. Consider a file with the string "3+4*5" A parser is the program which converts that string into an expression tree: ((3+4)*5) or Times(Plus(3,4),5). Or consider the typechecker example. It had a datatype for expressions. How did we get those expressions in the first case? Parsers are used all over. The commandline in UNIX has a parser. SQL queries you send to a server have one. The web browser uses one to turn the HTML pages into pictures on screen. And, of course, every compiler has them. This set of lectures is all about explaining that process and doing so using a monadic style. This type of parser is very easy to write and understand in the context of a functional programming class, though it is a bit underpowered for a real compiler. Those techniques are covered in more detail in the Compilers I class, which will hopefully be offered next year. So, what is a parser? The more general version of a parser is called a reader in this code. You could imagine that it was: type reader = string -> tree But, this might fail, so we'd extend it to: type reader = string -> tree option And we might not use up all of the input data: type reader = string -> (tree * string) option Of course, we look at that as functional programmers and have the instinct that we should generalize it. Our final type is: type ('a, 'b) reader = 'b -> ('a * 'b) option So, a reader is a function that consumes some source (of type 'b) and, if successful, returns a result of type 'a and the unused remainder of the source (also of type 'b). Parsing is the process of applying these functions to inputs to generate either an output datatype or NONE, indicating that it failed. << switch to computer, bring up reader.sml >> Here's the definition of the reader type, just as we mentioned below. Now, let's go over 7 readers. After we go over them, I will go over an example that uses many of them, to help you get an idea of how they work. Then, on Wednesday, we will go over a second example that uses these same combinators. First is the return reader. It just passes along the current input unchanged (called instream) and returns the provided value. Next up is fail. It just returns NONE. Chain takes a reader p and a function f. If the reader p succeeds, then f is applied to the value it returns, yielding a second reader, which in turn is applied to the remainder of the source produced when p was run. Chain is used to perform parsing actions in sequence, where the second action depends on the value produced by the first. Choice takes two readers, p and q. It returns the result of p if it succeeds and otherwise returns the result of reading with q starting with the same source that p used. The star and starPlus operators correspond to the Kleene star and Kleene plus operators. These operators will run a reader as many times as it succeeds until it fails, returning the final result. First, notice that types are slightly more interesting. They take an ('a, 'b) reader, but return an ('a list, 'b) reader. This change handles the difference in functionality from what we described earlier. If you remember, "readers return NONE to indicate they failed to parse." But the star operator always succeeds, even when it was able to do zero reads. So in that case we need to be able to return SOME(v,s) to indicate that we "successfully read nothing". The returned value is the list nil, indicating a successful parse with zero read actions. So, how do they work? Star uses the choice operator, which you will recall tries the first reader and returns its results if it works and otherwise runs the second reader. In this case, the second, (return []), does what we just discussed. The first handles the case where one or more iterations of the reader action were successful, returning a list accumulating the values they returned. The starPlus reader first performs a read, then chains that value forward to a new reader, definied inline here. This second reader takes that returned result and then tries to call the original star function on the reader to get more parsed values. Finally, it combines that returned result with the result returned from the call to the star function (which, remember, will never be NONE). This pair of functions is by far the most complicated, and you should be prepared to spend some time looking at them before the details of how they work will really fall into place. The final reader function allows conditional parsing. It takes a preliminary result and then tries to apply another reader. If that reader succeeds, it returns that result. Otherwise, it returns the original one. We'll see some examples, but since it's lunchtime, let's imagine we're trying to classify pastries. There's some stream of pastries and we have a set of readers. In a choice, our donut reader returns the fact that we have found a donut. But, our implementation also cares about whether it is a kreuller, old-fashioned, or jelly. So we use invert to check if it is any of those three (in a choice) and return donut if it is not. You will see in the lexical analysis example to come where this feature comes in handy. Questions about the reader or parsing before we move on? === So I lied a little bit earlier. We won't be going from a string to an expression tree directly. First, we will go from a string to a list of tokens. Then, we will turn those tokens into an expression tree. Today, we'll finish off by going through the lexical analysis example. Wednesday, we'll cover the parser, which takes those tokens and produces an expression tree. So, what is a token? Let's open up token.sml and see. There's a datatype that defines all of the individual atomic elements of a string stream. IN this case, we have identifiers (ID), natural numbers, a bunch of symbols, and the keywords of our language. Before we get too far into the language, let's see everything working. <<< switch to the command prompt >>> rlwrap sml CM.make "fun.cm"; use "test-lexer.sml"; tokenize "donut"; tokenize "let"; tokenize "1 + *donut"; tokenize "^"; tokenize "1+2^"; So, how does this function work? Basically, it takes a string as input, wraps it into one of those instream abstractions that you saw used in reader.sml, and then calls lexerfn.tokenize. Let's look at the lexer, which is in lexerfn.sml. As you can see from its interface, it exposes two functions. The first is a reader that takes a tokenstream and returns either NONE or a token and the rest of the stream. The other is the one used in the test sample, which will read as many tokens from the head of the input as possible, returning those in a list along with the rest of the stream. Let's hop down and look at the definition of that first. If you remembered the starPlus operator --- which means to match one or more --- you'll see that this is what we wrap around the tokenRdr to produce the high-level function. Back up to the top. Since the input stream is basically just a bunch of characters, it shouldn't be too surprising that the first few functions are all higher-order functions that handle checking whether the character at the beginning of the stream has a given property. For example, look at isAlpha. We use the sat function paired with the Char.isAlpha predicate to build a reader that will, when given an input stream, return the character and the rest of the input stream if the character is an alphabetic character and otherwise return NONE. The other ones are similar. The wrapper function checkChar and checkChars check for exact identity. checkChars is used to determine if there is a particular string at the head. For example, you can pass in the characters "l" "e" "t" to the checkChars function to generate a reader that will determine if the keyword "let" is at the head of the stream. Identifiers are a chain that begins with an alphabet character, then has some number of alphanumerics. implode is a function that takes a list of characters and turns it into a string. Nat finds a natural number. Space eats up any number of whitespace characters, returning unit since we don't want to return any token for it. As you can see in the token function, we use the space function to gobble up any spaces before a token. Finally, we create the real readers that we will use in the tokenRdr function by using the token wrapper to build a reader using the predicates provided. Notice the keyword function, which uses the invert function to ensure that we don't match "let" from the stream unless there is a space after it --- it's not the keyword let if the string is "letter".. The +++ inline operator (pronounced "orelse") is mapped to the choice operator, which you will see the use of in the next function (otherwise, we would have a hugely-nested chain of operations slowly moving further and further right on the screen). The tokenRdr is now a pretty readable function. It starts out using chain to check if there is a natural number and if so return the NAT token. Otherwise (the +++, which is choice), it check for the quit keyword, and so on until it gets to identifier. Note that order matters here! All of the keywords had to come before the identifier option. Can anybody tell me why that is? We finish off with the symbols... and that's where we'll be stopping for today.