Specification of Syntax, ComSci 221, U. Chicago

Specifying Programming Language Syntax

Lecture Notes for Com Sci 221, Programming Languages

Last modified: Mon Oct 16 20:00:03 CDT 2000

Continue Wednesday 4 October 2000

Syntax is a kind of structure

Definition time again, using Webster:

syntax:: the arrangement of words as elements in a sentence to show their relationship; sentence structure

The Oxford English Dictionary is a tad more abstract:

syntax:: orderly or systematic arrangement of parts or elements; constitution (of body); a connected order or system of things

In any case, syntax properly refers to the structure of an utterance in some language (it might be natural language or programming language, spoken or written, etc.) in terms of the way that it is constructed to carry meaning. In written language, syntax lies in between orthography or typography, which are concerned with the way a text is laid out on paper or a computer display, and semantics, which is concerned directly with meaning. In English, the interesting syntactic concepts include sentence, subject, predicate, object, prepositional phrase, noun, verb, adjective, etc. In programming languages they typically include program, header, declaration, body, command, expression, identifier, etc.

Programming language syntax has been formalized very successfully into a cookbook discipline. So, we can deal with it briefly and efficiently, and get on to our informal concern with the much more difficult issues of semantics.

Formal notations for programming language syntx

There are three famous methods for specifying programming language syntax:

context-free grammars (CFGs)
Backus-Naur forms (BNFs)
syntax charts

The good news is that all three are essentially the same: they differ only in minor notational ways. Even if you think that you haven't heard of these, you have almost certainly used at least one of them informally. If you ever diagrammed sentences in grammar school, you should see that the rules for sentence diagrams are essentially the same as CFGs, BNFs, and syntax charts, while the diagrams themselves are essentially the same as the derivations that are made using CFGs, BNFs, syntax charts. Figures 2.10 and 2.11 on pages 46 and 48 in the text show an example of the same information presented in BNF and in syntax charts. The CFG version is shown in my Figure 2 below.

E --> E + T
E --> E - T
E --> T
T --> T * F
T --> T div F
T --> F
F -->C
F --> V
F --> ( E )

Figure 2a: example of a Context-Free Grammar

In CFG jargon, each line is a production. E, T, F, V, and C are called nonterminal symbols. +, -, *, and div are called terminal symbols. The terminal symbols are those that are actually used in the language. The nonterminals are only intended to be replaced by terminals, following the rules given in the productions. The example in Figure 2 is incomplete: there should be more productions to allow Vs and Cs to be replaced by terminals.

There is no essential difference between a CFG and a BNF. In a BNF, the derivation symbol in each production is written ::=, instead of -->. The nonterminals are written as words or phrases inside pointy brackets (< ... >), instead of capital letters. The terminals are quoted. None of this should be very exciting. Syntax charts look quite different at first glance, but Problems 2.17 and 2.18 on p. 52 of the text show that they are also essentially the same as CFGs and BNFs. From now on, I will refer to CFGs when I discuss the qualities shared by all three notations. In the language of the text, a CFG is an abstract structure, while a BNF or a syntax chart is a concrete notation for that structure. The notation in my Figure 2 uses another concrete notation for CFGs---the one that is used in theoretical articles and in general discussions of grammars.

In one way, syntax charts appear to be more powerful that CFGs and BNFs: they allow loops in their graphs. But, these loops just encode the information in one additional recursion in the CFG or BNF. For example, the recursion in E --> E + T produces a sequence of 2 or more Ts joined by +s. The same thing is represented by a loop in the syntax chart. There is a notation called extended BNF (EBNF) with constructs similar to the loops of syntax charts. In EBNF, curly braces { ... } indicate 0 or more repetitions, square brackets [ ... ], indicate an optional element, vertical bar | indicates a choice, and parentheses ( ... ) are used for grouping within the EBNF. When these symbols are used for grammatical constructs, the corresponding terminal symbols are quoted: e.g., an actual left parenthesis in the language is shown as `('. Different articles and books may have slightly different extensions to BNF.

There are some serious notational conveniences, such as the iterative and conditional notations in EBNF, that do not affect the basic power of CFGs and similar systems. Make sure that you understand how each extended notation may be translated into pure CFG notation.

What's so great about Context-Free Grammars?

Books and articles in theoretical computer science, and even in programming languages and linguistics, emphasize the use of CFGs to define which strings of symbols are and are not syntactically correct members of a given language. That is not really the important value of CFGs. The truly important use of CFGs is to define a way to parse syntactically correct strings: that is to associate with a string a tree structure (called a derivation tree or parse tree, or sometimes just a parse) presenting the syntax of the string. The parse of a program is a much better starting point for interpreting or compiling the program than the plain source text.

Make sure that you understand precisely how parse trees are associated with strings by a CFG. See Sections 2.4 and 2.5 of the text for a more careful description.

CFGs are wonderful because they are simultaneously readable by humans, and suitable as the basis for completely automatic parsing. In effect, CFGs are a sort of highly self-documenting programming language for parsers. They are included in programming language manuals as the last resort documentation of syntactic issues. And, they are processed by parser generators, such as Yacc and Bison, which compile them into parsing code.

CFGs represent an incredible success story in computer science. In the olden days, when FORTRAN was just being invented, the problem of parsing a program was the subject of Ph.D. dissertations. Now, the automatic processing of CFG specifications allows college students in a compiler writing course to solve parsing problems routinely. The first automatic parser generators were so exciting that people called them ``compiler compilers.'' Of course, a parser generator merely frees the implementor of a compiler to spend her time on the really hard part: generating good code.

The marvellously self-documenting quality of CFGs arises because, when they are constructed wisely, the nonterminal symbols of CFGs represent sensible syntactic categories. For example, in the CFG for arithmetic expressions in Figure 2 above, the nonterminal symbol E represents the category of expressions, V represents variables, and C represents constants. Similarly meaningful categories, such as statements, declarations, etc., also correspond to particular nonterminal symbols in a complete CFG for a whole programming language. But, there are also nonterminal symbols, such as T and F in Figure 2 above, that do not correspond to grammatically meaningful and useful categories. Sure, T is supposed to stand for ``term,'' and F for ``factor,'' but those are not particularly important grammatical categories in a program. Rather, T and F are gimmicks, added to the grammar to enforce the normal rules giving * and div precedence over + and -.

You must learn to distinguish, based on intuition and common sense, the grammatically meaningful parts of the structure determined by a CFG from the gimmicks. There are some extended notations for CFG that reduce the dependence on gimmicks, but a lot of programming language manuals still give the gimmicks equal status with the meaningful symbols.

What is programming language syntax, precisely?

I refuse to be absolutely precise, but I'll come much closer than with other definitions. For almost all purposes, almost all of the syntactic qualities of almost all programming languages may be regarded as a syntax tree. Most people in computer science say ``abstract syntax tree'' instead of ``syntax tree,'' because they never looked up the definition of syntax, and they think that ``syntax'' by itself (or ``concrete syntax'' when they want to be more pedantic) means typography, rather than structure.

So, syntax is a tree. But what tree? Well, given the right CFG for a language, the syntax tree of a program is almost the parse tree, except that the terminal symbols and gimmicks are taken out, and the natural conceptual operators are put in. This is best understood by example. Consider the expression

x + y * 3 + z * (w - 4)

The parse tree, using the grammar in Figure 2 (with some obvious additional productions to get rid of Vs and Cs) is shown in Figure 3. The most usual idea of the syntax tree is shown in Figure 4. Notice that the extra steps involving T and F have been omitted, since they are really just gimmicks to enforce precedence. At each node of the tree, instead of the nonterminal from the parse tree, we have the operator that is being applied. Reasonable people may disagree over fine points in the construction of syntax trees. For example, if + and * are understood as operations combining more than 2 operands (which is suggested by the EBNF version on p. 19 of the text), then we might prefer the syntax tree of Figure 5, which treats the iteration of the production E --> T + E as a gimmick, rather than a structural step.

I wrote out ``add,'' ``mult,'' etc. in this example to emphasize that the operation is not the same thing as the terminal symbol (+ or *) that corresponds to it so naturally. In the future, I will use the most convenient and mnemonic symbols in syntax trees, which will often be the same as the symbols in the ``concrete syntax.'' In other examples, such as the if ... then ... else ... fi example in Figure 6, there is no clear 1-1 correspondence between ``concrete'' symbols and ``abstract'' operators. Notice that in popular mathematical notation, there is often no terminal symbol at all to denote multiplication. Also, parentheses do not correspond to anything in the syntax tree. Rather, they are a gimmick involving terminal symbols, used to control the shape of the syntax tree.

End Wednesday 4 October 2000
%<----------------------------------------------

%<----------------------------------------------

Begin Friday 6 October 2000

Entire programs, as well as expressions, have natural syntax trees. In principle, there is nothing at all subtle about associating a syntax tree with a program, but many students confuse syntax trees with the similar looking, but very different, flow charts. A syntax tree shows the structure of the program as it is constructed from its parts. A flow chart shows the structure of the execution of the program. The best way to understand syntax trees for programs is to study carefully the example in Figure 7, which gives a syntax tree for the program in Figure 8 below.

read(i);

while i>1 do

   if odd(i) then

      i := 3i+1

   else

      i := i/2

   fi

od

Figure 8

Puzzle: does the program above halt for all inputs?

Notice that the syntax tree is more closely connected to the typical indentation of the program than to a flow chart.

Prefix, infix, postfix, mixfix

CFGs provide a very flexible way to describe lots of different program notations, and they are fairly readable. But, there are also some general notational ideas that are often discussed in English.

In principle, an abstract syntax tree may be presented in lots of different concrete notations. For binary ASTs, there are three particularly common forms of representation: prefix, infix, and postfix. We pick a concrete symbol for each binary operation in the tree (e.g., ``+'' for addition). To represent a tree that combines T1 with T2 by addition, we represent T1 by C1 and T2 by C2, then place the ``+'' symbol alongside C1 and C2. It seems unnatural to change the order of T1 and T2, so there are three sensible possibilities:

prefix:	+ C1 C2
infix:	C1 + C2
postfix:	C1 C2 +

Infix is the form that we are most accustomed to seeing in math books, but it has the disadvantage of ambiguity: there are two ways to form an AST for A+B*C. So, infix notation is often used with parentheses, and the two interpretations of A+B*C are given as (A+B)*C and A+(B*C). In the absence of parentheses, precedence rules are often applied. * usually has precedence over +, so A+B*C is interpreted as A+(B*C).

Prefix and postfix notations have the nice property that the concrete notation describes the AST unambiguously (as long as every operator has a different concrete symbol). The two different interpretations of the infix A+B*C are given as *+ABC vs. +A*BC in prefix, AB+C* vs. ABC*+ in postfix. Notice that the operators appear in different orders in the different notations, as well as being interleaved differently with their operands. Old HP hand calculators used to use postfix notation to allow expressions to be typed without parentheses. Postfix notation for expressions also translates naturally into machine language for evaluating the expression. Notice that we may calculate AB+C* by the following sequence of operations:

Instruction	Symbol
Fetch A	A
Fetch B	B
D := A+B	+
Fetch C	C
E := DC*	*

Because of this correspondence between postfix notation and evaluation code, the later stages of compilers sometimes represent expressions in postfix form.

Often the same language uses prefix, infix, and postfix for different operators. E.g., in C the concrete symbol ``*'' is used for multiplication in infix form, for pointer dereferencing in prefix form, and for pointer types in postfix form. Mixed uses of pre/in/postfix may easily be ambiguous without parentheses.

Even when we don't use pre/in/postfix, we usually like to represent each subtree of an AST contiguously within the representation of the whole tree. These other contiguous representations are called mixfix (not the same as mixed use of pre/in/postfix). For example, the application of a function f to arguments x and y in math books is usually written f(x,y). If we think of f as a binary operator, then we have a mixfix representation using ``f('' in a prefix position, ``,'' in an infix position, and ``)'' in a postfix position.

Prefix, postfix, and mixfix can be used with operators that are not binary, but it isn't clear what infix would mean except for a binary operator. As long as each operation symbol is distinguishable from all others, and each operation symbol has a fixed number of operands, prefix and postfix notations are unambiguous. Control structures in conventional programming languages are typically represented by mixfix: for example a conditional form is typically given by

if A then B else C fi

where A, B, C represent the subtrees, and if ... then ... else ... fi are the symbols associated with the conditional operator.

It might seem that mixfix is so flexible that it covers all of the possibilities for notation, other than pre/in/postfix. But, there's no inherent reason that arguments of operators must be denoted by contiguous adjacent text. For example, some people favor a simultaneous assignment command, where all right-hand sides are evaluated before the assignments are executed. In conventional notation, to rotate the values of X Y and Z, we need an additional temporary variable T:

T := X
X := Y
Y := Z
Z := T

Since T has nothing to do with the idea of exchange, this is annoying. With simultaneous assignment, we might denote exchange by

(X, Y, Z) := (Y, Z, X)

But, we might argue that the natural abstract syntax of the simultaneous assignment is something like

simultaneous(assign(X, Y), assign(Y, Z), assign(Z, X))

The notation above cannot be explained as a mixfix notation for this abstract syntax.

Precedence and Associativity

Ambiguous notations, particularly (but not solely) infix notations for binary operations, often use precedence rules to resolve some ambiguities, reducing the need for explicit notations, such as parentheses. Notice how the sequence of nonterminals E, T, F in Figure 2 (equivalently, Expression, Term, Factor in the BNF and syntax chart) enforce precedence. + and - are introduced by productions replacing E, while * and div are introduced by productions replacing T. Since E is the start symbol for expressions, T occurs closer to the end of a derivation (equivalently, closer to the leaves of a parse tree), so the symbols introduced from T represent operations that are done earlier, with higher precedence. F (Factor) is not strictly necessary for enforcing precedence in this grammar, but it gives the grammar a regular structure, we would need it in order to introduce, e.g., exponentiation at a higher precedence than multiplication and division. Remember that symbols occurring earlier in a derivation have lower precedence than symbols occurring later.

A binary operation symbol is called left associative if its occurrences are processed from left to right, right associative in the opposite case. Notice that this sort of associativity of a symbol is essentially a precedence relation between the different occurrences of the same symbol. Associativity restrictions are also applied between different symbols of the same precedence, such as + and - in our example. Notice that the production E --> E + T makes + left associative, while the alternative E --> T + E

would make it right associative. Think on your own about the relationship between the grammatical forms for precedence and associativity restrictions. Left associativity is much more common in programming languages than right associativity, since we have a prejudice for processing text from left to right, and it is easier to do the operations in the same order that we see them.

The phrases ``left/right associative'' are confusing in conjunction with the use of the word ``associative'' in algebra. An operation is associative if it doesn't matter whether we evaluate sequences of the operation from left to right, or right to left. + and * are algebraicly associative, while - and div are not. In many programming languages the left-to-right order of evaluation may affect the point at which a program crashes due to overflow, or the precise amount of roundoff error, even when the operation being evaluated is algebraicly associative. So, the syntactical ``associativity'' of an operator may make a difference, even for an algebraicly ``associative'' operator.

Notice that the trick of representing repetition by a graphical loop in a syntax chart, or by an iteration operator in EBNF, erases the distinction between left and right associativity. Presentations of syntax through EBNF and syntax charts often come with additional notes describing associativity. For documentation purposes, it is sometimes better to do the same with precedence. For example, we can use the ambiguous grammar for expressions in Figure 9 below, and add a note that * and div take precedence over + and -, and that all operations are left associative. But, for our automatic parser generator, we may need to give the more complicated grammar with E, T, and F.

E --> E + E
E --> E - E
E --> E * E
E --> E div E
E --> C
E --> V
E --> ( E )

Figure 9: Ambiguous CFG for Expressions

End Friday 6 October 2000
%<----------------------------------------------

%<----------------------------------------------

Begin Monday 9 October 2000

Ambiguity

I mentioned ambiguity several times. Now, let's consider precisely what makes a grammar ambiguous. Officially, we say that a CFG is ambiguous if there is some string of terminal symbols with two or more different parse trees. If we were more careful, it would probably be better to call a notational system ambiguous only when it assigns two or more different abstract syntax trees to the same string of terminal symbols, but nobody has worked out the details of this idea. Usually, two different parse trees lead to two different abstract syntax trees.

The most famous case of ambiguity in programming languages is the dangline else ambiguity, described on pp. 39-40 of the text. This problem arises when we allow conditional constructs of both the forms

if C then A



if C then A else B

with no special symbol to mark the end of the conditional. It's rather tricky to write a grammar that generates nested, unterminated conditionals unambiguously. Such nested conditionals are also rather difficult for humans to read. So, it's surely better to avoid the dangling else ambiguity by terminating each conditional construct with a special marker, such as end, endif, or fi.

End Wednesday 6 October 1999
%<----------------------------------------------

%<----------------------------------------------

Begin Friday 8 October 1999

Should Parsing be Obsolete?

The need to parse strings of symbols into abstract syntax trees arises because

the abstract syntax is the right structure in which to manipulate a program, particularly to compile machine code, and
we are stuck typing in programs as strings of symbols.

Point 2 was true in the days of paper tape, punched cards, and teletype terminals. But, it really doesn't hold today with graphical displays, cursors, and pointing devices. So, it seems sensible to build programming language systems in which we edit the abstract syntax directly, instead of typing in text and having it parsed. A lot of the detailed considerations about ambiguity, precedence, associativity, and other annoyances would disappear if we edited syntax directly. We could still display programs textually, with nice indenting and formatting, just as WYSIWYG (What You See Is What You Get) editors display their internal representations of documents as formatted text.

Gottlob Frege, in the late 19th century, actually proposed to write expressions of mathematical logic in tree form. Typesetters killed off the idea, because it was too much trouble for the old typesetting techniques.

Tom Reps and Tim Teitelbaum at Cornell built a structure editor system (sometimes called a syntax editor) to edit abstract syntax trees directly. The Gandalf project at Carnegie-Mellon also had a structure editor, and there were other similar experiments. Around 1980, I decided that parsing was dead, and predicted that by 1985 we'd all be using structure editors. I was way wrong. Why? Probably because textual editors were already very mature, and they can simulate a lot of the value of structure editors by clever automatic indenting and other formatting (check out programming modes in the Emacs editor to see how nice this can be). Also, for short expressions, it's hard to beat the efficiency of typing in text, since most of us are well trained to do that.Structure editors need to incorporate small amounts of interactive parsing in the user interface, even though they focus on abstract syntax. Both the Cornell and CMU projects included some parsing, but it's tricky to get the interface just right for human efficiency. Finally, most structure editors have maintained some of the confusion between syntax and parse trees, so they haven't realized the full potential of their basic idea.

There are ``visual'' versions of programming languages now, competing with the more conventional textual versions. These ``visual'' programming languages use a graphical notation to display programs, but the underlying syntax is the same as older textual languages. In some ways they free us radically from the constraints of sequential text, but not radically enough for my taste. Unfortunately, these radically different sorts of notation so far haven't stimulated enough realization of the possibility for developing semantics, syntax, and notation independently, so that we can enjoy the best versions of each. In principle (and I hope in actuality some day), different people could choose to view the same program syntax in completely different concrete notations for different purposes. A few research articles have anticipated this separation of syntax from notation by treating a program as a database, whose contents determines the variables, expressions, statements, etc. in the program, and which may be displayed in lots of different ways by querying the database.

Beyond Context-Free Syntax

Success is often the enemy of further progress. CFG parsing technology has been so successful that it has eliminated the incentive to go beyond context-free syntax. In the programming language community, ``syntax'' is treated as a synonym for ``context-free syntax.'' Everything else is called ``semantics,'' whether or not it has much to do with meaning. In particular, the identification of variables by names (called ``identifiers'' in the CS jargon), and checking of type consistency, cannot be described by CFGs, so they are relegated to an area called ``static semantics.'' But intuitively, the use of identifiers, and static type checking have to do with notation rather than meaning, and ought to be classified as syntax. Only recently, Gopalan Nadathur has shown how much of ``static semantics'' can be handled by structures that he calls ``higher-order abstract syntax.'' We may be able to see some of this when we study Prolog at the end of the class.

To justify the idea that the specifics of variable names have to do with notation, rather than meaning, notice that the effect of a program depends on the equivalence or difference between occurrences of variables (that is, their identities), but not on the particular names that we attach to them (that is, their identifiers). But, the current state of the CFG art treats the textual declaration of variables and their types as part of the abstract syntax tree of a program, and the equivalence of different variable occurrences must be discovered through additional ``semantic'' processing with a symbol table. I foresee a better world where textual declarations of variables, and the variable names themselves, disappear from abstract syntax, just like the grammatical ``gimmicks'' that determine precedence. In this better syntactic world of the future, though, abstract syntax will no longer be a tree.

Figure 10 below shows a silly little program, in a language that I just made up.

Program

  Variable i,j: integer
 
  function f(i: integer): integer

    return i*i

  end function

begin

  j := 0

  for i := 1 to 10 do

    j := j + f(i)

  end for

end program

Figure 10: Program to Illustrate Non-Tree Syntax

Figure 11 shows the conventional sort of context-free abstract syntax for the program of Figure 10. Each occurrence of the identifiers i, j, and f in Figure 10 occurs separately in the abstract syntax tree. Now, look at Figure 12 for my idea of what the abstract syntax of the program should be. Each variable and function name is mentioned precisely once, with multiple pointers to the occurrence. The syntax is no longer a tree. Notice that there are two different instances of the variable i. Also, notice that we don't really need the identifiers f, i, and j at all in order to execute the program. They are only there to help produce a human-readable printout.

What Makes Good Notation

We have discussed a number of technical qualities of programming language syntax and notation, and we have seen how many of these qualities may be described with great flexibility and precision using CFGs and their variants. The actual success of a programming language depends critically on the practical efficiency of its notation when people write and read programs. So, what qualities make a notation good or bad? Nobody knows how to evaluate notations with any sort of objective precision, but there are some principles worth considering.

When two pieces of text look very similar, their meanings as programs should be either very similar or so different that we notice immediately when we compile/execute them. If A and A' look very similar, then it is reasonably likely that we might produce A' by accident when trying to produce A. If the results are very similar, then the impact of the error may be small. If the results are radically different, then we have a good chance of noticing and correcting the error. The worst outcome is a subtle-looking difference that has a catastrophic impact later on. This principle is impossible to follow in all cases, but some languages do better than others. The best way to make A and A' radically different is to have one of them fail to compile. A famous typographical error in ancient FORTRAN is to type
```
  DO 10 I = 1.10
  
```
instead of
```
  DO 10 I = 1,10
  
```
(the difference is the substitution of a period for a comma). Both lines are correct FORTRAN, but they do radically different things. See if you can figure out what. Hint: blank spaces are irrelevant in FORTRAN.

When there are several plausible interpretations of a text, the one most attractive to the human intuition should be the one used by the programming language. The impact of this principle can be rather subtle. The interpretation of one symbol may depend on the presence, absence and/or behavior of other symbols. For example, when we type numerical expressions with exponentiation and multiplication, exponentiation is always interpreted as right-associative. That's because
(x^y)^z = x^(y*z)
When we want to express a left-associated sequence of exponentiations, it's almost always more natural to use the multiplicative form on the right. So, every math book that I've ever seen interprets x^y^z right-associatively as x^(y^z) for numerical expressions. But, in logic and set theory there is an algebraicly similar exponentiation operation on sets, which is often used either without multiplication, or with a multiplication that doesn't satisfy the law above. And, every math book that I've ever seen interprets x^y^z left-associatively as (x^y)^z in that case, because the left-associated form occurs more frequently than the right-associated form. Notice that it is the presence of an alternate way to express left-associated exponentiation that shifts the preference to right associativity in the numerical case.

These principles are highly subjective, and their application is often recognized only after a notation succeeds or fails. In practice, the restrictions of context-free parsing technology have a huge impact---not always a good one---on the design of program notation.