Lecture Notes
for
Com Sci 221,
Programming Languages
Last modified: Mon Nov 1 10:37:04 CST 1999
Chapter 4 in the text covers the nuts and bolts of this topic very well. You need to read it especially carefully. I will concentrate on odd points of view, and on ideas that connect the material in Chapter 4 with later material in Chapters 9 and 14, and with material that goes beyond the text. These notes were written rather hurriedly, and are probably not as clean as earlier sections.
The word ``type'' is one of many words that is used in different, but related, ways by those who study programming languages. These uses of ``type'' all start with a pre-CS mathematical notion of type from set theory. But, the computational objects associated with types in programming languages have qualities that don't arise in the normal study of set theory, and the original idea of type has been extended in several different ways to deal with them. Actually, mathematics and logic had already used ``type'' in somewhat different ways in different contexts, but programming languages introduced differences in new dimensions.
The simplest concept of type derives directly from set theory. It is insufficient for the full understanding of most uses of types in programming languages. But, all concepts of type relate back to set theory in some way, so it is a good starting point.
I made up the following definition. I think it's a good one, but it's not standard in the literature.
Types in current procedural programming languages have sets of values that can mostly be defined from the basic types above, using the set-theoretic type-constructing operations above. The following table shows corresponding types in the notations of Pascal, C, and set theory. Both Pascal and C allow cross products and marked unions to have arbitrary arities, as well as the binary forms from the set-theoretic notation. Also, the two programming languages have different restrictions on the use of some type operators. The following table gives a pretty good idea of the correspondence.
Even at the set-theoretical level of abstraction, the correspondence is tricky. Both C and Pascal require names to be introduced explicitly for some components that are named implicitly in the mathematical notation. Pascal only allows unions (variant cases) to occur inside of cross products (records). While C separates union from cross product more cleanly than Pascal, the C union is not exactly a marked union, nor is it a normal set-theoretic union (which just combines two sets into one that contains the members of both). Every member of the C union type
Pascal C Set theory name form >name form >name form enumeration {a, b, c, d}
enumeration enum {a, b, c, d}
enumeration {a, b, c, d} record record a: A; b: B end
structure struct { A a; B b; }
cross product AxB subset set of A
- - power set P(A) array array[0..n] of A
array A[n]
function {0,...,n} -> A function function f(a: A): B
function B f(A a)
function A->B variant record case t of a: A; b: B end
union With a lot of restrictions and caveats: More carefully:union { A a; B b; };
struct { enum {a,b} tag; union content { A a; B b; }; }
marked union A+B
union {A a; B b;}
may
be treated freely as either type A or type B. Most of
the time, C union is used to simulate marked union, but the
programmer is responsible for insuring that each use of an object in a
union uses the same component type as the most recent previous
assignment. Essentially, the programmer keeps track of the tag in the
flow of control. But, sometimes C unions (and even
Pascal variant records, with a bit of trickery), are used to
convert between two types by just interpreting a bit pattern from one
type as if it were intended for the other. This doesn't happen very
often in C, because there are so many other ways to trick the
type system. C also confuses the correspondence by
introducing variable names after, or in the middle of, the types that
are assigned to them.
void
in C is not the empty typeThe type void
in C is a bit
subtle. It's a good exercise to figure out its set-theoretic meaning,
even though the use of void
is not
problematic. C uses the type void
to
indicate that a function does not return a value. So, it seems that
void
might be the empty type. Actually,
void
is the type with no information, but that
makes it a type with one value, not zero. In computation, information
is always carried by the distinction between two or more
possibilities. The distinction between two possibilities (such as 0
and 1) is called a bit of information, since it's the
smallest distinction that can be made independently of context. The
value of a type with one member carries no information, since one
always gets the same value.
Pointer types in programming languages don't seem to be
describable nicely in the set-theoretic style. Although the type of
pointers to type A (^A
in Pascal,
A *
in C) has a set of values, that
set doesn't seem to be derived from the set A in any sensible
way. Also, set-theoretic types don't seem to explain the use of
expressions on the left-hand sides of assignments. To deal with these
programming concepts, we use the idea of l values and r
values.
The usual sorts of values inhabiting types in the previous section are r values. l values are references to, or locations of r values. Concretely, at the machine-language level, they are machine addresses. Slightly more abstractly, they are pointers. The l and r come from the use of l values on the left-hand sides of assignment statements, and r values on the right-hand sides. It's easy to understand simple assignments, such as
x := y+z
without thinking explicitly of l values. But, assignments to
components of structures, such as
A[n].angle := B[m].angle + 90
are quite hard to explain purely in terms of the r values
involved (presumably real numbers in this case). So, we think of the
expression on the left-hand side of an assignment evaluating to an
l value, which determines how the r value of the
expression on the right-hand side is to be stored.
In programming languages that use structured objects (arrays, records), or that have local variables with different scopes, the instantaneous state of computation is usually decomposed using the concept of l and r values. The simple view of computation that we suggested at the beginning of the course has instantaneous states that associate with each variable its r value. In principle, that's enough for explaining computation, but the description of structured assignments and entering/exiting local scopes gets complicated. So, we usually decompose this simple notion of state into an environment that maps variables to l values, and a state (alas, we reuse the same name) that maps l values to r values. The environment is usually determined entirely by the local scopes containing a control point in a program, plus the stack of procedure calls. The state varies completely dynamically, and may be different each time we reach the same control point.
Different programming languages vary a lot in the generality of their treatment of l values. In some languages, l values occur only implicitly through the use of expressions on the left-hand sides of assignments. Pointers may be understood as l values converted explicitly into rvalues so they can be used on right-hand sides. It's confusing to talk about pointers, because the l value of a pointer is the address at which the pointer value is stored, while the r value of a pointer is the address of the thing that it points to.
The dereferencing operator applied to a pointer value
p
(p^
in Pascal, *p
in
C) take the r value of p
(which is a
pointer value), and use it as an l value. So, p^ :=
1
and *p = 1
store 1 at the address p
points to. On the right-hand side of an assignment, everything is
automatically mapped to an r value, so p^
and
*p
refer to the value at the address pointed to by
p
. In Pascal, the r values of pointers
are all allocated in a special area of memory called the heap
using the new
procedure, so the r value of
a pointer is never the same as the l value of a
variable. C uses the malloc
procedure to create
new r values of pointers on the heap, but it also allows
pointer to point to other addresses, including the l values of
variables, using the address operator
&
. &x
is the l value of the variable
x
, turned into an r value---that is, the address
of x
turned into the value of a pointer.
The address operator is logically peculiar, and can't be understood
in the same way as other operators. On the right-hand side of
assignments, all variables are implicitly dereferenced to their
r values. Given only the r value of a variable, there's
no way to find its l value. Unlike arithmetic and other
functions, the &
operator does not apply a function to
the value. Rather, it cancels the implicit dereferencing of the
variable symbol that it precedes.
Actual parameters to procedures and functions may be passed as
l values (call by reference) or r values (call by
value). Pascal lets the programmer choose. C uses
only call by value, but since &
converts an arbitrary
l value into an r value, it is easy to simulate call by
reference. That's why you see so many ampersands in C
function calls.
Even without pointers, the idea of l vs. r values
allows us to describe variations in the flexibility of assignments to
different types. Scalar variables are assignable as entire values, but
there is no provision for breaking them into components and assigning
those components individually. That is, a scalar variable has an
l value for the whole object, but there is no way to derive
l values for components. Arrays and records/structures, on the
other hand, have individually assignable components. That is, given
the l value of an array A
, we may derive an
l value for its n
th component, by indexing into
the array on the left-hand side of an assignment
(A[n]
). Although arrays and functions both correspond to
set-theoretic function types, they have very different
assignabilities. Arrays are assignable by components, and functions
aren't assignable at all---they get permanent values from their
declarations.
Perhaps the most important refinement of the set-theoretic concept of types is the concept of types as sets plus some operations. In procedural languages, we must indicate the operations on l values associated with a type---including the kinds of assignments allowed, as well as on its r values. This sort of structure---a set plus a list of operations on it---is called an abstract datatype. Classes in C++ provide one way of using abstract datatypes in an actual programming language. Abstract data types allow us in many cases to separate the specification of what a piece of program needs to accomplish from the implementation that determines how to accomplish it. That separation has proved extremely valuable in complex programming tasks, where we need to change implementations, and to distribute responsibility for different implementations to different people and teams. In mathematics, an algebra is a lot like the implementation of an abstract datatype, and the signature of the algebra is the abstract datatype that it implements.
The oldest concept of datatype is probably the one used in discussions of computer architecture, and early Fortran compilers. There, a datatype is usually identified as a particular layout of bits in memory, and a way of interpreting those bits as a value. The text has a nice discussion of the layout of arrays. Even scalar types need to have their bit layouts defined somewhere. There are two different representations of integers, called 1's complement and 2's complement due to the different ways of implementing negation. Floating point layouts are susceptible to even more variations. Since the interpretation of a layout of bits is determined by the operations (e.g. +, -, * on integers) that we perform on those bits, there is a connection between the layout/interpretation and abstract datatype concepts. But, computer architecture normally provides the operations that interpret bunches of bits as values in the form of tools that may be used in any combination. Compilers usually use the abstract datatype concept to restrict programs so that their uses of types are consistent in some sense.
Many intuitively appealing types are naturally defined in a recursive fashion. For example, we would like to define the type of list of integer in the following set-theoretic notation:
list = {nil} + (integer x list)Neither Pascal nor C allows this sort of recursive definition of type. Because of the influence of the layout view of types, both of these languages insist that the memory footprint of a type is fixed by the type declaration. In a recursive type definition, Pascal and C require that every recursion be mediated by a pointer type. For example, in Pascal:
The marked union with {nil} is included in the use of the pointer type, since all pointer types in Pascal (and C) contain the special valuelist = ^record first: int; tail: list end
nil
that doesn't point to
anything. Pascal and C only require that the pointer
type appear somewhere between the type being defined and the
recursive invocation of that type. For example, the following
declaration is also legal Pascal, but it defines a
slightly different type omitting the empty list:
list = record first: int; tail: ^list end
Some programming languages, such as Scheme, have no built-in concept of type---equivalently, all objects have the same type. One of the most important benefits of formal type systems in programming languages is that automated type checking can find lots of errors that would otherwise be left to tedious debugging. Type checking may be done at ``statically'' at compile time or ``dynamically'' at run time. Compile-time type checking is more common, and more valuable, since it uncovers errors sooner. The impact of type checking on the structure of a programming language is a bit ironic. Type checking improves structure by manipulating the relationship between correct and incorrect programs: a slight error in typing a program is likely to produce a type-incorrect program, rather than a program that runs and produces incorrect results. C is usually called a weakly typed language, since the use of types is mostly voluntary, and it's easy for a programmer to assign a value in one type and then pretend later that it has another type. Pascal is usually called a strongly typed language, since the designers tried, and almost succeeded, to make type correctness absolutely mandatory.
Unfortunately, many widely used programming languages still suffer from the confusion of two different sorts of relationship between type symbols. Consider the following two type declarations:
type Cartesian_triple = record x: real; y: real; z: real end
If we just consider the technical meanings of these declarations in Pascal (or similar ones in C and other languages), they appear to be equivalent. But, the names of the types suggest very different attitudes. It's likely that the programmer intended to definetype momentum = record x: real; y: real; z: real end
Cartesian_triple
as a synonym for a
record containing three real
fields. Probably,
there was no intention to distinguish Cartesian_triple
from other triples of real numbers. On the contrary, the programmer
probably intended momentum
to represent a triple of real
numbers used to represent, or implement, a
physical quantity of momentum. Dimensional analysis in physics is
essentially a kind of type checking, which uncovers a lot of errors in
formulae in a particularly simple way. For example, although momentum
and velocity are both represented by triples of reals, a velocity
multiplied by a mass yields a momentum, and not another
velocity. C and Pascal don't support dimensional
analysis very well, largely because of the failure to distinguish
between the two sorts of declarations above.
With an understanding of the difference between type definition and type representation/implementation, modern programming languages are starting to provide two different notations for associating names with types. C++ classes may be used for this purpose. A muddy appreciation that there was such a problem led to the distinction between ``structural equivalence'' and ``name equivalence'' of types treated on pages 139-141 of the text.
The precise behavior of type checking depends critically on the way in which type equivalence is defined. Some programming languages use an asymmetric relation of type compatibility to determine when the type of an expression is ``compatible'' with the type of a formal parameter.
It seems that a lot of the variety in concepts and uses of types can be understood in terms of a 2 by 2 diagram:
On the horizontal dimension, I distinguish concepts in which the type of a value is a quality of the value (inherent type), from concepts in which type is a way of interpreting representations as values (interpretive type). On the vertical dimension, I distinguish concepts in which we declare the type that we intend an object to have (intensional/prescriptive type) from concepts in which we describe the type that we observe an object to have (extensional/descriptive). I don't have high confidence in this diagram, but it should at least provoke some thought about the relations between the variety of concepts associated with ``type.''
Inherent Interpretive Intensional
Prescriptivecompile-time type checking
proof theoryabstract datatypes
category theoryExtensional
Descriptiverun-time type checking
set theorybit representation
machine ops
algebra