Datatypes, ComSci 221, U. Chicago

Datatypes in Procedural Languages

Lecture Notes for Com Sci 221, Programming Languages

Last modified: Mon Nov 1 10:37:04 CST 1999

Start Wednesay 20 October 1999

Strategic Overview

Chapter 4 in the text covers the nuts and bolts of this topic very well. You need to read it especially carefully. I will concentrate on odd points of view, and on ideas that connect the material in Chapter 4 with later material in Chapters 9 and 14, and with material that goes beyond the text. These notes were written rather hurriedly, and are probably not as clean as earlier sections.

Multiple Concepts of ``Type''

The word ``type'' is one of many words that is used in different, but related, ways by those who study programming languages. These uses of ``type'' all start with a pre-CS mathematical notion of type from set theory. But, the computational objects associated with types in programming languages have qualities that don't arise in the normal study of set theory, and the original idea of type has been extended in several different ways to deal with them. Actually, mathematics and logic had already used ``type'' in somewhat different ways in different contexts, but programming languages introduced differences in new dimensions.

Set-theoretic types

The simplest concept of type derives directly from set theory. It is insufficient for the full understanding of most uses of types in programming languages. But, all concepts of type relate back to set theory in some way, so it is a good starting point.

I made up the following definition. I think it's a good one, but it's not standard in the literature.

set-theoretic type: a set of values intended to include precisely those values that might be associated with a particular variable in a program.

The important set-theoretic types in procedural programming languages include

basic types: predefined types that may be used without special declaration---values of basic types usually fit in a single machine word, are usually thought of as units rather than structures built up of other units, and are usually intended to have intuitive meanings to those with a mathematical background
defined types: types built up from basic types using type-constructing operations---defined types are usually introduced by a special type declaration, and their values are usually thought of as structures comprising values of their component types in some way

Most types in procedural programming languages of today may be understood naturally in terms of the basic types integer, floating-point, boolean, and character, plus the type constructing operations enumeration, cross product, marked union, function, and power set.

An enumeration set is a finite set described by listing all of its values, for example {a, b, c, d} describes the set containing exactly a, b, c, and d.
The cross product of sets A and B is written AxB. It contains all of the ordered pairs whose first element is in A and whose second element is in B.
The marked union of sets A and B is written A+B. It contains the elements of A marked in a uniform way, and the elements of B marked in another uniform way, different from the markings of elements of A. Marked union is often called ``disjoint union.'' This term is misleading, since it suggests that A and B are required to be disjoint (i.e., to have no member in common). On the contrary, the same value may be a member of both A and B, and in that case it produces two different values in A+B, one with each of the two marks.
The function space from set A to set B is written A->B (and in many math books B to the power A). It contains all of the functions that map each element of A to a unique element of B.
The power set of set A is written P(A) (and in many math books 2 to the power A). It contains all of the subsets of A, including the empty set and the set A itself.

Types in current procedural programming languages have sets of values that can mostly be defined from the basic types above, using the set-theoretic type-constructing operations above. The following table shows corresponding types in the notations of Pascal, C, and set theory. Both Pascal and C allow cross products and marked unions to have arbitrary arities, as well as the binary forms from the set-theoretic notation. Also, the two programming languages have different restrictions on the use of some type operators. The following table gives a pretty good idea of the correspondence.

> >

Pascal C Set theory

name form name form name form

enumeration
{a, b, c, d}
enumeration
enum {a, b, c, d}
enumeration {a, b, c, d}

record
record a: A; b: B end
structure
struct { A a; B b; }
cross product AxB

subset
set of A
- - power set P(A)

array
array[0..n] of A
array
A[n]
function {0,...,n} -> A

function
function f(a: A): B
function
B f(A a)
function A->B

variant record
case t of a: A; b: B end
union With a lot of restrictions and caveats:
union { A a; B b; };
More carefully:
struct { enum {a,b} tag; union content { A a; B b; }; }
marked union A+B

Pascal	C	Set theory
name	form	name	form	name	form
enumeration	`{a, b, c, d}`	enumeration	`enum {a, b, c, d}`	enumeration	{a, b, c, d}
record	`record a: A; b: B end`	structure	`struct { A a; B b; }`	cross product	AxB
subset	`set of A`	-	-	power set	P(A)
array	`array[0..n] of A`	array	`A[n]`	function	{0,...,n} -> A
function	`function f(a: A): B`	function	`B f(A a)`	function	A->B
variant record	`case t of a: A; b: B end`	union	With a lot of restrictions and caveats: `union { A a; B b; };` More carefully: `struct { enum {a,b} tag; union content { A a; B b; }; }`	marked union	A+B

Even at the set-theoretical level of abstraction, the correspondence is tricky. Both C and Pascal require names to be introduced explicitly for some components that are named implicitly in the mathematical notation. Pascal only allows unions (variant cases) to occur inside of cross products (records). While C separates union from cross product more cleanly than Pascal, the C union is not exactly a marked union, nor is it a normal set-theoretic union (which just combines two sets into one that contains the members of both). Every member of the C union type union {A a; B b;} may be treated freely as either type A or type B. Most of the time, C union is used to simulate marked union, but the programmer is responsible for insuring that each use of an object in a union uses the same component type as the most recent previous assignment. Essentially, the programmer keeps track of the tag in the flow of control. But, sometimes C unions (and even Pascal variant records, with a bit of trickery), are used to convert between two types by just interpreting a bit pattern from one type as if it were intended for the other. This doesn't happen very often in C, because there are so many other ways to trick the type system. C also confuses the correspondence by introducing variable names after, or in the middle of, the types that are assigned to them.

`void` in C is not the empty type

The type void in C is a bit subtle. It's a good exercise to figure out its set-theoretic meaning, even though the use of void is not problematic. C uses the type void to indicate that a function does not return a value. So, it seems that void might be the empty type. Actually, void is the type with no information, but that makes it a type with one value, not zero. In computation, information is always carried by the distinction between two or more possibilities. The distinction between two possibilities (such as 0 and 1) is called a bit of information, since it's the smallest distinction that can be made independently of context. The value of a type with one member carries no information, since one always gets the same value.

End Wednesday 20 October 1999
%<----------------------------------------------

%<----------------------------------------------

Begin Friday 22 October 1999

l Values, r Values, and Assignability

Pointer types in programming languages don't seem to be describable nicely in the set-theoretic style. Although the type of pointers to type A (^A in Pascal, A * in C) has a set of values, that set doesn't seem to be derived from the set A in any sensible way. Also, set-theoretic types don't seem to explain the use of expressions on the left-hand sides of assignments. To deal with these programming concepts, we use the idea of l values and r values.

The usual sorts of values inhabiting types in the previous section are r values. l values are references to, or locations of r values. Concretely, at the machine-language level, they are machine addresses. Slightly more abstractly, they are pointers. The l and r come from the use of l values on the left-hand sides of assignment statements, and r values on the right-hand sides. It's easy to understand simple assignments, such as

x := y+z

without thinking explicitly of l values. But, assignments to components of structures, such as

A[n].angle := B[m].angle + 90

are quite hard to explain purely in terms of the r values involved (presumably real numbers in this case). So, we think of the expression on the left-hand side of an assignment evaluating to an l value, which determines how the r value of the expression on the right-hand side is to be stored.

Environment vs. state

In programming languages that use structured objects (arrays, records), or that have local variables with different scopes, the instantaneous state of computation is usually decomposed using the concept of l and r values. The simple view of computation that we suggested at the beginning of the course has instantaneous states that associate with each variable its r value. In principle, that's enough for explaining computation, but the description of structured assignments and entering/exiting local scopes gets complicated. So, we usually decompose this simple notion of state into an environment that maps variables to l values, and a state (alas, we reuse the same name) that maps l values to r values. The environment is usually determined entirely by the local scopes containing a control point in a program, plus the stack of procedure calls. The state varies completely dynamically, and may be different each time we reach the same control point.

Pointers as l values in r values

Different programming languages vary a lot in the generality of their treatment of l values. In some languages, l values occur only implicitly through the use of expressions on the left-hand sides of assignments. Pointers may be understood as l values converted explicitly into rvalues so they can be used on right-hand sides. It's confusing to talk about pointers, because the l value of a pointer is the address at which the pointer value is stored, while the r value of a pointer is the address of the thing that it points to.

The dereferencing operator applied to a pointer value p (p^ in Pascal, *p in C) take the r value of p (which is a pointer value), and use it as an l value. So, p^ := 1 and *p = 1 store 1 at the address p points to. On the right-hand side of an assignment, everything is automatically mapped to an r value, so p^ and *p refer to the value at the address pointed to by p. In Pascal, the r values of pointers are all allocated in a special area of memory called the heap using the new procedure, so the r value of a pointer is never the same as the l value of a variable. C uses the malloc procedure to create new r values of pointers on the heap, but it also allows pointer to point to other addresses, including the l values of variables, using the address operator &. &x is the l value of the variable x, turned into an r value---that is, the address of x turned into the value of a pointer.

The address operator is logically peculiar, and can't be understood in the same way as other operators. On the right-hand side of assignments, all variables are implicitly dereferenced to their r values. Given only the r value of a variable, there's no way to find its l value. Unlike arithmetic and other functions, the & operator does not apply a function to the value. Rather, it cancels the implicit dereferencing of the variable symbol that it precedes.

Actual parameters to procedures and functions may be passed as l values (call by reference) or r values (call by value). Pascal lets the programmer choose. C uses only call by value, but since & converts an arbitrary l value into an r value, it is easy to simulate call by reference. That's why you see so many ampersands in C function calls.

Assignability

Even without pointers, the idea of l vs. r values allows us to describe variations in the flexibility of assignments to different types. Scalar variables are assignable as entire values, but there is no provision for breaking them into components and assigning those components individually. That is, a scalar variable has an l value for the whole object, but there is no way to derive l values for components. Arrays and records/structures, on the other hand, have individually assignable components. That is, given the l value of an array A, we may derive an l value for its nth component, by indexing into the array on the left-hand side of an assignment (A[n]). Although arrays and functions both correspond to set-theoretic function types, they have very different assignabilities. Arrays are assignable by components, and functions aren't assignable at all---they get permanent values from their declarations.

End Friday 22 October 1999
%<----------------------------------------------

%<----------------------------------------------

Begin Monday 25 October 1999

Datatype as a Set Plus Some Operations

Perhaps the most important refinement of the set-theoretic concept of types is the concept of types as sets plus some operations. In procedural languages, we must indicate the operations on l values associated with a type---including the kinds of assignments allowed, as well as on its r values. This sort of structure---a set plus a list of operations on it---is called an abstract datatype. Classes in C++ provide one way of using abstract datatypes in an actual programming language. Abstract data types allow us in many cases to separate the specification of what a piece of program needs to accomplish from the implementation that determines how to accomplish it. That separation has proved extremely valuable in complex programming tasks, where we need to change implementations, and to distribute responsibility for different implementations to different people and teams. In mathematics, an algebra is a lot like the implementation of an abstract datatype, and the signature of the algebra is the abstract datatype that it implements.

Datatype as Layout and Interpretation of Bits

The oldest concept of datatype is probably the one used in discussions of computer architecture, and early Fortran compilers. There, a datatype is usually identified as a particular layout of bits in memory, and a way of interpreting those bits as a value. The text has a nice discussion of the layout of arrays. Even scalar types need to have their bit layouts defined somewhere. There are two different representations of integers, called 1's complement and 2's complement due to the different ways of implementing negation. Floating point layouts are susceptible to even more variations. Since the interpretation of a layout of bits is determined by the operations (e.g. +, -, * on integers) that we perform on those bits, there is a connection between the layout/interpretation and abstract datatype concepts. But, computer architecture normally provides the operations that interpret bunches of bits as values in the form of tools that may be used in any combination. Compilers usually use the abstract datatype concept to restrict programs so that their uses of types are consistent in some sense.

Recursive Type Definitions

Many intuitively appealing types are naturally defined in a recursive fashion. For example, we would like to define the type of list of integer in the following set-theoretic notation:

list = {nil} + (integer x list)

Neither Pascal nor C allows this sort of recursive definition of type. Because of the influence of the layout view of types, both of these languages insist that the memory footprint of a type is fixed by the type declaration. In a recursive type definition, Pascal and C require that every recursion be mediated by a pointer type. For example, in Pascal:


list = ^record
          first: int;
          tail:  list
        end

The marked union with {nil} is included in the use of the pointer type, since all pointer types in Pascal (and C) contain the special value nil that doesn't point to anything. Pascal and C only require that the pointer type appear somewhere between the type being defined and the recursive invocation of that type. For example, the following declaration is also legal Pascal, but it defines a slightly different type omitting the empty list:


list = record
         first: int;
         tail:  ^list
        end

Types as Restrictions on Programs: Type Checking

Some programming languages, such as Scheme, have no built-in concept of type---equivalently, all objects have the same type. One of the most important benefits of formal type systems in programming languages is that automated type checking can find lots of errors that would otherwise be left to tedious debugging. Type checking may be done at ``statically'' at compile time or ``dynamically'' at run time. Compile-time type checking is more common, and more valuable, since it uncovers errors sooner. The impact of type checking on the structure of a programming language is a bit ironic. Type checking improves structure by manipulating the relationship between correct and incorrect programs: a slight error in typing a program is likely to produce a type-incorrect program, rather than a program that runs and produces incorrect results. C is usually called a weakly typed language, since the use of types is mostly voluntary, and it's easy for a programmer to assign a value in one type and then pretend later that it has another type. Pascal is usually called a strongly typed language, since the designers tried, and almost succeeded, to make type correctness absolutely mandatory.

End Monday 25 October 1999
%<----------------------------------------------

%<----------------------------------------------

Begin Wednesday 27 October 1999

Confusion: definition vs. implementation

Unfortunately, many widely used programming languages still suffer from the confusion of two different sorts of relationship between type symbols. Consider the following two type declarations:


type Cartesian_triple = record
                          x: real;
                          y: real;
                          z: real
                      end


type momentum = record
                  x: real;
                  y: real;
                  z: real
                end

If we just consider the technical meanings of these declarations in Pascal (or similar ones in C and other languages), they appear to be equivalent. But, the names of the types suggest very different attitudes. It's likely that the programmer intended to define Cartesian_triple as a synonym for a record containing three real fields. Probably, there was no intention to distinguish Cartesian_triple from other triples of real numbers. On the contrary, the programmer probably intended momentum to represent a triple of real numbers used to represent, or implement, a physical quantity of momentum. Dimensional analysis in physics is essentially a kind of type checking, which uncovers a lot of errors in formulae in a particularly simple way. For example, although momentum and velocity are both represented by triples of reals, a velocity multiplied by a mass yields a momentum, and not another velocity. C and Pascal don't support dimensional analysis very well, largely because of the failure to distinguish between the two sorts of declarations above.

With an understanding of the difference between type definition and type representation/implementation, modern programming languages are starting to provide two different notations for associating names with types. C++ classes may be used for this purpose. A muddy appreciation that there was such a problem led to the distinction between ``structural equivalence'' and ``name equivalence'' of types treated on pages 139-141 of the text.

The precise behavior of type checking depends critically on the way in which type equivalence is defined. Some programming languages use an asymmetric relation of type compatibility to determine when the type of an expression is ``compatible'' with the type of a formal parameter.

The Space of Type Concepts

It seems that a lot of the variety in concepts and uses of types can be understood in terms of a 2 by 2 diagram:

Inherent Interpretive

Intensional
Prescriptive compile-time type checking
proof theory abstract datatypes
category theory

Extensional
Descriptive run-time type checking
set theory bit representation
machine ops
algebra

	Inherent	Interpretive
Intensional Prescriptive	compile-time type checking proof theory	abstract datatypes category theory
Extensional Descriptive	run-time type checking set theory	bit representation machine ops algebra

On the horizontal dimension, I distinguish concepts in which the type of a value is a quality of the value (inherent type), from concepts in which type is a way of interpreting representations as values (interpretive type). On the vertical dimension, I distinguish concepts in which we declare the type that we intend an object to have (intensional/prescriptive type) from concepts in which we describe the type that we observe an object to have (extensional/descriptive). I don't have high confidence in this diagram, but it should at least provoke some thought about the relations between the variety of concepts associated with ``type.''