Immutability and Persistence

What makes implementing efficient data structures harder in a purely functional setting compared to an imperative one?

  1. The lack of imperative, or destructive, updates leads to copying possibly large parts of the data structure, and

  2. All versions of a data structure are expected to be persistent and can be used even after updates.

Any practical compiler for a purely functional language aggressively looks for opportunities to share the representation of data that is common to multiple data structures. Nevertheless, space overhead is unavoidable when compared to similar data structures in imperative settings. However, as we will see later in the course, with clever design and a dash of lazy evaluation, many efficient data structures can be designed for purely functional settings. But first, some basics...

Lists

The following function updates the element in xs at index i, if one exists, with the value y:

update : List a -> Int -> a -> List a
update xs i y =
  case (xs, i) of
    ([],     _) -> []
    (x::xs', 0) -> y :: xs'
    (x::xs', _) -> x :: update xs' (i-1) y

As with the find function that walks through a list one element at a time, so too does update by keeping track of how many more elements i to traverse. The worst-case running time of update is O(n), where n is the size of xs. Why isn't the worst-case running time infinite, when i is negative?

In addition to running times, we also need to pay attention to the space overheads incurred. That is, how much additional data needs to be represented during and after a call to the function.

In the second case, notice that the output list y :: xs' refers to xs', which is identical to the tail of the input list. This is an opportunity for the language implementation, which will of course use pointers and pointer manipulation, to share the underlying representation of xs' among the representations of the input and output lists.

In the third case, on the other hand, there is no such opportunity for sharing; the resulting list is built from x and some unknown list that results from the recursive call to update'. As a result, a new memory cell must be allocated for x in the output list and concatenated to the list that results from the recursive call. This case embodies the worst case with respect to space (in addition to time). If update needs to update the last element in the list, then the first n-1 elements of xs are copied and cons-ed on to the singleton list y :: []. Thus, update allocates O(n) space in the worst case.

Another list operation to consider is append:

append : List a -> List a -> List a
append xs ys =
  case xs of
    []     -> ys
    x::xs' -> x :: append xs' ys

Let n be the size of xs and m be the size of ys. Notice that the elements of ys are not traversed at all (so its number of elements cannot contribute to execution time) and ys is referred to by name in the first case, where the language implementation can share its representation in the input and output lists. Thus, the worst-case time and space cost of append is O(n).

In practice, it is common to use a single variable n to describe the combined size of the lists, or the size of the larger list. In either case, the worst-case time and space costs are still O(n).

Binary Search Trees

Recall the findBST function that searches for an element in a binary-search-ordered tree, making a recursive call only in the subtree where the element can possibly appear. The insertion algorithm that preserves the binary search property is similar:

insert : comparable -> Tree comparable -> Tree comparable
insert x t = case t of
  Empty -> Node x Empty Empty
  Node y left right ->
    if x == y then      t
    else if x < y then  Node y (insert x left) right
    else {- x > y -}    Node y left (insert x right)

As with findBST, insert runs in O(h) time, where h is the height of the tree. The height h of a tree t is equal to its size n in the worst case, where every Node has only one element (that is, t is a crooked list). And h is O(n┬Ělog(n)) in the best case, when t is (roughly) balanced.

Regarding space, the insert function copies into the output tree all of the node values y that are observed until an Empty is reached. But the right subtree in the second case and the left subtree in the third case are common to both the input and output trees, so the implementation has the opportunity to share this representation. Because at most h nodes are traversed, insert allocates O(h) space in the worst case.

Persistence.elm contains the List and Tree examples above.


Reading

Required

  • Okasaki, Chapters 1 and 2