« My Research | Main | The Utility of PUT »

The Annotations Problem

Here is a problem that occurs in statically-typed functional languages, which doesn't come up in dynamic languages (the tradeoff, of course, is static safety). I'll state the problem and give a partial solution, but I'm interested in other approaches.

Frequently we're faced with a complex recursive data structure, which ought to have a rigid type. For example, here is a tree structure that you might define in ML:

type 'a tree = [`Leaf of 'a | `Unary of 'a tree 
                | `Binary of 'a tree * 'a tree]

This forces us to give the right number of sub-trees when constructing a binary node, a unary node, or a leaf. It is good that the type system enforces this.

The problem occurs when we have some functions that wish to add information to this tree, perhaps in a sloppier way. For example, a function that assigns a unique number to each node of a tree, in preparation for some other algorithm that will consume these numbers. It's hard to write such a function in ML: we can either construct a new, parallel, data structure, which has fudged constructor names, such as:

type 'a ntree  = int * [`Leaf of 'a | `Unary of 'a ntree 
                | `Binary of 'a ntree * 'a ntree]

O'Caml's polymorphic variants allow us to re-use the same constructor names, but we still have to define a new type in order to add the annotations.

The problem becomes worse if my program uses a second kind of annotation—for example, a function may take a raw tree and assign type information to it. Now we need four ntree types, since each kind of annotation might be present or not. And we need hideous functions that convert between these functions, throwing away and restoring data just so that you can pass the trees from one utility function to another, depending on which annotations are expected by each one.

This is a place where dynamic languages win big. In Perl and Ruby, for example, almost any value is in fact a dictionary, and so it can be extended with arbitrary annotations at will. Of course, in these languages, there is no way to ensure (statically) that a function receives the structure with the annotations that it expects.

The "Annotations Problem," then, is to give an elegant way of constructing a recursive type which (implicitly) allows arbitrary annotations at every node of its recursive self.

This should be doable with known technology. In a language like OCaml but with extensible record types, you could define the type as follows:

type 'content 'annot atree =
    'annot * 
      [ `Leaf 'content
      | `Unary of 'content 'annot atree 
      | `Binary of 'content 'annot atree * 'content 'annot atree]

The 'annot type parameter can be used to supply the type of annotations that you are interested in. You can do this today in O'Caml; to solve the problem, though, you need to instantiate'annot as an extensible record type. Then, any given function will use only certain labels in the record. If I have a richer annotated-tree, I can easily pass it into a utility function that needs only a subset of my annotations (maybe none of them). At the same time, if my function uses a label foo from the annotation the type system will ensure that any tree given to the function has a label foo on every node—which is generally the guarantee that I want.

I'm not sure whether any current language supports polymorphic variants, extensible records, and recursive types, which allows this solution. Even then, the solution is a bit cumbersome, because a function which cares not about the annotations still needs to mention the annotations in its pattern matching, and on its result constructions, in order to ensure that the annotations are preserved (what to do if a function changes the structure of the tree is out of scope!). A further, and more serious problem, is that any type needs to be declared in this special way in order to admit annotations: if I want to annotate a library-provided structure, I need to ask the library author to rebuild with this in mind, and that will break all other consumers.

Is it sensible to speak of annotations always being available for any recursive data structure? It exists in dynamically-typed languages.

The statement of the problem calls for elegance. My solution would be an improvement, but it's not perfect. More elegant solutions are welcome.