« Simplifying XML | Main | Tricky Query Optimizations »

Nesting into SQL

It's perfectly plain in Buneman et al.'s "Comprehension Syntax", but I didn't realize it until now: Links's list/bag/set comprehensions don't correspond to SQL. There are simple expressions, using only comprehensions and no fancy language features, that cannot be compiled directly into SQL, because SQL's relations are always flat, whereas the structures our comprehensions build can be nested. To wit:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

This should give us a list of pairs, the first of which is the name attribute of a thing in t and the second of which is a list of corresponding id values taken from s. But SQL has no structure directly corresponding to a "list of tuples of lists."

Currently, the Links interpreter will simply decline to optimize these comprehensions, leading to a single query against the table t and as many queries against s as there are rows in t. This is abysmal; it's distinctly worse than coding in PHP where at least you can write one efficient query.

I see two plausible solutions; one is to modify Links to bring it into line with SQL's model. I'm not too fond of this; ruling out nested structures across the board would be unacceptable, and creating a sub-language for queries, which contained only flat relations, would diminish the Links goal of unifying the algorithmic language with the query language.

Alternatively, and much better, we could do the join, and then post-process the results to group them into the proper structure. In this case, the query would be

select t.name, s.id from s, t where s.t_id = t.id order by t.name

And the post-processing might be

fun grouper(row, accum)
{
    if accum <> [] && row.name == hd(accum).name 
    then [{name = row.name; id = [row.id]++hd(accum).id}]++tl(accum)
    else [{name = row.name; id = [row.id]}]++accum
}
fold(grouper, [], results)

This just takes the records with a common name field and folds them into a single record, with all the corresponding id fields listed therein. I'm relying on the ordering by t.name in order to do this in linear time, but a more general/robust solution could probably work in n log n time, by keeping an indexed structure to look up the record into which any given row should be folded.

It will take a bit of tinkering to come up with the grouper for a given nested comprehension, in general. Some thoughts toward that:

In SQL, only the so-called aggregate functions (max, average, sum, count) may be applied to the values being grouped together—essentially, only functions that yield an atomic datum. The operation we need might be equivalent to a "group by" whose operation is list/bag/set construction. Looking at it this way might allow us to easily apply algorithms and transformations that already exist for SQL's "group by." But how to recognize what columns are being grouped by and which ones are being grouped together?

Well, a condition that makes the example problematic is that we have nested queries, and there is a "bare" piece of the top query: a value from t is used outside of the inner query. I think this condition implies the sort of nested structure I'm talking about. Furthermore, it is a smoking gun as to how we should structure the groupings. The variable used in the outer comprehension represents the one to group by, the other comprehension, in itself, becomes the aggregation operation.

This needs to be refined, of course, but I think it is a lead. Essentially, the idea is to take a nested comprehension like this:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

and convert it to a faux-SQL query like so:

select t.name, makebag(s.id) from s, t
where s.t_id = t.id group by t.name

where makebag( ) is the "aggregate function" that constructs a bag from its arguments. (In fact, this should fall out as a comprehension over the grouped elements.) The faux query comes apart into a real query and a post-processing step:

select t.name, s.id from s, t where s.t_id = t.id

group_by [name] makebag results

I can't see writing a general post-processor within Links, since it would need something like reflection, to dissect an arbitrary row on given fields. Let's use a Haskell-like meta-language notation. Suppose we want to group by some list of fields g and that project g is a function that projects those fields out to form a new record. We also have an aggregation function, f. The needed post-processor should behave like this (though more efficient algorithms should be used):

group_by g f rows = map (onSecond f) (collate g rows)
    where collate g list = [let ys = whereEq (project g) x list) in
                                (x, map (project g') ys)
                            | x <- uniq (map (project g) list)]
              where g' = complement g       -- fudging here a bit
          whereEq f x = filter (\y -> f y == x)
         onSecond f (x, y) = (x, f y)

I aim to set all this in a more gentle theoretical framework. I'd like to have a formal definition of the NRC, SQL, and our query-optimizable language; then we can give the translation between them, and then it's a simple matter of programming. Shouldn't be too hard.

❡