Main

March 9, 2009

Grust & Schreiber's Ferry: Database Queries in a Sensible Language

Last week we had some really cool guys come visit us from Germany: Torsten Grust and Tom Schreiber of Universität Tübingen. I was aware of Grust's work on functional database query languages from some years ago, but was surprised when he emailed us (the Links team) and let us know about their new work with Ferry [1], essentially a nice, first-order, functional programming language which they compile to bundles of SQL queries.

This overlaps a lot with what I've been doing with compiling Links to SQL. I had recently discovered a way of compiling expressions in a pure, non-recursive, higher-order fragment of Links to single SQL queries, given the condition that the source expression has the same type as an SQL query: a multiset of tuples of base type.

What Torsten and Tom have figured out how to do is to compile expressions of nested data type to a small set of queries having flat type, such that you can easily assemble the nested result from the flat queries. This is something I had thought about how to do and hadn't cracked, so I was very pleased when the Tübingen guys turned up with a way to do it. In fact Jan Van den Bussche had figured out a similar encoding (of nested to flat algebra) in 2001 [2], in a purely theoretical setting and with (unordered) sets as the only container type. Torsten and Tom can actually handle unordered bags and ordered lists, which is another big step forward.

Their work and my work overlap, but we each had things that the other didn't have. They had nested data structures and ordered collections (lists), while my compilation works for a higher-order language, where functions are first-class values. Also I'm working in the setting of an impure language with side-effects; we can statically identify database-compilable expressions, which must be pure, by using a type-and-effect analysis. The two systems seem to fit together very well, and we're looking at integrating them.

Torsten, and Tom, and myself, Phil and Sam had a nice visit together in Edinburgh were we pried into the details. Three cheers for collaborative visits!

[1] "Ferry: Database-Supported Program Execution." Torsten Grust, Manuel Mayr, Jan Rittinger, Tom Schreiber. Proc. 28th ACM SIGMOD Int'l Conference on Management of Data (SIGMOD 2009), June 2009. To appear.
[2] "Simulation of the nested relational algebra by the flat relational algebra, with an application to the complexity of evaluating powerset algebra expressions." Jan Van den Bussche. Theoretical Computer Science, 254(1-2). p. 363-377. 2001.

August 27, 2008

Sensible SQL: The second typing for grouped columns

SQL keeps turning out to be more sensible than I thought it was.

If you're like me, you thought that GROUP BY clauses partitioned your columns into "grouping columns" and "non-grouping columns" and that the MUST to be processed with aggregate functions (count, sum, max, min, average) and that the second MUST NOT be processed that way. This would have been a problem for converting Peyton Jones and Wadler's Comprehensive Comprehensions to SQL, since that technique uses a ncie uniform treatment of group-by clauses: they transform all the query's bound variables to list type for the remainder of the query. Example—here's a simple query that takes a table, groups it on values from columna a, and returns pairs of the value from a and the sum of the list of corresponding column b values:

select a, sum(b) from t group by a

Fine and simple. And as you may know, the following produces an error:

select a, b from t group by a

Because we've grouped on a, we can't just select column b within each grouping because there are multiple column b values. The DBMS will insist that we use an aggregate function to convert this collection to a primitive-type value. Such are the restrictions of the relational model.

But Peyton Jones and Wadler's proposal simply treats all columns, when grouped, as lists. So

[(a, b) | (a, b) <- t, group by a]

is a perfectly legal expression of type [([A], [B])] (letting A and B be the raw column types of a and b, resp.). What they expect you to do, to make SQLizable expressions, is to apply some other function to reduce each column, a typical example being

[(the a, sum b) | (a, b) <- t, group by a]

which corresponds to the first SQL query above: it collects the values from a and matches these with the sums of corresponding values from b. But the above syntax allows you to do things like this:

[(sum a, sum b) | (a, b) <- t, group by a]

which seems to correspond to SQL that applies an aggregate function to a grouped column as well as an ungrouped column. "Whither SQL!?" I worried! Yet all was not lost. SQL treats grouped columns as being alternately bag-typed or primitive-typed, depending on the context. So the following is a valid query:

select sum(a), sum(b) from s group by a

Huzzah! This shows SQL to be in closer accordance with a sensible, uniform, orthogonal languge such as Peyton Jones and Wadler's than I'd imagined. In other ways, too, I keep finding it easier and easier to make SQL queries out of difficult FP expressions by using little-known SQL features. Cheers, guys.

February 14, 2006

Idiots are a Lot Smarter

Simon Willison (et al.) took some nice notes on a talk by Josh Schachter, developer of del.icio.us. Speaking from my experience developing web applications at Amazon.com and Six Apart, I agree with every one of these points. A lot of these are easy no-brainers that lots of web apps don't get right—like dropping users back where they were after registering.

Some particularly nice tidbits:

Scaling: avoid early optimization. SQL doesn't map well to these problems—think about how to split up data over multiple machines. Understand indexing strategies, profile every SQL statement. Nagios or similar for monitoring.

There's no explanation what "these" problems are. Still, I think it's the right sentiment. The relational model is a weird model that was foisted on computing decades ago. It's become a crutch that developers—and particularly web developers—rely on excessively.

"Idiots are a lot smarter than you"—wait to see what breaks before you fix it.

There's little I hate more in engineering that attempts to fix unbroken stuff. Christopher Alexander's best line is, "Every act of building is an act of repair." We're here to make the world better, only.

January 30, 2006

Multi-way Joins in Links

Multi-way joins are now working (transformed from simple list comprehensions, that is) in the Links interpreter. There were a number of subtleties in the rewriting code that prevented this from working.

Currently, my cohort Jeremy and I are not sure whether it's better to use a deep-searching algorithm, like the one we have now, or rather to use simple, local rewrite rules with a fixed depth. The advantage of simple rules would be the ease of showing correctness. But it might be hard to show that such rules must produce more optimal behavior—there is the possibility of hill-climbing, reaching a local optimum and thereby missing a global optimum. The deep-searching rewriter we have now is rather hard to keep track of, as it relyies on simultaneous knowledge of several different kinds of AST nodes, whereas we suspect that a shallow-rule system could be built with only about one rule for each way of combining two node constructors.

To decide this question, I think we need to flesh out the shallow-rule system, and in parallel we need to pursue some kind of assertions about the deep-search system. I still haven't been able t find any tidy results about this in the literature. There are scads of papers by Libkin & Wong that sound related, but I haven't yet done a thorough search to see what's there.

Some things I discovered: Codd, 1970, A relational model for large shared data banks is the original paper for (flat) 'relational algebra' which is the formalism ours needs to compile into. Libkin & Wong's 1994 paper New Techniques for Studying Set Languages, Bag Languages and Aggregate Functions argues about the comparative power of various nested calculi, including those called BQL and NRCaggr.

January 3, 2006

Better Comprehension Through Memo(r)ization

I've been thinking about how to deal with these problems optimizing query-backed comprehensions.

A reasonably general example of the problem is as follows. I include a join condition for conversation sake, but also keep in mind the possibility that the condition could be always-true.

for x <- s in
    let y = f(x) in
        for z <- t where x.id == z.id in
            (x, y, z)

First let's characterize the running time of this algorithm, as is. Let s be the size of the table s, let F(n) be the cost of running f on an input n. Also suppose that there is a cost q for each query, and that this is non-trivial since it involves marshalling data to and from an external process, which may be on another machine. Finally, let k be the average number of rows in t which match the join condition; i.e. that k such that the number of tuples in the result is sk. The above algorithm should run in time O(sF(n) + sk + sq).

Now how is it affected by optimization? We have a rewrite rule which would rewrite the above as

for (x, z) <- s * t where x.id == z.id in
    let y = f(x) in
        (x, y, z)

Which performs fewer queries but evaluates f(x) more times. This version should take time like O(skF(n)). In lots of cases, this is worse.

It might be possible to estimate these costs, yielding a heuristic optimization rule that the compiler would employ.

But if we know that f is pure, we can memoize it, and that would give a much better optimization.

Supposing we can do the memoization lookups in some kind of O(log m) time, this would give a program that runs in time upper-bounded at O(sF(n) + sk + s log m)); we've saved the O(sq) cost of doing multiple queries but we've incurred the cost of the lookups at O(s log m).

Tricky Query Optimizations

Optimizing language-integrated queries is trickier than it looks.

Besides the structural problem I talked about yesterday, there are cases where a naive "optimization" can easily make the result run much slower than the direct version. This will happen whenever some time-consuming operation occurs between an inner and an outer comprehension. Suppose t is a table with loads of big numbers in the number column; consider this Links algorithm:

for x <- t in
    factors = factors_of(x.number);  # (costly)
    for y <- s in
        (y, factors)

A naive optimizer would say, "I should combine these two queries rather than re-running the inner query over and over again." It would yield an algorithm that works like this:

for (x, y) <- (t, s) in
    factors = factors_of(x.number);  # (costly)
    (y, factors)

But this algorithm might evaluate factors_of many more times than the original, since there may be more than one pair (x, y) for any given value of x. The optimization improved performance with respect to the database, but worsened the performance of the algorithm itself.

January 2, 2006

Nesting into SQL

It's perfectly plain in Buneman et al.'s "Comprehension Syntax", but I didn't realize it until now: Links's list/bag/set comprehensions don't correspond to SQL. There are simple expressions, using only comprehensions and no fancy language features, that cannot be compiled directly into SQL, because SQL's relations are always flat, whereas the structures our comprehensions build can be nested. To wit:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

This should give us a list of pairs, the first of which is the name attribute of a thing in t and the second of which is a list of corresponding id values taken from s. But SQL has no structure directly corresponding to a "list of tuples of lists."

Currently, the Links interpreter will simply decline to optimize these comprehensions, leading to a single query against the table t and as many queries against s as there are rows in t. This is abysmal; it's distinctly worse than coding in PHP where at least you can write one efficient query.

I see two plausible solutions; one is to modify Links to bring it into line with SQL's model. I'm not too fond of this; ruling out nested structures across the board would be unacceptable, and creating a sub-language for queries, which contained only flat relations, would diminish the Links goal of unifying the algorithmic language with the query language.

Alternatively, and much better, we could do the join, and then post-process the results to group them into the proper structure. In this case, the query would be

select t.name, s.id from s, t where s.t_id = t.id order by t.name

And the post-processing might be

fun grouper(row, accum)
{
    if accum <> [] && row.name == hd(accum).name 
    then [{name = row.name; id = [row.id]++hd(accum).id}]++tl(accum)
    else [{name = row.name; id = [row.id]}]++accum
}
fold(grouper, [], results)

This just takes the records with a common name field and folds them into a single record, with all the corresponding id fields listed therein. I'm relying on the ordering by t.name in order to do this in linear time, but a more general/robust solution could probably work in n log n time, by keeping an indexed structure to look up the record into which any given row should be folded.

It will take a bit of tinkering to come up with the grouper for a given nested comprehension, in general. Some thoughts toward that:

In SQL, only the so-called aggregate functions (max, average, sum, count) may be applied to the values being grouped together—essentially, only functions that yield an atomic datum. The operation we need might be equivalent to a "group by" whose operation is list/bag/set construction. Looking at it this way might allow us to easily apply algorithms and transformations that already exist for SQL's "group by." But how to recognize what columns are being grouped by and which ones are being grouped together?

Well, a condition that makes the example problematic is that we have nested queries, and there is a "bare" piece of the top query: a value from t is used outside of the inner query. I think this condition implies the sort of nested structure I'm talking about. Furthermore, it is a smoking gun as to how we should structure the groupings. The variable used in the outer comprehension represents the one to group by, the other comprehension, in itself, becomes the aggregation operation.

This needs to be refined, of course, but I think it is a lead. Essentially, the idea is to take a nested comprehension like this:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

and convert it to a faux-SQL query like so:

select t.name, makebag(s.id) from s, t
where s.t_id = t.id group by t.name

where makebag( ) is the "aggregate function" that constructs a bag from its arguments. (In fact, this should fall out as a comprehension over the grouped elements.) The faux query comes apart into a real query and a post-processing step:

select t.name, s.id from s, t where s.t_id = t.id

group_by [name] makebag results

I can't see writing a general post-processor within Links, since it would need something like reflection, to dissect an arbitrary row on given fields. Let's use a Haskell-like meta-language notation. Suppose we want to group by some list of fields g and that project g is a function that projects those fields out to form a new record. We also have an aggregation function, f. The needed post-processor should behave like this (though more efficient algorithms should be used):

group_by g f rows = map (onSecond f) (collate g rows)
    where collate g list = [let ys = whereEq (project g) x list) in
                                (x, map (project g') ys)
                            | x <- uniq (map (project g) list)]
              where g' = complement g       -- fudging here a bit
          whereEq f x = filter (\y -> f y == x)
         onSecond f (x, y) = (x, f y)

I aim to set all this in a more gentle theoretical framework. I'd like to have a formal definition of the NRC, SQL, and our query-optimizable language; then we can give the translation between them, and then it's a simple matter of programming. Shouldn't be too hard.

❡