« Slow blog year chez Ezra | Main | Discovery! \qedhere »

Sensible SQL: The second typing for grouped columns

SQL keeps turning out to be more sensible than I thought it was.

If you're like me, you thought that GROUP BY clauses partitioned your columns into "grouping columns" and "non-grouping columns" and that the MUST to be processed with aggregate functions (count, sum, max, min, average) and that the second MUST NOT be processed that way. This would have been a problem for converting Peyton Jones and Wadler's Comprehensive Comprehensions to SQL, since that technique uses a ncie uniform treatment of group-by clauses: they transform all the query's bound variables to list type for the remainder of the query. Example—here's a simple query that takes a table, groups it on values from columna a, and returns pairs of the value from a and the sum of the list of corresponding column b values:

select a, sum(b) from t group by a

Fine and simple. And as you may know, the following produces an error:

select a, b from t group by a

Because we've grouped on a, we can't just select column b within each grouping because there are multiple column b values. The DBMS will insist that we use an aggregate function to convert this collection to a primitive-type value. Such are the restrictions of the relational model.

But Peyton Jones and Wadler's proposal simply treats all columns, when grouped, as lists. So

[(a, b) | (a, b) <- t, group by a]

is a perfectly legal expression of type [([A], [B])] (letting A and B be the raw column types of a and b, resp.). What they expect you to do, to make SQLizable expressions, is to apply some other function to reduce each column, a typical example being

[(the a, sum b) | (a, b) <- t, group by a]

which corresponds to the first SQL query above: it collects the values from a and matches these with the sums of corresponding values from b. But the above syntax allows you to do things like this:

[(sum a, sum b) | (a, b) <- t, group by a]

which seems to correspond to SQL that applies an aggregate function to a grouped column as well as an ungrouped column. "Whither SQL!?" I worried! Yet all was not lost. SQL treats grouped columns as being alternately bag-typed or primitive-typed, depending on the context. So the following is a valid query:

select sum(a), sum(b) from s group by a

Huzzah! This shows SQL to be in closer accordance with a sensible, uniform, orthogonal languge such as Peyton Jones and Wadler's than I'd imagined. In other ways, too, I keep finding it easier and easier to make SQL queries out of difficult FP expressions by using little-known SQL features. Cheers, guys.

Post a comment