" /> Ezra's Research: January 2006 Archives

« December 2005 | Main | February 2006 »

January 30, 2006

Multi-way Joins in Links

Multi-way joins are now working (transformed from simple list comprehensions, that is) in the Links interpreter. There were a number of subtleties in the rewriting code that prevented this from working.

Currently, my cohort Jeremy and I are not sure whether it's better to use a deep-searching algorithm, like the one we have now, or rather to use simple, local rewrite rules with a fixed depth. The advantage of simple rules would be the ease of showing correctness. But it might be hard to show that such rules must produce more optimal behavior—there is the possibility of hill-climbing, reaching a local optimum and thereby missing a global optimum. The deep-searching rewriter we have now is rather hard to keep track of, as it relyies on simultaneous knowledge of several different kinds of AST nodes, whereas we suspect that a shallow-rule system could be built with only about one rule for each way of combining two node constructors.

To decide this question, I think we need to flesh out the shallow-rule system, and in parallel we need to pursue some kind of assertions about the deep-search system. I still haven't been able t find any tidy results about this in the literature. There are scads of papers by Libkin & Wong that sound related, but I haven't yet done a thorough search to see what's there.

Some things I discovered: Codd, 1970, A relational model for large shared data banks is the original paper for (flat) 'relational algebra' which is the formalism ours needs to compile into. Libkin & Wong's 1994 paper New Techniques for Studying Set Languages, Bag Languages and Aggregate Functions argues about the comparative power of various nested calculi, including those called BQL and NRCaggr.

January 29, 2006

SeaFunc: Stumbling Monk

Neat! The Seattle functional programming group meets at The Stumbling Monk on Olive Way.

The Stumbling Monk is a great little hole-in-the-wall with essentially no decor, the bartenders are approximately college kids with an intricate knowledge of beer varieties, and all they have is Belgian beer. When I was there they had a stack of very obscure board games in the corner. It used to be in the storefront next door, and it used to be a brewery-paraphernalia shop, rather than a bar. At some point they moved into the abandoned typewriter shop on the corner, and didn't bother to freshen up the awning, which still has faded, partially-fallen words ("Typewriter") from the shop that had been there before. I'm glad they didn't change it; it's a venerable, recognizable piece of recent Seattle history, mundane to the outsider, but familiar and sweet to the Seattle resident.

January 23, 2006

Message-passing and Entanglement

Also today, I pondered some issues around concurrency. I had the realization that, although message-passing may be a useful language mechanism (as a sole concurrency mechanism) for structuring a lot of programs, it's probably not good for applications that have a lot of data interdependence and need strong atomicity properties. Here was my insight:

Suppose you're building software for a complex system—like a simulator, say—and there are discrete units in the system that seem like they are largely independent from one another. It may be attractive to implement these units as distinct processes (threads) in the way you structure the software. You can write a single process description that sends messages whenever the real, modelled object has an interaction with other objects. This might seem appealing, but it's not necessarily the right thing to do.

The critical thing is that if, as in Erlang, message-passing is the sole means of concurrency-control, then passing a message is the only kind of atomic operation. That means that if variables x and y are updated together in some atomic operation, they have to be managed by the same process. And if y and z are also part of some atomic operation, they too have to be managed by the same process.

So the natural division of your program into threads, mirroring the division of modelled objects, may not be correct, since there may be some operation which must be atomic, and must involve two or more "natural" objects. Furthermore, it's not hard to imagine most or all of the variables in a program becoming entangled in this way.

This is not to say message-passing is not useful, or that there aren't lots of programs that could profitably be written that way. Many applications, after all, don't require strict atomicity in every of the operations that are in principle atomic. But, I think some real safety-critical applications are going to have data that is so entangled that, if they had to use the message-passing approach, they'd lose out on the benefits of concurrency.

So I continue my search for other concurrency abstractions that the language can usefully support.

Term-Rewriting Combinators

Progress on the Links interpreter.

Jeremy checked in his completely-overhauled optimizer combinators. We now have ways of defining term-rewriters by combining rewriters; and we can compose a structurally-recursive function just by giving one case and specifying a strategy, such as "bottom-up" or "top-down." This approach is pretty clearly laid out in the papers of Limsoon Wong, but it took us a while to rewrite out existing optimizer functinos this way and to find just the right combinators. Hopefully we'll have a chance to publish the specific techniques that we're finding useful and the pitfalls.

Now, we've cut our optimizer code from ~1200 lines to about 400, a rather good improvement. It's far more readable and approximately as correct! We also spent some time pairing to create unit tests, which was fruitful and makes me feel much more confident that these things will keep working in the future.

January 20, 2006

Languages and Multithreading

Some references on the state of mainstream programming languages for multithreading:
Software and the Concurrency Revolution (in ACM Queue) has some really good observations about what problems exist and why current techniques don't solve them, or don't solve them yet. The author argues that the problems need to solved for imperatiev languages, not just for functional languages.

Tim Bray's On Threads describes the problems that exist in the Java Standard Libraries, with lots of redundant locking at different levels—and you can still have concurrency problems because it doesn't help the programmer address safety and deadlock at the application level. Until I read this I didn't realize the situation was so bad in that arena.

Some cutting-edge (avant-garde?) languages are marginally better at this stuff, but I agree with the first author that the problem should be solved with new abstractions that can be applied in any langauge; it's not sufficient to solve this just in functional languages, as there's a lot of code out there that's not going to be rewritten in another language anytime soon. Even the best functional approaches (I'm thinking of STM in Concurrent Haskell and Erlang) don't seem to have the problem locked down—Haskell's STM moots referential transparency; and Erlang's message-passing model hasn't been shown to be powerful enough for all applications. This is a pretty interesting research area and I want to learn more.

January 19, 2006

Raw Data Lacking on the Web

Why isn't there more data on the web?

There seems to be a lot of news stories and opinion articles, but comparatively little data.

For example, I was just reading a piece asserting that book sales were up over the last decade, and provided a link. At the end of the link, I was hoping to find a nice graph showing this supposed spike in book sales. Instead, it was a USA Today article with a bunch of, you know, text. Why would I want to read a bunch of text with nothing more behind it than the editorial voice of USA Today?

Surely some reputable institute has counted the total number of books sold in each of the last ten years. Why isn't that study available for free on the web?

I'd expect to see the raw numbers, a graph displaying them, and ideally the full study which puts those numbers in context and describes the methods, etc. Sometimes you can get this kind of data; for example, in government studies like these power industry statistics. But most of the time, bloggers and others are pointing at news reports, which has at least two problems: first, it has too many layers of editorialism (adding the news editor and journalist besides just me, the blogger, and the original scholars), and second, it breaks down valuable raw data into a journalist's tag lines and pull quotes.

Why isn't more of this available?

One issue is probably that lots of data is owned; it's gathered at some cost and the institute wants to sell it to people who can get some business use out of it. I'd like to see more intellectual property owners open up the access to at least some of their property. I get a lot out of the New York Times online even as a non-subscriber, and I'm glad that they show as much as they do, though I wish there wer more.

Leaving that aside, another issue is that it's hard to search for raw data. It's easy to search the web for a topic, but hard to specify that you're interested in a certain level of detail. Words like "data," "table" and "figures" aren't likely to help. Perhaps a search engine could specialize in recognizing stuff that looks like tabular numerical data, and offer a way to say you want that kind of stuff.

January 11, 2006

Browsers are [not] quick; more on functional GUIs

UPDATE: This was complete rot. My test code was buggy, which caused the flicker. I patched it up and my draggable list moves smoothly now. Jeremy reports that he once tried an app that would re-create an entire (small) DOM tree on every keypress, and it worked reasonably fast. How do browsers manage to redraw pages as smoothly as they do?

An observation: web browsers move very fast when executing calls that modify the DOM.

For example, suppose list is a list of items, of which x is one, and you delete the item, then immediathttp://homepages.inf.ed.ac.uk/s0456219/ely insert it in the same place:

y = x.nextSibling;
list.insertBefore(y, x);

This will typically cause the whole list to be redrawn in the intermediate state, so the list gets shorter for a fraction of a second before the element is reinserted.

However, if you delete an element and reinsert it in a different place, the browser seems to redraw it smoothly, as if the element just moved:

y = x.previousSibling;
list.insertBefore(y, x);

That seems to be a difference of behavior, which I'm at a loss to explain.

In any event, this seems to rule out the "whole component replacement" that I talked about before, in implementing a purely-functional web GUI.

How to work around this difficulty? I still like the scoped, conditionalized structure that I outlined in that post. I suppose we could still have an inner function return a complete component, and have the runtime system compare it with the existing state of the component, executing a minimal series of DOM instructions to make the two agree.

Another approach, which I'm prefering, is to have the inner function return a "delta"—an expression representing how the component should change, something like a list of DOM instructions. I wouldn't want to allow the instructions in such an object to depend on one another, so I'd like to cook up a set of "deltas" which would be mutually independent, if possible. This would amount to a new mini-language for manipulating DOM trees, one that had no notion of sequential execution.

January 10, 2006

On Not Designing XML languages

Tim Bray discusses what he calls the "big five" XML languages and argues that whatever you're trying to do, you probably don't need to design a new one, particularly if your application falls anywhere near one of these. I'm inclined to agree.

Language design is hard work, and new languages almost never get any uptake (cf. Esperanto). The same seems to be true for spoken languages, programming languages, and data-description languages.

Tim says that "any non-trivial" XML language is going to have constraints that can't be checked with existing schema languages. That sounds like an interesting problem. Is there a list of such constraints somewhere?

January 9, 2006

Query optimization limitations: projection

Another problem for Links' current query optimization algorithm. It should be possible to create a proper query out of something like this (supposing y is a value that might or might not be in the table one or more times):

for x <- t in
where x == y

Yet currently the algorithm requires the loop variable to be used only through projections (e.g. x.age), giving up if it sees it used raw.

To compile this into SQL we'd probably need to expand it to a fieldwise comparison (x.a == y.a && x.b == y.b && ...) but that should be no sweat.

The slightly deeper problem is that the current algorithm works by tracking which variables correspond to which table fields, rather than a more general approach which would associate variables with table rows or fields. This might be a modest fix or it might require more digging.

January 6, 2006

Haskell Activity

I'm relatively new to the Haskell community; before diving in this fall, I'd thought that it was probably a really innovative language with lots of neat features, but also suffering from quirks that prevent it from being widely used except by language nerds.

Instead, I'm impressed how much activity there is in the Haskell community, making real libraries like modular database abstraction layers, a distributed source-control system, and apps like an mp3 player, a diagram editor, and even, apparently, a start at a first-person shooter using OpenGL.

I'm leaving out all of its applications in interpreting and compiling languages, since that's its forte. It's quite good at those things, but these wider applications should make it interesting to programmers in general, as opposed to language nerds alone.

January 5, 2006

Reference Draggable List

Worked up a reference implementation of a draggable list in the raw JavaScript today. This reminded me of the language hurdles to doing such a simple thing, and gives me something that's as close to the functional style as I can presently imagine. It does use global data and mutable locations a bunch, so it will be fun to tear it apart and give a truly functional way of writing it. In all, it's about a 100 lines, though it's not terribly flexible, so it will want some smithy.

January 3, 2006

Better Comprehension Through Memo(r)ization

I've been thinking about how to deal with these problems optimizing query-backed comprehensions.

A reasonably general example of the problem is as follows. I include a join condition for conversation sake, but also keep in mind the possibility that the condition could be always-true.

for x <- s in
    let y = f(x) in
        for z <- t where x.id == z.id in
            (x, y, z)

First let's characterize the running time of this algorithm, as is. Let s be the size of the table s, let F(n) be the cost of running f on an input n. Also suppose that there is a cost q for each query, and that this is non-trivial since it involves marshalling data to and from an external process, which may be on another machine. Finally, let k be the average number of rows in t which match the join condition; i.e. that k such that the number of tuples in the result is sk. The above algorithm should run in time O(sF(n) + sk + sq).

Now how is it affected by optimization? We have a rewrite rule which would rewrite the above as

for (x, z) <- s * t where x.id == z.id in
    let y = f(x) in
        (x, y, z)

Which performs fewer queries but evaluates f(x) more times. This version should take time like O(skF(n)). In lots of cases, this is worse.

It might be possible to estimate these costs, yielding a heuristic optimization rule that the compiler would employ.

But if we know that f is pure, we can memoize it, and that would give a much better optimization.

Supposing we can do the memoization lookups in some kind of O(log m) time, this would give a program that runs in time upper-bounded at O(sF(n) + sk + s log m)); we've saved the O(sq) cost of doing multiple queries but we've incurred the cost of the lookups at O(s log m).

Tricky Query Optimizations

Optimizing language-integrated queries is trickier than it looks.

Besides the structural problem I talked about yesterday, there are cases where a naive "optimization" can easily make the result run much slower than the direct version. This will happen whenever some time-consuming operation occurs between an inner and an outer comprehension. Suppose t is a table with loads of big numbers in the number column; consider this Links algorithm:

for x <- t in
    factors = factors_of(x.number);  # (costly)
    for y <- s in
        (y, factors)

A naive optimizer would say, "I should combine these two queries rather than re-running the inner query over and over again." It would yield an algorithm that works like this:

for (x, y) <- (t, s) in
    factors = factors_of(x.number);  # (costly)
    (y, factors)

But this algorithm might evaluate factors_of many more times than the original, since there may be more than one pair (x, y) for any given value of x. The optimization improved performance with respect to the database, but worsened the performance of the algorithm itself.

January 2, 2006

Nesting into SQL

It's perfectly plain in Buneman et al.'s "Comprehension Syntax", but I didn't realize it until now: Links's list/bag/set comprehensions don't correspond to SQL. There are simple expressions, using only comprehensions and no fancy language features, that cannot be compiled directly into SQL, because SQL's relations are always flat, whereas the structures our comprehensions build can be nested. To wit:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

This should give us a list of pairs, the first of which is the name attribute of a thing in t and the second of which is a list of corresponding id values taken from s. But SQL has no structure directly corresponding to a "list of tuples of lists."

Currently, the Links interpreter will simply decline to optimize these comprehensions, leading to a single query against the table t and as many queries against s as there are rows in t. This is abysmal; it's distinctly worse than coding in PHP where at least you can write one efficient query.

I see two plausible solutions; one is to modify Links to bring it into line with SQL's model. I'm not too fond of this; ruling out nested structures across the board would be unacceptable, and creating a sub-language for queries, which contained only flat relations, would diminish the Links goal of unifying the algorithmic language with the query language.

Alternatively, and much better, we could do the join, and then post-process the results to group them into the proper structure. In this case, the query would be

select t.name, s.id from s, t where s.t_id = t.id order by t.name

And the post-processing might be

fun grouper(row, accum)
    if accum <> [] && row.name == hd(accum).name 
    then [{name = row.name; id = [row.id]++hd(accum).id}]++tl(accum)
    else [{name = row.name; id = [row.id]}]++accum
fold(grouper, [], results)

This just takes the records with a common name field and folds them into a single record, with all the corresponding id fields listed therein. I'm relying on the ordering by t.name in order to do this in linear time, but a more general/robust solution could probably work in n log n time, by keeping an indexed structure to look up the record into which any given row should be folded.

It will take a bit of tinkering to come up with the grouper for a given nested comprehension, in general. Some thoughts toward that:

In SQL, only the so-called aggregate functions (max, average, sum, count) may be applied to the values being grouped together—essentially, only functions that yield an atomic datum. The operation we need might be equivalent to a "group by" whose operation is list/bag/set construction. Looking at it this way might allow us to easily apply algorithms and transformations that already exist for SQL's "group by." But how to recognize what columns are being grouped by and which ones are being grouped together?

Well, a condition that makes the example problematic is that we have nested queries, and there is a "bare" piece of the top query: a value from t is used outside of the inner query. I think this condition implies the sort of nested structure I'm talking about. Furthermore, it is a smoking gun as to how we should structure the groupings. The variable used in the outer comprehension represents the one to group by, the other comprehension, in itself, becomes the aggregation operation.

This needs to be refined, of course, but I think it is a lead. Essentially, the idea is to take a nested comprehension like this:

for x <- t in
  [(x.name, for y <- s where y.t_id = x.id in [y.id])]

and convert it to a faux-SQL query like so:

select t.name, makebag(s.id) from s, t
where s.t_id = t.id group by t.name

where makebag( ) is the "aggregate function" that constructs a bag from its arguments. (In fact, this should fall out as a comprehension over the grouped elements.) The faux query comes apart into a real query and a post-processing step:

select t.name, s.id from s, t where s.t_id = t.id

group_by [name] makebag results

I can't see writing a general post-processor within Links, since it would need something like reflection, to dissect an arbitrary row on given fields. Let's use a Haskell-like meta-language notation. Suppose we want to group by some list of fields g and that project g is a function that projects those fields out to form a new record. We also have an aggregation function, f. The needed post-processor should behave like this (though more efficient algorithms should be used):

group_by g f rows = map (onSecond f) (collate g rows)
    where collate g list = [let ys = whereEq (project g) x list) in
                                (x, map (project g') ys)
                            | x <- uniq (map (project g) list)]
              where g' = complement g       -- fudging here a bit
          whereEq f x = filter (\y -> f y == x)
         onSecond f (x, y) = (x, f y)

I aim to set all this in a more gentle theoretical framework. I'd like to have a formal definition of the NRC, SQL, and our query-optimizable language; then we can give the translation between them, and then it's a simple matter of programming. Shouldn't be too hard.