« Tricky Query Optimizations | Main | Reference Draggable List »

Better Comprehension Through Memo(r)ization

I've been thinking about how to deal with these problems optimizing query-backed comprehensions.

A reasonably general example of the problem is as follows. I include a join condition for conversation sake, but also keep in mind the possibility that the condition could be always-true.

for x <- s in
    let y = f(x) in
        for z <- t where x.id == z.id in
            (x, y, z)

First let's characterize the running time of this algorithm, as is. Let s be the size of the table s, let F(n) be the cost of running f on an input n. Also suppose that there is a cost q for each query, and that this is non-trivial since it involves marshalling data to and from an external process, which may be on another machine. Finally, let k be the average number of rows in t which match the join condition; i.e. that k such that the number of tuples in the result is sk. The above algorithm should run in time O(sF(n) + sk + sq).

Now how is it affected by optimization? We have a rewrite rule which would rewrite the above as

for (x, z) <- s * t where x.id == z.id in
    let y = f(x) in
        (x, y, z)

Which performs fewer queries but evaluates f(x) more times. This version should take time like O(skF(n)). In lots of cases, this is worse.

It might be possible to estimate these costs, yielding a heuristic optimization rule that the compiler would employ.

But if we know that f is pure, we can memoize it, and that would give a much better optimization.

Supposing we can do the memoization lookups in some kind of O(log m) time, this would give a program that runs in time upper-bounded at O(sF(n) + sk + s log m)); we've saved the O(sq) cost of doing multiple queries but we've incurred the cost of the lookups at O(s log m).