### Better Comprehension Through Memo(r)ization

I've been thinking about how to deal with these problems optimizing query-backed comprehensions.

A reasonably general example of the problem is as follows. I include a join condition for conversation sake, but also keep in mind the possibility that the condition could be always-true.

``````for x <- s in
let y = f(x) in
for z <- t where x.id == z.id in
(x, y, z)
``````

First let's characterize the running time of this algorithm, as is. Let s be the size of the table `s`, let F(n) be the cost of running `f` on an input `n`. Also suppose that there is a cost q for each query, and that this is non-trivial since it involves marshalling data to and from an external process, which may be on another machine. Finally, let k be the average number of rows in t which match the join condition; i.e. that k such that the number of tuples in the result is sk. The above algorithm should run in time O(sF(n) + sk + sq).

Now how is it affected by optimization? We have a rewrite rule which would rewrite the above as

``````for (x, z) <- s * t where x.id == z.id in
let y = f(x) in
(x, y, z)
``````

Which performs fewer queries but evaluates `f(x)` more times. This version should take time like O(skF(n)). In lots of cases, this is worse.

It might be possible to estimate these costs, yielding a heuristic optimization rule that the compiler would employ.

But if we know that `f` is pure, we can memoize it, and that would give a much better optimization.

Supposing we can do the memoization lookups in some kind of O(log m) time, this would give a program that runs in time upper-bounded at O(sF(n) + sk + s log m)); we've saved the O(sq) cost of doing multiple queries but we've incurred the cost of the lookups at O(s log m).