### Better Comprehension Through Memo(r)ization

I've been thinking about how to deal with these problems optimizing query-backed comprehensions.

A reasonably general example of the problem is as follows. I include a join condition for conversation sake, but also keep in mind the possibility that the condition could be always-true.

```
for x <- s in
let y = f(x) in
for z <- t where x.id == z.id in
(x, y, z)
```

First let's characterize the running time of this algorithm, as is. Let *s* be the size of the table `s`

, let F(n) be the cost of running `f`

on an input `n`

. Also suppose that there is a cost *q* for each query, and that this is non-trivial since it involves marshalling data to and from an external process, which may be on another machine. Finally, let *k* be the average number of rows in *t* which match the join condition; i.e. that *k* such that the number of tuples in the result is *sk*. The above algorithm should run in time *O(sF(n) + sk + sq)*.

Now how is it affected by optimization? We have a rewrite rule which would rewrite the above as

```
for (x, z) <- s * t where x.id == z.id in
let y = f(x) in
(x, y, z)
```

Which performs fewer queries but evaluates `f(x)`

more times. This version should take time like *O(skF(n))*. In lots of cases, this is worse.

It might be possible to estimate these costs, yielding a heuristic optimization rule that the compiler would employ.

But if we know that `f`

is pure, we can memoize it, and that would give a much better optimization.

Supposing we can do the memoization lookups in some kind of *O*(log *m*) time, this would give a program that runs in time upper-bounded at *O(sF(n) + sk + s* log *m*)); we've saved the *O(sq)* cost of doing multiple queries but we've incurred the cost of the lookups at *O(s* log *m*).