« Meeting Notes | Main | On Loosely-Joining Small Pieces »

Against Databases

It is common wisdom in web-programming circles that a web application must be backed by a database. Sometimes the application fundamentally relies on a database per se: for example, flight-booking web sites are interfacing with a database of flight data that's useful as a database even outside of the web context.

Other times an RDBMS is employed simply as a no-think, heavyweight solution to the myriad problems that come up in engineering for the web.

For example, the tricky problem of sharing data between concurrent web requests—call this one the "sharing" problem—is often handled by creating short-lived rows in a database table. To see why this is heavyweight, consider this: It may be that at any given time only two processes need to exchange a particular piece of data, and only fleetingly, but the RDBMS insists on writing it to disk and indexing it into a table to asure full ACID properties.

For another: some data is generated on one web request and is needed by a subsequent request from the same client. Since the client can connect to any one of many identical web servers, and since even the same server could be stopped and started between requests, it would be no good to store this data in local memory. Call this the "persistence" problem in web engineering. Conventionally, web programmers just keep a table for such data, indexed, for example, by a user's session key.

In both of these cases, an RDBMS is a heavyweight solution. A few of the costs incurred by using an RDBMS include:

  1. the operational overhead of guaranteeing durability when it is not absolutely required,
  2. writing relational queries for data that is not inherently relational is awkward
  3. possibility of locks (e.g. on indexes) causing undue delays,
  4. keeping the DB schema in sync with what the application needs.

Oftentimes, you see, the data that needs to be stored is not fundamentally relational; more often its character is like that of some other data structure that one uses in programming: a list, a tree, a polymorphic record, a graph.

Really big web operations (e.g. Amazon, Google, LiveJournal) always end up building a lighter-weight system to provide persistence and sharing. The engineers of these systems all discovered, upon reaching a certain scale, that the conventional RDBMS leads to problems with slowness & contention, not to mention the impediments of converting data to a relational form. At the same time, the strong guarantees of an RDBMS are often unneccessary.

Amazon has built a middle-ware infrastructure for communications & caching, and for its massive catalog data uses lightweight, read-only B-Trees. Google is building a huge, fast, highly-distributed map service. LiveJournal has built a distributed lightweight cache. Yahoo! Stores, to judge by Paul Graham's quote, never used an RDBMS in the first place. The fact that so many of the biggest systems on the web are building alternatives is strong evidence that alternatives are needed.

My proposal: a good web language ought to support easy, flexible sharing and easy, flexible persistence of that language's own data objects, without forcing a conversion to a different form.

What database did you use?

We didn't use one. We just stored everything in files. The Unix file system is pretty good at not losing your data, especially if you put the files on a Netapp.

It is a common mistake to think of Web-based apps as interfaces to databases. Desktop apps aren't just interfaces to databases; why should Web-based apps be any different? The hard part is not where you store the data, but what the software does.

Viaweb FAQ, Paul Graham

Comments

Another example against databases: Francesco Cessarini has implemented an Erlang system used to handle text messages sent to a reality TV show. They use an in-memory database (Erlang's Mnesia) with no transaction properties, because full ACID transactions would be too slow.

Links should be good for tackling this sort of thing. We should be able to specify a number of different data sources, one of which is SQL tables, and access all of these identically.

I was thinking that our first example of this should be to support both SQL and XQuery. Ezra, maybe you want to suggest some other, more lightweight source? Perhaps the file systems, as in Viaweb?

I think the filesystem is a good starting point. It's easier than XQuery, for example.

Post a comment