This page last updated 1997-11-01. This paragraph added 1999-02-26.
Big databases are a big challenge to fund and manage. In the normal business instance, a big database is accountable to a single person or group of people, usually the CEO of a corporation. In this case, it is usually best -- in terms of time and material expenses, and in terms of risk -- to centralize the database in one place, under a single control structure, and under a single budget item.
However, in the real world, there are often needs for databases requiring great time and material expenditures but without enough benefit for any single potential user to justify their construction. A comprehensive WWW search engine is an example; even AltaVista is falling farther and farther behind the current state of the Web, and their query engine is currently running on ``a set of seven AlphaServer 8400 5/300 systems, each with ten processors, 6 GB of RAM, and 210 GB of hard disk in a RAID array.'', while their spider runs on ``an AlphaServer 4100 5/300, with 1.5 GB memory and 30 GB RAID of hard disk space.'', and indexes six million pages a day. (See http://altavista.digital.com/av/content/about_our_technology_2.htm.) I estimated the list-price cost of this equipment; it came to $5 million or so, and that's not including any of the networking equipment, or the other five AltaVista sites in the world.
For the first time in history, grassroots parallel-processing efforts have become feasible. Some of the greatest computational challenges have been dealt with this way; RSA-129 in 1994, and the RSA DES challenge in 1997 were successfully attacked by this method.
There are a lot of spare CPU cycles and disk space around the Internet. It would be nice if we could put together a sort of Vista Populi, distributing the computation of query results, the storage of indexed pages, and the spidering of new pages around the Internet.
There are a certain number of contributing nodes around the world. Each node can afford to devote a small, non-negotiable amount of resources to this task. The nodes will not have much reliable connectivity to one another.
The resources that are expected to limit this effort are:
For this thing to work, it has to scale well. That means that if we double the size of the index, double the number of queries per second, and quadruple the number of nodes in the effort, the response time should stay the same, and the load on individual contributing nodes should stay the same.
There is one technique for scalable distributed databases that I know about. It's the technique used in DNS, enabling millions of retrievals per second from a worldwide database with millions of records. This is the technique of hierarchical keys. The keyspace is subdivided hierarchically; a single set of interchangeable authoritative servers are responsible for all the keys with a certain prefix. Either they know the data for any given key in this zone, or they know of other servers that are authoritative for some longer prefix of that key, and refer the querier to those other servers.
This makes it possible to store an index for an arbitrarily large keyspace, with a bounded amount of information needed on each server, and it also makes it possible to restructure part of the index, or add and delete keys, by changing data on a small number of servers.
In DNS, the keyspace is structured so that there are natural break points between the levels of the hierarchy. If we're constructing an index for full-text searching, we do not have this option. We have to accept the keys as they are given.
Hierarchical keyspaces by themselves don't give us everything we need. We also need the root servers not to get pounded to death; for this we need hierarchical caching, which is more difficult. The alternative is to double the number of root servers with every doubling of the query load.
The idea behind hierarchical caching is to limit the number of queries to places high in the authority tree. It also, as a beneficial side effect, reduces the effect of repeated queries for the same index entry -- something that's tremendously important in DNS, but might not be so important here.
For hierarchical caching to be effective in reducing the load on root and near-root servers, the branching factor of the keyspace needs to be high near the root. The alternative is the current situation with DNS: there's a very high branching factor for .com., and so there's a very low chance that some randomly chosen subdomain of .com. will already be in your local nameserver's cache, and so you usually have to consult a root nameserver to resolve any
How do we do compound queries? As an example, the query `+criteria +autism +eye-contact' on AltaVista finds ten thousand occurrences of ``eye contact'', 31420 occurrences of ``autism'', and 602214 occurrences of ``criteria''. In this proposed distributed scheme, where the servers that know where to find ``autism'' are not likely to be the same servers that know where to find ``criteria'', how do we find the intersection of the two? It is impractical to download 602,214 URLs -- even compressed -- and I don't know how else to solve this, let alone rank them in order of relevance. Yet, on AltaVista, the query only returns 32 documents, of which the tenth was the one I wanted.