Suppose you want a Patricia tree that has good locality of reference.

Why might you want this? Maybe you have a trillion items in your tree,
and thus a trillion treenodes, each containing three pointers and a
count. Each pointer, in the general case, must be 40 bits long, so you
have 120 bits plus the count; say 128 bits, 16 bytes.  So your tree is
16 terabytes.

That probably means you’re going to store it on disk, unless the query
rate is extremely high. But a lookup in this tree will involve
accessing 40 nodes, on average, and possibly more; plus an access to
wherever the key is stored at the end. If every separate access
involves a separate disk access, this will be slow, like four seconds.

Suppose, instead, you divide the tree into “bubbles” or “pages” of,
say, one mebibyte, each containing a subtree rooted at a particular
node.  There are only a bit over a million such bubbles, so you can
represent a pointer to the root node of any one of them in 21 bits.
And each one can surely only contain, at most, 8 million nodes, and
probably much less; but it can contain at least 65536 nodes.  So you
can probably get by with, say, 18-bit pointers for internal nodes, and
21-bit pointers for external nodes.

(The idea is that the first N pointer values in the bubble are
allocated to nodes, and the next N are allocated to external pointers,
so that each internal pointer can point either to an external pointer
or to another node in the same bubble.)

A bubble may ultimately have just as many pointers to other bubbles as
it has nodes inside of it. So, for each node, you need 3*18 bits for
internal pointers, 21 bits for an external pointer, and, say, 8 bits
for the count field. All in all that’s 83 bits, 11 bytes.

That leaves room for 95325 nodes and 95325 external bubble pointers in
a bubble.  If the bubbles grow randomly, then a path through a bubble
will consume about 17 bits of entropy from the key, so an average path
will traverse two or three bubbles, taking thus 25ms instead of
4000. Better yet, the root bubble will surely be in RAM, perhaps even
in L2 cache, so it’s two or usually one disk accesses instead of two
or three.  This approach, then, increases query throughput by a factor
of 20 to 40.

As an added bonus, you only need 10 bytes instead of 16 per node, so
the database has been reduced from 16 terabytes to 10!

The size of one mebibyte was chosen to be close to the
bandwidth-latency product of spinning-rust disks.  The
bandwidth-latency product of main memory seems to be around 40
bytes. Can we make something bubble-like on a scale close to 40 bytes
and get savings on CPU as well as disk access?

Already we can fit four nodes into there. But suppose, instead, that
we use the binary-heap approach and make our bubbles out of
“supernodes” of, say, seven internal nodes (branchings) and eight
external pointers. Each branching needs only a count field (say, 8
bits) and a data pointer (say, 18 bits); then you need eight 18-bit
pointers to other supernodes or bubble-external pointers, a total of
326 bits, or 41 bytes.

Inserting into a supernode might involve shoving three nodes over, to
the left or the right, which is an acceptable cost, particularly
considering that it’s all in L1D.  It might also involve allocating a
new supernode and splitting the supernode. For that reason, we
shouldn’t expect our supernodes to actually be more than about half
full.

So what does a bubble of supernodes look like? Each supernode comes
with seven external pointers, 147 bits in all, for a total of 493
bits, 62 bytes.  So you have room for 16912 supernodes (689 164 bytes)
and 118 384 external pointers (310 758 bytes) in a bubble.  So this is
only marginally better for usage of space (like 15%), but you could
probably cut down the internal pointers to 17 bits.

XXX am I fucking up with this data pointer thing? It needs to be more
than 17 or 21 bits, doesn’t it?

But for cache line fills, it should be a big improvement. Traversing
the whole bubble previously needed about 17 cache-line fills; now it
needs about 7.  (It’s much more likely to be disk-bound, though.)

It looks like there has been quite a bit of work on this kind of thing
already: the “HAT-trie”, which uses hash tables for strings below a
certain level of the trie.