Suppose you want a Patricia tree that has good locality of reference. Why might you want this? Maybe you have a trillion items in your tree, and thus a trillion treenodes, each containing three pointers and a count. Each pointer, in the general case, must be 40 bits long, so you have 120 bits plus the count; say 128 bits, 16 bytes. So your tree is 16 terabytes. That probably means you’re going to store it on disk, unless the query rate is extremely high. But a lookup in this tree will involve accessing 40 nodes, on average, and possibly more; plus an access to wherever the key is stored at the end. If every separate access involves a separate disk access, this will be slow, like four seconds. Suppose, instead, you divide the tree into “bubbles” or “pages” of, say, one mebibyte, each containing a subtree rooted at a particular node. There are only a bit over a million such bubbles, so you can represent a pointer to the root node of any one of them in 21 bits. And each one can surely only contain, at most, 8 million nodes, and probably much less; but it can contain at least 65536 nodes. So you can probably get by with, say, 18-bit pointers for internal nodes, and 21-bit pointers for external nodes. (The idea is that the first N pointer values in the bubble are allocated to nodes, and the next N are allocated to external pointers, so that each internal pointer can point either to an external pointer or to another node in the same bubble.) A bubble may ultimately have just as many pointers to other bubbles as it has nodes inside of it. So, for each node, you need 3*18 bits for internal pointers, 21 bits for an external pointer, and, say, 8 bits for the count field. All in all that’s 83 bits, 11 bytes. That leaves room for 95325 nodes and 95325 external bubble pointers in a bubble. If the bubbles grow randomly, then a path through a bubble will consume about 17 bits of entropy from the key, so an average path will traverse two or three bubbles, taking thus 25ms instead of 4000. Better yet, the root bubble will surely be in RAM, perhaps even in L2 cache, so it’s two or usually one disk accesses instead of two or three. This approach, then, increases query throughput by a factor of 20 to 40. As an added bonus, you only need 10 bytes instead of 16 per node, so the database has been reduced from 16 terabytes to 10! The size of one mebibyte was chosen to be close to the bandwidth-latency product of spinning-rust disks. The bandwidth-latency product of main memory seems to be around 40 bytes. Can we make something bubble-like on a scale close to 40 bytes and get savings on CPU as well as disk access? Already we can fit four nodes into there. But suppose, instead, that we use the binary-heap approach and make our bubbles out of “supernodes” of, say, seven internal nodes (branchings) and eight external pointers. Each branching needs only a count field (say, 8 bits) and a data pointer (say, 18 bits); then you need eight 18-bit pointers to other supernodes or bubble-external pointers, a total of 326 bits, or 41 bytes. Inserting into a supernode might involve shoving three nodes over, to the left or the right, which is an acceptable cost, particularly considering that it’s all in L1D. It might also involve allocating a new supernode and splitting the supernode. For that reason, we shouldn’t expect our supernodes to actually be more than about half full. So what does a bubble of supernodes look like? Each supernode comes with seven external pointers, 147 bits in all, for a total of 493 bits, 62 bytes. So you have room for 16912 supernodes (689 164 bytes) and 118 384 external pointers (310 758 bytes) in a bubble. So this is only marginally better for usage of space (like 15%), but you could probably cut down the internal pointers to 17 bits. XXX am I fucking up with this data pointer thing? It needs to be more than 17 or 21 bits, doesn’t it? But for cache line fills, it should be a big improvement. Traversing the whole bubble previously needed about 17 cache-line fills; now it needs about 7. (It’s much more likely to be disk-bound, though.) It looks like there has been quite a bit of work on this kind of thing already: the “HAT-trie”, which uses hash tables for strings below a certain level of the trie.