// A memory-tiny associative array.
// The idea is that you store a series of sorted, binary-searchable
// table segments of exponentially decreasing sizes. New entries
// always go into the smallest table. When a table segment gets too
// big, you merge it with larger tables.
// The exponential decrease factor is an adjustable tradeoff between
// insertion time and lookup time. Here I’ll pretend it’s always 2.
// The average time per successful lookup is a little less than 2(lg N
// - 1) probes. The biggest segment contains half the items, so half
// the lookups terminate there; it requires lg N - 1 probes in the
// worst case, and almost that in the normal case. The next 25% of the
// lookups terminate in the second-biggest segment, needing an
// additional lg N - 2 probes there. The next 12.5% terminate in the
// third-biggest, needing an additional lg N - 3 probes, or 3 lg N - 6
// probes in all. The infinite series of 1/2 + 1/4 * 2 + 1/8 * 3 +
// ... sums to almost 2; if we ignore the -1, -2, -3, etc., which only
// become significant improvements once we get to the very small
// tables, we would have 2 lg N. So if you have 64k entries, it’s
// about 30 probes, on average. Or a little less.
// The time per unsuccessful lookup is worse: you need about (lg N -
// 1)^2/2 probes, because you always have to search all the
// segments. So if you have 64k entries, it’s about 112 probes. This
// is also the worst case for successful lookup.
// Inserting an item without segment merging is very quick: you just
// allocate space by bumping a pointer and write a single-entry
// segment into place. But you do need to merge segments. Rather than
// summing the finite series, let’s just point out that each of the N
// elements will need to go through lg N merges to reach its final
// resting place after being inserted, and most of them will reach it
// or almost reach it; so inserting requires lg N steps of merging on
// average. For an element to go through a merge, it undergoes on
// average half a copy to temporary space, one comparison (well, two,
// but each one compares two items), and one copy to its final
// location. So these lg N steps are larger than the ones in
// lookup. Inserting 64k items, then, will involve a million and a
// half copies and a million comparisons.
// In practice, it’s probably better to cheap out on both insertion
// and search on small intervals by just going sequential. Once the
// interval being searched gets to be, say, 8 items or less, you might
// as well just sequentially search it; and you’re probably better off
// accumulating newly-inserted items in a sequential “side file” and
// then, when it gets too big, sorting it with insertion or selection
// sort rather than mergesort. Selection sort of 16 items takes, I
// think, 120 comparisons and 16 exchanges; mergesort takes 64
// comparisons and 92 copies, which is almost the same amount of
// traffic but with more complicated control logic and in a bigger
// working set. Below that point, selection sort is clearly faster;
// above it, mergesort is clearly faster.
// Increasing the worst allowed segment ratio from 2:1 to, say, 8:1
// would allow fewer segments, at the cost of more merges. The number
// of probes for a lookup converges toward a limit of almost lg N,
// while the number of copies an item goes through to reach its final
// resting place explodes.
// All of this can be done in a fixed block of memory one and a half
// times the size of the maximum supported table size (or smaller, if
// the segment ratio is more extreme). You start by allocating lg N
// table-start pointers at the beginning. Then you build segments
// starting after that, growing toward the end of the block. Appending
// to the latest segments just involves bumping a pointer; merging two
// segments begins by moving the second segment to a new space past
// the end of where the merged segment will be, and then building the
// merged segment backward from its end, eventually overwriting the
// segment that wasn’t moved.
// It can also be easily adapted to a resizable block of memory.
// Only the last couple of merges need the block of memory to be so
// large. They could instead be conducted using heapsort rather than a
// merge, at the cost of an extra lg N time cost for those merges, but
// allowing the amount of auxiliary “wasted” memory to be very
// small. For example, if you do this for the last two merges, you
// need 9/8 the space for the total stored data. For 64k items, the
// last merge goes from 64k comparisons and 96k copies to, I think, a
// couple million comparisons and a couple million copies, more than
// all the steps before that point; and the 32k-item merge before that
// is hardly better.
// There’s a variant of this I’ve thought about that could have
// guaranteed worst-case bounds on insertion as well.
// You have fixed-size segments, say, 512 segments of 128 elements
// each. You have a directory of the segments, listing the first and
// last key in each segment. To do a lookup, you perform an interval
// search on this directory (e.g. using an interval tree) to find all
// the segments spanning the key you’re searching for, and search each
// of them individually. Insertion inserts into a “current” segment,
// which may be organized differently from the other segments
// (e.g. maintained sorted, or partly sorted, or whatever) and which
// in general will tend to turn up in the interval search results
// fairly often; insertion also runs the merge algorithm some number
// of steps.
// Merging in this model is always merging two segments to form two
// other segments; when it completes, the two non-overlapping output
// segments replace the input segments in the directory. In this way,
// you only ever need two free segments to carry out a merge. For
// bounded worst-case real-time performance, you would want to run the
// merge algorithm for some number of steps upon each insertion.
// Once the merge is done, the degree of overlap in the segment
// directory is somewhat reduced: two segments that overlapped
// previously are replaced by two equivalent segments that do not
// overlap.
// The question, then, is whether there is an algorithm for choosing
// segments to merge that can guarantee both a limited amount of merge
// work on insert, and a limited degree of overlap as time goes
// on. It’s clear that in the *average* case, everything is nice and
// logarithmic, and that this ought to be homeomorphic to the
// huge-contiguous-segment case mentioned previously, which gives you
// guaranteed amortized performance but not guaranteed
// responsiveness. Ideally you want no more than lg N segments
// spanning any particular key, and no more than lg N merges to get a
// particular key to its final resting place.
// The thing that could make a key take longer to get to its final
// resting place would be a badly-matched choice of segments to
// merge. If the segments barely overlap, for example, then after
// merging, their span will have diminished very little; and if a very
// dense segment (one covering a small range of keys) is merged with a
// very sparse segment, the keys from the dense segment will discover
// themselves being consulted much more often afterwards, because they
// will have been copied into a much sparser segment.
// But these concepts are very vague. “Sparse” and “dense” are defined
// relative to the distribution of the inserted keys, which may be
// very nonuniform! So it’s somewhat more difficult to determine the
// “level” to which a segment has been promoted.
// You could store a merge count explicitly in the segment directory
// for each segment, indicating the maximum number of merges its keys
// had been through, but I’m not sure that will help.
// Intuitively, it seems appealing to focus on the area with the
// deepest segment stacking, i.e. where the largest number of
// intervals overlap a single point. This is, after all, one of the
// variables we want to keep under control...
// Aha! You can make it almost totally homeomorphic to the original
// plan. Each segment has a “level”. The current output segment is
// level 0. You only ever merge segments of the same level together;
// this produces segments of the next higher level. You can proceed
// simply by merging the two segments of a given level with the lowest
// starting keys...
// But then you still might find that one of the segments is much
// denser than the other in its part of the keyspace. Solution: only
// output it at level N+1 if it doesn’t overlap any segment at level
// N; otherwise, output it at level N. Alternatively, output the first
// output segment at level N+1 and the second at level N.
// So the segment selection algorithm for merging is something like:
// Find the lowest level that contains two overlapping segments.
// On that level, find the two overlapping segments with the earliest
// starting keys.
// Merge them.
// I’m not totally sure about this approach. A more direct translation
// of the original algorithm with its variable-sized segments might
// work better. In that algorithm, once merging was initiated for a
// particular level, it would propagate its way up the tree to the
// bigger segments until it ran out of steam.
// Time log:
// 2010-04-20: from 22:22 to 22:54, writing most of the example usage.
// 2010-04-21: from 01:06 to 01:09
// 2010-04-24: from 02:13 to 04:51: writing notes and reading research
// Example usage: a word-count program.
#include
enum { wordmax = 32 };
inline int is_ascii_alnum(int c) {
switch(c) {
/* in Perl:
print " case '$_':\n" for (('a'..'z'), ('A'..'Z'), ('0'..'9'), '_');
*/
case 'a':
case 'b':
case 'c':
case 'd':
case 'e':
case 'f':
case 'g':
case 'h':
case 'i':
case 'j':
case 'k':
case 'l':
case 'm':
case 'n':
case 'o':
case 'p':
case 'q':
case 'r':
case 's':
case 't':
case 'u':
case 'v':
case 'w':
case 'x':
case 'y':
case 'z':
case 'A':
case 'B':
case 'C':
case 'D':
case 'E':
case 'F':
case 'G':
case 'H':
case 'I':
case 'J':
case 'K':
case 'L':
case 'M':
case 'N':
case 'O':
case 'P':
case 'Q':
case 'R':
case 'S':
case 'T':
case 'U':
case 'V':
case 'W':
case 'X':
case 'Y':
case 'Z':
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
case '_':
return 1;
default:
return 0;
}
}
FILE *determine_input(int argc, char **argv) {
if (argc > 1) {
FILE *input = fopen(argv[1], "r");
if (!input) {
perror(argv[1]);
return NULL;
}
} else {
return stdin;
}
}
void count_words(tinytable *t, FILE *input) {
char wordbuf[wordmax];
int c;
char *wp = wordbuf;
for (;;) {
c = getc(input);
// skip overlong words
if (wp == wordbuf + wordmax && is_ascii_alnum(c)) continue;
// done with overlong word; reset
if (wp == wordbuf + wordmax) { wp = wordbuf; continue; }
// add to word
if (is_ascii_alnum(c)) { *wp++ = c; continue; }
// empty word
if (wp == wordbuf) continue;
// at end of word
tinytable_entry e = tinytable_find(t, wordbuf, wp-wordbuf, create_if_not_found);
e->data.int_val++;
if (c == EOF) break;
}
}
int display_results(tinytable *t) {
tinytable_iterator ti = tinytable_iterate(t);
while (!tinytable_iterate_done(&ti)) {
tinytable_entry e = tinytable_iterate_next(&ti);
if (printf("'") < 0) return 0;
if (fwrite
}
return 1;
}
int main(int argc, char **argv) {
FILE *input = determine_input(argc, argv);
tinytable t = {0};
count_words(&t, input);
if (!display_results(&t)) return 1;
return 0;
}