#!/usr/bin/python # -*- coding: utf-8 -*- """Prepare a hand-drawn hash table of words. What this program does ---------------------- This program shows a method that you should have used if you were an early medieval scribe drawing up a glossary, rather than to alphabetize your glossary, wasting precious parchment. The lookup efficiency is dramatically better than actual early medieval glossaries achieved with their use of first-letter order, and not far from that of full alphabetization. To be concrete, I compiled an 8672-word glossary of relatively uncommon English words. Fully alphabetized, a binary search in it would require you to examine some 13 words to find a word. The Épinal glossary, which was about the same size but in first-letter order, required searching through about 225 words to find an "A-" word, or 550 if it wasn't present; a first-letter-order glossary of these 8672 words would require searching through 263 words on average for a successful search or 333 if not. This hash-table-based approach, by contrast, requires searching through only 31 words for successful searches, or 42 for unsuccessful searches, and despite what you'd expect of hashing algorithms, is easily usable without a computer, because the hash value is just the first and fourth letter of the word, collapsed into alphabetically contiguous equivalence classes to save space on the paper. Adding entries to the hash table does not require the ability to erase. If we're limited to using Roman numerals, we need to use a smaller number of equivalence classes to save space, and so it requires searching through XLVI words on average for successful searches, or LXVII words for unsuccessful searches. This is how I organize my address book in my notebook. I use Hindu-Arabic numerals, though. How it works ------------ The traditional way to organize dictionaries or glossaries is by alphabetizing the words by lexicographical order, but as James Murray writes in [The Evolution of English Lexicography][0], glossaries and dictionaries went through several centuries before being fully alphabetized, going through intermediate stages of first-letter order and second-letter order. [0]: http://www.gutenberg.org/files/11694/11694-h/11694-h.htm Murray speculated that perhaps the progressively more alphabetical orders were a result of gradual alphabetization by different scribes. If you're copying down a 1000-word glossary onto a parchment scroll, alphabetizing it without using any auxiliary parchment or leaving any extra space will require you to make 1000 passes through the glossary, each time finding the alphabetically-first remaining word. Meanwhile, you can do the same job in only 26 passes if you're just putting the words in first-letter order. It turns out that you can do a very great deal better than this by using 20th-century algorithmic techniques — by writing two columns of numbers to the left of the words, with a short table at the beginning of the list, you can reduce lookup time dramatically, and without the need for costly auxiliary parchment or intermediate copying. Furthermore, you can add future words to this glossary incrementally in the future, without degrading its performance much, and you can do it all by hand, in a single pass, easily. If you have some marginal space, you may be able to do it on the existing glossary manuscript, without having to make an entire copy of it. This program demonstrates the technique by doing it by machine. It takes as an input file an unordered dictionary, with one word definition per line; the output file, which is in the same order, allows rapid lookups. The Épinal-Erfurt Glossary had some 550 entries, mostly Latin, categorized under A, so a successful search for a word beginning with A involves the examination of an average of some 225 entries, which is rather wasteful. Experimental procedure and results ---------------------------------- For input, I prepared an 8672-word glossary of somewhat uncommon English words (things it's plausible that an English speaker might not know) as follows, with dictd and dict-wn installed on my Debian netbook: < ~/devel/wordlist tail -n +5000 | head -13000 | while read count word do if dict -d wn "$word" > tmp.$$ then echo $(< tmp.$$ grep -iA 2 "^ $word"$ | head -3) fi done > glossary This uses the British National Corpus frequencies list to find the most common 13000 words in English, excepting the 4999 most common words, and get some kind of reasonable definitions for them from WordNet. Because the BNC words aren't stemmed, many of them are things like "encounters" or "pints", which don't have entries of their own in WordNet; and many placenames such as "Yarmouth" and "Titford" are also missing; so the resulting unordered glossary is only 8672 words. It contains 601 "A-" words and 33 "Ab-" words, which is roughly comparable to the Épinal–Erfurt Glossary's 550 "A-" words, 95 "Ab-" words, and 78 "Ac-" words, according to Murray. Binary search on 8672 alphabetized words requires, theoretically, log₂ 8672 ≈ 13 or so word probes to find a word; but typically once you're down to 128 or 64 candidates, you just read through them in order, so you really only need about 6 probes. The definitions extracted from WordNet by the above shell script are not that great, in part because they sometimes elevate secondary senses of words to primary importance, but they're still useful: outbreak n 1: a sudden violent spontaneous occurrence (usually of some undesirable condition); "the outbreak of hostilities" [syn: greenhouse adj 1: of or relating to or caused by the greenhouse effect; "greenhouse gases" assisted adj 1: having help; often used as a combining form [syn: {assisted}, {aided}] [ant: {unassisted}] The output from an earlier version of this program begins as follows: 210 hash chains, max search length 139 entries (from ('ab', 'ef')). Average chain length 42 entries. Average successful search length 31 entries. Find column of first letter and row of fourth letter to get N. If word N isn't your word, retry with the other number on line N. ab c d ef gh ij kl mn op qr s t uv wxyz ab 33 96 57 116 646 411 250 290 134 122 215 232 1033 462 c 572 27 12 84 3038 388 682 503 149 536 63 67 556 1893 d 391 478 1802 17 110 64 195 45 188 52 93 208 118 1238 ef 40 21 157 59 91 128 11 135 90 29 371 55 356 1475 gh 228 1247 480 125 37 1045 104 15 671 25 659 489 576 31 ij 62 175 105 320 7 465 845 73 200 56 235 658 244 1909 kl 127 166 111 177 284 2 332 390 14 247 28 255 334 30 mn 22 4 78 34 131 1196 154 77 6 899 1 32 364 606 op 224 39 85 8 619 124 351 92 141 5 100 24 163 167 qr 23 74 947 302 336 16 274 156 53 20 133 49 1576 918 s 184 75 782 237 541 894 2545 396 315 733 130 447 148 121 t 58 61 430 54 80 367 1300 205 18 298 65 114 1139 160 uv 117 507 660 89 13 47 1009 292 69 574 265 932 10 4404 wxyz 35 923 172 795 712 686 70 1180 640 3342 1265 76 233 1796 253 113 679 102 227 3 26 217 43 602 335 751 321 754 1 50 swimming adj 1: filled or brimming with tears; "swimming eyes"; "sorrow made the eyes of many grow liquid" [syn: {liquid}, 2 261 imply v 1: express or state indirectly [syn: {imply}, {connote}] 2: suggest as a logically necessary consequence; in logic 3 123 ie adv 1: that is to say; in other words [syn: {i.e.}, {ie}, {id est}] 4 159 clinic n 1: a medical establishment run by a group of medical specialists 5 42 respects n 1: (often used with `pay') a formal expression of esteem; "he paid his respects to the mayor" Example lookup -------------- As an example lookup, suppose you're seeking the word "pursued". The relevant letters are "p" and "s"; in the "op" column and the "s" row, the number is 315. Entry 315 is: 315 319 prospective adj 1: of or concerned with or related to the future; "prospective earnings"; "a prospective mother"; "a This is not "pursued", so we continue to entry 319: 319 420 possessed adj 1: influenced or controlled by a powerful force such as a strong emotion; "by love possessed" [syn: {obsessed}, Nor is this, so we continue to 420: 420 457 pursued adj 1: followed with enmity as if to harm; "running and leaping like a herd of pursued antelopes" And we have found it. If we were to continue following the chain, we would eventually find ourselves here, where it ends: 8416 pristine adj 1: completely free from dirt or contamination; "pristine mountain snow" The longest such chain has 139 words in it, but on average you have to examine only 31 words to find the word you're looking for. Copying into sequential form ---------------------------- If you should at some point find yourself copying a glossary in this format from one manuscript to another, you can copy each such chain into a contiguous block in a single pass. This will enable readers of your new reordered copy to turn immediately to the page containing your word and scan down the page for it, rather than having to binary-search through several pages, as in alphabetized reference books; and the block of related words can still have a next-pointer at the end of it for appending later words. In Roman-numeral mode, the result looks like this: Cxxxii hash chains, max search length cxcix entries (from ('cd', 'qrs')). Average chain length lxvii entries. Average successful search length xlvi entries. Find column of first letter and row of fourth letter to get N. If word N isn't your word, retry with the other number on line N. ab cd ef gh ij kl mn op qrs t uvwxyz ab xxxiii lvii cxvi dcxlvi cdxi ccl ccxc cxxxiv cxxii ccxxxii cdlxii cd cccxci xii xvii cx lxiv cxcv xlv cxlix lii lxvii cxviii ef xl xxi lix xci cxxviii xi cxxxv xc xxix lv ccclvi gh ccxxviii cdlxxx cxxv xxxvii mxlv civ xv dclxxi xxv cdlxxxix xxxi ij lxii cv cccxx vii cdlxv dcccxlv lxxiii cc lvi dclviii ccxliv kl cxxvii cxi clxxvii cclxxxiv ii cccxxxii cccxc xiv xxviii cclv xxx mn xxii iv xxxiv cxxxi mcxcvi cliv lxxvii vi i xxxii ccclxiv op ccxxiv xxxix viii dcxix cxxiv cccli xcii cxli v xxiv clxiii qrs xxiii lxxiv ccxxxvii cccxxxvi xvi cclxxiv clvi liii xx xlix cxxi t lviii lxi liv lxxx ccclxvii mccc ccv xviii lxv cxiv clx uvwxyz xxxv clxxii lxxxix xiii xlvii lxx ccxcii lxix cclxv lxxvi x ccliii cxiii cii ccxxvii iii xxvi ccxvii xliii cccxxxv dccli cccxxi i l swimming adj 1: filled or brimming with tears; "swimming eyes"; "sorrow made the eyes of many grow liquid" [syn: {liquid}, ii cclxi imply v 1: express or state indirectly [syn: {imply}, {connote}] 2: suggest as a logically necessary consequence; in logic iii cxxiii ie adv 1: that is to say; in other words [syn: {i.e.}, {ie}, {id est}] iv lxxviii clinic n 1: a medical establishment run by a group of medical specialists v xlii respects n 1: (often used with `pay') a formal expression of esteem; "he paid his respects to the mayor" """ import sys import textwrap roman = False def main(input_file): # These work well, but Roman numerals make the table too wide. # Separating c from d and s from t more than makes up for the loss # of lumping wxyz together. equivalence_classes = 'ab c d e fgh ij kl mn op qr s t uv wxyz'.split() # So with Roman numerals, I use these: if roman: equivalence_classes = 'ab cd ef gh ij kl mn op qrs t uvwxyz'.split() max_width = 112 else: max_width = 80 heads = {} following = {} lines = [] for number, line in enumerate(input_file, 1): word = line.split()[0] hash_value = ((class_of(equivalence_classes, word[0]), class_of(equivalence_classes, word[3])) if len(word) > 3 else (class_of(equivalence_classes, word[0]), ' ')) if hash_value not in heads: heads[hash_value] = number else: to_follow = heads[hash_value] while to_follow in following: to_follow = following[to_follow] following[to_follow] = number lines.append((number, line)) width = max(len(represent(n)) for n in range(number+1)) class_width = max(len(c) for c in equivalence_classes) table_column_width = max(class_width, max(len(represent(number)) for number in heads.values())) for nonsense in analyze(heads, following): yield nonsense yield "Find column of first letter and row of fourth letter to get N.\n" yield "If word N isn't your word, retry with the other number on line N.\n" # Table column header line. yield justify('', class_width+1) for c in equivalence_classes + [' ']: yield justify(c, table_column_width+1) yield '\n' # Table of heads. for line_class in equivalence_classes + [' ']: yield justify(line_class, class_width+1) for c in equivalence_classes + [' ']: yield justify(represent(heads.get((c, line_class))), table_column_width+1) yield '\n' yield '\n' indent = 2*width + 3 for number, line in lines: yield justify(represent(number), width+1) yield justify(represent(following.get(number)), width+1) yield ' ' yield textwrap.fill(line, max_width, initial_indent=' ' * indent, subsequent_indent=' ' * indent)[indent:] yield '\n' def analyze(heads, following): max_chain_length = 0 max_chain = None total_chain_length = 0 # Also, total number of entries. total_squared_chain_lengths = 0 n_chains = 0 for key, start in heads.items(): chain_length = 1 current = start while current is not None: current = following.get(current) chain_length += 1 if chain_length > max_chain_length: max_chain_length = chain_length max_chain = key total_chain_length += chain_length total_squared_chain_lengths += chain_length**2 n_chains += 1 yield "%s hash chain%s, " % (represent(n_chains).capitalize(), plural(n_chains)) entry_plural = lambda n: plural(n, ("entry", "entries")) yield "max search length %s %s (from %s).\n" % (represent(max_chain_length), entry_plural(max_chain_length), max_chain) avg = int(round(total_chain_length / float(n_chains))) yield "Average chain length %s %s.\n" % (represent(avg), entry_plural(avg)) # The reasoning here is that successful searches are equally # likely to start on any of the entries in a chain of length N, # and the total of all of those searches is N²/2, if you count the # final word as ½. avg_s = int(round(total_squared_chain_lengths / 2.0 / total_chain_length)) yield "Average successful search length %s %s.\n" % (represent(avg_s), entry_plural(avg_s)) def plural(n, possibilities=("", "s")): return possibilities[0] if n == 1 else possibilities[1] def justify(s, width): return (s + ' ' * (width - len(s)) if roman else ' ' * (width - len(s)) + s) def class_of(equivalence_classes, letter): for possibility in equivalence_classes: if letter.lower() in possibility: return possibility return ' ' def represent(number): if number is None: return '' if roman: return ''.join(represent_roman(number)) else: return str(number) def represent_roman(n): "Convert to Roman numerals. After Mark Pilgrim in Dive Into Python." for num, integer in [('ↁ', 5000), ('mↁ', 4000), ('m', 1000), ('cm', 900), ('d', 500), ('cd', 400), ('c', 100), ('xc', 90), ('l', 50), ('xl', 40), ('x', 10), ('ix', 9), ('v', 5), ('iv', 4), ('i', 1)]: while n >= integer: yield num n -= integer if __name__ == '__main__': if '--roman' in sys.argv: roman = True sys.stdout.writelines(main(sys.stdin))