#!/usr/bin/python
# -*- coding: utf-8 -*-
"""Prepare a hand-drawn hash table of words.

What this program does
----------------------

This program shows a method that you should have used if you were an
early medieval scribe drawing up a glossary, rather than to
alphabetize your glossary, wasting precious parchment.  The lookup
efficiency is dramatically better than actual early medieval
glossaries achieved with their use of first-letter order, and not far
from that of full alphabetization.

To be concrete, I compiled an 8672-word glossary of relatively
uncommon English words.  Fully alphabetized, a binary search in it
would require you to examine some 13 words to find a word.  The Épinal
glossary, which was about the same size but in first-letter order,
required searching through about 225 words to find an "A-" word, or
550 if it wasn't present; a first-letter-order glossary of these 8672
words would require searching through 263 words on average for a
successful search or 333 if not.

This hash-table-based approach, by contrast, requires searching
through only 31 words for successful searches, or 42 for unsuccessful
searches, and despite what you'd expect of hashing algorithms, is
easily usable without a computer, because the hash value is just the
first and fourth letter of the word, collapsed into alphabetically
contiguous equivalence classes to save space on the paper.  Adding
entries to the hash table does not require the ability to erase.

If we're limited to using Roman numerals, we need to use a smaller
number of equivalence classes to save space, and so it requires
searching through XLVI words on average for successful searches, or
LXVII words for unsuccessful searches.

This is how I organize my address book in my notebook.  I use
Hindu-Arabic numerals, though.

How it works
------------
 
The traditional way to organize dictionaries or glossaries is by
alphabetizing the words by lexicographical order, but as James Murray
writes in [The Evolution of English Lexicography][0], glossaries and
dictionaries went through several centuries before being fully
alphabetized, going through intermediate stages of first-letter order
and second-letter order.

[0]: http://www.gutenberg.org/files/11694/11694-h/11694-h.htm

Murray speculated that perhaps the progressively more alphabetical
orders were a result of gradual alphabetization by different scribes.
If you're copying down a 1000-word glossary onto a parchment scroll,
alphabetizing it without using any auxiliary parchment or leaving any
extra space will require you to make 1000 passes through the glossary,
each time finding the alphabetically-first remaining word.  Meanwhile,
you can do the same job in only 26 passes if you're just putting the
words in first-letter order.

It turns out that you can do a very great deal better than this by
using 20th-century algorithmic techniques — by writing two columns of
numbers to the left of the words, with a short table at the beginning
of the list, you can reduce lookup time dramatically, and without the
need for costly auxiliary parchment or intermediate copying.
Furthermore, you can add future words to this glossary incrementally
in the future, without degrading its performance much, and you can do
it all by hand, in a single pass, easily.  If you have some marginal
space, you may be able to do it on the existing glossary manuscript,
without having to make an entire copy of it.

This program demonstrates the technique by doing it by machine.  It
takes as an input file an unordered dictionary, with one word
definition per line; the output file, which is in the same order,
allows rapid lookups.

The Épinal-Erfurt Glossary had some 550 entries, mostly Latin,
categorized under A, so a successful search for a word beginning with
A involves the examination of an average of some 225 entries, which is
rather wasteful.

Experimental procedure and results
----------------------------------

For input, I prepared an 8672-word glossary of somewhat uncommon
English words (things it's plausible that an English speaker might not
know) as follows, with dictd and dict-wn installed on my Debian
netbook:

    < ~/devel/wordlist tail -n +5000 | head -13000 | while read count word
        do if dict -d wn "$word" > tmp.$$
            then echo $(< tmp.$$ grep -iA 2 "^  $word"$ | head -3)
        fi
    done > glossary

This uses the British National Corpus frequencies list
<http://canonical.org/~kragen/sw/wordlist> to find the most common
13000 words in English, excepting the 4999 most common words, and get
some kind of reasonable definitions for them from WordNet.  Because
the BNC words aren't stemmed, many of them are things like
"encounters" or "pints", which don't have entries of their own in
WordNet; and many placenames such as "Yarmouth" and "Titford" are also
missing; so the resulting unordered glossary is only 8672 words.  It
contains 601 "A-" words and 33 "Ab-" words, which is roughly
comparable to the Épinal–Erfurt Glossary's 550 "A-" words, 95 "Ab-"
words, and 78 "Ac-" words, according to Murray.

Binary search on 8672 alphabetized words requires, theoretically,
log₂ 8672 ≈ 13 or so word probes to find a word; but typically once
you're down to 128 or 64 candidates, you just read through them in
order, so you really only need about 6 probes.

The definitions extracted from WordNet by the above shell script are
not that great, in part because they sometimes elevate secondary
senses of words to primary importance, but they're still useful:

    outbreak n 1: a sudden violent spontaneous occurrence (usually of
        some undesirable condition); "the outbreak of hostilities" [syn:
    greenhouse adj 1: of or relating to or caused by the greenhouse
        effect; "greenhouse gases"
    assisted adj 1: having help; often used as a combining form [syn:
        {assisted}, {aided}] [ant: {unassisted}]

The output from an earlier version of this program begins as follows:

    210 hash chains, max search length 139 entries (from ('ab', 'ef')).
    Average chain length 42 entries.
    Average successful search length 31 entries.
    Find column of first letter and row of fourth letter to get N.
    If word N isn't your word, retry with the other number on line N.
            ab    c    d   ef   gh   ij   kl   mn   op   qr    s    t   uv wxyz     
       ab   33   96   57  116  646  411  250  290  134  122  215  232 1033  462     
        c  572   27   12   84 3038  388  682  503  149  536   63   67  556 1893     
        d  391  478 1802   17  110   64  195   45  188   52   93  208  118 1238     
       ef   40   21  157   59   91  128   11  135   90   29  371   55  356 1475     
       gh  228 1247  480  125   37 1045  104   15  671   25  659  489  576   31     
       ij   62  175  105  320    7  465  845   73  200   56  235  658  244 1909     
       kl  127  166  111  177  284    2  332  390   14  247   28  255  334   30     
       mn   22    4   78   34  131 1196  154   77    6  899    1   32  364  606     
       op  224   39   85    8  619  124  351   92  141    5  100   24  163  167     
       qr   23   74  947  302  336   16  274  156   53   20  133   49 1576  918     
        s  184   75  782  237  541  894 2545  396  315  733  130  447  148  121     
        t   58   61  430   54   80  367 1300  205   18  298   65  114 1139  160     
       uv  117  507  660   89   13   47 1009  292   69  574  265  932   10 4404     
     wxyz   35  923  172  795  712  686   70 1180  640 3342 1265   76  233 1796     
           253  113  679  102  227    3   26  217   43  602  335  751  321  754     

        1   50 swimming adj 1: filled or brimming with tears; "swimming eyes";
               "sorrow made the eyes of many grow liquid" [syn: {liquid},
        2  261 imply v 1: express or state indirectly [syn: {imply}, {connote}] 2:
               suggest as a logically necessary consequence; in logic
        3  123 ie adv 1: that is to say; in other words [syn: {i.e.}, {ie}, {id
               est}]
        4  159 clinic n 1: a medical establishment run by a group of medical
               specialists
        5   42 respects n 1: (often used with `pay') a formal expression of esteem;
               "he paid his respects to the mayor"

Example lookup
--------------

As an example lookup, suppose you're seeking the word "pursued".  The
relevant letters are "p" and "s"; in the "op" column and the "s" row,
the number is 315.  Entry 315 is:

    315  319 prospective adj 1: of or concerned with or related to the future;
             "prospective earnings"; "a prospective mother"; "a

This is not "pursued", so we continue to entry 319:

    319  420 possessed adj 1: influenced or controlled by a powerful force such as
             a strong emotion; "by love possessed" [syn: {obsessed},

Nor is this, so we continue to 420:

    420  457 pursued adj 1: followed with enmity as if to harm; "running and
             leaping like a herd of pursued antelopes"

And we have found it.  If we were to continue following the chain, we
would eventually find ourselves here, where it ends:

    8416      pristine adj 1: completely free from dirt or contamination; "pristine
              mountain snow"

The longest such chain has 139 words in it, but on average you have to
examine only 31 words to find the word you're looking for.

Copying into sequential form
----------------------------

If you should at some point find yourself copying a glossary in this
format from one manuscript to another, you can copy each such chain
into a contiguous block in a single pass.  This will enable readers of
your new reordered copy to turn immediately to the page containing
your word and scan down the page for it, rather than having to
binary-search through several pages, as in alphabetized reference
books; and the block of related words can still have a next-pointer at
the end of it for appending later words.

In Roman-numeral mode, the result looks like this:

    Cxxxii hash chains, max search length cxcix entries (from ('cd', 'qrs')).
    Average chain length lxvii entries.
    Average successful search length xlvi entries.
    Find column of first letter and row of fourth letter to get N.
    If word N isn't your word, retry with the other number on line N.
           ab       cd       ef       gh       ij       kl       mn       op       qrs      t        uvwxyz            
    ab     xxxiii   lvii     cxvi     dcxlvi   cdxi     ccl      ccxc     cxxxiv   cxxii    ccxxxii  cdlxii            
    cd     cccxci   xii      xvii     cx       lxiv     cxcv     xlv      cxlix    lii      lxvii    cxviii            
    ef     xl       xxi      lix      xci      cxxviii  xi       cxxxv    xc       xxix     lv       ccclvi            
    gh     ccxxviii cdlxxx   cxxv     xxxvii   mxlv     civ      xv       dclxxi   xxv      cdlxxxix xxxi              
    ij     lxii     cv       cccxx    vii      cdlxv    dcccxlv  lxxiii   cc       lvi      dclviii  ccxliv            
    kl     cxxvii   cxi      clxxvii  cclxxxiv ii       cccxxxii cccxc    xiv      xxviii   cclv     xxx               
    mn     xxii     iv       xxxiv    cxxxi    mcxcvi   cliv     lxxvii   vi       i        xxxii    ccclxiv           
    op     ccxxiv   xxxix    viii     dcxix    cxxiv    cccli    xcii     cxli     v        xxiv     clxiii            
    qrs    xxiii    lxxiv    ccxxxvii cccxxxvi xvi      cclxxiv  clvi     liii     xx       xlix     cxxi              
    t      lviii    lxi      liv      lxxx     ccclxvii mccc     ccv      xviii    lxv      cxiv     clx               
    uvwxyz xxxv     clxxii   lxxxix   xiii     xlvii    lxx      ccxcii   lxix     cclxv    lxxvi    x                 
           ccliii   cxiii    cii      ccxxvii  iii      xxvi     ccxvii   xliii    cccxxxv  dccli    cccxxi            

    i                 l                  swimming adj 1: filled or brimming with tears; "swimming eyes"; "sorrow
                                         made the eyes of many grow liquid" [syn: {liquid},
    ii                cclxi              imply v 1: express or state indirectly [syn: {imply}, {connote}] 2: suggest
                                         as a logically necessary consequence; in logic
    iii               cxxiii             ie adv 1: that is to say; in other words [syn: {i.e.}, {ie}, {id est}]
    iv                lxxviii            clinic n 1: a medical establishment run by a group of medical specialists
    v                 xlii               respects n 1: (often used with `pay') a formal expression of esteem; "he
                                         paid his respects to the mayor"
"""

import sys
import textwrap

roman = False

def main(input_file):
    # These work well, but Roman numerals make the table too wide.
    # Separating c from d and s from t more than makes up for the loss
    # of lumping wxyz together.
    equivalence_classes = 'ab c d e fgh ij kl mn op qr s t uv wxyz'.split()
    # So with Roman numerals, I use these:
    if roman:
        equivalence_classes = 'ab cd ef gh ij kl mn op qrs t uvwxyz'.split()
        max_width = 112
    else:
        max_width = 80

    heads = {}
    following = {}
    lines = []

    for number, line in enumerate(input_file, 1):
        word = line.split()[0]
        hash_value = ((class_of(equivalence_classes, word[0]),
                       class_of(equivalence_classes, word[3])) if len(word) > 3
                      else (class_of(equivalence_classes, word[0]), ' '))
        if hash_value not in heads:
            heads[hash_value] = number
        else:
            to_follow = heads[hash_value]
            while to_follow in following:
                to_follow = following[to_follow]
            following[to_follow] = number

        lines.append((number, line))

    width = max(len(represent(n)) for n in range(number+1))
    class_width = max(len(c) for c in equivalence_classes)
    table_column_width = max(class_width, max(len(represent(number))
                                              for number in heads.values()))

    for nonsense in analyze(heads, following):
        yield nonsense

    yield "Find column of first letter and row of fourth letter to get N.\n"
    yield "If word N isn't your word, retry with the other number on line N.\n"

    # Table column header line.
    yield justify('', class_width+1)
    for c in equivalence_classes + [' ']:
        yield justify(c, table_column_width+1)
    yield '\n'

    # Table of heads.
    for line_class in equivalence_classes + [' ']:
        yield justify(line_class, class_width+1)
        for c in equivalence_classes + [' ']:
            yield justify(represent(heads.get((c, line_class))),
                          table_column_width+1)
        yield '\n'

    yield '\n'
    indent = 2*width + 3
    for number, line in lines:
        yield justify(represent(number), width+1)
        yield justify(represent(following.get(number)), width+1)
        yield ' '
        yield textwrap.fill(line, max_width,
                            initial_indent=' ' * indent,
                            subsequent_indent=' ' * indent)[indent:]
        yield '\n'

def analyze(heads, following):
    max_chain_length = 0
    max_chain = None
    total_chain_length = 0      # Also, total number of entries.
    total_squared_chain_lengths = 0
    n_chains = 0
    for key, start in heads.items():
        chain_length = 1
        current = start
        while current is not None:
            current = following.get(current)
            chain_length += 1
        
        if chain_length > max_chain_length:
            max_chain_length = chain_length
            max_chain = key
        total_chain_length += chain_length
        total_squared_chain_lengths += chain_length**2
        n_chains += 1

    yield "%s hash chain%s, " % (represent(n_chains).capitalize(),
                                 plural(n_chains))
    entry_plural = lambda n: plural(n, ("entry", "entries"))
    yield "max search length %s %s (from %s).\n" % (represent(max_chain_length),
                                                    entry_plural(max_chain_length),
                                                    max_chain)
    avg = int(round(total_chain_length / float(n_chains)))
    yield "Average chain length %s %s.\n" % (represent(avg),
                                            entry_plural(avg))

    # The reasoning here is that successful searches are equally
    # likely to start on any of the entries in a chain of length N,
    # and the total of all of those searches is N²/2, if you count the
    # final word as ½.
    avg_s = int(round(total_squared_chain_lengths / 2.0 / total_chain_length))
    yield "Average successful search length %s %s.\n" % (represent(avg_s),
                                                         entry_plural(avg_s))

def plural(n, possibilities=("", "s")):
    return possibilities[0] if n == 1 else possibilities[1]

def justify(s, width):
    return (s + ' ' * (width - len(s)) if roman
            else ' ' * (width - len(s)) + s)

def class_of(equivalence_classes, letter):
    for possibility in equivalence_classes:
        if letter.lower() in possibility:
            return possibility
    return ' '

def represent(number):
    if number is None:
        return ''
    if roman:
        return ''.join(represent_roman(number))
    else:
        return str(number)

def represent_roman(n):
    "Convert to Roman numerals. After Mark Pilgrim in Dive Into Python."
    for num, integer in [('ↁ', 5000), ('mↁ', 4000),
                         ('m', 1000), ('cm', 900), ('d', 500), ('cd', 400),
                         ('c', 100),  ('xc', 90),  ('l', 50),  ('xl', 40),
                         ('x', 10),   ('ix', 9),   ('v', 5),   ('iv', 4),
                         ('i', 1)]:
        while n >= integer:
            yield num
            n -= integer

if __name__ == '__main__':
    if '--roman' in sys.argv:
        roman = True
    sys.stdout.writelines(main(sys.stdin))