#!/usr/bin/python3
"""Find minimal-pair sets for practicing English phonetics.

Minimal pairs are words that differ only by a single phoneme.  I’m
defining a “minimal-pair set” as a set of words in which every pair is
a minimal pair; this may be an existing term or may not be.

This program uses a file `pronunciation-dictionary`, which can be
created as follows with eSpeak, and is also found at
<http://canonical.org/~kragen/sw/dev3/pronunciation-dictionary>:

    $ espeak -v en-us+f2 --ipa -q < /usr/share/dict/words > words-ipa-2
    $ paste /usr/share/dict/words words-ipa-2 > pronunciation-dictionary

This is not perfect, because eSpeak’s pronunciation isn’t perfect, but
it’s good enough to be useful as a starting point for recording sets
of minimal pairs for ear training.

By default it generates sets of minimal pairs that differ by any
sequence of vowels (including diphthongs), since English’s insane
proliferation of vowels is the trickiest thing for ESL learners from
many backgrounds, including but not limited to native speakers of
Spanish:

    $ ./minpairsets.py
    lˈVsVz
            lˈuːsᵻz looses
            lˈæsᵻz lasses
            lˈuːɪsəz Louisa's
            lˈiːsɪz lease's
            lˈæsɪz lass's/lassies
            lˈæsəz Lassa's
            lˈɛsəz Lesa's
            lˈɛsiz Lessie's
            lˈɔsɪz loss's
            lˈeɪsiz Lacey's/Lacy's
            lˈɛsiːz lessee's/lessees
            lˈæsiz Lassie's/lassie's
            lˈoʊɪsɪz Lois's
            lˈiːsəz Lisa's
            lˈɔsᵻz losses
            lˈiːsᵻz leases
            lˈaʊsᵻz louses
            lˈɑːsəz Lhasa's
            lˈɛsɪz less's
            lˈeɪsᵻz lace's/laces
            lˈuːɪsɪz Luis's
            lˈuːsɪz Luce's
            lˈaʊsɪz louse's
            lˈæsoʊz lasso's/lassoes/lassos
            lˈuːsiz Lucy's
    mˈVsVz
            mˈeɪsəz mesa's/mesas
            mˈɪsiz Missy's
            mˈɛsᵻz messes
            mˈʌsɪz muss's
            mˈæsᵻz Masses/masses
            mˈæsaɪz Masai's
            mˈaʊsᵻz mouses/mousses
            mˈɔsɪz Moss's/moss's
            mˈɪsɪz Mrs/miss's/misses
            mˈuːsɪz moose's/mousse's
            mˈʌsᵻz musses
            mˈoʊsɪz moseys
            mˈeɪsiz Macy's/Maisie's
            mˈɔsᵻz mosses
            mˈɛsɪz mess's
            mˈæsɪz Mass's/mass's
            mˈaʊsɪz mouse's
            mˈeɪsᵻz mace's/maces
    pˈVsVz
            pˈaɪsiːz Pisces
            pˈɪsᵻz pisses
            pˈæsɪz pass's
            pˈʊsiz pussy's
            pˈiːsoʊz peso's/pesos
            pˈɪəsɪz Pius's
            pˈiːsᵻz peaces/pieces
            pˈɑːsᵻz posses
            pˈiːsɪz Peace's/peace's/piece's
            pˈʊsᵻz pusses
            pˈeɪsᵻz Pace's/pace's/paces
            pˈʊsɪz puss's/pussies
            pˈʌsɪz pus's
            pˈoʊsiz poesy's
            pˈæsᵻz passes
            pˈɪsɪz piss's
    …

For example, “maces” and “messes” are in the same set here, forming a
minimal [ɛ]/[eɪ] pair — definitely contrastive, but quite similar in
sound to the Spanish-speaking ear; but I think it’s erroneous to claim
that “mesas” [meɪsəz] and “maces” [meɪsᵻz] form a minimal pair, on the
grounds that I’m a native English speaker and I can’t hear the
difference, and of course “mousses” is not pronounced [mˈaʊsᵻz].

But you can specify a particular set of phonemes to work on, such as
this set of low to low-mid vowels that all sound pretty much like [a]
to speakers of many languages (including Spanish):

    $ ./minpairsets.py a/æ/ɐ/ʌ/ɔ/ɑ
    kˈaf
            kˈɔf cough
            kˈʌf cuff
            kˈæf Caph/calf
    lˈaɡ
            lˈæɡ lag
            lˈʌɡ lug
            lˈɔɡ log
    mˈas
            mˈɔs Moss/moss
            mˈæs Mass/mas/mass
            mˈʌs muss
    sˈaŋ
            sˈæŋ Sang/sang
            sˈʌŋ Sung/sung
            sˈɔŋ song
    …

Here’s a set of alveolar approximants, fricatives, and affricates that
are allophones of a single phoneme in Rioplatense Spanish and thus
pose problems for learners of English:

    $ ./minpairsets.py j/ʒ/dʒ/ʃ
    jˈæk
            jˈæk Yacc/yack/yak
            dʒˈæk Jack/jack
            ʒˈæk Jacques
            ʃˈæk shack
    jˈiː
            jˈiː ye
            ʃˈiː Shea/she
            dʒˈiː G/GE/Ge/g/gee
    jˈoʊ
            dʒˈoʊ Jo/Joe
            jˈoʊ yo
            ʃˈoʊ show
    jˈuː
            dʒˈuː Jew
            jˈuː U/ewe/u/yew/you
            ʃˈuː shoe/shoo
    jˈæm
            dʒˈæm jam/jamb
            jˈæm yam
            ʃˈæm sham
    …

Here’s a voiced/unvoiced pair that poses problems for native speakers
of Micronesian and Melanesian languages, where that feature is
generally not contrastive:

    $ ./minpairsets.py p/b
    pˈiːp
            pˈiːp peep
            bˈiːp beep
            bˈiːb Beebe
    pˈɑːp
            bˈɑːb Bob/bob
            bˈɑːp bop
            pˈɑːp pop
    pˈɑːpɪŋ
            bˈɑːpɪŋ bopping
            pˈɑːpɪŋ popping
            bˈɑːbɪŋ bobbing
    dˈɛp
            dˈɛp Depp
            dˈɛb deb
    …

Here’s the l/ɹ pair that famously causes difficulty to native speakers
of Sinitic languages; this list is probably missing a lot of entries
because of eSpeak’s concessions to non-rhotic dialects:

    $ ./minpairsets.py l/ɹ
    lˈɪl
            ɹˈɪɹ rear
            lˈɪl Lille
            ɹˈɪl rill
            lˈɪɹ Lear/leer
    lˈɪlz
            ɹˈɪɹz rear's/rears
            lˈɪɹz Lear's/leer's/leers
            ɹˈɪlz rill's/rills
    plˈɛzəntli
            pɹˈɛzəntli presently
            plˈɛzəntli pleasantly
            plˈɛzəntɹi pleasantry
    …

Here’s the [ɪn]/[ɛn] distinction that doesn’t exist in English
dialects with the pin/pen merger:

    $ ./minpairsets.py ɪn/ɛn
    ˈɪn
            ˈɪn In/in/inn
            ˈɛn N/n
    bˈɪn
            bˈɪn been/bin
            bˈɛn Ben
    dˈɪn
            dˈɛn den
            dˈɪn din
    fˈɪn
            fˈɪn Finn/fin
            fˈɛn fen
    kˈɪn
            kˈɛn Ken/ken
            kˈɪn kin
    …

Unfortunately we can’t do the same trick with the which/witch merger
because eSpeak doesn’t make the [ʍ]/[w] distinction necessary to find
the minimal pairs!  The only [hw] are in Spanish loanwords:

    $ ./minpairsets.py hw/w
    hwˈɑːnə
            hwˈɑːnə Juana
            wˈɑːnə wanna

To find minimal pairs for [u]/[ʊ], we need to specify vowel length;
some of these are real:

    $ ./minpairsets.py ʊ/uː
    bˈʊl
            bˈuːl Boole
            bˈʊl bull
    fˈʊl
            fˈuːl fool
            fˈʊl full
    hˈʊd
            hˈuːd who'd
            hˈʊd Hood/hood
    kˈʊd
            kˈuːd cooed
            kˈʊd could
    kˈʊk
            kˈuːk kook
            kˈʊk Cook/Cooke/cook
    lˈʊk
            lˈuːk Luke
            lˈʊk look
    nˈʊk
            nˈuːk nuke
            nˈʊk nook
    pˈʊl
            pˈuːl Poole/pool
            pˈʊl pull
    …

It finds a few for the usually-not-contrastive pair [ð]/[θ], but a
number of them, including the first one, are wrong:

    $ ./minpairsets.py ð/θ
    bˈɛð
            bˈɛð Bethe
            bˈɛθ Beth
    ðˈaɪ
            ðˈaɪ thy
            θˈaɪ thigh
    ðˈiː
            ðˈiː thee
            θˈiː Thea
    lˈoʊð
            lˈoʊð loathe
            lˈoʊθ loath
    mˈaʊð
            mˈaʊð Mouthe
            mˈaʊθ mouth
    tˈiːð
            tˈiːð teethe
            tˈiːθ teeth

Because you have to enter IPA on the command line, be careful with
[ɡ]; that’s not the ASCII “g”, even though they look the same in some
fonts.  The first command here gets no output.  This is another
contrast that's difficult for Spanish-speakers to hear:

    $ ./minpairsets.py gw/w
    $ ./minpairsets.py ɡw/w
    ɡwˈɛn
            ɡwˈɛn Gwen
            wˈɛn wen/when
    ɡwˈɪn
            wˈɪn Wynn/win
            ɡwˈɪn Gwyn
    ɡwˈɛnz
            wˈɛnz wen's/wens/when's/whens
            ɡwˈɛnz Gwen's
    ɡwˈɪnz
            ɡwˈɪnz Gwyn's
            wˈɪnz Wynn's/win's/wins

You can specify more than one such set of phonemes in order to come up
with minimal-pair sets that are more treacherous for a given audience;
here we are using four different sets of consonants that are each
allophones of a single Spanish consonant in order to find sets of
contrasting words that are maximally difficult for Spanish-speakers to
distinguish.  Note that the second set has the erroneous pronunciation
[zˈɪŋkz] for “zinc’s” and an analogous error for “sync’s”, making that
set appear larger than it really is; and the first set has only the
[jˈuːs] pronunciation for “use”.

    $ ./minpairsets.py d/ð s/z/θ b/v j/ʒ/dʒ/ʃ
    jˈuːs
            dʒˈuːs juice
            dʒˈuːz Jew's/Jews
            jˈuːs use
            jˈuːz Eu's/U's/ewe's/ewes/yew's/yews/you's/yous
            jˈuːθ youth
            ʃˈuːz shoe's/shoes/shoos
    sˈɪŋks
            sˈɪŋks sink's/sinks/syncs
            sˈɪŋkz sync's
            zˈɪŋks zincs
            zˈɪŋkz zinc's
            θˈɪŋks thinks
    jˈæk
            dʒˈæk Jack/jack
            jˈæk Yacc/yack/yak
            ʃˈæk shack
            ʒˈæk Jacques
    dˈɛns
            dˈɛns dense
            dˈɛnz den's/dens
            ðˈɛns thence
            ðˈɛnz then's
    …

"""


from __future__ import print_function
import sys
import os
import re
import argparse


vowels = re.compile(u'(?:a|e|i|o|u|ɪ|ɛ|ʊ|ɑ|ː|ə|æ|ɔ|ʌ|ɐ|ᵻ)+')
rhovowels = re.compile(u'ɜː|ɚ')
def collapse_vowels(ipa):
    return vowels.sub('V', rhovowels.sub('Vɹ', ipa))


def main(dictionary, key=collapse_vowels):
    by_key = {}
    for line in dictionary:
        word, ipa = line.split('\t')

        word = word.strip()
        ipa = ipa.strip()
        k = key(ipa)

        #print(word, ipa, k)

        if k not in by_key:
            by_key[k] = {}
        if ipa not in by_key[k]:
            by_key[k][ipa] = set()
        by_key[k][ipa].add(word)

    top_keys = sorted(by_key, key=lambda k: (-len(by_key[k]), len(k), k))

    for i, k in enumerate(top_keys):
        if len(by_key[k]) == 1:
            break        # A set containing one item contains no pairs

        print(k)
        for ipa in sorted(by_key[k]):
            print("\t%s %s" % (ipa, '/'.join(sorted(by_key[k][ipa]))))

        if i > 100:
            break


def collapse(things):
    regex = re.compile('|'.join(re.escape(thing) for thing in things))
    def key(ipa):
        return regex.sub(things[0], ipa)
    return key

def compose(f1, f2):
    return lambda *args: f1(f2(*args))

def parse_args():
    parser = argparse.ArgumentParser(description="Generate phonetic minimal-pair sets of words.")
    parser.add_argument('phoneme_sets', nargs='*',
                        help='/-separated list of phoneme strings to look for contrasts in, such as ð/d/θ')
    parser.add_argument('--dictionary', default=os.path.dirname(__file__) + '/pronunciation-dictionary',
                        help='TSV-format pronunciation dictionary file, such as'
                        + ' <http://canonical.org/~kragen/sw/dev3/pronunciation-dictionary>'
                        + ' (default %(default)s)')
    return parser.parse_args()

if __name__ == '__main__':
    args = parse_args()

    if args.phoneme_sets:
        key = collapse(args.phoneme_sets[0].split('/'))
        for pset in args.phoneme_sets[1:]:
            key = compose(collapse(pset.split('/')), key)
    else:
        key = collapse_vowels

    with open(args.dictionary) as f:
        main(f, key)