#!/usr/bin/python3 """Find minimal-pair sets for practicing English phonetics. Minimal pairs are words that differ only by a single phoneme. I’m defining a “minimal-pair set” as a set of words in which every pair is a minimal pair; this may be an existing term or may not be. This program uses a file `pronunciation-dictionary`, which can be created as follows with eSpeak, and is also found at : $ espeak -v en-us+f2 --ipa -q < /usr/share/dict/words > words-ipa-2 $ paste /usr/share/dict/words words-ipa-2 > pronunciation-dictionary This is not perfect, because eSpeak’s pronunciation isn’t perfect, but it’s good enough to be useful as a starting point for recording sets of minimal pairs for ear training. By default it generates sets of minimal pairs that differ by any sequence of vowels (including diphthongs), since English’s insane proliferation of vowels is the trickiest thing for ESL learners from many backgrounds, including but not limited to native speakers of Spanish: $ ./minpairsets.py lˈVsVz lˈuːsᵻz looses lˈæsᵻz lasses lˈuːɪsəz Louisa's lˈiːsɪz lease's lˈæsɪz lass's/lassies lˈæsəz Lassa's lˈɛsəz Lesa's lˈɛsiz Lessie's lˈɔsɪz loss's lˈeɪsiz Lacey's/Lacy's lˈɛsiːz lessee's/lessees lˈæsiz Lassie's/lassie's lˈoʊɪsɪz Lois's lˈiːsəz Lisa's lˈɔsᵻz losses lˈiːsᵻz leases lˈaʊsᵻz louses lˈɑːsəz Lhasa's lˈɛsɪz less's lˈeɪsᵻz lace's/laces lˈuːɪsɪz Luis's lˈuːsɪz Luce's lˈaʊsɪz louse's lˈæsoʊz lasso's/lassoes/lassos lˈuːsiz Lucy's mˈVsVz mˈeɪsəz mesa's/mesas mˈɪsiz Missy's mˈɛsᵻz messes mˈʌsɪz muss's mˈæsᵻz Masses/masses mˈæsaɪz Masai's mˈaʊsᵻz mouses/mousses mˈɔsɪz Moss's/moss's mˈɪsɪz Mrs/miss's/misses mˈuːsɪz moose's/mousse's mˈʌsᵻz musses mˈoʊsɪz moseys mˈeɪsiz Macy's/Maisie's mˈɔsᵻz mosses mˈɛsɪz mess's mˈæsɪz Mass's/mass's mˈaʊsɪz mouse's mˈeɪsᵻz mace's/maces pˈVsVz pˈaɪsiːz Pisces pˈɪsᵻz pisses pˈæsɪz pass's pˈʊsiz pussy's pˈiːsoʊz peso's/pesos pˈɪəsɪz Pius's pˈiːsᵻz peaces/pieces pˈɑːsᵻz posses pˈiːsɪz Peace's/peace's/piece's pˈʊsᵻz pusses pˈeɪsᵻz Pace's/pace's/paces pˈʊsɪz puss's/pussies pˈʌsɪz pus's pˈoʊsiz poesy's pˈæsᵻz passes pˈɪsɪz piss's … For example, “maces” and “messes” are in the same set here, forming a minimal [ɛ]/[eɪ] pair — definitely contrastive, but quite similar in sound to the Spanish-speaking ear; but I think it’s erroneous to claim that “mesas” [meɪsəz] and “maces” [meɪsᵻz] form a minimal pair, on the grounds that I’m a native English speaker and I can’t hear the difference, and of course “mousses” is not pronounced [mˈaʊsᵻz]. But you can specify a particular set of phonemes to work on, such as this set of low to low-mid vowels that all sound pretty much like [a] to speakers of many languages (including Spanish): $ ./minpairsets.py a/æ/ɐ/ʌ/ɔ/ɑ kˈaf kˈɔf cough kˈʌf cuff kˈæf Caph/calf lˈaɡ lˈæɡ lag lˈʌɡ lug lˈɔɡ log mˈas mˈɔs Moss/moss mˈæs Mass/mas/mass mˈʌs muss sˈaŋ sˈæŋ Sang/sang sˈʌŋ Sung/sung sˈɔŋ song … Here’s a set of alveolar approximants, fricatives, and affricates that are allophones of a single phoneme in Rioplatense Spanish and thus pose problems for learners of English: $ ./minpairsets.py j/ʒ/dʒ/ʃ jˈæk jˈæk Yacc/yack/yak dʒˈæk Jack/jack ʒˈæk Jacques ʃˈæk shack jˈiː jˈiː ye ʃˈiː Shea/she dʒˈiː G/GE/Ge/g/gee jˈoʊ dʒˈoʊ Jo/Joe jˈoʊ yo ʃˈoʊ show jˈuː dʒˈuː Jew jˈuː U/ewe/u/yew/you ʃˈuː shoe/shoo jˈæm dʒˈæm jam/jamb jˈæm yam ʃˈæm sham … Here’s a voiced/unvoiced pair that poses problems for native speakers of Micronesian and Melanesian languages, where that feature is generally not contrastive: $ ./minpairsets.py p/b pˈiːp pˈiːp peep bˈiːp beep bˈiːb Beebe pˈɑːp bˈɑːb Bob/bob bˈɑːp bop pˈɑːp pop pˈɑːpɪŋ bˈɑːpɪŋ bopping pˈɑːpɪŋ popping bˈɑːbɪŋ bobbing dˈɛp dˈɛp Depp dˈɛb deb … Here’s the l/ɹ pair that famously causes difficulty to native speakers of Sinitic languages; this list is probably missing a lot of entries because of eSpeak’s concessions to non-rhotic dialects: $ ./minpairsets.py l/ɹ lˈɪl ɹˈɪɹ rear lˈɪl Lille ɹˈɪl rill lˈɪɹ Lear/leer lˈɪlz ɹˈɪɹz rear's/rears lˈɪɹz Lear's/leer's/leers ɹˈɪlz rill's/rills plˈɛzəntli pɹˈɛzəntli presently plˈɛzəntli pleasantly plˈɛzəntɹi pleasantry … Here’s the [ɪn]/[ɛn] distinction that doesn’t exist in English dialects with the pin/pen merger: $ ./minpairsets.py ɪn/ɛn ˈɪn ˈɪn In/in/inn ˈɛn N/n bˈɪn bˈɪn been/bin bˈɛn Ben dˈɪn dˈɛn den dˈɪn din fˈɪn fˈɪn Finn/fin fˈɛn fen kˈɪn kˈɛn Ken/ken kˈɪn kin … Unfortunately we can’t do the same trick with the which/witch merger because eSpeak doesn’t make the [ʍ]/[w] distinction necessary to find the minimal pairs! The only [hw] are in Spanish loanwords: $ ./minpairsets.py hw/w hwˈɑːnə hwˈɑːnə Juana wˈɑːnə wanna To find minimal pairs for [u]/[ʊ], we need to specify vowel length; some of these are real: $ ./minpairsets.py ʊ/uː bˈʊl bˈuːl Boole bˈʊl bull fˈʊl fˈuːl fool fˈʊl full hˈʊd hˈuːd who'd hˈʊd Hood/hood kˈʊd kˈuːd cooed kˈʊd could kˈʊk kˈuːk kook kˈʊk Cook/Cooke/cook lˈʊk lˈuːk Luke lˈʊk look nˈʊk nˈuːk nuke nˈʊk nook pˈʊl pˈuːl Poole/pool pˈʊl pull … It finds a few for the usually-not-contrastive pair [ð]/[θ], but a number of them, including the first one, are wrong: $ ./minpairsets.py ð/θ bˈɛð bˈɛð Bethe bˈɛθ Beth ðˈaɪ ðˈaɪ thy θˈaɪ thigh ðˈiː ðˈiː thee θˈiː Thea lˈoʊð lˈoʊð loathe lˈoʊθ loath mˈaʊð mˈaʊð Mouthe mˈaʊθ mouth tˈiːð tˈiːð teethe tˈiːθ teeth Because you have to enter IPA on the command line, be careful with [ɡ]; that’s not the ASCII “g”, even though they look the same in some fonts. The first command here gets no output. This is another contrast that's difficult for Spanish-speakers to hear: $ ./minpairsets.py gw/w $ ./minpairsets.py ɡw/w ɡwˈɛn ɡwˈɛn Gwen wˈɛn wen/when ɡwˈɪn wˈɪn Wynn/win ɡwˈɪn Gwyn ɡwˈɛnz wˈɛnz wen's/wens/when's/whens ɡwˈɛnz Gwen's ɡwˈɪnz ɡwˈɪnz Gwyn's wˈɪnz Wynn's/win's/wins You can specify more than one such set of phonemes in order to come up with minimal-pair sets that are more treacherous for a given audience; here we are using four different sets of consonants that are each allophones of a single Spanish consonant in order to find sets of contrasting words that are maximally difficult for Spanish-speakers to distinguish. Note that the second set has the erroneous pronunciation [zˈɪŋkz] for “zinc’s” and an analogous error for “sync’s”, making that set appear larger than it really is; and the first set has only the [jˈuːs] pronunciation for “use”. $ ./minpairsets.py d/ð s/z/θ b/v j/ʒ/dʒ/ʃ jˈuːs dʒˈuːs juice dʒˈuːz Jew's/Jews jˈuːs use jˈuːz Eu's/U's/ewe's/ewes/yew's/yews/you's/yous jˈuːθ youth ʃˈuːz shoe's/shoes/shoos sˈɪŋks sˈɪŋks sink's/sinks/syncs sˈɪŋkz sync's zˈɪŋks zincs zˈɪŋkz zinc's θˈɪŋks thinks jˈæk dʒˈæk Jack/jack jˈæk Yacc/yack/yak ʃˈæk shack ʒˈæk Jacques dˈɛns dˈɛns dense dˈɛnz den's/dens ðˈɛns thence ðˈɛnz then's … """ from __future__ import print_function import sys import os import re import argparse vowels = re.compile(u'(?:a|e|i|o|u|ɪ|ɛ|ʊ|ɑ|ː|ə|æ|ɔ|ʌ|ɐ|ᵻ)+') rhovowels = re.compile(u'ɜː|ɚ') def collapse_vowels(ipa): return vowels.sub('V', rhovowels.sub('Vɹ', ipa)) def main(dictionary, key=collapse_vowels): by_key = {} for line in dictionary: word, ipa = line.split('\t') word = word.strip() ipa = ipa.strip() k = key(ipa) #print(word, ipa, k) if k not in by_key: by_key[k] = {} if ipa not in by_key[k]: by_key[k][ipa] = set() by_key[k][ipa].add(word) top_keys = sorted(by_key, key=lambda k: (-len(by_key[k]), len(k), k)) for i, k in enumerate(top_keys): if len(by_key[k]) == 1: break # A set containing one item contains no pairs print(k) for ipa in sorted(by_key[k]): print("\t%s %s" % (ipa, '/'.join(sorted(by_key[k][ipa])))) if i > 100: break def collapse(things): regex = re.compile('|'.join(re.escape(thing) for thing in things)) def key(ipa): return regex.sub(things[0], ipa) return key def compose(f1, f2): return lambda *args: f1(f2(*args)) def parse_args(): parser = argparse.ArgumentParser(description="Generate phonetic minimal-pair sets of words.") parser.add_argument('phoneme_sets', nargs='*', help='/-separated list of phoneme strings to look for contrasts in, such as ð/d/θ') parser.add_argument('--dictionary', default=os.path.dirname(__file__) + '/pronunciation-dictionary', help='TSV-format pronunciation dictionary file, such as' + ' ' + ' (default %(default)s)') return parser.parse_args() if __name__ == '__main__': args = parse_args() if args.phoneme_sets: key = collapse(args.phoneme_sets[0].split('/')) for pset in args.phoneme_sets[1:]: key = compose(collapse(pset.split('/')), key) else: key = collapse_vowels with open(args.dictionary) as f: main(f, key)