argmaxc ∈ candidates P(c|w) <==> argmaxc ∈ candidates P(c) P(w|c) / P(w)
The four parts of this expression are:
- Selection Mechanism: argmax
We choose the candidate with the highest combined probability. - Candidate Model: c ∈ candidates
This tells us which candidate corrections, c, to consider. - Language Model: P(c)
The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07. - Error Model: P(w|c)
The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.
How It Works in PythonThe four parts of the program are:
1. Selection Mechanism: In Python, max with a key argument does 'argmax'.
2. Candidate Model: First a new concept: a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:
This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example,
However, if we restrict ourselves to words that are known—that is, in the dictionary— then the set is much smaller:
We'll also consider corrections that require two simple edits. This generates a much bigger set of possibilities, but usually only a few of them are known words.
We say that the results of edits2(w) have an edit distance of 2 from w.
3. Language Model: We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter.
4. Error Model: When I started to write this program, sitting on a plane in 2007, I had no data on spelling errors, and no internet connection (I know that may be hard to imagine today). Without data I couldn't build a good spelling error model, so I took a shortcut: I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:
- The original word, if it is known; otherwise
- The list of known words at edit distance one away, if there are any; otherwise
- The list of known words at edit distance two away, if there are any; otherwise
- The original word, even though it is not known.
Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model). That gives us: