A Next-Word Prediction Language Model
Get the application here:
https://gianbalsamo.shinyapps.io/calcoli/
date: June 17, 2016
What does the Calcoli App do?
When queried about an incomplete sentence, the Calcoli App predicts the sentence’s likely next word; in other words, it predicts the word that someone writing that sentence would probably type next.
When you are about to start typing in the left window of the Calcoli App, the right window declares itself “idling.”
If you type type less that the minimal requirement of two words, the right window gives an error.
As you keep typing, the right window keeps predicting the next single word after a suitable delay. (I have opted not to insert a “submit” button.)
The lexicon for this exercise consists of a collection of phrases from Twitter and from news articles. Ideally, one should simply scan the lexicon, find the queried-about incomplete sentence, and predict as the next typed word the one effectively coming after the last word of this sentence; and if the queried-about sentence appears more than once in the lexicon, one should predict as the next word the one coming most frequently after the last word of the sentence. Here we are of course in the realm of conditional probabilities.
Two obstacles to this approach are the size of any significant lexicon of natural language (even the small one adopted for this exercise) and the length of the queried-about sentences. Several language models solve these two problems by reducing the lexicon to a corpus, reducing the queried-about sentence to an n-gram, and searching for that n-gram within the corpus via the backoff algorithm. The corpus for the Calcoli App consists of a random sample of 20% of the lexicon’s elements.
N-gram is the sequence of n words at the end of the queried-about sentence. Instead of scanning the corpus for a long sentence, one scans it for, say, the 3-gram at the end of this sentence, and predicts as the next typed word the one most likely to follow this 3-gram; notice that this implies that, in the various data frames computed from one’s corpus, this 3-gram must figure as the first 3 words of a 4-gram. Backoff algorithm: if no 3-gram gives a plausible result, one regresses to the 2-gram at the end of the sentence; and so on.
The longer is the n-gram under examination, and the larger is the difference in size between the original lexicon and the corpus, the more likely it is that this n-gram's frequency in the corpus be close or even equal to zero. Language models fix this problem with smoothing algorithms of various sorts. They algorithms are designed to attribute some probability to words that, based on the corpus, have zero probability of appearing in the tail of a given n-gram. If the Calcoli App has a claim to originality, it comes from its smoothing procedure.
Some of the most effective smoothing algorithms are identified in the literature as “modified Kneser-Ney smoothing.” I have adopted a somewhat original version of these algorithms by the insertion of wild cards. Within an n-gram, a wild card, conventionally identified as * , is inserted in substitution of a word; for instance, the insertion of one wild card into the n-gram “w,w,w,w” may turn it into the four following configurations: “*,w,w,w” or “w,*,w,w” or “w,w,*,w” or “w,w,w,*”. One can of course insert up to three wild cards into a 4-gram.
1.Compute the frequencies of the 4-grams, 3-grams, and 2-grams found in the corpus;
2. Compute an additional list of 4-gram* and 4-gram** (and their frequencies) by adding respectively one and two wild cards to the original 4-grams (thus: “w,*,w,w” and thus: “w,*,*,w”);
3. If the queried-about sentence’s closing 3-gram coincides with the first three words in one or more from the list of 4-grams, predict as the next typed word the last word from the most likely of these 4-grams;
4. Backoff principle: if no 4-gram does the trick, apply the same approach to the list of 3-grams; if no 3-gram does the trick, apply the same approach to the list of 2-grams; if no 2-gram does the trick, apply the same approach to the list of 4-gram*; if no 4-gram* does the trick, apply the same approach to the list of 4-gram**.
That’s it.
Try it out: https://gianbalsamo.shinyapps.io/calcoli/
I hope you enjoy it.