Leila Hadj-Chikh
17 May 2016
Text prediction algorithms identify words that are likely to follow a set of preceding words. Many of these algorithms are based on relative frequencies of n-grams — sequences of n number of words that are found together in a text.
Some algorithms only consider the highest-order n-grams in a language model that match the input text. Other algorithms weight the probabilities for n-grams of different sizes. Such “smoothing” leads to more accurate predictions, but it is also more computationally intensive. This can be a drawback for applications that require timely results, such as autocompletion.
For autocompletion to be useful, it must allow a user to select a word in less time than it would take to type it. Thus, the challenge is to provide good options as quickly as possible. To provide rapid responses to user input, I developed an algorithm based on Stupid Backoff, a simple technique that has proven nearly as effective as Kneser-Ney Smoothing when used with large datasets (Brants et al., 2007).
I built my language model using the ~1 million most frequent 2-, 3-, and 4-grams from the Corpus of Contemporary American English (COCA). For unigrams, I used the 1/3 million most frequent words derived from the Google Web Trillion Word Corpus.
Stupid Backoff does not use a probability distribution, so its accuracy cannot be measured using perplexity. However, I estimated the quality of my language model by comparing it to n-grams generated from a test corpus of 135 sentences from Twitter, blogs, and new articles. The model contained 46% of the unigrams and 62% of the bigrams found in the test set (Table 1).
Total n-grams Matching n-grams Percent
1-grams 1589 734 0.46
2-grams 1588 986 0.62
3-grams 1587 438 0.28
4-grams 1586 120 0.08
The algorithm takes the last n words of input (up to n=3) and searches for n+1-grams that begin with those words. If at least one match is found, it returns a list of the last words of all matches in descending frequency. Otherwise, the algorithm backs off to a lower-order n-gram and recurses. If no matches are found for any n-gram, the algorithm simply returns the most frequent unigrams in the model.
Pseudocode of algorithm logic:
# get.predictions (ngram)
# get the size of the n-gram
# if it's at least a unigram
# search for matches among n+1-grams
# if at least one match is found
# return the last words of the most frequent matches
# else if no matches are found
# get the lower-order ngram
# recurse using the lower-order ngram
# else if there are no more n-grams to try
# return the most frequent words in the model
Predictionary is an app that allows users to enter text and get suggestions for the next word. Suggestions are based on top matches from the prediction algorithm.
App Features:
Results are returned relatively quickly, so users can have fun building new sentences!