Predictionary: A Text Prediction App

Leila Hadj-Chikh
17 May 2016

Predictionary

Text Prediction

Text prediction algorithms identify words that are likely to follow a set of preceding words. Many of these algorithms are based on relative frequencies of n-grams — sequences of n number of words that are found together in a text.

Strategies and Tradeoffs

Some algorithms only consider the highest-order n-grams in a language model that match the input text. Other algorithms weight the probabilities for n-grams of different sizes. Such “smoothing” leads to more accurate predictions, but it is also more computationally intensive. This can be a drawback for applications that require timely results, such as autocompletion.

The Problem

For autocompletion to be useful, it must allow a user to select a word in less time than it would take to type it. Thus, the challenge is to provide good options as quickly as possible. To provide rapid responses to user input, I developed an algorithm based on Stupid Backoff, a simple technique that has proven nearly as effective as Kneser-Ney Smoothing when used with large datasets (Brants et al., 2007).

Language Model

Modeling Approach

I built my language model using the ~1 million most frequent 2-, 3-, and 4-grams from the Corpus of Contemporary American English (COCA). For unigrams, I used the 1/3 million most frequent words derived from the Google Web Trillion Word Corpus.

Quantifying Performance

Stupid Backoff does not use a probability distribution, so its accuracy cannot be measured using perplexity. However, I estimated the quality of my language model by comparing it to n-grams generated from a test corpus of 135 sentences from Twitter, blogs, and new articles. The model contained 46% of the unigrams and 62% of the bigrams found in the test set (Table 1).

Table 1. Summary of n-gram coverage.

        Total n-grams Matching n-grams Percent
1-grams          1589              734    0.46
2-grams          1588              986    0.62
3-grams          1587              438    0.28
4-grams          1586              120    0.08

Prediction Algorithm

The algorithm takes the last n words of input (up to n=3) and searches for n+1-grams that begin with those words. If at least one match is found, it returns a list of the last words of all matches in descending frequency. Otherwise, the algorithm backs off to a lower-order n-gram and recurses. If no matches are found for any n-gram, the algorithm simply returns the most frequent unigrams in the model.

Pseudocode of algorithm logic:

# get.predictions (ngram)
    # get the size of the n-gram
    # if it's at least a unigram
        # search for matches among n+1-grams
        # if at least one match is found
            # return the last words of the most frequent matches
        # else if no matches are found
            # get the lower-order ngram
            # recurse using the lower-order ngram
    # else if there are no more n-grams to try
        # return the most frequent words in the model

Predictionary

Predictionary is an app that allows users to enter text and get suggestions for the next word. Suggestions are based on top matches from the prediction algorithm.

App Features:

User can adjust the maximum number of suggestions shown
Stopwords can be filtered from suggestions if at least one non-stopword is returned
User can click a suggested word to add it to the text input

Results are returned relatively quickly, so users can have fun building new sentences!

Screenshot of Predictionary App.