Executive Summary

Our goal for this Data Science Capstone was to create a Shiny web application that, taking a meaningful sequence of words as an input, could predict the word most likely to follow or complete such a sequence. This functionality could be used to speed user typing by suggesting the next word to the user and let them select it rather than having to type it themselves. Another potential application is search query autocomplete.

To provide this functionality we built a 5-gram probabilistic language model and used Stupid Backoff to rank next-word candidates. Our language model built its understanding of the English language from 3 different corpora: a list of blog posts, a list of news articles, and a list of tweets (none of which the user’s output).

Our application, WordPredictR, achieves 14.1%/22.7% top-1/top-3 precision using an independent benchmark and generates predictions in less than 20ms. Its language model uses 62MB of disk space. The web app can be found at https://philferriere.shinyapps.io/WordPredictR.

Building a 5-gram Language Model

As explained in [1], N-gram probabilistic language models (N-gram models, for short) allow for the assignment of probabilities to sequences of words. An N-gram is simply a sequence of words of length N. Using N-gram models, we can to estimate the probability of the last word of an N-gram given the previous words. This makes N-gram models an ideal tool for generating and ranking next-word predictions, as is the intent of our Shiny application. But first, we had to build our language model.

Corpus data

For this application, we built a 5-gram language model from a list of blog posts, a list of news articles, and a list of tweets that were downloaded from [2]. To get a better understanding of the content of the files, we collected the number of lines in each, the number of words, measured the length of the longest line in characters, and the average line length (also in characters):

From the above, it may be surprising to see that the longest tweet is 213 characters, when the Twitter platform does not allow for messages longer than 140 characters. Here is the longest tweet:

If there was any doubt, the above reasserted the need to perform thorough data cleaning on text data before attempting to use it to build a predictive model.

Data Cleaning

To clean up the text data, we performed the following (ordered) list of operations:

Use a unique representation for quotes, as well as hyphens and dashes, and …
Remove non-ASCII characters
Convert all text to lowercase
Compress multi-character whitespaces to a single whitespace character
Remove URLs, numbers, leading and trailing whitespaces
Split text into sentences
Remove sentences with profanity

Note that we did not remove stopwords (most common words of the English language). Our goal being to create a next-word predictor based on the words typed by the user, removing stopwords would result in the creation of a large number of meaningless N-grams.

Similarly, if we were to simply remove profane words (instead of removing the whole sentence), words that were previously separated by one or more profane words could then become adjacent. Hence, down the line, we would also end up generating N-grams that simply weren’t there in the original data. For this reason, we believed it was safer to remove sentences with profane words completely. The list of profane words we used contained 373 entries[3].

Building Term-Frequency Matrices

After data cleaning, we built term-frequency matrices for 5-grams, all the way down to unigrams, using the function textcnt() from the {tau} R package [4].

Most Frequent N-Grams

The following tables and plots show the most common N-grams mined from our corpus:

N-Gram Pruning

Because term-frequencies follow a Zipf’s distribution, term-frequency matrices contain few terms with high frequency counts and many terms with low frequency counts. Ultra-low frequency items use a significant amount of memory while adding little value to our language model (their probability to occur being so low, our predictor is very unlikely to show them as predictions to the user). Hence, we can save a significant amount of memory if we only store terms with some minimum count. Here, we set this threshold at 4 – any N-gram with a count less than four was pruned.

Below, are the reductions in size achieved by pruning (column names are meant to be self-explanatory):

Note that we didn’t prune the term-frequency matrices right at this stage, however. Indeed, we only pruned them after having computed all n-gram probabilities used by our two next-word prediction algorithms.

Predicting the Next Word

Whenever the user of our Shiny web app keys in something, the application suggests up to 5 words that it thinks would most likely follow what the user has typed in so far. It does so by searching its 5-gram language model for words that most often follow the user’s input words.

Next-Word Candidate Search

The algorithm first uses the last four words typed in and tries to find 5-grams that complete those four words. If less than 5 candidates are found, then the app uses the last three words typed in and searches for 4-grams that match those three words, and so on, until it has found at least 5 matches. If the app is unable to find any suitable candidates, it simply returns its top-5 most likely unigrams, based on their maximum-likelihood estimate.

Once enough candidates have been identified, the app stops searching for matches and next assigns a score to each candidate. The 5 candidates that achieve the highest scores are the ones the application presents to the user.

Ranking Next-Word Candidates

To rank next-word candidates, we use a mechanism first described in [5] as Stupid Backoff. In their seminal paper, the authors define the following scoring function:

with a recommended value of 0.4 for lambda (note: If any notation used above isn’t clear, please refer to [1]).

Our use of this scoring approach can be summarized as follows:

if (candidateIs5gram) {
  score = matched5gramCount / input4gramCount
} else if (candidateIs4gram) {
  score = 0.4 * matched4gramCount / input3gramCount
} else if (candidateIs3gram) {
  score = 0.4 * 0.4 * matched3gramCount / input2gramCount
} else if (candidateIs2gram) {
  score = 0.4 * 0.4 * 0.4 * matched2gramcount / input1gramCount
}

Let’s walk through an example. If you enter “Hello, how are you” (without the quotes) in WordPredictR’s Live Predictor Mode, you will see the five following predictions:

          Word Score
1:       doing 0.100
2:       today 0.015
3:     feeling 0.012
4:       going 0.011
5: celebrating 0.008

In this case, our 5-gram language model contains only one 5-gram that starts with “hello how are you”. That 5-gram is “hello how are you doing”. It appeared 5 times in our corpus, while the 4-gram “hello how are you” appeared 50 times. This candidate’s score is computed as follows:

S("hello how are you doing") = 5 / 20 = 0.1

Our 5-gram language model also contains seven 4-grams that start with “how are you”. One of them is “how are you doing”, but since it is already a known candidate, there is no point in adding it a second time to our list. Of the six other 4-grams, the four 4-grams with the highest counts are “how are you today” (127 times), “how are you feeling” (102 times), “how are you going” (96 times), and “how are you celebrating” (65 times). The 3-gram “how are you” appears 3442 times in our corpus. Hence, the 4-gram candidates’ scores are:

S("how are you today") = 0.4 * 127 / 3442 = 0.015
S("how are you feeling") = 0.4 * 102 / 3442 = 0.012
S("how are you going") = 0.4 * 96 / 3442 = 0.011
S("how are you celebrating") = 0.4 * 65 / 3442 = 0.008

Kneser-Ney Interpolation Alternate Scoring

In the implementation above, scores were computed using only the match at the highest-order N-gram. In the previous example, to compute a score for “doing” we only used the S(“hello how are you doing”) 5-gram count, without even paying attention to what its S(“how are you doing”) 4-gram count would have been. Using Kneser-Ney interpolation (a.k.a Kneser-Ney smoothing) we can actually compute a score that takes lower-order N-gram matches into consideration as well. The general recursive formulation of Kneser-Ney smoothing is defined as follows:

where the definition of the count cKN depends on whether we are counting the highest-order N-gram being interpolated or one of the lower-order N-grams:

The continuation count is the number of unique single word contexts. Again, for notation clarifications, one may refer to [1]. The following illustrates how we implemented Kneser-Ney interpolation using a fictional 5-gram:

with

Scoring Functions Evaluation

In order to evaluate the performance of each scoring algorithm, we ran a benchmarking tool relied upon by past and current Data Science Capstone students called benchmark.R[6]. It uses a sample of blog articles and twitter tweets as its test dataset. Sliding a fixed-size word window over a test sentence in its dataset, benchmark.R calls our prediction function to guess the word that follows that window. After the window has slided over the entire sentence, it moves on to the next sentence in its dataset.

In our case, the benchmark was run over 599 blog sentences and 793 tweets, for a total of 28,445 predictions for each scoring technique.

The tool measures top-1 precision (the number of times our top prediction correctly matched the word following its test window) and top-3 precision (the number of times one of our top-3 predictions matched the word following its test window). Here are the results obtained with Stupid Backoff:

Below are the results obtained using our Kneser-Ney Interpolation alternate scoring method:

At the time of this writing, these were the highest scores publicly reported on the Data Science Capstone forums.

From the results above – assuming our implementation is correct – the slight improvement in precision wouldn’t seem to justify switching to a scoring function as complex as the one used by Kneser-Ney interpolation. However, as can also be gathered from the numbers above, with proper optimizations, both scoring techniques can be implemented to generate predictions in much less than 20ms on a 5-year old workstation.

Optimizing WordPredictR

By aggressively leveraging precomputations and tuning memory usage to the specifics of WordPredictR’s top-5 predictions precision, we can both achieve both high-speed execution and low memory usage.

Leveraging Precomputations

The most expensive computations performed by our two prediction algorithms do not depend directly on the user’s input – indeed, they depend mostly on the content of the corpus used to build the 5-gram language model. The user’s input only impacts which conditional branches need to be executed (backoff). Hence, we can precompute all the scores based on the counts maintained in our term-frequency matrices, and use fast lookup table (LUTs) operations to retrieve those scores. Ordering the top-5 scores before presentation can be done very quickly afterwards.

This technique applies to both algorithms. Yes, Kneser-Ney interpolations may use counts at all N-gram orders, from 5-gram to unigram but, again, those counts are all based on the corpus we used. They, too, can be computed ahead of time. This is why our two scoring algorithms execute at about the same speed (less than 20ms on a Quad-Core 5500 series 64-bit Xeon).

Note that the delay one experiences after hitting the spacebar in Live Predictor mode, or after the Click me... button has been clicked in Canned Sentences mode, does not come from the execution of the prediction functions but from Shiny’s reactive timer.

Minimizing Memory Usage

As we mentioned earlier, ultra-low frequency items are, by definition, very unlikely to show up as predictions to the user. As we’ve also showed, pruning those terms can result in huge memory savings. Now that we have built our LUTs, we can get rid of all the N-grams with a count less than four (except for unigrams, see below)

Since we’re only interested in returning the top-5 predictions, there is very little value in keeping around more than 5 lower-order (N-1)-grams (per algorithm) that start with the same words. Now that we have built our LUTS, we know which of them score the highest with each scoring function. Hence, we can discard all lower-order (N-1)-grams which start with the same words and are not in their respective top-5.

We can also save memory by replacing N-gram string representations in all term-frequency matrices where N>1 with lists of N integer indices in the unigram term-frequency matrix. That index is used to look up the actual word it then represents. At this stage, we have optimized all N-gram term-frequency matrices where N>1, so we can do one last pass to remove all terms from the unigram term-frequency matrix that are not referenced by higher-order term-frequency matrices.

The final LUTs require 62.8MB of disk storage and use 210MB of RAM once loaded (for two algorithms).

Using WordPredictR

To launch WordPredictR, go to https://philferriere.shinyapps.io/WordPredictR:

This Shiny web app can operate in two different modes:

In Live Predictor mode, the application ingests user-typed text and predicts the next word.
In “Canned Sentences” mode, watch the predictor operate on three randomly selected sentences.

To choose between the two modes, click the left sidebar.

Live Predictor

In this mode, type a few words at the top of the main body of the page, and watch the algorithm perform its magic every time you hit the spacebar. The application will use the text entered so far and issue a prediction for it.

If the input mostly contains unintelligible content, and the prediction cannot make any “useful” prediction, the application will simply return a list of its highest frequency unigrams. So, if you see “the” “to” “and” as predictions and those are unlikely to be correct, please check the spelling of the last word(s) you have just typed in.

Canned Sentences

When using this mode, click on the Click me... button to get predictions for the final word of three (out of a set of 120) randomly chosen sentences. You can repeat the operation as many times as you want.

Note that the page displayed in Canned Sentences mode is not static. Even though sentences are loaded from a predetermined set, resulting predictions are not. Each time a new trio of sentences is selected, the application does run the prediction algorithms on the server for each sentence.

Results

In both modes, the application reports its top 5 predictions, with the top prediction listed at the top and less likely ones next.

The application runs two different algorithms (See the Prediction Algorithms tab for details) and reports results for both. Each prediction has a score that either represents the actual probability of the predicted word, of is a function of it. Scores are representative of the confidence of the prediction. They are, however, only meaningful within each algorithm’s results, not across different algorithms.

Contact Info

If you have any questions about this application (or simply want to reach out!) the author can be found at https://www.linkedin.com/in/philferriere

References

[1] “Speech and Language Processing”, by D. Jurafsky & al, Chapter 4, Draft of January 9, 2015 @ https://web.stanford.edu/~jurafsky/slp3/
[2] [JHU DS Capstone Swiftkey Dataset](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
[3] Shutterstock List of Bad Words @ “https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en”
[4] {tau} Text Analysis Utilities R Package @ https://cran.r-project.org/web/packages/tau/index.html
[5] “Large language models in machine translation” by T. Brants et al, in EMNLP/CoNLL 2007 @ http://www.aclweb.org/anthology/D07-1090.pdf
[6] Next word prediction benchmark @ https://github.com/hfoffani/dsci-benchmark

Word Prediction Using Stupid Backoff With a 5-gram Language Model

JHU Data Science Capstone Final Report

Phil Ferriere

April 2016