Word Prediction App

Cameron G
17 Nov 2017

What does it do?

The app takes a phrase entered by the user and generates up to five predictions for the next word in the phrase. Click here to try it out.

App screenshot

What data does it use?

  • The app uses a corpus made of tweets, blogs, and news articles from SwiftKey. The full data set is available here.
    • Blogs: 899,288 documents
    • Twitter: 2,360,148 documents
    • News: 1,010,242 documents
  • For faster processing times, the training corpus used to create the model relies on a cleaned sample of 20% of these documents.
  • Using the quanteda package, the documents were tokenized and saved into 5 data tables: unigrams, bigrams, trigrams, 4-grams, and 5-grams, and their corresponding frequencies in the corpus.

How does it work?

  • The app uses the “Stupid Backoff Model” (SBO). According to Brants et al, this method “is inexpensive to calculate in a distributed environment while approaching the quality of Kneser-Ney smoothing for large amounts of data.”
  • It works by taking the last 4 words entered by the user and searching for matching 5-grams that begin with those 4 words. If some are found, it calculates their probabilities and rank orders them. It then searches for matching 4-grams, trigrams, bigrams, and unigrams, multiplying the resulting probabilities by 0.4, 0.16, 0.064, and 0.0256, respectively.
  • The model correctly predicted 15.7% of next words in a test set of unseen n-grams. Accuracy rises to around 26.4% if you consider whether the correct response appeared within the top 3 predicted words.

Is it user friendly?

The model has several features to make it run more quickly for the user.

  • Pruning. N-grams that occur less than 4 times in the corpus have been removed from the n-gram tables loaded into the app, since they have little predictive value.
  • Data tables. The app takes advantage of the fast indexing available using the data table package and setkey function, yielding predictions faster than using data frames.
  • Output length. The algorithm stops and returns the list of unigrams once it has generated the number of next-word predictions that the user specifies with the slider input. The model only “backs off” to lower order n-grams if it needs to add more items to the list.