Word Prediction App

Jean-Paul Courneya
29 May 2018

Executive Overview

The goal of this final assignment was to develop a word prediction app.

  • Students were provided a corpus which served as the basis for developing the algorithm running on the back-end to auto-complete phrases. An exploratory analysis was performed to get a sense of features of the corpus
  • An algorithm was developed using a N-gram probabilistic-language-modeling-based back-off approach which is efficient and accurate
  • A data product was then created which uses Shiny running the prediction algorithm that accepts an phrase and predicts the next word.

Understanding the Corpus

  • Source of the data The corpus used for building the model was provided through a collaboration between the course instructors and Swiftkey. The corpus is a collection of blog posts, a list of news articles, and a list of tweets that were downloaded from the repository provided in the course.
  • Decide on a text processing package for R Quantitative text analysis in R can be done using a variety of different R packages (baseR, tm, tau, tidytext, openNLP). For this app Quanteda a quantitative text analysis tool for R which is cross-platform, very robust, and fast . Our app required an R package which can create a corpus, tokenize the text, create N-grams, and generate document feature matrices.
  • Prepare the text for the language model: Combine the Corpus, remove non-ASCII characters, convert text to lowercase, tokenize the corpus, remove numbers, punctuation, symbols, twitter handles, separators, remove profanity, create N-grams and DFM

Developing the predictictive algorithm

  • The Stupid-Backoff N-gram probabilistic-language-modeling used is described in detail here.
  • Calculate probabilities for phrase completion using a Maximum Likelihood estimate which employs the Chain rule and Markoff assumptions. These probabilities are used to rank predictions in look up tables

Speeding up the predictictive algorithm

  • remove results with a next word frequency < 4
  • For each back-off a lambda of 0.4 is assigned as a discount to speed up the app predictions which are not found in the higher order N-gram table.
if (phraseIsTrigram) {
  score = matched4gramCount / input3gramCount
} else if (phraseIsBigram) {
  score = 0.4 * matched3gramCount / input2gramCount
} else if (phraseIsUniGram) {
  score = 0.4 * 0.4 * matched2gramCount / input1gramCount
}
  • Create fast look up tables for each level of N-gram with calculated probabilities computed.

Using the App

  • Try out the shiny app here
  • Type your desired phrase into the text box.
  • The application will then try to predict the next word(s) in a table with scores of sorte from most to least likely completion

References

githubRepo with scripts to build the app