Word Prediction App

Jean-Paul Courneya
29 May 2018

Executive Overview

The goal of this final assignment was to develop a word prediction app.

Students were provided a corpus which served as the basis for developing the algorithm running on the back-end to auto-complete phrases. An exploratory analysis was performed to get a sense of features of the corpus
An algorithm was developed using a N-gram probabilistic-language-modeling-based back-off approach which is efficient and accurate
A data product was then created which uses Shiny running the prediction algorithm that accepts an phrase and predicts the next word.

Understanding the Corpus

Source of the data The corpus used for building the model was provided through a collaboration between the course instructors and Swiftkey. The corpus is a collection of blog posts, a list of news articles, and a list of tweets that were downloaded from the repository provided in the course.
Decide on a text processing package for R Quantitative text analysis in R can be done using a variety of different R packages (baseR, tm, tau, tidytext, openNLP). For this app Quanteda a quantitative text analysis tool for R which is cross-platform, very robust, and fast . Our app required an R package which can create a corpus, tokenize the text, create N-grams, and generate document feature matrices.
Prepare the text for the language model: Combine the Corpus, remove non-ASCII characters, convert text to lowercase, tokenize the corpus, remove numbers, punctuation, symbols, twitter handles, separators, remove profanity, create N-grams and DFM

Developing the predictictive algorithm

The Stupid-Backoff N-gram probabilistic-language-modeling used is described in detail here.
Calculate probabilities for phrase completion using a Maximum Likelihood estimate which employs the Chain rule and Markoff assumptions. These probabilities are used to rank predictions in look up tables

Speeding up the predictictive algorithm

remove results with a next word frequency < 4
For each back-off a lambda of 0.4 is assigned as a discount to speed up the app predictions which are not found in the higher order N-gram table.

if (phraseIsTrigram) {
  score = matched4gramCount / input3gramCount
} else if (phraseIsBigram) {
  score = 0.4 * matched3gramCount / input2gramCount
} else if (phraseIsUniGram) {
  score = 0.4 * 0.4 * matched2gramCount / input1gramCount
}

Create fast look up tables for each level of N-gram with calculated probabilities computed.

Using the App

Try out the shiny app here
Type your desired phrase into the text box.
The application will then try to predict the next word(s) in a table with scores of sorte from most to least likely completion

References

githubRepo with scripts to build the app