Next Word Prediction App

Kayode John Olusola
14th January, 2016

Overview

  • The main objective of this project is to build a Next Word Predictor delivered as a Shiny app.

  • The Next Word Predictor App was developed and is available at https://jkayode.shinyapps.io/appDSSCapstone

  • The Prediction app works using the N-grams Language Model called Stupid Backoff Model.

  • Relative frequencies of N-grams created from processed text corpus were computed and used to predict the next word by using only N-1 words of prior context as described by Markov Chain Model.

Using the App

To use the Next Word Prediction App:

  • Load the app (see previous slide for link)
  • Enter your text in the text box provided
  • Click the Predict Next Word Button
  • Wait a few seconds to allow the prediction algorithm to run
  • View the predicted next word and 3 other words as likely options

These steps above are also available on the app as necessary user information

Describing the Algorithm

  • A sample of 25% of Text Corpus provided was cleaned removing profane words, numbers, punctuations etc. and used to create tokens of unigrams, bigrams, trigrams, quadgrams and pentagrams

  • The respective frequencies of distinct n-grams were computed, sorted descending and stored as RData files. These files are loaded and referenced when the app runs.

  • Prediction of the next word is done by cleaning the text inputed, checking the inputed words against the appropriate n-gram and backing off from higher order n-grams to lower order n-grams if a match was not found. Where no match is found, the most frequent unigram is returned.

Final Remarks

  • In order to have a light weight app, the n-gram files were filtered to remove n-grams with very low frequencies. Although this would reduce the accuracy of predictions, the trade-off was necessary to have an app that can run fast enough on the Shiny platform.

  • The Stupid Backoff Model was chosen for this project because of its simplicity while approaching the quality of more complex models as explained by Brants et al. (2007)

  • Further improvements would explore ways of using larger n-grams size with more efficient lookup methods for better predictions.