Kayode John Olusola
14th January, 2016
The main objective of this project is to build a Next Word Predictor delivered as a Shiny app.
The Next Word Predictor App was developed and is available at https://jkayode.shinyapps.io/appDSSCapstone
The Prediction app works using the N-grams Language Model called Stupid Backoff Model.
Relative frequencies of N-grams created from processed text corpus were computed and used to predict the next word by using only N-1 words of prior context as described by Markov Chain Model.
To use the Next Word Prediction App:
These steps above are also available on the app as necessary user information
A sample of 25% of Text Corpus provided was cleaned removing profane words, numbers, punctuations etc. and used to create tokens of unigrams, bigrams, trigrams, quadgrams and pentagrams
The respective frequencies of distinct n-grams were computed, sorted descending and stored as RData files. These files are loaded and referenced when the app runs.
Prediction of the next word is done by cleaning the text inputed, checking the inputed words against the appropriate n-gram and backing off from higher order n-grams to lower order n-grams if a match was not found. Where no match is found, the most frequent unigram is returned.
In order to have a light weight app, the n-gram files were filtered to remove n-grams with very low frequencies. Although this would reduce the accuracy of predictions, the trade-off was necessary to have an app that can run fast enough on the Shiny platform.
The Stupid Backoff Model was chosen for this project because of its simplicity while approaching the quality of more complex models as explained by Brants et al. (2007)
Further improvements would explore ways of using larger n-grams size with more efficient lookup methods for better predictions.