Rui Hu
September 12, 2018
The Predict Next Word Shiny App is built for the Capstone Project of the Coursera Data Science Specilization from John Hopkins University.
The goal of this app is to predict next word based off typed in words. The Markov assumption and ngram are the main theoratical foundation of this app.
The large dataset that was used to train this prediction model was from here: Capstone Dataset, provided by SwiftKey. For this project, only the text files under the folder 'en_US' within the dataset were used.
Markov assumption: The probability of a word showing up in a context depends only on the previous word or words.
ngram: These words(or letters) that appear in a certain order, are called “grams.”
A sequence of n words is called “ngram.”
In this step, 50% of the dataset was sampled.
Non-ascii characters are converted to ascii charagers.
1-5 grams are created and then stored in a data table. The size of the end result of the data table for the 1-5 ngrams is 34.4MB.
For easy search, it seems efficient to generate 1-(n-1) ngrams for the input text for search. If the typed words are less than 4, then use all the words. If not, then use the last four words.
What do we do when there is a word that appears after a word that they never appeared in the training set?
There are many approaches to do this, such as Add-k smoothing, various Backoff and Interpolation approaches, and Kneser-Ney Smoothing. After researching, the Stupid Backoff algorithm was selected for this app, for its efficiency and effectiveness.
If n-1 gram (here is 4gram) can't be found, then fall back to the lower level to search 3gram. If the match can't be found in 3 gram, then fall back to 2gram. So on and so forth. Each time of falling back, the weight of \( \lambda \) will be reduced by *0.4. Each matched word was then stored along with their weight (“power”) in the data table for the final selecting.
Performance test was done using the “Benchmark” scripts, courtesy of the Coursera Course alumini: Andrew Braddick.
The result is as following:
Overall top-3 score: 14.41 %
Overall top-1 precision: 10.42 %
Overall top-3 precision: 17.97 %
Average runtime: 148.99 msec
Number of predictions: 28464
Total memory used: 48.57 MB
The performance of this app is good enough for the users to find the word in the top 10 predicted words. However, further improvements can be made, given more time, by:
The app allows the user to type in words in the textbox and then makes real-time suggestions for the next word by giving top 10 candidates or whatever prediciton it finds (when the results are less than 10).
The results are displayed in an interactive table, sorted by their weight (“power”). The user can click on the word that they like, to make it appear in the text box.