Predict Next Word Shiny App

Rui Hu
September 12, 2018

The Predict Next Word Shiny App is built for the Capstone Project of the Coursera Data Science Specilization from John Hopkins University.

The goal of this app is to predict next word based off typed in words. The Markov assumption and ngram are the main theoratical foundation of this app.

The large dataset that was used to train this prediction model was from here: Capstone Dataset, provided by SwiftKey. For this project, only the text files under the folder 'en_US' within the dataset were used.

Basic Idea of Using N-gram to Predict the Next Word

Markov assumption: The probability of a word showing up in a context depends only on the previous word or words.
ngram: These words(or letters) that appear in a certain order, are called “grams.” A sequence of n words is called “ngram.”

Step 1: Getting and sampling the training dataset

In this step, 50% of the dataset was sampled.

Step 2: Initial cleaning the sample data

Non-ascii characters are converted to ascii charagers.

Step 3: Creating ngrams from the dataset

1-5 grams are created and then stored in a data table. The size of the end result of the data table for the 1-5 ngrams is 34.4MB.

Step 4: Processing the input text

For easy search, it seems efficient to generate 1-(n-1) ngrams for the input text for search. If the typed words are less than 4, then use all the words. If not, then use the last four words.

Step 5: Search the genrated (n-1):1 grams in the ngram data table.

Smoothing with Stupid Backoff

What do we do when there is a word that appears after a word that they never appeared in the training set?

There are many approaches to do this, such as Add-k smoothing, various Backoff and Interpolation approaches, and Kneser-Ney Smoothing. After researching, the Stupid Backoff algorithm was selected for this app, for its efficiency and effectiveness. stupid backoff equation

If n-1 gram (here is 4gram) can't be found, then fall back to the lower level to search 3gram. If the match can't be found in 3 gram, then fall back to 2gram. So on and so forth. Each time of falling back, the weight of \( \lambda \) will be reduced by *0.4. Each matched word was then stored along with their weight (“power”) in the data table for the final selecting.

Performance Benchmark

Performance test was done using the “Benchmark” scripts, courtesy of the Coursera Course alumini: Andrew Braddick.

The result is as following:

Overall top-3 score: 14.41 %
Overall top-1 precision: 10.42 %
Overall top-3 precision: 17.97 %
Average runtime: 148.99 msec
Number of predictions: 28464
Total memory used: 48.57 MB

The performance of this app is good enough for the users to find the word in the top 10 predicted words. However, further improvements can be made, given more time, by:

Adding semantic correlation between candidate words and the existing input text;
Improvement on the text cleaning appoach by considering names, locations, and idoms.

The Shiny App

The app allows the user to type in words in the textbox and then makes real-time suggestions for the next word by giving top 10 candidates or whatever prediciton it finds (when the results are less than 10).
The results are displayed in an interactive table, sorted by their weight (“power”). The user can click on the word that they like, to make it appear in the text box. screen shot of the shiny app