Next Word Prediction

Yan Yu
September 25, 2016

This Shiny application is the capstone project for the Coursera Data Science specialization. The goal of this capstone is to understand and build a predictive text model. The outline of this presentation is

Model Algorithm
ShinyApp Instructions
Accuracy Evaluation
Discussion and References

Model Algorithm - Preparation and Prediction

Preparation. We start with creating a sample of 10% from a large corpus called HC Corpora. Then, we removed the sentences with profane words, convert to lower case, remove numbers, punctution, separators, and twitter related stuff. Then, the cleaned sample data was tokenized into N-grams (N = 1, 2, 3, 4). All the terms with less than 3 counts were cut from the N-grams matrices.
Prediction. Once the user types in some texts, the app will instantly first use the last three words to search for the 4-gram. If less than three candidates are found, the app will use the last two words to search for the 3-gram. If still less than three candidates are found, the app will keep searching the 2-gram untill it find the three mathes. If no matching are found at last, it will return the top three unigrams. The Stupid Backoff model (alpha = 0.4) is then used to calculate scores for the matching candidates [1]. There may be repeat candidates from different N-grams, we just summarize the scores for the repeat candidates.

ShinyApp Instructions

Our Shiny app has two tabs.

First tab is the Next Word Predication page. The user just need to type in some words, phrases, or sentences. The app will instantly give you the ranked predicted next words. Only English language is suppported in this app.
Second tab is the Documents page. It basically loaded the md file for the presentation. The user can find all the information about this application.
The Shiny app link is https://yanyu6.shinyapps.io/wordapp

Accuracy Evaluation

In order to evaluate our model performance, we ran a benchmark test supplied by past Data Science student [2]. In our case, the benchmark was run over 599 blog lines and 793 tweets lines, for a total of 28464 predictions.
The accuracy of our model is as follows
- Overall top-3 score: 16.39 %
- Overall top-1 precision: 12.22 %
- Overall top-3 precision: 19.90 %
- Average runtime: 70.67 msec
- Number of predictions: 28464
- Total memory used: 33.03 MB

Discussion and References

Discussion. We may be able to further improve our model accuracy by doing following modifications. First, we can try to generate a higher level N-gram matrix. Second, we may sample more data from the original corpus, for example from 10% to 15%. Third, we may add a smoothing method to better predict the next word.
References
1. https://rpubs.com/pferriere/dscapreport
2. https://github.com/hfoffani/dsci-benchmark