Coursera JHU Data Science Specialization

Capstone Project Data Product – Next Word Predictor

Yue Li
4/29/2018

Overview of App

This app is created to predict the next word(s) given users' input statement. In the left side panel, the users will enter any text to get their next word(s) prediction. Below the text input box, there is a slider bar, which enables the users to choose the number of candidate word(s) they desire in the prediction panel. On the right side, the predicted word with highest probability will appear in bold font. Other candidates and their respective probability is ordered in the data table below.

For the use of this app please visit https://cranberryly.shinyapps.io/capstone1/.

Data Source

In the background, this app is implementing a predictive model that is developed with given datasets from Coursera. The original dataset includes News articles, Blog articles and Twitter data with total file size exceeds 500 MB. Rather than using all the data, random selection of 10% data from each data source was performed to create a training data set. The selected records were then cleaned by

  • Removing numbers, punctuation, white spaces, and any special extended ASCII characters.
  • Profanity words were searched and removed from the corpus
  • All letters in the corpus were converted to lower case

The Algorithm

To predict word candidates, Stupid Back-off model is applied in this app. The algorithm first uses the last three words typed in and tries to find 4-grams that complete those three words. If no candidates are found, then the app uses the last two words typed in and searches for 3-grams that match those two words, and so on, until it has found matches. If the app is unable to find any suitable candidates, it simply returns its most likely unigrams.

The scoring fomular is defined as below:

title

with a recommended value of 0.4 for lambda

Challenge and Future Work

  • To get a fast response time, reduce n-grams data size by filter out n-grams with frequency = 1.

  • Convert n-grams into data tables (much faster lookup than data frames).

  • In the future, more advanced data cleaning will be considered in the first step, such as removing stopwords or single words, removing imcomplete sentences or duplicate sentences, etc.

  • Smoothing algorithm like Kneser-Ney smoothing can be incorporated in the model to improve accuracy.

Appendix

Shiny app can be visited through https://cranberryly.shinyapps.io/capstone1/.

GitHub repository code can be found https://github.com/cranberryly/Capstone.

Thank you!