Predict the next word...

Niels Hanson
April 2015

Capstone Project of the John Hopkin's Coursera Data Science Specialization

Summary:
- Naive Bayes Prediction Model
- Shiny App
- Future Work

Naive Bayes Prediction Model

The model used is a tri-gram NaiveBays model maximizing the probability of predicting word \( C_k \) is \[ p(C_k| x_1, \ldots, x_n) \propto p(C_k) \prod_{i=1}^n p(x_i|C_k) \]
- and \( p(x_1 | C_k) \) and \( p(x_2 | C_k) \) are estimated from the bi-gram and tri-gram word frequencies given in the News, Blogs, and Twitter datasets
The model is implemented using the NaiveBayes() function of the e1071 package

tri_nb <- naiveBayes( Y ~ X1 + X2 , df_trigram )

Shiny App

Given some input text, the Shiny App predicts and presents the most likely word using the model
- Input text must contain at least one word to give a valid prediction

Predict the next word Shiny App

Visualization

Two dynamic visualizations based on the top-10 words were implemented:
Word Cloud: Words are scaled based on their model probability with top prediction in red (Package: wordcloud)

Bar Plot: Shows the calculated probability of the top-10 (Package: ggplot2)

Future Work

App model could be improved by Using backoff or interpolation models for n-grams
- Backoff: use trigram, then bigram, then unigram based on availability
- Interpolation: linear combination of trigram bigram and unigram probability
Needs to better handle unknown words as they come.
Better smoothing methods that use small counts to estimate never seen words.
- e.g., Good-Turing Smoothing, Kneser-Ney