Capstone Project of the John Hopkin's Coursera Data Science Specialization
Summary:
Naive Bayes Prediction Model
Shiny App
Future Work
Naive Bayes Prediction Model
The model used is a tri-gram NaiveBays model maximizing the probability of predicting word \( C_k \) is
\[
p(C_k| x_1, \ldots, x_n) \propto p(C_k) \prod_{i=1}^n p(x_i|C_k)
\]
and \( p(x_1 | C_k) \) and \( p(x_2 | C_k) \) are estimated from the bi-gram and tri-gram word frequencies given in the News, Blogs, and Twitter datasets
The model is implemented using the NaiveBayes() function of the e1071 package
tri_nb <- naiveBayes( Y ~ X1 + X2 , df_trigram )
Shiny App
Given some input text, the Shiny App predicts and presents the most likely word using the model
Input text must contain at least one word to give a valid prediction
Visualization
Two dynamic visualizations based on the top-10 words were implemented:
Word Cloud: Words are scaled based on their model probability with top prediction in red (Package: wordcloud)
Bar Plot: Shows the calculated probability of the top-10 (Package: ggplot2)
Future Work
App model could be improved by Using backoff or interpolation models for n-grams
Backoff: use trigram, then bigram, then unigram based on availability
Interpolation: linear combination of trigram bigram and unigram probability
Needs to better handle unknown words as they come.
Better smoothing methods that use small counts to estimate never seen words.