Predict the next word...

Niels Hanson
April 2015

Capstone Project of the John Hopkin's Coursera Data Science Specialization

  • Summary:
    • Naive Bayes Prediction Model
    • Shiny App
    • Future Work

Naive Bayes Prediction Model

  • The model used is a tri-gram NaiveBays model maximizing the probability of predicting word \( C_k \) is \[ p(C_k| x_1, \ldots, x_n) \propto p(C_k) \prod_{i=1}^n p(x_i|C_k) \]
    • and \( p(x_1 | C_k) \) and \( p(x_2 | C_k) \) are estimated from the bi-gram and tri-gram word frequencies given in the News, Blogs, and Twitter datasets
  • The model is implemented using the NaiveBayes() function of the e1071 package
tri_nb <- naiveBayes( Y ~ X1 + X2 , df_trigram )

Shiny App

  • Given some input text, the Shiny App predicts and presents the most likely word using the model
    • Input text must contain at least one word to give a valid prediction

Predict the next word Shiny App

Visualization

  • Two dynamic visualizations based on the top-10 words were implemented: Top-10 Word Cloud
  • Word Cloud: Words are scaled based on their model probability with top prediction in red (Package: wordcloud)
  • Bar Plot: Shows the calculated probability of the top-10 (Package: ggplot2) Top-10 Bar Plot

Future Work

  • App model could be improved by Using backoff or interpolation models for n-grams
    • Backoff: use trigram, then bigram, then unigram based on availability
    • Interpolation: linear combination of trigram bigram and unigram probability
  • Needs to better handle unknown words as they come.
  • Better smoothing methods that use small counts to estimate never seen words.