RG Capstone Presentation for Data Science Certification

R Gehring

Background of Project

Problem: What is the next word?

Data: Sample from twitter, blogs, and news articles

Solution: Using natural language processing and data supplied, predict the third word in a sequence of words.

Data Cleaning

Before any modeling, I cleaned the data using R's tm package.

  • Removed puctuation
  • Removed swear words
  • Removed numbers
  • Removed stopwords like the, a

I created a milestone report, in which I looked at word clouds, word associations, n-gram (2,3,and 4)

I tried the model with and without stopwords and it produces more meaningful results without the inclusion of stopwords.

My Model

Given two words, my model predicts the third word. The accuracy of model fluctuates from 50-70% given different samples.

How does the model work?

  • I look my n-gram (3 words) and expanded it with respect to frequency. I parsed out the words and called the first word, X1, the second word, X2, and the thrid word, Y.
  • The main idea behind the model is: Given two words, X1 and X2, what is the third word, Y?
  • My algorithm is Naive Baynes from library e1071. This algorithm computes the conditional probabilities of a categorical class variable given independent predictor variables using the Bayes rule.
  • Shipy application is: https://renugehring.shinyapps.io/DataProducts/

Try it

I typed in “leaders around” and got a prediction of “country”.

In order to test drive the model, please enter two words (no stopwords please) and you will get a third word as prediction.

alt text