Text-based Sentiment Analysis and Classifier

John Slough
November 20, 2015

Capstone Project for Coursera Data Science Specialization

The Data

The data comes from the Yelp Dataset Challenge

  • 1.6M reviews and 500K tips by 366K users for 61K businesses
  • Each text-based review also has star rating (1-5)
  • 481K business attributes, e.g., hours, parking availability, ambience
  • Social network of 366K users for a total of 2.9M social edges
  • Aggregated check-ins over time for each of the 61K businesses

The task: Identify a question or problem that you are interested in addressing with the data.

The question: Can we predict the sentiment of restaurant reviews (positive or negative) from the words in the text?

This kind of analysis has wide-reaching applications in multiple domains.

The Model and Results

Natural Language Processing methods

  • Data processed with R
  • Bag of words model
  • Including emoticon analysis
  • Logistic regression using Dato's Graphlab module in Python
  • Accuracy of about 94% achieved on the testing dataset
  • Confusion Matrix (1 = positive sentiment)

The answer: Yes, we can.

Going further: Can we predict the number of stars of restaurant reviews (1-5) from the words in the text?

The Milti-Class Model and Results

Natural Language Processing methods

  • Same features as Sentiment Analysis
  • Including emoticon analysis
  • Multinomial logistic regression using Dato's Graphlab module in Python
  • Accuracy of about 55% achieved on the testing dataset
  • Evidence of overfitting
  • Confusion Matrix

The answer: Yes, we can, but not as well.


Emoticon Analysis

Coefficients for sentiment analysis logistic regression model:

  • positive emoticons = 0.969
  • negative emoticons = -0.585
  • in the logical direction
  • similar results for multi-class model

For more information on this part of the analysis check out the R-code and the IPython notebook. Also be sure to check out the full report.

Conclusion

  • Successfully predicted the sentiment of Yelp reviews from the text.
  • Less successfully predicted star ratings of Yelp reviews from the text.

For a more detailed look at the code and analysis go to the Github Repository. And check out the most frequent n-grams.