Text-based Sentiment Analysis and Classifier

John Slough
November 20, 2015

Capstone Project for Coursera Data Science Specialization

The data comes from the Yelp Dataset Challenge

1.6M reviews and 500K tips by 366K users for 61K businesses

Each text-based review also has star rating (1-5)

481K business attributes, e.g., hours, parking availability, ambience

Social network of 366K users for a total of 2.9M social edges

The task: Identify a question or problem that you are interested in addressing with the data.

The question: Can we predict the sentiment of restaurant reviews (positive or negative) from the words in the text?

This kind of analysis has wide-reaching applications in multiple domains.

Natural Language Processing methods

The answer: Yes, we can.

Going further: Can we predict the number of stars of restaurant reviews (1-5) from the words in the text?

Natural Language Processing methods

The answer: Yes, we can, but not as well.

Emoticon Analysis

Coefficients for sentiment analysis logistic regression model:

For more information on this part of the analysis check out the R-code and the IPython notebook. Also be sure to check out the full report.

Conclusion

For a more detailed look at the code and analysis go to the Github Repository. And check out the most frequent n-grams.