Sentiment Analysis of Restaurant Reviews

This is my Capstone Project for the Coursera Data Science Specialization by the Johns Hopkins University.

For this project we used the Yelp Dataset Challenge to select a question of interest and try to answer it with our data science skills.

Can we predict the sentiment of a restaurant review?

Since the reviews data on the Yelp dataset provides both the star rating and the review text for each particular review, I wanted to know if it would be possible to train a machine learning model with this data in order to identify if a particular business review was positive or negative based only on its text.

The Dataset

The Yelp Dataset Challenge reviews dataset contains 1,569,264 business reviews.
The most prominent category for reviews is Restaurants with 990,627 restaurant reviews. I used this category for my model.

plot of chunk unnamed-chunk-2

Methodology used

In order to train a model that would predict the sentiment of a restaurant review:

Used a Naive Bayes algorithm to build a binary classification model.
Created a document term matrix with all the words used in the reviews.
Removed the least frequently used.
Trained several models with different n-grams.

Results

The model with the best performance was a unigram model, based on a subset of 4,954 rows from the original reviews dataset. This model had an out of sample accuracy of 87%.
It is possible to train a model to determine the sentiment of a restaurant review with a relatively high accuracy using the yelp reviews dataset as a training set.
If a restaurant were to use this model to interpret the thousands of tweets, facebook posts, blog posts or articles posted online they could get a sense if people are saying negative or positive things about their stablishment online.