Sentiment Analysis in YELP reviews dataset

David Manero
22th novemeber 2015

Introduction

The target of this project is to study if the words a client of a restaurant uses in his or her review can predict the score (in stars) he or she gives to the restaurant.

  • Study the lenguage used in the reviews
  • Build a model for predict score with de sentiment analysis of the words
  • Predict a score form a review

Method

This project uses varous methods to study the reviews and the words used.

  1. Subset the reviews by the score done (number of stars).
  2. Study the most frequently used words in each category (number of stars).
  3. Make a dictionary using the words more positive and negative.
  4. Study several prediction methods to get the more accured for this case.
  5. Build a predictive algorithm to guest the score of a new review.

Results

The best algorith founded for the prediction is Random Forest.

The model studied are:

  • Generalized Linear Model (“glm”). Accuracy= 0.72
  • Random Forest (“rf”). Accuracy= 0.87
  • Linear Model (“lm”). Accuracy= 0.73

Discussion

In this study, the use of the word frequency analysis must be used with care, because the most frequently words are common in positive and negative scores, so it must be counted only the unique words (Our bestword analysis).

The study of differents algorithms for the prediction could be done with more models, but the huge of the dataset made it impossible for my computer.

Anyway, the accurance of the final model selected is more than enough.