Realigning the Yelp Stars

Shawn Estes


November 11, 2015


Coursera Capstone Project, Oct 7, 2015

Introduction

Star rating reviews are inherently flawed. Yelp suffers from the same problems as movie critics that review Citizen Kane and Ghostbusters with 5 stars.

When a Yelper looks to Yelp for ratings, what is useful data?

  • High star ratings are commonly referenced and used for justification. “It has 5 stars!”
  • Users use the star rating to make snap judgment calls on trying a new place
  • This relationship is important for Yelp as much as it is for the businesses being reviewed.

Can we realign the star ratings based on how positive (or negative) a review is?

The Problem - Can we realign the stars?

Users would expect that an average review be 3.5 stars. But that's not the case at all. plot of chunk unnamed-chunk-2

Method and Data

Using the Yelp Academic Dataset, I preprocessed the data by reducing down to just the review text, removed punctuation and changed to lowercase, and removed stop words from a public corpus. After that I used the Porter Algorithm to reduce the remaining words to their roots.

Naive Bayes was the primary model

  • It was trained with the popular Opinion Lexicon, which also included their Porter Algorithm equivalents
  • The model has a 70% accuracy on finding negative and positive words in a “bag of words”*

*70% sounds much worse than it is. Humans only agree on 82% sentiment.

Results

  • It is possible to realign the star ratings
  • The positive nature of the dataset pushes our potential outcomes to the extremes.
  • The realignment would come at a risk of alienating users – most 5 star ratings would be moved down to 2, 3 or 4 star ratings and the rest of the reviews would transition to 1 star ratings.
  • Better approach is possible with a combination of the public lexicon and internal training of the model using what humans have qualified as a 5 star review