Realigning the Yelp stars

Shawn Estes
11/1/2015

Introduction

Star rating reviews are inherently flawed. Yelp suffers from the same problems as movie critics that review Citizen Kane and Ghostbusters with 5 stars.

When a Yelper looks to Yelp for ratings, what is useful data?

  • High star ratings are commonly referenced and used for justification. “It has 5 stars!”
  • Users use the star rating to make snap judgment calls on trying a new place
  • This relationship is important for Yelp as much as it is for the businesses being reviewed.

Can we realign the star ratings based on how positive (or negative) a review is?

The Problem - Can we realign the stars?

Users would expect that an average review be 3.5 stars. But that's not the case at all. plot of chunk unnamed-chunk-2

Method and Data

Using the Yelp Academic Dataset, I preprocessed the data by reducing down to just the review text, removed punctuation and changed to lowercase, and removed stop words from a public corpus. After that I used the Porter Algorithm to reduce the remaining words to their roots.

Naive Bayes was the primary model

  • It was trained with the popular Opinion Lexicon, which also included their Porter Algorithm equivalents
  • The model has a 70% accuracy on finding negative and positive words in a “bag of words”*

*70% sounds much worse than it is. Humans only agree on 82% sentiment.

Results

  • It is possible to realign the star ratings
  • The positive nature of the dataset pushes our potential outcomes to the extremes.
  • The realignment would come at a risk of alienating users – most 5 star ratings would be moved down to 2, 3 or 4 ratings and the rest of the reviews would transition to 1 star ratings.
  • Better approach is possible with a combination of the public lexicon and internal training of the model using what humans have qualified as a 5 star review