Introduction

Star rating reviews are inherently flawed. This is notable in the movie review community where critics have ranked Citizen Kane alongside Ghostbusters as the same rating. Yelp suffers some of these same types of problems due to simply relying on humans to decipher their star rating.

When a Yelper looks to Yelp for ratings, what is useful data? More often than not, Yelpers justify their choices by saying “It has 5 stars!” A quick glance at the star rating can make the difference between the user giving a new restaurant a try or making an appointment at a new doctor. This also sets a certain level of expectation of the experience. Jeopardizing this relationship with a user could be economically critical to both Yelp and the businesses they serve.

Can we realign the star ratings based on how positive (or negative) a review is?

Methods and Data

The focus of the exploratory analysis was on the Yelp academic review data solely. My goal was to figure out why users rate their reviews above average at all. The figure below shows the distribution of star ratings:

Expectations on a 5-star rating system would be that an average review would be 2.5 stars. Based on users own admissions, the average star rating on the academic dataset is 3.74 stars. This is over a full star rating from what “above average” means.

Preparing the data

To prepare the review text for processing, I ran a few potential tests to find accuracy problems. The amount of stop words and potential issues from misspellings drove me to make the following preprocessing changes to the review data:

  1. I’ve removed all punctuation and converted everything to lowercase.

  2. Using the Natural Language Toolkit, I took this preprocessed data, removed common stop words and utilized the Porter Algorithm to reduce to their roots. I also prepended any word that follows “never” or “no” or “not” with “NEG_” – the goal being to remove words that are negated by the prior word.

Modeling Methods

Since the Yelp data is biased to begin with, the best potential outcome is being able to find out an external source to rank content.

My Naive Bayes classifier was trained using the widely available Opinion Lexicon from Bin Liu, Minqg Hu, and Junsheng Cheng’s work on review text. This lexicon includes commonly misspelled words and discerns the difference between positive and negative words specifically when it comes to giving an opinion. To increase accuracy, I’ve also added the Porter Algorithm version of the words to both positive and negative.

The results of tests from the model:

accuracy: 0.7049469964664311
positive precision: 1.0
negative precision: 0.704773129051267

70% accuracy might seem low, but accuracy is up for debate in the community. Researchers at the University of Pittsburgh found only 82% of humans agree on sentiment. 70% is acceptable for this dataset, along with the precision. We know our dataset is biased toward the positive.

The data was ran against this model, gathering many factors:

  1. Does the model think this review as a whole is positive or negative?
  2. How does the model classify the “bag of words” for each review?
  3. How confident is the model that this is positive or negative?

Results

In order to get the results on the model, I ran the full review data set against the trained Naive Bayes model.

Looking at the reviews as whole text, the model found a large majority of positive comments.

The bag of words shows off a lot more negative (or neutral) words than the others.

Using the polarity of -1 to 1 and dividing into the 5 star ratings shows the models leaning toward the negative.

From there, I attempted to recenter the star ratings based on the mean value of -0.919897. The outcome was slightly better but not drastically different when it came to pushing negative values forward.

In actuality it moves more of the positive reviews into the 1 star range. What I’ve learned is that Yelp users may think they are raving about a restaurant, but aren’t using at least academically sanctioned words to do so. The star rating is treated independent of the review text almost entirely.

While this massive of a change would be more accurate based solely on the sentiment analysis of how much a user talks about how great a business is – it’s not an improvement to the current structure of taking the user’s word for it.

Discussion

In order to implement changes based on the modeling, Yelp would need to use a combination of the lexicons I used plus train specific models per category and then per user. Over time these models would need to be adapted and discount older reviews. The academic dataset itself is flawed in that there is no average data. Because the reviews are so extreme on positive or negative sides, any model based on the data will suffer the same results.

It appears that a combination of the algorithimically training + human self-rating will help to make sure that businesses are neither over-rated or under-rated. The balance between a user feeling their rating has been taken into consideration and also being a fair representation of the review they’ve submitted is important.

Potential Biases

This dataset is limited in what is released as Yelp’s academic dataset. Changes in patterns of text or trends can influence the decisions of the model and the sentiment analysis. There are not an even spread of 1-5 star ratings to be able to effectively predict the star ratings.

There are many potential influences as to why a user would rank a restaurant a certain way. It might be a single bad experience that elicits a 1 star rating or a great experience that elicits a 5 star rating. However, users could be tempted to rate a restaurant based on the existing star rating. Unfortunately, the dataset doesn’t include enough data to further explore this.

Yelpers may review different types of businesses (a doctor vs. a restaurant) using a different corpus. Each main category should have it’s own model evaluated to see whether results are more relevant.

Highlighting uncharacteristic behavior

Moving beyond just the star rating, weighing should be done on the display of reviews regardless of the star rating assigned by the user or assigned via predictive analytics.

Consider the following: A user is historically known as posting positive reviews, but for some reason, posts a negative one. Since it’s uncharacteristic, Yelpers should have it brought to their attention.

A 5-star review from a person who only posts 5-star reviews (or even only positive reviews) means less than a 5-star review from a person who only posts negative reviews.

While this shouldn’t necessarily have a bearing on the star rating, it would be an additional benefit to users and drum up additional engagement from users who agree or disagree with this extreme assertion.

References and Attributions:

Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,

Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing Opinions on the Web.” Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.