Data Science Capstone: Yelp Dataset Analysis

Marowen Ng
Sunday, November 22, 2015

Introduction

  • Yelp lets users rate businesses (restaurants, dentists, and so on) but are they always fair?

  • Are there people who only give 1 star or 5 star reviews?

  • How do we tell if these users only write reviews to either complain or compliment about a business?

  • Can we reasonably predict what kind of rating a business will get based on the users' rating behaviour?

Method

  • Step 1: simplify the raw dataset into one data frame
  • Step 2: plot histogram of users' average rating and its relationship with number of reviews
  • Step 3: build a simple model that predicts businesses' ratings based on users' average start and review count
  • Step 4: validate this model on a testing dataset
  • Step 5: cross check this validation and derive a conclusion

Users' Rating Behavior

  • There are some users who write only 1 star reviews and many users who write only 5 star reviews
  • These users only write very few reviews, likely for the sole purpose of either strongly complaining or complimenting
  • Users with average rating of 3-4 star have the widest range of review count; the review count gets less and less with lower or higher average star

alt text

Prediction Model and Accuracy

  • Based on confusion matrix, the model only has ~38% accuracy
  • This would be true if the criteria is to predict exactly
  • Due to the simplified nature of the model, we will relax the criteria to, say, +/- 1 star
  • With this relaxed criteria, ~82% of predicted values are either exactly accurate, or only off by +/- 1 star

alt text

Summary

  • Many users write reviews on Yelp just to give either 1 star or 5 star
  • For everyone else, a simple model based on the users' average stars and review count can be created to predict how many stars they rate a business
  • The model will only predict with 38% accuracy, but can reasonably predict 82% for differences within +/- 1 star
  • Given the subjectivity and oversimplification of the data, the model is quite good as a rough predictor
  • A much more robust model can potentially be created using all aspects of the raw dataset