Data Science Capstone: Yelp Dataset Analysis
Marowen Ng
Sunday, November 22, 2015
Introduction
Yelp lets users rate businesses (restaurants, dentists, and so on) but are they always fair?
Are there people who only give 1 star or 5 star reviews?
How do we tell if these users only write reviews to either complain or compliment about a business?
Can we reasonably predict what kind of rating a business will get based on the users' rating behaviour?
Method
- Step 1: simplify the raw dataset into one data frame
- Step 2: plot histogram of users' average rating and its relationship with number of reviews
- Step 3: build a simple model that predicts businesses' ratings based on users' average start and review count
- Step 4: validate this model on a testing dataset
- Step 5: cross check this validation and derive a conclusion
Users' Rating Behavior
- There are some users who write only 1 star reviews and many users who write only 5 star reviews
- These users only write very few reviews, likely for the sole purpose of either strongly complaining or complimenting
- Users with average rating of 3-4 star have the widest range of review count; the review count gets less and less with lower or higher average star
Prediction Model and Accuracy
- Based on confusion matrix, the model only has ~38% accuracy
- This would be true if the criteria is to predict exactly
- Due to the simplified nature of the model, we will relax the criteria to, say, +/- 1 star
- With this relaxed criteria, ~82% of predicted values are either exactly accurate, or only off by +/- 1 star
Summary
- Many users write reviews on Yelp just to give either 1 star or 5 star
- For everyone else, a simple model based on the users' average stars and review count can be created to predict how many stars they rate a business
- The model will only predict with 38% accuracy, but can reasonably predict 82% for differences within +/- 1 star
- Given the subjectivity and oversimplification of the data, the model is quite good as a rough predictor
- A much more robust model can potentially be created using all aspects of the raw dataset