Yelp Dataset Analysis

Introduction

The review website Yelp not only connects customers with businesses, but also allows customers to rate their experiences. There are millions of data consisting of user/business information, reviews, votes, and so on. For Johns Hopkins University Coursera capstone project, a specific Yelp dataset had been provided for analytical purpose.

From this dataset, I will attempt to answer the following questions:

1) Are there people who only give 1 star or 5 star reviews?

2) Is there a tendency for these people to keep giving 1 star or 5 star, or in other words, do they only write reviews to either complain or compliment about a business? Can this behavior be apparent by looking at the dataset?

3) Is it possible at all to reasonably predict what kind of rating a business will get based on the users’ rating behaviour?

Methods and Data

Step 1: Getting and Manipulating Data

The dataset was downloaded straight from the capstone project Coursera page into a local folder. The raw files are in JSON format and were converted into R data frames, as shown by the codes below.

setwd("~/R/capstone/yelp_dataset_challenge_academic_dataset")
tip <- stream_in(file("yelp_academic_dataset_tip.json"))
checkin <- stream_in(file("yelp_academic_dataset_checkin.json"))
user <- stream_in(file("yelp_academic_dataset_user.json"))
review <- stream_in(file("yelp_academic_dataset_review.json"))
business <- stream_in(file("yelp_academic_dataset_business.json"))

The raw files were cleaned up, dissected and crossed. The resulting final dataset consists of only user ID, star rating of a business, business ID, users’ review count, and users’ average stars.

##                       user_id stars            business_id review_count average_stars
## 182925 5hBsVMJOc8sXMPt8S3NjmA     5 --1emggGHgoG6ipd_RMb-g          309          3.92
## 205651 6DuBPBgCN6YMowFwrp_P5g     1 --1emggGHgoG6ipd_RMb-g          227          3.60
## 606413 GI_0tXuL7dWll1m1i48eQw     5 --1emggGHgoG6ipd_RMb-g          455          3.42
## 774662 jt4zoa6tk5q-4mWtuKHEPg     4 --1emggGHgoG6ipd_RMb-g           27          4.28
## 523137 epN0_Z-MWmx1DxO0FC7k9w     5 --4Pe8BZ6gj57VFL5mUE8g            2          3.00
## 833180 ky9pQ8AKufvZ63IxR41jGg     2 --4Pe8BZ6gj57VFL5mUE8g          504          3.90

Step 2: Exploratory Data Analysis

In order to answer the first two questions, several plots from the user dataset will be generated. Regarding whether there are people who only give either 1 star or 5 star reviews, a histogram of users’ average rating is plotted.

library(ggplot2)
ggplot(newuser1, aes(x = average_stars)) + geom_histogram(alpha = .50, binwidth=.1, colour = "black")

To answer the next question on these users’ tendency to write only 1 star or 5 star reviews, the relationship between users’ average stars and review count is plotted.

ggplot(newuser1, aes(average_stars, review_count)) + geom_point()

Step 3: Model Building

The last question of interest for this project is whether it is possible to predict how many star a user will rate a business based on the user’s rating behaviour, which is extremely subjective. It would be impossible to be able to predict with 100% accuracy what a user would rate a business based on past rating history, but we might be able to produce a model with reasonable prediction (for example, accurate at least half of the time, to be conservative).

We will evaluate if we can achieve this by making the model as simple as possible. In this case, only the users’ average stars and number of reviews will be used to build the model. We are going to perform nested likelihood ratio test on three linear regression fits: stars against users’ average stars, against users’ review counts, and against both.

fit1 <- lm(stars ~ average_stars, data = combo1)
fit2 <- lm(stars ~ review_count, data = combo1)
fit3 <- lm(stars ~ average_stars + review_count, data = combo1)
anova(fit1, fit2, fit3)

The fit comparison shows that either users’ average stars or review counts are terrible predictors of users’ ratings, but a decent fit if both are taken into account. They will therefore be the basis for the model. To test this model, the dataset was split into training and testing data sets. Since the sample size is large, the dataset was split equally.

library(caret)

## Loading required package: lattice

inTrain = createDataPartition(combo1$stars, p = 0.5, list=FALSE)
training = combo1[ inTrain,]
testing = combo1[-inTrain,]
dim(training)

## [1] 784633      5

dim(testing)

## [1] 784631      5

Step 4: Model Validation

The model is built upon the testing dataset, which is then applied to predict on the testing dataset. The output will be in the form of a confusion matrix as well as accuracy percentage.

modFit <- train(stars ~ average_stars + review_count,data=training,method="lm")
predictions <- round(predict(modFit, testing))
u = union(predictions, testing$stars)
t = table(factor(predictions, u), factor(testing$stars, u))
confusionMatrix(t)

## Confusion Matrix and Statistics
## 
##    
##          4      3      5      1      2      0
##   4 179339  76037 180434  26291  36615      0
##   3  45729  32927  41321  32005  27275      0
##   5   6797    930  65708    347    335      0
##   1      5     31     12  12325    320      0
##   2   1428   1501   2277   8997   5619      0
##   0      1      2     11     10      2      0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3771          
##                  95% CI : (0.3761, 0.3782)
##     No Information Rate : 0.3693          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.158           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 4 Class: 3 Class: 5 Class: 1 Class: 2  Class: 0
## Sensitivity            0.7687  0.29550  0.22676  0.15411 0.080082        NA
## Specificity            0.4207  0.78264  0.98301  0.99948 0.980121 1.000e+00
## Pos Pred Value         0.3596  0.18369  0.88654  0.97101 0.283473        NA
## Neg Pred Value         0.8113  0.87033  0.68466  0.91236 0.915604        NA
## Prevalence             0.2973  0.14201  0.36930  0.10193 0.089425 0.000e+00
## Detection Rate         0.2286  0.04196  0.08374  0.01571 0.007161 0.000e+00
## Detection Prevalence   0.6356  0.22846  0.09446  0.01618 0.025263 3.314e-05
## Balanced Accuracy      0.5947  0.53907  0.60489  0.57679 0.530101        NA

Results

The first histogram shows that there are substantial number of users with average rating of either 1 or 5 star (in fact, a large number of people have average rating of 5 star, more so than the others).

The second plot looks at the relationship between the number of reviews and average star ratings of the users in this dataset. The plot shows that users who give average rating of 1 or 5 star write only small number of reviews, and it gets increasingly more for users that are in between.

As a validation for the prediction model created, a matrix comparing the number of accurately (and inaccurately) predicted rating was generated. The calculated accuracy based on the generated confusion matrix is ~38%.

Discussion

The first histogram readily answers the first question that there are users who only write 1 or 5 star reviews. The second plot deduces that these users likely write reviews for the sole purpose of either strongly complaining or complimenting about their experiences (which is why there are not that many of them). It does answer the second question that there are people with the tendency to write only positive or negative reviews.

However, note that the plot is also densely populated across all average stars, meaning there are also users who write only a few reviews that average anywhere between 1 and 5 star, so not exclusively for strongly negative or positive reviews.

We must therefore state that there are users who exclusively write reviews on Yelp solely to give either 1 or 5 star, as this is apparent from the frequency of the review, but we cannot make a similar statement that infrequent reviewers are users that tend to give only 1 or 5 star.

The confusion matrix shows that the oversimplified model, which only takes into account the users’ average stars and review count (omitting all other factors), is unfortunately far from perfect in predicting a business’ rating based on the users’ average stars and review count.

However, if we are willing to relax the accuracy requirement, the model may be good enough as a rough predictor. To check this, the difference between the predicted and actual values are calculated and plotted as a histogram. It is apparent that a majority of predicted values are within 1 star away from actual values.

testing$predictions <- predictions
testing$difference <- testing$stars - testing$predictions
hist(testing$difference, breaks = 10, main = "Difference between Predicted and Actual")
axis(side=1, at=seq(-4,4, 1), labels=seq(-4,4,1))

To be more precise, roughly 82% of predicted values are only off by +/- 1 star, as shown by the calculation below.

difference <- as.data.frame(table(testing$difference))
(difference[4,2]+difference[5,2]+difference[6,2])/sum(difference[,2])

## [1] 0.8195164

Considering the subjectivity of this dataset and oversimplification of the variables, the model is quite acceptable as a rough estimating tool. The answer to the third question is that it is possible to reasonably predict the rating of a business based on the users’ rating behavior, which in this case only involves the users’ average stars and review count. A much more robust model can potentially be built by taking into account all aspects of the dataset.