1. Introduction

I am interested in predicting the number of stars a business is likely to get from a user on Yelp based on the business characteristics and user characteristics. In particular, what are the characteristics that predicts how much a star a user is likely to give a local business? This question might be of interest to businesses that are interested in learning about their customers’ behavior. If they are able to get the user data for their customer and their own business data, then they should be able to predict how likely is this customer going to give good reviews and ratings.

2. Methodology and Data

2.1. Background on the Data

The dataset provided here is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical). The data is made up of five different data sets (review, tip, checkin, user, and business) with each one related to the other via user_id or business_id (precisely like in a RDBMS).

2.2. Exploring the Data

First we need to explore the cities where people are more likely to leave reviews on Yelp. The following figure shows the cities (from highest to Lowest) with more than 10,000 reviews. I chose 10,000 reviews as a threshold because cities with more than 10,000 reviews make up 94% of the total review counts in the entire data set.

2.3. Preparing the Data

Now that I have my question speicified, I need to clean the data set and gather the necessary predictors that are relavent. First, I check each of the 5 data sets to see what are the variables I need and then rid all the irrelavent ones.After doing that, I merge all the data set to create the final clean data set “Yelp_10_clean_df”.

Since there are both numeric and non-numeric variables, I create catagorical variables(dummy variables) for the non-numeric ones (such as city names). Given the capacity of my computer processor and the 1.4 million observations in the data, it would be not feasible to analyse the data at its current state. So I narrow down my analysis to businesses in Las Vegas. As discussed before, local businesses would be interested to know the results of my analysis since it can help gain insights to consumer behavior.

2.4. Methodology

I will be using the standard categorical prediction model since I am interested in learning about how well can the predictors help predict the number of stars a user is likely to give a business. I start off by first partitioning the data set into training and testing data sets. I used 80-20 for training-test set.

2.4.1. Cross-Validation

I use trainControl() funtion from the caret package to use the cross validation method. The idea is that, we have already split the main data into training subset and testing subset and we will be building a model on the training set. So we want to use cross validation method on the training subset to further spit the training subset, build the model, test it and repeat that process as specified.

2.4.2. Model

Since the outcome variable is a catagorical variable (stars from 1 to 5), it doesn’t make much sense to use regression models. So I decided to use the prediction with trees (rpart) model which is a method of classification.

Here I am using the centering and scaling pre-process function as part of the model since using pca doesn’t make sense in a classification model.

load("Yelp_10_clean_df_v2.rdata")
Yelp_10_clean_df_v3 <- subset(Yelp_10_clean_df_v2, `Las Vegas` ==1)
Yelp_10_clean_df_v4<- Yelp_10_clean_df_v3[,c(4:5,7:10,28:33)]
Yelp_10_clean_df_v4$stars <- as.factor(Yelp_10_clean_df_v4$stars)

require(caret)
 set.seed(1022)
inTrain <- createDataPartition(Yelp_10_clean_df_v4$stars, p=0.80, list=F)

training <- Yelp_10_clean_df_v4[inTrain,]
testing <- Yelp_10_clean_df_v4[-inTrain,]


control =trainControl(method="cv", number=10, p=0.8)

mfit <- train(stars~., 
              method="rpart", preProc=c("center","scale"), trControl=control,data = training)
mfit
## CART 
## 
## 493890 samples
##     11 predictor
##      5 classes: '1', '2', '3', '4', '5' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 444502, 444502, 444500, 444500, 444501, 444501, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa       Accuracy SD  Kappa SD  
##   0.01065731  0.4555225  0.20556404  0.003836637  0.01085453
##   0.04391816  0.4394114  0.16818652  0.015743959  0.03011324
##   0.04772774  0.3739094  0.02660215  0.024655143  0.05608235
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01065731.

3. Results

3.1. Plots

So the cross-validation process allows us to pick the optimal model based on highest accuracy. The final complexity parameter used for the model was 0.011. The following graph shows the change in accuracy (cross-validation) with respect to change in complexity parameter.

Here’s the classification dendogram plot for the final model. The probability values in the nodes determine the threshold probability of being in certain class.

## NULL

3.2. Out of Sample Prediction

Now, we can try to use the final model to test the model on the testing subset. We use the predict() function and the confusion matrix to summarize the results. These results show the out of sample error rates.

Here is our prediction results on the testing subset:

Conf_matrix$table
##           Reference
## Prediction     1     2     3     4     5
##          1  4032     0     0  5740  3187
##          2  1220     0     0  6446  3706
##          3   677     0     0 10713  7064
##          4   649     0     0 14745 20566
##          5   482     0     0  7710 36531

Here is the overall statistics:

Conf_matrix$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.4479541      0.1864430      0.4451775      0.4507332      0.5754851 
## AccuracyPValue  McnemarPValue 
##      1.0000000            NaN

And finally the statistics by class:

t(Conf_matrix$byClass)
##                        Class: 1   Class: 2  Class: 3  Class: 4  Class: 5
## Sensitivity          0.57110482         NA        NA 0.3251091 0.5141301
## Specificity          0.92331283 0.90789516 0.8505362 0.7284098 0.8437059
## Pos Pred Value       0.31113512         NA        NA 0.4100389 0.8168280
## Neg Pred Value       0.97259952         NA        NA 0.6502148 0.5615849
## Prevalence           0.05718081 0.00000000 0.0000000 0.3673340 0.5754851
## Detection Rate       0.03265623 0.00000000 0.0000000 0.1194237 0.2958742
## Detection Prevalence 0.10495837 0.09210484 0.1494638 0.2912496 0.3622234
## Balanced Accuracy    0.74720882         NA        NA 0.5267595 0.6789180

4. Discussion

My results suggest that users are most likely to avoid giving 2 or 3 stars and choose between 1, 4 and 5. Given that my accuracy rate is just 0.447, it means that my machine learning method predicts the correct outcome only 44.7% of the time when predicting how many stars a user is likely to give a local business in Las Vegas.

Since I narrowed down my analysis to a specific city, we might be able to get a better result if we included more businesses from other cities. I tried running random forest model as well but it wasn’t any better than results from the rpart method and so I chose to go with rpart which is much faster computationally.