I am interested in predicting the number of stars a business is likely to get from a user on Yelp based on the business characteristics and user characteristics. In particular, what are the characteristics that predicts how much a star a user is likely to give a local business? This question might be of interest to businesses that are interested in learning about their customers’ behavior. If they are able to get the user data for their customer and their own business data, then they should be able to predict how likely is this customer going to give good reviews and ratings.
The dataset provided here is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical). The data is made up of five different data sets (review, tip, checkin, user, and business) with each one related to the other via user_id or business_id (precisely like in a RDBMS).
First we need to explore the cities where people are more likely to leave reviews on Yelp. The following figure shows the cities (from highest to Lowest) with more than 10,000 reviews. I chose 10,000 reviews as a threshold because cities with more than 10,000 reviews make up 94% of the total review counts in the entire data set.
Now that I have my question speicified, I need to clean the data set and gather the necessary predictors that are relavent. First, I check each of the 5 data sets to see what are the variables I need and then rid all the irrelavent ones.After doing that, I merge all the data set to create the final clean data set “Yelp_10_clean_df”.
Since there are both numeric and non-numeric variables, I create catagorical variables(dummy variables) for the non-numeric ones (such as city names). Given the capacity of my computer processor and the 1.4 million observations in the data, it would be not feasible to analyse the data at its current state. So I narrow down my analysis to businesses in Las Vegas. As discussed before, local businesses would be interested to know the results of my analysis since it can help gain insights to consumer behavior.
I will be using the standard categorical prediction model since I am interested in learning about how well can the predictors help predict the number of stars a user is likely to give a business. I start off by first partitioning the data set into training and testing data sets. I used 80-20 for training-test set.
I use trainControl() funtion from the caret package to use the cross validation method. The idea is that, we have already split the main data into training subset and testing subset and we will be building a model on the training set. So we want to use cross validation method on the training subset to further spit the training subset, build the model, test it and repeat that process as specified.
Since the outcome variable is a catagorical variable (stars from 1 to 5), it doesn’t make much sense to use regression models. So I decided to use the prediction with trees (rpart) model which is a method of classification.
Here I am using the centering and scaling pre-process function as part of the model since using pca doesn’t make sense in a classification model.
load("Yelp_10_clean_df_v2.rdata")
Yelp_10_clean_df_v3 <- subset(Yelp_10_clean_df_v2, `Las Vegas` ==1)
Yelp_10_clean_df_v4<- Yelp_10_clean_df_v3[,c(4:5,7:10,28:33)]
Yelp_10_clean_df_v4$stars <- as.factor(Yelp_10_clean_df_v4$stars)
require(caret)
set.seed(1022)
inTrain <- createDataPartition(Yelp_10_clean_df_v4$stars, p=0.80, list=F)
training <- Yelp_10_clean_df_v4[inTrain,]
testing <- Yelp_10_clean_df_v4[-inTrain,]
control =trainControl(method="cv", number=10, p=0.8)
mfit <- train(stars~.,
method="rpart", preProc=c("center","scale"), trControl=control,data = training)
mfit
## CART
##
## 493890 samples
## 11 predictor
## 5 classes: '1', '2', '3', '4', '5'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 444502, 444502, 444500, 444500, 444501, 444501, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.01065731 0.4555225 0.20556404 0.003836637 0.01085453
## 0.04391816 0.4394114 0.16818652 0.015743959 0.03011324
## 0.04772774 0.3739094 0.02660215 0.024655143 0.05608235
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01065731.
So the cross-validation process allows us to pick the optimal model based on highest accuracy. The final complexity parameter used for the model was 0.011. The following graph shows the change in accuracy (cross-validation) with respect to change in complexity parameter.
Here’s the classification dendogram plot for the final model. The probability values in the nodes determine the threshold probability of being in certain class.
## NULL
Now, we can try to use the final model to test the model on the testing subset. We use the predict() function and the confusion matrix to summarize the results. These results show the out of sample error rates.
Here is our prediction results on the testing subset:
Conf_matrix$table
## Reference
## Prediction 1 2 3 4 5
## 1 4032 0 0 5740 3187
## 2 1220 0 0 6446 3706
## 3 677 0 0 10713 7064
## 4 649 0 0 14745 20566
## 5 482 0 0 7710 36531
Here is the overall statistics:
Conf_matrix$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.4479541 0.1864430 0.4451775 0.4507332 0.5754851
## AccuracyPValue McnemarPValue
## 1.0000000 NaN
And finally the statistics by class:
t(Conf_matrix$byClass)
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.57110482 NA NA 0.3251091 0.5141301
## Specificity 0.92331283 0.90789516 0.8505362 0.7284098 0.8437059
## Pos Pred Value 0.31113512 NA NA 0.4100389 0.8168280
## Neg Pred Value 0.97259952 NA NA 0.6502148 0.5615849
## Prevalence 0.05718081 0.00000000 0.0000000 0.3673340 0.5754851
## Detection Rate 0.03265623 0.00000000 0.0000000 0.1194237 0.2958742
## Detection Prevalence 0.10495837 0.09210484 0.1494638 0.2912496 0.3622234
## Balanced Accuracy 0.74720882 NA NA 0.5267595 0.6789180
My results suggest that users are most likely to avoid giving 2 or 3 stars and choose between 1, 4 and 5. Given that my accuracy rate is just 0.447, it means that my machine learning method predicts the correct outcome only 44.7% of the time when predicting how many stars a user is likely to give a local business in Las Vegas.
Since I narrowed down my analysis to a specific city, we might be able to get a better result if we included more businesses from other cities. I tried running random forest model as well but it wasn’t any better than results from the rpart method and so I chose to go with rpart which is much faster computationally.