Is there a correlation between a restaurant’s open/closure (response variable), with the restaurant’s reviews and users giving those reviews? With a correlation, we can fit a regression model and predict if the restaurant is heading towards closure. This gives insights to help restaurants stay in business. Yelp can also use the prediction model to improve their service towards consumers.
Accuracy & objectivity of the review and user data are important in this investigation. Although reviews and users are likely subjective, I assume any bias effect are eliminated when averaged across a sufficiently large no. of reviews and users. Therefore, a quantitative treatment of reviews (using aggregate features by taking mean) gives an objective measure of the quality of the restaurant.
I cleaned the raw data and derive aggregate features from it. Variables are then chosen to fit a logistic regression model, with a binary response variable based on the open attribute in the business file. We then examine the accuracy of the model.
Pls see full coding details in RMD chunk “prep data”
## business_id open review_count.biz name
## Length:990720 Mode :logical Min. : 3.0 Length:990720
## Class :character FALSE:106892 1st Qu.: 56.0 Class :character
## Mode :character TRUE :883828 Median : 143.0 Mode :character
## NA's :0 Mean : 348.1
## 3rd Qu.: 339.0
## Max. :4578.0
##
## stars.biz user_id review_id stars.review
## Min. :1.000 Length:990720 Length:990720 Min. :1.00
## 1st Qu.:3.500 Class :character Class :character 1st Qu.:3.00
## Median :4.000 Mode :character Mode :character Median :4.00
## Mean :3.712 Mean :3.72
## 3rd Qu.:4.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.00
## NA's :93
## date.review total_votes.review age.review review_count.user
## Length:990720 Min. : 0.000 Min. : 318 Min. : 0.0
## Class :character 1st Qu.: 0.000 1st Qu.: 606 1st Qu.: 10.0
## Mode :character Median : 1.000 Median :1006 Median : 35.0
## Mean : 2.026 Mean :1160 Mean : 130.4
## 3rd Qu.: 2.000 3rd Qu.:1596 3rd Qu.: 137.0
## Max. :444.000 Max. :4058 Max. :4573.0
## NA's :93 NA's :93 NA's :93
## fans.user average_stars.user total_votes.user
## Min. : 0.000 Min. :0.000 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:3.480 1st Qu.: 11.0
## Median : 1.000 Median :3.790 Median : 58.0
## Mean : 8.004 Mean :3.743 Mean : 751.9
## 3rd Qu.: 5.000 3rd Qu.:4.080 3rd Qu.: 338.0
## Max. :1298.000 Max. :5.000 Max. :100319.0
## NA's :93 NA's :93 NA's :93
## total_friends.user years_elite.user yelping_age.user
## Min. : 0.00 Min. : 0.000 Min. : 325
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:1239
## Median : 5.00 Median : 0.000 Median :1755
## Mean : 44.78 Mean : 1.031 Mean :1783
## 3rd Qu.: 26.00 3rd Qu.: 1.000 3rd Qu.:2304
## Max. :3830.00 Max. :11.000 Max. :4069
## NA's :93 NA's :93 NA's :93
## total_compliments.user total_checkins.biz
## Min. : 0.0 Min. : 3
## 1st Qu.: 0.0 1st Qu.: 133
## Median : 3.0 Median : 400
## Mean : 179.3 Mean : 1002
## 3rd Qu.: 32.0 3rd Qu.: 1071
## Max. :49151.0 Max. :14203
## NA's :93 NA's :16674
Summary of the superset dataframe shows
I group the review, user and checkin data by restaurant. Then derive summary statistics for each restaurant as follows,
Pls see full coding details in RMD chunk “summarize data”
There were no near zero predictors to remove. The cleaned data was then partitioned into training and test sets with 70/30 split. On the training set, a logistic regression model was fitted and refined with bi-direction stepwise AIC model selection, resulting in a best fit model.
##
## Call:
## glm(formula = open ~ review_count.biz + total_checkins.biz +
## mean_stars.review + mean_total_votes.review + mean_age.review +
## mean_review_count.user + mean_average_stars.user + mean_total_votes.user +
## mean_total_friends.user + mean_yelping_age.user + mean_total_compliments.user,
## family = binomial(logit), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7629 0.1778 0.3400 0.5221 3.1792
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.638e+00 4.288e-01 20.143 < 2e-16 ***
## review_count.biz 3.800e-03 8.358e-04 4.547 5.45e-06 ***
## total_checkins.biz 1.114e-03 2.736e-04 4.072 4.67e-05 ***
## mean_stars.review -8.164e-02 4.588e-02 -1.779 0.075189 .
## mean_total_votes.review -5.626e-02 1.741e-02 -3.232 0.001228 **
## mean_age.review -1.955e-03 8.261e-05 -23.669 < 2e-16 ***
## mean_review_count.user 3.497e-03 3.118e-04 11.214 < 2e-16 ***
## mean_average_stars.user -4.297e-01 1.274e-01 -3.373 0.000743 ***
## mean_total_votes.user -1.177e-04 5.295e-05 -2.224 0.026164 *
## mean_total_friends.user -1.694e-03 5.293e-04 -3.200 0.001372 **
## mean_yelping_age.user -1.835e-03 1.005e-04 -18.259 < 2e-16 ***
## mean_total_compliments.user 3.087e-04 1.292e-04 2.389 0.016876 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15194 on 15259 degrees of freedom
## Residual deviance: 10140 on 15248 degrees of freedom
## AIC: 10164
##
## Number of Fisher Scoring iterations: 6
Model summary shows the coeffcients have small P-values with at least 90% significance level.
The ROC plot has a AUC (area under curve) value of 0.858, indicates my model performance is better than a random guess (where AUC=0.5), but not a perfect classifier (where AUC = 1.0).
The Optimal Cutoff plot shows the model’s best prediction accuracy @ 0.878, at a probability cutoff point of 0.512
The results shows a correlation between the review & user data and the restaurant’s open/closure. Further investigation should be done on the model selection, and check for interaction between predictors as it is not apparent why some model coefficients are negative, like stars.biz, mean_total_votes.review, mean_average_stars.user, mean_total_votes.user , mean_total_friends.user.
This classifier is applicable to other businesses too. The model is quick to run, with very little performance impact, Yelp could generate such summary statistics for each business and provide insights to support business decision.
— END of REPORT —
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ROCR_1.0-7 gplots_2.17.0 MASS_7.3-43 caret_6.0-58
## [5] ggplot2_1.0.1 lattice_0.20-33 dplyr_0.4.3 plyr_1.8.3
## [9] jsonlite_0.9.17 doSNOW_1.0.14 snow_0.4-1 iterators_1.0.7
## [13] foreach_1.4.2 cluster_2.0.3
##
## loaded via a namespace (and not attached):
## [1] gtools_3.5.0 reshape2_1.4.1 splines_3.2.2
## [4] colorspace_1.2-6 htmltools_0.2.6 stats4_3.2.2
## [7] yaml_2.1.13 mgcv_1.8-7 nloptr_1.0.4
## [10] DBI_0.3.1 stringr_1.0.0 MatrixModels_0.4-1
## [13] munsell_0.4.2 gtable_0.1.2 caTools_1.17.1
## [16] codetools_0.2-14 evaluate_0.7.2 labeling_0.3
## [19] knitr_1.11 SparseM_1.7 quantreg_5.19
## [22] pbkrtest_0.4-2 parallel_3.2.2 proto_0.3-10
## [25] Rcpp_0.12.0 KernSmooth_2.23-15 scales_0.3.0
## [28] formatR_1.2 gdata_2.17.0 lme4_1.1-9
## [31] digest_0.6.8 stringi_0.5-5 grid_3.2.2
## [34] tools_3.2.2 bitops_1.0-6 magrittr_1.5
## [37] car_2.1-0 Matrix_1.2-2 assertthat_0.1
## [40] minqa_1.2.4 rmarkdown_0.8 rstudioapi_0.3.1
## [43] R6_2.1.1 nnet_7.3-10 nlme_3.1-121