Introduction

Is there a correlation between a restaurant’s open/closure (response variable), with the restaurant’s reviews and users giving those reviews? With a correlation, we can fit a regression model and predict if the restaurant is heading towards closure. This gives insights to help restaurants stay in business. Yelp can also use the prediction model to improve their service towards consumers.

Methods and Data

Accuracy & objectivity of the review and user data are important in this investigation. Although reviews and users are likely subjective, I assume any bias effect are eliminated when averaged across a sufficiently large no. of reviews and users. Therefore, a quantitative treatment of reviews (using aggregate features by taking mean) gives an objective measure of the quality of the restaurant.

I cleaned the raw data and derive aggregate features from it. Variables are then chosen to fit a logistic regression model, with a binary response variable based on the open attribute in the business file. We then examine the accuracy of the model.

Prepare data

  1. We first import the Yelp dataset into R.
  2. Select only restaurants out of the business file.
  3. Select reviews for the restaurants.
  4. Select users who wrote those reviews.
  5. Tally all checkins for each of those restaurants.
  6. For review, user & checkin, we then compute aggregate features out of the raw data.
  7. Finally we merge the restaurant, review, user, checkin into 1 superset dataframe. In other words, we append the user & checkin data to each review.

Pls see full coding details in RMD chunk “prep data”

Explore Data

##  business_id           open         review_count.biz     name          
##  Length:990720      Mode :logical   Min.   :   3.0   Length:990720     
##  Class :character   FALSE:106892    1st Qu.:  56.0   Class :character  
##  Mode  :character   TRUE :883828    Median : 143.0   Mode  :character  
##                     NA's :0         Mean   : 348.1                     
##                                     3rd Qu.: 339.0                     
##                                     Max.   :4578.0                     
##                                                                        
##    stars.biz       user_id           review_id          stars.review 
##  Min.   :1.000   Length:990720      Length:990720      Min.   :1.00  
##  1st Qu.:3.500   Class :character   Class :character   1st Qu.:3.00  
##  Median :4.000   Mode  :character   Mode  :character   Median :4.00  
##  Mean   :3.712                                         Mean   :3.72  
##  3rd Qu.:4.000                                         3rd Qu.:5.00  
##  Max.   :5.000                                         Max.   :5.00  
##                                                        NA's   :93    
##  date.review        total_votes.review   age.review   review_count.user
##  Length:990720      Min.   :  0.000    Min.   : 318   Min.   :   0.0   
##  Class :character   1st Qu.:  0.000    1st Qu.: 606   1st Qu.:  10.0   
##  Mode  :character   Median :  1.000    Median :1006   Median :  35.0   
##                     Mean   :  2.026    Mean   :1160   Mean   : 130.4   
##                     3rd Qu.:  2.000    3rd Qu.:1596   3rd Qu.: 137.0   
##                     Max.   :444.000    Max.   :4058   Max.   :4573.0   
##                     NA's   :93         NA's   :93     NA's   :93       
##    fans.user        average_stars.user total_votes.user  
##  Min.   :   0.000   Min.   :0.000      Min.   :     0.0  
##  1st Qu.:   0.000   1st Qu.:3.480      1st Qu.:    11.0  
##  Median :   1.000   Median :3.790      Median :    58.0  
##  Mean   :   8.004   Mean   :3.743      Mean   :   751.9  
##  3rd Qu.:   5.000   3rd Qu.:4.080      3rd Qu.:   338.0  
##  Max.   :1298.000   Max.   :5.000      Max.   :100319.0  
##  NA's   :93         NA's   :93         NA's   :93        
##  total_friends.user years_elite.user yelping_age.user
##  Min.   :   0.00    Min.   : 0.000   Min.   : 325    
##  1st Qu.:   0.00    1st Qu.: 0.000   1st Qu.:1239    
##  Median :   5.00    Median : 0.000   Median :1755    
##  Mean   :  44.78    Mean   : 1.031   Mean   :1783    
##  3rd Qu.:  26.00    3rd Qu.: 1.000   3rd Qu.:2304    
##  Max.   :3830.00    Max.   :11.000   Max.   :4069    
##  NA's   :93         NA's   :93       NA's   :93      
##  total_compliments.user total_checkins.biz
##  Min.   :    0.0        Min.   :    3     
##  1st Qu.:    0.0        1st Qu.:  133     
##  Median :    3.0        Median :  400     
##  Mean   :  179.3        Mean   : 1002     
##  3rd Qu.:   32.0        3rd Qu.: 1071     
##  Max.   :49151.0        Max.   :14203     
##  NA's   :93             NA's   :16674

Summary of the superset dataframe shows

  1. 93 NA’s in some variables as these restaurants have no reviews. We will omit these as we need at least 1 review to proceed.
  2. NA’s in total_checkin.biz, we replace these with 0.

I group the review, user and checkin data by restaurant. Then derive summary statistics for each restaurant as follows,

  • Take the mean of these review attributes from all reviews of the restaurant:
    • star rating
    • total votes
    • review age in days
  • Take the mean of these user attributes from all users who reviewed the restaurant:
    • no. of reviews written by the user
    • no. of fans
    • average star rating
    • total no. of votes
    • total no. of friends
    • no. of years as elite user
    • no. of days user has been a Yelp user
    • total no. of compliments

Pls see full coding details in RMD chunk “summarize data”

Perform logistic regression

There were no near zero predictors to remove. The cleaned data was then partitioned into training and test sets with 70/30 split. On the training set, a logistic regression model was fitted and refined with bi-direction stepwise AIC model selection, resulting in a best fit model.

Results

## 
## Call:
## glm(formula = open ~ review_count.biz + total_checkins.biz + 
##     mean_stars.review + mean_total_votes.review + mean_age.review + 
##     mean_review_count.user + mean_average_stars.user + mean_total_votes.user + 
##     mean_total_friends.user + mean_yelping_age.user + mean_total_compliments.user, 
##     family = binomial(logit), data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7629   0.1778   0.3400   0.5221   3.1792  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  8.638e+00  4.288e-01  20.143  < 2e-16 ***
## review_count.biz             3.800e-03  8.358e-04   4.547 5.45e-06 ***
## total_checkins.biz           1.114e-03  2.736e-04   4.072 4.67e-05 ***
## mean_stars.review           -8.164e-02  4.588e-02  -1.779 0.075189 .  
## mean_total_votes.review     -5.626e-02  1.741e-02  -3.232 0.001228 ** 
## mean_age.review             -1.955e-03  8.261e-05 -23.669  < 2e-16 ***
## mean_review_count.user       3.497e-03  3.118e-04  11.214  < 2e-16 ***
## mean_average_stars.user     -4.297e-01  1.274e-01  -3.373 0.000743 ***
## mean_total_votes.user       -1.177e-04  5.295e-05  -2.224 0.026164 *  
## mean_total_friends.user     -1.694e-03  5.293e-04  -3.200 0.001372 ** 
## mean_yelping_age.user       -1.835e-03  1.005e-04 -18.259  < 2e-16 ***
## mean_total_compliments.user  3.087e-04  1.292e-04   2.389 0.016876 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15194  on 15259  degrees of freedom
## Residual deviance: 10140  on 15248  degrees of freedom
## AIC: 10164
## 
## Number of Fisher Scoring iterations: 6

Model summary shows the coeffcients have small P-values with at least 90% significance level.
The ROC plot has a AUC (area under curve) value of 0.858, indicates my model performance is better than a random guess (where AUC=0.5), but not a perfect classifier (where AUC = 1.0).

The Optimal Cutoff plot shows the model’s best prediction accuracy @ 0.878, at a probability cutoff point of 0.512

Discussion

The results shows a correlation between the review & user data and the restaurant’s open/closure. Further investigation should be done on the model selection, and check for interaction between predictors as it is not apparent why some model coefficients are negative, like stars.biz, mean_total_votes.review, mean_average_stars.user, mean_total_votes.user , mean_total_friends.user.

This classifier is applicable to other businesses too. The model is quick to run, with very little performance impact, Yelp could generate such summary statistics for each business and provide insights to support business decision.
— END of REPORT —

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ROCR_1.0-7      gplots_2.17.0   MASS_7.3-43     caret_6.0-58   
##  [5] ggplot2_1.0.1   lattice_0.20-33 dplyr_0.4.3     plyr_1.8.3     
##  [9] jsonlite_0.9.17 doSNOW_1.0.14   snow_0.4-1      iterators_1.0.7
## [13] foreach_1.4.2   cluster_2.0.3  
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       reshape2_1.4.1     splines_3.2.2     
##  [4] colorspace_1.2-6   htmltools_0.2.6    stats4_3.2.2      
##  [7] yaml_2.1.13        mgcv_1.8-7         nloptr_1.0.4      
## [10] DBI_0.3.1          stringr_1.0.0      MatrixModels_0.4-1
## [13] munsell_0.4.2      gtable_0.1.2       caTools_1.17.1    
## [16] codetools_0.2-14   evaluate_0.7.2     labeling_0.3      
## [19] knitr_1.11         SparseM_1.7        quantreg_5.19     
## [22] pbkrtest_0.4-2     parallel_3.2.2     proto_0.3-10      
## [25] Rcpp_0.12.0        KernSmooth_2.23-15 scales_0.3.0      
## [28] formatR_1.2        gdata_2.17.0       lme4_1.1-9        
## [31] digest_0.6.8       stringi_0.5-5      grid_3.2.2        
## [34] tools_3.2.2        bitops_1.0-6       magrittr_1.5      
## [37] car_2.1-0          Matrix_1.2-2       assertthat_0.1    
## [40] minqa_1.2.4        rmarkdown_0.8      rstudioapi_0.3.1  
## [43] R6_2.1.1           nnet_7.3-10        nlme_3.1-121