Can we predict if a restaurant is open or closed based on Yelp's reviews & reviewers?

LYE Keng Fook
22 Nov 2015

View the R code for this analysis here

Question of Interest

Yelp's data challenge, provides data on the businesses registered in Yelp, and the reviews of those businesses written by users.

I examine the datasets, to explore for correlation between restaurant business's open/closure (response variable), and the restaurant's reviews and users giving those reviews (predictors).

With such a correlation, we can predict if a restaurant is open or close based on the reviews and reviewers data. Such prediction provides insights to help restaurants stay in business. Yelp can also use the prediction model to improve their service towards consumers.

Methodology

  1. We first import the Yelp dataset into R.
  2. Select data only on restaurants, their reviews, and the users who wrote those reviews. out of the business file.
  3. Explore and clean the data as necessary (omit/replace missing values)
  4. Compute summary statistics out of the cleaned data.
  5. Partition the data into 2 sets: 1 for training, 1 for testing.
  6. Using the training set, we apply a logistics regression model to learn for patterns/correlation between the summary statistics (predictors) and the restaurant's open/closure (response).

Results

plot of chunk unnamed-chunk-1

The logistics regression model (from prev slide) gives us a probability function, when applied to the predictors, gives a probability value.

The left plot shows a cutoff probability, above which we classify the restaurant as open.

Using the optimal cutoff, we get the following prediction results as tabled (Reference is the test dataset)

          Reference
Prediction FALSE TRUE
     FALSE   664  213
     TRUE    631 5031

Conclusion

The results shows a correlation between the review & user data and the restaurant's open/closure, upon which a prediction model was built with a accuracy of 87%.

The methodology is applicable to other businesses too. The prediction model is quick to run, with very little performance impact, Yelp could generate such summary statistics for each business and provide insights to support business decision.

Accuracy & objectivity of the review and user data are important in this investigation. Although reviews and users are likely subjective, I assume any bias effect are eliminated when averaged across a sufficiently large no. of reviews and users. Therefore, a quantitative treatment of reviews (using summary statistics by taking mean) gives an objective measure of the quality of the restaurant.