Maximising Restaurant Success with YELP Data

G Royde
Saturday, November 21, 2015

Introduction

The restaurant business is notoriously difficult, with 60% failing in the first year and 80% within five. It is therefore useful to better understand what attributes drive success in this complex industry. For the analysis we will use the YELP dataset Challenge dataset.
This analysis performed to try and gain better insight in this area is formed of two parts.

1. Statistical testing - can we identify any attributes which seem to result in better performance? Are they statistically significant?

2. Prediction - can we create a prediction algorithm that will allow us to predict how sucessful a restaurant is/would be based on its attributes?

Exploratory Analysis

plot of chunk unnamed-chunk-2

Results - Statistical tests

For the statistical tests combinations of T-tests & correlation T-tests were used to determine if relationships/differences were statistically significant.

  • Seasonality: Variation through year is small & not statistically significant
  • Price Range: Highest price band 0.7 rating higher
  • Distance from Centre: Location of restaurants did not have a significant impact on review score.
  • Chains: A significant (pvalue 0.02) difference in average review score (CI -0.41:-0.04) was found between chain restaurants and non chain restaurants
  • Clusters: The number of colocated restaurants does not have an impact on review score
  • Categories: Category drives a large amount of variation. The lowest statistically significant score was 3.389 for Mexican restaurants & the highest 4.33 for Juice Bars.

Results - Prediction

The table shows RMSE for each of the prediction algorithms (Decision Tree [DT], Random Forest [RF], Linear Model [LM], Generalised Linear Model [GLM] and Support Vector Machine [SVM]), guessing the mean score every time [M] and the final combined prediction model [C].


----------------------------------------
 DT    RF    LM   GLM   SVM    M     C  
----- ----- ---- ----- ----- ----- -----
0.624 0.611 0.64 0.64  0.618 0.607 0.577
----------------------------------------

The combined prediction algorith with RMSE of 0.58 only outperform the test benchmark (selecting the average every time (RMSE 0.61)) by a small margin. That this is true an ensemble of models suggests that predicting review scores with any real accuracy is not possible. The expected reason for this is the sheer number of factors that contribute to a restaurants success far outreaches those categorised by the dataset.