The restaurant business is notoriously difficult, according to a study by Ohio State University 60% fail in the first year and 80% within the first five years. It is therefore of great interest for prospective (and current) business owners to better understand what attributes drive success in this complex industry. In order to delve into this we will analyse the YELP dataset Challenge dataset.
This analysis is formed of two parts.
Statistical testing - can we identify any attributes which seem to result in better performance? Are they statistically significant?
Prediction - can we create a prediction algorithm that will allow us to predict how sucessful a restaurant is/would be based on its attributes?
For the purpose of this peice of work, the sucess or performance of a restaurant is considered to be the average number of stars the business has recieved from YELP reviewers.
The YELP datasets were the only data sources used for this project, as there were no other readily available datasets that would enhance the work being done. The YELP dataset contains data for 10 cities across North America and Europe, however for this study it was decided to restrict the dataset to the city of Edinburgh. This was done for several reasons:
As with any data project the first step was to explore the data to understand what features were available and processing would be required to perform analysis.
It was identified that the majority of analysis would be performed on the “business” dataset, as this holds the information about the characteristics of restaurants. However the first peice of exploration performed was to investigate time dependence in reviews, this required the “reviews” dataset. The graph below shows that at the start of dataset, reviews were on average much more positive! Over time the scores given settled and by 2011 become stable, therefore further analysis was restricted to the data after 2011.
When exploring the “business” dataset it was discovered that its structure did not lend itself to analysis. Therefore a number of procedures were required:
The data was then refined to those of interest:
Then data enrichment was performed - creating new fields that might be of interest:
Finally preparation and testing of prediction algorithms was performed:
This section presents the results of the analysis in each of the target areas. A useful figure to keep in mind during this section is the mean star rating across all data 3.78.
Seasonality: Whilst there was a range (3.797:3.906) in the average ratings in different months no single month had an average rating given that was statistically significant (min p value 0.175).
Price Range: Whilst there was no correlation between price range and rating (0.07 p value 0.068) inspection of the exploratory plot revealed that there may in fact be a relationship. Therefore the table of ratings at each price range was calculated and can be seen below:
| Price Range | Average Star Rating | 95% T-Test P-Value |
|---|---|---|
| 1 | 3.766 | 0.6778 |
| 2 | 3.777 | 0.661 |
| 3 | 3.74 | 0.4722 |
| 4 | 4.464 | 6.795e-07 |
Distance from centre: There were no statistically significant correlations between the euclidian distance from the centre of the city or the latitude and the review score. There was however a small 0.089 significant (p value 0.02) correlation between longitude and review score.
Chains: A significant (pvalue 0.02) difference in average review score (CI -0.41:-0.04) was found between chain restaurants and non chain restaurants.
Clusters of restaurants: There were no statistically significant correlations between the number of restaurants in a cluster and review score.
Categories of restaurants: The table of categories with the top 3 and bottom 3 average ratings statisticaly significantly different (p value < 0.05) from the mean rating (3.78) is shown below:
| categories | N | AverageStars | pvalue |
|---|---|---|---|
| JuiceBars | 6 | 4.333 | 0.002808 |
| Smoothies | 6 | 4.333 | 0.002808 |
| African | 5 | 4.3 | 0.01219 |
| American | 17 | 3.447 | 0.0075 |
| Mex | 3 | 3.4 | 0.02522 |
| Mexican | 9 | 3.389 | 0.03865 |
Prediction algorithm: For each of the individual prediction algorithms (Decision Tree [DT], Random Forest [RF], Linear Model [LM], Generalised Linear Model [GLM] and Support Vector Machine [SVM]) the table below shows the root mean squared error (RMSE).The table also includes the RMSE of guessing the mean score every time [M] and the RMSE of the final combined prediction model [C].
| DT | RF | LM | GLM | SVM | M | C |
|---|---|---|---|---|---|---|
| 0.624 | 0.611 | 0.64 | 0.64 | 0.618 | 0.607 | 0.577 |
Seasonality: Since there is no relationship between month & review score we can therefore infer that when reviewers visit a restaurant does not have a significant impact on the score. From this we can further infer that the time of year we open a restaurant will not have an impact on our initial reviews.
Price Range: From the table in results we are able to see that there is a significant improvement in average star rating for restaurants in the highest price band (4), suggesting that we might want to keep our establishment on the pricier side.
Distance from centre: The euclidian distance from the cluster centre cluster showed no correlation with average star rating, neither did latitude or longitude. Between these measures and inspection of the map of restaurants with their scores we can see that in Edinburgh there are no locations that result in significantly better or worse review scores.
Chains: As we might expect chains are rated significantly lower than non chains, however it would probably be useful to do further work in this area as there may well be a useful distinction to draw between large multicity/international chains and restaurants with a couple of locations. It is also possible that the chains are confounding other relationships - e.g. a larger proportion than average of Italian restaurants might be chains and it could be this that is causing the lower than average rating.
Clusters of restaurants: The lack of significant correlation between cluster size and rating indicates that a restaurateur should not worry about the number of restaurants close to a prospective sight when considering the impact on the review scores recieved by the business.
Categories of restaurants: As we can see from the table in the results section there were 22 categories of restaurant for which the average star rating was significantly different from the mean (3.785). Although that is too many to talk to each category there are a couple that were a surprise - that juice & smoothie bars have the highest (statistically significant) rating was unforseen, as was the strong performance of African restaurants. Finally the poor performance of the annecdotaly popular Italian restaurants & Gastropubs was also a surprise. For types of restaurant so prevalent to perform below average was not an expected result.
Prediction algorithm: The combined prediction algorith with RMSE of 0.577 did outperform the test benchmark of selecting the average every time (RMSE 0.607) it did so by only a very slight margin. That this is true given a variety of models, from which an ensemble was created suggests that predicting with any real accuracy the performance of restaurants based on the YELP dataset is not possible. The expected reason for this is the sheer number of factors that contribute to a restaurants success far outreaches those categorised by the dataset.Additionally a very large number of features of the dataset had so many missing values as to render them innapropriate for analysis, reducing the characteristics available for learning.