Introduction

The restaurant business is notoriously difficult, according to a study by Ohio State University 60% fail in the first year and 80% within the first five years. It is therefore of great interest for prospective (and current) business owners to better understand what attributes drive success in this complex industry. In order to delve into this we will analyse the YELP dataset Challenge dataset.

This analysis is formed of two parts.

  1. Statistical testing - can we identify any attributes which seem to result in better performance? Are they statistically significant?

  2. Prediction - can we create a prediction algorithm that will allow us to predict how sucessful a restaurant is/would be based on its attributes?

For the purpose of this peice of work, the sucess or performance of a restaurant is considered to be the average number of stars the business has recieved from YELP reviewers.

Methods and Data

The YELP datasets were the only data sources used for this project, as there were no other readily available datasets that would enhance the work being done. The YELP dataset contains data for 10 cities across North America and Europe, however for this study it was decided to restrict the dataset to the city of Edinburgh. This was done for several reasons:

Data Exploration & Enrichment

As with any data project the first step was to explore the data to understand what features were available and processing would be required to perform analysis.

It was identified that the majority of analysis would be performed on the “business” dataset, as this holds the information about the characteristics of restaurants. However the first peice of exploration performed was to investigate time dependence in reviews, this required the “reviews” dataset. The graph below shows that at the start of dataset, reviews were on average much more positive! Over time the scores given settled and by 2011 become stable, therefore further analysis was restricted to the data after 2011.

When exploring the “business” dataset it was discovered that its structure did not lend itself to analysis. Therefore a number of procedures were required:

  1. Partitioning into Training & Test data sets - since a prediction algorithm was to be created the data need to be partitioned;
  2. Flattening - the attributes columns had all been compressed into a single column, these needed to be extracted as their own fields;
  3. Flatten Categories - the category field is a list, to allow easier analysis this is converted to character & has whitespace removed;
  4. Check which fields have too many missing values to be useful for analysis. 49 of the columns were more than 40% NAs, this was especially common in the “Good For” columns, this lack of data integrity, combined with these tags being added by the business (and therefore not necessarily matching reality) meant that the “Good For” columns were excluded from the analysis;
  5. Pivot out the categories field to be (135) different fields for use in machine learning algorithms (each as a boolean).

The data was then refined to those of interest:

  1. City Tagging - In order to subset the data to inspect Edinburgh the dataset needed to be tagged with which city each record belonged to. In order to do this k-means clustering was run using the latitudes & longitudes of restaurants and the latitudes & longitudes of the centres of the cities as initial cluster loci;
  2. Subset to only businesses who have listed “Restaurant” as a category for the business;
  3. Subset to Edinburgh restaurants only;
  4. Visualise the restaurants and their scores on a map of Edinburgh (see below).

Then data enrichment was performed - creating new fields that might be of interest:

  1. Distance from Cluster centre - The (euclidian) distance from the final cluster centres was calculated for each business;
  2. Tag Chains - enrich the dataset by adding a field that tags a restaurant as a chain (if there are more than 2 businesses with the same name they are considered part of a chain);
  3. Analyse the number of clusters of restaurants - Using scree plot;
  4. Tag Size of cluster - enrich the dataset by adding a field that tags a restaurant with the number of restaurants belonging to the same cluster;

Finally preparation and testing of prediction algorithms was performed:

  1. Split the training data into training & validation sets for model selection & initial testing;
  2. Run initial testing of different algorithms, using in & out of sample RMSE to evaluate performance;
  3. Train a Decision Tree, Random Forest, Linear model, Generalised Linear Model and Support Vector Machine;
  4. Train an ensemble model based on the combination of all of the models;
  5. Test the model on the Test dataset

Results

This section presents the results of the analysis in each of the target areas. A useful figure to keep in mind during this section is the mean star rating across all data 3.78.

Seasonality: Whilst there was a range (3.797:3.906) in the average ratings in different months no single month had an average rating given that was statistically significant (min p value 0.175).

Price Range: Whilst there was no correlation between price range and rating (0.07 p value 0.068) inspection of the exploratory plot revealed that there may in fact be a relationship. Therefore the table of ratings at each price range was calculated and can be seen below:

Price Range Average Star Rating 95% T-Test P-Value
1 3.766 0.6778
2 3.777 0.661
3 3.74 0.4722
4 4.464 6.795e-07

Distance from centre: There were no statistically significant correlations between the euclidian distance from the centre of the city or the latitude and the review score. There was however a small 0.089 significant (p value 0.02) correlation between longitude and review score.

Chains: A significant (pvalue 0.02) difference in average review score (CI -0.41:-0.04) was found between chain restaurants and non chain restaurants.

Clusters of restaurants: There were no statistically significant correlations between the number of restaurants in a cluster and review score.

Categories of restaurants: The table of categories with the top 3 and bottom 3 average ratings statisticaly significantly different (p value < 0.05) from the mean rating (3.78) is shown below:

categories N AverageStars pvalue
JuiceBars 6 4.333 0.002808
Smoothies 6 4.333 0.002808
African 5 4.3 0.01219
American 17 3.447 0.0075
Mex 3 3.4 0.02522
Mexican 9 3.389 0.03865

Prediction algorithm: For each of the individual prediction algorithms (Decision Tree [DT], Random Forest [RF], Linear Model [LM], Generalised Linear Model [GLM] and Support Vector Machine [SVM]) the table below shows the root mean squared error (RMSE).The table also includes the RMSE of guessing the mean score every time [M] and the RMSE of the final combined prediction model [C].

DT RF LM GLM SVM M C
0.624 0.611 0.64 0.64 0.618 0.607 0.577

Discussion

Seasonality: Since there is no relationship between month & review score we can therefore infer that when reviewers visit a restaurant does not have a significant impact on the score. From this we can further infer that the time of year we open a restaurant will not have an impact on our initial reviews.

Price Range: From the table in results we are able to see that there is a significant improvement in average star rating for restaurants in the highest price band (4), suggesting that we might want to keep our establishment on the pricier side.

Distance from centre: The euclidian distance from the cluster centre cluster showed no correlation with average star rating, neither did latitude or longitude. Between these measures and inspection of the map of restaurants with their scores we can see that in Edinburgh there are no locations that result in significantly better or worse review scores.

Chains: As we might expect chains are rated significantly lower than non chains, however it would probably be useful to do further work in this area as there may well be a useful distinction to draw between large multicity/international chains and restaurants with a couple of locations. It is also possible that the chains are confounding other relationships - e.g. a larger proportion than average of Italian restaurants might be chains and it could be this that is causing the lower than average rating.

Clusters of restaurants: The lack of significant correlation between cluster size and rating indicates that a restaurateur should not worry about the number of restaurants close to a prospective sight when considering the impact on the review scores recieved by the business.

Categories of restaurants: As we can see from the table in the results section there were 22 categories of restaurant for which the average star rating was significantly different from the mean (3.785). Although that is too many to talk to each category there are a couple that were a surprise - that juice & smoothie bars have the highest (statistically significant) rating was unforseen, as was the strong performance of African restaurants. Finally the poor performance of the annecdotaly popular Italian restaurants & Gastropubs was also a surprise. For types of restaurant so prevalent to perform below average was not an expected result.

Prediction algorithm: The combined prediction algorith with RMSE of 0.577 did outperform the test benchmark of selecting the average every time (RMSE 0.607) it did so by only a very slight margin. That this is true given a variety of models, from which an ensemble was created suggests that predicting with any real accuracy the performance of restaurants based on the YELP dataset is not possible. The expected reason for this is the sheer number of factors that contribute to a restaurants success far outreaches those categorised by the dataset.Additionally a very large number of features of the dataset had so many missing values as to render them innapropriate for analysis, reducing the characteristics available for learning.