Data Science Specialization - Analysis of Yelp data

Massimo Zanetti
November, 14th 2015

Introduction

This is the Capstone Project of Coursera's Data Science Specialization which requires to take part in the Yelp dataset challenge. Yelp is a company which publish user generated reviews about local businesses.

We analyzed whether the trend in reviews is able to explain the trend in the popularity of a certain type of cuisine in a city. Our hypothesis is that an increase in the share of reviews for a certain cuisine in a given year will cause the entrepreneurs to react and increase the number of restaurants for that cuisine in the following year.

We collected the data from Coursera website. For reproducibility purposes the actual code for processing is available at the following github page.

Methods

We studied the relationship using linear regression and polynomial regression. The predictor was the ratio between percentage of reviews and share of businesses (SRvsSB) per type of cuisine, per city/year. In particular we studied the index SRvsSB versus the average historical value in the previous years. The response variable was the ratio between the percentage of restaurants versus their historical average.
Data were separated into training and test set with proportion 75/25. Model complexity were chosen using 10 fold cross validation on training set root mean squared error as metric. The best fitting model was then trained on the entire dataset and the RMSE on the test set was used as proxy for generalization error.

Results

This is the relationship between variition in share of reviews and the share of business in the following year. The line show different fitted models.

plot of chunk unnamed-chunk-1

Comments

It is not possible to explain the variation in the share of businesses per category of cuisine using share of reviews ratio as predictor. The relationship showed by the regression models is slightly negative meaning that a share of reviews vs businesses higher than historical average at time t is associated with a percentage of businesses slightly lower than historical average at time t+1. However such conclusions are not very robust because the null modell has not been refused by the 1-SE rule.
There are some possible problems with this kind of analysys:

  • Yelpers are not a representative sample of restaurants customers
  • Quantity of reviews is not related to number of customers
  • Particular type of cuisines may have smaller restaurants, producing a low review per business predictor