Yelp Data and Analysis

Yelp is an online company (2004- present) that publishes user reviews of various business - including restaurants, bars, food joints, shopping locations and various other business. To date Yelp has published about 127 million reviews of local business across 32 countries.

The data that we have used here is part of the Yelp Dataset Challenge which Yelp publishes in their website. This year the data set includes details about

144K business
4.1M user reviews
Information of 947K tips
1.1M business attributes (cuisine, hours of operation etc.)
Data is from cities across US, UK, Canada, Germany and China

Our aim was to pick restaurants in 5 cities and identifying which cuisine is most praised in that city. We do this using the R programming and our learning from Introduction to Statistics.

Understanding the Data formats and structures

Before we load the Yelp data set we must look at the data structures from the Yelp webpage. The Entity Relationship diagram of the data set is given below:

Loading and Enriching the Business Data

First, we load the yelp business data to try and understand more details about the businesses included. We then filter this data to pick only restaurants that have more than 100 reviews.

To make this interesting we pick non-local cuisines for this analysis.

The cuisines that we picked for our analysis are Italian,Mexican,Japanese,Chinese,Thai.

Yelp data set does not have country of the business - only has lat/long codes so use the library rworldxtra to identify the country. If country cant be found, filter out that data.

coords2country = function(points) #function to identify countries
{  
  countriesSP = getMap(resolution='high')
  pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))  
  indices = over(pointsSP, countriesSP)
  indices$ADMIN  # gives country names
}
rest_for_analysis$country = coords2country(data.frame(rest_for_analysis$longitude,rest_for_analysis$latitude))
# remove business where we cant find country names
rest_for_analysis = rest_for_analysis %>% filter(!is.na(country))

From this data, we pick the top 10 cities based on number of business in each city.

#pick top 10 cities for analysis. We can later change this to a diff variable
top_cities = rest_for_analysis %>% group_by(city) %>% summarise(bus_count = n() ) %>% top_n(n=total_cities_covered) %>% arrange(-bus_count)

rest_in_top_cities = rest_for_analysis[is.element(rest_for_analysis$city,top_cities$city),]
rm(rest_for_analysis)

What did we pick?

Based on these filtering we have 1705 restaurant across 10 cities. The cities and cuisines selected are plotted below

rest_map = get_map(location = c('America'),zoom=4)
mapPoints = ggmap(rest_map) +
  geom_point(data =rest_in_top_cities, aes(x = longitude, y = latitude),col = "red", cex = .6) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE)+
  scale_shape_identity()

City vs Cuisines
	Chinese	Italian	Japanese	Mexican	Thai
Chandler	20	15	11	23	9
Charlotte	9	30	14	23	7
Henderson	6	21	17	21	8
Las Vegas	102	159	156	133	54
Mesa	9	14	12	23	8
Phoenix	33	90	28	125	22
Pittsburgh	8	22	13	20	14
Scottsdale	15	59	22	53	12
Tempe	12	15	11	28	9
Toronto	33	50	65	22	20

Filtering the Restaurant Review

After choosing the cities and the top 5 popular cuisines, we then read all the reviews for the restaurants that meet the requirement. For this analysis, we pick only the 5 star reviews. To cut down the file volume we remove the actual text in the review file.

csv_out_file = paste(out_path,"yelp_academic_dataset_review.csv",sep="/")

#only run this step one time to save execution times
if (!file.exists(csv_out_file)) {
  con_in = file(paste(data_path,"yelp_academic_dataset_review.json",sep="/"))
  con_out = file(csv_out_file, open = "wb")
  stream_in(con_in, handler = function(df){
    # Read without actual text data
    df = df[is.element(df$business_id,rest_in_top_cities$business_id),
            c("review_id","user_id","business_id","stars","date","type")]
    df = dplyr::filter(df, stars > 4)
    stream_out(df,con_out, pagesize = 100000)
  }, pagesize = 100000, verbose = FALSE)
  close(con_out)
}
review_5 = stream_in(file(paste(out_path,"yelp_academic_dataset_review.csv",sep="/")), pagesize = 200000, verbose = FALSE)

Analysis: Why are these restaurants famous?

From all data of the 5-star restaurants for selected cuisines, we also want to determine the probability of restaurants that have a bar or at least serve alcohol. This would also help us see if having a bar or serve alcohol is a determining factor for restaurants gaining 5 star.

alcohol_count_table = rest_in_top_cities %>% 
                        group_by( cuisine, serve_alcohol) %>% summarise(count = n())

The table below gives the summary

Cuisines and Alcohol
Cuisines	Serve Alcohol	Count
Chinese	FALSE	90
Chinese	TRUE	157
Italian	FALSE	52
Italian	TRUE	423
Japanese	FALSE	44
Japanese	TRUE	305
Mexican	FALSE	123
Mexican	TRUE	348
Thai	FALSE	52
Thai	TRUE	111

Probability that a restaurant serves alcohol is 0.788 and that of a restaurant having a bar is 0.528

This shows our subset of business have a large proportion of them selling alcohol. This could be one of the deciding factor for them being popular.

Visualising the Reviews of cuisines across Cities

The review data that we extracted is combined with the business information and then we try and plot this data to see if we can observe any visual trends.

barplot(cus_city_table_prop,ylab="Proportion Within City",xlab="City",las=2,
          ylim=c(0,0.65),beside=TRUE,legend=rownames(cus_city_table_prop),
          args.legend=c(x = "top",horiz=T,title="Cuisine"),cex.lab = 0.85,
          col = c("grey19","darkgoldenrod1","tan3","orangered","cyan"), 
          main = "Plotting proportion of 5 star reviews of 5 cuisines in 10 cities")

We can see visually observed several trends like Mexican cuisine being famous in Mesa, Japanese in Las Vegas etc.

Hypothesis Testing

To ensure the results are statistically significant we perform Chi-Squared Test.

Hypotheses for the Chi-square test of association:
- Null Hypothesis: \(H_0\) : There is no association between Cuisine and City in the review data
- Alternative Hypothesis: \(H_A\) : There is an association between Cuisine and City in the review data
Assumptions:
- That no more than 25% of the cells have expected counts below 5
Decision Rules:
- Reject \(H_0\) : if \(p\) value < 0.05 (\(\alpha\) significance level)
- else fail to reject \(H_0\)

Chi-Squared Test : Assumptions Check

We perform a visual check to see if there are any cells that have counts below 5

knitr::kable(cus_city_table,caption = "City vs Cuisines")

City vs Cuisines
	Chandler	Charlotte	Henderson	Las Vegas	Mesa	Phoenix	Pittsburgh	Scottsdale	Tempe	Toronto
Chinese	2193	574	449	14201	751	3520	810	1218	883	2934
Italian	1111	2304	2015	24088	1046	12864	1939	6636	2062	3951
Japanese	948	1299	1895	26922	731	3782	1108	2985	1403	5674
Mexican	1914	2119	2082	21349	1927	16019	1917	6828	2414	2166
Thai	715	605	774	8148	667	2043	1165	1142	895	2114

Since the assumptions on cell count are all met we can safely perform Chi Squared Test.

Chi-Squared Test

We perform a Chi-Squared test on the data set.

chi2 = chisq.test(cus_city_table)
chi2 #display chi-sq test results

## 
##  Pearson's Chi-squared test
## 
## data:  cus_city_table
## X-squared = 19081, df = 36, p-value < 2.2e-16

From this analysis, we can see that \(p\) < 0.001 which shows a significant association between cuisine and cities. We now plot the residual values to see if there are any significant findings. WE plot the residual plot to identify an glaring over-representation.

knitr::kable(chi2$stdres %>% round(0),caption = "City vs Cuisines")

City vs Cuisines
	Chandler	Charlotte	Henderson	Las Vegas	Mesa	Phoenix	Pittsburgh	Scottsdale	Tempe	Toronto
Chinese	47	-12	-18	23	3	-25	-4	-28	-4	17
Italian	-22	11	0	-21	-12	29	0	24	-2	-13
Japanese	-17	-7	8	61	-14	-65	-13	-22	-9	37
Mexican	0	5	2	-51	15	67	-1	26	7	-46
Thai	5	0	6	-2	11	-26	24	-14	9	18

This table shows that for almost all cities there is at least one over-representation of 5-star review. This shows that of the selected non-local cuisines there is a clear favorite in that city.

For an \(\alpha\) =0.05 and df = 36 we calculate the \(\chi^{2}\) using the below formula.

qchisq(p = .95,df = 36) # Find the Chi - Crital Value

## [1] 50.99846

We reject \(H_0\) when \(\chi^{2}\) > \(\chi^{2}_{\text{crit}}\). In our example the \(\chi^{2}\) is 20449 so we can safely reject Null Hypothesis.

Summary of Chi-Squared Test

From the Residuals plot, we see that each of the city is famous for a specific cuisine. The cuisine with the highest residual value is chosen as the most-famous of the selected cuisine type in that city.

res_col = colnames(chi2$stdres)
res_row = rownames(chi2$stdres)[apply(chi2$stdres,2,which.max)]
max_val = apply(chi2$stdres,2,max) %>% round(2)
max_res = data.frame(res_col,res_row,max_val,row.names = NULL)
colnames(max_res)=c("City","Fav Cuisine","Chi Sq Res")
knitr::kable(max_res,caption = "City vs Cuisines")

City vs Cuisines
City	Fav Cuisine	Chi Sq Res
Chandler	Chinese	46.71
Charlotte	Italian	10.70
Henderson	Japanese	8.16
Las Vegas	Japanese	60.83
Mesa	Mexican	15.42
Phoenix	Mexican	66.62
Pittsburgh	Thai	24.20
Scottsdale	Mexican	26.36
Tempe	Thai	9.35
Toronto	Japanese	36.91

A Chi-square test of association was used to test for a statistically significant association between cities and the most liked cuisine. The results of the test found a statistically significant association, \(\chi^{2}\) = 20449 , p < .001 thus suggesting that the people in these cities are more likely to prefer the above non-local cuisine.

Alternative Approach using Check-In data

Assuming we only have the check-in data we wanted to see if we can deduce the number of 5 star reviews that these non-local cuisines would get thus enabling us to gauge the popularity of the restaurant.

To do this we first load the Check-In data set and then slice the data to identify the number of check-ins and then we map it back to identify the business and the city and cuisine.

We can visually see a linear relationship when we do a log transformation of the data. But we need to do a statistical test to see if there are any significant relationship between the 2 variables.

Hypothesis Testing for Linear Regression Model

We perform the hypothesis testing for the overall linear regression model:

Hypotheses for overall Lienar Regression:
- Null Hypothesis: \(H_0\) : Data does not fit the linear regression model
- Alternative Hypothesis: \(H_A\) : Data fits the linear regression model
- We test the overall model using F Test
Assumptions:
- Independence (based on our understanding the data is independent)
- Linearity (based on scatter plot we can see that the variables are linear)
- Normality of Residuals
- Homoscedasticity
Decision Rules:
- Reject \(H_0\) : if \(p\) < 0.001
- Otherwise fail to reject \(H_0\)
Decision Rules:
- Reject \(H_0\) : if \(p\) value < 0.05 (\(\alpha\) significance level)
- else fail to reject \(H_0\)

Modelling the Linear Regression

The linear regression model is drawn up in R

five_star_checkin = lm(log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
five_star_checkin %>% summary()

## 
## Call:
## lm(formula = log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46666 -0.26643 -0.01482  0.25374  2.04293 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.30749    0.07281   17.96   <2e-16 ***
## log(total_checkin)  0.51479    0.01134   45.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4141 on 1702 degrees of freedom
## Multiple R-squared:  0.5477, Adjusted R-squared:  0.5474 
## F-statistic:  2061 on 1 and 1702 DF,  p-value: < 2.2e-16

We can see that the \(p\) value for the F test is < 0.001 and the F(1,1702) = 1059. The model shows statistically significant results. However we next check for the validity of assumptions and perform test on model parameters.

Hypothesis Testing of Linear Regression Model Parameters

We then perform hypothesis testing of the model parameters

Hypotheses for overall Lienar Regression:
- Intercept
- Null Hypothesis: \(H_0\) : \(\alpha\) = 0
- Alternative Hypothesis: \(H_A\) : \(\alpha \neq 0\) Data does not fit the linear regression model
- Slope
- Null Hypothesis: \(H_0\) : \(\beta\) = 0
- Alternative Hypothesis: \(H_A\) : \(\beta \neq 0\) Data does not fit the linear regression model
Assumptions:
- Same as the overall regression model
Decision Rules:
- Reject \(H_0\) : if \(p\) < 0.001 and if 95% CI of the parameter does not capture \(H_0\) : \(\alpha/\beta = 0\)
- Otherwise fail to reject \(H_0\)
Decision Rules:
- Reject \(H_0\) : if \(p\) value < 0.05 (\(\alpha\) significance level)
- else fail to reject \(H_0\)

In our model we see that the intercept \(\alpha\) = 1.307 and \(\beta\) = 0.515 and hence they are both non-zero. The CI is identified by

five_star_checkin %>% confint()

##                        2.5 %    97.5 %
## (Intercept)        1.1646819 1.4502970
## log(total_checkin) 0.4925462 0.5370261

This also falls above our threshold and hence all our model parameters are meeting the \(H_A\) rejection criteria.

The next steps is to test for residuals, normality and test for influential cases

Testing Linear Regression Assumptions

Correlation Co-Efficient

We calculate the correlation co-efficient to understand the strength and relation of the 2 variables.

r= cor(log(check_in_5_star$five_star_count) , log(check_in_5_star$total_checkin), 
       use = "complete.obs") 
print(paste("The r value is : ",round(r,3)),quote = F)

## [1] The r value is :  0.74

suppressMessages(library(psychometric))

## Warning: package 'psychometric' was built under R version 3.3.3

## Warning: package 'multilevel' was built under R version 3.3.3

print(paste("The CI (95%) for this value of r is : ",
            round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[1],3), " - ",
            round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[2],3)),
            quote = F)

## [1] The CI (95%) for this value of r is :  0.718  -  0.761

Summary of Findings using the Check-In Data

From all the tests we observe

No curvature in the Residuals vs Fitted Plot,
Most of data following the line in the Q-Q plot,
No heteroscedasticity in the standardized residual plot and
lastly no evidence of influential cases (cook’s line is not visible in the graph).
value of \(r\) is positive and shows a strong positive relationship
\(\alpha\) = 1.307, \(p\) < 0.001 and
\(\beta\) = 0.515, \(p\) < 0.001
F(1,1702) = 1059

Based on all these findings we reject the \(H_0\).

All of these show that there is a statistically significant relationship between the log(# of check-ins) and log(# of five star reviews).Thus if we know the number of check-ins for a business we can calculate the number of five star reviews that the business will get based on the below formula.

Nbr of 5 star reviews = exp (1.307 + 0.515 * log (Nbr of Check-Ins))

Please note that \(\alpha\) and \(\beta\) can fall within their respective CI’s as well.

From this we can infer that the business with the most check-ins is more likely the most popular one in that particular city

Summary , Next Steps and Appendix

Summary : We performed 2 analsyis of the yelp data set and observed the following
- We used the number of 5 star reviews that each non-local cuisine type received in a sample of 10 cities to identify the most preferred cuisine for that city.
- We also found a linear relationship between the number of checkins for a particular city and the number of 5 star reviews that the business gets. This can be used to identify the most likeable cuisine.
Next Steps
- We have not performed regression on multiple variables so we cant choose the best combination
- We could perform a lot of text based analysis to see if the review text shows any meaningful insights
Useful Links

Identifying the Most Liked Non-Local Cuisine based on Yelp Reviews

Analysis of Restaurant Reviews from Yelp to understand most liked non-local cuisine in a sample of cities