Identifying the Most Liked Non-Local Cuisine based on Yelp Reviews

Analysis of Restaurant Reviews from Yelp to understand most liked non-local cuisine in a sample of cities

Li Xie (s3308052), Rupesh Papneja (s3637387), Sarun Sankarankutty Menon (s3642783)

Last updated: 04 June, 2017

Yelp Data and Analysis

Yelp is an online company (2004- present) that publishes user reviews of various business - including restaurants, bars, food joints, shopping locations and various other business. To date Yelp has published about 127 million reviews of local business across 32 countries.

The data that we have used here is part of the Yelp Dataset Challenge which Yelp publishes in their website. This year the data set includes details about

Our aim was to pick restaurants in 5 cities and identifying which cuisine is most praised in that city. We do this using the R programming and our learning from Introduction to Statistics.

Understanding the Data formats and structures

Before we load the Yelp data set we must look at the data structures from the Yelp webpage. The Entity Relationship diagram of the data set is given below:

Loading and Enriching the Business Data

First, we load the yelp business data to try and understand more details about the businesses included. We then filter this data to pick only restaurants that have more than 100 reviews.

To make this interesting we pick non-local cuisines for this analysis.

The cuisines that we picked for our analysis are Italian,Mexican,Japanese,Chinese,Thai.

coords2country = function(points) #function to identify countries
{  
  countriesSP = getMap(resolution='high')
  pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))  
  indices = over(pointsSP, countriesSP)
  indices$ADMIN  # gives country names
}
rest_for_analysis$country = coords2country(data.frame(rest_for_analysis$longitude,rest_for_analysis$latitude))
# remove business where we cant find country names
rest_for_analysis = rest_for_analysis %>% filter(!is.na(country))

From this data, we pick the top 10 cities based on number of business in each city.

#pick top 10 cities for analysis. We can later change this to a diff variable
top_cities = rest_for_analysis %>% group_by(city) %>% summarise(bus_count = n() ) %>% top_n(n=total_cities_covered) %>% arrange(-bus_count)

rest_in_top_cities = rest_for_analysis[is.element(rest_for_analysis$city,top_cities$city),]
rm(rest_for_analysis)

What did we pick?

Based on these filtering we have 1705 restaurant across 10 cities. The cities and cuisines selected are plotted below

rest_map = get_map(location = c('America'),zoom=4)
mapPoints = ggmap(rest_map) +
  geom_point(data =rest_in_top_cities, aes(x = longitude, y = latitude),col = "red", cex = .6) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE)+
  scale_shape_identity()
City vs Cuisines
Chinese Italian Japanese Mexican Thai
Chandler 20 15 11 23 9
Charlotte 9 30 14 23 7
Henderson 6 21 17 21 8
Las Vegas 102 159 156 133 54
Mesa 9 14 12 23 8
Phoenix 33 90 28 125 22
Pittsburgh 8 22 13 20 14
Scottsdale 15 59 22 53 12
Tempe 12 15 11 28 9
Toronto 33 50 65 22 20

Filtering the Restaurant Review

After choosing the cities and the top 5 popular cuisines, we then read all the reviews for the restaurants that meet the requirement. For this analysis, we pick only the 5 star reviews. To cut down the file volume we remove the actual text in the review file.

csv_out_file = paste(out_path,"yelp_academic_dataset_review.csv",sep="/")

#only run this step one time to save execution times
if (!file.exists(csv_out_file)) {
  con_in = file(paste(data_path,"yelp_academic_dataset_review.json",sep="/"))
  con_out = file(csv_out_file, open = "wb")
  stream_in(con_in, handler = function(df){
    # Read without actual text data
    df = df[is.element(df$business_id,rest_in_top_cities$business_id),
            c("review_id","user_id","business_id","stars","date","type")]
    df = dplyr::filter(df, stars > 4)
    stream_out(df,con_out, pagesize = 100000)
  }, pagesize = 100000, verbose = FALSE)
  close(con_out)
}
review_5 = stream_in(file(paste(out_path,"yelp_academic_dataset_review.csv",sep="/")), pagesize = 200000, verbose = FALSE)

Analysis: Why are these restaurants famous?

From all data of the 5-star restaurants for selected cuisines, we also want to determine the probability of restaurants that have a bar or at least serve alcohol. This would also help us see if having a bar or serve alcohol is a determining factor for restaurants gaining 5 star.

alcohol_count_table = rest_in_top_cities %>% 
                        group_by( cuisine, serve_alcohol) %>% summarise(count = n())

The table below gives the summary

Cuisines and Alcohol
Cuisines Serve Alcohol Count
Chinese FALSE 90
Chinese TRUE 157
Italian FALSE 52
Italian TRUE 423
Japanese FALSE 44
Japanese TRUE 305
Mexican FALSE 123
Mexican TRUE 348
Thai FALSE 52
Thai TRUE 111

Probability that a restaurant serves alcohol is 0.788 and that of a restaurant having a bar is 0.528

This shows our subset of business have a large proportion of them selling alcohol. This could be one of the deciding factor for them being popular.

Visualising the Reviews of cuisines across Cities

The review data that we extracted is combined with the business information and then we try and plot this data to see if we can observe any visual trends.

barplot(cus_city_table_prop,ylab="Proportion Within City",xlab="City",las=2,
          ylim=c(0,0.65),beside=TRUE,legend=rownames(cus_city_table_prop),
          args.legend=c(x = "top",horiz=T,title="Cuisine"),cex.lab = 0.85,
          col = c("grey19","darkgoldenrod1","tan3","orangered","cyan"), 
          main = "Plotting proportion of 5 star reviews of 5 cuisines in 10 cities")

We can see visually observed several trends like Mexican cuisine being famous in Mesa, Japanese in Las Vegas etc.

Hypothesis Testing

To ensure the results are statistically significant we perform Chi-Squared Test.

Chi-Squared Test : Assumptions Check

We perform a visual check to see if there are any cells that have counts below 5

knitr::kable(cus_city_table,caption = "City vs Cuisines") 
City vs Cuisines
Chandler Charlotte Henderson Las Vegas Mesa Phoenix Pittsburgh Scottsdale Tempe Toronto
Chinese 2193 574 449 14201 751 3520 810 1218 883 2934
Italian 1111 2304 2015 24088 1046 12864 1939 6636 2062 3951
Japanese 948 1299 1895 26922 731 3782 1108 2985 1403 5674
Mexican 1914 2119 2082 21349 1927 16019 1917 6828 2414 2166
Thai 715 605 774 8148 667 2043 1165 1142 895 2114

Since the assumptions on cell count are all met we can safely perform Chi Squared Test.

Chi-Squared Test

We perform a Chi-Squared test on the data set.

chi2 = chisq.test(cus_city_table)
chi2 #display chi-sq test results
## 
##  Pearson's Chi-squared test
## 
## data:  cus_city_table
## X-squared = 19081, df = 36, p-value < 2.2e-16

From this analysis, we can see that \(p\) < 0.001 which shows a significant association between cuisine and cities. We now plot the residual values to see if there are any significant findings. WE plot the residual plot to identify an glaring over-representation.

knitr::kable(chi2$stdres %>% round(0),caption = "City vs Cuisines") 
City vs Cuisines
Chandler Charlotte Henderson Las Vegas Mesa Phoenix Pittsburgh Scottsdale Tempe Toronto
Chinese 47 -12 -18 23 3 -25 -4 -28 -4 17
Italian -22 11 0 -21 -12 29 0 24 -2 -13
Japanese -17 -7 8 61 -14 -65 -13 -22 -9 37
Mexican 0 5 2 -51 15 67 -1 26 7 -46
Thai 5 0 6 -2 11 -26 24 -14 9 18

This table shows that for almost all cities there is at least one over-representation of 5-star review. This shows that of the selected non-local cuisines there is a clear favorite in that city.

For an \(\alpha\) =0.05 and df = 36 we calculate the \(\chi^{2}\) using the below formula.

qchisq(p = .95,df = 36) # Find the Chi - Crital Value
## [1] 50.99846

We reject \(H_0\) when \(\chi^{2}\) > \(\chi^{2}_{\text{crit}}\). In our example the \(\chi^{2}\) is 20449 so we can safely reject Null Hypothesis.

Summary of Chi-Squared Test

From the Residuals plot, we see that each of the city is famous for a specific cuisine. The cuisine with the highest residual value is chosen as the most-famous of the selected cuisine type in that city.

res_col = colnames(chi2$stdres)
res_row = rownames(chi2$stdres)[apply(chi2$stdres,2,which.max)]
max_val = apply(chi2$stdres,2,max) %>% round(2)
max_res = data.frame(res_col,res_row,max_val,row.names = NULL)
colnames(max_res)=c("City","Fav Cuisine","Chi Sq Res")
knitr::kable(max_res,caption = "City vs Cuisines") 
City vs Cuisines
City Fav Cuisine Chi Sq Res
Chandler Chinese 46.71
Charlotte Italian 10.70
Henderson Japanese 8.16
Las Vegas Japanese 60.83
Mesa Mexican 15.42
Phoenix Mexican 66.62
Pittsburgh Thai 24.20
Scottsdale Mexican 26.36
Tempe Thai 9.35
Toronto Japanese 36.91

A Chi-square test of association was used to test for a statistically significant association between cities and the most liked cuisine. The results of the test found a statistically significant association, \(\chi^{2}\) = 20449 , p < .001 thus suggesting that the people in these cities are more likely to prefer the above non-local cuisine.

Alternative Approach using Check-In data

Assuming we only have the check-in data we wanted to see if we can deduce the number of 5 star reviews that these non-local cuisines would get thus enabling us to gauge the popularity of the restaurant.

To do this we first load the Check-In data set and then slice the data to identify the number of check-ins and then we map it back to identify the business and the city and cuisine.

We can visually see a linear relationship when we do a log transformation of the data. But we need to do a statistical test to see if there are any significant relationship between the 2 variables.

Hypothesis Testing for Linear Regression Model

We perform the hypothesis testing for the overall linear regression model:

Modelling the Linear Regression

The linear regression model is drawn up in R

five_star_checkin = lm(log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
five_star_checkin %>% summary()
## 
## Call:
## lm(formula = log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46666 -0.26643 -0.01482  0.25374  2.04293 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.30749    0.07281   17.96   <2e-16 ***
## log(total_checkin)  0.51479    0.01134   45.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4141 on 1702 degrees of freedom
## Multiple R-squared:  0.5477, Adjusted R-squared:  0.5474 
## F-statistic:  2061 on 1 and 1702 DF,  p-value: < 2.2e-16

We can see that the \(p\) value for the F test is < 0.001 and the F(1,1702) = 1059. The model shows statistically significant results. However we next check for the validity of assumptions and perform test on model parameters.

Hypothesis Testing of Linear Regression Model Parameters

We then perform hypothesis testing of the model parameters

In our model we see that the intercept \(\alpha\) = 1.307 and \(\beta\) = 0.515 and hence they are both non-zero. The CI is identified by

five_star_checkin %>% confint()
##                        2.5 %    97.5 %
## (Intercept)        1.1646819 1.4502970
## log(total_checkin) 0.4925462 0.5370261

This also falls above our threshold and hence all our model parameters are meeting the \(H_A\) rejection criteria.

The next steps is to test for residuals, normality and test for influential cases

Testing Linear Regression Assumptions

Correlation Co-Efficient

We calculate the correlation co-efficient to understand the strength and relation of the 2 variables.

r= cor(log(check_in_5_star$five_star_count) , log(check_in_5_star$total_checkin), 
       use = "complete.obs") 
print(paste("The r value is : ",round(r,3)),quote = F)
## [1] The r value is :  0.74
suppressMessages(library(psychometric))
## Warning: package 'psychometric' was built under R version 3.3.3
## Warning: package 'multilevel' was built under R version 3.3.3
print(paste("The CI (95%) for this value of r is : ",
            round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[1],3), " - ",
            round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[2],3)),
            quote = F)
## [1] The CI (95%) for this value of r is :  0.718  -  0.761

Summary of Findings using the Check-In Data

From all the tests we observe

Based on all these findings we reject the \(H_0\).

All of these show that there is a statistically significant relationship between the log(# of check-ins) and log(# of five star reviews).Thus if we know the number of check-ins for a business we can calculate the number of five star reviews that the business will get based on the below formula.

Nbr of 5 star reviews = exp (1.307 + 0.515 * log (Nbr of Check-Ins))

Please note that \(\alpha\) and \(\beta\) can fall within their respective CI’s as well.

From this we can infer that the business with the most check-ins is more likely the most popular one in that particular city

Summary , Next Steps and Appendix