Li Xie (s3308052), Rupesh Papneja (s3637387), Sarun Sankarankutty Menon (s3642783)
Last updated: 04 June, 2017
Yelp is an online company (2004- present) that publishes user reviews of various business - including restaurants, bars, food joints, shopping locations and various other business. To date Yelp has published about 127 million reviews of local business across 32 countries.
The data that we have used here is part of the Yelp Dataset Challenge which Yelp publishes in their website. This year the data set includes details about
Our aim was to pick restaurants in 5 cities and identifying which cuisine is most praised in that city. We do this using the R programming and our learning from Introduction to Statistics.
Before we load the Yelp data set we must look at the data structures from the Yelp webpage. The Entity Relationship diagram of the data set is given below:
First, we load the yelp business data to try and understand more details about the businesses included. We then filter this data to pick only restaurants that have more than 100 reviews.
To make this interesting we pick non-local cuisines for this analysis.
The cuisines that we picked for our analysis are Italian,Mexican,Japanese,Chinese,Thai.
coords2country = function(points) #function to identify countries
{
countriesSP = getMap(resolution='high')
pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))
indices = over(pointsSP, countriesSP)
indices$ADMIN # gives country names
}
rest_for_analysis$country = coords2country(data.frame(rest_for_analysis$longitude,rest_for_analysis$latitude))
# remove business where we cant find country names
rest_for_analysis = rest_for_analysis %>% filter(!is.na(country))From this data, we pick the top 10 cities based on number of business in each city.
#pick top 10 cities for analysis. We can later change this to a diff variable
top_cities = rest_for_analysis %>% group_by(city) %>% summarise(bus_count = n() ) %>% top_n(n=total_cities_covered) %>% arrange(-bus_count)
rest_in_top_cities = rest_for_analysis[is.element(rest_for_analysis$city,top_cities$city),]
rm(rest_for_analysis)Based on these filtering we have 1705 restaurant across 10 cities. The cities and cuisines selected are plotted below
rest_map = get_map(location = c('America'),zoom=4)
mapPoints = ggmap(rest_map) +
geom_point(data =rest_in_top_cities, aes(x = longitude, y = latitude),col = "red", cex = .6) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)+
scale_shape_identity()|
|
|
After choosing the cities and the top 5 popular cuisines, we then read all the reviews for the restaurants that meet the requirement. For this analysis, we pick only the 5 star reviews. To cut down the file volume we remove the actual text in the review file.
csv_out_file = paste(out_path,"yelp_academic_dataset_review.csv",sep="/")
#only run this step one time to save execution times
if (!file.exists(csv_out_file)) {
con_in = file(paste(data_path,"yelp_academic_dataset_review.json",sep="/"))
con_out = file(csv_out_file, open = "wb")
stream_in(con_in, handler = function(df){
# Read without actual text data
df = df[is.element(df$business_id,rest_in_top_cities$business_id),
c("review_id","user_id","business_id","stars","date","type")]
df = dplyr::filter(df, stars > 4)
stream_out(df,con_out, pagesize = 100000)
}, pagesize = 100000, verbose = FALSE)
close(con_out)
}
review_5 = stream_in(file(paste(out_path,"yelp_academic_dataset_review.csv",sep="/")), pagesize = 200000, verbose = FALSE)From all data of the 5-star restaurants for selected cuisines, we also want to determine the probability of restaurants that have a bar or at least serve alcohol. This would also help us see if having a bar or serve alcohol is a determining factor for restaurants gaining 5 star.
alcohol_count_table = rest_in_top_cities %>%
group_by( cuisine, serve_alcohol) %>% summarise(count = n())The table below gives the summary
|
|
Probability that a restaurant serves alcohol is 0.788 and that of a restaurant having a bar is 0.528
This shows our subset of business have a large proportion of them selling alcohol. This could be one of the deciding factor for them being popular.
The review data that we extracted is combined with the business information and then we try and plot this data to see if we can observe any visual trends.
barplot(cus_city_table_prop,ylab="Proportion Within City",xlab="City",las=2,
ylim=c(0,0.65),beside=TRUE,legend=rownames(cus_city_table_prop),
args.legend=c(x = "top",horiz=T,title="Cuisine"),cex.lab = 0.85,
col = c("grey19","darkgoldenrod1","tan3","orangered","cyan"),
main = "Plotting proportion of 5 star reviews of 5 cuisines in 10 cities")We can see visually observed several trends like Mexican cuisine being famous in Mesa, Japanese in Las Vegas etc.
To ensure the results are statistically significant we perform Chi-Squared Test.
We perform a visual check to see if there are any cells that have counts below 5
knitr::kable(cus_city_table,caption = "City vs Cuisines") | Chandler | Charlotte | Henderson | Las Vegas | Mesa | Phoenix | Pittsburgh | Scottsdale | Tempe | Toronto | |
|---|---|---|---|---|---|---|---|---|---|---|
| Chinese | 2193 | 574 | 449 | 14201 | 751 | 3520 | 810 | 1218 | 883 | 2934 |
| Italian | 1111 | 2304 | 2015 | 24088 | 1046 | 12864 | 1939 | 6636 | 2062 | 3951 |
| Japanese | 948 | 1299 | 1895 | 26922 | 731 | 3782 | 1108 | 2985 | 1403 | 5674 |
| Mexican | 1914 | 2119 | 2082 | 21349 | 1927 | 16019 | 1917 | 6828 | 2414 | 2166 |
| Thai | 715 | 605 | 774 | 8148 | 667 | 2043 | 1165 | 1142 | 895 | 2114 |
Since the assumptions on cell count are all met we can safely perform Chi Squared Test.
We perform a Chi-Squared test on the data set.
chi2 = chisq.test(cus_city_table)
chi2 #display chi-sq test results##
## Pearson's Chi-squared test
##
## data: cus_city_table
## X-squared = 19081, df = 36, p-value < 2.2e-16
From this analysis, we can see that \(p\) < 0.001 which shows a significant association between cuisine and cities. We now plot the residual values to see if there are any significant findings. WE plot the residual plot to identify an glaring over-representation.
knitr::kable(chi2$stdres %>% round(0),caption = "City vs Cuisines") | Chandler | Charlotte | Henderson | Las Vegas | Mesa | Phoenix | Pittsburgh | Scottsdale | Tempe | Toronto | |
|---|---|---|---|---|---|---|---|---|---|---|
| Chinese | 47 | -12 | -18 | 23 | 3 | -25 | -4 | -28 | -4 | 17 |
| Italian | -22 | 11 | 0 | -21 | -12 | 29 | 0 | 24 | -2 | -13 |
| Japanese | -17 | -7 | 8 | 61 | -14 | -65 | -13 | -22 | -9 | 37 |
| Mexican | 0 | 5 | 2 | -51 | 15 | 67 | -1 | 26 | 7 | -46 |
| Thai | 5 | 0 | 6 | -2 | 11 | -26 | 24 | -14 | 9 | 18 |
This table shows that for almost all cities there is at least one over-representation of 5-star review. This shows that of the selected non-local cuisines there is a clear favorite in that city.
For an \(\alpha\) =0.05 and df = 36 we calculate the \(\chi^{2}\) using the below formula.
qchisq(p = .95,df = 36) # Find the Chi - Crital Value## [1] 50.99846
We reject \(H_0\) when \(\chi^{2}\) > \(\chi^{2}_{\text{crit}}\). In our example the \(\chi^{2}\) is 20449 so we can safely reject Null Hypothesis.
From the Residuals plot, we see that each of the city is famous for a specific cuisine. The cuisine with the highest residual value is chosen as the most-famous of the selected cuisine type in that city.
res_col = colnames(chi2$stdres)
res_row = rownames(chi2$stdres)[apply(chi2$stdres,2,which.max)]
max_val = apply(chi2$stdres,2,max) %>% round(2)
max_res = data.frame(res_col,res_row,max_val,row.names = NULL)
colnames(max_res)=c("City","Fav Cuisine","Chi Sq Res")
knitr::kable(max_res,caption = "City vs Cuisines") | City | Fav Cuisine | Chi Sq Res |
|---|---|---|
| Chandler | Chinese | 46.71 |
| Charlotte | Italian | 10.70 |
| Henderson | Japanese | 8.16 |
| Las Vegas | Japanese | 60.83 |
| Mesa | Mexican | 15.42 |
| Phoenix | Mexican | 66.62 |
| Pittsburgh | Thai | 24.20 |
| Scottsdale | Mexican | 26.36 |
| Tempe | Thai | 9.35 |
| Toronto | Japanese | 36.91 |
A Chi-square test of association was used to test for a statistically significant association between cities and the most liked cuisine. The results of the test found a statistically significant association, \(\chi^{2}\) = 20449 , p < .001 thus suggesting that the people in these cities are more likely to prefer the above non-local cuisine.
Assuming we only have the check-in data we wanted to see if we can deduce the number of 5 star reviews that these non-local cuisines would get thus enabling us to gauge the popularity of the restaurant.
To do this we first load the Check-In data set and then slice the data to identify the number of check-ins and then we map it back to identify the business and the city and cuisine.
|
|
|
We can visually see a linear relationship when we do a log transformation of the data. But we need to do a statistical test to see if there are any significant relationship between the 2 variables.
We perform the hypothesis testing for the overall linear regression model:
The linear regression model is drawn up in R
five_star_checkin = lm(log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
five_star_checkin %>% summary()##
## Call:
## lm(formula = log(five_star_count) ~ log(total_checkin), data = check_in_5_star)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46666 -0.26643 -0.01482 0.25374 2.04293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30749 0.07281 17.96 <2e-16 ***
## log(total_checkin) 0.51479 0.01134 45.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4141 on 1702 degrees of freedom
## Multiple R-squared: 0.5477, Adjusted R-squared: 0.5474
## F-statistic: 2061 on 1 and 1702 DF, p-value: < 2.2e-16
We can see that the \(p\) value for the F test is < 0.001 and the F(1,1702) = 1059. The model shows statistically significant results. However we next check for the validity of assumptions and perform test on model parameters.
We then perform hypothesis testing of the model parameters
In our model we see that the intercept \(\alpha\) = 1.307 and \(\beta\) = 0.515 and hence they are both non-zero. The CI is identified by
five_star_checkin %>% confint()## 2.5 % 97.5 %
## (Intercept) 1.1646819 1.4502970
## log(total_checkin) 0.4925462 0.5370261
This also falls above our threshold and hence all our model parameters are meeting the \(H_A\) rejection criteria.
The next steps is to test for residuals, normality and test for influential cases
|
|
|
|
|
|
We calculate the correlation co-efficient to understand the strength and relation of the 2 variables.
r= cor(log(check_in_5_star$five_star_count) , log(check_in_5_star$total_checkin),
use = "complete.obs")
print(paste("The r value is : ",round(r,3)),quote = F)## [1] The r value is : 0.74
suppressMessages(library(psychometric))## Warning: package 'psychometric' was built under R version 3.3.3
## Warning: package 'multilevel' was built under R version 3.3.3
print(paste("The CI (95%) for this value of r is : ",
round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[1],3), " - ",
round(CIr(r,n = nrow(check_in_5_star), level = 0.95 )[2],3)),
quote = F)## [1] The CI (95%) for this value of r is : 0.718 - 0.761
From all the tests we observe
Based on all these findings we reject the \(H_0\).
All of these show that there is a statistically significant relationship between the log(# of check-ins) and log(# of five star reviews).Thus if we know the number of check-ins for a business we can calculate the number of five star reviews that the business will get based on the below formula.
Nbr of 5 star reviews = exp (1.307 + 0.515 * log (Nbr of Check-Ins))
Please note that \(\alpha\) and \(\beta\) can fall within their respective CI’s as well.
From this we can infer that the business with the most check-ins is more likely the most popular one in that particular city
Summary : We performed 2 analsyis of the yelp data set and observed the following
We used the number of 5 star reviews that each non-local cuisine type received in a sample of 10 cities to identify the most preferred cuisine for that city.
We also found a linear relationship between the number of checkins for a particular city and the number of 5 star reviews that the business gets. This can be used to identify the most likeable cuisine.
Useful Links