Objective
This project is driven by Yelp Academic Data Set, which is offered via Yelp Data Challenge. Yelp is one of the leading online recommendation sites for Restaurants and other local businesses, and it is very popular in North America and Europe.
Our objective with the project is two-fold-
Sentiment Analysis: Analysing if Yelp Reviews for a Restaurants correlate with Health Inspection Grades issued by the city.
- Exploratory analysis: Explore the Yelp Dataset to spot
- Locational concentrations of high quality restaurants
- Variation in Cultural Trends and Preferences across Major Cities
Approach
The method used in this project in order to answer the primary questions was an OSEMN workflow approach. In this approach, there are five distinct stages: Obtain Data, Scrub Data, Explore, Model and iNterpret.
Introduction to the Data
The data set comprises of following files -
It is a large dataset with details of about 4 million reviews, 1 million users and 144 thousand businesses.
The schema of the json files is as follows -
More details about the challenge, dataset an its schema can be seen at the Yelp Data Challenge website.
Maintaining the Data on Remote Server
The data was maintained on SQL Server in Microsoft Azure Cloud. The data upload was an extensive exercise which took multiple days due to sheer volume of records. The R packages used to help in this were RODBC and jsonlite. RODBC was chosen because RSQLServer has been deprecated by CRAN-R and is no longer actively supported.
Example of how connection was established -
library(RODBC)
connectionString <- "Driver=ODBC Driver 13 for SQL Server;Server=tcp:data607yelp1.database.windows.net,1433;Database=DATA607_FinalProject_Yelp1;Uid=data607@data607yelp1;Pwd=XXXXX;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
#conn <- odbcDriverConnect(connectionString)
The tables uploaded were, which correspnd to yelp files mentioned above -
YelpBusiness
YelpUser
YelpReview
Once the data was uploaded multiple tables were created on server side as well to help the analysis -
YelpLVBusiness - Restaurants in Las Vegas
YelpLVReview - Reviews for Restaurants in Las Vegas
YelpLVReviewTokens - Tokens from Sample Las Vegas Reviews to be used in Sentiment Analysis
YelpRestaurants - All Yelp Restaurants
YelpRestaurantCategory - Restaurants and their Category (~Cuisine) table
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(stringr)
library(ggplot2)
library(RODBC)
library(ggrepel)
library(knitr)
library(tidyr)
library(jsonlite)
library(leaflet)
library(dplyr)
Analysing the Health Inspection
Obtain Data The first step was to retrieve data regarding health inspection grades, restaurant/food/drink businesses and reviews/ratings of said businesses. The two sources were the Yelp Dataset Challenge data (can be found here: https://www.yelp.com/dataset_challenge) and, under the US City Open Data Census website, the Las Vegas Health Inspection Data, whose location was chosen for its larger data pool in Yelp reviews, (can be found here: http://us-city.census.okfn.org/entry/vegas/food-safety). The Health Inspection data was in a csv format, while the Yelp Dataset came in the form of json files (User, Business and Reviews), all of which were entered into an SQL database.
Scrub Data The primary goal when scrubbing was to first combine multiple datasets while retaining as much information as possible, to ensure that there were no duplicate entries, that entries were formatted correctly, that missing data was removed (to the capacity that it was allowed without making the data pool too small) and that headings and data entries were consistant for future analysis.
Explore Even though there were questions created before combining the datasets, many of our approaches and goals had to be readjusted when finally working with the combined data. After getting a sense of what kind of information we would be working with and the value of each variable in regard to our overall project, we were allowed to shift and narrow down our focus to determining the relationship between Yelp Reviews/Ratings (with an importance on hygiene) and Health Inspection Grades, in addition to analyzing what cuisines were more prominent across major cities within the U.S. and outside it.
Model In order to analyse the Yelp Reviews, a sentiment analysis was run, using a Health Lexicon to determine if the users’ opinion regarding the health status of the establishment matched that of the Health Inspectors. In addition, a regression model was used to determine the importance of the variables against the establishments’ overall rating in order to determine if any factors contributed more to a higher rating. Visual correlations were also made to enhance the conclusion formed by the regression.
Interpret The conclusions made were based on the models and visual analyses created during the “Model” portion of the data science workflow.
It should be noted that, due to the nature of the project, in addition to a vast number of unforseen occurences when handling the data, this workflow structure was not followed as rigidly as stated above. In times when a portion of a dataset that was deemed unecessary (and thus discarded earlier in the project) proved to be important, this workflow was used (within already another step of the primary analysis method timeline) until all necessary data was available for handling.
Yelp Data
############### YELP
connectionString = "driver=ODBC Driver 13 for SQL Server;Server=tcp:data607yelp1.database.windows.net,1433;Database=DATA607_FinalProject_Yelp1;Uid=data607@data607yelp1;Pwd=Ozymandias123;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
conn = odbcDriverConnect(connectionString)
# YelpUser
User = sqlQuery(conn, "SELECT * FROM YelpUser;")
# YelpReview
Review = sqlQuery(conn, "SELECT * FROM YelpLVReview;")
# YelpBusiness
LasVBus = sqlQuery(conn, "SELECT * FROM YelpLVBusiness;")
User = User[ c(1,3)]
Review = Review[ c(1:6)]
LasVBus = LasVBus[ c(1:11)]
kable(head(LasVBus))
| 3rkxTx8DoZSl7_FryhXCVQ |
Jugoo Chinese Restaurant |
University |
4747 S Maryland Pkwy |
Las Vegas |
NV |
89119 |
36.10413 |
-115.1377 |
3.5 |
34 |
| xkVMIk_Vqh17f48ZQ_6b0w |
Tom Colicchio’s Craftsteak |
The Strip |
3799 Las Vegas Blvd S |
Las Vegas |
NV |
89109 |
36.10292 |
-115.1698 |
4.0 |
1915 |
| n8XlV8E3BUBBF1gCe6mvEg |
House of Vino |
Centennial |
8053 N Durango Dr, Ste 160 |
Las Vegas |
NV |
89131 |
36.30763 |
-115.2786 |
3.5 |
34 |
| 2BbFeotL85cIaBjSq1SWiA |
Springs Cafe by Wolfgang Puck |
Westside |
333 S Valley View Blvd |
Las Vegas |
NV |
89107 |
36.16810 |
-115.1922 |
2.5 |
23 |
| w_A1qFeQ3vOl7o4kdBxPXA |
Muse Asian Cuisine & Lounge |
Chinatown |
5115 W Spring Mountain Rd, Ste 107 |
Las Vegas |
NV |
89146 |
36.12604 |
-115.2114 |
3.5 |
14 |
| XlUuMGMvkzy_QLAUTEe-fw |
Dumpling King |
Chinatown |
5740 Spring Mountain Rd |
Las Vegas |
NV |
89146 |
36.12608 |
-115.2198 |
3.5 |
120 |
Health Inspection Data
############### INSPECTIONS ###############
LVInspection = read.csv("https://raw.githubusercontent.com/Galanopoulog/Final-Project/master/LVInspections.csv", header = T, stringsAsFactors = T)
levels(LVInspection$Inspection.Grade)[2] = "A"
levels(LVInspection$Inspection.Grade)[3] = "B"
levels(LVInspection$Current.Grade)[2] = "A"
kable(head(LVInspection))
| DA0001131 |
PR0000411 |
Black Mountain Clubhouse Restaurant |
Black Mountain Clubhouse |
Restaurant |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
3 |
A |
10/29/2015 0:00 |
3/30/2010 0:00 |
3/30/2010 11:30 |
EE7000392 |
Routine Inspection |
18 |
B |
|
‘B’ Downgrade |
49122731112 |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
| DA0001132 |
PR0000411 |
Black Mountain Clubhouse Restaurant |
Black Mountain Clubhouse |
Restaurant |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
3 |
A |
10/29/2015 0:00 |
4/13/2010 0:00 |
4/13/2010 11:15 |
EE7000392 |
Routine Inspection |
1 |
A |
|
Compliant |
37 |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
| DA0001133 |
PR0000411 |
Black Mountain Clubhouse Restaurant |
Black Mountain Clubhouse |
Restaurant |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
3 |
A |
10/29/2015 0:00 |
3/10/2011 0:00 |
3/10/2011 10:40 |
EE7000392 |
Routine Inspection |
9 |
A |
|
Compliant |
NA |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
| DA0001173 |
PR0000412 |
Black Mountain Clubhouse Bar |
Black Mountain Clubhouse |
Bar / Tavern |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
0 |
A |
10/29/2015 0:00 |
3/30/2010 0:00 |
3/30/2010 12:45 |
EE7000392 |
Routine Inspection |
3 |
A |
|
Compliant |
313637 |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
| DA0001174 |
PR0000412 |
Black Mountain Clubhouse Bar |
Black Mountain Clubhouse |
Bar / Tavern |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
0 |
A |
10/29/2015 0:00 |
3/10/2011 0:00 |
3/10/2011 11:40 |
EE7000392 |
Routine Inspection |
4 |
A |
|
Compliant |
NA |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
| DA0001185 |
PR0000414 |
Black Mountain Country Club - Golf Cart #1 |
Black Mountain Clubhouse |
Self-Service Food Truck |
500 Greenway Rd |
Henderson |
Nevada |
89015-7464 |
0 |
A |
10/29/2015 0:00 |
3/30/2010 0:00 |
3/30/2010 12:40 |
EE7000392 |
Routine Inspection |
0 |
A |
|
Compliant |
NA |
2/21/2013 22:26 |
(36.019531, -114.9739838) |
Combining Health Inspection data with LasVegas Business data
names(LasVBus)[2]= "Restaurant.Name"
names(LasVBus)[4]= "Address"
nondupl = LVInspection[!duplicated(LVInspection[,c('Restaurant.Name','Address')]),]
nondupl= nondupl[ c(3,6,10,11:13,17:22)]
Combined = merge(LasVBus, nondupl, by = c('Restaurant.Name','Address'))
Combined = Combined[,c(3,1,2,4,5,6,7,8,9,10,11,13,17,12,16,14,15,21,18:20)]
kable(head(Combined))
| 0IbboCQAEQlROAv5PKKO_Q |
Aloha Island Grill |
2267 N Rampart Blvd |
Summerlin |
Las Vegas |
NV |
89128 |
36.20120 |
-115.2811 |
4.5 |
15 |
A |
A |
10 |
10 |
6/16/2010 0:00 |
6/16/2010 0:00 |
2/21/2013 22:26 |
|
Compliant |
1.418223e+09 |
| JgkfjCL1NfMBC_M_hax_3Q |
Aloha Kitchen |
4466 E Charleston Blvd |
Sunrise |
Las Vegas |
NV |
89104 |
36.15928 |
-115.0782 |
4.0 |
116 |
A |
A |
6 |
10 |
1/20/2016 0:00 |
2/22/2010 0:00 |
2/21/2013 22:26 |
|
Compliant |
1.420000e+13 |
| 21sGRVR7HEs_t6PdB9tGMw |
Angelina’s Pizzeria |
5750 S Decatur Blvd |
|
Las Vegas |
NV |
89118 |
36.08530 |
-115.2074 |
2.5 |
131 |
A |
A |
8 |
8 |
12/3/2015 0:00 |
2/12/2010 0:00 |
2/21/2013 22:26 |
|
Compliant |
1.428314e+09 |
| Oyo5CWthu2upxxaf69MfIg |
Anna’s Garden |
5740 Spring Mountain Rd |
Chinatown |
Las Vegas |
NV |
89146 |
36.12602 |
-115.2198 |
3.5 |
29 |
A |
C |
10 |
32 |
10/10/2011 0:00 |
2/2/2010 0:00 |
2/21/2013 22:26 |
|
‘C’ Downgrade |
1.210000e+23 |
| Rhz38lsptoxX9IWS641LFA |
Annie’s Kitchen |
1212 D St |
|
Las Vegas |
NV |
89106 |
36.18430 |
-115.1470 |
4.0 |
10 |
A |
A |
0 |
5 |
11/3/2015 0:00 |
5/3/2010 0:00 |
2/21/2013 22:26 |
|
Compliant |
7.353637e+06 |
| 7NvIgf7XW55OKa5ZU7gw8Q |
Arandas Taqueria |
2040 N Nellis Blvd |
Sunrise |
Las Vegas |
NV |
89115 |
36.19695 |
-115.0617 |
4.0 |
20 |
A |
X |
0 |
55 |
1/6/2015 0:00 |
3/10/2010 0:00 |
2/21/2013 22:26 |
|
Closed with Fees |
4.710000e+51 |
Combining User data with Review data based on the Business data
UserReview = merge(User, Review, by = c('user_id'))
UserbyBusiness = semi_join(UserReview, Combined, by = "business_id")
Analyzing the Health Inspection data
# looking at grades and their demerit points
# Histograms
par(mfrow = c(1,2))
ggplot(LVInspection, aes(x=Inspection.Grade, y=Inspection.Demerits)) + geom_bar(stat="identity") +
labs(x="Inspection.Grade", y="Inspection.Demerits")

ggplot(LVInspection, aes(x=Current.Grade, y=Current.Demerits)) + geom_bar(stat="identity") +
labs(x="Current.Grade", y="Current.Demerits")

# Boxplots
par(mfrow = c(1,2))
boxplot(Inspection.Demerits ~ Inspection.Grade,data=LVInspection, main="Inspection Demerit Points by Grade",
xlab="Inspection Grade", ylab="Inspection Demerits", ylim = c(0,120))
boxplot(Current.Demerits ~ Current.Grade,data=LVInspection, main="Demerit Points by Grade",
xlab="Current Grade", ylab="Current Demerits",ylim = c(0,120))

We notice that during the Inspection, the majority of the business were classified as “A” grade, though it might have been expected to see a normal distribution in which, perhaps a “B” grade was more common. However, after the inspection, there was a dramatic shift in the grades in that the majority of businesses were upgraded to “A” grade, which may imply that after their grading, the businesses focused to reslove the issues that caused them demerit points. This seems more evident with the boxplots, where the upper outliers for demerits in each grades have radically decreased.
Adressing Questions
1. Relationship exists between Yelp ratings and Inspection Ratings
pairs(Combined[,c("stars","review_count","Current.Grade","Current.Demerits","Inspection.Grade","Inspection.Demerits")])

We see here that there doesn’t visually appear to exist a correlation between these variables. Nonetheless, we run a multiple regression.
# Regression
regstar = lm(stars ~ review_count + Current.Grade + Current.Demerits + Inspection.Grade + Inspection.Demerits, data = Combined)
summary(regstar)
##
## Call:
## lm(formula = stars ~ review_count + Current.Grade + Current.Demerits +
## Inspection.Grade + Inspection.Demerits, data = Combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89014 -0.39232 0.05882 0.51465 1.16622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4106715 0.7067370 4.826 3.38e-06 ***
## review_count 0.0007989 0.0003249 2.459 0.0151 *
## Current.GradeB -0.0262220 0.4404514 -0.060 0.9526
## Current.GradeC 0.3585021 0.7468499 0.480 0.6319
## Current.GradeO -1.4592595 0.6681818 -2.184 0.0305 *
## Current.GradeX -1.5123761 0.9769126 -1.548 0.1237
## Current.Demerits -0.0095496 0.0157630 -0.606 0.5455
## Inspection.GradeA -0.0654856 0.6841899 -0.096 0.9239
## Inspection.GradeB -0.2742315 0.7018883 -0.391 0.6966
## Inspection.GradeC -0.1997069 0.7555434 -0.264 0.7919
## Inspection.GradeX -0.3885972 0.8807551 -0.441 0.6597
## Inspection.Demerits 0.0158040 0.0130400 1.212 0.2274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6506 on 151 degrees of freedom
## Multiple R-squared: 0.1506, Adjusted R-squared: 0.08868
## F-statistic: 2.433 on 11 and 151 DF, p-value: 0.008076
hist(regstar$residuals, main = "Regression Residuals")

# normal probability plot
qqnorm(regstar$residuals)
qqline(regstar$residuals)

# constant variability
plot(regstar$residuals ~ Combined$review_count, main = "Variability")
abline(h = 0, lty = 3)

This multiple regression is not fit to be run. The residuals histogram is somewhat left skewed, the data along the qq plot is off the line and there exists a pattern when testing the residuals for constand variability. The regression itself only primarily the “reviews_count” variable to have any significant effect on the “stars” variable. So, we can run another regression with only that variable to take into account.
regstarrev = lm(stars ~ review_count, data = Combined)
summary(regstarrev)
##
## Call:
## lm(formula = stars ~ review_count, data = Combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8994 -0.4140 0.0768 0.5406 1.1117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3855991 0.0635339 53.288 <2e-16 ***
## review_count 0.0009172 0.0003159 2.904 0.0042 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6664 on 161 degrees of freedom
## Multiple R-squared: 0.04977, Adjusted R-squared: 0.04387
## F-statistic: 8.433 on 1 and 161 DF, p-value: 0.004203
hist(regstarrev$residuals, main = "Regression Residuals")

# normal probability plot
qqnorm(regstarrev$residuals)
qqline(regstarrev$residuals)

# constant variability
plot(regstarrev$residuals ~ Combined$review_count, main = "Variability")
abline(h = 0, lty = 3)

This, too, was a failed attempt at explaining the star ratings. The R-squared of this regression showed that very little variability was explained this and the previous model.
Sentiment analysis
UserbyBusiness = UserbyBusiness[!duplicated(UserbyBusiness), ]
ReviewsGrades = merge(UserbyBusiness, Combined, by = "business_id")
groupings <- ReviewsGrades %>% select(business_id, Inspection.Grade) %>% filter(Inspection.Grade=='C') %>% group_by(business_id, Inspection.Grade) %>% summarise()
groupings$business_id
## [1] 9f68hfHK6gNyHWmt9guiJw dblGQtHAzNy1ACis1m2shw dFmaYj3_cZpmEaL4WSOkPg
## [4] eRQwgSQD1KyCmJk_PAGOXQ FvC3W9E6rXUYx-cTxbnUiA iRFaq249LVEFGeXLzPULnQ
## [7] lOQPeguczAueJnLmDUTfGg MXC9pwIxovWUc9yu1F8OxA Oyo5CWthu2upxxaf69MfIg
## [10] PSRZaGGxXmOmabL2si8pKw q7LG_2s2urLJcD2IsxgTWg Qc-bULSBAyVh_qaKRWkF9A
## [13] So132GP_uy3XbGs0KNyzyw suQugTjSMYMIrK5fAabC_Q t3AWGr757V4Fvcun6sDsuw
## [16] T5UBAJt5Kod_wQGbH6m9hA w0ArxcJ0XsyK1ZwrIf3gpA zfNKdsuT7bWWN-f3v5Rj_g
## 5238 Levels: --q7kSBRb0vWC8lSkXFByA ... Zzl4ffV9kFY3yOSSgyRUBQ
#%>% rapply(function(x)length(unique(x)))
# Sentiment analysis
reviews = as.list(as.character(ReviewsGrades$text))
analysis <- sqlQuery(conn, "Select * from YelpLVReviewTokens")
# using lexicon to score words
#lexicon = data.frame(get_sentiments("afinn"))
lexicon = read.csv("https://raw.githubusercontent.com/jmehta753/MSDA_JM/master/DATA607/Final%20Project/HygieneLexicon.csv")
af.sent = data.frame(merge(analysis, lexicon, by.x="word", by.y="word") %>%
group_by(review)%>%
mutate(mean.af.sent = mean(score)))
names(af.sent)[2] = "review_id"
# combining scores to business data
review.analysis = merge(af.sent, ReviewsGrades, by = "review_id")
Plots
MEAN SCORE BY STARS
These plots are separated into two sections: Star scores each user alloted to the restaurant and the overall score Yelp gave the restaurant based on the users’ reviews. It will be interesting to see what the connection is between these two variables and the mean sentiment score, in order to determine if, firstly, the sentiment analysis was somewhat accurate, and, second, to see if the overall rating of the business matches the sentiment scores.
# Users' stars for the restaurant
par(mfrow = c(1,2))
boxplot(mean.af.sent~ stars.x, data=review.analysis[!duplicated(review.analysis), ],
main="Mean Sentiment Score by Users' Star Scores",xlab="Users' Stars", ylab="Mean Sentiment Score")
abline(h = 0, lty = 3)
# Overall star rating for the restaurant
boxplot(mean.af.sent~ stars.y, data=review.analysis[!duplicated(review.analysis), ],
main="Mean Sentiment Score by Restaurants' Star Scores",xlab="Restaurant Stars", ylab="Mean Sentiment Score")
abline(h = 0, lty = 3)

MEAN SCORE BY GRADE
The boxplot is based on Current Inspection Grade. What would be interesting to note here is if there is a relationship between the inspection grades and the review sentiments (meaning, if the users’ opinion match or are related in some way to the inspection grades).
# Current Grade
boxplot(mean.af.sent~ Current.Grade, data=review.analysis[!duplicated(review.analysis), ],
main="Mean Sentiment Score by Current Grade",xlab="Current Grade", ylab="Mean Sentiment Score")

# Inspection Grade
boxplot(mean.af.sent~ Inspection.Grade, data=review.analysis[!duplicated(review.analysis), ],
main="Mean Sentiment Score by Inspection Grade",xlab="Current Grade", ylab="Mean Sentiment Score")
abline(h = 0, lty = 3)

STARS BY GRADE
The boxplot is based on Current Inspection Grade. What would be interesting to note here is if there is a relationship between the inspection grades and the review sentiments (meaning, if the users’ opinion match or are related in some way to the inspection grades).
boxplot(stars.x ~ Current.Grade, ReviewsGrades,
main="Users' Star Scores by Current Inspection Grades",xlab="Current Grade", ylab="Stars")
abline(h = 0, lty = 3)

WORDS BY USERS’ STAR SCORE
Plotting the words that the afinn lexicon determined as positive/negative
The detail to pay attention to here is not only the words that 5-star reviews earned, but whether the words for higher star scores are related to health/cleanliness. At first glance, the majority of the words have more to do with taste than hygiene.
# Plots of the words scores vs the users' star score for each restaurant.
ggplot(review.analysis, aes(score, stars.x, group = score)) +
geom_point(aes(colour = factor(score))) +
geom_label_repel(aes(label=word)) +
xlab("Word Score") +
ylab("Users' Stars Scores")
Location Analysis through Visualization
We are interested in knowing how the restaurants are concentrated in Las Vegas and if we can spot where some of its great restaurants are located through visual analysis. It is well known that Las Vegas strip is home to multiple high quality restaurants but are there any other surprises?
Plot all the restaruants on the map to view the concentrations
colorDomain <- c(0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5)
pal <- colorNumeric("YlGn", colorDomain, na.color = "transparent")
leaflet(vegasFrame) %>% addTiles() %>% setView(lng = -115.1398,
lat = 36.1699, zoom = 12) %>% addCircles(lng = ~longitude, lat = ~latitude,
weight = 0, radius = ~exp(stars*1.2) , fillOpacity = 0.5,
color = ~pal(stars),
popup = ~content(name, address, neighborhood))
Expectedly there are multiple areas of concentrations within las vegas.
Plot only the high ranked restaurants on the map
As our above map was bit crowded we use a different approach to plot the map by clustering the spots together.
vegasTopFrame = vegasFrame %>% filter(stars>=4.0)
leaflet(vegasTopFrame) %>% addTiles() %>% setView(lng = -115.1398,
lat = 36.1699, zoom = 12) %>% addCircleMarkers(lng = ~longitude, lat = ~latitude,
popup = ~content(name, address, neighborhood), clusterOptions = markerClusterOptions())
Conclusion :
While the Strip expectedly has highest number of high quality restaurants (at 268), it came as a surprise that Chinatown (with 166) and Downtown (with 157) has very large number of such places too.
Analysing the cultural trends and preferences across various cities
The approach taken here is to measure the popularity of various types of cuisines in major cities by counting the number of reviews for restaurants belonging to such category.
First lets load the restaurant and category data from database
YelpRestaurants <- sqlQuery(conn, "Select * from YelpRestaurants")
YelpRestaurantsCategory <- sqlQuery(conn, "Select * from YelpRestaurantCategory")
names(YelpRestaurants)
## [1] "business_id" "name" "neighborhood" "address"
## [5] "city" "state" "postal_code" "latitude"
## [9] "longitude" "stars" "review_count" "is_open"
## [13] "attributes" "categories" "hours" "type"
names(YelpRestaurantsCategory)
## [1] "business_id" "category"
Classify Reviews by Year
AllReviews = sqlQuery(conn, "Select business_id, stars, date from YelpReview")
ReviewsByYear <- AllReviews %>% select(business_id, stars, date) %>% mutate(year = substring(date, 1,4))
ReviewsByYear <- ReviewsByYear %>% group_by(business_id, year) %>% summarise(meanStar = mean(stars), count=n())
kable(head(ReviewsByYear))
| –6MefnULPED_I942VcFNA |
2008 |
5.000000 |
1 |
| –6MefnULPED_I942VcFNA |
2010 |
2.333333 |
3 |
| –6MefnULPED_I942VcFNA |
2011 |
3.000000 |
2 |
| –6MefnULPED_I942VcFNA |
2012 |
3.000000 |
1 |
| –6MefnULPED_I942VcFNA |
2013 |
3.000000 |
1 |
| –6MefnULPED_I942VcFNA |
2014 |
3.000000 |
3 |
Combine Reviews with City (from Restaurants data) and Category data
We also compute the total number of reviews in each city-category combination
BusinessByCity <- YelpRestaurants %>% select(business_id, city)
BusinessByCityAndCategory <- inner_join(BusinessByCity, YelpRestaurantsCategory, by.x="business_id", by.y="business_id")
## Joining, by = "business_id"
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
BusinessYearOnYearGrid <- inner_join(BusinessByCityAndCategory, ReviewsByYear, by.x="business_id", by.y="business_id") %>% group_by(city, category, year) %>% summarise(count=n())
## Joining, by = "business_id"
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## character vector and factor, coercing into character vector
BusinessYearOnYearGrid1 <- BusinessYearOnYearGrid %>% spread(key=year, value=count)
BusinessYearOnYearGrid1[is.na(BusinessYearOnYearGrid1)] <- 0
BusinessYearOnYearGrid1 <- BusinessYearOnYearGrid1 %>% group_by(city, category) %>% mutate(total=`2004`+`2005` + `2006`+`2007` +`2008`+`2009` + `2010`+`2011` + `2012`+`2013` + `2014`+`2015` + `2016`+`2017`)
BusinessYearOnYearGrid1<- BusinessYearOnYearGrid1 %>% arrange(desc(total))
kable(head(BusinessYearOnYearGrid1))
| Las Vegas |
American (Traditional) |
0 |
14 |
58 |
148 |
210 |
257 |
340 |
381 |
402 |
434 |
481 |
504 |
512 |
318 |
4059 |
| Las Vegas |
Fast Food |
0 |
5 |
15 |
54 |
94 |
130 |
277 |
378 |
420 |
529 |
591 |
655 |
669 |
222 |
4039 |
| Las Vegas |
Nightlife |
0 |
14 |
60 |
137 |
175 |
217 |
290 |
321 |
354 |
419 |
477 |
514 |
552 |
342 |
3872 |
| Toronto |
Nightlife |
0 |
0 |
0 |
0 |
149 |
223 |
279 |
331 |
390 |
444 |
513 |
576 |
609 |
244 |
3758 |
| Las Vegas |
Bars |
0 |
12 |
56 |
131 |
165 |
203 |
272 |
304 |
336 |
397 |
449 |
490 |
524 |
330 |
3669 |
| Toronto |
Bars |
0 |
0 |
0 |
0 |
139 |
206 |
262 |
310 |
368 |
423 |
494 |
557 |
592 |
239 |
3590 |
Spotting Cultural Trends
Comparing American cities
We don’t consider Las Vegas here as it is not a representative city due to its unusual concentration of restaurants and atypical culture.
## Categories which we think are redundant
redundantCat <- c('Bars', 'Nightlife', 'Event Planning & Services', 'Delis', 'Caterers', 'Pubs')
## Plot American cities
AmerBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charlotte','Pittsburgh', 'Phoenix')) %>% arrange(desc(total))
# Filter out redundant categories and keep only where number of reviews is atleast 300
AmerBusinessYearOnYearGrid <- AmerBusinessYearOnYearGrid %>% filter(!category %in% redundantCat) %>% filter(total > 300) %>% select(total)
## Adding missing grouping variables: `city`, `category`
ggplot(AmerBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "American Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
- Phoenix appears to be more diverse than Pittsburgh and Charoltte. Unsurprisingly the Mexican count is much higher there due to its high Latino influence.
- Charolette has a number of reviews for Southern cuisine, which is completely lacking in other two. Surprsingly its Cafe culture is yet to develop here.
- Pittsburgh is least diverse of all, and surprisingly not enough liking for Steakhouses is seen.
1. Comparing Canadian cities
## Plot Canadian cities
CanadianBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Toronto', 'Montréal')) %>% arrange(desc(total))
# Filter out redundant categories and keep only where number of reviews is atleast 300
CanadianBusinessYearOnYearGrid <- CanadianBusinessYearOnYearGrid %>% filter(!category %in% redundantCat) %>% filter(total > 400) %>% select(total)
## Adding missing grouping variables: `city`, `category`
CanadianBusinessYearOnYearGrid
## Source: local data frame [54 x 3]
## Groups: city, category [54]
##
## city category total
## <fctr> <fctr> <dbl>
## 1 Toronto Canadian (New) 2758
## 2 Toronto Italian 2433
## 3 Toronto Breakfast & Brunch 2419
## 4 Toronto Sandwiches 2025
## 5 Toronto Japanese 2005
## 6 Toronto Chinese 1986
## 7 Toronto Pizza 1820
## 8 Toronto Cafes 1655
## 9 Toronto Coffee & Tea 1575
## 10 Montréal French 1560
## # ... with 44 more rows
ggplot(CanadianBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Canadian Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
- Toronto comes out to be a signifcantly more diverse city. So in this way its hold its reputation of being a world city, much like New york.
- Montreal’s absence of anything but french and traditional food is very stark. For examples, we don’t see Middle Eastern, Mexican or Indian counts.
- Only cuisine which stands out as non-traditional for Montreal is Sushi bars. Is it because it is trendy and healthy which appeals to Montreal’s europe-like tastes?
## Plot Euro cities
EuroBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Edinburgh', 'Stuttgart')) %>% arrange(desc(total))
EuroBusinessYearOnYearGrid <- EuroBusinessYearOnYearGrid %>% filter(!category %in% redundantCat) %>% filter(total > 200) %>% select(total)
## Adding missing grouping variables: `city`, `category`
ggplot(EuroBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "European Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
- What stands out for Scotland is Gastropubs and for Stuttgart is Beer Gardens.
- Edinburgh seems to have a large presence of Indian cuisine which is in line with Curry influence in UK.
Comparing the continents through representative cities
## Plot cities
AcrossBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Phoenix', 'Toronto', 'Stuttgart')) %>% arrange(desc(total))
AcrossBusinessYearOnYearGrid <- AcrossBusinessYearOnYearGrid %>% filter(!category %in% redundantCat) %>% filter(total > 400) %>% select(total)
## Adding missing grouping variables: `city`, `category`
ggplot(AcrossBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Comparative Analysis", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
What is most striking here is Toronto’s amazing diversity of cuisines, as we noted above as well.
New Age and Trends Food in various cities
We want to see how cities fare for various new trends.
newAgeFood <- c('Asian Fusion', 'Gastropubs', 'Salad', 'Vegan', 'Vegetarian', 'Wine Bars', 'Cafes', 'Juice Bars & Smoothies', 'Tea Rooms', 'Sushi Bars')
AllBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charoltte', 'Toronto', 'Edinburgh', 'Phoenix', 'Montréal', 'Pittsburgh', 'Stuttgart')) %>% arrange(desc(total))
AllBusinessYearOnYearGrid <- AllBusinessYearOnYearGrid %>% filter(category %in% newAgeFood)
AllBusinessYearOnYearGrid <- AllBusinessYearOnYearGrid %>% filter(!category %in% redundantCat) %>% filter(total > 0) %>% select(total)
## Adding missing grouping variables: `city`, `category`
ggplot(AllBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Trends and Latest", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
- The Vegan and Vegetarian culture is already mature in Montreal, Toronto and Phoenix, but is yet to catch up in Edinburgh and Stuttgart. Same can be said for Salad too.
- There seems no place for Juice bars in Stuttgart. Is it because of strong beer culture?
Pubs and Bars Across the cities
nightlife <- c('Bars', 'Gastropubs', 'Beer', 'Wine Bars', 'Pubs', 'Beer Garden', 'Nightlife', 'Sports Bars')
NightlifeYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charoltte', 'Toronto', 'Edinburgh', 'Phoenix', 'Montréal', 'Pittsburgh', 'Stuttgart')) %>% arrange(desc(total))
NightlifeYearOnYearGrid <- NightlifeYearOnYearGrid %>% filter(category %in% nightlife) %>% select(total)
## Adding missing grouping variables: `city`, `category`
ggplot(NightlifeYearOnYearGrid, aes(x = city, y = category, label = `total`)) +
geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) +
geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Nightlife and Alcohol across cities", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:
- The surprising thing here is simple similarity of all places. Bars and Pubs seems a category which remains unaffected by location.
- Only thing to note is Stuttgart has replaced Sports Bars with Beer gardens.