Final Project

Objective

This project is driven by Yelp Academic Data Set, which is offered via Yelp Data Challenge. Yelp is one of the leading online recommendation sites for Restaurants and other local businesses, and it is very popular in North America and Europe.

Our objective with the project is two-fold-

Sentiment Analysis: Analysing if Yelp Reviews for a Restaurants correlate with Health Inspection Grades issued by the city.
Exploratory analysis: Explore the Yelp Dataset to spot

Locational concentrations of high quality restaurants
Variation in Cultural Trends and Preferences across Major Cities

Approach

The method used in this project in order to answer the primary questions was an OSEMN workflow approach. In this approach, there are five distinct stages: Obtain Data, Scrub Data, Explore, Model and iNterpret.

Introduction to the Data

The data set comprises of following files -

YelpFiles

It is a large dataset with details of about 4 million reviews, 1 million users and 144 thousand businesses.

YelpNumRecords

The schema of the json files is as follows -

YelpDataSchema

More details about the challenge, dataset an its schema can be seen at the Yelp Data Challenge website.

Maintaining the Data on Remote Server

The data was maintained on SQL Server in Microsoft Azure Cloud. The data upload was an extensive exercise which took multiple days due to sheer volume of records. The R packages used to help in this were RODBC and jsonlite. RODBC was chosen because RSQLServer has been deprecated by CRAN-R and is no longer actively supported.

Example of how connection was established -

library(RODBC)
connectionString <- "Driver=ODBC Driver 13 for SQL Server;Server=tcp:data607yelp1.database.windows.net,1433;Database=DATA607_FinalProject_Yelp1;Uid=data607@data607yelp1;Pwd=XXXXX;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
#conn <- odbcDriverConnect(connectionString)

The tables uploaded were, which correspnd to yelp files mentioned above -

YelpBusiness

YelpUser

YelpReview

Once the data was uploaded multiple tables were created on server side as well to help the analysis -

YelpLVBusiness - Restaurants in Las Vegas

YelpLVReview - Reviews for Restaurants in Las Vegas

YelpLVReviewTokens - Tokens from Sample Las Vegas Reviews to be used in Sentiment Analysis

YelpRestaurants - All Yelp Restaurants

YelpRestaurantCategory - Restaurants and their Category (~Cuisine) table

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)
library(stringr)
library(ggplot2)
library(RODBC)
library(ggrepel)
library(knitr)
library(tidyr)
library(jsonlite)
library(leaflet)
library(dplyr)

Analysing the Health Inspection

Obtain Data The first step was to retrieve data regarding health inspection grades, restaurant/food/drink businesses and reviews/ratings of said businesses. The two sources were the Yelp Dataset Challenge data (can be found here: https://www.yelp.com/dataset_challenge) and, under the US City Open Data Census website, the Las Vegas Health Inspection Data, whose location was chosen for its larger data pool in Yelp reviews, (can be found here: http://us-city.census.okfn.org/entry/vegas/food-safety). The Health Inspection data was in a csv format, while the Yelp Dataset came in the form of json files (User, Business and Reviews), all of which were entered into an SQL database.

Scrub Data The primary goal when scrubbing was to first combine multiple datasets while retaining as much information as possible, to ensure that there were no duplicate entries, that entries were formatted correctly, that missing data was removed (to the capacity that it was allowed without making the data pool too small) and that headings and data entries were consistant for future analysis.

Explore Even though there were questions created before combining the datasets, many of our approaches and goals had to be readjusted when finally working with the combined data. After getting a sense of what kind of information we would be working with and the value of each variable in regard to our overall project, we were allowed to shift and narrow down our focus to determining the relationship between Yelp Reviews/Ratings (with an importance on hygiene) and Health Inspection Grades, in addition to analyzing what cuisines were more prominent across major cities within the U.S. and outside it.

Model In order to analyse the Yelp Reviews, a sentiment analysis was run, using a Health Lexicon to determine if the users’ opinion regarding the health status of the establishment matched that of the Health Inspectors. In addition, a regression model was used to determine the importance of the variables against the establishments’ overall rating in order to determine if any factors contributed more to a higher rating. Visual correlations were also made to enhance the conclusion formed by the regression.

Interpret The conclusions made were based on the models and visual analyses created during the “Model” portion of the data science workflow.

It should be noted that, due to the nature of the project, in addition to a vast number of unforseen occurences when handling the data, this workflow structure was not followed as rigidly as stated above. In times when a portion of a dataset that was deemed unecessary (and thus discarded earlier in the project) proved to be important, this workflow was used (within already another step of the primary analysis method timeline) until all necessary data was available for handling.

Yelp Data

###############  YELP   
connectionString = "driver=ODBC Driver 13 for SQL Server;Server=tcp:data607yelp1.database.windows.net,1433;Database=DATA607_FinalProject_Yelp1;Uid=data607@data607yelp1;Pwd=Ozymandias123;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
conn = odbcDriverConnect(connectionString)

# YelpUser
User = sqlQuery(conn, "SELECT * FROM YelpUser;")
# YelpReview
Review = sqlQuery(conn, "SELECT * FROM YelpLVReview;")
# YelpBusiness
LasVBus = sqlQuery(conn, "SELECT * FROM YelpLVBusiness;")

User = User[ c(1,3)]
Review = Review[ c(1:6)]
LasVBus = LasVBus[ c(1:11)]

kable(head(LasVBus))

business_id	name	neighborhood	address	city	state	postal_code	latitude	longitude	stars	review_count
3rkxTx8DoZSl7_FryhXCVQ	Jugoo Chinese Restaurant	University	4747 S Maryland Pkwy	Las Vegas	NV	89119	36.10413	-115.1377	3.5	34
xkVMIk_Vqh17f48ZQ_6b0w	Tom Colicchio’s Craftsteak	The Strip	3799 Las Vegas Blvd S	Las Vegas	NV	89109	36.10292	-115.1698	4.0	1915
n8XlV8E3BUBBF1gCe6mvEg	House of Vino	Centennial	8053 N Durango Dr, Ste 160	Las Vegas	NV	89131	36.30763	-115.2786	3.5	34
2BbFeotL85cIaBjSq1SWiA	Springs Cafe by Wolfgang Puck	Westside	333 S Valley View Blvd	Las Vegas	NV	89107	36.16810	-115.1922	2.5	23
w_A1qFeQ3vOl7o4kdBxPXA	Muse Asian Cuisine & Lounge	Chinatown	5115 W Spring Mountain Rd, Ste 107	Las Vegas	NV	89146	36.12604	-115.2114	3.5	14
XlUuMGMvkzy_QLAUTEe-fw	Dumpling King	Chinatown	5740 Spring Mountain Rd	Las Vegas	NV	89146	36.12608	-115.2198	3.5	120

Health Inspection Data

###############  INSPECTIONS   ###############  
LVInspection = read.csv("https://raw.githubusercontent.com/Galanopoulog/Final-Project/master/LVInspections.csv", header = T, stringsAsFactors = T)

levels(LVInspection$Inspection.Grade)[2] = "A"
levels(LVInspection$Inspection.Grade)[3] = "B"
levels(LVInspection$Current.Grade)[2] = "A"

kable(head(LVInspection))

Serial.Number	Permit.Number	Restaurant.Name	Location.Name	Category.Name	Address	City	State	Zip	Current.Demerits	Current.Grade	Date.Current	Inspection.Date	Inspection.Time	Employee.ID	Inspection.Type	Inspection.Demerits	Inspection.Grade	Inspection.Result	Violations	Record.Updated	Location.1
DA0001131	PR0000411	Black Mountain Clubhouse Restaurant	Black Mountain Clubhouse	Restaurant	500 Greenway Rd	Henderson	Nevada	89015-7464	3	A	10/29/2015 0:00	3/30/2010 0:00	3/30/2010 11:30	EE7000392	Routine Inspection	18	B	‘B’ Downgrade	49122731112	2/21/2013 22:26	(36.019531, -114.9739838)
DA0001132	PR0000411	Black Mountain Clubhouse Restaurant	Black Mountain Clubhouse	Restaurant	500 Greenway Rd	Henderson	Nevada	89015-7464	3	A	10/29/2015 0:00	4/13/2010 0:00	4/13/2010 11:15	EE7000392	Routine Inspection	1	A	Compliant	37	2/21/2013 22:26	(36.019531, -114.9739838)
DA0001133	PR0000411	Black Mountain Clubhouse Restaurant	Black Mountain Clubhouse	Restaurant	500 Greenway Rd	Henderson	Nevada	89015-7464	3	A	10/29/2015 0:00	3/10/2011 0:00	3/10/2011 10:40	EE7000392	Routine Inspection	9	A	Compliant	NA	2/21/2013 22:26	(36.019531, -114.9739838)
DA0001173	PR0000412	Black Mountain Clubhouse Bar	Black Mountain Clubhouse	Bar / Tavern	500 Greenway Rd	Henderson	Nevada	89015-7464	0	A	10/29/2015 0:00	3/30/2010 0:00	3/30/2010 12:45	EE7000392	Routine Inspection	3	A	Compliant	313637	2/21/2013 22:26	(36.019531, -114.9739838)
DA0001174	PR0000412	Black Mountain Clubhouse Bar	Black Mountain Clubhouse	Bar / Tavern	500 Greenway Rd	Henderson	Nevada	89015-7464	0	A	10/29/2015 0:00	3/10/2011 0:00	3/10/2011 11:40	EE7000392	Routine Inspection	4	A	Compliant	NA	2/21/2013 22:26	(36.019531, -114.9739838)
DA0001185	PR0000414	Black Mountain Country Club - Golf Cart #1	Black Mountain Clubhouse	Self-Service Food Truck	500 Greenway Rd	Henderson	Nevada	89015-7464	0	A	10/29/2015 0:00	3/30/2010 0:00	3/30/2010 12:40	EE7000392	Routine Inspection	0	A	Compliant	NA	2/21/2013 22:26	(36.019531, -114.9739838)

Combining Health Inspection data with LasVegas Business data

names(LasVBus)[2]= "Restaurant.Name"
names(LasVBus)[4]= "Address"

nondupl = LVInspection[!duplicated(LVInspection[,c('Restaurant.Name','Address')]),]
nondupl= nondupl[ c(3,6,10,11:13,17:22)]

Combined = merge(LasVBus, nondupl, by = c('Restaurant.Name','Address'))
Combined = Combined[,c(3,1,2,4,5,6,7,8,9,10,11,13,17,12,16,14,15,21,18:20)]
kable(head(Combined))

business_id	Restaurant.Name	Address	neighborhood	city	state	postal_code	latitude	longitude	stars	review_count	Current.Grade	Inspection.Grade	Current.Demerits	Inspection.Demerits	Date.Current	Inspection.Date	Record.Updated	Inspection.Result	Violations
0IbboCQAEQlROAv5PKKO_Q	Aloha Island Grill	2267 N Rampart Blvd	Summerlin	Las Vegas	NV	89128	36.20120	-115.2811	4.5	15	A	A	10	10	6/16/2010 0:00	6/16/2010 0:00	2/21/2013 22:26	Compliant	1.418223e+09
JgkfjCL1NfMBC_M_hax_3Q	Aloha Kitchen	4466 E Charleston Blvd	Sunrise	Las Vegas	NV	89104	36.15928	-115.0782	4.0	116	A	A	6	10	1/20/2016 0:00	2/22/2010 0:00	2/21/2013 22:26	Compliant	1.420000e+13
21sGRVR7HEs_t6PdB9tGMw	Angelina’s Pizzeria	5750 S Decatur Blvd		Las Vegas	NV	89118	36.08530	-115.2074	2.5	131	A	A	8	8	12/3/2015 0:00	2/12/2010 0:00	2/21/2013 22:26	Compliant	1.428314e+09
Oyo5CWthu2upxxaf69MfIg	Anna’s Garden	5740 Spring Mountain Rd	Chinatown	Las Vegas	NV	89146	36.12602	-115.2198	3.5	29	A	C	10	32	10/10/2011 0:00	2/2/2010 0:00	2/21/2013 22:26	‘C’ Downgrade	1.210000e+23
Rhz38lsptoxX9IWS641LFA	Annie’s Kitchen	1212 D St		Las Vegas	NV	89106	36.18430	-115.1470	4.0	10	A	A	0	5	11/3/2015 0:00	5/3/2010 0:00	2/21/2013 22:26	Compliant	7.353637e+06
7NvIgf7XW55OKa5ZU7gw8Q	Arandas Taqueria	2040 N Nellis Blvd	Sunrise	Las Vegas	NV	89115	36.19695	-115.0617	4.0	20	A	X	0	55	1/6/2015 0:00	3/10/2010 0:00	2/21/2013 22:26	Closed with Fees	4.710000e+51

Combining User data with Review data based on the Business data

UserReview = merge(User, Review, by = c('user_id'))
UserbyBusiness = semi_join(UserReview, Combined, by = "business_id")

Analyzing the Health Inspection data

# looking at grades and their demerit points
# Histograms
par(mfrow = c(1,2))
ggplot(LVInspection, aes(x=Inspection.Grade, y=Inspection.Demerits)) + geom_bar(stat="identity") + 
  labs(x="Inspection.Grade", y="Inspection.Demerits")

ggplot(LVInspection, aes(x=Current.Grade, y=Current.Demerits)) + geom_bar(stat="identity") + 
  labs(x="Current.Grade", y="Current.Demerits")

# Boxplots
par(mfrow = c(1,2))
boxplot(Inspection.Demerits ~ Inspection.Grade,data=LVInspection, main="Inspection Demerit Points by Grade",
        xlab="Inspection Grade", ylab="Inspection Demerits", ylim = c(0,120))

boxplot(Current.Demerits ~ Current.Grade,data=LVInspection, main="Demerit Points by Grade",
        xlab="Current Grade", ylab="Current Demerits",ylim = c(0,120))

We notice that during the Inspection, the majority of the business were classified as “A” grade, though it might have been expected to see a normal distribution in which, perhaps a “B” grade was more common. However, after the inspection, there was a dramatic shift in the grades in that the majority of businesses were upgraded to “A” grade, which may imply that after their grading, the businesses focused to reslove the issues that caused them demerit points. This seems more evident with the boxplots, where the upper outliers for demerits in each grades have radically decreased.

Adressing Questions

1. Relationship exists between Yelp ratings and Inspection Ratings

pairs(Combined[,c("stars","review_count","Current.Grade","Current.Demerits","Inspection.Grade","Inspection.Demerits")])

We see here that there doesn’t visually appear to exist a correlation between these variables. Nonetheless, we run a multiple regression.

# Regression
regstar = lm(stars ~ review_count + Current.Grade + Current.Demerits + Inspection.Grade + Inspection.Demerits, data = Combined)

summary(regstar)

## 
## Call:
## lm(formula = stars ~ review_count + Current.Grade + Current.Demerits + 
##     Inspection.Grade + Inspection.Demerits, data = Combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89014 -0.39232  0.05882  0.51465  1.16622 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.4106715  0.7067370   4.826 3.38e-06 ***
## review_count         0.0007989  0.0003249   2.459   0.0151 *  
## Current.GradeB      -0.0262220  0.4404514  -0.060   0.9526    
## Current.GradeC       0.3585021  0.7468499   0.480   0.6319    
## Current.GradeO      -1.4592595  0.6681818  -2.184   0.0305 *  
## Current.GradeX      -1.5123761  0.9769126  -1.548   0.1237    
## Current.Demerits    -0.0095496  0.0157630  -0.606   0.5455    
## Inspection.GradeA   -0.0654856  0.6841899  -0.096   0.9239    
## Inspection.GradeB   -0.2742315  0.7018883  -0.391   0.6966    
## Inspection.GradeC   -0.1997069  0.7555434  -0.264   0.7919    
## Inspection.GradeX   -0.3885972  0.8807551  -0.441   0.6597    
## Inspection.Demerits  0.0158040  0.0130400   1.212   0.2274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6506 on 151 degrees of freedom
## Multiple R-squared:  0.1506, Adjusted R-squared:  0.08868 
## F-statistic: 2.433 on 11 and 151 DF,  p-value: 0.008076

hist(regstar$residuals, main = "Regression Residuals")

# normal probability plot
qqnorm(regstar$residuals)
qqline(regstar$residuals)

# constant variability
plot(regstar$residuals ~ Combined$review_count, main = "Variability")
abline(h = 0, lty = 3)

This multiple regression is not fit to be run. The residuals histogram is somewhat left skewed, the data along the qq plot is off the line and there exists a pattern when testing the residuals for constand variability. The regression itself only primarily the “reviews_count” variable to have any significant effect on the “stars” variable. So, we can run another regression with only that variable to take into account.

regstarrev = lm(stars ~ review_count, data = Combined)

summary(regstarrev)

## 
## Call:
## lm(formula = stars ~ review_count, data = Combined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8994 -0.4140  0.0768  0.5406  1.1117 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3855991  0.0635339  53.288   <2e-16 ***
## review_count 0.0009172  0.0003159   2.904   0.0042 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6664 on 161 degrees of freedom
## Multiple R-squared:  0.04977,    Adjusted R-squared:  0.04387 
## F-statistic: 8.433 on 1 and 161 DF,  p-value: 0.004203

hist(regstarrev$residuals, main = "Regression Residuals")

# normal probability plot
qqnorm(regstarrev$residuals)
qqline(regstarrev$residuals)

# constant variability
plot(regstarrev$residuals ~ Combined$review_count, main = "Variability")
abline(h = 0, lty = 3)

This, too, was a failed attempt at explaining the star ratings. The R-squared of this regression showed that very little variability was explained this and the previous model.

Sentiment analysis

UserbyBusiness = UserbyBusiness[!duplicated(UserbyBusiness), ]
ReviewsGrades = merge(UserbyBusiness, Combined, by = "business_id")

groupings <- ReviewsGrades %>% select(business_id, Inspection.Grade) %>% filter(Inspection.Grade=='C') %>% group_by(business_id, Inspection.Grade) %>% summarise()
groupings$business_id

##  [1] 9f68hfHK6gNyHWmt9guiJw dblGQtHAzNy1ACis1m2shw dFmaYj3_cZpmEaL4WSOkPg
##  [4] eRQwgSQD1KyCmJk_PAGOXQ FvC3W9E6rXUYx-cTxbnUiA iRFaq249LVEFGeXLzPULnQ
##  [7] lOQPeguczAueJnLmDUTfGg MXC9pwIxovWUc9yu1F8OxA Oyo5CWthu2upxxaf69MfIg
## [10] PSRZaGGxXmOmabL2si8pKw q7LG_2s2urLJcD2IsxgTWg Qc-bULSBAyVh_qaKRWkF9A
## [13] So132GP_uy3XbGs0KNyzyw suQugTjSMYMIrK5fAabC_Q t3AWGr757V4Fvcun6sDsuw
## [16] T5UBAJt5Kod_wQGbH6m9hA w0ArxcJ0XsyK1ZwrIf3gpA zfNKdsuT7bWWN-f3v5Rj_g
## 5238 Levels: --q7kSBRb0vWC8lSkXFByA ... Zzl4ffV9kFY3yOSSgyRUBQ

#%>% rapply(function(x)length(unique(x))) 

# Sentiment analysis
reviews = as.list(as.character(ReviewsGrades$text))


analysis <- sqlQuery(conn, "Select * from YelpLVReviewTokens")

# using lexicon to score words
#lexicon = data.frame(get_sentiments("afinn"))
lexicon = read.csv("https://raw.githubusercontent.com/jmehta753/MSDA_JM/master/DATA607/Final%20Project/HygieneLexicon.csv")
af.sent = data.frame(merge(analysis, lexicon, by.x="word", by.y="word") %>% 
                      group_by(review)%>% 
                       mutate(mean.af.sent = mean(score)))

names(af.sent)[2] = "review_id"


# combining scores to business data
review.analysis = merge(af.sent, ReviewsGrades, by = "review_id")

Plots

MEAN SCORE BY STARS

These plots are separated into two sections: Star scores each user alloted to the restaurant and the overall score Yelp gave the restaurant based on the users’ reviews. It will be interesting to see what the connection is between these two variables and the mean sentiment score, in order to determine if, firstly, the sentiment analysis was somewhat accurate, and, second, to see if the overall rating of the business matches the sentiment scores.

# Users' stars for the restaurant
par(mfrow = c(1,2))
boxplot(mean.af.sent~ stars.x, data=review.analysis[!duplicated(review.analysis), ], 
        main="Mean Sentiment Score by Users' Star Scores",xlab="Users' Stars", ylab="Mean Sentiment Score")
abline(h = 0, lty = 3)
# Overall star rating for the restaurant
boxplot(mean.af.sent~ stars.y, data=review.analysis[!duplicated(review.analysis), ], 
        main="Mean Sentiment Score by Restaurants' Star Scores",xlab="Restaurant Stars", ylab="Mean Sentiment Score")
abline(h = 0, lty = 3)

MEAN SCORE BY GRADE

The boxplot is based on Current Inspection Grade. What would be interesting to note here is if there is a relationship between the inspection grades and the review sentiments (meaning, if the users’ opinion match or are related in some way to the inspection grades).

# Current Grade
boxplot(mean.af.sent~ Current.Grade, data=review.analysis[!duplicated(review.analysis), ], 
        main="Mean Sentiment Score by Current Grade",xlab="Current Grade", ylab="Mean Sentiment Score")

# Inspection Grade
boxplot(mean.af.sent~ Inspection.Grade, data=review.analysis[!duplicated(review.analysis), ], 
        main="Mean Sentiment Score by Inspection Grade",xlab="Current Grade", ylab="Mean Sentiment Score")

abline(h = 0, lty = 3)

STARS BY GRADE

boxplot(stars.x ~ Current.Grade, ReviewsGrades, 
        main="Users' Star Scores by Current Inspection Grades",xlab="Current Grade", ylab="Stars")
abline(h = 0, lty = 3)

WORDS BY USERS’ STAR SCORE

Plotting the words that the afinn lexicon determined as positive/negative

The detail to pay attention to here is not only the words that 5-star reviews earned, but whether the words for higher star scores are related to health/cleanliness. At first glance, the majority of the words have more to do with taste than hygiene.

# Plots of the words scores vs the users' star score for each restaurant.
ggplot(review.analysis, aes(score, stars.x, group = score)) + 
  geom_point(aes(colour = factor(score))) +
  geom_label_repel(aes(label=word)) +
  xlab("Word Score") +
  ylab("Users' Stars Scores")

Location Analysis through Visualization

We are interested in knowing how the restaurants are concentrated in Las Vegas and if we can spot where some of its great restaurants are located through visual analysis. It is well known that Las Vegas strip is home to multiple high quality restaurants but are there any other surprises?

Extract the Yelp Restaurants data from database

vegasFrame <- sqlQuery(conn, "select * from YelpLVBusiness")
head(vegasFrame)

##              business_id                          name neighborhood
## 1 3rkxTx8DoZSl7_FryhXCVQ      Jugoo Chinese Restaurant   University
## 2 xkVMIk_Vqh17f48ZQ_6b0w    Tom Colicchio's Craftsteak    The Strip
## 3 n8XlV8E3BUBBF1gCe6mvEg                 House of Vino   Centennial
## 4 2BbFeotL85cIaBjSq1SWiA Springs Cafe by Wolfgang Puck     Westside
## 5 w_A1qFeQ3vOl7o4kdBxPXA   Muse Asian Cuisine & Lounge    Chinatown
## 6 XlUuMGMvkzy_QLAUTEe-fw                 Dumpling King    Chinatown
##                              address      city state postal_code latitude
## 1               4747 S Maryland Pkwy Las Vegas    NV       89119 36.10413
## 2              3799 Las Vegas Blvd S Las Vegas    NV       89109 36.10292
## 3         8053 N Durango Dr, Ste 160 Las Vegas    NV       89131 36.30763
## 4             333 S Valley View Blvd Las Vegas    NV       89107 36.16810
## 5 5115 W Spring Mountain Rd, Ste 107 Las Vegas    NV       89146 36.12604
## 6            5740 Spring Mountain Rd Las Vegas    NV       89146 36.12608
##   longitude stars review_count is_open
## 1 -115.1377   3.5           34       0
## 2 -115.1698   4.0         1915       1
## 3 -115.2786   3.5           34       0
## 4 -115.1922   2.5           23       0
## 5 -115.2114   3.5           14       0
## 6 -115.2198   3.5          120       0
##                                                                                                                                                                                                                                                        attributes
## 1 c("Alcohol: none", "Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", "BusinessAcceptsCreditCards: True", "BusinessParking: {'garage'
## 2 c("Alcohol: full_bar", "Ambience: {'romantic': False, 'intimate': False, 'classy': True, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}", "BYOB: False", "BYOBCorkage: no", "BikeParking: False", "Bu
## 3 c("Alcohol: full_bar", "Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': True}", "BikeParking: True", "BusinessAcceptsBitcoin: False", "Busi
## 4 c("Alcohol: beer_and_wine", "BusinessAcceptsCreditCards: True", "BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}", "GoodForKids: True", "GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': True
## 5 c("Alcohol: full_bar", "Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}", "BikeParking: True", "BusinessAcceptsBitcoin: False", "Bus
## 6 c("Alcohol: none", "Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': True, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}", "BikeParking: False", "BusinessAcceptsBitcoin: False", "Busines
##                                                                                                     categories
## 1                                                                                  c("Restaurants", "Chinese")
## 2                        c("Steakhouses", "Restaurants", "Cheesesteaks", "Food", "American (New)", "Desserts")
## 3                                       c("Restaurants", "Beer", "Wine & Spirits", "Delis", "Food", "Grocery")
## 4                                                                           c("Restaurants", "American (New)")
## 5 c("Karaoke", "Coffee & Tea", "Bars", "Restaurants", "Lounges", "Asian Fusion", "Nightlife", "Food", "Cafes")
## 6                                                  c("Restaurants", "Szechuan", "Chinese", "Halal", "Dim Sum")
##                                                                                                                                                     hours
## 1             c("Monday 10:30-21:30", "Tuesday 10:30-21:30", "Wednesday 10:30-21:30", "Thursday 10:30-21:30", "Friday 10:30-21:30", "Sunday 10:30-21:30")
## 2 c("Monday 17:0-22:0", "Tuesday 17:0-22:0", "Wednesday 17:0-22:0", "Thursday 17:0-22:0", "Friday 17:0-22:30", "Saturday 17:0-22:30", "Sunday 17:0-22:0")
## 3          c("Monday 9:0-21:0", "Tuesday 9:0-21:0", "Wednesday 9:0-21:0", "Thursday 9:0-21:0", "Friday 9:0-21:0", "Saturday 9:0-21:0", "Sunday 9:0-21:0")
## 4   c("Monday 11:0-15:0", "Tuesday 11:0-15:0", "Wednesday 11:0-15:0", "Thursday 11:0-15:0", "Friday 11:0-15:0", "Saturday 11:0-18:0", "Sunday 11:0-18:0")
## 5          c("Monday 19:0-4:0", "Tuesday 19:0-4:0", "Wednesday 19:0-4:0", "Thursday 19:0-4:0", "Friday 13:0-4:0", "Saturday 13:0-4:0", "Sunday 13:0-3:0")
## 6    c("Monday 11:0-0:0", "Tuesday 11:0-22:0", "Wednesday 11:0-22:0", "Thursday 11:0-22:0", "Friday 11:0-22:0", "Saturday 11:0-22:0", "Sunday 11:0-22:0")
##       type
## 1 business
## 2 business
## 3 business
## 4 business
## 5 business
## 6 business

vegasFrame %>% select(stars) %>% group_by(stars)

## Source: local data frame [5,433 x 1]
## Groups: stars [9]
## 
##    stars
## *  <dbl>
## 1    3.5
## 2    4.0
## 3    3.5
## 4    2.5
## 5    3.5
## 6    3.5
## 7    3.5
## 8    2.5
## 9    3.5
## 10   4.0
## # ... with 5,423 more rows

content <- function(name, address, neighbourhood) {
    paste(sep = "<br/>",
    "<b>", name,"</b>",
    address, neighbourhood
  )
}

Plot all the restaruants on the map to view the concentrations

colorDomain <- c(0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5)
pal <- colorNumeric("YlGn", colorDomain, na.color = "transparent")
leaflet(vegasFrame) %>% addTiles() %>% setView(lng = -115.1398, 
    lat = 36.1699, zoom = 12) %>% addCircles(lng = ~longitude, lat = ~latitude, 
    weight = 0, radius = ~exp(stars*1.2) , fillOpacity = 0.5,  
    color = ~pal(stars),
    popup = ~content(name, address, neighborhood))

Expectedly there are multiple areas of concentrations within las vegas.

Plot only the high ranked restaurants on the map

As our above map was bit crowded we use a different approach to plot the map by clustering the spots together.

vegasTopFrame = vegasFrame %>% filter(stars>=4.0)
leaflet(vegasTopFrame) %>% addTiles() %>% setView(lng = -115.1398, 
    lat = 36.1699, zoom = 12) %>% addCircleMarkers(lng = ~longitude, lat = ~latitude, 
    popup = ~content(name, address, neighborhood), clusterOptions = markerClusterOptions())

Conclusion :

While the Strip expectedly has highest number of high quality restaurants (at 268), it came as a surprise that Chinatown (with 166) and Downtown (with 157) has very large number of such places too.

Analysing the cultural trends and preferences across various cities

The approach taken here is to measure the popularity of various types of cuisines in major cities by counting the number of reviews for restaurants belonging to such category.

First lets load the restaurant and category data from database

YelpRestaurants <- sqlQuery(conn, "Select * from YelpRestaurants")
YelpRestaurantsCategory <- sqlQuery(conn, "Select * from YelpRestaurantCategory")
names(YelpRestaurants)

##  [1] "business_id"  "name"         "neighborhood" "address"     
##  [5] "city"         "state"        "postal_code"  "latitude"    
##  [9] "longitude"    "stars"        "review_count" "is_open"     
## [13] "attributes"   "categories"   "hours"        "type"

names(YelpRestaurantsCategory)

## [1] "business_id" "category"

Classify Reviews by Year

AllReviews = sqlQuery(conn, "Select business_id, stars, date from YelpReview")
ReviewsByYear <- AllReviews %>% select(business_id, stars, date) %>% mutate(year = substring(date, 1,4)) 
ReviewsByYear <- ReviewsByYear %>% group_by(business_id, year) %>% summarise(meanStar = mean(stars), count=n())
kable(head(ReviewsByYear))

business_id	year	meanStar	count
–6MefnULPED_I942VcFNA	2008	5.000000	1
–6MefnULPED_I942VcFNA	2010	2.333333	3
–6MefnULPED_I942VcFNA	2011	3.000000	2
–6MefnULPED_I942VcFNA	2012	3.000000	1
–6MefnULPED_I942VcFNA	2013	3.000000	1
–6MefnULPED_I942VcFNA	2014	3.000000	3

Combine Reviews with City (from Restaurants data) and Category data

We also compute the total number of reviews in each city-category combination

BusinessByCity <- YelpRestaurants %>% select(business_id, city)
BusinessByCityAndCategory <- inner_join(BusinessByCity, YelpRestaurantsCategory, by.x="business_id", by.y="business_id")

## Joining, by = "business_id"

## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

BusinessYearOnYearGrid <- inner_join(BusinessByCityAndCategory, ReviewsByYear, by.x="business_id", by.y="business_id") %>% group_by(city, category, year) %>% summarise(count=n())

## Joining, by = "business_id"

## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## character vector and factor, coercing into character vector

BusinessYearOnYearGrid1 <- BusinessYearOnYearGrid %>% spread(key=year, value=count)
BusinessYearOnYearGrid1[is.na(BusinessYearOnYearGrid1)] <- 0
BusinessYearOnYearGrid1 <- BusinessYearOnYearGrid1 %>% group_by(city, category) %>% mutate(total=`2004`+`2005` + `2006`+`2007` +`2008`+`2009` + `2010`+`2011` + `2012`+`2013` + `2014`+`2015` + `2016`+`2017`)
BusinessYearOnYearGrid1<- BusinessYearOnYearGrid1 %>% arrange(desc(total))
kable(head(BusinessYearOnYearGrid1))

city	category	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	total
Las Vegas	American (Traditional)	14	58	148	210	257	340	381	402	434	481	504	512	318	4059
Las Vegas	Fast Food	5	15	54	94	130	277	378	420	529	591	655	669	222	4039
Las Vegas	Nightlife	14	60	137	175	217	290	321	354	419	477	514	552	342	3872
Toronto	Nightlife	0	0	0	149	223	279	331	390	444	513	576	609	244	3758
Las Vegas	Bars	12	56	131	165	203	272	304	336	397	449	490	524	330	3669
Toronto	Bars	0	0	0	139	206	262	310	368	423	494	557	592	239	3590

Spotting Cultural Trends

Comparing American cities

We don’t consider Las Vegas here as it is not a representative city due to its unusual concentration of restaurants and atypical culture.

## Categories which we think are redundant
redundantCat <- c('Bars', 'Nightlife', 'Event Planning & Services', 'Delis', 'Caterers', 'Pubs')

## Plot American cities
AmerBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charlotte','Pittsburgh', 'Phoenix')) %>% arrange(desc(total))

# Filter out redundant categories and keep only where number of reviews is atleast 300
AmerBusinessYearOnYearGrid <- AmerBusinessYearOnYearGrid %>% filter(!category %in%  redundantCat) %>% filter(total > 300) %>% select(total)

## Adding missing grouping variables: `city`, `category`

ggplot(AmerBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "American Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

Phoenix appears to be more diverse than Pittsburgh and Charoltte. Unsurprisingly the Mexican count is much higher there due to its high Latino influence.
Charolette has a number of reviews for Southern cuisine, which is completely lacking in other two. Surprsingly its Cafe culture is yet to develop here.
Pittsburgh is least diverse of all, and surprisingly not enough liking for Steakhouses is seen.

1. Comparing Canadian cities

## Plot Canadian cities
CanadianBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Toronto', 'Montréal')) %>% arrange(desc(total))

# Filter out redundant categories and keep only where number of reviews is atleast 300 
CanadianBusinessYearOnYearGrid <- CanadianBusinessYearOnYearGrid %>% filter(!category %in%  redundantCat) %>% filter(total > 400) %>% select(total)

## Adding missing grouping variables: `city`, `category`

CanadianBusinessYearOnYearGrid

## Source: local data frame [54 x 3]
## Groups: city, category [54]
## 
##        city           category total
##      <fctr>             <fctr> <dbl>
## 1   Toronto     Canadian (New)  2758
## 2   Toronto            Italian  2433
## 3   Toronto Breakfast & Brunch  2419
## 4   Toronto         Sandwiches  2025
## 5   Toronto           Japanese  2005
## 6   Toronto            Chinese  1986
## 7   Toronto              Pizza  1820
## 8   Toronto              Cafes  1655
## 9   Toronto       Coffee & Tea  1575
## 10 Montréal             French  1560
## # ... with 44 more rows

ggplot(CanadianBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Canadian Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

Toronto comes out to be a signifcantly more diverse city. So in this way its hold its reputation of being a world city, much like New york.
Montreal’s absence of anything but french and traditional food is very stark. For examples, we don’t see Middle Eastern, Mexican or Indian counts.
Only cuisine which stands out as non-traditional for Montreal is Sushi bars. Is it because it is trendy and healthy which appeals to Montreal’s europe-like tastes?

## Plot Euro cities
EuroBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Edinburgh', 'Stuttgart')) %>% arrange(desc(total))
EuroBusinessYearOnYearGrid <- EuroBusinessYearOnYearGrid %>% filter(!category %in%  redundantCat) %>% filter(total > 200) %>% select(total)

## Adding missing grouping variables: `city`, `category`

ggplot(EuroBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "European Cities Cuisine Preferneces", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

What stands out for Scotland is Gastropubs and for Stuttgart is Beer Gardens.
Edinburgh seems to have a large presence of Indian cuisine which is in line with Curry influence in UK.

Comparing the continents through representative cities

## Plot cities
AcrossBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Phoenix', 'Toronto', 'Stuttgart')) %>% arrange(desc(total))

AcrossBusinessYearOnYearGrid <- AcrossBusinessYearOnYearGrid %>% filter(!category %in%  redundantCat) %>% filter(total > 400) %>% select(total)

## Adding missing grouping variables: `city`, `category`

ggplot(AcrossBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Comparative Analysis", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

What is most striking here is Toronto’s amazing diversity of cuisines, as we noted above as well.

New Age and Trends Food in various cities

We want to see how cities fare for various new trends.

newAgeFood <- c('Asian Fusion', 'Gastropubs', 'Salad', 'Vegan', 'Vegetarian', 'Wine Bars', 'Cafes', 'Juice Bars & Smoothies', 'Tea Rooms', 'Sushi Bars')

AllBusinessYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charoltte', 'Toronto', 'Edinburgh', 'Phoenix', 'Montréal', 'Pittsburgh', 'Stuttgart')) %>% arrange(desc(total))
AllBusinessYearOnYearGrid <- AllBusinessYearOnYearGrid %>% filter(category %in% newAgeFood)
AllBusinessYearOnYearGrid <- AllBusinessYearOnYearGrid %>% filter(!category %in%  redundantCat) %>% filter(total > 0) %>% select(total)

## Adding missing grouping variables: `city`, `category`

ggplot(AllBusinessYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Trends and Latest", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

The Vegan and Vegetarian culture is already mature in Montreal, Toronto and Phoenix, but is yet to catch up in Edinburgh and Stuttgart. Same can be said for Salad too.
There seems no place for Juice bars in Stuttgart. Is it because of strong beer culture?

Pubs and Bars Across the cities

nightlife <- c('Bars', 'Gastropubs', 'Beer', 'Wine Bars', 'Pubs', 'Beer Garden', 'Nightlife', 'Sports Bars')

NightlifeYearOnYearGrid <- BusinessYearOnYearGrid1 %>% filter(city %in% c('Charoltte', 'Toronto', 'Edinburgh', 'Phoenix', 'Montréal', 'Pittsburgh', 'Stuttgart')) %>% arrange(desc(total))

NightlifeYearOnYearGrid <- NightlifeYearOnYearGrid %>% filter(category %in% nightlife) %>% select(total)

## Adding missing grouping variables: `city`, `category`

ggplot(NightlifeYearOnYearGrid, aes(x = city, y = category, label = `total`)) + 
    geom_point(aes(size = `total` * 2, colour = category, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 3) + scale_size(range = c(1, 30), guide = "none") + labs(title = "Nightlife and Alcohol across cities", x = "City", y = "Category") + theme(plot.margin = unit(c(0.02,0.02,0.02,0.02),"cm"))

Conclusions:

The surprising thing here is simple similarity of all places. Bars and Pubs seems a category which remains unaffected by location.
Only thing to note is Stuttgart has replaced Sports Bars with Beer gardens.

Citations:

Yelp Data Set

Final Project

Georgia Galanopoulos & Jaideep Mehta

May 14, 2017

Objective

Approach

Introduction to the Data

Maintaining the Data on Remote Server

Analysing the Health Inspection

Yelp Data

Health Inspection Data

Combining Health Inspection data with LasVegas Business data

Combining User data with Review data based on the Business data

Analyzing the Health Inspection data

Adressing Questions

1. Relationship exists between Yelp ratings and Inspection Ratings

Sentiment analysis

Plots

Location Analysis through Visualization

Extract the Yelp Restaurants data from database

Plot all the restaruants on the map to view the concentrations

Plot only the high ranked restaurants on the map

Conclusion :

Analysing the cultural trends and preferences across various cities

First lets load the restaurant and category data from database

Classify Reviews by Year

Combine Reviews with City (from Restaurants data) and Category data

Spotting Cultural Trends

Comparing American cities

Conclusions:

1. Comparing Canadian cities

Conclusions:

Conclusions:

Comparing the continents through representative cities

Conclusions:

New Age and Trends Food in various cities

Conclusions:

Pubs and Bars Across the cities

Conclusions:

Citations: