Nathan Lim
May 22, 2015
Many people think that ZAGAT scores are believable. I also often check ZAGAT score before I reserve restaurants. I was curious if the ZAGAT scores correlated to cleanliness or healthy food.
This project is to look relations between ZAGAT scores and NYC restaurant inspection scores. While ZAGAT scores are given by critics and focus on food decoration and service of restaurants, the inspection conducted by Heath department focuses on safety and health of food.
I would like to check if ZAGAT high rated restaurants are detected less by NYC restaurant inspection. While a high ZAGAT score is positive, a high score in the NYC inspection is negative.
library(rjson) #data scrape
library(XML) #data scrape
library(RCurl) #data scrape
library(tidyr) #cleaning data
library(dplyr) #cleaning data
library(ggplot2) #data analysis and visualization
library(ggmap) #data analysis and visualization
library(GGally) #data analysis and visualization
library(ggvis) #data analysis and visualization
library(knitr) #presentation“https://data.cityofnewyork.us”
insp <- read.csv(file="/users/seoungyoonlim/documents/cuny/is607final/nyc_restaurant_inspection_results.csv", header=TRUE)
str(insp)## 'data.frame': 493770 obs. of 18 variables:
## $ CAMIS : int 30075445 30075445 30075445 30075445 30075445 30075445 30075445 30075445 30075445 30075445 ...
## $ DBA : Factor w/ 20418 levels "","''W'' CAFE",..: 12327 12327 12327 12327 12327 12327 12327 12327 12327 12327 ...
## $ BORO : Factor w/ 6 levels "BRONX","BROOKLYN",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BUILDING : Factor w/ 9949 levels ""," ",..: 57 57 57 57 57 57 57 57 57 57 ...
## $ STREET : Factor w/ 4566 levels " 1 AVENUE ",..: 2982 2982 2982 2982 2982 2982 2982 2982 2982 2982 ...
## $ ZIPCODE : int 10462 10462 10462 10462 10462 10462 10462 10462 10462 10462 ...
## $ PHONE : Factor w/ 24124 levels "","__________",..: 21925 21925 21925 21925 21925 21925 21925 21925 21925 21925 ...
## $ CUISINE.DESCRIPTION : Factor w/ 85 levels "Afghan","African",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ INSPECTION.DATE : Factor w/ 1315 levels "01/01/1900","01/02/2013",..: 142 221 1039 931 931 831 831 831 831 831 ...
## $ ACTION : Factor w/ 6 levels "","Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addres"| __truncated__,..: 6 6 5 6 6 6 6 6 6 6 ...
## $ VIOLATION.CODE : Factor w/ 98 levels "","02A","02B",..: 44 63 1 30 32 21 30 42 44 52 ...
## $ VIOLATION.DESCRIPTION: Factor w/ 110 levels "","''''No Smoking\032 and/or 'Smoking Permitted\032 sign not conspicuously posted. Health warning not present on 'Smoking Permitte"| __truncated__,..: 40 69 1 23 27 45 23 77 40 25 ...
## $ CRITICAL.FLAG : Factor w/ 3 levels "Critical","Not Applicable",..: 1 3 2 1 1 1 1 1 1 3 ...
## $ SCORE : int 6 2 NA 6 6 32 32 32 32 32 ...
## $ GRADE : Factor w/ 7 levels "","A","B","C",..: 2 2 1 2 2 1 1 1 1 1 ...
## $ GRADE.DATE : Factor w/ 1207 levels "","01/02/2013",..: 128 201 1 855 855 1 1 1 1 1 ...
## $ RECORD.DATE : Factor w/ 1 level "05/12/2015": 1 1 1 1 1 1 1 1 1 1 ...
## $ INSPECTION.TYPE : Factor w/ 34 levels "","Administrative Miscellaneous / Compliance Inspection",..: 12 12 34 13 13 12 12 12 12 12 ...
the_url="https://www.zagat.com/best-restaurants/new-york"
SOURCE <- getURL(the_url)
PARSED <- htmlParse(SOURCE)
restaurant=xpathSApply(PARSED, "//p[@class='title']",xmlValue)
score_food=xpathSApply(PARSED, "//li[@class='score food']",xmlValue)
score_decor=xpathSApply(PARSED, "//li[@class='score decor']",xmlValue)[2:101]
score_service=xpathSApply(PARSED, "//li[@class='score service']",xmlValue)[2:101]
cost=xpathSApply(PARSED, "//li[@class='score cost']",xmlValue)[2:101]
nyc100 <- cbind(restaurant, score_food, score_decor, score_service, cost)
head(nyc100)## restaurant score_food score_decor score_service cost
## [1,] "Le Bernardin" "29" "28" "29" "$162"
## [2,] "Bouley" "29" "28" "28" "$132"
## [3,] "Jean-Georges" "28" "28" "28" "$166"
## [4,] "Gotham Bar & Grill" "28" "27" "27" "$86"
## [5,] "Eleven Madison Park" "28" "28" "28" "$220"
## [6,] "Daniel" "28" "28" "28" "$167"
write.table(nyc100,file="zagat_nyc100.csv",sep=",",row.names=F)
the_url2="https://www.zagat.com/best-restaurants/new-york/downtown-and-brooklyn"
SOURCE2 <- getURL(the_url2)
PARSED2 <- htmlParse(SOURCE2)
restaurant2=xpathSApply(PARSED2, "//p[@class='title']",xmlValue)
score_food2=xpathSApply(PARSED2, "//li[@class='score food']",xmlValue)
score_decor2=xpathSApply(PARSED2, "//li[@class='score decor']",xmlValue)[2:101]
score_service2=xpathSApply(PARSED2, "//li[@class='score service']",xmlValue)[2:101]
cost2=xpathSApply(PARSED2, "//li[@class='score cost']",xmlValue)[2:101]
dtown100=cbind(restaurant2, score_food2, score_decor2, score_service2, cost2)
write.table(dtown100,file="zagat_dtown100.csv",sep=",",row.names=F)
names(dtown100) <- names(nyc100)
z_nyc=rbind(nyc100, dtown100)names(insp)## [1] "CAMIS" "DBA"
## [3] "BORO" "BUILDING"
## [5] "STREET" "ZIPCODE"
## [7] "PHONE" "CUISINE.DESCRIPTION"
## [9] "INSPECTION.DATE" "ACTION"
## [11] "VIOLATION.CODE" "VIOLATION.DESCRIPTION"
## [13] "CRITICAL.FLAG" "SCORE"
## [15] "GRADE" "GRADE.DATE"
## [17] "RECORD.DATE" "INSPECTION.TYPE"
new_insp <- insp[,c(2,3,5,6,8:12,14)]
#convert numeric to Date type
new_insp$INSPECTION.DATE <-as.Date(new_insp$INSPECTION.DATE,format='%m/%d/%Y')
#delete space at the end
new_insp$DBA <- gsub(" $","", new_insp$DBA, perl=T)
new_insp$STREET <- gsub("^\\s+|\\s+$", "", new_insp$STREET)
new_insp$STREET <- gsub(" "," ", new_insp$STREET)
new_insp$ZIPCODE <- gsub(" $","", new_insp$ZIPCODE, perl=T)
new_insp$CUISINE.DESCRIPTION <- gsub(" $","", new_insp$CUISINE.DESCRIPTION, perl=T)A: Inspection score 0 ~ 13
B: Inspection score 14 ~ 27
C: Inspection score 28 ~ more
new_insp["grade"] <- NA
#grade the restarants
new_insp[which(new_insp$SCORE<=13),]$grade <- "A"
new_insp[which(new_insp$SCORE<=27 & new_insp$SCORE>13),]$grade <- "B"
new_insp[which(new_insp$SCORE>=28),]$grade <- "C"
arranged <- new_insp %>%
arrange(desc(DBA), desc(INSPECTION.DATE))
# to get the most recent insp SCORE
insp_sub <- subset(arranged,!duplicated(arranged$DBA)) z_nyc=data.frame(z_nyc, stringsAsFactors=FALSE)
# delete space at the end
z_nyc$restaurant <- gsub(" $","", z_nyc$restaurant, perl=T)
# change ` to '
z_nyc$restaurant <- gsub("’","'", z_nyc$restaurant)
# delete $ from cost column
z_nyc$cost <- sub(pattern = "\\$", replacement = "", z_nyc$cost)
#change data type to numeric
z_nyc$score_food <- as.numeric(z_nyc$score_food)
z_nyc$score_decor <- as.numeric(z_nyc$score_decor)
z_nyc$score_service <- as.numeric(z_nyc$score_service)
z_nyc$cost <- as.numeric(z_nyc$cost)
#clean restaurant names
z_nyc$restaurant<- sub(pattern = "-", replacement = " ", z_nyc$restaurant)
# delete dulpicate
z_nyc <- subset(z_nyc,!duplicated(z_nyc$restaurant))
z_nyc <- z_nyc %>% arrange(restaurant)
# add mean_score column
z_nyc$mean_score <- round(rowMeans(z_nyc[,2:4]),2)
z_nyc$restaurant <- toupper(z_nyc$restaurant)newdf <- merge(z_nyc, insp_sub, by.x="restaurant", by.y="DBA", all.x=TRUE)
names(newdf)## [1] "restaurant" "score_food"
## [3] "score_decor" "score_service"
## [5] "cost" "mean_score"
## [7] "BORO" "STREET"
## [9] "ZIPCODE" "CUISINE.DESCRIPTION"
## [11] "INSPECTION.DATE" "ACTION"
## [13] "VIOLATION.CODE" "VIOLATION.DESCRIPTION"
## [15] "SCORE" "grade"
data <- newdf[c(1:10,15,16)] # select columns for analysis
zagat <- data[!is.na(data$SCORE),] #delete NA rows
# change column names for convinient reason
names(zagat)[names(zagat)=="CUISINE.DESCRIPTION"] <- "cuisine"
names(zagat)[names(zagat)=="SCORE"] <- "insp_score"
head(zagat)## restaurant score_food score_decor score_service cost mean_score
## 1 15 EAST RESTAURANT 26 22 24 102 24.00
## 2 ABC KITCHEN 25 24 23 63 24.00
## 3 AI FIORI 26 25 25 93 25.33
## 5 ALTA 25 23 20 62 22.67
## 8 AQUAVIT 26 25 26 86 25.67
## 10 ATERA 26 24 26 228 25.33
## BORO STREET ZIPCODE cuisine insp_score grade
## 1 MANHATTAN EAST 15 STREET 10003 Japanese 21 B
## 2 MANHATTAN EAST 18 STREET 10003 American 12 A
## 3 MANHATTAN 5 AVENUE 10018 Italian 11 A
## 5 MANHATTAN WEST 10 STREET 10011 American 23 B
## 8 MANHATTAN EAST 55 STREET 10022 Scandinavian 9 A
## 10 MANHATTAN WORTH STREET 10013 American 9 A
summary(zagat$insp_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 11.00 12.62 13.00 36.00
summary(insp$SCORE)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 11.00 18.00 20.85 26.00 156.00 31892
aggregate(zagat$insp_score, by=list(zagat$grade), mean)## Group.1 x
## 1 A 10.01149
## 2 B 22.41176
## 3 C 32.66667
aggregate(zagat$mean_score, by=list(zagat$grade), mean)## Group.1 x
## 1 A 23.68552
## 2 B 23.29353
## 3 C 26.44333
zagat$geo <- geocode(paste(zagat$STREET, "New York", ","))
NYC <- get_map(location=c(lon=-73.95, lat=40.75), zoom = 11, maptype = "roadmap", source="google")
ggmap(NYC, extent = 'normal') +
geom_point(aes(x = geo$lon, y = geo$lat, colour = mean_score), data = zagat, alpha = .8)ggplot(data=zagat, aes(grade, mean_score))+geom_boxplot()table(zagat$grade)##
## A B C
## 87 17 3
Most of “ZAGAT Best Restaurants” get A or B. Only 3 restaurants out of 107 got C grades at Health department Inspection.
A: Inspection score 0 ~ 13
B: Inspection score 14 ~ 27
C: Inspection score 28 ~ more
#new york inspection score distribution
ggplot(insp, aes(SCORE)) +
xlab("Inspection score") +
ylab("Number of Restaurants") +
geom_histogram(binwidth=1, color="black", fill="darkgoldenrod1") +
geom_vline(aes(xintercept=mean(SCORE, na.rm=T)),
color="red", alpha=0.5, size=2)+
geom_vline(aes(xintercept=mean(zagat$insp_score, na.rm=T)),
color="green", alpha=0.5, size=2)+
ggtitle("NYC INSPECTION SCORES")+
theme(plot.title = element_text(size=15, face="bold"))+
annotate("text", x = 22, y = 24000, label = "20.85 \n NYC mean of inspection score", col="darkblue", size=4)+
annotate("text", x = 13, y = 35000, label = "12.62 \n ZAGAT inspection mean score",col="darkblue", size=4)#rate of zagat restaurants having insp score more than NYC mean.
sum(zagat$insp_score>20.85)/nrow(zagat) ## [1] 0.1588785
ggplot(zagat, aes(insp_score))+geom_histogram(color="black",fill="brown3",binwidth=1)+
xlab("Inspection score") +
ylab("Number of Restaurants") +
geom_vline(aes(xintercept=mean(insp_score, na.rm=T)),
color="green", alpha=0.5, size=2)+
geom_vline(aes(xintercept=mean(insp$SCORE, na.rm=T)),
color="red", alpha=0.5, size=2)+
geom_vline(aes(xintercept=median(zagat$insp_score, na.rm=T)),
color="blue", alpha=0.5, size=2)+
annotate("text", x = 21, y = 15, label = "20.85 \n NYC inspection mean score", col="darkblue", size=4)+
annotate("text", x = 13, y = 18, label = "12.62 \n ZAGAT inspection mean score",col="darkblue", size=4)+
annotate("text", x = 7, y = 20, label = "11.0 \n ZAGAT inspection median score",col="darkblue", size=4)+
ggtitle("ZAGAT Restaurants' Inspection Scores")+theme(plot.title = element_text(size=15, face="bold"))H0: ZAGAT Best restaurants’ Inspection score = General NYC restaurants’ Inspection score
H1: ZAGAT Best restaurants’ Inspection score != General NYC restaurants’ Inspection score
t.test(zagat$insp_score, insp$SCORE, paired=FALSE)##
## Welch Two Sample t-test
##
## data: zagat$insp_score and insp$SCORE
## t = -13.7993, df = 106.254, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.412079 -7.047351
## sample estimates:
## mean of x mean of y
## 12.61682 20.84654
The analysis shows that only 15.89% of restaurants in “ZAGAT Best Restaurants” list got higher inspection scores than the mean of general NYC restaurants. This means that about 84% of ZAGAT restaurants are cleaner or healthier than an average NYC restaurant. Also, by the T-test, we got a very small p-value (2.2e-16). So this difference is statistically significant.
pairs(~score_food+score_decor+score_service+cost+insp_score, data=zagat,
panel=panel.smooth, main="ZAGAT Scatterplot Matrix")ggpairs(data=zagat, columns = c(2:6,11), title = "ZAGAT Scatterplot Matrix",
upper=list(continuous='cor'),
lower=list(continuous = 'smooth'),
diag=list(continuous='density'),
params=c(colour="blue", alpha=6/10),
axisLabels='show'
)The two graphs show that scores of decoration, service, and food increase along with the cost increase. However the rate of increase slowed as the restaurant cost got higher. Especially, a correlation between decoration score and sevice score is noticible, at 0.812. The correlation between NYC inspection score and others (cost, scores of decoration, sevice, food) is not apparent.
zagat %>% ggvis(~mean_score, ~insp_score, opacity := 0.7) %>% layer_points() %>% layer_smooths(se=T, opacity := 0.5, span=0.5, stroke:="red")When we see the above graph, we cannot find relations between ZAGAT score and Inspection score. We cannot tell if among ZAGAT restaurants, higer ZAGAT rated restaurants are cleaner than lower ZAGAT rated restaurants, or vice versa.
zagat$grade = factor(zagat$grade)
zagat %>% ggvis(~mean_score, ~cost, fill= ~grade, opacity := 0.8) %>% layer_points() %>% layer_smooths(span=0.5, stroke:="red")This shows that ZAGAT scores(here ‘mean_score’) tend to get higher along with increase of its cost. But the inspenction score grades (A,B,C) are scattered between low to high cost and ZAGAT score, so it appears to be unrelated.
Based on the above analysis, restaurants included in the “ZAGAT Best-Restaurant” list tend to be cleaner and healthier than NYC general restaurants. Only 16.04% restaurants in the ZAGAT list have a higher inspection score than the mean of NYC restaurant inspection score. However, among the restaurants in the list, there is no specific correlation between inspection score and ZAGAT score. This means that even if a restaurant has high ZAGAT score, we can not tell that the restaurant is cleaner or healthier than other restaurants that have lower ZAGAT score. This makes sense since ZAGAT scores focus on taste, service, and decoration of restaurants, while Inspection score focuses on health laws. This anlysis also shows a noticible correlation(0.812) between decoration score and service score in ZAGAT ratings. However, we still cannot tell that nice interior restaurants actually have nice service. It is possible that the critic gave better service score under better atmoshpere. Because of the lack of cost data of non ZAGAT restaurant, I couldn’t figure out if ZAGAT listed restaurants are better per cost. Given that ZAGAT list restaurants tend to be expensive, the better inspection score can relate to this price difference.