Idea of this project is to study the relationship between New York City restaurant inspection data to Zagat rating data. As Zagat scoring is based on food ratings, appearance and service, CIty score is based on Sanitation and Health.We would like to investigate if Zagat best rated restaurants has low City Inspection score.
Study of the correlation between New York City Inspection score and Zagat scores and Cost of the meals reveals that the inspection score doesn’t good correlation with Zagat score and Cost of the meals with a drink and tip. Distribution of the Inspection score against Cost and Grade reveals that some restaurants with high Zagat rating (High cost) have both Grade A and Grade B, hence it is concluded that with Inspection score we cannot decide whether the restaurant will have high Zagat rating or not.
library(rjson)
library(XML)
library(RCurl)
library(tidyr)
library(dplyr)
library(ggplot2)
library(ggmap)
library(GGally)
library(knitr)
library(rmarkdown)
library(grid)
New_York_City_Res_Insp <- read.csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv",na.strings= c(""," ","NA"))
As part of this study we are considering 2015 data from both City Inspection and Zagat ratings
city_inspection<-New_York_City_Res_Insp[grep("2015", New_York_City_Res_Insp$INSPECTION.DATE), ]
Zagat has two files for New York.(i) Top 100 new york restaurants (ii) Top 100 Downton and Broklyn restaurants. We have considered two files to cover maximum restaurants across New York.
zagat_url="https://www.zagat.com/best-restaurants/new-york"
SOURCE <- getURL(zagat_url)
PARSED <- htmlParse(SOURCE)
restaurant=xpathSApply(PARSED, "//p[@class='title']",xmlValue)
score_food=xpathSApply(PARSED, "//li[@class='score food']",xmlValue)
score_decor=xpathSApply(PARSED, "//li[@class='score decor']",xmlValue)[2:101]
score_service=xpathSApply(PARSED, "//li[@class='score service']",xmlValue)[2:101]
cost=xpathSApply(PARSED, "//li[@class='score cost']",xmlValue)[2:101]
Nyu_Downtown_url="https://www.zagat.com/best-restaurants/new-york/downtown-and-brooklyn"
SOURCE1 <- getURL(Nyu_Downtown_url)
PARSED1 <- htmlParse(SOURCE1)
restaurant1=xpathSApply(PARSED1, "//p[@class='title']",xmlValue)
score_food1=xpathSApply(PARSED1, "//li[@class='score food']",xmlValue)
score_decor1=xpathSApply(PARSED1, "//li[@class='score decor']",xmlValue)[2:101]
score_service1=xpathSApply(PARSED1, "//li[@class='score service']",xmlValue)[2:101]
cost1=xpathSApply(PARSED1, "//li[@class='score cost']",xmlValue)[2:101]
As we have two file, we will create two data tables and load data in corresponding table, then will create a single data table by merging two data tables using rbind.
zagat_nyc100 <- cbind(restaurant, score_food, score_decor,score_service, cost)
dtown_top_rest=cbind(restaurant1, score_food1, score_decor1,score_service1, cost1)
write.table(zagat_nyc100,file="zagat_nyc_top100.csv",sep=",",row.names=F)
write.table(dtown_top_rest,file="zagat_nyc_dtown100.csv",sep=",",row.names=F)
names(dtown_top_rest) <- names(zagat_nyc100)
nyc_rest=rbind(zagat_nyc100, dtown_top_rest)
nyc_rest <- data.frame(nyc_rest, stringsAsFactors = FALSE)
city_inspection <- city_inspection[,c(2,3,4,5,6,8,9,10,11,12,14)]
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
city_inspection$STREET <- trim(city_inspection$STREET)
city_inspection$DBA <- trim(city_inspection$DBA)
city_inspection$BORO <- trim(city_inspection$BORO)
city_inspection$CUISINE.DESCRIPTION <- trim(city_inspection$CUISINE.DESCRIPTION)
city_inspection$SCORE <- trim(city_inspection$SCORE)
nyc_rest$restaurant <- trim(nyc_rest$restaurant)
nyc_rest$score_food <- trim(nyc_rest$score_food)
nyc_rest$score_decor <- trim(nyc_rest$score_decor)
nyc_rest$score_service <- trim(nyc_rest$score_service)
nyc_rest$cost <- trim(nyc_rest$cost)
nyc_rest$restaurant <- sub(pattern = "-", replacement = " ", nyc_rest$restaurant)
nyc_rest$cost <- gsub('^.', '', nyc_rest$cost)
nyc_rest$score_food <- as.numeric(nyc_rest$score_food)
nyc_rest$score_decor <- as.numeric(nyc_rest$score_decor)
nyc_rest$score_service <- as.numeric(nyc_rest$score_service)
nyc_rest$cost <- as.numeric(nyc_rest$cost)
city_inspection$SCORE <- as.numeric(city_inspection$SCORE)
city_inspection<-subset(city_inspection,!(is.na(city_inspection["SCORE"])))
nyc_rest <- subset(nyc_rest, !duplicated(nyc_rest$restaurant))
As grade data in city inspection file is inconsistent, we have calculated the Grade based on the score given by the City. City has the guidelines to allocate Grade as per the score. -If Inspection score bewteen 0 and 13, Grade is “A” -If score is between 14 and 27 then grade is “B” -If the score is more than or equal to 28 then grade is “C”
city_inspection$grade <- NA
city_inspection$grade[city_inspection$SCORE < 14] <- 'A'
city_inspection$grade[city_inspection$SCORE < 28 & city_inspection$SCORE >13] <- 'B'
city_inspection$grade[city_inspection$SCORE > 27] <- 'C'
Inspection can be done multiple times in the same year on a Restaurant, our interest for this project is to get the Latest Score for each Restaurant. As we merged two Zagat files, there may be a chance to have same restaurant twice, we will delete duplicate restaurants from Zagat data
city_inspection$INSPECTION.DATE<-as.Date(city_inspection$INSPECTION.DATE,format='%m/%d/%Y')
city_inspection_upd <- city_inspection %>% arrange(desc(DBA),desc(INSPECTION.DATE))
city_inspection <- subset(city_inspection_upd,!duplicated(city_inspection_upd$DBA))
nyc_rest <- subset(nyc_rest, !duplicated(nyc_rest$restaurant))
nyc_rest$restaurant <- toupper(nyc_rest$restaurant)
restaurant_nyc <- merge(nyc_rest, city_inspection, by.x="restaurant", by.y="DBA")
restaurant_nyc$mean_zagat_score <- round(rowMeans(restaurant_nyc[,2:4]),1)
summary(restaurant_nyc$SCORE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 10.00 10.19 12.00 23.00
aggregate(restaurant_nyc$SCORE,by = list(restaurant_nyc$grade), mean)
## Group.1 x
## 1 A 9.556818
## 2 B 19.500000
It appears that the inspection score(SCORE) has little impact on Zagat ratings and cost of the meal in the restaurant. Score_decor and Score_service has the highest correlation(0.798) with score_service.
ggpairs(restaurant_nyc, mapping = aes(color= grade), columns = c(2,3,4,5,15,17,16))
ggplot(restaurant_nyc, aes(x=restaurant_nyc$SCORE))+geom_histogram(binwidth=.5)
This shows that both Grades(A,B) are scattered between Low cost to hight cost, which means inspection score doesn’t have impact on Cost(in turn Zagat rating). Restuarants with high rating and cost have bad inspection score(Grade=B) as well as Good Grade(A).
qplot(restaurant_nyc$SCORE, restaurant_nyc$cost, data = restaurant_nyc, color= grade)