Objective:

Idea of this project is to study the relationship between New York City restaurant inspection data to Zagat rating data. As Zagat scoring is based on food ratings, appearance and service, CIty score is based on Sanitation and Health.We would like to investigate if Zagat best rated restaurants has low City Inspection score.

Conclusion:

Study of the correlation between New York City Inspection score and Zagat scores and Cost of the meals reveals that the inspection score doesn’t good correlation with Zagat score and Cost of the meals with a drink and tip. Distribution of the Inspection score against Cost and Grade reveals that some restaurants with high Zagat rating (High cost) have both Grade A and Grade B, hence it is concluded that with Inspection score we cannot decide whether the restaurant will have high Zagat rating or not.

Packages needed for the project

library(rjson)    
library(XML)      
library(RCurl)    
library(tidyr)    
library(dplyr)    
library(ggplot2)  
library(ggmap)    
library(GGally)
library(knitr)    
library(rmarkdown)
library(grid)

Load New York inspection data into R

New_York_City_Res_Insp <- read.csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv",na.strings= c(""," ","NA"))

Filter Inspection data data to load only 2015 data

As part of this study we are considering 2015 data from both City Inspection and Zagat ratings

city_inspection<-New_York_City_Res_Insp[grep("2015", New_York_City_Res_Insp$INSPECTION.DATE), ]

Import Zagat data

Zagat has two files for New York.(i) Top 100 new york restaurants (ii) Top 100 Downton and Broklyn restaurants. We have considered two files to cover maximum restaurants across New York.

Scrapping top 100 New York restaurants

zagat_url="https://www.zagat.com/best-restaurants/new-york"

SOURCE <- getURL(zagat_url)
PARSED <- htmlParse(SOURCE)
restaurant=xpathSApply(PARSED, "//p[@class='title']",xmlValue)
score_food=xpathSApply(PARSED, "//li[@class='score food']",xmlValue)
score_decor=xpathSApply(PARSED, "//li[@class='score decor']",xmlValue)[2:101]
score_service=xpathSApply(PARSED, "//li[@class='score service']",xmlValue)[2:101]
cost=xpathSApply(PARSED, "//li[@class='score cost']",xmlValue)[2:101]

Scrapping top 100 Brooklyn restaurants

Nyu_Downtown_url="https://www.zagat.com/best-restaurants/new-york/downtown-and-brooklyn"
SOURCE1 <- getURL(Nyu_Downtown_url)
PARSED1 <- htmlParse(SOURCE1)
restaurant1=xpathSApply(PARSED1, "//p[@class='title']",xmlValue)
score_food1=xpathSApply(PARSED1, "//li[@class='score food']",xmlValue)
score_decor1=xpathSApply(PARSED1, "//li[@class='score decor']",xmlValue)[2:101]
score_service1=xpathSApply(PARSED1, "//li[@class='score service']",xmlValue)[2:101]
cost1=xpathSApply(PARSED1, "//li[@class='score cost']",xmlValue)[2:101]

Load Zagat data into tables

As we have two file, we will create two data tables and load data in corresponding table, then will create a single data table by merging two data tables using rbind.

zagat_nyc100 <- cbind(restaurant, score_food, score_decor,score_service, cost)
dtown_top_rest=cbind(restaurant1, score_food1, score_decor1,score_service1, cost1)
write.table(zagat_nyc100,file="zagat_nyc_top100.csv",sep=",",row.names=F)
write.table(dtown_top_rest,file="zagat_nyc_dtown100.csv",sep=",",row.names=F)
names(dtown_top_rest) <- names(zagat_nyc100) 
nyc_rest=rbind(zagat_nyc100, dtown_top_rest)
nyc_rest <- data.frame(nyc_rest, stringsAsFactors = FALSE)

Data Cleaning tasks

Select subset of columns from City Inspection data

city_inspection <- city_inspection[,c(2,3,4,5,6,8,9,10,11,12,14)]

Deleting whitespace from start and end of the string

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
city_inspection$STREET <- trim(city_inspection$STREET)
city_inspection$DBA <- trim(city_inspection$DBA)
city_inspection$BORO <- trim(city_inspection$BORO)
city_inspection$CUISINE.DESCRIPTION <- trim(city_inspection$CUISINE.DESCRIPTION)
city_inspection$SCORE <- trim(city_inspection$SCORE)
nyc_rest$restaurant <- trim(nyc_rest$restaurant)
nyc_rest$score_food <- trim(nyc_rest$score_food)
nyc_rest$score_decor <- trim(nyc_rest$score_decor)
nyc_rest$score_service <- trim(nyc_rest$score_service)
nyc_rest$cost <- trim(nyc_rest$cost)
nyc_rest$restaurant <- sub(pattern = "-", replacement = " ", nyc_rest$restaurant)

Remove ‘$’ from Cost column of zagat data

nyc_rest$cost <-   gsub('^.', '', nyc_rest$cost)

Convert all scores to numeric

nyc_rest$score_food <- as.numeric(nyc_rest$score_food)
nyc_rest$score_decor <- as.numeric(nyc_rest$score_decor) 
nyc_rest$score_service <- as.numeric(nyc_rest$score_service)
nyc_rest$cost <- as.numeric(nyc_rest$cost)
city_inspection$SCORE <- as.numeric(city_inspection$SCORE)

Remove all the rows with no Score from City Inspection data

city_inspection<-subset(city_inspection,!(is.na(city_inspection["SCORE"])))

delete duplicate records from zagat data

nyc_rest <- subset(nyc_rest, !duplicated(nyc_rest$restaurant))

Data Processing

Add “GRADE” column to City Inspection data

As grade data in city inspection file is inconsistent, we have calculated the Grade based on the score given by the City. City has the guidelines to allocate Grade as per the score. -If Inspection score bewteen 0 and 13, Grade is “A” -If score is between 14 and 27 then grade is “B” -If the score is more than or equal to 28 then grade is “C”

city_inspection$grade <- NA
city_inspection$grade[city_inspection$SCORE < 14] <- 'A'
city_inspection$grade[city_inspection$SCORE < 28 & city_inspection$SCORE >13] <- 'B'
city_inspection$grade[city_inspection$SCORE > 27] <- 'C'

Remove duplicate data from Inspection and Zagat ratings

Inspection can be done multiple times in the same year on a Restaurant, our interest for this project is to get the Latest Score for each Restaurant. As we merged two Zagat files, there may be a chance to have same restaurant twice, we will delete duplicate restaurants from Zagat data

city_inspection$INSPECTION.DATE<-as.Date(city_inspection$INSPECTION.DATE,format='%m/%d/%Y')
city_inspection_upd <- city_inspection %>%                    arrange(desc(DBA),desc(INSPECTION.DATE))
city_inspection <- subset(city_inspection_upd,!duplicated(city_inspection_upd$DBA))
nyc_rest <- subset(nyc_rest, !duplicated(nyc_rest$restaurant))

Keep column(Restaurant name) name same in Zagat file and City Inspection file as Restaurant name will be Key to merge both the files

nyc_rest$restaurant <- toupper(nyc_rest$restaurant)

Join Zagat data and City inspection data on restaurant name as key

restaurant_nyc <- merge(nyc_rest, city_inspection, by.x="restaurant", by.y="DBA")

Create a new column to add Mean score of Zagat ratings

restaurant_nyc$mean_zagat_score <- round(rowMeans(restaurant_nyc[,2:4]),1)

Analysis and Presentation

Summary stats of inspection score

summary(restaurant_nyc$SCORE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   10.00   10.19   12.00   23.00

Mean of the inspection score by each Grade

aggregate(restaurant_nyc$SCORE,by = list(restaurant_nyc$grade), mean)
##   Group.1         x
## 1       A  9.556818
## 2       B 19.500000

scatterplot matrix of all scores and grade

It appears that the inspection score(SCORE) has little impact on Zagat ratings and cost of the meal in the restaurant. Score_decor and Score_service has the highest correlation(0.798) with score_service.

ggpairs(restaurant_nyc, mapping = aes(color= grade), columns = c(2,3,4,5,15,17,16))

Distribution of inspection score accross resturants

ggplot(restaurant_nyc, aes(x=restaurant_nyc$SCORE))+geom_histogram(binwidth=.5)

Distribution of the Inspection score against Meal Cost and Grades

This shows that both Grades(A,B) are scattered between Low cost to hight cost, which means inspection score doesn’t have impact on Cost(in turn Zagat rating). Restuarants with high rating and cost have bad inspection score(Grade=B) as well as Good Grade(A).

qplot(restaurant_nyc$SCORE, restaurant_nyc$cost, data = restaurant_nyc, color= grade)