Through this project, I intend to analyze the restaurants available in NYC to identify how they stack up in terms on quality and health safety. If time permits, I also want to check if there is any co-relation between the quality of these restaurants and their location/neighborhood real estate valuations.
In the current circumstances where there is a public health emergency, restaurants play a keyrole in improving the public health by maintaining proper code and adhering to the standards set by the City. We will also be reviewing if these violations are in any way related to the neighborhood factors i.e., does being in a good neighborhood having high property valulations guarantees access to a better maintained restaurant.
To identify this, I intend to use below two datasets, 1. Gives me the restaurant data and their violations https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j 2. Gives the neighborhood property sales https://data.cityofnewyork.us/City-Government/DOF-Summary-of-Neighborhood-Sales-by-Neighborhood-/5ebm-myj7
There are totally 402K observations recorded collecting data from 2015 to 2020, with 26 variables per observation.
restaurantRatings <- read.csv("data/Restaurant_ratings.csv",sep = ';', stringsAsFactors = F)
summary(restaurantRatings)
## ï..CAMIS DBA BORO BUILDING
## Min. :30075445 Length:402061 Length:402061 Length:402061
## 1st Qu.:41403222 Class :character Class :character Class :character
## Median :50008729 Mode :character Mode :character Mode :character
## Mean :46211177
## 3rd Qu.:50059343
## Max. :50104547
##
## STREET ZIPCODE PHONE CUISINE.DESCRIPTION
## Length:402061 Length:402061 Length:402061 Length:402061
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## INSPECTION.DATE ACTION VIOLATION.CODE VIOLATION.DESCRIPTION
## Length:402061 Length:402061 Length:402061 Length:402061
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## CRITICAL.FLAG SCORE GRADE GRADE.DATE
## Length:402061 Min. : -1.00 Length:402061 Length:402061
## Class :character 1st Qu.: 11.00 Class :character Class :character
## Mode :character Median : 15.00 Mode :character Mode :character
## Mean : 20.41
## 3rd Qu.: 26.00
## Max. :164.00
## NA's :16759
## RECORD.DATE INSPECTION.TYPE Latitude Longitude
## Length:402061 Length:402061 Min. : 0.00 Min. :-74.25
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.99
## Mode :character Mode :character Median :40.73 Median :-73.96
## Mean :40.17 Mean :-72.93
## 3rd Qu.:40.76 3rd Qu.:-73.90
## Max. :40.91 Max. : 0.00
## NA's :422 NA's :422
## Community.Board Council.District Census.Tract BIN
## Min. :101.0 Min. : 1.00 Min. : 100 Min. :1000000
## 1st Qu.:105.0 1st Qu.: 4.00 1st Qu.: 7800 1st Qu.:1042861
## Median :301.0 Median :20.00 Median : 16200 Median :3007498
## Mean :248.4 Mean :20.02 Mean : 28746 Mean :2507528
## 3rd Qu.:401.0 3rd Qu.:34.00 3rd Qu.: 40300 3rd Qu.:4001558
## Max. :595.0 Max. :51.00 Max. :162100 Max. :5799501
## NA's :5904 NA's :5892 NA's :5892 NA's :7647
## BBL NTA
## Min. :1.000e+00 Length:402061
## 1st Qu.:1.010e+09 Class :character
## Median :3.002e+09 Mode :character
## Mean :2.402e+09
## 3rd Qu.:4.002e+09
## Max. :5.270e+09
## NA's :422
#propertySales <- read.csv("data/Neighborhood_sales.csv",sep = ';', stringsAsFactors = F)
#str(propertySales)
This data is provided by DOHMH and contains the list of restaurants that were inspected from the year 2014. But primarily the data set supplies good quantitiy of data starting years 2015. Since the data was acquired from different administrative systems, there were some missing data elements that required a clean up. For example, some elements such as Boro, Grade were provided incorrectly or absent. And the data set included partial data for 2020, so we had to exclude it to get full year picture.
#Data Cleaning
restaurantRatings <- filter(restaurantRatings, (BORO != '0' & BORO != '') & (GRADE == 'A' | GRADE =='B' | GRADE =='C'))
#Data Wranging
restaurantRatings$hasViolation = restaurantRatings$ACTION != 'No violations were recorded at the time of this inspection.'
restaurantRatings$inspection.year <- format(as.Date(restaurantRatings$INSPECTION.DATE, format="%m/%d/%Y"),"%Y")
restaurantRatings <- filter(restaurantRatings, (inspection.year > 2015 & inspection.year <2020))
restaurantRatings <- restaurantRatings %>% group_by(restaurantRatings$CUISINE.DESCRIPTION) %>% mutate(count=n())
Every restaurant is registered with NYC Health Dept under a Cuisine. So here were are trying to understand the demograhy of restaurants in NYC. Below graph displays the top 25 cuisines in NYC
top25Cuisines <- unique(restaurantRatings[,c(8,30)])
top25Cuisines <- top25Cuisines[order(-top25Cuisines$count), ][c(1:25),]
ggplot(top25Cuisines, aes(x=reorder(CUISINE.DESCRIPTION, -count), y=count, label=count)) + geom_bar(stat="identity", position = position_dodge()) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Using the variable earlier we created in data wrangling ‘inspection.year’ we are looking to find the number of observations made each year across NYC restaurants. This helps us understand how much focus has been emphasized on checking the food establishments.
ggplot(data=restaurantRatings, aes(x=inspection.year)) + geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-1)
Below chart explains the count of restaurants that had atleast one observation. This helps us understand totally how many restaurants are failing in getting a clean pass
restaurantRatingsWithViolations <- filter(restaurantRatings, hasViolation=='TRUE')
restaurantRatingsWithViolations <- unique(restaurantRatingsWithViolations[,c(2,28)])
ggplot(restaurantRatingsWithViolations, aes(x=inspection.year)) + geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-1)
Whenever an observation is recorded against a restaurant, it’s current GRADE is also registered. So this chart explains what GRADE were the restaurants holding when they saw an observation
violatingRest <- subset(restaurantRatings, restaurantRatings$hasViolation)
ggplot(violatingRest, aes(x=GRADE, fill=inspection.year)) + geom_bar(position = 'dodge')
df2019 <- filter(restaurantRatings, inspection.year=='2019')
Below chart displays the restaurants by GRADES and Boro. This helps us understand which borough has highest number of GRADED restaurants.
ggplot(df2019, aes(x=BORO, fill=GRADE)) + geom_bar(position = 'dodge') + theme(axis.text.x = element_text(angle = 45, hjust = 1))
I tried to collect the worst restaurant/brand by observation recorded in 2019. And suprisingly many chain restaurants appeared in the top list. One possible reason could be due to the number of restaurants registered under same name. But this the count is quite high for a corporate managed restaurant.
df2019 <- df2019 %>% group_by(DBA) %>% mutate(violation.count=n())
worst25Cuisines <- unique(df2019[,c(2,31)])
worst25Cuisines <- worst25Cuisines[order(-worst25Cuisines$violation.count), ][c(1:25),]
worst25Cuisines <- worst25Cuisines[c(1:25),]
ggplot(worst25Cuisines, aes(x=reorder(DBA, -violation.count), y = violation.count, label=violation.count))+ geom_bar(stat="identity", position = position_dodge()) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs( x = "Restaurant", y = "Violation Count")
Since each restaurant received an observation description, I tried to collate them and find the most commonly occuring word using wordcloud package.
text <- restaurantRatings$VIOLATION.DESCRIPTION
text=str_replace_all(text,"[^[:graph:]]", " ")
text=str_replace_all(text,"andor", " ")
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
words_df <- data.frame(word = names(words),freq=words)
set.seed(1234)
wordcloud(words = words_df$word, freq = words_df$freq, min.freq = 1, max.words=50, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))