As more and more people in the cities eat out more often, health and hygiene of the public has come to the center stage. It is important that the food that is served to the patron as well as the hygiene standards of the restaurants themselves are top-notch. In an attempt to address this concern NYC’s Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants every year. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.
This analysis looks at various aspects health and hygiene issues. Some of the questions answered are: * How do different boroughs stack up against one another- are there more issues in some places? * What are some of the key issues that are reported by the officers? * Has the inspections had any positive impact? * What are some of the top restaurants that have a particular issue?
The data is available at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59
library(dplyr) ## used to manupulate data
library(tibble) ## used to create tibbles
library(stringr) ## using regx and other string functions
library(tidyverse)#group of packages used to summarise and visualize data
library(tidyr) ## functions which can be used to make data tidy
library(purrr) ## used for iteration functionns like map
library(RSocrata) ## used for reading socrata API datasets
library(ggplot2) ## used for plotting graphs
library(lubridate) # used for date manipulations
library(tm) # text Mining Package in R
library(SnowballC) # text Mining Package in R
library(wordcloud) # Package to Build word Clould
library(RColorBrewer) # Package to add color
The Codebook is available in the attachments section at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59/about
The variables available are:
| Variable Name | Description | Data Type | Example | # of Missing Values |
|---|---|---|---|---|
| CAMIS | Unique Identifier for each restaurant | int | 30075445, | 0 |
| DBA | Restaurant Name | factor | AUNTIE ANNE’S PRETZELS | 0 |
| BORO | Borough (Area) | factor | BROOKLYN, BRONX | 0 |
| BUILDING | Building Name or Number | character | TERMINAL 1 | 39 |
| STREET | Street address | factor | W 154TH ST | 0 |
| ZIPCODE | Reataurant ZIP code | int (converted to char) | 11692 | 0 |
| PHONE | Restaurant Phone Number | character | 9738089525 | 1 |
| CUISINE.DESCRIPTION | Type of Cuisine | factor | Chinese, Bakery | 0 |
| INSPECTION.DATE | Inspection Date | Posixlt (converted to data) | Date Range: 2011-08-23 to 2016-11-16 (There are some dates from 1900 that are dropped) | 0 |
| ACTION | Action taken by DOHMH | factor | Establishment Closed by DOHMH, Violations were cited in the following area(s) | 0 |
| VIOLATION.CODE | DOHMH Violation Code | factor | 02A, 02B | 119 |
| VIOLATION.DESCRIPTION | Violation Description as per the officer | factor (converted to character) | Wash hands sign not posted at hand wash facility, Accurate thermometer not provided in refrigerated or hot holding equipment | 119 |
| CRITICAL.FLAG | Critical or Not | factor | Critical, Not Critical, Not Applicable | 0 |
| SCORE | Score Awarded by DOHMH | int | 116, 117 | 0 |
| GRADE | Grade Awarded by DOHMH | factor | A, B, Not Yet Graded | 0 |
| GRADE.DATE | Date when grade was awarded | Posixlt (converted to date) | 2011-10-07, 2016-11-16 | 200885 |
| RECORD.DATE | Date when record was created/updated | Posixlt (converted to date) | 2016-11-18 | 0 |
| INSPECTION.TYPE | Type of inspection (replaced by 2 new columns) | character | Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection | 0 |
| INSPECTION.CLASS | Similar to Type of inspection | character | Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection | 0 |
| INSPECTION.PHASE | Phase of the inspection | character | Initial Inspection, Re-Inspection | 0 |
The data is imported using the RSocrata package. The below code is used import data.
url<-"https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
nyrestaurant<-read.socrata(url)
# write the file to the desktop for further use
#readr::write_rds(nyrestaurant,"nycrestaurant.rds")
# read local copy - comment out the code above when using this
#nyrestaurant<-readr::read_rds("nycrestaurant.rds")
The GRADE.DATE, INSPECTION.DATE and RECORD.DATE were in Posixlt format which could not be read into a tibble. I converted these to a date format using a function
See code below:
# function to change POSIXlt to date
date_conv<-function(x){
if (identical(unlist(class(x))[1],"POSIXlt")){
x<-as.Date(x)
} else(x)
}
## created a tible from data frame
nyc_rest_tib<-as_tibble(map(nyrestaurant,date_conv))
Some additional data manipulations and cleaning was applied to the dataset
A Sample from the dataset is shown below:
head(nyc_rest_tib[1:20,])
The GRADEDATE variable has 200885 NAs. This is over 50% missing values in that column. However, these observations are not deleted. I will use filters to remove these values on the fly when needed.
ACTION: This variable stores values of the action taken by the inspection authorities. Thease values were recoded to make them simple for further use. Here is a list of values: violations, reopened, closed, reclosed, no violations
BORO: This contains the names of the 5 major NY boroughs. There are some missing data here. MANHATTAN, BROOKLYN, STATEN ISLAND, QUEENS, BRONX, Missing
VIOLATION.DESCRIPTION: This is a free text column that contains the actual commments from the inspection officer.
GRADE: The GRADE Column contains the grade that was awarded buring the inspection. It is important to note that grades can change over the year. There is also a possiblilty of grades being changed during re-inspection. Some restaurants have not been inspected/graded yet or the grading is pending. Sample values are: A, , B, C, P, Z, Not Yet Graded
#Unique inspections results for 2016
inspections = unique(select(nyc_rest_tib, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, INSPECTION.DATE, ACTION, SCORE, GRADE))
inspections=filter(inspections,lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'| GRADE=='C') & ACTION !='closed' & BORO !='Missing')
# since there can be multiple inspections. select only the latest for each restaurant
latest = merge( aggregate(INSPECTION.DATE ~ CAMIS, inspections, max), inspections) #Unique restaurants and their latest score/grade
# unique list of restaurants and their grade for 2016
restaurants = unique(select(latest, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, SCORE, GRADE))
ggplot(data=restaurants, aes(x=reorder(BORO, desc(BORO)))) +
geom_bar(aes(fill=GRADE), position='dodge') +
labs(title='Grades in each NY Borough for 2016', x='Borough', y='# of Restaurants') + scale_y_continuous(limits=c(0,9000), breaks=seq(0,10000,1000)) +
theme_bw()
This graph shows the number of New York restaurants by grades in each borough. It is important to note that this analysis is only for 2016 and excludes new or un-graded restaurants. It also excludes Restaurants for which location data was missing.
It can be seen that Manhattan has the highest number of Grade A restaunats. While that is not surprising is that the number of Grade B and C are very low. Compared to that restaurants in the suburbs of Queens and Brooklyn that have a much higher proportion of Grade B for the smaller number of Grade A’s
# filter 2016 data for valid restaurants in a vector
textdata<-nyc_rest_tib%>%
filter(lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'| GRADE=='C') & ACTION !='closed' & BORO !='Missing')%>%
select(VIOLATION.DESCRIPTION)
# Convert the vector into a corpus
text <- Corpus(VectorSource(textdata$VIOLATION.DESCRIPTION))
# Convert to lower case
text <- tm_map(text, tolower)
#tdm <- TermDocumentMatrix(text) # turn it into a term document matrix
text1 <- tm_map(text, removeWords, stopwords('english')) # remove stop words
text1 <- tm_map(text1, removePunctuation) # remove punctuations
text1 <- tm_map(text1, PlainTextDocument) # make plain text
text1 <- tm_map(text1, removeNumbers) # remove numbers
text1 <- tm_map(text1, stripWhitespace) # remove white spaces
dtm <- TermDocumentMatrix(text1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10) # Top 10 words
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 200,
max.words=70, random.order=FALSE, rot.per=0.5,
colors=brewer.pal(8, "Dark2")
)
The word cloud shows some of the key words in the inspection reports. Some of the key words that pop out from the analysis are flies, vermin. These words are indicative of the fact that restaurants are not taking adequante steps to protect food and their premises from these sources of infection and diseases. Some other words mice, filth and contact are some of the other words which again clearly shows the presence of rodents in the premises. Words like Non Food, equipment and surface also hint at other potential sources of problems that have more to do with the physical location itself.
# Filter all data for Grade ABC and add a year variable
nyc_rest_modified<-nyc_rest_tib%>%
mutate(Year=format(INSPECTION.DATE,"%Y"))%>%
filter((GRADE =='A'| GRADE=='B'| GRADE=='C') & BORO !='Missing')
## Aggregate the final data for 2013-2016
## Older data was not included since there was not a lot of records for first few years of program
finalData<-rbind(restaurants_2013,restaurants_2014,restaurants_2015,restaurants_2016)
Agg_data<-finalData %>%
unique() %>%
mutate(Gradechar = as.character(GRADE)) %>%
group_by(Year,Gradechar) %>%
summarise(count=n())
# Plot the data
ggplot(data=Agg_data, aes(x=Year, y=count, group = Gradechar, colour = Gradechar)) +
geom_line(size=2) +
geom_point( size=2, shape=21, fill="white")+
labs(title='Change in # of Restaurants by grade') +
theme_bw()
Overall, it appears that there has been a positive impact of the inspections. the number of Grade A restaurants has consistent increased since 2013 while Grade B and Grade C have remained the same. It will be interesting to see if this is an overall trend or confined to one or two geographies. However, it has not been done for now.
top_restaurant<-function(df,x,y){
dt<-df%>%
filter(format(INSPECTION.DATE, "%Y") == x)%>%
filter(str_detect(VIOLATION.DESCRIPTION, regex("y", ignore_case = TRUE)))%>%
group_by(CAMIS,DBA)%>%
summarise(count=n())%>%
arrange(desc(count))%>%
head(10)
print(dt)
m<-ggplot(data=dt,aes(x = reorder(DBA, (count)), y = count, fill=DBA)) +
geom_bar(stat = "identity",width = .25)+
geom_text(aes(label= count), na.rm = TRUE, hjust = 0.3, vjust = -0.8)+
ggtitle("Restaurants with most violations of contamination") +
# scale_x_continuous(limits=c(0,50))
coord_flip()+
theme_bw()+
ylab("# of Violations")+ xlab("Restaurant")
print(m)
}
# function call
top_restaurant(nyc_rest_tib,2016,"contamination")
## Source: local data frame [10 x 3]
## Groups: CAMIS [10]
##
## CAMIS DBA count
## <int> <chr> <int>
## 1 50044942 FOGON LATINO BAR RESTAURANT 37
## 2 50045602 POLKA DOT 32
## 3 50043402 ROYAL FRIED CHICKEN 29
## 4 41429788 EL AGUILA 27
## 5 41683816 MAX BAKERY & RESTAURANT 25
## 6 50010805 CAFE AU LEE 25
## 7 40861669 NEW VICTORY RESTAURANT 24
## 8 40965177 LA GATA GOLOSA #2 24
## 9 41250025 YUNG SUN SEAFOOD RESTAURANT 24
## 10 41711975 SZECHUAN GOURMET 24
This is an extension of the word cloud that was done earlier. Since the cloud pointed out many potential sources of contamination like flies and vermin, I decided to look at the top 20 restaurants that have had these violations in 2016. From the analysis we can see that FOGON LATINO BAR RESTAURANT has had the mopst violations about contamination in 2016 followed by Polka Dot.
The analysis done above provides a high level view of the hygine at New York restaurants. With the charts shown above one can determine what are some of the most critical issues plaguing restauramts in New York and where.
This analysis was done in R and made extensive use of R packages for manipulations and plotting. wordcloud, ggplot, dplyr and tidyverse are some of the critical packages used in this analysis.