Objective

An analysis of inspection of New York City restaurants since 2011.

As more and more people in the cities eat out more often, health and hygiene of the public has come to the center stage. It is important that the food that is served to the patron as well as the hygiene standards of the restaurants themselves are top-notch. In an attempt to address this concern NYC’s Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants every year. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.

The Analysis

This analysis looks at various aspects health and hygiene issues. Some of the questions answered are: * How do different boroughs stack up against one another- are there more issues in some places? * What are some of the key issues that are reported by the officers? * Has the inspections had any positive impact? * What are some of the top restaurants that have a particular issue?

High Level summary

  • Manhattan, despite having the largest number of restaurants and high population and customer base does well in terms of hygiene standards
  • Overall, number of Grade A restaurants have gone up over the years.

Data Source

The data is available at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59

Packages Required

library(dplyr) ## used to manupulate data
library(tibble) ## used to create tibbles
library(stringr) ## using regx and other string functions
library(tidyverse)#group of packages used to summarise and visualize data
library(tidyr) ## functions which can be used to make data tidy
library(purrr) ## used for iteration functionns like map
library(RSocrata) ## used for reading socrata API datasets
library(ggplot2)  ## used for plotting graphs
library(lubridate) # used for date manipulations
library(tm) # text Mining Package in R
library(SnowballC) # text Mining Package in R
library(wordcloud) # Package to Build word Clould
library(RColorBrewer) # Package to add color 

Data Description

The Codebook is available in the attachments section at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59/about

The variables available are:

Variable Name Description Data Type Example # of Missing Values
CAMIS Unique Identifier for each restaurant int 30075445, 0
DBA Restaurant Name factor AUNTIE ANNE’S PRETZELS 0
BORO Borough (Area) factor BROOKLYN, BRONX 0
BUILDING Building Name or Number character TERMINAL 1 39
STREET Street address factor W 154TH ST 0
ZIPCODE Reataurant ZIP code int (converted to char) 11692 0
PHONE Restaurant Phone Number character 9738089525 1
CUISINE.DESCRIPTION Type of Cuisine factor Chinese, Bakery 0
INSPECTION.DATE Inspection Date Posixlt (converted to data) Date Range: 2011-08-23 to 2016-11-16 (There are some dates from 1900 that are dropped) 0
ACTION Action taken by DOHMH factor Establishment Closed by DOHMH, Violations were cited in the following area(s) 0
VIOLATION.CODE DOHMH Violation Code factor 02A, 02B 119
VIOLATION.DESCRIPTION Violation Description as per the officer factor (converted to character) Wash hands sign not posted at hand wash facility, Accurate thermometer not provided in refrigerated or hot holding equipment 119
CRITICAL.FLAG Critical or Not factor Critical, Not Critical, Not Applicable 0
SCORE Score Awarded by DOHMH int 116, 117 0
GRADE Grade Awarded by DOHMH factor A, B, Not Yet Graded 0
GRADE.DATE Date when grade was awarded Posixlt (converted to date) 2011-10-07, 2016-11-16 200885
RECORD.DATE Date when record was created/updated Posixlt (converted to date) 2016-11-18 0
INSPECTION.TYPE Type of inspection (replaced by 2 new columns) character Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection 0
INSPECTION.CLASS Similar to Type of inspection character Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection 0
INSPECTION.PHASE Phase of the inspection character Initial Inspection, Re-Inspection 0

Importing data

The data is imported using the RSocrata package. The below code is used import data.

url<-"https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
nyrestaurant<-read.socrata(url)

# write the file to the desktop for further use
#readr::write_rds(nyrestaurant,"nycrestaurant.rds")

# read local copy - comment out the code above when using this
#nyrestaurant<-readr::read_rds("nycrestaurant.rds")

Cleaning data

The GRADE.DATE, INSPECTION.DATE and RECORD.DATE were in Posixlt format which could not be read into a tibble. I converted these to a date format using a function

See code below:

# function to change POSIXlt to date
date_conv<-function(x){
    if (identical(unlist(class(x))[1],"POSIXlt")){
        x<-as.Date(x)
    } else(x)
}

##  created a tible from data frame
nyc_rest_tib<-as_tibble(map(nyrestaurant,date_conv))

Some additional data manipulations and cleaning was applied to the dataset

A Sample from the dataset is shown below:

head(nyc_rest_tib[1:20,])

The GRADEDATE variable has 200885 NAs. This is over 50% missing values in that column. However, these observations are not deleted. I will use filters to remove these values on the fly when needed.

Summary of some of the key variables used in analysis

Exploratory Data Analysis

Analysis 1- Grades in each Borough for 2016

#Unique inspections results for 2016
inspections = unique(select(nyc_rest_tib, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, INSPECTION.DATE, ACTION, SCORE, GRADE)) 
inspections=filter(inspections,lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'| GRADE=='C') & ACTION !='closed' & BORO !='Missing')

# since there can be multiple inspections. select only the latest for each restaurant
latest = merge( aggregate(INSPECTION.DATE ~ CAMIS, inspections, max), inspections) #Unique restaurants and their latest score/grade 

# unique list of restaurants and their grade for 2016
restaurants = unique(select(latest, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, SCORE, GRADE))
ggplot(data=restaurants, aes(x=reorder(BORO, desc(BORO)))) + 
    geom_bar(aes(fill=GRADE), position='dodge') + 
    labs(title='Grades in each NY Borough for 2016', x='Borough', y='# of Restaurants') + scale_y_continuous(limits=c(0,9000), breaks=seq(0,10000,1000)) + 
    theme_bw()  

This graph shows the number of New York restaurants by grades in each borough. It is important to note that this analysis is only for 2016 and excludes new or un-graded restaurants. It also excludes Restaurants for which location data was missing.

It can be seen that Manhattan has the highest number of Grade A restaunats. While that is not surprising is that the number of Grade B and C are very low. Compared to that restaurants in the suburbs of Queens and Brooklyn that have a much higher proportion of Grade B for the smaller number of Grade A’s

Analysis 2- Word Cloud of the Most Frequent Violations

# filter 2016 data for valid restaurants in a vector
textdata<-nyc_rest_tib%>%
        filter(lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'|         GRADE=='C') & ACTION !='closed' & BORO !='Missing')%>%
        select(VIOLATION.DESCRIPTION)

# Convert the vector into a corpus
text <- Corpus(VectorSource(textdata$VIOLATION.DESCRIPTION))

# Convert to lower case
text <- tm_map(text, tolower)
    
#tdm <- TermDocumentMatrix(text) # turn it into a term document matrix

text1 <- tm_map(text, removeWords, stopwords('english')) # remove stop words
text1 <- tm_map(text1, removePunctuation) # remove punctuations 
text1 <- tm_map(text1, PlainTextDocument) # make plain text
text1 <- tm_map(text1, removeNumbers) # remove numbers
text1 <- tm_map(text1, stripWhitespace) # remove white spaces

dtm <- TermDocumentMatrix(text1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10) # Top 10 words 
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 200,
        max.words=70, random.order=FALSE, rot.per=0.5, 
        colors=brewer.pal(8, "Dark2")
        )

The word cloud shows some of the key words in the inspection reports. Some of the key words that pop out from the analysis are flies, vermin. These words are indicative of the fact that restaurants are not taking adequante steps to protect food and their premises from these sources of infection and diseases. Some other words mice, filth and contact are some of the other words which again clearly shows the presence of rodents in the premises. Words like Non Food, equipment and surface also hint at other potential sources of problems that have more to do with the physical location itself.

Analysis 4- Identify Top 20 Contamination violation using regex and functions

top_restaurant<-function(df,x,y){
    dt<-df%>%
    filter(format(INSPECTION.DATE, "%Y") == x)%>%
    filter(str_detect(VIOLATION.DESCRIPTION, regex("y", ignore_case = TRUE)))%>%
    group_by(CAMIS,DBA)%>%
    summarise(count=n())%>%
    arrange(desc(count))%>%
    head(10)
    print(dt)
    m<-ggplot(data=dt,aes(x = reorder(DBA, (count)), y = count, fill=DBA)) + 
    geom_bar(stat = "identity",width = .25)+ 
    geom_text(aes(label= count), na.rm = TRUE, hjust = 0.3, vjust = -0.8)+
    ggtitle("Restaurants with most violations of contamination") +
    # scale_x_continuous(limits=c(0,50))
    coord_flip()+
    theme_bw()+
    ylab("# of Violations")+ xlab("Restaurant")
    print(m)
}

# function call
top_restaurant(nyc_rest_tib,2016,"contamination")
## Source: local data frame [10 x 3]
## Groups: CAMIS [10]
## 
##       CAMIS                         DBA count
##       <int>                       <chr> <int>
## 1  50044942 FOGON LATINO BAR RESTAURANT    37
## 2  50045602                   POLKA DOT    32
## 3  50043402         ROYAL FRIED CHICKEN    29
## 4  41429788                   EL AGUILA    27
## 5  41683816     MAX BAKERY & RESTAURANT    25
## 6  50010805                 CAFE AU LEE    25
## 7  40861669      NEW VICTORY RESTAURANT    24
## 8  40965177           LA GATA GOLOSA #2    24
## 9  41250025 YUNG SUN SEAFOOD RESTAURANT    24
## 10 41711975            SZECHUAN GOURMET    24

This is an extension of the word cloud that was done earlier. Since the cloud pointed out many potential sources of contamination like flies and vermin, I decided to look at the top 20 restaurants that have had these violations in 2016. From the analysis we can see that FOGON LATINO BAR RESTAURANT has had the mopst violations about contamination in 2016 followed by Polka Dot.

Key Insights

Summary

The analysis done above provides a high level view of the hygine at New York restaurants. With the charts shown above one can determine what are some of the most critical issues plaguing restauramts in New York and where.

This analysis was done in R and made extensive use of R packages for manipulations and plotting. wordcloud, ggplot, dplyr and tidyverse are some of the critical packages used in this analysis.