CUNY SPS DATA 698 - RESEARCH PROPOSAL

Research Draft

Health Inspection in NYC

Introduction

The New York City Department of Health and Mental Hygiene (DOHMH) is the department of government of New York City responsible for public health along with issuing birth certificates, dog licenses, and conducting restaurant inspection and enforcement. The department was initially set up as the New York City Board of Health in 1805 to combat an outbreak of yellow fever. Later in July 2002 it was renamed to DOHMH.

Data Source

The NYC DOHMH Restaurant Inspection Results has a dataset that contains information from several large administrative data systems. It contians some illogical values that could be a data entry or transfer errors and some data might be missing as well.The dataset inspection records three years data prior to the most recent inspection for restaurants and college cafeterias in an active status. The dataset is last updated on March 15, 2020. It contains 402k instances and 26 attributes. But I will be only focusing on the restaurants in Manhattan only because the dataset is too huge. And in this dataset, there are 159574 instances and 18 attributes.

#importing required libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(purrr)
library(tidyr)
library(knitr)

#importing data from github. I have only used data for Manhattan
data <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA698/master/DOHMH.csv")
knitr::kable(head(data))%>%
                kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE.DESCRIPTION INSPECTION.DATE ACTION VIOLATION.CODE VIOLATION.DESCRIPTION CRITICAL.FLAG SCORE GRADE GRADE.DATE RECORD.DATE INSPECTION.TYPE
40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016 2126863400 Italian 6/15/15 Violations were cited in the following area(s). 02B Hot food item not held at or above 140º F. Critical 30 8/28/17 Cycle Inspection / Initial Inspection
40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE 10016 2126863400 Italian 11/25/14 Violations were cited in the following area(s). 20F Current letter grade card not posted. Not Critical NA 8/28/17 Administrative Miscellaneous / Initial Inspection
40376944 TOMOE SUSHI MANHATTAN 172 THOMPSON STREET 10012 2127779346 Japanese 10/6/15 Violations were cited in the following area(s). 02G Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. Critical 13 A 10/6/15 8/28/17 Cycle Inspection / Re-inspection
40560896 YAKITORI TAISHO MANHATTAN 5 ST MARKS PLACE 10003 2122285086 Japanese 8/13/15 Violations were cited in the following area(s). 10B Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly. Not Critical 9 8/28/17 Cycle Inspection / Initial Inspection
41552184 NOM WAH TEA/DIM SUM PALOR MANHATTAN 13 DOYERS STREET 10013 2129626047 Chinese 9/2/14 Violations were cited in the following area(s). 02G Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. Critical 18 B 9/2/14 8/28/17 Cycle Inspection / Re-inspection
41575449 MIDNIGHT EXPRESS MANHATTAN 1715 2 AVENUE 10128 2128602320 American 12/19/14 Violations were cited in the following area(s). 04M Live roaches present in facility’s food and/or non-food areas. Critical 25 8/28/17 Cycle Inspection / Initial Inspection

Every restaurant in NYC is scheduled a health inspection which is conducted at least once a year. This inspection will be unannounced. The inspector checks for compliance with city and state food safety regulations and marks points for any condition that violates these rules. The inspections are conducted by Department of Health and Mental Hygiene (DOHMH).

The New York City Health Department inspects all food service establishments to make sure they meet Health Code requirements, which helps prevent foodborne illness. How often a restaurant is inspected depends on its inspection score. Restaurants that receive a low score on the initial or first inspection in the inspection cycle are inspected less often than those that receive a high score.

According to the score the restaurants receive, they get the rating of A, B or C in which A is the highest score that a restaurant can get.

These health inspections can occur anytime a year in the New York City (Bronx, Brooklyn, Queens, Manhattan, Staten Island). The inspector may visit anytime the restaurant is receiving or preparing food or drink or is open to the public. It is a crime to offer money, gifts or services of any kind. Legal action will be taken against anyone who offers or accepts a bribe.

The inspector reports witnessed violations in a handheld computer during the inspection. Each violation is associated with a range of points depending on the type and extent of the violation, and the risk it poses to the public or consumers. At the end of the inspection, the points are added together for an inspection score. Lower inspection scores indicate better compliance with the Health Code.

To have an idea what our data looks like some of the very important tools are glimpse, summary, dim, colnames, etc. These helps you to overall look at the column names of the data, see the mean, median and standard deviation of the dataset.

You can also see the properties of the variables like if it is an integer, character, boolean, etc.

To make the data easier for manipulation, I converted data file into a dplyr table.

#convert to dplyr table
data <- tbl_df(data)
#let's have a look at the data
glimpse(data)
## Observations: 159,574
## Variables: 18
## $ CAMIS                 <int> 40511702, 40511702, 40376944, 40560896, 41…
## $ DBA                   <fct> NOTARO RESTAURANT, NOTARO RESTAURANT, TOMO…
## $ BORO                  <fct> MANHATTAN, MANHATTAN, MANHATTAN, MANHATTAN…
## $ BUILDING              <fct> 635, 635, 172, 5, 13, 1715, 6, 10, 672, 65…
## $ STREET                <fct> SECOND AVENUE, SECOND AVENUE, THOMPSON STR…
## $ ZIPCODE               <int> 10016, 10016, 10012, 10003, 10013, 10128, …
## $ PHONE                 <fct> 2126863400, 2126863400, 2127779346, 212228…
## $ CUISINE.DESCRIPTION   <fct> Italian, Italian, Japanese, Japanese, Chin…
## $ INSPECTION.DATE       <fct> 6/15/15, 11/25/14, 10/6/15, 8/13/15, 9/2/1…
## $ ACTION                <fct> Violations were cited in the following are…
## $ VIOLATION.CODE        <fct> 02B, 20F, 02G, 10B, 02G, 04M, 10B, 04N, 08…
## $ VIOLATION.DESCRIPTION <fct> "Hot food item not held at or above 140√Ǭ…
## $ CRITICAL.FLAG         <fct> Critical, Not Critical, Critical, Not Crit…
## $ SCORE                 <int> 30, NA, 13, 9, 18, 25, 12, 12, 9, 19, 11, …
## $ GRADE                 <fct> , , A, , B, , A, A, A, , A, B, , A, , , A,…
## $ GRADE.DATE            <fct> , , 10/6/15, , 9/2/14, , 9/26/16, 12/16/14…
## $ RECORD.DATE           <fct> 8/28/17, 8/28/17, 8/28/17, 8/28/17, 8/28/1…
## $ INSPECTION.TYPE       <fct> Cycle Inspection / Initial Inspection, Adm…
#to see the column names
colnames(data)
##  [1] "CAMIS"                 "DBA"                  
##  [3] "BORO"                  "BUILDING"             
##  [5] "STREET"                "ZIPCODE"              
##  [7] "PHONE"                 "CUISINE.DESCRIPTION"  
##  [9] "INSPECTION.DATE"       "ACTION"               
## [11] "VIOLATION.CODE"        "VIOLATION.DESCRIPTION"
## [13] "CRITICAL.FLAG"         "SCORE"                
## [15] "GRADE"                 "GRADE.DATE"           
## [17] "RECORD.DATE"           "INSPECTION.TYPE"
#to see the total numbers records and attributes
dim(data)
## [1] 159574     18
#to see different types of violations
summary(data$CRITICAL.FLAG)
##       Critical Not Applicable   Not Critical 
##          88496           2672          68406

Literature review

The idea of this project came after searching for the project that I will work on for as the capstone project. I have been in the hospitality industry for many years. Hospitality and health hygeine is considered one of the important part of New York City lifestyle. There are lots of data gathered by the city and the health inspector on the yearly basis but I couldn’t find much of research that has been done.

There are some works that can be found on Kaggle but these are not done in a fruitful way.You can find some desciptive statistics but there has not been any kinds of predictive analysis done yet.

There are huge numbers of data gathered by government on semi-yearly basis and yearly basis. Health inspection is done or initial establisment, inspection and re-inspection. So, huge amount of data is collected but theses data are just stored somewhere which is of no value.

#to see different forms of the violation in a plot
violation_letter <- ggplot(data, aes(x=factor(CRITICAL.FLAG))) + geom_bar() +xlab("Different Violations")

violation_letter

#convert the grades into a factors 
data$GRADE <- factor(data$GRADE, levels = c("A", "B", "C", "Z", "Not Yet Graded"), ordered = TRUE)
(levels(data$GRADE))
## [1] "A"              "B"              "C"              "Z"             
## [5] "Not Yet Graded"
#let's find out the total number of distinct 
n_distinct(data$CAMIS)
## [1] 10643
data %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9260 rows containing non-finite values (stat_bin).

Tools

Different tools are used to perform this project like Bash, R, etc. Since the dataset was too huge, I used the data for only Manhattan. Still the dataset was huge to upload in my github account. I used youtube to learn how upload a huge file in github. Using bash we are able to upload a file in github that is 50 MB.

Missing Values

In order to deal with missing values, I tried to count the distinct establishments to make sure if any records are duplicates or if there is an ID missing for them.

#display data group by actions
data %>% group_by(ACTION) %>% count()
## # A tibble: 6 x 2
## # Groups:   ACTION [6]
##   ACTION                                                                  n
##   <fct>                                                               <int>
## 1 ""                                                                    475
## 2 Establishment Closed by DOHMH.  Violations were cited in the foll…   3252
## 3 Establishment re-closed by DOHMH                                      283
## 4 Establishment re-opened by DOHMH                                      928
## 5 No violations were recorded at the time of this inspection.          2004
## 6 Violations were cited in the following area(s).                    152632
#display data group by violation description
data %>% group_by(VIOLATION.DESCRIPTION) %>% count() %>% arrange(desc(n)) %>% head(10)
## # A tibble: 10 x 2
## # Groups:   VIOLATION.DESCRIPTION [10]
##    VIOLATION.DESCRIPTION                                                  n
##    <fct>                                                              <int>
##  1 Non-food contact surface improperly constructed. Unacceptable mat… 22053
##  2 Facility not vermin proof. Harborage or conditions conducive to a… 15145
##  3 Cold food item held above 41√Ǭ∫ F (smoked fish and reduced oxyge… 12160
##  4 Food contact surface not properly washed, rinsed and sanitized af… 11491
##  5 Food not protected from potential source of contamination during … 10646
##  6 Evidence of mice or live mice present in facility's food and/or n… 10127
##  7 Plumbing not properly installed or maintained; anti-siphonage or …  9444
##  8 "Filth flies or food/refuse/sewage-associated (FRSA) flies presen…  7942
##  9 Hot food item not held at or above 140º F.                       7237
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-…  3980

Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations.Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.

So, in order to make date accessible for inspection, grade and record, I converted it to date class object.

data$INSPECTION.DATE<-(as.Date(data$INSPECTION.DATE, "%m/%d/%Y"))
data$GRADE.DATE<-(as.Date(data$GRADE.DATE, "%m/%d/%Y"))
data$RECORD.DATE<-(as.Date(data$RECORD.DATE, "%m/%d/%Y"))
class(data$RECORD.DATE)
## [1] "Date"

Now let’s see if we are able to see the range of the inspections.

range(data$`INSPECTION.DATE`)
## [1] "0000-01-01" "0017-08-25"

Prediction

For the prediction, i will be using initial modelling such as linear regression and stepwise methods for feature selection.