The New York City Department of Health and Mental Hygiene (DOHMH) is the department of government of New York City responsible for public health along with issuing birth certificates, dog licenses, and conducting restaurant inspection and enforcement. The department was initially set up as the New York City Board of Health in 1805 to combat an outbreak of yellow fever. Later in July 2002 it was renamed to DOHMH.
The NYC DOHMH Restaurant Inspection Results has a dataset that contains information from several large administrative data systems. It contians some illogical values that could be a data entry or transfer errors and some data might be missing as well.The dataset inspection records three years data prior to the most recent inspection for restaurants and college cafeterias in an active status. The dataset is last updated on March 15, 2020. It contains 402k instances and 26 attributes. But I will be only focusing on the restaurants in Manhattan only because the dataset is too huge. And in this dataset, there are 159574 instances and 18 attributes.
#importing required libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(purrr)
library(tidyr)
library(knitr)
#importing data from github. I have only used data for Manhattan
data <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA698/master/DOHMH.csv")
knitr::kable(head(data))%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
CAMIS | DBA | BORO | BUILDING | STREET | ZIPCODE | PHONE | CUISINE.DESCRIPTION | INSPECTION.DATE | ACTION | VIOLATION.CODE | VIOLATION.DESCRIPTION | CRITICAL.FLAG | SCORE | GRADE | GRADE.DATE | RECORD.DATE | INSPECTION.TYPE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
40511702 | NOTARO RESTAURANT | MANHATTAN | 635 | SECOND AVENUE | 10016 | 2126863400 | Italian | 6/15/15 | Violations were cited in the following area(s). | 02B | Hot food item not held at or above 140º F. | Critical | 30 | 8/28/17 | Cycle Inspection / Initial Inspection | ||
40511702 | NOTARO RESTAURANT | MANHATTAN | 635 | SECOND AVENUE | 10016 | 2126863400 | Italian | 11/25/14 | Violations were cited in the following area(s). | 20F | Current letter grade card not posted. | Not Critical | NA | 8/28/17 | Administrative Miscellaneous / Initial Inspection | ||
40376944 | TOMOE SUSHI | MANHATTAN | 172 | THOMPSON STREET | 10012 | 2127779346 | Japanese | 10/6/15 | Violations were cited in the following area(s). | 02G | Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. | Critical | 13 | A | 10/6/15 | 8/28/17 | Cycle Inspection / Re-inspection |
40560896 | YAKITORI TAISHO | MANHATTAN | 5 | ST MARKS PLACE | 10003 | 2122285086 | Japanese | 8/13/15 | Violations were cited in the following area(s). | 10B | Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly. | Not Critical | 9 | 8/28/17 | Cycle Inspection / Initial Inspection | ||
41552184 | NOM WAH TEA/DIM SUM PALOR | MANHATTAN | 13 | DOYERS STREET | 10013 | 2129626047 | Chinese | 9/2/14 | Violations were cited in the following area(s). | 02G | Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. | Critical | 18 | B | 9/2/14 | 8/28/17 | Cycle Inspection / Re-inspection |
41575449 | MIDNIGHT EXPRESS | MANHATTAN | 1715 | 2 AVENUE | 10128 | 2128602320 | American | 12/19/14 | Violations were cited in the following area(s). | 04M | Live roaches present in facility’s food and/or non-food areas. | Critical | 25 | 8/28/17 | Cycle Inspection / Initial Inspection |
Every restaurant in NYC is scheduled a health inspection which is conducted at least once a year. This inspection will be unannounced. The inspector checks for compliance with city and state food safety regulations and marks points for any condition that violates these rules. The inspections are conducted by Department of Health and Mental Hygiene (DOHMH).
The New York City Health Department inspects all food service establishments to make sure they meet Health Code requirements, which helps prevent foodborne illness. How often a restaurant is inspected depends on its inspection score. Restaurants that receive a low score on the initial or first inspection in the inspection cycle are inspected less often than those that receive a high score.
According to the score the restaurants receive, they get the rating of A, B or C in which A is the highest score that a restaurant can get.
These health inspections can occur anytime a year in the New York City (Bronx, Brooklyn, Queens, Manhattan, Staten Island). The inspector may visit anytime the restaurant is receiving or preparing food or drink or is open to the public. It is a crime to offer money, gifts or services of any kind. Legal action will be taken against anyone who offers or accepts a bribe.
The inspector reports witnessed violations in a handheld computer during the inspection. Each violation is associated with a range of points depending on the type and extent of the violation, and the risk it poses to the public or consumers. At the end of the inspection, the points are added together for an inspection score. Lower inspection scores indicate better compliance with the Health Code.
To have an idea what our data looks like some of the very important tools are glimpse, summary, dim, colnames, etc. These helps you to overall look at the column names of the data, see the mean, median and standard deviation of the dataset.
You can also see the properties of the variables like if it is an integer, character, boolean, etc.
To make the data easier for manipulation, I converted data file into a dplyr table.
#convert to dplyr table
data <- tbl_df(data)
#let's have a look at the data
glimpse(data)
## Observations: 159,574
## Variables: 18
## $ CAMIS <int> 40511702, 40511702, 40376944, 40560896, 41…
## $ DBA <fct> NOTARO RESTAURANT, NOTARO RESTAURANT, TOMO…
## $ BORO <fct> MANHATTAN, MANHATTAN, MANHATTAN, MANHATTAN…
## $ BUILDING <fct> 635, 635, 172, 5, 13, 1715, 6, 10, 672, 65…
## $ STREET <fct> SECOND AVENUE, SECOND AVENUE, THOMPSON STR…
## $ ZIPCODE <int> 10016, 10016, 10012, 10003, 10013, 10128, …
## $ PHONE <fct> 2126863400, 2126863400, 2127779346, 212228…
## $ CUISINE.DESCRIPTION <fct> Italian, Italian, Japanese, Japanese, Chin…
## $ INSPECTION.DATE <fct> 6/15/15, 11/25/14, 10/6/15, 8/13/15, 9/2/1…
## $ ACTION <fct> Violations were cited in the following are…
## $ VIOLATION.CODE <fct> 02B, 20F, 02G, 10B, 02G, 04M, 10B, 04N, 08…
## $ VIOLATION.DESCRIPTION <fct> "Hot food item not held at or above 140√Ǭ…
## $ CRITICAL.FLAG <fct> Critical, Not Critical, Critical, Not Crit…
## $ SCORE <int> 30, NA, 13, 9, 18, 25, 12, 12, 9, 19, 11, …
## $ GRADE <fct> , , A, , B, , A, A, A, , A, B, , A, , , A,…
## $ GRADE.DATE <fct> , , 10/6/15, , 9/2/14, , 9/26/16, 12/16/14…
## $ RECORD.DATE <fct> 8/28/17, 8/28/17, 8/28/17, 8/28/17, 8/28/1…
## $ INSPECTION.TYPE <fct> Cycle Inspection / Initial Inspection, Adm…
#to see the column names
colnames(data)
## [1] "CAMIS" "DBA"
## [3] "BORO" "BUILDING"
## [5] "STREET" "ZIPCODE"
## [7] "PHONE" "CUISINE.DESCRIPTION"
## [9] "INSPECTION.DATE" "ACTION"
## [11] "VIOLATION.CODE" "VIOLATION.DESCRIPTION"
## [13] "CRITICAL.FLAG" "SCORE"
## [15] "GRADE" "GRADE.DATE"
## [17] "RECORD.DATE" "INSPECTION.TYPE"
#to see the total numbers records and attributes
dim(data)
## [1] 159574 18
#to see different types of violations
summary(data$CRITICAL.FLAG)
## Critical Not Applicable Not Critical
## 88496 2672 68406
The idea of this project came after searching for the project that I will work on for as the capstone project. I have been in the hospitality industry for many years. Hospitality and health hygeine is considered one of the important part of New York City lifestyle. There are lots of data gathered by the city and the health inspector on the yearly basis but I couldn’t find much of research that has been done.
There are some works that can be found on Kaggle but these are not done in a fruitful way.You can find some desciptive statistics but there has not been any kinds of predictive analysis done yet.
There are huge numbers of data gathered by government on semi-yearly basis and yearly basis. Health inspection is done or initial establisment, inspection and re-inspection. So, huge amount of data is collected but theses data are just stored somewhere which is of no value.
#to see different forms of the violation in a plot
violation_letter <- ggplot(data, aes(x=factor(CRITICAL.FLAG))) + geom_bar() +xlab("Different Violations")
violation_letter
#convert the grades into a factors
data$GRADE <- factor(data$GRADE, levels = c("A", "B", "C", "Z", "Not Yet Graded"), ordered = TRUE)
(levels(data$GRADE))
## [1] "A" "B" "C" "Z"
## [5] "Not Yet Graded"
#let's find out the total number of distinct
n_distinct(data$CAMIS)
## [1] 10643
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9260 rows containing non-finite values (stat_bin).
Different tools are used to perform this project like Bash, R, etc. Since the dataset was too huge, I used the data for only Manhattan. Still the dataset was huge to upload in my github account. I used youtube to learn how upload a huge file in github. Using bash we are able to upload a file in github that is 50 MB.
In order to deal with missing values, I tried to count the distinct establishments to make sure if any records are duplicates or if there is an ID missing for them.
#display data group by actions
data %>% group_by(ACTION) %>% count()
## # A tibble: 6 x 2
## # Groups: ACTION [6]
## ACTION n
## <fct> <int>
## 1 "" 475
## 2 Establishment Closed by DOHMH. Violations were cited in the foll… 3252
## 3 Establishment re-closed by DOHMH 283
## 4 Establishment re-opened by DOHMH 928
## 5 No violations were recorded at the time of this inspection. 2004
## 6 Violations were cited in the following area(s). 152632
#display data group by violation description
data %>% group_by(VIOLATION.DESCRIPTION) %>% count() %>% arrange(desc(n)) %>% head(10)
## # A tibble: 10 x 2
## # Groups: VIOLATION.DESCRIPTION [10]
## VIOLATION.DESCRIPTION n
## <fct> <int>
## 1 Non-food contact surface improperly constructed. Unacceptable mat… 22053
## 2 Facility not vermin proof. Harborage or conditions conducive to a… 15145
## 3 Cold food item held above 41√Ǭ∫ F (smoked fish and reduced oxyge… 12160
## 4 Food contact surface not properly washed, rinsed and sanitized af… 11491
## 5 Food not protected from potential source of contamination during … 10646
## 6 Evidence of mice or live mice present in facility's food and/or n… 10127
## 7 Plumbing not properly installed or maintained; anti-siphonage or … 9444
## 8 "Filth flies or food/refuse/sewage-associated (FRSA) flies presen… 7942
## 9 Hot food item not held at or above 140º F. 7237
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-… 3980
Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations.Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.
So, in order to make date accessible for inspection, grade and record, I converted it to date class object.
data$INSPECTION.DATE<-(as.Date(data$INSPECTION.DATE, "%m/%d/%Y"))
data$GRADE.DATE<-(as.Date(data$GRADE.DATE, "%m/%d/%Y"))
data$RECORD.DATE<-(as.Date(data$RECORD.DATE, "%m/%d/%Y"))
class(data$RECORD.DATE)
## [1] "Date"
Now let’s see if we are able to see the range of the inspections.
range(data$`INSPECTION.DATE`)
## [1] "0000-01-01" "0017-08-25"
For the prediction, i will be using initial modelling such as linear regression and stepwise methods for feature selection.