NYC’s Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants every year. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.
The data is available at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59
While I could not find a codebook online, I did my best to understand the data. Below is my unerstanding of the variables in the dataset.
The variables available are:
| Variable Name | Description | Data Type | Example | # of Missing Values |
|---|---|---|---|---|
| CAMIS | I am not entirely sure what this column does but it appears to be some kind of a code for restaurants | int | 30075445, | 0 |
| DBA | Restaurant Name | factor | AUNTIE ANNE’S PRETZELS | 0 |
| BORO | Borough (Area) | factor | BROOKLYN, BRONX | 0 |
| BUILDING | Building Name or Number | character | TERMINAL 1 | 104 |
| STREET | Street address | factor | W 154TH ST | 0 |
| ZIPCODE | Reataurant ZIP code | int | 11692 | 0 |
| PHONE | Restaurant Phone Number | character | 9738089525 | 9 |
| CUISINE.DESCRIPTION | Type of Cuisine | factor | Chinese, Bakery | 0 |
| INSPECTION.DATE | Inspection Date | Posixlt | Date Range: 2011-08-23 to 2016-11-16 (There are some dates from 1900 that are dropped) | 0 |
| ACTION | Action taken by DOHMH | factor | Establishment Closed by DOHMH, Violations were cited in the following area(s) | 0 |
| VIOLATION.CODE | DOHMH Violation Code | factor | 02A, 02B | 119 |
| VIOLATION.DESCRIPTION | Violation Reason | factor | Wash hands sign not posted at hand wash facility, Accurate thermometer not provided in refrigerated or hot holding equipment | 119 |
| CRITICAL.FLAG | Critical or Not | factor | Critical, Not Critical, Not Applicable | 0 |
| SCORE | Score Awarded by DOHMH | int | 116, 117 | 27687 |
| GRADE | Grade Awarded by DOHMH | factor | A, B, Not Yet Graded | 0 |
| GRADE.DATE | Date when grade was awarded | Posixlt | 2011-10-07, 2016-11-16 | 230262 |
| RECORD.DATE | Date when record was created/updated | Posixlt | 2016-11-18 | 0 |
| INSPECTION.TYPE | Type of inspection | character | Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection | 0 |
The data is imported using the RSocrata package. The below code is used import data.
library(RSocrata)
url<-"https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
nyrestaurant<-read.socrata(url)
The GRADE.DATE, INSPECTION.DATE and RECORD.DATE were in Posixlt format which could not be read into a tibble. To bypass this, I first imported data as a data.frame and then converted these fields into a Posixct format to save the dataset as a tibble.
See code below:
# convert posixlt to posixct
GRADEDATE<-as.POSIXct(nyrestaurant$GRADE.DATE)
INSPECTIONDATE<-as.POSIXct(nyrestaurant$INSPECTION.DATE)
RECORDDATE<-as.POSIXct(nyrestaurant$RECORD.DATE)
library(tidyverse)
library(dplyr)
nyrestaurant_tib<-as_tibble(cbind(select(nyrestaurant,-c(GRADE.DATE,INSPECTION.DATE,RECORD.DATE)),GRADEDATE,INSPECTIONDATE,RECORDDATE))
A Sample from the dataset is shown below:
library(tidyverse)
head(nyrestaurant_tib)
The first step of the cleaning process was to convert the Posixlt time to Posixct. This has already been done as described in the above section.
The next step is to remove rowd that correspond to 1900 data
nyrestaurant_tib<-nyrestaurant_tib %>%
filter(INSPECTIONDATE!="1900-01-01" )
The GRADEDATE variable has 229344 NAs. This is over 50% missing values in that column. This column may need to be dropped but I am holding off for now.
The dataset requires further refining. There are multiple observations of the same restaurant on the same day because of multiple violations. This can cause problems in the analysis so it should be accouted for during the analysis phase.
Some of the questions that I hope to answer: