Proposal

An analysis of inspection of New York City restaurants since 2011.

NYC’s Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants every year. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.

Data Source

The data is available at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59

Data Description

While I could not find a codebook online, I did my best to understand the data. Below is my unerstanding of the variables in the dataset.

The variables available are:

Variable Name Description Data Type Example # of Missing Values
CAMIS I am not entirely sure what this column does but it appears to be some kind of a code for restaurants int 30075445, 0
DBA Restaurant Name factor AUNTIE ANNE’S PRETZELS 0
BORO Borough (Area) factor BROOKLYN, BRONX 0
BUILDING Building Name or Number character TERMINAL 1 104
STREET Street address factor W 154TH ST 0
ZIPCODE Reataurant ZIP code int 11692 0
PHONE Restaurant Phone Number character 9738089525 9
CUISINE.DESCRIPTION Type of Cuisine factor Chinese, Bakery 0
INSPECTION.DATE Inspection Date Posixlt Date Range: 2011-08-23 to 2016-11-16 (There are some dates from 1900 that are dropped) 0
ACTION Action taken by DOHMH factor Establishment Closed by DOHMH, Violations were cited in the following area(s) 0
VIOLATION.CODE DOHMH Violation Code factor 02A, 02B 119
VIOLATION.DESCRIPTION Violation Reason factor Wash hands sign not posted at hand wash facility, Accurate thermometer not provided in refrigerated or hot holding equipment 119
CRITICAL.FLAG Critical or Not factor Critical, Not Critical, Not Applicable 0
SCORE Score Awarded by DOHMH int 116, 117 27687
GRADE Grade Awarded by DOHMH factor A, B, Not Yet Graded 0
GRADE.DATE Date when grade was awarded Posixlt 2011-10-07, 2016-11-16 230262
RECORD.DATE Date when record was created/updated Posixlt 2016-11-18 0
INSPECTION.TYPE Type of inspection character Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection 0

Importing data

The data is imported using the RSocrata package. The below code is used import data.

library(RSocrata)
url<-"https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
nyrestaurant<-read.socrata(url)

The GRADE.DATE, INSPECTION.DATE and RECORD.DATE were in Posixlt format which could not be read into a tibble. To bypass this, I first imported data as a data.frame and then converted these fields into a Posixct format to save the dataset as a tibble.

See code below:

# convert posixlt to posixct
GRADEDATE<-as.POSIXct(nyrestaurant$GRADE.DATE)
INSPECTIONDATE<-as.POSIXct(nyrestaurant$INSPECTION.DATE)
RECORDDATE<-as.POSIXct(nyrestaurant$RECORD.DATE)

library(tidyverse)
library(dplyr)
nyrestaurant_tib<-as_tibble(cbind(select(nyrestaurant,-c(GRADE.DATE,INSPECTION.DATE,RECORD.DATE)),GRADEDATE,INSPECTIONDATE,RECORDDATE))

A Sample from the dataset is shown below:

library(tidyverse)
head(nyrestaurant_tib)

Cleaning data

The first step of the cleaning process was to convert the Posixlt time to Posixct. This has already been done as described in the above section.

The next step is to remove rowd that correspond to 1900 data

nyrestaurant_tib<-nyrestaurant_tib %>%
    filter(INSPECTIONDATE!="1900-01-01" )

The GRADEDATE variable has 229344 NAs. This is over 50% missing values in that column. This column may need to be dropped but I am holding off for now.

The dataset requires further refining. There are multiple observations of the same restaurant on the same day because of multiple violations. This can cause problems in the analysis so it should be accouted for during the analysis phase.

Planned Analysis

Some of the questions that I hope to answer: