Data Wrangling in R - Final Project

Objective

An analysis of inspection of New York City restaurants since 2011.

As more and more people in the cities eat out more often, health and hygiene of the public has come to the center stage. It is important that the food that is served to the patron as well as the hygiene standards of the restaurants themselves are top-notch. In an attempt to address this concern NYC’s Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants every year. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.

The Analysis

This analysis looks at various aspects health and hygiene issues. Some of the questions answered are: * How do different boroughs stack up against one another- are there more issues in some places? * What are some of the key issues that are reported by the officers? * Has the inspections had any positive impact? * What are some of the top restaurants that have a particular issue?

High Level summary

Manhattan, despite having the largest number of restaurants and high population and customer base does well in terms of hygiene standards
Overall, number of Grade A restaurants have gone up over the years.

Data Source

The data is available at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59

Date of data collection: 8th December, 2016 - see RECORD.DATE Column
There were 18 variable, some more were added/dropped
Restaurants that were not graded were coded with a INSPECTION.DATE =‘1900-01-01’

Packages Required

library(dplyr) ## used to manupulate data
library(tibble) ## used to create tibbles
library(stringr) ## using regx and other string functions
library(tidyverse)#group of packages used to summarise and visualize data
library(tidyr) ## functions which can be used to make data tidy
library(purrr) ## used for iteration functionns like map
library(RSocrata) ## used for reading socrata API datasets
library(ggplot2)  ## used for plotting graphs
library(lubridate) # used for date manipulations
library(tm) # text Mining Package in R
library(SnowballC) # text Mining Package in R
library(wordcloud) # Package to Build word Clould
library(RColorBrewer) # Package to add color

Data Description

The Codebook is available in the attachments section at : https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59/about

The variables available are:

Variable Name	Description	Data Type	Example	# of Missing Values
CAMIS	Unique Identifier for each restaurant	int	30075445,	0
DBA	Restaurant Name	factor	AUNTIE ANNE’S PRETZELS	0
BORO	Borough (Area)	factor	BROOKLYN, BRONX	0
BUILDING	Building Name or Number	character	TERMINAL 1	39
STREET	Street address	factor	W 154TH ST	0
ZIPCODE	Reataurant ZIP code	int (converted to char)	11692	0
PHONE	Restaurant Phone Number	character	9738089525	1
CUISINE.DESCRIPTION	Type of Cuisine	factor	Chinese, Bakery	0
INSPECTION.DATE	Inspection Date	Posixlt (converted to data)	Date Range: 2011-08-23 to 2016-11-16 (There are some dates from 1900 that are dropped)	0
ACTION	Action taken by DOHMH	factor	Establishment Closed by DOHMH, Violations were cited in the following area(s)	0
VIOLATION.CODE	DOHMH Violation Code	factor	02A, 02B	119
VIOLATION.DESCRIPTION	Violation Description as per the officer	factor (converted to character)	Wash hands sign not posted at hand wash facility, Accurate thermometer not provided in refrigerated or hot holding equipment	119
CRITICAL.FLAG	Critical or Not	factor	Critical, Not Critical, Not Applicable	0
SCORE	Score Awarded by DOHMH	int	116, 117	0
GRADE	Grade Awarded by DOHMH	factor	A, B, Not Yet Graded	0
GRADE.DATE	Date when grade was awarded	Posixlt (converted to date)	2011-10-07, 2016-11-16	200885
RECORD.DATE	Date when record was created/updated	Posixlt (converted to date)	2016-11-18	0
INSPECTION.TYPE	Type of inspection (replaced by 2 new columns)	character	Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection	0
INSPECTION.CLASS	Similar to Type of inspection	character	Administrative Miscellaneous / Initial Inspection, Cycle Inspection / Compliance Inspection	0
INSPECTION.PHASE	Phase of the inspection	character	Initial Inspection, Re-Inspection	0

Importing data

The data is imported using the RSocrata package. The below code is used import data.

url<-"https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59"
nyrestaurant<-read.socrata(url)

# write the file to the desktop for further use
#readr::write_rds(nyrestaurant,"nycrestaurant.rds")

# read local copy - comment out the code above when using this
#nyrestaurant<-readr::read_rds("nycrestaurant.rds")

Cleaning data

The GRADE.DATE, INSPECTION.DATE and RECORD.DATE were in Posixlt format which could not be read into a tibble. I converted these to a date format using a function

See code below:

# function to change POSIXlt to date
date_conv<-function(x){
    if (identical(unlist(class(x))[1],"POSIXlt")){
        x<-as.Date(x)
    } else(x)
}

##  created a tible from data frame
nyc_rest_tib<-as_tibble(map(nyrestaurant,date_conv))

Some additional data manipulations and cleaning was applied to the dataset

Convert VIOLATION.DESCRIPTION to character
Split INSPECTION.TYPE into INSPECTION.CLASS and INSPECTION.PHASE
Remove data from ‘1900-01-01’ since this is a junk value for restarurants have not been inspected
Replace ACTION data with shorter descriptions

A Sample from the dataset is shown below:

head(nyc_rest_tib[1:20,])

The GRADEDATE variable has 200885 NAs. This is over 50% missing values in that column. However, these observations are not deleted. I will use filters to remove these values on the fly when needed.

Summary of some of the key variables used in analysis

ACTION: This variable stores values of the action taken by the inspection authorities. Thease values were recoded to make them simple for further use. Here is a list of values: violations, reopened, closed, reclosed, no violations
BORO: This contains the names of the 5 major NY boroughs. There are some missing data here. MANHATTAN, BROOKLYN, STATEN ISLAND, QUEENS, BRONX, Missing
VIOLATION.DESCRIPTION: This is a free text column that contains the actual commments from the inspection officer.
GRADE: The GRADE Column contains the grade that was awarded buring the inspection. It is important to note that grades can change over the year. There is also a possiblilty of grades being changed during re-inspection. Some restaurants have not been inspected/graded yet or the grading is pending. Sample values are: A, , B, C, P, Z, Not Yet Graded

Exploratory Data Analysis

Analysis 1- Grades in each Borough for 2016

#Unique inspections results for 2016
inspections = unique(select(nyc_rest_tib, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, INSPECTION.DATE, ACTION, SCORE, GRADE)) 
inspections=filter(inspections,lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'| GRADE=='C') & ACTION !='closed' & BORO !='Missing')

# since there can be multiple inspections. select only the latest for each restaurant
latest = merge( aggregate(INSPECTION.DATE ~ CAMIS, inspections, max), inspections) #Unique restaurants and their latest score/grade 

# unique list of restaurants and their grade for 2016
restaurants = unique(select(latest, CAMIS, BORO, ZIPCODE, CUISINE.DESCRIPTION, SCORE, GRADE))

ggplot(data=restaurants, aes(x=reorder(BORO, desc(BORO)))) + 
    geom_bar(aes(fill=GRADE), position='dodge') + 
    labs(title='Grades in each NY Borough for 2016', x='Borough', y='# of Restaurants') + scale_y_continuous(limits=c(0,9000), breaks=seq(0,10000,1000)) + 
    theme_bw()

This graph shows the number of New York restaurants by grades in each borough. It is important to note that this analysis is only for 2016 and excludes new or un-graded restaurants. It also excludes Restaurants for which location data was missing.

It can be seen that Manhattan has the highest number of Grade A restaunats. While that is not surprising is that the number of Grade B and C are very low. Compared to that restaurants in the suburbs of Queens and Brooklyn that have a much higher proportion of Grade B for the smaller number of Grade A’s

Analysis 2- Word Cloud of the Most Frequent Violations

# filter 2016 data for valid restaurants in a vector
textdata<-nyc_rest_tib%>%
        filter(lubridate::year(INSPECTION.DATE)=='2016' & (GRADE =='A'| GRADE=='B'|         GRADE=='C') & ACTION !='closed' & BORO !='Missing')%>%
        select(VIOLATION.DESCRIPTION)

# Convert the vector into a corpus
text <- Corpus(VectorSource(textdata$VIOLATION.DESCRIPTION))

# Convert to lower case
text <- tm_map(text, tolower)
    
#tdm <- TermDocumentMatrix(text) # turn it into a term document matrix

text1 <- tm_map(text, removeWords, stopwords('english')) # remove stop words
text1 <- tm_map(text1, removePunctuation) # remove punctuations 
text1 <- tm_map(text1, PlainTextDocument) # make plain text
text1 <- tm_map(text1, removeNumbers) # remove numbers
text1 <- tm_map(text1, stripWhitespace) # remove white spaces

dtm <- TermDocumentMatrix(text1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10) # Top 10 words

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 200,
        max.words=70, random.order=FALSE, rot.per=0.5, 
        colors=brewer.pal(8, "Dark2")
        )

The word cloud shows some of the key words in the inspection reports. Some of the key words that pop out from the analysis are flies, vermin. These words are indicative of the fact that restaurants are not taking adequante steps to protect food and their premises from these sources of infection and diseases. Some other words mice, filth and contact are some of the other words which again clearly shows the presence of rodents in the premises. Words like Non Food, equipment and surface also hint at other potential sources of problems that have more to do with the physical location itself.

Analysis 3 - Trends in Grading over the years- Does the program have a positive impact

# Filter all data for Grade ABC and add a year variable
nyc_rest_modified<-nyc_rest_tib%>%
    mutate(Year=format(INSPECTION.DATE,"%Y"))%>%
    filter((GRADE =='A'| GRADE=='B'| GRADE=='C')  & BORO !='Missing')

## Aggregate the final data for 2013-2016
## Older data was not included since there was not a lot of records for first few years of program

finalData<-rbind(restaurants_2013,restaurants_2014,restaurants_2015,restaurants_2016)
Agg_data<-finalData %>%
    unique() %>%
    mutate(Gradechar = as.character(GRADE)) %>%
    group_by(Year,Gradechar) %>%
    summarise(count=n())

# Plot the data
ggplot(data=Agg_data, aes(x=Year, y=count, group = Gradechar, colour = Gradechar)) +
    geom_line(size=2) +
    geom_point( size=2, shape=21, fill="white")+
    labs(title='Change in # of Restaurants by grade') + 
    theme_bw()

Overall, it appears that there has been a positive impact of the inspections. the number of Grade A restaurants has consistent increased since 2013 while Grade B and Grade C have remained the same. It will be interesting to see if this is an overall trend or confined to one or two geographies. However, it has not been done for now.

Analysis 4- Identify Top 20 Contamination violation using regex and functions

top_restaurant<-function(df,x,y){
    dt<-df%>%
    filter(format(INSPECTION.DATE, "%Y") == x)%>%
    filter(str_detect(VIOLATION.DESCRIPTION, regex("y", ignore_case = TRUE)))%>%
    group_by(CAMIS,DBA)%>%
    summarise(count=n())%>%
    arrange(desc(count))%>%
    head(10)
    print(dt)
    m<-ggplot(data=dt,aes(x = reorder(DBA, (count)), y = count, fill=DBA)) + 
    geom_bar(stat = "identity",width = .25)+ 
    geom_text(aes(label= count), na.rm = TRUE, hjust = 0.3, vjust = -0.8)+
    ggtitle("Restaurants with most violations of contamination") +
    # scale_x_continuous(limits=c(0,50))
    coord_flip()+
    theme_bw()+
    ylab("# of Violations")+ xlab("Restaurant")
    print(m)
}

# function call
top_restaurant(nyc_rest_tib,2016,"contamination")

## Source: local data frame [10 x 3]
## Groups: CAMIS [10]
## 
##       CAMIS                         DBA count
##       <int>                       <chr> <int>
## 1  50044942 FOGON LATINO BAR RESTAURANT    37
## 2  50045602                   POLKA DOT    32
## 3  50043402         ROYAL FRIED CHICKEN    29
## 4  41429788                   EL AGUILA    27
## 5  41683816     MAX BAKERY & RESTAURANT    25
## 6  50010805                 CAFE AU LEE    25
## 7  40861669      NEW VICTORY RESTAURANT    24
## 8  40965177           LA GATA GOLOSA #2    24
## 9  41250025 YUNG SUN SEAFOOD RESTAURANT    24
## 10 41711975            SZECHUAN GOURMET    24

This is an extension of the word cloud that was done earlier. Since the cloud pointed out many potential sources of contamination like flies and vermin, I decided to look at the top 20 restaurants that have had these violations in 2016. From the analysis we can see that FOGON LATINO BAR RESTAURANT has had the mopst violations about contamination in 2016 followed by Polka Dot.

Key Insights

Overall high standards of hygiene for Manhattan restaurants.
There has been a steady increase in Grade A restaurants.
Flies and Vermin are two of the most common forms of contamination at NY restaurants.

Summary

The analysis done above provides a high level view of the hygine at New York restaurants. With the charts shown above one can determine what are some of the most critical issues plaguing restauramts in New York and where.

This analysis was done in R and made extensive use of R packages for manipulations and plotting. wordcloud, ggplot, dplyr and tidyverse are some of the critical packages used in this analysis.