Dataset 2: Chicago Food Inspections

For this round, we'll be looking at Chicago food inspection data. Our analysis question is as follows:

Find out if there is a trend of type of facilities and their risk level given by the inspector.

setwd('~/git/CUNY.MDS/DATA607')
dataset2 = fread('food-inspections.csv')

CSV Export

The first step we'll take here is eliminating unecessary columns:

writeLines(names(dataset2)) %>% kable()

## Inspection ID
## DBA Name
## AKA Name
## License #
## Facility Type
## Risk
## Address
## City
## State
## Zip
## Inspection Date
## Inspection Type
## Results
## Violations
## Latitude
## Longitude
## Location
## Historical Wards 2003-2015
## Zip Codes
## Community Areas
## Census Tracts
## Wards

A bunch of these columns are empty, so we'll drop them:

dataset2[, `:=` (`Historical Wards 2003-2015` = NULL, `Zip Codes` = NULL, `Census Tracts` = NULL, `Community Areas` = NULL, `Wards` = NULL, `Location` = NULL)]

Recoding variables

We'll start by grouping the inspection types by buckets into more granular categories.

dataset2[, .N, by = "Inspection Type"][order(N, decreasing = T),][1:10, ] %>% kable %>% kable_styling( full_width = F, position = "left")

Inspection Type	N
Canvass	103995
License	26054
Canvass Re-Inspection	20722
Complaint	18223
License Re-Inspection	9018
Complaint Re-Inspection	7571
Short Form Complaint	6771
Suspected Food Poisoning	852
Consultation	671
License-Task Force	605

dataset2[grepl(pattern = "Canvass(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'CANVASS']
dataset2[grepl(pattern = "Complaint(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'COMPLAINT']
dataset2[grepl(pattern = "License(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'LICENSE']
dataset2[grepl(pattern = "Food(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FOOD']
dataset2[grepl(pattern = "Fire(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FIRE']
dataset2[is.na(Inspection.Recoded), Inspection.Recoded := 'OTHER']

dataset2[, .N, by = "Inspection.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")

Inspection.Recoded	N
CANVASS	124723
LICENSE	35730
COMPLAINT	32575
OTHER	2432
FOOD	1043
FIRE	322

Risk level also needs to be cleaned up a bit and releveled:

dataset2[, .N, by = "Risk"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")

Risk	N
Risk 1 (High)	141129
Risk 2 (Medium)	38431
Risk 3 (Low)	17165
	69
All	31

We'll recode here:

dataset2[grepl("Risk\\s*1", Risk), Risk.Recoded := "High"]
dataset2[grepl("Risk\\s*2", Risk), Risk.Recoded := "Medium"]
dataset2[grepl("Risk\\s*3", Risk), Risk.Recoded := "Low"]
dataset2[(grepl("All", Risk) | Risk == ""), Risk.Recoded := "Other"]
dataset2[, Risk.Recoded := factor(Risk.Recoded, levels = c("Low", "Medium", "High", "Other"))]

dataset2[, .N, by = "Risk.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")

Risk.Recoded	N
High	141129
Medium	38431
Low	17165
Other	100

The last variable that needs recoding is facility type. We can aggregate it into buckets

From this:

dataset2[, .N, by = (`Facility Type`)][1:20,][order(N, decreasing = T)]  %>% kable %>% kable_styling( full_width = F, position = "left")

Facility Type	N
Restaurant	131254
Grocery Store	25034
School	12132
	4768
Children's Services Facility	3115
Bakery	2884
Daycare (2 - 6 Years)	2703
Daycare Above and Under 2 Years	2400
Catering	1201
Liquor	861
Daycare Combo 1586	758
Mobile Food Preparer	639
TAVERN	286
Shared Kitchen User (Long Term)	201
STORE	41
tavern	24
PALETERIA	22
LIQUOR STORE	7
HERBALIFE/ZUMBA	5
JUICE BAR/GROCERY	2

To this:

dataset2[grepl("restaurant", `Facility Type`, ignore.case = T), Type.Recoded := "Restaurant"]
dataset2[grepl("grocery\\s*|groceries\\s*", `Facility Type`, ignore.case = T), Type.Recoded := "Grocery"]
dataset2[grepl("school", `Facility Type`, ignore.case = T), Type.Recoded := "School"]
dataset2[grepl("bakery", `Facility Type`, ignore.case = T), Type.Recoded := "Bakery"]
dataset2[grepl("daycase", `Facility Type`, ignore.case = T), Type.Recoded := "Daycare"]

dataset2[grepl("liquor|tavern|alcohol", `Facility Type`, ignore.case = T), Type.Recoded := "Liquor"]
dataset2[grepl("mobile\\s*food", `Facility Type`, ignore.case = T), Type.Recoded := "Mobile Food"]
dataset2[is.na(Type.Recoded), Type.Recoded := "Other"]

dataset2[, table(Type.Recoded, useNA = "always")] %>% kable %>% kable_styling( full_width = F, position = "left")

Type.Recoded	Freq
Bakery	2936
Grocery	25372
Liquor	1238
Mobile Food	1535
Other	21955
Restaurant	131368
School	12421
NA	0

Geographic Visual

This won't give us a fancy map, but will be helpful in observing geographical spread:

#geom.points = dataset2[, list(`Inspection ID` = sample(`Inspection ID`, 1)), by = "Zip"]
geom.points = dataset2[sample(1:nrow(dataset2), 10000, replace = F),]

#geom.points = merge(geom.points, dataset2, by = "Inspection ID")

geo.plot = ggplot(geom.points, aes(Longitude, Latitude, color =  Risk.Recoded)) +
  geom_point(size = .25, show.legend = TRUE) +
  coord_quickmap() +
  theme_minimal() +
  scale_color_canva()

ggplotly(geo.plot)

Data Casting

From here, we'll cast the data based on the inspection type:

wide.data = dataset2[, .(`Inspection ID`, Inspection.Recoded, Results)]

wide.data = dcast(wide.data, formula = `Inspection ID` ~  Inspection.Recoded , fun = max, value.var = "Results")

head(wide.data)  %>% kable %>% kable_styling( full_width = F, position = "left")

Inspection ID	CANVASS	COMPLAINT	FIRE	FOOD	LICENSE	OTHER
44247	NA	Pass	NA	NA	NA	NA
44248	Pass	NA	NA	NA	NA	NA
44249	Pass	NA	NA	NA	NA	NA
44250	Pass	NA	NA	NA	NA	NA
44251	Pass	NA	NA	NA	NA	NA
44252	NA	NA	NA	Pass	NA	NA

Export

dataset2.wide = merge(dataset2[, -c("Inspection Type", "Results", "Inspection.Recoded", "Risk")], wide.data, by = "Inspection ID")

write.csv(dataset2.wide, file = "dataset2_wide.csv", row.names = F)

Analysis

Question: Find out if there is a trend of type of facilities and their risk level given by the inspector.

We can start with some exploratory data analysis and produce a table with the highest risk percentage per risk strata.

risk.total = dataset2[, list(total.risk.N = .N), by = "Risk.Recoded"]
risk.total = merge(dataset2[, .N, by = c("Risk.Recoded", "Type.Recoded")], risk.total, by = "Risk.Recoded")
risk.total[, percent := round(N/total.risk.N, 2)]
risk.total[, .SD[which.max(percent)], by = c("Risk.Recoded")] %>% kable %>% kable_styling( full_width = F, position = "left")

Risk.Recoded	Type.Recoded	N	total.risk.N	percent
Low	Grocery	8979	17165	0.52
Medium	Restaurant	24570	38431	0.64
High	Restaurant	105452	141129	0.75
Other	Other	82	100	0.82

The next step is using a statistical test for categorical variables:

table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded) %>% kable %>% kable_styling( full_width = F, position = "left")

	Low	Medium	High	Other
Bakery	22	1763	1151	0
Grocery	8979	8546	7845	2
Liquor	1159	62	17	0
Mobile Food	785	672	77	1
Restaurant	1332	24570	105452	14
School	273	1007	11140	1

chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded)) %>% tidy() %>% kable %>% kable_styling( full_width = F, position = "left")

## Warning in chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, :
## Chi-squared approximation may be incorrect

statistic	p.value	parameter	method
68658.2	0	15	Pearson's Chi-squared test

Since risk has ordered levels, we should also calculate the Cramer's correlation coefficient

CramerV(dataset2$Type.Recoded, dataset2$Risk.Recoded)

## [1] 0.3347896

Key Takeaways

Based on results, we can conlude the following

Grocery stores have the lowest level of risk for food inspections
Restaurants have the highest proportion of medium and high risk inspections
This obviously is skewed, since restaurants are the bulk of inspections to begin with.

Project 2, Dataset 2

Deepika Dilip

3/13/2022