Dataset 2: Chicago Food Inspections

For this round, we'll be looking at Chicago food inspection data. Our analysis question is as follows:

Find out if there is a trend of type of facilities and their risk level given by the inspector.

setwd('~/git/CUNY.MDS/DATA607')
dataset2 = fread('food-inspections.csv')

CSV Export

The first step we'll take here is eliminating unecessary columns:

writeLines(names(dataset2)) %>% kable()
## Inspection ID
## DBA Name
## AKA Name
## License #
## Facility Type
## Risk
## Address
## City
## State
## Zip
## Inspection Date
## Inspection Type
## Results
## Violations
## Latitude
## Longitude
## Location
## Historical Wards 2003-2015
## Zip Codes
## Community Areas
## Census Tracts
## Wards

A bunch of these columns are empty, so we'll drop them:

dataset2[, `:=` (`Historical Wards 2003-2015` = NULL, `Zip Codes` = NULL, `Census Tracts` = NULL, `Community Areas` = NULL, `Wards` = NULL, `Location` = NULL)]

Recoding variables

We'll start by grouping the inspection types by buckets into more granular categories.

dataset2[, .N, by = "Inspection Type"][order(N, decreasing = T),][1:10, ] %>% kable %>% kable_styling( full_width = F, position = "left") 
Inspection Type N
Canvass 103995
License 26054
Canvass Re-Inspection 20722
Complaint 18223
License Re-Inspection 9018
Complaint Re-Inspection 7571
Short Form Complaint 6771
Suspected Food Poisoning 852
Consultation 671
License-Task Force 605
dataset2[grepl(pattern = "Canvass(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'CANVASS']
dataset2[grepl(pattern = "Complaint(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'COMPLAINT']
dataset2[grepl(pattern = "License(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'LICENSE']
dataset2[grepl(pattern = "Food(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FOOD']
dataset2[grepl(pattern = "Fire(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FIRE']
dataset2[is.na(Inspection.Recoded), Inspection.Recoded := 'OTHER']

dataset2[, .N, by = "Inspection.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left") 
Inspection.Recoded N
CANVASS 124723
LICENSE 35730
COMPLAINT 32575
OTHER 2432
FOOD 1043
FIRE 322

Risk level also needs to be cleaned up a bit and releveled:

dataset2[, .N, by = "Risk"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left") 
Risk N
Risk 1 (High) 141129
Risk 2 (Medium) 38431
Risk 3 (Low) 17165
69
All 31

We'll recode here:

dataset2[grepl("Risk\\s*1", Risk), Risk.Recoded := "High"]
dataset2[grepl("Risk\\s*2", Risk), Risk.Recoded := "Medium"]
dataset2[grepl("Risk\\s*3", Risk), Risk.Recoded := "Low"]
dataset2[(grepl("All", Risk) | Risk == ""), Risk.Recoded := "Other"]
dataset2[, Risk.Recoded := factor(Risk.Recoded, levels = c("Low", "Medium", "High", "Other"))]

dataset2[, .N, by = "Risk.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left") 
Risk.Recoded N
High 141129
Medium 38431
Low 17165
Other 100

The last variable that needs recoding is facility type. We can aggregate it into buckets

From this:

dataset2[, .N, by = (`Facility Type`)][1:20,][order(N, decreasing = T)]  %>% kable %>% kable_styling( full_width = F, position = "left") 
Facility Type N
Restaurant 131254
Grocery Store 25034
School 12132
4768
Children's Services Facility 3115
Bakery 2884
Daycare (2 - 6 Years) 2703
Daycare Above and Under 2 Years 2400
Catering 1201
Liquor 861
Daycare Combo 1586 758
Mobile Food Preparer 639
TAVERN 286
Shared Kitchen User (Long Term) 201
STORE 41
tavern 24
PALETERIA 22
LIQUOR STORE 7
HERBALIFE/ZUMBA 5
JUICE BAR/GROCERY 2

To this:

dataset2[grepl("restaurant", `Facility Type`, ignore.case = T), Type.Recoded := "Restaurant"]
dataset2[grepl("grocery\\s*|groceries\\s*", `Facility Type`, ignore.case = T), Type.Recoded := "Grocery"]
dataset2[grepl("school", `Facility Type`, ignore.case = T), Type.Recoded := "School"]
dataset2[grepl("bakery", `Facility Type`, ignore.case = T), Type.Recoded := "Bakery"]
dataset2[grepl("daycase", `Facility Type`, ignore.case = T), Type.Recoded := "Daycare"]

dataset2[grepl("liquor|tavern|alcohol", `Facility Type`, ignore.case = T), Type.Recoded := "Liquor"]
dataset2[grepl("mobile\\s*food", `Facility Type`, ignore.case = T), Type.Recoded := "Mobile Food"]
dataset2[is.na(Type.Recoded), Type.Recoded := "Other"]

dataset2[, table(Type.Recoded, useNA = "always")] %>% kable %>% kable_styling( full_width = F, position = "left") 
Type.Recoded Freq
Bakery 2936
Grocery 25372
Liquor 1238
Mobile Food 1535
Other 21955
Restaurant 131368
School 12421
NA 0

Geographic Visual

This won't give us a fancy map, but will be helpful in observing geographical spread:

#geom.points = dataset2[, list(`Inspection ID` = sample(`Inspection ID`, 1)), by = "Zip"]
geom.points = dataset2[sample(1:nrow(dataset2), 10000, replace = F),]

#geom.points = merge(geom.points, dataset2, by = "Inspection ID")

geo.plot = ggplot(geom.points, aes(Longitude, Latitude, color =  Risk.Recoded)) +
  geom_point(size = .25, show.legend = TRUE) +
  coord_quickmap() +
  theme_minimal() +
  scale_color_canva()

ggplotly(geo.plot)

Data Casting

From here, we'll cast the data based on the inspection type:

wide.data = dataset2[, .(`Inspection ID`, Inspection.Recoded, Results)]

wide.data = dcast(wide.data, formula = `Inspection ID` ~  Inspection.Recoded , fun = max, value.var = "Results")

head(wide.data)  %>% kable %>% kable_styling( full_width = F, position = "left") 
Inspection ID CANVASS COMPLAINT FIRE FOOD LICENSE OTHER
44247 NA Pass NA NA NA NA
44248 Pass NA NA NA NA NA
44249 Pass NA NA NA NA NA
44250 Pass NA NA NA NA NA
44251 Pass NA NA NA NA NA
44252 NA NA NA Pass NA NA

Export

dataset2.wide = merge(dataset2[, -c("Inspection Type", "Results", "Inspection.Recoded", "Risk")], wide.data, by = "Inspection ID")

write.csv(dataset2.wide, file = "dataset2_wide.csv", row.names = F)

Analysis

Question: Find out if there is a trend of type of facilities and their risk level given by the inspector.

We can start with some exploratory data analysis and produce a table with the highest risk percentage per risk strata.

risk.total = dataset2[, list(total.risk.N = .N), by = "Risk.Recoded"]
risk.total = merge(dataset2[, .N, by = c("Risk.Recoded", "Type.Recoded")], risk.total, by = "Risk.Recoded")
risk.total[, percent := round(N/total.risk.N, 2)]
risk.total[, .SD[which.max(percent)], by = c("Risk.Recoded")] %>% kable %>% kable_styling( full_width = F, position = "left") 
Risk.Recoded Type.Recoded N total.risk.N percent
Low Grocery 8979 17165 0.52
Medium Restaurant 24570 38431 0.64
High Restaurant 105452 141129 0.75
Other Other 82 100 0.82

The next step is using a statistical test for categorical variables:

table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded) %>% kable %>% kable_styling( full_width = F, position = "left") 
Low Medium High Other
Bakery 22 1763 1151 0
Grocery 8979 8546 7845 2
Liquor 1159 62 17 0
Mobile Food 785 672 77 1
Restaurant 1332 24570 105452 14
School 273 1007 11140 1
chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded)) %>% tidy() %>% kable %>% kable_styling( full_width = F, position = "left") 
## Warning in chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, :
## Chi-squared approximation may be incorrect
statistic p.value parameter method
68658.2 0 15 Pearson's Chi-squared test

Since risk has ordered levels, we should also calculate the Cramer's correlation coefficient

CramerV(dataset2$Type.Recoded, dataset2$Risk.Recoded)
## [1] 0.3347896

Key Takeaways

Based on results, we can conlude the following