For this round, we'll be looking at Chicago food inspection data. Our analysis question is as follows:
Find out if there is a trend of type of facilities and their risk level given by the inspector.
setwd('~/git/CUNY.MDS/DATA607')
dataset2 = fread('food-inspections.csv')
The first step we'll take here is eliminating unecessary columns:
writeLines(names(dataset2)) %>% kable()
## Inspection ID
## DBA Name
## AKA Name
## License #
## Facility Type
## Risk
## Address
## City
## State
## Zip
## Inspection Date
## Inspection Type
## Results
## Violations
## Latitude
## Longitude
## Location
## Historical Wards 2003-2015
## Zip Codes
## Community Areas
## Census Tracts
## Wards
A bunch of these columns are empty, so we'll drop them:
dataset2[, `:=` (`Historical Wards 2003-2015` = NULL, `Zip Codes` = NULL, `Census Tracts` = NULL, `Community Areas` = NULL, `Wards` = NULL, `Location` = NULL)]
We'll start by grouping the inspection types by buckets into more granular categories.
dataset2[, .N, by = "Inspection Type"][order(N, decreasing = T),][1:10, ] %>% kable %>% kable_styling( full_width = F, position = "left")
| Inspection Type | N |
|---|---|
| Canvass | 103995 |
| License | 26054 |
| Canvass Re-Inspection | 20722 |
| Complaint | 18223 |
| License Re-Inspection | 9018 |
| Complaint Re-Inspection | 7571 |
| Short Form Complaint | 6771 |
| Suspected Food Poisoning | 852 |
| Consultation | 671 |
| License-Task Force | 605 |
dataset2[grepl(pattern = "Canvass(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'CANVASS']
dataset2[grepl(pattern = "Complaint(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'COMPLAINT']
dataset2[grepl(pattern = "License(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'LICENSE']
dataset2[grepl(pattern = "Food(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FOOD']
dataset2[grepl(pattern = "Fire(.)?", `Inspection Type`, ignore.case = T), Inspection.Recoded := 'FIRE']
dataset2[is.na(Inspection.Recoded), Inspection.Recoded := 'OTHER']
dataset2[, .N, by = "Inspection.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")
| Inspection.Recoded | N |
|---|---|
| CANVASS | 124723 |
| LICENSE | 35730 |
| COMPLAINT | 32575 |
| OTHER | 2432 |
| FOOD | 1043 |
| FIRE | 322 |
Risk level also needs to be cleaned up a bit and releveled:
dataset2[, .N, by = "Risk"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")
| Risk | N |
|---|---|
| Risk 1 (High) | 141129 |
| Risk 2 (Medium) | 38431 |
| Risk 3 (Low) | 17165 |
| 69 | |
| All | 31 |
We'll recode here:
dataset2[grepl("Risk\\s*1", Risk), Risk.Recoded := "High"]
dataset2[grepl("Risk\\s*2", Risk), Risk.Recoded := "Medium"]
dataset2[grepl("Risk\\s*3", Risk), Risk.Recoded := "Low"]
dataset2[(grepl("All", Risk) | Risk == ""), Risk.Recoded := "Other"]
dataset2[, Risk.Recoded := factor(Risk.Recoded, levels = c("Low", "Medium", "High", "Other"))]
dataset2[, .N, by = "Risk.Recoded"][order(N, decreasing = T),]%>% kable %>% kable_styling( full_width = F, position = "left")
| Risk.Recoded | N |
|---|---|
| High | 141129 |
| Medium | 38431 |
| Low | 17165 |
| Other | 100 |
The last variable that needs recoding is facility type. We can aggregate it into buckets
From this:
dataset2[, .N, by = (`Facility Type`)][1:20,][order(N, decreasing = T)] %>% kable %>% kable_styling( full_width = F, position = "left")
| Facility Type | N |
|---|---|
| Restaurant | 131254 |
| Grocery Store | 25034 |
| School | 12132 |
| 4768 | |
| Children's Services Facility | 3115 |
| Bakery | 2884 |
| Daycare (2 - 6 Years) | 2703 |
| Daycare Above and Under 2 Years | 2400 |
| Catering | 1201 |
| Liquor | 861 |
| Daycare Combo 1586 | 758 |
| Mobile Food Preparer | 639 |
| TAVERN | 286 |
| Shared Kitchen User (Long Term) | 201 |
| STORE | 41 |
| tavern | 24 |
| PALETERIA | 22 |
| LIQUOR STORE | 7 |
| HERBALIFE/ZUMBA | 5 |
| JUICE BAR/GROCERY | 2 |
To this:
dataset2[grepl("restaurant", `Facility Type`, ignore.case = T), Type.Recoded := "Restaurant"]
dataset2[grepl("grocery\\s*|groceries\\s*", `Facility Type`, ignore.case = T), Type.Recoded := "Grocery"]
dataset2[grepl("school", `Facility Type`, ignore.case = T), Type.Recoded := "School"]
dataset2[grepl("bakery", `Facility Type`, ignore.case = T), Type.Recoded := "Bakery"]
dataset2[grepl("daycase", `Facility Type`, ignore.case = T), Type.Recoded := "Daycare"]
dataset2[grepl("liquor|tavern|alcohol", `Facility Type`, ignore.case = T), Type.Recoded := "Liquor"]
dataset2[grepl("mobile\\s*food", `Facility Type`, ignore.case = T), Type.Recoded := "Mobile Food"]
dataset2[is.na(Type.Recoded), Type.Recoded := "Other"]
dataset2[, table(Type.Recoded, useNA = "always")] %>% kable %>% kable_styling( full_width = F, position = "left")
| Type.Recoded | Freq |
|---|---|
| Bakery | 2936 |
| Grocery | 25372 |
| Liquor | 1238 |
| Mobile Food | 1535 |
| Other | 21955 |
| Restaurant | 131368 |
| School | 12421 |
| NA | 0 |
This won't give us a fancy map, but will be helpful in observing geographical spread:
#geom.points = dataset2[, list(`Inspection ID` = sample(`Inspection ID`, 1)), by = "Zip"]
geom.points = dataset2[sample(1:nrow(dataset2), 10000, replace = F),]
#geom.points = merge(geom.points, dataset2, by = "Inspection ID")
geo.plot = ggplot(geom.points, aes(Longitude, Latitude, color = Risk.Recoded)) +
geom_point(size = .25, show.legend = TRUE) +
coord_quickmap() +
theme_minimal() +
scale_color_canva()
ggplotly(geo.plot)
From here, we'll cast the data based on the inspection type:
wide.data = dataset2[, .(`Inspection ID`, Inspection.Recoded, Results)]
wide.data = dcast(wide.data, formula = `Inspection ID` ~ Inspection.Recoded , fun = max, value.var = "Results")
head(wide.data) %>% kable %>% kable_styling( full_width = F, position = "left")
| Inspection ID | CANVASS | COMPLAINT | FIRE | FOOD | LICENSE | OTHER |
|---|---|---|---|---|---|---|
| 44247 | NA | Pass | NA | NA | NA | NA |
| 44248 | Pass | NA | NA | NA | NA | NA |
| 44249 | Pass | NA | NA | NA | NA | NA |
| 44250 | Pass | NA | NA | NA | NA | NA |
| 44251 | Pass | NA | NA | NA | NA | NA |
| 44252 | NA | NA | NA | Pass | NA | NA |
dataset2.wide = merge(dataset2[, -c("Inspection Type", "Results", "Inspection.Recoded", "Risk")], wide.data, by = "Inspection ID")
write.csv(dataset2.wide, file = "dataset2_wide.csv", row.names = F)
Question: Find out if there is a trend of type of facilities and their risk level given by the inspector.
We can start with some exploratory data analysis and produce a table with the highest risk percentage per risk strata.
risk.total = dataset2[, list(total.risk.N = .N), by = "Risk.Recoded"]
risk.total = merge(dataset2[, .N, by = c("Risk.Recoded", "Type.Recoded")], risk.total, by = "Risk.Recoded")
risk.total[, percent := round(N/total.risk.N, 2)]
risk.total[, .SD[which.max(percent)], by = c("Risk.Recoded")] %>% kable %>% kable_styling( full_width = F, position = "left")
| Risk.Recoded | Type.Recoded | N | total.risk.N | percent |
|---|---|---|---|---|
| Low | Grocery | 8979 | 17165 | 0.52 |
| Medium | Restaurant | 24570 | 38431 | 0.64 |
| High | Restaurant | 105452 | 141129 | 0.75 |
| Other | Other | 82 | 100 | 0.82 |
The next step is using a statistical test for categorical variables:
table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded) %>% kable %>% kable_styling( full_width = F, position = "left")
| Low | Medium | High | Other | |
|---|---|---|---|---|
| Bakery | 22 | 1763 | 1151 | 0 |
| Grocery | 8979 | 8546 | 7845 | 2 |
| Liquor | 1159 | 62 | 17 | 0 |
| Mobile Food | 785 | 672 | 77 | 1 |
| Restaurant | 1332 | 24570 | 105452 | 14 |
| School | 273 | 1007 | 11140 | 1 |
chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, dataset2[Type.Recoded != "Other"]$Risk.Recoded)) %>% tidy() %>% kable %>% kable_styling( full_width = F, position = "left")
## Warning in chisq.test(table(dataset2[Type.Recoded != "Other"]$Type.Recoded, :
## Chi-squared approximation may be incorrect
| statistic | p.value | parameter | method |
|---|---|---|---|
| 68658.2 | 0 | 15 | Pearson's Chi-squared test |
Since risk has ordered levels, we should also calculate the Cramer's correlation coefficient
CramerV(dataset2$Type.Recoded, dataset2$Risk.Recoded)
## [1] 0.3347896
Based on results, we can conlude the following