library(tidyverse)
library(ggplot2)
library(tidycensus)
library(data.table)
library(readr)
library(dplyr)
library(ggpubr)
library(sf)
census_api_key("3d3549857b43fb9f12d359e3822a007cb6ad8ca9")Introduction
For this project, I looked into the long term problem of food inequality in low income communities. There was a look into how low income neighborhoods complaints of the lack of access to healthier food options and the questionable quality of the available food.
The data collection is from City of Chicago’s Food Inspection results for this observational study. The City of Chicago was cited to have a large income disparity problem, as citizens denoted the long term discrimination faced in their communities 1. For this observation study, I will look into the differences of the food facilities available and their ratings and if there’s a difference of the communities household income and its quality of food services.
Chicago_Data<-read.csv("food-inspections.csv")
#Remove empty columns and location as we will use lat.& long.
Chicago_Data<-Chicago_Data%>%select(-c("Historical.Wards.2003.2015","Zip.Codes","Community.Areas","Census.Tracts","Wards","Location"))
#Clean up dates just in case for later analysis
Chicago_Data$Inspection.Date<-str_sub(Chicago_Data$Inspection.Date,start=1,end=-14)
Chicago_Data$Inspection.Date<-Chicago_Data$Inspection.Date%>%as.Date()
#Only look at results with findings by inspector. Take out invalid results
Clean_Chi<-Chicago_Data%>%filter(Results=="Pass"|Results=="Fail"|Results=="Pass w/ Conditions")
Clean_Chi<-Clean_Chi%>%filter(Risk=="Risk 1 (High)"|Risk=="Risk 3 (Low)"|Risk=="Risk 2 (Medium)")
summary(Clean_Chi)## Inspection.ID DBA.Name AKA.Name License..
## Min. : 44247 Length:171562 Length:171562 Min. : 0
## 1st Qu.:1114516 Class :character Class :character 1st Qu.:1194648
## Median :1482753 Mode :character Mode :character Median :1979986
## Mean :1428682 Mean :1591752
## 3rd Qu.:1995410 3rd Qu.:2232951
## Max. :2352738 Max. :9999999
## NA's :16
## Facility.Type Risk Address City
## Length:171562 Length:171562 Length:171562 Length:171562
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## State Zip Inspection.Date Inspection.Type
## Length:171562 Min. :10014 Min. :2010-01-04 Length:171562
## Class :character 1st Qu.:60614 1st Qu.:2012-06-26 Class :character
## Mode :character Median :60625 Median :2014-12-10 Mode :character
## Mean :60629 Mean :2014-11-25
## 3rd Qu.:60643 3rd Qu.:2017-03-24
## Max. :60827 Max. :2019-12-04
## NA's :31
## Results Violations Latitude Longitude
## Length:171562 Length:171562 Min. :41.64 Min. :-87.91
## Class :character Class :character 1st Qu.:41.83 1st Qu.:-87.71
## Mode :character Mode :character Median :41.89 Median :-87.67
## Mean :41.88 Mean :-87.68
## 3rd Qu.:41.94 3rd Qu.:-87.63
## Max. :42.02 Max. :-87.53
## NA's :636 NA's :636
Data Analysis
Before the analysis, Let’s understand the variables above. For the Risk, Chicago’s site definition of risk of of the selected food services “affecting the public’s health”2. Establishments must obtain the health code with proper maintenance and proper food preparation. The Results are the health code scores for a particular place: Pass, Fail, and Pass with conditions. Pass with conditions scare means there were significant violations found in during the inspection but were corrected immediately.
Correlation between the type of facilities and their likelihood of fair food grade
For this analysis, there was a look into the type of facilities provided in the report. There were 475 Food Facility provided in the data set, as there was not a set group of terms used by inspectors. This was the first challenge of the question, as the large variety of terms used could not be grouped for simplicity.
Each facility was group by their type to view their total count and its frequency overall. It is seen that the top facility types are Restaurants, Grocery stores, and schools and their share in the total count.
Then, Let’s move our attention towards these facilties and their associate risks. The assumption I will have to follow from the limited granularity is the risks of public health is based on the size of daily foot traffic. The chart Chi_rank unveils the totals for each facility type and their risk breakout. Restaurants Risk 1 have the highest count of inspections.
Now, We need a system to see if facility type’s health score is at least passes on average. The way the score is quantify by the point system: one point for a pass, half a point for a pass with conditions, and no points for a fail. If over half of its population passes their inspection, then the majority of the facilities included are up to code.
The score was performed, the table unveils that the majority of high risk restaurants have passed their inspections. A deeper view shows there is a congregation of failed inspections in the middle of the map. This will be explored later. For now, There is not a strong correlation between the facility type and its results.
To double verify my findings, I check to see if the correlation of the health code scores and its risk and facility type was statically significant. A linear regression was applied and the p-values were above .05, which means it is not statically significant. I did discount the p-values where it did passed, as the total is too small for confirmation.
#Let's see the possible facilities and their averages overall
Facility<-Clean_Chi%>%group_by(Facility.Type)%>%summarise(n=n())%>%mutate(Avg=n/sum(n))%>%arrange(desc(Avg))
as_tibble(Facility)## # A tibble: 475 x 3
## Facility.Type n Avg
## <chr> <int> <dbl>
## 1 Restaurant 115764 0.675
## 2 Grocery Store 22121 0.129
## 3 School 11761 0.0686
## 4 Children's Services Facility 2945 0.0172
## 5 Bakery 2532 0.0148
## 6 Daycare (2 - 6 Years) 2386 0.0139
## 7 Daycare Above and Under 2 Years 2251 0.0131
## 8 Long Term Care 1293 0.00754
## 9 Catering 997 0.00581
## 10 Mobile Food Dispenser 800 0.00466
## # ... with 465 more rows
#Rank the Risk system-> Can cause risk to public health
Chi_Score<-Clean_Chi%>%select(c("Facility.Type","Risk","Results"))
Chi_Rank<-Chi_Score%>%group_by(Facility.Type)%>%count(Risk)%>%arrange(desc(n))
Chi_Rank<-Chi_Rank%>%mutate(freq=n/sum(n))
as_tibble(Chi_Rank)## # A tibble: 579 x 4
## Facility.Type Risk n freq
## <chr> <chr> <int> <dbl>
## 1 Restaurant Risk 1 (High) 93330 0.806
## 2 Restaurant Risk 2 (Medium) 21288 0.184
## 3 School Risk 1 (High) 10564 0.898
## 4 Grocery Store Risk 3 (Low) 7539 0.341
## 5 Grocery Store Risk 2 (Medium) 7301 0.330
## 6 Grocery Store Risk 1 (High) 7281 0.329
## 7 Children's Services Facility Risk 1 (High) 2931 0.995
## 8 Daycare (2 - 6 Years) Risk 1 (High) 2362 0.990
## 9 Daycare Above and Under 2 Years Risk 1 (High) 2240 0.995
## 10 Bakery Risk 2 (Medium) 1500 0.592
## # ... with 569 more rows
#Now, Categorizes pass/fail system into a score. Fail=-1, pass w/ conditions=.5, Pass=1
Chi_Score$Results<-str_replace_all(Chi_Score$Results, c("Pass w/ Conditions"="0.5","Pass"="1.0","Fail"="0.0"))
Chi_Score$Results<-as.numeric(Chi_Score$Results)
Chi_Score<-Chi_Score%>%group_by(Facility.Type,Risk)%>%summarise(score=sum(Results))%>%arrange(desc(score))## `summarise()` has grouped output by 'Facility.Type'. You can override using the
## `.groups` argument.
#Join the health score scores and risk tables together with a inner join
Chi_Combine<-inner_join(Chi_Rank,Chi_Score,by=c("Facility.Type"="Facility.Type","Risk"="Risk"))
Chi_Combine<-Chi_Combine%>%mutate(benchmark=n/2)
#If the score is higher than total population== Majority passed inspection
Chi_Combine<-Chi_Combine%>%mutate(passMark=ifelse(score>benchmark,"Majority Pass inspection","Majority Failed inspection"))
#Majority of Facility types passed inspection
as_tibble(Chi_Combine)## # A tibble: 579 x 7
## Facility.Type Risk n freq score benchmark passMark
## <chr> <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 Restaurant Risk 1~ 93330 0.806 64838 46665 Majorit~
## 2 Restaurant Risk 2~ 21288 0.184 15329 10644 Majorit~
## 3 School Risk 1~ 10564 0.898 7940. 5282 Majorit~
## 4 Grocery Store Risk 3~ 7539 0.341 4964. 3770. Majorit~
## 5 Grocery Store Risk 2~ 7301 0.330 4685 3650. Majorit~
## 6 Grocery Store Risk 1~ 7281 0.329 5058 3640. Majorit~
## 7 Children's Services Facility Risk 1~ 2931 0.995 2144. 1466. Majorit~
## 8 Daycare (2 - 6 Years) Risk 1~ 2362 0.990 1743 1181 Majorit~
## 9 Daycare Above and Under 2 Years Risk 1~ 2240 0.995 1674. 1120 Majorit~
## 10 Bakery Risk 2~ 1500 0.592 1030 750 Majorit~
## # ... with 569 more rows
#Observe if the passes are overall, Some congregation of fails in the middle of the map
ggplot()+geom_sf()+geom_point(data=Clean_Chi,aes(x = Longitude, y = Latitude,color=Results,size=0.1,alpha=0.05))+theme_minimal()+labs(color="Inspection Results",title = "Chicago's Facility Types vs Health Inspection Results")## Warning: Removed 636 rows containing missing values (geom_point).
#Let's see with our top three facility types their scores
cc_short<-Chi_Combine%>%filter(Facility.Type=="Restaurant"|Facility.Type=="Grocery Store"|Facility.Type=="School")
as_tibble(cc_short)## # A tibble: 9 x 7
## Facility.Type Risk n freq score benchmark passMark
## <chr> <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 Restaurant Risk 1 (High) 93330 0.806 64838 46665 Majority Pass in~
## 2 Restaurant Risk 2 (Medium) 21288 0.184 15329 10644 Majority Pass in~
## 3 School Risk 1 (High) 10564 0.898 7940. 5282 Majority Pass in~
## 4 Grocery Store Risk 3 (Low) 7539 0.341 4964. 3770. Majority Pass in~
## 5 Grocery Store Risk 2 (Medium) 7301 0.330 4685 3650. Majority Pass in~
## 6 Grocery Store Risk 1 (High) 7281 0.329 5058 3640. Majority Pass in~
## 7 Restaurant Risk 3 (Low) 1146 0.00990 752. 573 Majority Pass in~
## 8 School Risk 2 (Medium) 950 0.0808 757 475 Majority Pass in~
## 9 School Risk 3 (Low) 247 0.0210 185 124. Majority Pass in~
## Is there a statically significance between facility type and risk with the scores. Not Statically significant as p-values are greater than .05
chi_lm<-lm(score~Facility.Type+Risk,data = Chi_Combine)
# Summary : The main three facilities have p-values<.05
#I commented out the summary linear regression as its all 475 Facility Types
#summary(chi_lm)A Community average Household income vs the quality of the food services provided
For this analysis, I used tidycensus to retrieve the average household income from 2020 census. The only available variable that matched our needs had the data by the tract. The tract is a region that the census used to map their surveys and not at the zip code level, which would make mapping these incomes easier. So, I used an additional source to filter the tidycensus data to only include data in the Chicago area. The Chicago data portal has provided the CensusTracts, which include of all the current tracts of Chicago with their GEOID included3. Then, I used the CenesusTract as a filter list for matches with Chicago GEOIDs with those in the census data for incomes in Illinois. In return, There was a approximate match between the stores and their tract.
For the mapping, There is a side by side comparison with the median household income and the inspection results. In the middle of Chicago, the tract communities have household incomes lower than 50K annually. The Food Quality map has a grouping on majority failed inspections in the same geographic area. There appears to be a trend of denser red clusters of failed inspections reflect low income areas shown in the bottom left corner of the map as well.
#Using tideycensus holds HH Income, geometry holds lat/long coordinates
area_HH<-get_acs(geography = "tract",
variables = "B19013_001",state = "IL", geometry = TRUE)##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============== | 19%
|
|=============== | 22%
|
|================ | 23%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================= | 41%
|
|============================== | 43%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================= | 46%
|
|================================= | 47%
|
|================================== | 49%
|
|=================================== | 50%
|
|===================================== | 53%
|
|====================================== | 54%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 61%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================= | 70%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|===================================================== | 76%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================== | 82%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 88%
|
|============================================================== | 89%
|
|================================================================ | 92%
|
|================================================================= | 93%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
#Need a data pal to match GEOID within Chicago boundaries (CDP,updated 2019)
geo_lookup<-read_csv("CensusTracts.csv",col_select="GEOID10")
area_HH<-area_HH%>%filter(GEOID %in% geo_lookup$GEOID10)
#Will not get exact matches with Chicago data as its a tract== blocked area. lets layer both data onto each other
#Mapping risk to location
# mapping HH income to risk
g1<-area_HH%>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "viridis")+labs(fill="Household Income",title = "Chicago's median Household Income vs Food rating")+theme_minimal()
#Using sf to map inspector results to Chicago map
g2<-ggplot(data=area_HH)+geom_sf()+geom_point(data=Clean_Chi,aes(x = Longitude, y = Latitude,color=Results,size=0.1,alpha=0.05))+theme_minimal()
#Note: G2 takes a few minutes to load!
g1g2#ggarrange(g1,g2,nrow = 2)Takeaways
Back to the facility type and its relationship of a lower health code score, there a trend of red and blue plots in the middle of the Chicago map. It was appear from both maps provided, there is a stark HH income difference and the quality of the food services provided. It appears the likelihood of a food facility with a health code violation increased.
https://www.chicagotribune.com/living/health/ct-life-inequity-data-policy-roots-chicago-20200726-r3c7qykvvbfm5bdjm4fpb6g5k4-story.html↩︎
https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF↩︎
https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik↩︎