Life expectancy is the average age that a person is estimated to live. Predicting life expectancy at the community level can be an ethical use of machine learning as we can identify communities/census tracts that need extra attention in terms of distribution of health resources and educational campaigns urging residents to make healthier lifestyle choices. Unfortunately, in Chicago, there are life expectancy gaps based on race. WWTW reported that social conditions create gaps in disease and premature mortality as a result of structural racism, lack of investment in communities, and poverty.
Similarly, the Red Line Project also found that Chicago’s African American community have the lowest life expectancy, and that cardiovascular disease can be the reason why since 60% of Black males end up developing some form of heart disease. Food deserts, which are low-income areas or neighborhoods that have a lack of supermarkets/healthy food options, are also a reason why low-income communities have higher rates of heart disease. Red Line Project also found that these low-income communities don’t have access to good quality healthcare. Moreover, education is also another important factor because people with lower education tend to work who work high-stress jobs, and hard laborers are at risk for anxiety and burn out.
ABC news reported that the life expectancy gap in Chicago is growing among Blacks and Whites. This same article also attributed diabetes, infant mortality, HIV prevalence, and homicide rate as factors that explain the life expectancy gap.
Chicago Sun-times cites 5 main factors for lower life expectancy: chronic diseases, homicide, infant mortality, opioid overdoses, and infections (flu, HIV, etc). In fact, the diabetes rate is 70% higher for Blacks, homicide rate are 9 times higher in Black communities, Black infant mortality rate is 9 times higher, and the rate of opioid-related overdoses are also 3 times higher.
Lastly, in a media press release, reported by Mayor Lightfoot and Commissioner of Chicago Department of Public Health (CDPH) Allison Arwady 2 days ago, they reported that the life expectancy for Blacks dropped below 70 years for the first time in decades. The main drivers of the racial life expectancy gap, according to this article, are chronic disease (heart disease, cancer, diabetes); homicide, infant mortality, HIV, flu; and opioid overdose. Other outcomes include social issues including lack of access to stable housing, food, childcare, and stable income.
The CDPH is working to address this gap through the Healthy Chicago 2025, a 5 year action plan to close this life expectancy gap. Thus, this model would be considered an ethical use by CDPH in predicting life expectancy and addressing this gap.
Health data at the census tract level provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health.
Total crimes in 2015 provided by the Chicago Police Department.
Total grocery stores in 2013, and only one available in the Open Data Portal.
4)Life Expectancy 2010-2015 at the Census Tract level
Only data I found at the census tract level detailing life expectancy.
5)Chicago Boundaries Shapefile of all of the 77 Chicago Community Areas.
6)Package CMAPGEO Shapefile of Chicago boundaries to create the maps
7)American Community Survey 2015 Population, Education and Income estimates obtained through the tidycensus package.
1)PLACES Health Data at the census tract level in 2019 provided by CDC, National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health
2)Crimes- 2019 Total crimes in 2019 as reported by CPD.
4)Chicago Health Atlas Used to get 2019 actual life expectancy values by community area to compare the predictions. Data source: Illinois Department of Public Health, Death Certificate Data Files
5)American Community Survey: 2019 Population, Education and Income estimates for 2019 obtained through the tidycensus package.
#lifeexpc <- read.socrata("https://data.cdc.gov/resource/5h56-n989.json")
#lifeexpc = lifeexpc %>%
#filter(state_name == "Illinois", county_name == "Cook County, IL")
#crimes2015 = read.socrata("https://data.cityofchicago.org/resource/vwwp-7yr9.json")
#homicides2015 = crimes2015 %>% filter(primary_type == "HOMICIDE")
#variables <- load_variables(2015,"acs5")
#il15 <- get_acs(geography = "tract",
# variables = c(medincome = "B19013_001",
# totalpop = "B02001_001",
# white_alone = "B02001_002",
# black_alone = "B02001_003",
# bachelorsdegree = "B06009_005",
# foodstamps = "B09010_001"),
# state = "IL",
# year = 2015,
# output='wide')
# il15 <- il15 %>% mutate(
# pct_black = black_aloneE / totalpopE,
# pct_college = bachelorsdegreeE/totalpopE,
# pct_foodstamps = foodstampsE/totalpopE
#)
# iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)
# cookcounty_tracts <- iltracts %>%
# left_join(il15 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
#) %>% filter(COUNTYFP == '031') #merging il tracts with il15 census data, filtering it to cook county
#ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE)
# chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)
#chicago_tracts <- cookcounty_tracts %>% st_intersection(
# chicago
# )
#find tract level data for the crimes
# homicides2015 <- homicides2015 %>%
# st_as_sf(coords=c("longitude", "latitude"))
# st_crs(homicides2015) <- st_crs(chicago_tracts)
# homicides2015 <- homicides2015 %>% st_join(chicago_tracts, join = st_intersects) %>%
# as.data.frame()
# homicides2015 = homicides2015 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())
#grocery stores so we can explore food desserts in Chicago. Which census tracts don't have any grocery stores in the community?
#grocery_stores <- read.socrata("https://data.cityofchicago.org/resource/53t8-wyrc.json") #data is from 2013
# grocery_stores = grocery_stores %>% group_by(census_tract) %>% summarize(total.grocery.stores = n())
#grocery_stores$census_tract <- substring(grocery_stores$census_tract, 6)
#length(unique(grocery_stores$census_tract)) #there is a duplicate somewhere
#grocery_stores = grocery_stores %>% distinct() #remove the duplicate
# census_tracts <- read.socrata("https://data.cityofchicago.org/resource/74p9-q2aq.json")
#comarea <- read.socrata("https://data.cityofchicago.org/resource/igwz-8jzy.json")
#tracts_comarea <- census_tracts %>% select(tractce10, commarea)
# comarea <- comarea %>% select(community, area_numbe)
# tracts_comarea <- left_join(tracts_comarea, comarea, by = c("commarea" = "area_numbe"))
# cities <- read.socrata("https://chronicdata.cdc.gov/resource/kucs-wizg.json")
# cities = cities %>% filter(stateabbr == "IL", placename == "Chicago")
#remove 17031 for cities
# cities$tractfips <- substring(cities$tractfips, 6)
#remove period for lifeexp
#lifeexpc$full_ct_num <- sub("[.]", "", lifeexpc$full_ct_num)
#merge with LE
#cities = left_join(cities, lifeexpc, by = c("tractfips" = "full_ct_num"))
#sum(is.na(cities$le)) #96 missing values for LE
#merge cities with homicide in 2015
# cities <- left_join(cities, homicides2015, by = c("tractfips" = "TRACTCE"))
# cities <- cities %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0
#merge cities data with census data
#census_chicago <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps) %>% as.data.frame()
#census_chicago$GEOID <- substring(census_chicago$GEOID,6)
#cities <- left_join(cities, census_chicago, by = c("tractfips" = "GEOID"))
#finally, merge cities data with grocery stores data
#cities <- left_join(cities, grocery_stores, by = c("tractfips" = "census_tract"))
#cities <- cities %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))
#merge with community area
#cities <- left_join(cities, tracts_comarea, by = c("tractfips" = "tractce10"))
#cities
#cities_final <- cities %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores, le)
#make numeric
# cities_final[, c(3:ncol(cities_final))] <- sapply(cities_final[, c(3:ncol(cities_final))], as.numeric)
# sum(is.na(cities_final$le)) #remove missing values of LE
# cities_final = cities_final %>% filter(
# !is.na(le)
# )
#cities_final
#write.csv(cities_final, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/cities_final.csv", row.names = FALSE)
cities_final <- read.csv("cities_final.csv")
#crimes2019 = read.socrata("https://data.cityofchicago.org/resource/w98m-zvie.json")
#homicides2019 = crimes2019 %>% filter(primary_type == "HOMICIDE")
# tract data
#variables <- load_variables(2019,"acs5")
#view(variables)
#il19 <- get_acs(geography = "tract",
#variables = c(medincome = "B19013_001",
# totalpop = "B02001_001",
# white_alone = "B02001_002",
# black_alone = "B02001_003",
# bachelorsdegree = "B06009_005",
# foodstamps = "B09010_001"),
# state = "IL",
# year = 2019,
# output='wide')
#il19 <- il19 %>% mutate(
# pct_black = black_aloneE / totalpopE,
# pct_college = bachelorsdegreeE/totalpopE,
# pct_foodstamps = foodstampsE/totalpopE
#)
#iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)
#cookcounty_tracts <- iltracts %>%
# left_join(il19 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
#) %>% filter(COUNTYFP == '031') # merging il tracts with il15 census data, filtering it to cook county
#ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE)
#chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)
#chicago_tracts <- cookcounty_tracts %>% st_intersection(
# chicago
#)
## find tract level data for the crimes
#homicides2019 <- homicides2019 %>%
# st_as_sf(coords=c("longitude", "latitude"))
#st_crs(homicides2019) <- st_crs(chicago_tracts)
#homicides2019 <- homicides2019 %>% st_join(chicago_tracts, join = st_intersects) %>%
# as.data.frame()
#homicides2019 = homicides2019 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())
#grocery stores. No new data from 2019, it is the same
#census tracts
#tracts_comarea
#### PLACES data
#PLACES_IL <- read.socrata("https://chronicdata.cdc.gov/resource/yjkw-uj5s.json")
#PLACES_IL = PLACES_IL %>% filter(stateabbr == "IL", countyname == "Cook")
# remove 17031 for PLACES_IL
#PLACES_IL$tractfips <- substring(PLACES_IL$tractfips, 6)
# merge PLACES_IL with homicide in 2019
#PLACES_IL <- left_join(PLACES_IL, homicides2019, by = c("tractfips" = "TRACTCE"))
#PLACES_IL <- PLACES_IL %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0
# merge PLACES_IL data with census data
#census_chicago2 <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, #pct_foodstamps) %>% as.data.frame()
#census_chicago2$GEOID <- substring(census_chicago2$GEOID,6)
#PLACES_IL <- left_join(PLACES_IL, census_chicago2, by = c("tractfips" = "GEOID"))
# finally, merge PLACES_IL data with grocery stores data
#PLACES_IL <- left_join(PLACES_IL, grocery_stores, by = c("tractfips" = "census_tract"))
#PLACES_IL <- PLACES_IL %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))
# merge with community area
#PLACES_IL <- left_join(PLACES_IL, tracts_comarea, by = c("tractfips" = "tractce10"))
#PLACES_IL = PLACES_IL %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores)
#PLACES_IL[, c(3:ncol(PLACES_IL))] <- sapply(PLACES_IL[, c(3:ncol(PLACES_IL))], as.numeric)
#write.csv(PLACES_IL, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/PLACES_IL.csv", row.names = FALSE)
PLACES_IL <- read.csv("PLACES_IL.csv")
sex_le <- read.socrata("https://data.cityofchicago.org/resource/3qdj-cqb8.json")
sex_le = sex_le %>% filter(sex == "All")
sex_le = sex_le %>% filter(!race_ethnicity == "All")
sex_le$X_2010_life_expectancy <- as.numeric(sex_le$X_2010_life_expectancy)
sex_le %>% ggplot(mapping = aes(x = race_ethnicity,
y= X_2010_life_expectancy, fill= race_ethnicity)) +
scale_fill_manual(values = c("#e09f3e", "#540b0e", "#56B4E9")) +
scale_y_continuous(breaks = seq(0, 350, by = 50)) +
labs( x= "Race",
y= "Number of Years Expected to Live",
title = "Average Life Expectancy in 2010",
subtitle = "Significant health disparities") +
geom_col() + geom_text(aes(x = race_ethnicity, y = X_2010_life_expectancy,
label = round(X_2010_life_expectancy, 0)), size = 3, hjust=0 , vjust= -0.5 , family= "Arial") +
theme_classic() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 14),
axis.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12, face = "bold"),
axis.line = element_blank(),
axis.ticks = element_blank(),
legend.position = "none")
African Americans in 2010 lived 7 years less on average than whites, 13 years less than Hispanic/Latinx
line_graph = sex_le %>%
select(race_ethnicity,X_1990_life_expectancy, X_2000_life_expectancy, X_2010_life_expectancy)
line_graph$X_1990_life_expectancy <- as.double(line_graph$X_1990_life_expectancy)
line_graph$X_2000_life_expectancy <- as.double(line_graph$X_2000_life_expectancy)
line_graph = line_graph %>%
pivot_longer(!race_ethnicity,
names_to = "Year",
values_to = "Life_Expectancy")
line_graph$Year <- gsub("[^0-9.-]", "", line_graph$Year)
line_graph$Year <- as.factor(line_graph$Year)
p <- ggplot(data=line_graph, aes(x=Year, y= Life_Expectancy, group=race_ethnicity, colour=race_ethnicity)) +
geom_line() +
geom_point() +
labs(x='Year', y='Average Years expected to Live', title='Average Life Expectancy by Race/Ethnicity', subtitle = 'Source: Chicago Open Data Portal') + theme_classic() + theme(
plot.title = element_text(face = "bold", size = 12), plot.subtitle = element_text(face = "italic", size = 12),
axis.title = element_text(size = 10, face = "bold"), axis.text = element_text(size = 8, face = "bold"), axis.line = element_blank(),
axis.ticks = element_blank())
fig <- ggplotly(p)
fig
Trend of Life Expectancy by group has been going up steadily for all Whites, Blacks, and Hispanic
filtered_correlation <- cities_final
filtered_correlation <- rename(filtered_correlation, c( "heart_dis" = "chd_crudeprev", "smoking" = "csmoking_crudeprev", "diabetes" = "diabetes_crudeprev",
"kidney_dis" = "kidney_crudeprev",
"obesity" = "obesity_crudeprev",
"homicides" = "Total.Homicides",
"Black" = "pct_black",
"income" = "medincomeE",
"college" = "pct_college",
"grocery_stores" = "total.grocery.stores",))
filtered_correlation = filtered_correlation %>%
select(le , heart_dis, smoking, diabetes,kidney_dis, obesity, homicides, Black , income, college, grocery_stores)
correlation <-cor(filtered_correlation)
corrplot(correlation, type="upper", order="hclust",
col=brewer.pal(n=8, name="RdYlBu"))
Life Expectancy has a negative correlation with the following variables: homicides, % Black, heart disease, diabetes, kidney disease, smoking, obesity, grocery stores (weak association) In fact, the strongest, negative correlation is with the % Black and obesity variables. Surprisingly, it has a weak relationship with grocery stores.
Life Expectancy has a strong, positive correlation with income and % with college education.
15 predictors at the census tract level, crude prevalence rates(total # divided by the population)
Chronic Health Diseases
diabetes_crudeprev = % with diabetes
CHD_CrudePrev = % with heart disease
copd_crudeprev = % with chronic obstructive pulmonary disease, a type of lung disease
kidney_crudeprev = % with chronic kidney disease
stroke_crudeprev = % who has had stroke in the year the data was collected(2015 and 2019)
Risk factors for Health Diseases
obesity_crudeprev = % considered obese
csmoking_crudeprev = % who smoke frequently
sleep_crudeprev = % who sleep less than 7 hrs
Social and economic factors
access2_crudeprev = % lack of health insurance among adults 18-65yrs
pct_foodstamps = % in community with SNAP benefits (food stamps)
medincomeE = median income
pct_black = % of African Americans in census tract
pct_college = % with a Bachelor’s degree
total.grocery.stores = total grocery stores
Total. Homicides = total homicides in the census tract(2015 and 2019)
DV is le = average life expectancy
Unit of analysis is tractfips = Chicago Census Tract
missing_values = cities_final %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))# no missing values
group <- rsample::initial_split(cities_final)
train <- training(group)
test <- testing(group)
le_folds <- vfold_cv(train, repeats=1, v=3)
tree_model <- parsnip::decision_tree(tree_depth=5) %>%
set_engine("rpart") %>%
set_mode("regression")
le_rec <- recipe (le ~ access2_crudeprev + chd_crudeprev+ copd_crudeprev + csmoking_crudeprev + diabetes_crudeprev + kidney_crudeprev + obesity_crudeprev + sleep_crudeprev + stroke_crudeprev + Total.Homicides + pct_black + medincomeE + pct_college + pct_foodstamps + total.grocery.stores, data = train) %>%
step_corr(all_predictors()) %>% # drops variables that correlate with each other
step_normalize(all_numeric_predictors()) # normalizes numeric variable to have SD of 1 and mean of 0
le_workflow <-
workflow() %>%
add_model(tree_model) %>%
add_recipe(le_rec)
le_fit <- le_workflow %>%
fit(data= train)
le_preds <- augment(le_fit, test)
DT::datatable(yardstick::rmse(
le_preds,
truth = le,
estimate = .pred
))
tree_fit <- le_fit %>%
extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)
Regression Decision tree was used for the first model with a tree depth of 5. Important thing to note here is that the most important variable identified in the model is the % of African Americans in the census tract.
linear_reg_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
xgb_spec <-
boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
#boosted = xgb_spec ✓
cart_spec <-
decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
#decision_tree= cart_spec
knn_spec <-
nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
nnet_spec <-
mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>%
set_engine("nnet", MaxNWts = 2600) %>%
set_mode("regression")
my_set <- workflow_set(
preproc = list(normalized = le_rec),
models = list(
linear_reg = linear_reg_spec, boosted = xgb_spec, decision_tree= cart_spec, KNN = knn_spec, neural_network = nnet_spec)
)
grid_ctrl <-
control_grid(
save_pred = FALSE,
save_workflow = FALSE
)
grid_results <-
my_set %>%
workflow_map(
seed = 1503,
resamples = le_folds,
grid = 15,
control = grid_ctrl,
verbose = FALSE
)
nrow(collect_metrics(grid_results, summarize = FALSE)) # 450
## [1] 450
grid_results %>%
rank_results() %>%
filter(.metric == "rmse") %>%
select(model, .config, rmse = mean, rank) %>%
kableExtra::kable() %>%
kableExtra::kable_material(c("striped", "hover"))
| model | .config | rmse | rank |
|---|---|---|---|
| linear_reg | Preprocessor1_Model10 | 2.578095 | 1 |
| linear_reg | Preprocessor1_Model14 | 2.592534 | 2 |
| linear_reg | Preprocessor1_Model03 | 2.608995 | 3 |
| linear_reg | Preprocessor1_Model12 | 2.610349 | 4 |
| linear_reg | Preprocessor1_Model08 | 2.610371 | 5 |
| linear_reg | Preprocessor1_Model13 | 2.610386 | 6 |
| linear_reg | Preprocessor1_Model09 | 2.610406 | 7 |
| linear_reg | Preprocessor1_Model07 | 2.610407 | 8 |
| linear_reg | Preprocessor1_Model15 | 2.610412 | 9 |
| linear_reg | Preprocessor1_Model11 | 2.610479 | 10 |
| linear_reg | Preprocessor1_Model05 | 2.610606 | 11 |
| linear_reg | Preprocessor1_Model06 | 2.610635 | 12 |
| linear_reg | Preprocessor1_Model04 | 2.610859 | 13 |
| linear_reg | Preprocessor1_Model02 | 2.611098 | 14 |
| linear_reg | Preprocessor1_Model01 | 2.611615 | 15 |
| mlp | Preprocessor1_Model12 | 2.612178 | 16 |
| nearest_neighbor | Preprocessor1_Model06 | 2.657645 | 17 |
| nearest_neighbor | Preprocessor1_Model05 | 2.691113 | 18 |
| boost_tree | Preprocessor1_Model15 | 2.699301 | 19 |
| boost_tree | Preprocessor1_Model06 | 2.711654 | 20 |
| mlp | Preprocessor1_Model03 | 2.714290 | 21 |
| nearest_neighbor | Preprocessor1_Model08 | 2.716003 | 22 |
| nearest_neighbor | Preprocessor1_Model12 | 2.732779 | 23 |
| nearest_neighbor | Preprocessor1_Model11 | 2.740400 | 24 |
| boost_tree | Preprocessor1_Model03 | 2.744069 | 25 |
| nearest_neighbor | Preprocessor1_Model09 | 2.760300 | 26 |
| decision_tree | Preprocessor1_Model11 | 2.799309 | 27 |
| decision_tree | Preprocessor1_Model02 | 2.804476 | 28 |
| boost_tree | Preprocessor1_Model14 | 2.807765 | 29 |
| nearest_neighbor | Preprocessor1_Model04 | 2.825890 | 30 |
| boost_tree | Preprocessor1_Model01 | 2.844425 | 31 |
| nearest_neighbor | Preprocessor1_Model02 | 2.844712 | 32 |
| boost_tree | Preprocessor1_Model05 | 2.846701 | 33 |
| boost_tree | Preprocessor1_Model02 | 2.889060 | 34 |
| decision_tree | Preprocessor1_Model01 | 2.912858 | 35 |
| decision_tree | Preprocessor1_Model05 | 2.927119 | 36 |
| nearest_neighbor | Preprocessor1_Model13 | 2.929557 | 37 |
| decision_tree | Preprocessor1_Model15 | 2.933926 | 38 |
| decision_tree | Preprocessor1_Model12 | 2.933926 | 39 |
| nearest_neighbor | Preprocessor1_Model15 | 2.935942 | 40 |
| boost_tree | Preprocessor1_Model11 | 2.950718 | 41 |
| nearest_neighbor | Preprocessor1_Model14 | 2.973112 | 42 |
| nearest_neighbor | Preprocessor1_Model03 | 2.999706 | 43 |
| mlp | Preprocessor1_Model04 | 3.025198 | 44 |
| decision_tree | Preprocessor1_Model03 | 3.026930 | 45 |
| boost_tree | Preprocessor1_Model12 | 3.041542 | 46 |
| nearest_neighbor | Preprocessor1_Model10 | 3.062527 | 47 |
| decision_tree | Preprocessor1_Model13 | 3.071267 | 48 |
| decision_tree | Preprocessor1_Model09 | 3.083321 | 49 |
| decision_tree | Preprocessor1_Model08 | 3.090028 | 50 |
| mlp | Preprocessor1_Model07 | 3.107423 | 51 |
| boost_tree | Preprocessor1_Model09 | 3.119737 | 52 |
| decision_tree | Preprocessor1_Model10 | 3.127145 | 53 |
| decision_tree | Preprocessor1_Model04 | 3.173886 | 54 |
| mlp | Preprocessor1_Model11 | 3.196364 | 55 |
| decision_tree | Preprocessor1_Model06 | 3.227234 | 56 |
| mlp | Preprocessor1_Model05 | 3.231370 | 57 |
| mlp | Preprocessor1_Model06 | 3.318100 | 58 |
| nearest_neighbor | Preprocessor1_Model01 | 3.410332 | 59 |
| decision_tree | Preprocessor1_Model07 | 3.469856 | 60 |
| decision_tree | Preprocessor1_Model14 | 3.584462 | 61 |
| nearest_neighbor | Preprocessor1_Model07 | 3.639251 | 62 |
| mlp | Preprocessor1_Model01 | 3.722567 | 63 |
| mlp | Preprocessor1_Model14 | 3.873417 | 64 |
| mlp | Preprocessor1_Model15 | 3.993735 | 65 |
| mlp | Preprocessor1_Model09 | 4.280478 | 66 |
| mlp | Preprocessor1_Model02 | 4.430005 | 67 |
| boost_tree | Preprocessor1_Model13 | 4.747433 | 68 |
| boost_tree | Preprocessor1_Model10 | 4.891170 | 69 |
| boost_tree | Preprocessor1_Model07 | 7.826343 | 70 |
| mlp | Preprocessor1_Model10 | 8.916352 | 71 |
| mlp | Preprocessor1_Model08 | 9.712742 | 72 |
| mlp | Preprocessor1_Model13 | 14.248312 | 73 |
| boost_tree | Preprocessor1_Model08 | 14.589793 | 74 |
| boost_tree | Preprocessor1_Model04 | 35.185166 | 75 |
autoplot(
grid_results,
rank_metric = "rmse", # <- how to order models
metric = "rmse", # <- which metric to visualize
select_best = TRUE # <- one point per workflow
) + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust =-0.6, size = 3.5)
After doing a grid search of 5 models ( Linear Regression, Boosted Tree, Decision Tree,K-Nearest Neighbor, Multi-layer Perceptron Neural Network) with 3 re-samples and 15 grids, the best model with the lowest RMSE is a Normalized Linear Regression.
best_results <-
grid_results %>%
extract_workflow_set_result("normalized_linear_reg") %>% # specify the set result
select_best(metric = "rmse")
best_results_fit <-
grid_results %>%
extract_workflow("normalized_linear_reg") %>% # specify
finalize_workflow(best_results) %>%
last_fit(split = group)
best_results_fit %>%
collect_predictions() %>%
ggplot(aes(x = le, y = .pred)) +
geom_abline(color = "gray50", lty = 2) +
geom_point(alpha = 0.5) +
coord_obs_pred() +
labs(x = "observed LE", y = "predicted LE")
Once we apply the best model to actual values for life expectancy in 2015, we see that it does an OK job in predicting life expectancy.
best_results
## # A tibble: 1 × 3
## penalty mixture .config
## <dbl> <dbl> <chr>
## 1 0.120 0.678 Preprocessor1_Model10
linear_param <-
tibble(
penalty = 0.01196087,
mixture = 0.6784387
)
le_workflow_final <-
workflow() %>%
add_model(linear_reg_spec) %>%
add_recipe(le_rec)
final_workflow <-
le_workflow_final %>%
finalize_workflow(linear_param)
final_fit = final_workflow %>%
fit(data= cities_final) # predict future data
le_preds <- augment(final_fit, cities_final)
le_preds$.pred <- round(le_preds$.pred, digits = 1)
subset(le_preds , le_preds$le == le_preds$.pred) %>%
select(tractfips, le, .pred)
## # A tibble: 12 × 3
## tractfips le .pred
## <int> <dbl> <dbl>
## 1 10400 79.7 79.7
## 2 30604 79 79
## 3 50900 81.3 81.3
## 4 161300 78.6 78.6
## 5 170400 79 79
## 6 210100 79.4 79.4
## 7 390400 73.4 73.4
## 8 570500 79.5 79.5
## 9 620300 78.6 78.6
## 10 680600 68.9 68.9
## 11 770902 79 79
## 12 836600 77.3 77.3
Out of the 698 census tracts in the data, my model accurately predicted the life expectancy of 12 census tracts
I conducted an out-of sample analysis for 2019 health data using the same variables in the 2015 model. After aggregating the predictions to community area, I downloaded life expectancy by community area from the Chicago Health Atlas to compare my out of sample predictions
le_preds.2019 <- augment(final_fit, PLACES_IL) # out of sample prediction
le_preds.2019 = le_preds.2019 %>% group_by(community, commarea) %>% summarize(predicted_life.expectancy = mean(.pred, na.rm=TRUE))
CDPH_2019 <- read.csv("CDPH_2019.csv")
CDPH_2019$Name <- toupper(CDPH_2019$Name)
CDPH_2019$Name <- gsub("'", "", CDPH_2019$Name)
le_preds.2019 = left_join(le_preds.2019, CDPH_2019, by = c("commarea" = "GEOID"))
le_preds.2019 = rename(le_preds.2019, Actual_Life.Expectancy2019 = VRLE_2019)
le_preds.2019 <- le_preds.2019 %>% select(-c("Layer","Name"))
le_preds.2019 <- le_preds.2019 %>% filter(!is.na(community)) # remove na value
le_preds.2019$predicted_life.expectancy <- round(le_preds.2019$predicted_life.expectancy, digits = 1) # round by 1 digit
chicago_map <- cmapgeo::cca_sf
chicago_map$cca_name <- toupper(chicago_map$cca_name)
chicago_map$cca_name <- gsub("'", "", chicago_map$cca_name)
chicago_map = left_join(chicago_map, le_preds.2019, by = c("cca_name" = "community"))
#chicago_map %>% filter(cca_name == "NEW CITY")
tm_shape(chicago_map) +
tm_polygons(col = "predicted_life.expectancy", breaks = c(65, 70,75, 80, 85), id = "cca_name", zindex = 401, title = "Predicted Life Expectancy in 2019", popup.vars = c( "Predicted Life Expectancy: " = "predicted_life.expectancy"),
legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) +
tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") })) + tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(chicago_map) +
tm_polygons(col = "Actual_Life.Expectancy2019", id = "cca_name", zindex = 401, title = "Actual Life Expectancy in 2019", popup.vars = c( "Actual Life Expectancy: " = "Actual_Life.Expectancy2019"),
legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) +
tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") })) + tmap_mode("view")
## tmap mode set to interactive viewing
subset(le_preds.2019 , le_preds.2019 $predicted_life.expectancy == le_preds.2019 $Actual_Life.Expectancy2019) %>%
select(community, predicted_life.expectancy, Actual_Life.Expectancy2019)
## # A tibble: 2 × 3
## # Groups: community [2]
## community predicted_life.expectancy Actual_Life.Expectancy2019
## <chr> <dbl> <dbl>
## 1 BEVERLY 79.5 79.5
## 2 NEAR SOUTH SIDE 79.4 79.4
Out of the 77 community areas, my model accurately predicted life expectancy for Beverly and Near South Side! Overall, the two maps look almost identical, with life expectancy higher in the North-side and lower in the south-side regions!
The main benefit of using machine learning models to predict life expectancy is determining which communities health agencies like CDPH should prioritize to improve the life expectancy numbers, particularly with the Healthy Chicago 2025 goal of reducing the life expectancy gap between various communities. For example, the life expectancy for residents in Austin is 71 years while the life expectancy in Near North Side is 82.1 years. This is a difference of 12 years and this is completely unacceptable. Intervention efforts can be directed towards treating diabetes and helping educate people on what to eat. CDPH could also work on strategies to help improve access to healthy food, particularly around communities with a low number of grocery stores. Ultimately, more funding for Black and Latinx communities would improve health equity, and machine learning can help assess which communities to prioritize. City agencies can also target individuals without health insurance and expand their access to health services like Medicaid so they can get better healthcare coverage.
However, as we saw from the decision tree, the most important variable that the model identified in predicting life expectancy was the % of Black Residents in a census tract. This type of rationale can be completely racist if city agencies want to explain these models to the public. It would be really disrespectful and racist to present at a public health agency meeting that the reason why a community’s life expectancy is low is because there was a high % of African Americans in that community. Therefore, the linear regression model is the best fit for these kinds of predictions. Moreover, it is also important to keep in mind that machine learning models are meant to mimic real-world data. We cannot completely eliminate health equities with these models because these models are mimicking real-world data. There are large, systemic issues that need to be taken into consideration first. We are not trying to solve the life expectancy gap, we are solely trying to predict life expectancy so city agencies can understand which communities they need to prioritize. Additionally, this model cannot be used to predict future life expectancies past 2019 because the 2014-2015 variables does not include Covid-case, and as I mentioned in the introduction, CDPH reported two days ago that life expectancy in 2020 for African Americans in Chicago dropped below 70 years for the first time in decades due to the impact of the pandemic. This model also does not explain the impact of genetic factors on disease risk among different ethnicities. For example, African Americans carry a higher risk of developing diabetes due to genetic factors. These factors are not taken into consideration in the model, and can lead to bias.
Given the low RMSE, I would say that the variables in the model do a pretty good job in predicting life expectancy but there are areas of improvement. I would recommend CDPH to use variables in this model but not exactly this model, as the RMSE can be further decreased with additional variables and additional hyperparameter tuning. Moreover, the model predicted life expectancy for 698 census tracts for the 2015 data, but there are 866 census tracts in Chicago. As a result, due to missing data, 168 census tracts are unaccounted for in the model.