Introduction

Life expectancy is the average age that a person is estimated to live. Predicting life expectancy at the community level can be an ethical use of machine learning as we can identify communities/census tracts that need extra attention in terms of distribution of health resources and educational campaigns urging residents to make healthier lifestyle choices. Unfortunately, in Chicago, there are life expectancy gaps based on race. WWTW reported that social conditions create gaps in disease and premature mortality as a result of structural racism, lack of investment in communities, and poverty.

Similarly, the Red Line Project also found that Chicago’s African American community have the lowest life expectancy, and that cardiovascular disease can be the reason why since 60% of Black males end up developing some form of heart disease. Food deserts, which are low-income areas or neighborhoods that have a lack of supermarkets/healthy food options, are also a reason why low-income communities have higher rates of heart disease. Red Line Project also found that these low-income communities don’t have access to good quality healthcare. Moreover, education is also another important factor because people with lower education tend to work who work high-stress jobs, and hard laborers are at risk for anxiety and burn out.

ABC news reported that the life expectancy gap in Chicago is growing among Blacks and Whites. This same article also attributed diabetes, infant mortality, HIV prevalence, and homicide rate as factors that explain the life expectancy gap.

Chicago Sun-times cites 5 main factors for lower life expectancy: chronic diseases, homicide, infant mortality, opioid overdoses, and infections (flu, HIV, etc). In fact, the diabetes rate is 70% higher for Blacks, homicide rate are 9 times higher in Black communities, Black infant mortality rate is 9 times higher, and the rate of opioid-related overdoses are also 3 times higher.

Lastly, in a media press release, reported by Mayor Lightfoot and Commissioner of Chicago Department of Public Health (CDPH) Allison Arwady 2 days ago, they reported that the life expectancy for Blacks dropped below 70 years for the first time in decades. The main drivers of the racial life expectancy gap, according to this article, are chronic disease (heart disease, cancer, diabetes); homicide, infant mortality, HIV, flu; and opioid overdose. Other outcomes include social issues including lack of access to stable housing, food, childcare, and stable income.

The CDPH is working to address this gap through the Healthy Chicago 2025, a 5 year action plan to close this life expectancy gap. Thus, this model would be considered an ethical use by CDPH in predicting life expectancy and addressing this gap.

Data Sources

2015

1)500 Cities 2017 Release

Health data at the census tract level provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health.

2)Crimes-2015

Total crimes in 2015 provided by the Chicago Police Department.

3)Grocery Stores- 2013

Total grocery stores in 2013, and only one available in the Open Data Portal.

4)Life Expectancy 2010-2015 at the Census Tract level

Only data I found at the census tract level detailing life expectancy.

5)Chicago Boundaries Shapefile of all of the 77 Chicago Community Areas.

6)Package CMAPGEO Shapefile of Chicago boundaries to create the maps

7)American Community Survey 2015 Population, Education and Income estimates obtained through the tidycensus package.

2019(for out of sample analysis)

1)PLACES Health Data at the census tract level in 2019 provided by CDC, National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health

2)Crimes- 2019 Total crimes in 2019 as reported by CPD.

3)Grocery Stores- 2013

4)Chicago Health Atlas Used to get 2019 actual life expectancy values by community area to compare the predictions. Data source: Illinois Department of Public Health, Death Certificate Data Files

5)American Community Survey: 2019 Population, Education and Income estimates for 2019 obtained through the tidycensus package.

  #lifeexpc <- read.socrata("https://data.cdc.gov/resource/5h56-n989.json")

  #lifeexpc = lifeexpc %>%
   #filter(state_name == "Illinois", county_name == "Cook County, IL")






  #crimes2015 = read.socrata("https://data.cityofchicago.org/resource/vwwp-7yr9.json")

  #homicides2015 = crimes2015 %>% filter(primary_type == "HOMICIDE")



  #variables <- load_variables(2015,"acs5")



  #il15 <- get_acs(geography = "tract", 
  #              variables = c(medincome = "B19013_001",
  #                            totalpop = "B02001_001",
   #                           white_alone = "B02001_002",
     #                         black_alone = "B02001_003",
       #                       bachelorsdegree = "B06009_005",
         #                     foodstamps = "B09010_001"),
           #     state = "IL", 
             #   year = 2015,
               # output='wide')



 # il15 <- il15 %>% mutate(
   # pct_black = black_aloneE / totalpopE,
   # pct_college = bachelorsdegreeE/totalpopE,
 #   pct_foodstamps  = foodstampsE/totalpopE
  #)

  # iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)


 # cookcounty_tracts <- iltracts %>%
 #   left_join(il15 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
  #) %>% filter(COUNTYFP == '031')  #merging il tracts with il15 census data, filtering it to cook county

  #ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE) 

  # chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)

  #chicago_tracts <- cookcounty_tracts %>% st_intersection(
  #  chicago
 # )

 #find tract level data for the crimes

 # homicides2015 <- homicides2015 %>%
 #   st_as_sf(coords=c("longitude", "latitude"))
 # st_crs(homicides2015) <- st_crs(chicago_tracts)



 # homicides2015 <- homicides2015 %>% st_join(chicago_tracts, join = st_intersects) %>%
  #  as.data.frame() 

 # homicides2015 = homicides2015 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())




#grocery stores so we can explore food desserts in Chicago. Which census tracts don't have any grocery stores in the community?


#grocery_stores <- read.socrata("https://data.cityofchicago.org/resource/53t8-wyrc.json")  #data is from 2013

 # grocery_stores = grocery_stores %>% group_by(census_tract) %>% summarize(total.grocery.stores = n())
  #grocery_stores$census_tract <- substring(grocery_stores$census_tract, 6)

  #length(unique(grocery_stores$census_tract))  #there is a duplicate somewhere

   #grocery_stores = grocery_stores %>% distinct()  #remove the duplicate




 # census_tracts <- read.socrata("https://data.cityofchicago.org/resource/74p9-q2aq.json")
  #comarea <- read.socrata("https://data.cityofchicago.org/resource/igwz-8jzy.json")


  #tracts_comarea <- census_tracts %>% select(tractce10, commarea)
 # comarea <- comarea %>% select(community, area_numbe)


 #  tracts_comarea  <- left_join(tracts_comarea, comarea, by = c("commarea" = "area_numbe"))



  # cities <- read.socrata("https://chronicdata.cdc.gov/resource/kucs-wizg.json")
  # cities = cities %>% filter(stateabbr == "IL", placename == "Chicago")


 #remove 17031 for cities
  # cities$tractfips <- substring(cities$tractfips, 6)

#remove period for lifeexp
  #lifeexpc$full_ct_num <- sub("[.]", "", lifeexpc$full_ct_num)

 #merge with LE
  #cities = left_join(cities, lifeexpc, by = c("tractfips" = "full_ct_num"))

  #sum(is.na(cities$le))  #96 missing values for LE

 #merge cities with homicide in 2015
  # cities <- left_join(cities, homicides2015, by = c("tractfips" = "TRACTCE"))

  # cities <- cities %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0

 #merge cities data with census data 
  #census_chicago <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps) %>% as.data.frame()
  #census_chicago$GEOID <- substring(census_chicago$GEOID,6)

  #cities <- left_join(cities, census_chicago, by = c("tractfips" = "GEOID"))

 #finally, merge cities data with grocery stores data

  #cities <- left_join(cities, grocery_stores, by = c("tractfips" = "census_tract"))
  #cities <- cities %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))


 #merge with community area
  #cities <- left_join(cities, tracts_comarea, by = c("tractfips" = "tractce10"))




  #cities

  #cities_final <-  cities %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores, le)


 #make numeric
 # cities_final[, c(3:ncol(cities_final))] <- sapply(cities_final[, c(3:ncol(cities_final))], as.numeric)


 # sum(is.na(cities_final$le))  #remove missing values of LE

 # cities_final =  cities_final %>%  filter(
 #   !is.na(le)
  # )


   #cities_final

 #write.csv(cities_final, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/cities_final.csv", row.names = FALSE)



cities_final <- read.csv("cities_final.csv")
#crimes2019 = read.socrata("https://data.cityofchicago.org/resource/w98m-zvie.json")



#homicides2019 = crimes2019 %>% filter(primary_type == "HOMICIDE")



# tract data
#variables <- load_variables(2019,"acs5")

#view(variables)

#il19 <- get_acs(geography = "tract", 
           #variables = c(medincome = "B19013_001",
#                         totalpop = "B02001_001",
#                        white_alone = "B02001_002",
#                          black_alone = "B02001_003",
#                          bachelorsdegree = "B06009_005",
#                          foodstamps = "B09010_001"),
#           state = "IL", 
#           year = 2019,
#            output='wide')



#il19 <- il19 %>% mutate(
 # pct_black = black_aloneE / totalpopE,
 # pct_college = bachelorsdegreeE/totalpopE,
 # pct_foodstamps  = foodstampsE/totalpopE
#)

#iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)


#cookcounty_tracts <- iltracts %>%
 # left_join(il19 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
#) %>% filter(COUNTYFP == '031') # merging il tracts with il15 census data, filtering it to cook county

#ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE) 

#chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)

#chicago_tracts <- cookcounty_tracts %>% st_intersection(
 # chicago
#)

## find tract level data for the crimes

#homicides2019 <- homicides2019 %>%
# st_as_sf(coords=c("longitude", "latitude"))
#st_crs(homicides2019) <- st_crs(chicago_tracts)



#homicides2019 <- homicides2019 %>% st_join(chicago_tracts, join = st_intersects) %>%
 # as.data.frame() 

#homicides2019 = homicides2019 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())


#grocery stores. No new data from 2019, it is the same

#census tracts
#tracts_comarea 


#### PLACES data
#PLACES_IL <- read.socrata("https://chronicdata.cdc.gov/resource/yjkw-uj5s.json")



#PLACES_IL = PLACES_IL %>% filter(stateabbr == "IL", countyname == "Cook")


# remove 17031 for PLACES_IL
#PLACES_IL$tractfips <- substring(PLACES_IL$tractfips, 6)


# merge PLACES_IL with homicide in 2019
#PLACES_IL <- left_join(PLACES_IL, homicides2019, by = c("tractfips" = "TRACTCE"))

#PLACES_IL <- PLACES_IL %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0





# merge PLACES_IL data with census data 
 #census_chicago2 <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, #pct_foodstamps) %>% as.data.frame()
  #census_chicago2$GEOID <- substring(census_chicago2$GEOID,6)

#PLACES_IL <- left_join(PLACES_IL, census_chicago2, by = c("tractfips" = "GEOID"))

# finally, merge PLACES_IL data with grocery stores data

#PLACES_IL <- left_join(PLACES_IL, grocery_stores, by = c("tractfips" = "census_tract"))
#PLACES_IL <- PLACES_IL %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))


# merge with community area
#PLACES_IL <- left_join(PLACES_IL, tracts_comarea, by = c("tractfips" = "tractce10"))




#PLACES_IL  = PLACES_IL %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores)

#PLACES_IL[, c(3:ncol(PLACES_IL))] <- sapply(PLACES_IL[, c(3:ncol(PLACES_IL))], as.numeric)

#write.csv(PLACES_IL, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/PLACES_IL.csv", row.names = FALSE)


PLACES_IL <- read.csv("PLACES_IL.csv")

Exploratory Data Analysis

sex_le <- read.socrata("https://data.cityofchicago.org/resource/3qdj-cqb8.json")

sex_le = sex_le %>% filter(sex == "All")
sex_le = sex_le %>% filter(!race_ethnicity == "All")
sex_le$X_2010_life_expectancy <- as.numeric(sex_le$X_2010_life_expectancy)

sex_le %>% ggplot(mapping = aes(x = race_ethnicity,  
y= X_2010_life_expectancy, fill= race_ethnicity)) + 
  scale_fill_manual(values = c("#e09f3e", "#540b0e", "#56B4E9")) +
  scale_y_continuous(breaks = seq(0, 350, by = 50)) +
  
    labs( x= "Race", 
        y= "Number of Years Expected to Live",
        title = "Average Life Expectancy in 2010",
        subtitle = "Significant health disparities") +
  
geom_col() + geom_text(aes(x = race_ethnicity, y = X_2010_life_expectancy,
label = round(X_2010_life_expectancy, 0)), size = 3, hjust=0 , vjust= -0.5 , family= "Arial") +
  
    theme_classic() +
  theme( 
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic", size = 14),
    axis.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12, face = "bold"),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "none")

African Americans in 2010 lived 7 years less on average than whites, 13 years less than Hispanic/Latinx

line_graph = sex_le %>%
  select(race_ethnicity,X_1990_life_expectancy,     X_2000_life_expectancy, X_2010_life_expectancy)

line_graph$X_1990_life_expectancy <- as.double(line_graph$X_1990_life_expectancy)
line_graph$X_2000_life_expectancy <- as.double(line_graph$X_2000_life_expectancy)

line_graph = line_graph %>%
  pivot_longer(!race_ethnicity, 
               names_to = "Year",
               values_to = "Life_Expectancy")
               
line_graph$Year <- gsub("[^0-9.-]", "", line_graph$Year)
line_graph$Year <- as.factor(line_graph$Year)





p <- ggplot(data=line_graph, aes(x=Year, y= Life_Expectancy, group=race_ethnicity, colour=race_ethnicity)) +
    geom_line() +
    geom_point() +
  labs(x='Year', y='Average Years expected to Live', title='Average Life Expectancy by Race/Ethnicity', subtitle = 'Source: Chicago Open Data Portal')  + theme_classic() + theme(
plot.title = element_text(face = "bold", size = 12), plot.subtitle = element_text(face = "italic", size = 12),
axis.title = element_text(size = 10, face = "bold"), axis.text = element_text(size = 8, face = "bold"), axis.line = element_blank(),
axis.ticks = element_blank())

fig <- ggplotly(p)

fig

Trend of Life Expectancy by group has been going up steadily for all Whites, Blacks, and Hispanic

filtered_correlation <- cities_final 

filtered_correlation <-  rename(filtered_correlation,  c( "heart_dis" = "chd_crudeprev", "smoking" = "csmoking_crudeprev", "diabetes" = "diabetes_crudeprev",
                                 "kidney_dis" = "kidney_crudeprev",
                                 "obesity" = "obesity_crudeprev",
                                 "homicides" = "Total.Homicides",
                                 "Black" = "pct_black",
                                 "income" = "medincomeE",
                                 "college" = "pct_college",
                                 "grocery_stores" = "total.grocery.stores",))


filtered_correlation = filtered_correlation %>%
  select(le , heart_dis, smoking, diabetes,kidney_dis, obesity, homicides, Black , income, college, grocery_stores)


correlation <-cor(filtered_correlation)




corrplot(correlation, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

Life Expectancy has a negative correlation with the following variables: homicides, % Black, heart disease, diabetes, kidney disease, smoking, obesity, grocery stores (weak association) In fact, the strongest, negative correlation is with the % Black and obesity variables. Surprisingly, it has a weak relationship with grocery stores.

Life Expectancy has a strong, positive correlation with income and % with college education.

Predictor Variables

15 predictors at the census tract level, crude prevalence rates(total # divided by the population)

Chronic Health Diseases

  1. diabetes_crudeprev = % with diabetes

  2. CHD_CrudePrev = % with heart disease

  3. copd_crudeprev = % with chronic obstructive pulmonary disease, a type of lung disease

  4. kidney_crudeprev = % with chronic kidney disease

  5. stroke_crudeprev = % who has had stroke in the year the data was collected(2015 and 2019)

Risk factors for Health Diseases

  1. obesity_crudeprev = % considered obese

  2. csmoking_crudeprev = % who smoke frequently

  3. sleep_crudeprev = % who sleep less than 7 hrs

Social and economic factors

  1. access2_crudeprev = % lack of health insurance among adults 18-65yrs

  2. pct_foodstamps = % in community with SNAP benefits (food stamps)

  3. medincomeE = median income

  4. pct_black = % of African Americans in census tract

  5. pct_college = % with a Bachelor’s degree

  6. total.grocery.stores = total grocery stores

  7. Total. Homicides = total homicides in the census tract(2015 and 2019)

DV is le = average life expectancy

Unit of analysis is tractfips = Chicago Census Tract

missing_values = cities_final %>%
  select(everything()) %>%  
  summarise_all(funs(sum(is.na(.))))# no missing values

Initial Model: Decision Tree

group <- rsample::initial_split(cities_final)
train <- training(group)
test <- testing(group)


le_folds <- vfold_cv(train, repeats=1, v=3)
  
 tree_model <- parsnip::decision_tree(tree_depth=5) %>%
   set_engine("rpart") %>%
   set_mode("regression")


 le_rec <- recipe (le ~  access2_crudeprev + chd_crudeprev+ copd_crudeprev + csmoking_crudeprev +  diabetes_crudeprev  +   kidney_crudeprev + obesity_crudeprev   + sleep_crudeprev +  stroke_crudeprev + Total.Homicides + pct_black + medincomeE + pct_college + pct_foodstamps   + total.grocery.stores, data = train) %>% 
     
   step_corr(all_predictors()) %>% # drops variables that correlate with each other
   
   
  step_normalize(all_numeric_predictors())  # normalizes numeric variable to have SD of 1 and mean of 0
   



le_workflow <-
  workflow() %>%
  add_model(tree_model) %>%
  add_recipe(le_rec)


le_fit <- le_workflow %>%
  fit(data= train)


le_preds <- augment(le_fit, test)


DT::datatable(yardstick::rmse(
  le_preds,
  truth = le,
  estimate = .pred
))
tree_fit <- le_fit  %>% 
  extract_fit_engine()


rpart.plot::rpart.plot(tree_fit)

Regression Decision tree was used for the first model with a tree depth of 5. Important thing to note here is that the most important variable identified in the model is the % of African Americans in the census tract.

Assessing 5 models with different tuning and parameters

linear_reg_spec <- 
   linear_reg(penalty = tune(), mixture = tune()) %>% 
   set_engine("glmnet")

xgb_spec <- 
   boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
   set_engine("xgboost") %>% 
   set_mode("regression")

   #boosted = xgb_spec   ✓


 cart_spec <- 
   decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
   set_engine("rpart") %>% 
   set_mode("regression")
 #decision_tree= cart_spec
 
 
knn_spec <- 
   nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>% 
   set_engine("kknn") %>% 
   set_mode("regression")

nnet_spec <- 
   mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>% 
   set_engine("nnet", MaxNWts = 2600) %>% 
   set_mode("regression")




my_set <- workflow_set(
  preproc =  list(normalized = le_rec), 
  models = list(
   linear_reg = linear_reg_spec, boosted = xgb_spec, decision_tree= cart_spec, KNN = knn_spec, neural_network = nnet_spec)
)
grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )



grid_results <-
   my_set %>%
   workflow_map(
      seed = 1503,
      resamples = le_folds,
      grid = 15,
      control = grid_ctrl,
      verbose = FALSE
   )
nrow(collect_metrics(grid_results, summarize = FALSE)) # 450
## [1] 450
grid_results %>% 
   rank_results() %>% 
   filter(.metric == "rmse") %>% 
   select(model, .config, rmse = mean, rank) %>%
    kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))
model .config rmse rank
linear_reg Preprocessor1_Model10 2.578095 1
linear_reg Preprocessor1_Model14 2.592534 2
linear_reg Preprocessor1_Model03 2.608995 3
linear_reg Preprocessor1_Model12 2.610349 4
linear_reg Preprocessor1_Model08 2.610371 5
linear_reg Preprocessor1_Model13 2.610386 6
linear_reg Preprocessor1_Model09 2.610406 7
linear_reg Preprocessor1_Model07 2.610407 8
linear_reg Preprocessor1_Model15 2.610412 9
linear_reg Preprocessor1_Model11 2.610479 10
linear_reg Preprocessor1_Model05 2.610606 11
linear_reg Preprocessor1_Model06 2.610635 12
linear_reg Preprocessor1_Model04 2.610859 13
linear_reg Preprocessor1_Model02 2.611098 14
linear_reg Preprocessor1_Model01 2.611615 15
mlp Preprocessor1_Model12 2.612178 16
nearest_neighbor Preprocessor1_Model06 2.657645 17
nearest_neighbor Preprocessor1_Model05 2.691113 18
boost_tree Preprocessor1_Model15 2.699301 19
boost_tree Preprocessor1_Model06 2.711654 20
mlp Preprocessor1_Model03 2.714290 21
nearest_neighbor Preprocessor1_Model08 2.716003 22
nearest_neighbor Preprocessor1_Model12 2.732779 23
nearest_neighbor Preprocessor1_Model11 2.740400 24
boost_tree Preprocessor1_Model03 2.744069 25
nearest_neighbor Preprocessor1_Model09 2.760300 26
decision_tree Preprocessor1_Model11 2.799309 27
decision_tree Preprocessor1_Model02 2.804476 28
boost_tree Preprocessor1_Model14 2.807765 29
nearest_neighbor Preprocessor1_Model04 2.825890 30
boost_tree Preprocessor1_Model01 2.844425 31
nearest_neighbor Preprocessor1_Model02 2.844712 32
boost_tree Preprocessor1_Model05 2.846701 33
boost_tree Preprocessor1_Model02 2.889060 34
decision_tree Preprocessor1_Model01 2.912858 35
decision_tree Preprocessor1_Model05 2.927119 36
nearest_neighbor Preprocessor1_Model13 2.929557 37
decision_tree Preprocessor1_Model15 2.933926 38
decision_tree Preprocessor1_Model12 2.933926 39
nearest_neighbor Preprocessor1_Model15 2.935942 40
boost_tree Preprocessor1_Model11 2.950718 41
nearest_neighbor Preprocessor1_Model14 2.973112 42
nearest_neighbor Preprocessor1_Model03 2.999706 43
mlp Preprocessor1_Model04 3.025198 44
decision_tree Preprocessor1_Model03 3.026930 45
boost_tree Preprocessor1_Model12 3.041542 46
nearest_neighbor Preprocessor1_Model10 3.062527 47
decision_tree Preprocessor1_Model13 3.071267 48
decision_tree Preprocessor1_Model09 3.083321 49
decision_tree Preprocessor1_Model08 3.090028 50
mlp Preprocessor1_Model07 3.107423 51
boost_tree Preprocessor1_Model09 3.119737 52
decision_tree Preprocessor1_Model10 3.127145 53
decision_tree Preprocessor1_Model04 3.173886 54
mlp Preprocessor1_Model11 3.196364 55
decision_tree Preprocessor1_Model06 3.227234 56
mlp Preprocessor1_Model05 3.231370 57
mlp Preprocessor1_Model06 3.318100 58
nearest_neighbor Preprocessor1_Model01 3.410332 59
decision_tree Preprocessor1_Model07 3.469856 60
decision_tree Preprocessor1_Model14 3.584462 61
nearest_neighbor Preprocessor1_Model07 3.639251 62
mlp Preprocessor1_Model01 3.722567 63
mlp Preprocessor1_Model14 3.873417 64
mlp Preprocessor1_Model15 3.993735 65
mlp Preprocessor1_Model09 4.280478 66
mlp Preprocessor1_Model02 4.430005 67
boost_tree Preprocessor1_Model13 4.747433 68
boost_tree Preprocessor1_Model10 4.891170 69
boost_tree Preprocessor1_Model07 7.826343 70
mlp Preprocessor1_Model10 8.916352 71
mlp Preprocessor1_Model08 9.712742 72
mlp Preprocessor1_Model13 14.248312 73
boost_tree Preprocessor1_Model08 14.589793 74
boost_tree Preprocessor1_Model04 35.185166 75
autoplot(
   grid_results,
   rank_metric = "rmse",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
   select_best = TRUE     # <- one point per workflow
) + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust =-0.6, size = 3.5) 

After doing a grid search of 5 models ( Linear Regression, Boosted Tree, Decision Tree,K-Nearest Neighbor, Multi-layer Perceptron Neural Network) with 3 re-samples and 15 grids, the best model with the lowest RMSE is a Normalized Linear Regression.

best_results <- 
   grid_results %>% 
   extract_workflow_set_result("normalized_linear_reg") %>%  # specify the set result
   select_best(metric = "rmse")

best_results_fit <- 
   grid_results %>% 
   extract_workflow("normalized_linear_reg") %>%  # specify
   finalize_workflow(best_results) %>% 
   last_fit(split = group)


best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = le, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed LE", y = "predicted LE")

Once we apply the best model to actual values for life expectancy in 2015, we see that it does an OK job in predicting life expectancy.

best_results
## # A tibble: 1 × 3
##   penalty mixture .config              
##     <dbl>   <dbl> <chr>                
## 1   0.120   0.678 Preprocessor1_Model10
linear_param <- 
  tibble(
    penalty = 0.01196087,
    mixture = 0.6784387
  )

le_workflow_final <-
  workflow() %>%
  add_model(linear_reg_spec) %>%
  add_recipe(le_rec)


final_workflow <- 
  le_workflow_final  %>%
  finalize_workflow(linear_param)

final_fit = final_workflow %>%
  fit(data= cities_final) # predict future data

le_preds <- augment(final_fit, cities_final)

le_preds$.pred <- round(le_preds$.pred, digits = 1)
subset(le_preds , le_preds$le ==  le_preds$.pred)  %>%
  select(tractfips, le, .pred)
## # A tibble: 12 × 3
##    tractfips    le .pred
##        <int> <dbl> <dbl>
##  1     10400  79.7  79.7
##  2     30604  79    79  
##  3     50900  81.3  81.3
##  4    161300  78.6  78.6
##  5    170400  79    79  
##  6    210100  79.4  79.4
##  7    390400  73.4  73.4
##  8    570500  79.5  79.5
##  9    620300  78.6  78.6
## 10    680600  68.9  68.9
## 11    770902  79    79  
## 12    836600  77.3  77.3

Out of the 698 census tracts in the data, my model accurately predicted the life expectancy of 12 census tracts

Out of sample prediction for 2019 Life Expectancy

I conducted an out-of sample analysis for 2019 health data using the same variables in the 2015 model. After aggregating the predictions to community area, I downloaded life expectancy by community area from the Chicago Health Atlas to compare my out of sample predictions

le_preds.2019 <- augment(final_fit, PLACES_IL) # out of sample prediction

le_preds.2019 = le_preds.2019  %>% group_by(community, commarea) %>% summarize(predicted_life.expectancy = mean(.pred, na.rm=TRUE))


CDPH_2019 <- read.csv("CDPH_2019.csv") 


CDPH_2019$Name <- toupper(CDPH_2019$Name)
CDPH_2019$Name <- gsub("'", "", CDPH_2019$Name)

                                          

le_preds.2019 = left_join(le_preds.2019, CDPH_2019, by = c("commarea" = "GEOID"))

le_preds.2019 = rename(le_preds.2019, Actual_Life.Expectancy2019 = VRLE_2019)
le_preds.2019 <- le_preds.2019 %>% select(-c("Layer","Name"))

le_preds.2019 <- le_preds.2019   %>% filter(!is.na(community)) # remove na value




le_preds.2019$predicted_life.expectancy <- round(le_preds.2019$predicted_life.expectancy, digits = 1) # round by 1 digit

Final Maps

chicago_map <- cmapgeo::cca_sf
chicago_map$cca_name <- toupper(chicago_map$cca_name)

chicago_map$cca_name <- gsub("'", "", chicago_map$cca_name)


chicago_map = left_join(chicago_map, le_preds.2019, by = c("cca_name" = "community"))
#chicago_map %>% filter(cca_name == "NEW CITY")

Predicted 2019 Life Expectancy Map

 tm_shape(chicago_map) + 
          tm_polygons(col = "predicted_life.expectancy", breaks = c(65, 70,75, 80, 85),  id = "cca_name", zindex = 401, title = "Predicted Life Expectancy in 2019", popup.vars = c( "Predicted Life Expectancy: " = "predicted_life.expectancy"),
                      legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) + 
          tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") }))  + tmap_mode("view")
## tmap mode set to interactive viewing

Actual 2019 Life Expectancy Map

tm_shape(chicago_map) + 
          tm_polygons(col = "Actual_Life.Expectancy2019", id = "cca_name", zindex = 401, title = "Actual Life Expectancy in 2019", popup.vars = c( "Actual Life Expectancy: " = "Actual_Life.Expectancy2019"),
                      legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) + 
          tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") }))  + tmap_mode("view")
## tmap mode set to interactive viewing
subset(le_preds.2019 , le_preds.2019 $predicted_life.expectancy ==  le_preds.2019 $Actual_Life.Expectancy2019)  %>%
  select(community, predicted_life.expectancy, Actual_Life.Expectancy2019)
## # A tibble: 2 × 3
## # Groups:   community [2]
##   community       predicted_life.expectancy Actual_Life.Expectancy2019
##   <chr>                               <dbl>                      <dbl>
## 1 BEVERLY                              79.5                       79.5
## 2 NEAR SOUTH SIDE                      79.4                       79.4

Out of the 77 community areas, my model accurately predicted life expectancy for Beverly and Near South Side! Overall, the two maps look almost identical, with life expectancy higher in the North-side and lower in the south-side regions!

Conclusion

The main benefit of using machine learning models to predict life expectancy is determining which communities health agencies like CDPH should prioritize to improve the life expectancy numbers, particularly with the Healthy Chicago 2025 goal of reducing the life expectancy gap between various communities. For example, the life expectancy for residents in Austin is 71 years while the life expectancy in Near North Side is 82.1 years. This is a difference of 12 years and this is completely unacceptable. Intervention efforts can be directed towards treating diabetes and helping educate people on what to eat. CDPH could also work on strategies to help improve access to healthy food, particularly around communities with a low number of grocery stores. Ultimately, more funding for Black and Latinx communities would improve health equity, and machine learning can help assess which communities to prioritize. City agencies can also target individuals without health insurance and expand their access to health services like Medicaid so they can get better healthcare coverage.

However, as we saw from the decision tree, the most important variable that the model identified in predicting life expectancy was the % of Black Residents in a census tract. This type of rationale can be completely racist if city agencies want to explain these models to the public. It would be really disrespectful and racist to present at a public health agency meeting that the reason why a community’s life expectancy is low is because there was a high % of African Americans in that community. Therefore, the linear regression model is the best fit for these kinds of predictions. Moreover, it is also important to keep in mind that machine learning models are meant to mimic real-world data. We cannot completely eliminate health equities with these models because these models are mimicking real-world data. There are large, systemic issues that need to be taken into consideration first. We are not trying to solve the life expectancy gap, we are solely trying to predict life expectancy so city agencies can understand which communities they need to prioritize. Additionally, this model cannot be used to predict future life expectancies past 2019 because the 2014-2015 variables does not include Covid-case, and as I mentioned in the introduction, CDPH reported two days ago that life expectancy in 2020 for African Americans in Chicago dropped below 70 years for the first time in decades due to the impact of the pandemic. This model also does not explain the impact of genetic factors on disease risk among different ethnicities. For example, African Americans carry a higher risk of developing diabetes due to genetic factors. These factors are not taken into consideration in the model, and can lead to bias.

Given the low RMSE, I would say that the variables in the model do a pretty good job in predicting life expectancy but there are areas of improvement. I would recommend CDPH to use variables in this model but not exactly this model, as the RMSE can be further decreased with additional variables and additional hyperparameter tuning. Moreover, the model predicted life expectancy for 698 census tracts for the 2015 data, but there are 866 census tracts in Chicago. As a result, due to missing data, 168 census tracts are unaccounted for in the model.