Introduction
Data Sources
- 2015
- 2019(for out of sample analysis)
Exploratory Data Analysis
Predictor Variables
Initial Model: Decision Tree
Assessing 5 models with different tuning and parameters
Out of sample prediction for 2019 Life Expectancy
Final Maps
- Predicted 2019 Life Expectancy Map
- Actual 2019 Life Expectancy Map
Conclusion

Introduction

Life expectancy is the average age that a person is estimated to live. Predicting life expectancy at the community level can be an ethical use of machine learning as we can identify communities/census tracts that need extra attention in terms of distribution of health resources and educational campaigns urging residents to make healthier lifestyle choices. Unfortunately, in Chicago, there are life expectancy gaps based on race. WWTW reported that social conditions create gaps in disease and premature mortality as a result of structural racism, lack of investment in communities, and poverty.

Similarly, the Red Line Project also found that Chicago’s African American community have the lowest life expectancy, and that cardiovascular disease can be the reason why since 60% of Black males end up developing some form of heart disease. Food deserts, which are low-income areas or neighborhoods that have a lack of supermarkets/healthy food options, are also a reason why low-income communities have higher rates of heart disease. Red Line Project also found that these low-income communities don’t have access to good quality healthcare. Moreover, education is also another important factor because people with lower education tend to work who work high-stress jobs, and hard laborers are at risk for anxiety and burn out.

ABC news reported that the life expectancy gap in Chicago is growing among Blacks and Whites. This same article also attributed diabetes, infant mortality, HIV prevalence, and homicide rate as factors that explain the life expectancy gap.

Chicago Sun-times cites 5 main factors for lower life expectancy: chronic diseases, homicide, infant mortality, opioid overdoses, and infections (flu, HIV, etc). In fact, the diabetes rate is 70% higher for Blacks, homicide rate are 9 times higher in Black communities, Black infant mortality rate is 9 times higher, and the rate of opioid-related overdoses are also 3 times higher.

Lastly, in a media press release, reported by Mayor Lightfoot and Commissioner of Chicago Department of Public Health (CDPH) Allison Arwady 2 days ago, they reported that the life expectancy for Blacks dropped below 70 years for the first time in decades. The main drivers of the racial life expectancy gap, according to this article, are chronic disease (heart disease, cancer, diabetes); homicide, infant mortality, HIV, flu; and opioid overdose. Other outcomes include social issues including lack of access to stable housing, food, childcare, and stable income.

The CDPH is working to address this gap through the Healthy Chicago 2025, a 5 year action plan to close this life expectancy gap. Thus, this model would be considered an ethical use by CDPH in predicting life expectancy and addressing this gap.

Data Sources

2015

1)500 Cities 2017 Release

Health data at the census tract level provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health.

2)Crimes-2015

Total crimes in 2015 provided by the Chicago Police Department.

3)Grocery Stores- 2013

Total grocery stores in 2013, and only one available in the Open Data Portal.

4)Life Expectancy 2010-2015 at the Census Tract level

Only data I found at the census tract level detailing life expectancy.

5)Chicago Boundaries Shapefile of all of the 77 Chicago Community Areas.

6)Package CMAPGEO Shapefile of Chicago boundaries to create the maps

7)American Community Survey 2015 Population, Education and Income estimates obtained through the tidycensus package.

2019(for out of sample analysis)

1)PLACES Health Data at the census tract level in 2019 provided by CDC, National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health

2)Crimes- 2019 Total crimes in 2019 as reported by CPD.

3)Grocery Stores- 2013

4)Chicago Health Atlas Used to get 2019 actual life expectancy values by community area to compare the predictions. Data source: Illinois Department of Public Health, Death Certificate Data Files

5)American Community Survey: 2019 Population, Education and Income estimates for 2019 obtained through the tidycensus package.

  #lifeexpc <- read.socrata("https://data.cdc.gov/resource/5h56-n989.json")

  #lifeexpc = lifeexpc %>%
   #filter(state_name == "Illinois", county_name == "Cook County, IL")






  #crimes2015 = read.socrata("https://data.cityofchicago.org/resource/vwwp-7yr9.json")

  #homicides2015 = crimes2015 %>% filter(primary_type == "HOMICIDE")



  #variables <- load_variables(2015,"acs5")



  #il15 <- get_acs(geography = "tract", 
  #              variables = c(medincome = "B19013_001",
  #                            totalpop = "B02001_001",
   #                           white_alone = "B02001_002",
     #                         black_alone = "B02001_003",
       #                       bachelorsdegree = "B06009_005",
         #                     foodstamps = "B09010_001"),
           #     state = "IL", 
             #   year = 2015,
               # output='wide')



 # il15 <- il15 %>% mutate(
   # pct_black = black_aloneE / totalpopE,
   # pct_college = bachelorsdegreeE/totalpopE,
 #   pct_foodstamps  = foodstampsE/totalpopE
  #)

  # iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)


 # cookcounty_tracts <- iltracts %>%
 #   left_join(il15 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
  #) %>% filter(COUNTYFP == '031')  #merging il tracts with il15 census data, filtering it to cook county

  #ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE) 

  # chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)

  #chicago_tracts <- cookcounty_tracts %>% st_intersection(
  #  chicago
 # )

 #find tract level data for the crimes

 # homicides2015 <- homicides2015 %>%
 #   st_as_sf(coords=c("longitude", "latitude"))
 # st_crs(homicides2015) <- st_crs(chicago_tracts)



 # homicides2015 <- homicides2015 %>% st_join(chicago_tracts, join = st_intersects) %>%
  #  as.data.frame() 

 # homicides2015 = homicides2015 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())




#grocery stores so we can explore food desserts in Chicago. Which census tracts don't have any grocery stores in the community?


#grocery_stores <- read.socrata("https://data.cityofchicago.org/resource/53t8-wyrc.json")  #data is from 2013

 # grocery_stores = grocery_stores %>% group_by(census_tract) %>% summarize(total.grocery.stores = n())
  #grocery_stores$census_tract <- substring(grocery_stores$census_tract, 6)

  #length(unique(grocery_stores$census_tract))  #there is a duplicate somewhere

   #grocery_stores = grocery_stores %>% distinct()  #remove the duplicate




 # census_tracts <- read.socrata("https://data.cityofchicago.org/resource/74p9-q2aq.json")
  #comarea <- read.socrata("https://data.cityofchicago.org/resource/igwz-8jzy.json")


  #tracts_comarea <- census_tracts %>% select(tractce10, commarea)
 # comarea <- comarea %>% select(community, area_numbe)


 #  tracts_comarea  <- left_join(tracts_comarea, comarea, by = c("commarea" = "area_numbe"))



  # cities <- read.socrata("https://chronicdata.cdc.gov/resource/kucs-wizg.json")
  # cities = cities %>% filter(stateabbr == "IL", placename == "Chicago")


 #remove 17031 for cities
  # cities$tractfips <- substring(cities$tractfips, 6)

#remove period for lifeexp
  #lifeexpc$full_ct_num <- sub("[.]", "", lifeexpc$full_ct_num)

 #merge with LE
  #cities = left_join(cities, lifeexpc, by = c("tractfips" = "full_ct_num"))

  #sum(is.na(cities$le))  #96 missing values for LE

 #merge cities with homicide in 2015
  # cities <- left_join(cities, homicides2015, by = c("tractfips" = "TRACTCE"))

  # cities <- cities %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0

 #merge cities data with census data 
  #census_chicago <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps) %>% as.data.frame()
  #census_chicago$GEOID <- substring(census_chicago$GEOID,6)

  #cities <- left_join(cities, census_chicago, by = c("tractfips" = "GEOID"))

 #finally, merge cities data with grocery stores data

  #cities <- left_join(cities, grocery_stores, by = c("tractfips" = "census_tract"))
  #cities <- cities %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))


 #merge with community area
  #cities <- left_join(cities, tracts_comarea, by = c("tractfips" = "tractce10"))




  #cities

  #cities_final <-  cities %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores, le)


 #make numeric
 # cities_final[, c(3:ncol(cities_final))] <- sapply(cities_final[, c(3:ncol(cities_final))], as.numeric)


 # sum(is.na(cities_final$le))  #remove missing values of LE

 # cities_final =  cities_final %>%  filter(
 #   !is.na(le)
  # )


   #cities_final

 #write.csv(cities_final, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/cities_final.csv", row.names = FALSE)



cities_final <- read.csv("cities_final.csv")

#crimes2019 = read.socrata("https://data.cityofchicago.org/resource/w98m-zvie.json")



#homicides2019 = crimes2019 %>% filter(primary_type == "HOMICIDE")



# tract data
#variables <- load_variables(2019,"acs5")

#view(variables)

#il19 <- get_acs(geography = "tract", 
           #variables = c(medincome = "B19013_001",
#                         totalpop = "B02001_001",
#                        white_alone = "B02001_002",
#                          black_alone = "B02001_003",
#                          bachelorsdegree = "B06009_005",
#                          foodstamps = "B09010_001"),
#           state = "IL", 
#           year = 2019,
#            output='wide')



#il19 <- il19 %>% mutate(
 # pct_black = black_aloneE / totalpopE,
 # pct_college = bachelorsdegreeE/totalpopE,
 # pct_foodstamps  = foodstampsE/totalpopE
#)

#iltracts <- tigris::tracts(state = "IL", year = 2019, cb = TRUE)


#cookcounty_tracts <- iltracts %>%
 # left_join(il19 %>% select(GEOID, pct_black, medincomeE, pct_college, pct_foodstamps)
#) %>% filter(COUNTYFP == '031') # merging il tracts with il15 census data, filtering it to cook county

#ilcountycook <- tigris::county_subdivisions(state=17, county=31, cb=TRUE) 

#chicago <- ilcountycook %>% filter(NAME == "Chicago") %>% select(region=NAME)

#chicago_tracts <- cookcounty_tracts %>% st_intersection(
 # chicago
#)

## find tract level data for the crimes

#homicides2019 <- homicides2019 %>%
# st_as_sf(coords=c("longitude", "latitude"))
#st_crs(homicides2019) <- st_crs(chicago_tracts)



#homicides2019 <- homicides2019 %>% st_join(chicago_tracts, join = st_intersects) %>%
 # as.data.frame() 

#homicides2019 = homicides2019 %>% group_by(TRACTCE) %>% summarize(Total.Homicides = n())


#grocery stores. No new data from 2019, it is the same

#census tracts
#tracts_comarea 


#### PLACES data
#PLACES_IL <- read.socrata("https://chronicdata.cdc.gov/resource/yjkw-uj5s.json")



#PLACES_IL = PLACES_IL %>% filter(stateabbr == "IL", countyname == "Cook")


# remove 17031 for PLACES_IL
#PLACES_IL$tractfips <- substring(PLACES_IL$tractfips, 6)


# merge PLACES_IL with homicide in 2019
#PLACES_IL <- left_join(PLACES_IL, homicides2019, by = c("tractfips" = "TRACTCE"))

#PLACES_IL <- PLACES_IL %>% mutate(Total.Homicides = replace_na(Total.Homicides,0)) # replace 0 homicides with 0





# merge PLACES_IL data with census data 
 #census_chicago2 <- chicago_tracts %>% select(GEOID, pct_black, medincomeE, pct_college, #pct_foodstamps) %>% as.data.frame()
  #census_chicago2$GEOID <- substring(census_chicago2$GEOID,6)

#PLACES_IL <- left_join(PLACES_IL, census_chicago2, by = c("tractfips" = "GEOID"))

# finally, merge PLACES_IL data with grocery stores data

#PLACES_IL <- left_join(PLACES_IL, grocery_stores, by = c("tractfips" = "census_tract"))
#PLACES_IL <- PLACES_IL %>% mutate(total.grocery.stores = replace_na(total.grocery.stores, 0))


# merge with community area
#PLACES_IL <- left_join(PLACES_IL, tracts_comarea, by = c("tractfips" = "tractce10"))




#PLACES_IL  = PLACES_IL %>% select(tractfips, community, commarea, access2_crudeprev, bpmed_crudeprev, chd_crudeprev, copd_crudeprev, csmoking_crudeprev , diabetes_crudeprev, highchol_crudeprev, kidney_crudeprev, obesity_crudeprev, sleep_crudeprev , stroke_crudeprev, Total.Homicides, pct_black, medincomeE, pct_college, pct_foodstamps, total.grocery.stores)

#PLACES_IL[, c(3:ncol(PLACES_IL))] <- sapply(PLACES_IL[, c(3:ncol(PLACES_IL))], as.numeric)

#write.csv(PLACES_IL, "/Users/macbookair/Library/Mobile Documents/com~apple~CloudDocs/PA 470 AI & ML/Final Project Life Expectancy/PA 470 Final project/PLACES_IL.csv", row.names = FALSE)


PLACES_IL <- read.csv("PLACES_IL.csv")

Exploratory Data Analysis

sex_le <- read.socrata("https://data.cityofchicago.org/resource/3qdj-cqb8.json")

sex_le = sex_le %>% filter(sex == "All")
sex_le = sex_le %>% filter(!race_ethnicity == "All")
sex_le$X_2010_life_expectancy <- as.numeric(sex_le$X_2010_life_expectancy)

sex_le %>% ggplot(mapping = aes(x = race_ethnicity,  
y= X_2010_life_expectancy, fill= race_ethnicity)) + 
  scale_fill_manual(values = c("#e09f3e", "#540b0e", "#56B4E9")) +
  scale_y_continuous(breaks = seq(0, 350, by = 50)) +
  
    labs( x= "Race", 
        y= "Number of Years Expected to Live",
        title = "Average Life Expectancy in 2010",
        subtitle = "Significant health disparities") +
  
geom_col() + geom_text(aes(x = race_ethnicity, y = X_2010_life_expectancy,
label = round(X_2010_life_expectancy, 0)), size = 3, hjust=0 , vjust= -0.5 , family= "Arial") +
  
    theme_classic() +
  theme( 
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic", size = 14),
    axis.title = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12, face = "bold"),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "none")

African Americans in 2010 lived 7 years less on average than whites, 13 years less than Hispanic/Latinx

line_graph = sex_le %>%
  select(race_ethnicity,X_1990_life_expectancy,     X_2000_life_expectancy, X_2010_life_expectancy)

line_graph$X_1990_life_expectancy <- as.double(line_graph$X_1990_life_expectancy)
line_graph$X_2000_life_expectancy <- as.double(line_graph$X_2000_life_expectancy)

line_graph = line_graph %>%
  pivot_longer(!race_ethnicity, 
               names_to = "Year",
               values_to = "Life_Expectancy")
               
line_graph$Year <- gsub("[^0-9.-]", "", line_graph$Year)
line_graph$Year <- as.factor(line_graph$Year)





p <- ggplot(data=line_graph, aes(x=Year, y= Life_Expectancy, group=race_ethnicity, colour=race_ethnicity)) +
    geom_line() +
    geom_point() +
  labs(x='Year', y='Average Years expected to Live', title='Average Life Expectancy by Race/Ethnicity', subtitle = 'Source: Chicago Open Data Portal')  + theme_classic() + theme(
plot.title = element_text(face = "bold", size = 12), plot.subtitle = element_text(face = "italic", size = 12),
axis.title = element_text(size = 10, face = "bold"), axis.text = element_text(size = 8, face = "bold"), axis.line = element_blank(),
axis.ticks = element_blank())

fig <- ggplotly(p)

fig

Trend of Life Expectancy by group has been going up steadily for all Whites, Blacks, and Hispanic

filtered_correlation <- cities_final 

filtered_correlation <-  rename(filtered_correlation,  c( "heart_dis" = "chd_crudeprev", "smoking" = "csmoking_crudeprev", "diabetes" = "diabetes_crudeprev",
                                 "kidney_dis" = "kidney_crudeprev",
                                 "obesity" = "obesity_crudeprev",
                                 "homicides" = "Total.Homicides",
                                 "Black" = "pct_black",
                                 "income" = "medincomeE",
                                 "college" = "pct_college",
                                 "grocery_stores" = "total.grocery.stores",))


filtered_correlation = filtered_correlation %>%
  select(le , heart_dis, smoking, diabetes,kidney_dis, obesity, homicides, Black , income, college, grocery_stores)


correlation <-cor(filtered_correlation)




corrplot(correlation, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

Life Expectancy has a negative correlation with the following variables: homicides, % Black, heart disease, diabetes, kidney disease, smoking, obesity, grocery stores (weak association) In fact, the strongest, negative correlation is with the % Black and obesity variables. Surprisingly, it has a weak relationship with grocery stores.

Life Expectancy has a strong, positive correlation with income and % with college education.

Predictor Variables

15 predictors at the census tract level, crude prevalence rates(total # divided by the population)

Chronic Health Diseases

diabetes_crudeprev = % with diabetes
CHD_CrudePrev = % with heart disease
copd_crudeprev = % with chronic obstructive pulmonary disease, a type of lung disease
kidney_crudeprev = % with chronic kidney disease
stroke_crudeprev = % who has had stroke in the year the data was collected(2015 and 2019)

Risk factors for Health Diseases

obesity_crudeprev = % considered obese
csmoking_crudeprev = % who smoke frequently
sleep_crudeprev = % who sleep less than 7 hrs

Social and economic factors

access2_crudeprev = % lack of health insurance among adults 18-65yrs
pct_foodstamps = % in community with SNAP benefits (food stamps)
medincomeE = median income
pct_black = % of African Americans in census tract
pct_college = % with a Bachelor’s degree
total.grocery.stores = total grocery stores
Total. Homicides = total homicides in the census tract(2015 and 2019)

DV is le = average life expectancy

Unit of analysis is tractfips = Chicago Census Tract

missing_values = cities_final %>%
  select(everything()) %>%  
  summarise_all(funs(sum(is.na(.))))# no missing values

Initial Model: Decision Tree

group <- rsample::initial_split(cities_final)
train <- training(group)
test <- testing(group)


le_folds <- vfold_cv(train, repeats=1, v=3)
  
 tree_model <- parsnip::decision_tree(tree_depth=5) %>%
   set_engine("rpart") %>%
   set_mode("regression")


 le_rec <- recipe (le ~  access2_crudeprev + chd_crudeprev+ copd_crudeprev + csmoking_crudeprev +  diabetes_crudeprev  +   kidney_crudeprev + obesity_crudeprev   + sleep_crudeprev +  stroke_crudeprev + Total.Homicides + pct_black + medincomeE + pct_college + pct_foodstamps   + total.grocery.stores, data = train) %>% 
     
   step_corr(all_predictors()) %>% # drops variables that correlate with each other
   
   
  step_normalize(all_numeric_predictors())  # normalizes numeric variable to have SD of 1 and mean of 0
   



le_workflow <-
  workflow() %>%
  add_model(tree_model) %>%
  add_recipe(le_rec)


le_fit <- le_workflow %>%
  fit(data= train)


le_preds <- augment(le_fit, test)


DT::datatable(yardstick::rmse(
  le_preds,
  truth = le,
  estimate = .pred
))

tree_fit <- le_fit  %>% 
  extract_fit_engine()


rpart.plot::rpart.plot(tree_fit)

Regression Decision tree was used for the first model with a tree depth of 5. Important thing to note here is that the most important variable identified in the model is the % of African Americans in the census tract.

Assessing 5 models with different tuning and parameters

linear_reg_spec <- 
   linear_reg(penalty = tune(), mixture = tune()) %>% 
   set_engine("glmnet")

xgb_spec <- 
   boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
   set_engine("xgboost") %>% 
   set_mode("regression")

   #boosted = xgb_spec   ✓


 cart_spec <- 
   decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
   set_engine("rpart") %>% 
   set_mode("regression")
 #decision_tree= cart_spec
 
 
knn_spec <- 
   nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>% 
   set_engine("kknn") %>% 
   set_mode("regression")

nnet_spec <- 
   mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>% 
   set_engine("nnet", MaxNWts = 2600) %>% 
   set_mode("regression")




my_set <- workflow_set(
  preproc =  list(normalized = le_rec), 
  models = list(
   linear_reg = linear_reg_spec, boosted = xgb_spec, decision_tree= cart_spec, KNN = knn_spec, neural_network = nnet_spec)
)

grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )



grid_results <-
   my_set %>%
   workflow_map(
      seed = 1503,
      resamples = le_folds,
      grid = 15,
      control = grid_ctrl,
      verbose = FALSE
   )

nrow(collect_metrics(grid_results, summarize = FALSE)) # 450

## [1] 450

grid_results %>% 
   rank_results() %>% 
   filter(.metric == "rmse") %>% 
   select(model, .config, rmse = mean, rank) %>%
    kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

model	.config	rmse	rank
linear_reg	Preprocessor1_Model10	2.578095	1
linear_reg	Preprocessor1_Model14	2.592534	2
linear_reg	Preprocessor1_Model03	2.608995	3
linear_reg	Preprocessor1_Model12	2.610349	4
linear_reg	Preprocessor1_Model08	2.610371	5
linear_reg	Preprocessor1_Model13	2.610386	6
linear_reg	Preprocessor1_Model09	2.610406	7
linear_reg	Preprocessor1_Model07	2.610407	8
linear_reg	Preprocessor1_Model15	2.610412	9
linear_reg	Preprocessor1_Model11	2.610479	10
linear_reg	Preprocessor1_Model05	2.610606	11
linear_reg	Preprocessor1_Model06	2.610635	12
linear_reg	Preprocessor1_Model04	2.610859	13
linear_reg	Preprocessor1_Model02	2.611098	14
linear_reg	Preprocessor1_Model01	2.611615	15
mlp	Preprocessor1_Model12	2.612178	16
nearest_neighbor	Preprocessor1_Model06	2.657645	17
nearest_neighbor	Preprocessor1_Model05	2.691113	18
boost_tree	Preprocessor1_Model15	2.699301	19
boost_tree	Preprocessor1_Model06	2.711654	20
mlp	Preprocessor1_Model03	2.714290	21
nearest_neighbor	Preprocessor1_Model08	2.716003	22
nearest_neighbor	Preprocessor1_Model12	2.732779	23
nearest_neighbor	Preprocessor1_Model11	2.740400	24
boost_tree	Preprocessor1_Model03	2.744069	25
nearest_neighbor	Preprocessor1_Model09	2.760300	26
decision_tree	Preprocessor1_Model11	2.799309	27
decision_tree	Preprocessor1_Model02	2.804476	28
boost_tree	Preprocessor1_Model14	2.807765	29
nearest_neighbor	Preprocessor1_Model04	2.825890	30
boost_tree	Preprocessor1_Model01	2.844425	31
nearest_neighbor	Preprocessor1_Model02	2.844712	32
boost_tree	Preprocessor1_Model05	2.846701	33
boost_tree	Preprocessor1_Model02	2.889060	34
decision_tree	Preprocessor1_Model01	2.912858	35
decision_tree	Preprocessor1_Model05	2.927119	36
nearest_neighbor	Preprocessor1_Model13	2.929557	37
decision_tree	Preprocessor1_Model15	2.933926	38
decision_tree	Preprocessor1_Model12	2.933926	39
nearest_neighbor	Preprocessor1_Model15	2.935942	40
boost_tree	Preprocessor1_Model11	2.950718	41
nearest_neighbor	Preprocessor1_Model14	2.973112	42
nearest_neighbor	Preprocessor1_Model03	2.999706	43
mlp	Preprocessor1_Model04	3.025198	44
decision_tree	Preprocessor1_Model03	3.026930	45
boost_tree	Preprocessor1_Model12	3.041542	46
nearest_neighbor	Preprocessor1_Model10	3.062527	47
decision_tree	Preprocessor1_Model13	3.071267	48
decision_tree	Preprocessor1_Model09	3.083321	49
decision_tree	Preprocessor1_Model08	3.090028	50
mlp	Preprocessor1_Model07	3.107423	51
boost_tree	Preprocessor1_Model09	3.119737	52
decision_tree	Preprocessor1_Model10	3.127145	53
decision_tree	Preprocessor1_Model04	3.173886	54
mlp	Preprocessor1_Model11	3.196364	55
decision_tree	Preprocessor1_Model06	3.227234	56
mlp	Preprocessor1_Model05	3.231370	57
mlp	Preprocessor1_Model06	3.318100	58
nearest_neighbor	Preprocessor1_Model01	3.410332	59
decision_tree	Preprocessor1_Model07	3.469856	60
decision_tree	Preprocessor1_Model14	3.584462	61
nearest_neighbor	Preprocessor1_Model07	3.639251	62
mlp	Preprocessor1_Model01	3.722567	63
mlp	Preprocessor1_Model14	3.873417	64
mlp	Preprocessor1_Model15	3.993735	65
mlp	Preprocessor1_Model09	4.280478	66
mlp	Preprocessor1_Model02	4.430005	67
boost_tree	Preprocessor1_Model13	4.747433	68
boost_tree	Preprocessor1_Model10	4.891170	69
boost_tree	Preprocessor1_Model07	7.826343	70
mlp	Preprocessor1_Model10	8.916352	71
mlp	Preprocessor1_Model08	9.712742	72
mlp	Preprocessor1_Model13	14.248312	73
boost_tree	Preprocessor1_Model08	14.589793	74
boost_tree	Preprocessor1_Model04	35.185166	75

autoplot(
   grid_results,
   rank_metric = "rmse",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
   select_best = TRUE     # <- one point per workflow
) + geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust =-0.6, size = 3.5)

After doing a grid search of 5 models ( Linear Regression, Boosted Tree, Decision Tree,K-Nearest Neighbor, Multi-layer Perceptron Neural Network) with 3 re-samples and 15 grids, the best model with the lowest RMSE is a Normalized Linear Regression.

best_results <- 
   grid_results %>% 
   extract_workflow_set_result("normalized_linear_reg") %>%  # specify the set result
   select_best(metric = "rmse")

best_results_fit <- 
   grid_results %>% 
   extract_workflow("normalized_linear_reg") %>%  # specify
   finalize_workflow(best_results) %>% 
   last_fit(split = group)


best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = le, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed LE", y = "predicted LE")

Once we apply the best model to actual values for life expectancy in 2015, we see that it does an OK job in predicting life expectancy.

best_results

## # A tibble: 1 × 3
##   penalty mixture .config              
##     <dbl>   <dbl> <chr>                
## 1   0.120   0.678 Preprocessor1_Model10

linear_param <- 
  tibble(
    penalty = 0.01196087,
    mixture = 0.6784387
  )

le_workflow_final <-
  workflow() %>%
  add_model(linear_reg_spec) %>%
  add_recipe(le_rec)


final_workflow <- 
  le_workflow_final  %>%
  finalize_workflow(linear_param)

final_fit = final_workflow %>%
  fit(data= cities_final) # predict future data

le_preds <- augment(final_fit, cities_final)

le_preds$.pred <- round(le_preds$.pred, digits = 1)

subset(le_preds , le_preds$le ==  le_preds$.pred)  %>%
  select(tractfips, le, .pred)

## # A tibble: 12 × 3
##    tractfips    le .pred
##        <int> <dbl> <dbl>
##  1     10400  79.7  79.7
##  2     30604  79    79  
##  3     50900  81.3  81.3
##  4    161300  78.6  78.6
##  5    170400  79    79  
##  6    210100  79.4  79.4
##  7    390400  73.4  73.4
##  8    570500  79.5  79.5
##  9    620300  78.6  78.6
## 10    680600  68.9  68.9
## 11    770902  79    79  
## 12    836600  77.3  77.3

Out of the 698 census tracts in the data, my model accurately predicted the life expectancy of 12 census tracts

Out of sample prediction for 2019 Life Expectancy

I conducted an out-of sample analysis for 2019 health data using the same variables in the 2015 model. After aggregating the predictions to community area, I downloaded life expectancy by community area from the Chicago Health Atlas to compare my out of sample predictions

le_preds.2019 <- augment(final_fit, PLACES_IL) # out of sample prediction

le_preds.2019 = le_preds.2019  %>% group_by(community, commarea) %>% summarize(predicted_life.expectancy = mean(.pred, na.rm=TRUE))


CDPH_2019 <- read.csv("CDPH_2019.csv") 


CDPH_2019$Name <- toupper(CDPH_2019$Name)
CDPH_2019$Name <- gsub("'", "", CDPH_2019$Name)

                                          

le_preds.2019 = left_join(le_preds.2019, CDPH_2019, by = c("commarea" = "GEOID"))

le_preds.2019 = rename(le_preds.2019, Actual_Life.Expectancy2019 = VRLE_2019)
le_preds.2019 <- le_preds.2019 %>% select(-c("Layer","Name"))

le_preds.2019 <- le_preds.2019   %>% filter(!is.na(community)) # remove na value




le_preds.2019$predicted_life.expectancy <- round(le_preds.2019$predicted_life.expectancy, digits = 1) # round by 1 digit

Final Maps

chicago_map <- cmapgeo::cca_sf
chicago_map$cca_name <- toupper(chicago_map$cca_name)

chicago_map$cca_name <- gsub("'", "", chicago_map$cca_name)


chicago_map = left_join(chicago_map, le_preds.2019, by = c("cca_name" = "community"))
#chicago_map %>% filter(cca_name == "NEW CITY")

Predicted 2019 Life Expectancy Map

 tm_shape(chicago_map) + 
          tm_polygons(col = "predicted_life.expectancy", breaks = c(65, 70,75, 80, 85),  id = "cca_name", zindex = 401, title = "Predicted Life Expectancy in 2019", popup.vars = c( "Predicted Life Expectancy: " = "predicted_life.expectancy"),
                      legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) + 
          tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") }))  + tmap_mode("view")

## tmap mode set to interactive viewing

Actual 2019 Life Expectancy Map

tm_shape(chicago_map) + 
          tm_polygons(col = "Actual_Life.Expectancy2019", id = "cca_name", zindex = 401, title = "Actual Life Expectancy in 2019", popup.vars = c( "Actual Life Expectancy: " = "Actual_Life.Expectancy2019"),
                      legend.format=list(fun=function(x) paste0(formatC(x, digits=0, format="f"), " years"))) + 
          tm_layout(legend.format = list(fun = function(x) { paste0(formatC(x, digits = 1, format = "f"), " years") }))  + tmap_mode("view")

## tmap mode set to interactive viewing

subset(le_preds.2019 , le_preds.2019 $predicted_life.expectancy ==  le_preds.2019 $Actual_Life.Expectancy2019)  %>%
  select(community, predicted_life.expectancy, Actual_Life.Expectancy2019)

## # A tibble: 2 × 3
## # Groups:   community [2]
##   community       predicted_life.expectancy Actual_Life.Expectancy2019
##   <chr>                               <dbl>                      <dbl>
## 1 BEVERLY                              79.5                       79.5
## 2 NEAR SOUTH SIDE                      79.4                       79.4

Out of the 77 community areas, my model accurately predicted life expectancy for Beverly and Near South Side! Overall, the two maps look almost identical, with life expectancy higher in the North-side and lower in the south-side regions!

Conclusion

The main benefit of using machine learning models to predict life expectancy is determining which communities health agencies like CDPH should prioritize to improve the life expectancy numbers, particularly with the Healthy Chicago 2025 goal of reducing the life expectancy gap between various communities. For example, the life expectancy for residents in Austin is 71 years while the life expectancy in Near North Side is 82.1 years. This is a difference of 12 years and this is completely unacceptable. Intervention efforts can be directed towards treating diabetes and helping educate people on what to eat. CDPH could also work on strategies to help improve access to healthy food, particularly around communities with a low number of grocery stores. Ultimately, more funding for Black and Latinx communities would improve health equity, and machine learning can help assess which communities to prioritize. City agencies can also target individuals without health insurance and expand their access to health services like Medicaid so they can get better healthcare coverage.

However, as we saw from the decision tree, the most important variable that the model identified in predicting life expectancy was the % of Black Residents in a census tract. This type of rationale can be completely racist if city agencies want to explain these models to the public. It would be really disrespectful and racist to present at a public health agency meeting that the reason why a community’s life expectancy is low is because there was a high % of African Americans in that community. Therefore, the linear regression model is the best fit for these kinds of predictions. Moreover, it is also important to keep in mind that machine learning models are meant to mimic real-world data. We cannot completely eliminate health equities with these models because these models are mimicking real-world data. There are large, systemic issues that need to be taken into consideration first. We are not trying to solve the life expectancy gap, we are solely trying to predict life expectancy so city agencies can understand which communities they need to prioritize. Additionally, this model cannot be used to predict future life expectancies past 2019 because the 2014-2015 variables does not include Covid-case, and as I mentioned in the introduction, CDPH reported two days ago that life expectancy in 2020 for African Americans in Chicago dropped below 70 years for the first time in decades due to the impact of the pandemic. This model also does not explain the impact of genetic factors on disease risk among different ethnicities. For example, African Americans carry a higher risk of developing diabetes due to genetic factors. These factors are not taken into consideration in the model, and can lead to bias.

Given the low RMSE, I would say that the variables in the model do a pretty good job in predicting life expectancy but there are areas of improvement. I would recommend CDPH to use variables in this model but not exactly this model, as the RMSE can be further decreased with additional variables and additional hyperparameter tuning. Moreover, the model predicted life expectancy for 698 census tracts for the 2015 data, but there are 866 census tracts in Chicago. As a result, due to missing data, 168 census tracts are unaccounted for in the model.

Predicting Life Expectancy in Chicago

David Segovia

4/28/2022