Introduction

Depression is a serious mental affliction that affects around 8% of the American population (or 21 million people!) on an annual basis. A wide range of symptoms are associated with the condition, including persistent feelings of sadness, hopelessness, and emptiness, increased irritability, decreased energy, disrupted sleep, and changes in appetite. Those with the condition experience a marked decrease in quality of life, productivity, and happiness, all of which has social and economic consequences. Moreover, in the more severe cases, death by suicide can result from the quality of life changes. All this makes depression a pressing public health concern to investigate at both an individual and societal level.

In the following analysis, I will work with a variety of socioeconomic indicators to build an ensemble of machine learning models to predict the county level prevalence of depression in the US. Then, the best performing model based on the root mean square error metric (RMSE) will be selected as the final model to predict the prevalence of depression in the US at a county level. The final model will then be analyzed to gain insight to the relationship between the socioeconomic indicators used in the analysis and the prevalence of depression at a county level.

The data that will be used in this analysis will be taken from the following 2023 CDC study—National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression¹—as well as the US Census Bureau. Specifically, we will make use of the following tables from the Census Bureau for our predictors: Selected Social Characteristics in the United States², Selected economic Characteristics³, Selected Housing Characteristics⁴, and ACS Demographic and Housing Estimates⁵.

Loading Data and Packages

We’ll begin by loading the required packages and data into our environment.

library(ggplot2)
library(tidyverse)
library(tidymodels)
library(readxl)
library(corrplot)
library(vip)
library(tidypredict)
library(visdat)

tidymodels_prefer()

depression <- read_excel("cdc_129404_DS1.xlsx", range = cell_rows(c(2,NA))) %>%
  rename(FIPS = `CountyFIPS code`,
         DEP = `Crude Prevalence Estimate`) %>%
  select(FIPS, DEP) %>%
  mutate_if(is.character, as.numeric)

dp02 <- read_csv("ACSDP5Y2022.DP02_2024-03-15T154358/ACSDP5Y2022.DP02-Data.csv",
                 show_col_types = FALSE) %>%
  rename(AHS = `DP02_0016E`,
         AFS = `DP02_0017E`,
         FER = `DP02_0040E`,
         BOH = `DP02_0068PE`,
         VET = `DP02_0070PE`,
         DIS = `DP02_0072PE`,
         FDC = `DP02_0084PE`,
         FDS = `DP02_0086PE`,
         FBP = `DP02_0094PE`,
         LOE = `DP02_0114PE`,
         COM = `DP02_0153PE`,
         INT = `DP02_0154PE`,
         FIPS = `GEO_ID`) %>%
  select(FIPS, AHS, AFS, FER, BOH, VET, DIS, FDC, FDS, FBP, LOE, COM, INT) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
  mutate_if(is.character, as.numeric)

dp03 <- read_csv("ACSDP5Y2022.DP03_2024-03-15T231540/ACSDP5Y2022.DP03-Data.csv",
                 show_col_types = FALSE) %>%
  rename(UER = `DP03_0009PE`,
         ACT = `DP03_0025E`,
         MHI = `DP03_0062E`,
         NHC = `DP03_0099PE`,
         POV = `DP03_0128PE`,
         FIPS = `GEO_ID`) %>%
  select(FIPS, UER, ACT, MHI, NHC, POV) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
  mutate_if(is.character, as.numeric)

dp04 <- read_csv("ACSDP5Y2022.DP04_2024-03-16T144811/ACSDP5Y2022.DP04-Data.csv",
              
                    show_col_types = FALSE) %>%
  rename(HVR = `DP04_0004E`,
         RVR = `DP04_0005E`,
         MHV = `DP04_0089E`,
         MGR = `DP04_0134E`,
         HOR = `DP04_0046PE`,
         FIPS = `GEO_ID`) %>%
  select(FIPS, HVR, RVR, MHV, MGR, HOR) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
  mutate_if(is.character, as.numeric)

dp05 <- read_csv("ACSDP5Y2022.DP05_2024-03-16T150451/ACSDP5Y2022.DP05-Data.csv", 
                 show_col_types = FALSE) %>%
  rename(POP = `DP05_0001E`,
         MEA = `DP05_0018E`,
         SER = `DP05_0004E`,
         PUE = `DP05_0019PE`,
         POS = `DP05_0024PE`,
         PEW = `DP05_0066PE`,
         PHL = `DP05_0073PE`,
         HOU = `DP05_0088E`,
         FIPS = `GEO_ID`) %>%
  select(FIPS, POP, MEA, SER, PUE, POS, PEW, PHL, HOU) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
  mutate_if(is.character, as.numeric) %>%
  mutate(HOU = HOU/POP)

depression <- left_join(depression, dp02, by = join_by(FIPS)) %>%
  left_join(y = dp03, by = join_by(FIPS)) %>%
  left_join(y = dp04, by = join_by(FIPS)) %>%
  left_join(y = dp05, by = join_by(FIPS))

In 2022, the Census Bureau began reporting data for each Connecticut’s nine planning regions instead of the historical eight counties, which ceased to function as local government entities in 1960. As such, in order to obtain data at a county level for the Connecticut, we must also independently aggregate and average the data at a county subdivision level into county level metrics.

ct_fips <- read_csv("ct-town-county-fips-list.csv",
                    show_col_types = FALSE) %>%
  mutate(FIPS = substr(FIPS, 1, 4))

dp02_ct <- read_csv("ACSDP5Y2022.DP02_2024-03-19T172051/ACSDP5Y2022.DP02-Data.csv",
                    show_col_types = FALSE) %>%
  rename(AHS = `DP02_0016E`,
         AFS = `DP02_0017E`,
         FER = `DP02_0040E`,
         BOH = `DP02_0068PE`,
         VET = `DP02_0070PE`,
         DIS = `DP02_0072PE`,
         FDC = `DP02_0084PE`,
         FDS = `DP02_0086PE`,
         FBP = `DP02_0094PE`,
         LOE = `DP02_0114PE`,
         COM = `DP02_0153PE`,
         INT = `DP02_0154PE`) %>%
  select(GEO_ID, AHS, AFS, FER, BOH, VET, DIS, FDC, FDS, FBP, LOE, COM, INT) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(across(!GEO_ID, as.numeric))

dp03_ct <- read_csv("ACSDP5Y2022.DP03_2024-03-19T182634/ACSDP5Y2022.DP03-Data.csv",
                    show_col_types = FALSE) %>%
  rename(UER = `DP03_0009PE`,
         ACT = `DP03_0025E`,
         MHI = `DP03_0062E`,
         NHC = `DP03_0099PE`,
         POV = `DP03_0128PE`) %>%
  select(GEO_ID, UER, ACT, MHI, NHC, POV) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(across(!GEO_ID, as.numeric))

dp04_ct <- read_csv("ACSDP5Y2022.DP04_2024-03-19T183028/ACSDP5Y2022.DP04-Data.csv",
                    show_col_types = FALSE) %>%
  rename(HVR = `DP04_0004E`,
         RVR = `DP04_0005E`,
         MHV = `DP04_0089E`,
         MGR = `DP04_0134E`,
         HOR = `DP04_0046PE`) %>%
  select(GEO_ID, HVR, RVR, MHV, MGR, HOR) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(across(!GEO_ID, as.numeric))

dp05_ct <- read_csv("ACSDP5Y2022.DP05_2024-03-19T183304/ACSDP5Y2022.DP05-Data.csv",
                    show_col_types = FALSE) %>%
  rename(POP = `DP05_0001E`,
         MEA = `DP05_0018E`,
         SER = `DP05_0004E`,
         PUE = `DP05_0019PE`,
         POS = `DP05_0024PE`,
         PEW = `DP05_0066PE`,
         PHL = `DP05_0073PE`,
         HOU = `DP05_0088E`) %>%
  select(GEO_ID, POP, MEA, SER, PUE, POS, PEW, PHL, HOU) %>%
  filter(!row_number() %in% c(1)) %>%
  mutate(across(!GEO_ID, as.numeric))

ct_data <- full_join(dp02_ct, dp03_ct, join_by(GEO_ID)) %>%
  full_join(dp04_ct, join_by(GEO_ID)) %>%
  full_join(dp05_ct, join_by(GEO_ID)) %>%
  inner_join(y = ct_fips, by = join_by(GEO_ID)) %>%
  select(-c(Town, County, GEO_ID)) %>%
  relocate(FIPS)

ct_data <- ct_data %>%
  group_by(FIPS) %>%
  summarize(across(c("AHS", "AFS", "FER", "BOH", "VET", "DIS", "FDC", "FDS", "FBP", "LOE", "COM", "INT", "UER", "ACT", "MHI", "NHC", "POV", "HVR", "RVR", "MHV", "MGR", "HOR", "MEA", "SER", "PUE", "POS", "PEW", "PHL"), ~ weighted.mean(.x, POP, na.rm = TRUE)),
            POP = sum(POP),
            HOU = sum(HOU)/sum(POP)) %>%
  mutate(FIPS = as.numeric(FIPS)) %>%
  relocate(POP, .before = MEA)

depression <- depression %>% 
  rows_patch(ct_data, by = c("FIPS"))

Exploratory Data Analysis

Missing Data

As a first step to exploring the data, let’s check for any data using the vis_miss() function. As can be seen below, a minimal amount of data is missing from our dataset. Since the number of missing data points is small, it will be easier to just drop the missing data from our dataset.

vis_miss(depression)

depression <- depression %>%
  drop_na()

Visualizing the Outcome

Next, let’s have a look at our outcome variable DEP, the prevalence of depression in each US county.

depression %>% 
  ggplot() + 
  geom_histogram(aes(x = DEP), binwidth = 1)

From the histogram, we can see that the prevalence of depression ranges from as low as 10% to as high as 30%, and appears to be roughly normally distributed with a mean around 20%.

Correlation Plots

Next, let’s have a look at our predictors as well. We’ll start by creating a correlation plot of the outcome and all the predictors.

depression %>%
  select(-FIPS) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", diag = FALSE)

From the correlation plot, we can pick out several relationships between the predictors and the outcome and between separate predictors. For instance, the poverty rate POV and percent of population is disabled is moderately positively correlated with the prevalence of depression DEP, while predictors like the median household income MHI and percent of population with a bachelor’s degree or higher BOH are moderately negatively correlated with the prevalence of depression. Additionally, we see certain predictors with strong positive correlation with each other, including but not limited to: the percent of population born in a foreign country FBP and the percent speaking a language other than English at home LOE, the median home value MHV and median gross rent MGR with the percent of population with a bachelor’s degree or higher BOH, and the median age MEA with the percent of population that is 65 or older, POS. None of these relationships are particularly surprising, given logically that they should have some correlation, though it is reassuring that the data agrees with logic. There are also certain predictors with strong negative correlation with each other: again, these are mostly logical and unsurprising, and include median household income MHI with the poverty rate POV and percent disabled DIS, as well as the median age MEA with the percent of population under 18, PUE.

Let’s have a closer look at some of these relationships. First, let’s plot the depression prevalence DEP against the povery rate POV:

depression %>%
  ggplot() +
  geom_point(aes(x = POV, y = DEP))

Just as indicated by the correlation plot, we can see a clear positive correlation between the two variables. On average, as the poverty rate POV increases, the depression prevalence DEP also increases. Similarly, let’s again plot the depression prevalence DEP, this time against the median household income MHI:

depression %>%
  ggplot() +
  geom_point(aes(x = MHI, y = DEP))

Once again, the plot is in agreement with what we have observed from the correlation plot. This time, when the median household income MHI increases, the prevalence of depression DEP decreases: a negatively correlated relationship, as expected.

Finally, let’s have a look at the correlation between the product of two predictors and the prevalence of depression.

n <- dim(depression)[2]
inter_cor_mat <- matrix(nrow = n - 2, 
                        ncol = n - 2,
                        dimnames = list(names(depression)[3:n],
                                        names(depression)[3:n]))
for (i in 3:n) {
  for (j in i:n) {
    inter_cor_mat[i-2,j-2] <- cor(depression[,i]*depression[,j], depression$DEP, use = "complete.obs")
  }
}

corrplot(inter_cor_mat, type = "upper")

From the above plot, we can pick out many interactions that are both positively and negatively correlated with the depression prevalence. The positively correlated include POV:PEW (the poverty rate and percent of population that is white), AHS:DIS (average household size and disability rate), and PUE:DIS (percent under eighteen and disability rate). The negatively correlated include MHI:SER (median household income and sex ratio), INT:MHI (percent of population with internet at home and median household income), and BOH:LOE (percent of population with a bachelor’s degree or higher and percent of population that speaks a language other than English at home).

Next, we’ll pick out the interactions that have a correlation with the depression prevalence above a specific magnitude. These will be used in our models to enhance their predictive abilities.

inter_terms <- which(abs(inter_cor_mat) > 0.35, arr.ind = TRUE)

for (i in 1:dim(inter_terms)[1]) {
  inter_terms[i,1] <- names(depression)[as.numeric(inter_terms[i,1]) + 2]
  inter_terms[i,2] <- names(depression)[as.numeric(inter_terms[i,2]) + 2]
}

inter_terms

##     row   col  
## AHS "AHS" "DIS"
## AFS "AFS" "DIS"
## DIS "DIS" "DIS"
## DIS "DIS" "COM"
## DIS "DIS" "INT"
## DIS "DIS" "ACT"
## AFS "AFS" "MHI"
## COM "COM" "MHI"
## INT "INT" "MHI"
## MHI "MHI" "MHI"
## DIS "DIS" "POV"
## INT "INT" "POV"
## ACT "ACT" "POV"
## DIS "DIS" "HOR"
## POV "POV" "HOR"
## MHI "MHI" "MEA"
## MHI "MHI" "SER"
## DIS "DIS" "PUE"
## MHI "MHI" "POS"
## DIS "DIS" "PEW"
## UER "UER" "PEW"
## POV "POV" "PEW"
## MHI "MHI" "HOU"

Preparing to Build the Models

Training/Testing Split

Before setting up the machine learning models, the data must first be split into a training and testing set. Just as the names suggest the training set is used to train the models, while the testing set is used to test and evaluate the final model performance. While the distinction may seem basic, splitting the data is important as we want to avoid using the same data to fit and test the models. For this particular analysis, I have chosen to split 70% of the data into the training set and 30% of the data into the testing set, in order to maximize the amount of information used to train the models while still maintaining a sizable amount of data to test with. Additionally, the split is stratified on the outcome variable DEP. This ensures that the distribution of the outcome in both the training and testing set mirrors the original distribution in the data.

set.seed(2468)

depression_split <- initial_split(depression, prop = 0.7, strata = DEP)

depression_train <- training(depression_split)
depression_test <- testing(depression_split)

To confirm that the data was properly split, we divide the number of rows in both the training and testing sets by the number of rows in our original data.

dim(depression_train)[1]/dim(depression)[1]

## [1] 0.6998723

dim(depression_test)[1]/dim(depression)[1]

## [1] 0.3001277

The training set has approximately 70% of the data and the testing set has around 30% of the data, so we have successfully split the data.

Specifying Recipes

Next, we must specify a recipe for our models to use. For the most part, we can use the same recipe for each model, making sure to remove to FIPS data as it is not a predictor that we are trying to use. Here, we also specify the interaction terms that we have selected earlier to include in our model. Finally, we must also make sure to center and scale all the predictors (that is, adjusting the values so that the mean is 0 and standard deviation is 1 for each predictor). This last step is particularly important for models like the $k$-nearest neighbors model, which relies on computing distances between data points.

depression_recipe <- recipe(DEP ~ ., data = depression_train) %>%
  step_rm(FIPS) %>%
  step_interact(terms = ~ AHS:DIS + AFS:DIS + DIS:DIS + DIS:COM) %>%
  step_interact(terms = ~ DIS:INT + DIS:UER + DIS:ACT + AFS:MHI) %>%
  step_interact(terms = ~ COM:MHI + INT:MHI + MHI:MHI + DIS:POV) %>%
  step_interact(terms = ~ INT:POV + ACT:POV + DIS:HOR + POV:HOR) %>%
  step_interact(terms = ~ MHI:SER + DIS:PUE + MHI:POS + DIS:PEW) %>%
  step_interact(terms = ~ UER:PEW + POV:PEW + MHI:HOU) %>%
  step_normalize(all_predictors())

There is one model that we must specify a recipe in particular for: the polynomial model, which requires us to tune the degree of the polynomial to be fit. Just as before, we will remove the FIPS data and normalize all predictors, but this time, we will add a step which allows us to tune the degree of the polynomial model.

depression_poly_recipe <- recipe(DEP ~ ., data = depression_train) %>%
  step_rm(FIPS) %>%
  step_poly(all_predictors(), degree = tune()) %>%
  step_normalize(all_predictors())

$k$-Fold Cross-Validation

As the final step to preparing to build our models, we will split the data into five folds using $k$-fold cross-validation. This process takes our training data and randomly splits it into $k$ different groups (or folds). Then, when we train our models, each individual model will be fit $k$ times to the training data, leaving out the $k$th fold on the $k$th time. The $k$th fold is then used to evaluate the model’s performance using metrics like RMSE (root mean square error) for regression problems or accuracy for classification models. The $k$ metrics from the $k$ fits can then be averaged to estimate the test error for the particular model, providing us a metric with which to compare and select the best performing model.

For this particular analysis, we will use $k=5$, so our training data will be split into 5 different groups, once again stratifying on the outcome variable, DEP.

set.seed(2468)

depression_folds <- vfold_cv(depression_train, v = 5, strata = DEP)

Building and Fitting the Models

We are now ready to start building and fitting our models. All machine learning models can be split into two categories: parametric and non-parametric models. In the coming steps, we will build a fit several of each type.

The general process for building and fitting the models will be quite similar across all models. In general, we will adhere to the following steps:

Begin by specifying the model and model parameters (and if they are to be tuned), the particular engine to be used, and whether the model will be used for regression or classification.
Set up a workflow for the model and add the recipe and model to the workflow.
If there are model parameters that are to be tuned, set up a tuning grid by specifying the range of values to be considered and number of levels (i.e. how many different values) for each parameter.
Tune the model with the workflow using the $k$-folds obtained earlier and the tuning grid.

Parametric Models

Parametric models are those that specify a specific functional form of the outcome as a function of the predictors. These models require less data to fit and are generally easier to interpret. However, since they must adhere to a specific functional form, they have less flexibility and generally have a difficult time modeling more complex data.

Classic Linear Model

The first parametric model we will build and fit is the classic linear regression model, which assumes that the outcome $Y$ is a linear function of the predictors $x_1,\dots,x_n$: \[\begin{equation*} E\left[Y\right]=\beta_0+\beta_1x_1+\dots+\beta_nx_n. \end{equation*}\] Here the goal is to determine the parameters $\beta_0,\dots,\beta_1$ that best predict the outcome $Y$ according to this functional form.

lm_model <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

lm_wflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(depression_recipe)

tune_res_lm <- lm_wflow %>%
  tune_grid(resamples = depression_folds)

Elastic Net Model

The next parametric model we will fit is the elastic net model, which is similar to the linear regression model, except for a penalty term \[\begin{equation*} \lambda\left[\left(1-\alpha\right)\left|\left|\beta\right|\right|_2^2/2+\alpha\left|\left|\beta\right|\right|_1\right] \end{equation*}\] that is added to the residual sum of squares. This introduces some bias into our estimates of $\beta$, but also reduces the variance, and thus possibly the mean square error and improving model performance. However, the elastic net requires us to tune two parameters, the mixture $\alpha$, which controls how much weight we give to the ridge penalty term versus the lasso penalty term, and the penalty $\lambda$.

net_model <- linear_reg(mixture = tune(),
                        penalty = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

net_wflow <- workflow() %>%
  add_model(net_model) %>%
  add_recipe(depression_recipe)

net_grid <- grid_regular(penalty(),
                         mixture(),
                         levels = 20)

tune_res_net <- net_wflow %>%
  tune_grid(resamples = depression_folds,
            grid = net_grid)

Polynomial Model

The third and final parametric model we will fit is the polynomial model. Just as before with the elastic net model, the polynomial model is similar to the linear regression model (in fact, it technically is a linear regression model), with the exception that we now consider predictor terms of a degree greater than 1 in our functional form for the outcome. This means that we have one parameter to tune in this model, the degree $n$ of the polynomial that we fit. This concludes the parametric models that we will fit.

poly_wflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(depression_poly_recipe)

poly_grid <- grid_regular(degree(range = c(1,5)),
                          levels = 5)

tune_res_poly <- poly_wflow %>%
  tune_grid(resamples = depression_folds,
            grid = poly_grid)

Non-Parametric Models

Next, we will build and fit several non-parametric models. Unlike their parametric counterparts, non-parametric models do not specify a functional form for the outcome, and thus exhibit greater flexibility. This flexibility does not come without cost however, as it does take more data to train these models, and they can be prone to overfitting. Despite these drawbacks, non-parametric models can perform just as well or even better than their parametric counterparts, especially when trying to model complex relationships.

$k$-Nearest Neighbors Model

The first non-parametric model we will fit is the $k$-nearest neighbors model. Just as the name implies, this model predicts the outcome for a particular point by looking at its $k$ nearest neighbors and using them to predict the outcome. In a regression setting, this means averaging the outcomes of the $k$ nearest neighbors to a particular point and assigning the result to that outcome. For this model, we have one parameter that we’ll need to tune: the number of neighbors $k$ to consider.

knn_model <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

knn_wflow <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(depression_recipe)

knn_grid <- grid_regular(neighbors(range = c(1,50)),
                         levels = 25)

tune_res_knn <- knn_wflow %>%
  tune_grid(resamples = depression_folds, 
            grid = knn_grid)

Pruned Decision Tree Model

The next non-parametric model we will build and fit is the pruned decision tree model. Put simply, a decision tree splits the predictor space into $j$ distinct regions $R_j$, and assigns to all points in each region $R_j$ the average of all the outcomes in $R_j$. A pruned decision tree utilizes a penalty term (similar to the elastic net) $\alpha\left|T\right|$ to prevent overfitting more complex decision trees. So we have one parameter to tune for this model, the cost_complexity $\alpha$.

ptr_model <- decision_tree(cost_complexity = tune()) %>%
  set_engine("rpart") %>% 
  set_mode("regression")

ptr_wflow <- workflow() %>%
  add_model(ptr_model) %>%
  add_recipe(depression_recipe)

ptr_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

tune_res_ptr <- ptr_wflow %>%
  tune_grid(resamples = depression_folds,
            grid = ptr_grid)

Random Forest Model

Next, we’ll have another look at a tree-based model: the random forest model. Random forest models build an ensemble of decision trees and averages the results from each to come up with an outcome for a particular point, thereby reducing the variance of the model. There is one caveat though: rather than considering every single predictor as a candidate each time the predictor space is split, a random subset of the predictors are chosen as candidates (hence the name, random forest!). To fit the random forest model then, there are three parameters that we must tune: the number of predictors to consider at each split mtry, the total number of trees trees, and the minimum number of observations needed to split a branch, min_n.

Note that random forests tend to take significant time to train, so I will be loading a previously saved fit for this analysis.

ran_for_model <- rand_forest(mtry = tune(),
                         trees = tune(),
                         min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

ran_for_wflow <- workflow() %>%
  add_recipe(depression_recipe) %>%
  add_model(ran_for_model)

ran_for_grid <- grid_regular(mtry(range = c(20, 50)),
                               trees(range = c(500, 1500)),
                               min_n(range = c(2,20)),
                               levels = 5)

# tune_res_ran_for <- ran_for_wflow %>%
#   tune_grid(resamples = depression_folds,
#             grid = ran_for_grid)

# save(tune_res_ran_for, file = "tune_res_ran_for.rda")

load("tune_res_ran_for.rda")

Boosted Tree Model

Our final non-parametric model is yet another tree-based model, the boosted tree model. The model works by sequentially fitting trees to the residuals (the difference between the observed values and estimated values) and updating the fitted model with a shrunken result of the newly fitted tree. Thus, we again have multiple parameters to tune with the boosted tree model: the number of predictors to consider at each split mtry, the total number of trees trees, and the shrinkage parameter learn_rate.

Note that boosted trees also tend to take significant time to train, so I will be loading a previously saved fit for this analysis as well.

btr_model <- boost_tree(mtry = tune(),
                        trees = tune(),
                        learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

btr_wflow <- workflow() %>%
  add_recipe(depression_recipe) %>%
  add_model(btr_model)

btr_grid <- grid_regular(mtry(range = c(8,40)),
                         trees(range = c(500,1500)),
                         learn_rate(range = c(0.01,0.1), trans = identity_trans()),
                         levels = 5)

# tune_res_btr <- btr_wflow %>% 
#   tune_grid(resamples = depression_folds,
#             grid = btr_grid)

# save(tune_res_btr, file = "tune_res_btr.rda")

load("tune_res_btr.rda")

Analyzing Model Results

Now that we’ve built and fit our machine learning models, let’s have a look at how they performed and the optimal parameter values for those that required tuning.

Classic Linear Regression

There were no parameters that required tuning for the linear regression model, so we’ll just have a look at the metrics of the fit.

collect_metrics(tune_res_lm)

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   2.19      5  0.0361 Preprocessor1_Model1
## 2 rsq     standard   0.517     5  0.0114 Preprocessor1_Model1

best_lm <- select_best(tune_res_lm, metric = "rmse")

From the table above, we see that the RMSE for the fit is 2.19 and the $R^2$ is 0.515. These values indicate that the model performed moderately well, but not stellar.

Elastic Net Model

Next, let’s have a look at the results from the elastic net model. We’ll first have a look at how the tuned parameters mixture and penalty affected the model’s performance.

autoplot(tune_res_net)

Fron the plot, we see that the model performance is relatively constant as the amount of regularization (the penalty parameter) increases, until around 0.01, where the performance quickly drops off. We also see that lower values of the mixture parameter tend to perform better as the amount of regularization increases. Now, let’s have a look at the best performing sets of parameters and their associated metrics:

show_best(tune_res_net, metric = "rmse")

## # A tibble: 5 × 8
##   penalty mixture .metric .estimator  mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1 0.00785  0.0526 rmse    standard    2.18     5  0.0317 Preprocessor1_Model036
## 2 0.00785  0.158  rmse    standard    2.18     5  0.0312 Preprocessor1_Model076
## 3 0.00785  0.105  rmse    standard    2.18     5  0.0316 Preprocessor1_Model056
## 4 0.00785  0.263  rmse    standard    2.18     5  0.0306 Preprocessor1_Model116
## 5 0.0264   0.0526 rmse    standard    2.18     5  0.0309 Preprocessor1_Model037

best_net <- select_best(tune_res_net, metric = "rmse")

From the table, we can observe that the best performing models have smaller values of both the penalty and mixture parameters, in accordance with our previous observations. The RMSE for these models are all around 2.18, which indicates a model performance roughly on par with the linear regression model.

Polynomial Model

Next, let’s see how the last parametric model—the polynomial model—performed. There was only one tuned parameter in this model: the degree parameter.

autoplot(tune_res_poly)

The plot of model performance as a function of polynomial degree indicates that the polynomial model performs optimally with degree=2, and the performance rapidly drops off as degree increases. The large dropoff in performance is likely caused by overfitting by higher degree polynomials.

show_best(tune_res_poly, metric = "rmse")

## # A tibble: 5 × 7
##   degree .metric .estimator  mean     n std_err .config             
##    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1      2 rmse    standard    2.17     5  0.0355 Preprocessor2_Model1
## 2      1 rmse    standard    2.20     5  0.0322 Preprocessor1_Model1
## 3      3 rmse    standard    2.40     5  0.124  Preprocessor3_Model1
## 4      4 rmse    standard   11.4      5  6.76   Preprocessor4_Model1
## 5      5 rmse    standard   18.0      5 12.2    Preprocessor5_Model1

Just as observed above, the polynomial model performed best when the degree parameter was set equal to 2. With a RMSE of 2.17, the polynomial model with degree=2 performed at roughly the same level as the linear regression model and the elastic net model.

$k$-Nearest Neighbors

Now, we’ll take a look at how the first of our non-parametric models performed.

autoplot(tune_res_knn)

As can be seen in the above plot, the $k$-nearest neighbor model performed best with a value of $k$ around 20. Model performance quickly drops off as the neighbors parameter decreases and slowly decreases as the neighbors parameter increases away from 20.

show_best(tune_res_knn, metric = "rmse")

## # A tibble: 5 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        21 rmse    standard    2.20     5  0.0376 Preprocessor1_Model11
## 2        23 rmse    standard    2.20     5  0.0384 Preprocessor1_Model12
## 3        19 rmse    standard    2.20     5  0.0369 Preprocessor1_Model10
## 4        25 rmse    standard    2.20     5  0.0391 Preprocessor1_Model13
## 5        27 rmse    standard    2.20     5  0.0397 Preprocessor1_Model14

The best perform $k$-nearest neighbors all have a value for the neighbors parameter around 20, in agreement with our observations above. THe RMSE of the best performing models are all around 2.20, indicating a performance on par with both the linear regression and elastic net models, but slightly behind the best polynomial model.

Pruned Tree Model

Next, we’ll look at the results from the pruned tree models.

autoplot(tune_res_ptr)

From the above plot, we can see that the pruned tree performed best for a value of the cost_complexity parameter between 0.001 and 0.01, with the performance dropping off for larger values.

show_best(tune_res_ptr, metric = "rmse")

## # A tibble: 5 × 7
##   cost_complexity .metric .estimator  mean     n std_err .config              
##             <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1         0.00464 rmse    standard    2.44     5  0.0522 Preprocessor1_Model04
## 2         0.00774 rmse    standard    2.47     5  0.0407 Preprocessor1_Model05
## 3         0.00278 rmse    standard    2.49     5  0.0565 Preprocessor1_Model03
## 4         0.00167 rmse    standard    2.54     5  0.0383 Preprocessor1_Model02
## 5         0.0129  rmse    standard    2.56     5  0.0330 Preprocessor1_Model06

The table indicates that the best performing pruned tree had a cost_complexity parameter equal to 0.00464 with an RMSE of 2.44. This is the worst performance of any of the models we have looked at so far by a significant margin.

Random Forest Model

Now, we’ll look at the performance of the random forest models.

autoplot(tune_res_ran_for)

The plot above indicates that the model performance increased for larger values of mtry and trees and smaller values of min_n.

show_best(tune_res_ran_for, metric = "rmse")

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    50   750     6 rmse    standard    2.01     5  0.0201 Preprocessor1_Model0…
## 2    50  1000     6 rmse    standard    2.01     5  0.0206 Preprocessor1_Model0…
## 3    42  1500     6 rmse    standard    2.01     5  0.0206 Preprocessor1_Model0…
## 4    50   500     2 rmse    standard    2.01     5  0.0218 Preprocessor1_Model0…
## 5    50   750     2 rmse    standard    2.01     5  0.0210 Preprocessor1_Model0…

The best performing random forest models all had an RMSE around 2.01, making it the best performing model that we have looked at so far, beating out the regression, elastic net, polynomial, and $k$-nearest neighbors models by a significant margin.

Boosted Tree Model

Finally, we’ll look at the results of last model that we fit: the boosted tree model.

autoplot(tune_res_btr)

From the above plot, we see that performance tends to best for a learn_rate around 0.03, with mtry around 15, and a larger number of trees. Overall, the model performance tends to increase as learn_rate and trees increases, while the performance based on mtry varies based upon the other parameters.

show_best(tune_res_btr, metric = "rmse")

## # A tibble: 5 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1    16  1500     0.0325 rmse    standard    1.91     5  0.0187 Preprocessor1_M…
## 2    16  1250     0.0325 rmse    standard    1.91     5  0.0187 Preprocessor1_M…
## 3    16  1000     0.0325 rmse    standard    1.91     5  0.0186 Preprocessor1_M…
## 4    16   750     0.0325 rmse    standard    1.91     5  0.0189 Preprocessor1_M…
## 5    32  1500     0.01   rmse    standard    1.91     5  0.0277 Preprocessor1_M…

From the above table, we see that the best performing boosted trees have an RMSE around 1.91, making the boosted tree our best performing model.

Selecting and Fitting the Final Model

Based on the analysis in the previous section, we concluded that the boosted tree was the best performing model. Thus, we’ll select the best performing boosted tree as our final model and fit it to the entirety of the training data. The parameters of this model are shown below:

best_model <- select_best(tune_res_btr, metric = "rmse")
best_model

## # A tibble: 1 × 4
##    mtry trees learn_rate .config               
##   <int> <int>      <dbl> <chr>                 
## 1    16  1500     0.0325 Preprocessor1_Model035

final_wflow <- finalize_workflow(btr_wflow, best_model)
final_fit <- fit(final_wflow, depression_train)

Now that we have fit our final model, we will use the testing data to evaluate its performance.

augment(final_fit, new_data = depression_test) %>%
  rmse(truth = DEP, .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.92

With an RMSE of 1.92, the final model performed on par with expectations, mirroring the $k$-fold cross validations estimates of the testing error. Based on this result, we can conclude that the model performed moderately well at predicting the county-level prevalence of depression in the US.

To visualize our model performance, we can plot the observed rate of depression versus the predicted rate by the model.

augment(final_fit, new_data = depression_test) %>%
  ggplot() +
  geom_point(aes(x = .pred, y = DEP)) +
  geom_abline(slope = 1, color = "red")

The red line indicates where the points would fall if the model had a 100% prediction accuracy (or a RMSE of 0). Overall, the points are clustered quite close to the line, once again indicating that the final model did quite well in predicting the county-level prevalence of depression.

We can also make a variable importance plot of the final model, which shows which variables are the most important in predicting the outcome variable.

vip(final_fit)

From the plot, we see that several of the interaction terms were the most important, including MHI:POS (the median household income with the percent of the population over the age of 65), DIS:ACT (the disability rate with the average commute time), and AHS:DIS (the average household size with the disability rate). Of the non-interaction terms, the most important are POP (the total population), PHL (the percent of the population that is Hispanic or Latino), and PEW (the percent of population that is white).

Conclusion

After fitting a multitude of models to the data, we found that the boosted tree model performed the best. This is not surprising, as the relationship between the various socioeconomic predictors and the observed rate of depression in each US county is likely quite complex and difficult to capture with parametric models like linear regression and elastic net. Moreover, the final boosted tree model did quite well at predicting the rate of depression in each US county, as evidenced by the RMSE value of 1.92 and the plot of observed versus predicted values.

While the two worst performing models were also non-parametric models, they should not detract from the conclusion that the relationship between the depression rate and the predictors used in this analysis is likely highly complex and nonlinear. The (relatively) poor performance of the $k$-nearest neighbors model can be attributed to the large number of predictors used in this anlysis, which makes it difficult for the $k$-nearest neighbors to identify the nearest neighbors optimally. The poor performance of the pruned tree (the worst performing model overall) can be attributed to the high variance nature of pruned tree models. Nonetheless, the strong performance of the random forest and boosted tree models indicate the presence of a complex nonlinear relationship between the rate of depression and the predictors used.

One avenue for further exploration beyond this analysis is the interplay between the racial demographics of a region and the rate of depression. In the variable importance plot of the final model, two of the top three non-interaction variables were related to racial demographics (PEW, the percent of the population that is white, and PHL, the percent of the population that his Hispanic or Latino). This indicates that race must be important in determining the rate of depression at a county level, and the relationship between the racial demographics of a region and the rate of depression should be explored.

Lee B, Wang Y, Carlson SA, et al. National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression — United States, 2020. MMWR Morb Mortal Wkly Rep 2023;72:644–650. DOI: http://dx.doi.org/10.15585/mmwr.mm7224a1 ↩︎
U.S. Census Bureau. (2022). Selected Social Characteristics in the United States. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP02. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP02?g=010XX00US$0500000&tp=false.↩︎
U.S. Census Bureau. (2022). Selected Economic Characteristics. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP03. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP03?g=010XX00US$0500000.↩︎
U.S. Census Bureau. (2022). Selected Housing Characteristics. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP04. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP04?g=010XX00US$0500000.↩︎
U.S. Census Bureau. (2022). ACS Demographic and Housing Estimates. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP05. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP05?g=010XX00US$0500000.↩︎

Predicting the County-Level Prevalence of Depression in the US

Ivy Li

March 2024

Introduction

Loading Data and Packages

Exploratory Data Analysis

Missing Data

Visualizing the Outcome

Correlation Plots

Preparing to Build the Models

Training/Testing Split

Specifying Recipes

\(k\)-Fold Cross-Validation

Building and Fitting the Models

Parametric Models

Classic Linear Model

Elastic Net Model

Polynomial Model

Non-Parametric Models

\(k\)-Nearest Neighbors Model

Pruned Decision Tree Model

Random Forest Model

Boosted Tree Model

Analyzing Model Results

Classic Linear Regression

Elastic Net Model

Polynomial Model

\(k\)-Nearest Neighbors

Pruned Tree Model

Random Forest Model

Boosted Tree Model

Selecting and Fitting the Final Model

Conclusion