Depression is a serious mental affliction that affects around 8% of the American population (or 21 million people!) on an annual basis. A wide range of symptoms are associated with the condition, including persistent feelings of sadness, hopelessness, and emptiness, increased irritability, decreased energy, disrupted sleep, and changes in appetite. Those with the condition experience a marked decrease in quality of life, productivity, and happiness, all of which has social and economic consequences. Moreover, in the more severe cases, death by suicide can result from the quality of life changes. All this makes depression a pressing public health concern to investigate at both an individual and societal level.
In the following analysis, I will work with a variety of socioeconomic indicators to build an ensemble of machine learning models to predict the county level prevalence of depression in the US. Then, the best performing model based on the root mean square error metric (RMSE) will be selected as the final model to predict the prevalence of depression in the US at a county level. The final model will then be analyzed to gain insight to the relationship between the socioeconomic indicators used in the analysis and the prevalence of depression at a county level.
The data that will be used in this analysis will be taken from the following 2023 CDC study—National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression1—as well as the US Census Bureau. Specifically, we will make use of the following tables from the Census Bureau for our predictors: Selected Social Characteristics in the United States2, Selected economic Characteristics3, Selected Housing Characteristics4, and ACS Demographic and Housing Estimates5.
We’ll begin by loading the required packages and data into our environment.
library(ggplot2)
library(tidyverse)
library(tidymodels)
library(readxl)
library(corrplot)
library(vip)
library(tidypredict)
library(visdat)
tidymodels_prefer()
depression <- read_excel("cdc_129404_DS1.xlsx", range = cell_rows(c(2,NA))) %>%
rename(FIPS = `CountyFIPS code`,
DEP = `Crude Prevalence Estimate`) %>%
select(FIPS, DEP) %>%
mutate_if(is.character, as.numeric)
dp02 <- read_csv("ACSDP5Y2022.DP02_2024-03-15T154358/ACSDP5Y2022.DP02-Data.csv",
show_col_types = FALSE) %>%
rename(AHS = `DP02_0016E`,
AFS = `DP02_0017E`,
FER = `DP02_0040E`,
BOH = `DP02_0068PE`,
VET = `DP02_0070PE`,
DIS = `DP02_0072PE`,
FDC = `DP02_0084PE`,
FDS = `DP02_0086PE`,
FBP = `DP02_0094PE`,
LOE = `DP02_0114PE`,
COM = `DP02_0153PE`,
INT = `DP02_0154PE`,
FIPS = `GEO_ID`) %>%
select(FIPS, AHS, AFS, FER, BOH, VET, DIS, FDC, FDS, FBP, LOE, COM, INT) %>%
filter(!row_number() %in% c(1)) %>%
mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
mutate_if(is.character, as.numeric)
dp03 <- read_csv("ACSDP5Y2022.DP03_2024-03-15T231540/ACSDP5Y2022.DP03-Data.csv",
show_col_types = FALSE) %>%
rename(UER = `DP03_0009PE`,
ACT = `DP03_0025E`,
MHI = `DP03_0062E`,
NHC = `DP03_0099PE`,
POV = `DP03_0128PE`,
FIPS = `GEO_ID`) %>%
select(FIPS, UER, ACT, MHI, NHC, POV) %>%
filter(!row_number() %in% c(1)) %>%
mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
mutate_if(is.character, as.numeric)
dp04 <- read_csv("ACSDP5Y2022.DP04_2024-03-16T144811/ACSDP5Y2022.DP04-Data.csv",
show_col_types = FALSE) %>%
rename(HVR = `DP04_0004E`,
RVR = `DP04_0005E`,
MHV = `DP04_0089E`,
MGR = `DP04_0134E`,
HOR = `DP04_0046PE`,
FIPS = `GEO_ID`) %>%
select(FIPS, HVR, RVR, MHV, MGR, HOR) %>%
filter(!row_number() %in% c(1)) %>%
mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
mutate_if(is.character, as.numeric)
dp05 <- read_csv("ACSDP5Y2022.DP05_2024-03-16T150451/ACSDP5Y2022.DP05-Data.csv",
show_col_types = FALSE) %>%
rename(POP = `DP05_0001E`,
MEA = `DP05_0018E`,
SER = `DP05_0004E`,
PUE = `DP05_0019PE`,
POS = `DP05_0024PE`,
PEW = `DP05_0066PE`,
PHL = `DP05_0073PE`,
HOU = `DP05_0088E`,
FIPS = `GEO_ID`) %>%
select(FIPS, POP, MEA, SER, PUE, POS, PEW, PHL, HOU) %>%
filter(!row_number() %in% c(1)) %>%
mutate(FIPS = substr(FIPS, nchar(FIPS)-4, nchar(FIPS))) %>%
mutate_if(is.character, as.numeric) %>%
mutate(HOU = HOU/POP)
depression <- left_join(depression, dp02, by = join_by(FIPS)) %>%
left_join(y = dp03, by = join_by(FIPS)) %>%
left_join(y = dp04, by = join_by(FIPS)) %>%
left_join(y = dp05, by = join_by(FIPS))
In 2022, the Census Bureau began reporting data for each Connecticut’s nine planning regions instead of the historical eight counties, which ceased to function as local government entities in 1960. As such, in order to obtain data at a county level for the Connecticut, we must also independently aggregate and average the data at a county subdivision level into county level metrics.
ct_fips <- read_csv("ct-town-county-fips-list.csv",
show_col_types = FALSE) %>%
mutate(FIPS = substr(FIPS, 1, 4))
dp02_ct <- read_csv("ACSDP5Y2022.DP02_2024-03-19T172051/ACSDP5Y2022.DP02-Data.csv",
show_col_types = FALSE) %>%
rename(AHS = `DP02_0016E`,
AFS = `DP02_0017E`,
FER = `DP02_0040E`,
BOH = `DP02_0068PE`,
VET = `DP02_0070PE`,
DIS = `DP02_0072PE`,
FDC = `DP02_0084PE`,
FDS = `DP02_0086PE`,
FBP = `DP02_0094PE`,
LOE = `DP02_0114PE`,
COM = `DP02_0153PE`,
INT = `DP02_0154PE`) %>%
select(GEO_ID, AHS, AFS, FER, BOH, VET, DIS, FDC, FDS, FBP, LOE, COM, INT) %>%
filter(!row_number() %in% c(1)) %>%
mutate(across(!GEO_ID, as.numeric))
dp03_ct <- read_csv("ACSDP5Y2022.DP03_2024-03-19T182634/ACSDP5Y2022.DP03-Data.csv",
show_col_types = FALSE) %>%
rename(UER = `DP03_0009PE`,
ACT = `DP03_0025E`,
MHI = `DP03_0062E`,
NHC = `DP03_0099PE`,
POV = `DP03_0128PE`) %>%
select(GEO_ID, UER, ACT, MHI, NHC, POV) %>%
filter(!row_number() %in% c(1)) %>%
mutate(across(!GEO_ID, as.numeric))
dp04_ct <- read_csv("ACSDP5Y2022.DP04_2024-03-19T183028/ACSDP5Y2022.DP04-Data.csv",
show_col_types = FALSE) %>%
rename(HVR = `DP04_0004E`,
RVR = `DP04_0005E`,
MHV = `DP04_0089E`,
MGR = `DP04_0134E`,
HOR = `DP04_0046PE`) %>%
select(GEO_ID, HVR, RVR, MHV, MGR, HOR) %>%
filter(!row_number() %in% c(1)) %>%
mutate(across(!GEO_ID, as.numeric))
dp05_ct <- read_csv("ACSDP5Y2022.DP05_2024-03-19T183304/ACSDP5Y2022.DP05-Data.csv",
show_col_types = FALSE) %>%
rename(POP = `DP05_0001E`,
MEA = `DP05_0018E`,
SER = `DP05_0004E`,
PUE = `DP05_0019PE`,
POS = `DP05_0024PE`,
PEW = `DP05_0066PE`,
PHL = `DP05_0073PE`,
HOU = `DP05_0088E`) %>%
select(GEO_ID, POP, MEA, SER, PUE, POS, PEW, PHL, HOU) %>%
filter(!row_number() %in% c(1)) %>%
mutate(across(!GEO_ID, as.numeric))
ct_data <- full_join(dp02_ct, dp03_ct, join_by(GEO_ID)) %>%
full_join(dp04_ct, join_by(GEO_ID)) %>%
full_join(dp05_ct, join_by(GEO_ID)) %>%
inner_join(y = ct_fips, by = join_by(GEO_ID)) %>%
select(-c(Town, County, GEO_ID)) %>%
relocate(FIPS)
ct_data <- ct_data %>%
group_by(FIPS) %>%
summarize(across(c("AHS", "AFS", "FER", "BOH", "VET", "DIS", "FDC", "FDS", "FBP", "LOE", "COM", "INT", "UER", "ACT", "MHI", "NHC", "POV", "HVR", "RVR", "MHV", "MGR", "HOR", "MEA", "SER", "PUE", "POS", "PEW", "PHL"), ~ weighted.mean(.x, POP, na.rm = TRUE)),
POP = sum(POP),
HOU = sum(HOU)/sum(POP)) %>%
mutate(FIPS = as.numeric(FIPS)) %>%
relocate(POP, .before = MEA)
depression <- depression %>%
rows_patch(ct_data, by = c("FIPS"))
As a first step to exploring the data, let’s check for any data using
the vis_miss() function. As can be seen below, a minimal
amount of data is missing from our dataset. Since the number of missing
data points is small, it will be easier to just drop the missing data
from our dataset.
vis_miss(depression)
depression <- depression %>%
drop_na()
Next, let’s have a look at our outcome variable DEP, the
prevalence of depression in each US county.
depression %>%
ggplot() +
geom_histogram(aes(x = DEP), binwidth = 1)
From the histogram, we can see that the prevalence of depression ranges from as low as 10% to as high as 30%, and appears to be roughly normally distributed with a mean around 20%.
Next, let’s have a look at our predictors as well. We’ll start by creating a correlation plot of the outcome and all the predictors.
depression %>%
select(-FIPS) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE)
From the correlation plot, we can pick out several relationships
between the predictors and the outcome and between separate predictors.
For instance, the poverty rate POV and percent of
population is disabled is moderately positively correlated with the
prevalence of depression DEP, while predictors like the
median household income MHI and percent of population with
a bachelor’s degree or higher BOH are moderately negatively
correlated with the prevalence of depression. Additionally, we see
certain predictors with strong positive correlation with each other,
including but not limited to: the percent of population born in a
foreign country FBP and the percent speaking a language
other than English at home LOE, the median home value
MHV and median gross rent MGR with the percent
of population with a bachelor’s degree or higher BOH, and
the median age MEA with the percent of population that is
65 or older, POS. None of these relationships are
particularly surprising, given logically that they should have some
correlation, though it is reassuring that the data agrees with logic.
There are also certain predictors with strong negative correlation with
each other: again, these are mostly logical and unsurprising, and
include median household income MHI with the poverty rate
POV and percent disabled DIS, as well as the
median age MEA with the percent of population under 18,
PUE.
Let’s have a closer look at some of these relationships. First, let’s
plot the depression prevalence DEP against the povery rate
POV:
depression %>%
ggplot() +
geom_point(aes(x = POV, y = DEP))
Just as indicated by the correlation plot, we can see a clear
positive correlation between the two variables. On average, as the
poverty rate POV increases, the depression prevalence
DEP also increases. Similarly, let’s again plot the
depression prevalence DEP, this time against the median
household income MHI:
depression %>%
ggplot() +
geom_point(aes(x = MHI, y = DEP))
Once again, the plot is in agreement with what we have observed from
the correlation plot. This time, when the median household income
MHI increases, the prevalence of depression
DEP decreases: a negatively correlated relationship, as
expected.
Finally, let’s have a look at the correlation between the product of two predictors and the prevalence of depression.
n <- dim(depression)[2]
inter_cor_mat <- matrix(nrow = n - 2,
ncol = n - 2,
dimnames = list(names(depression)[3:n],
names(depression)[3:n]))
for (i in 3:n) {
for (j in i:n) {
inter_cor_mat[i-2,j-2] <- cor(depression[,i]*depression[,j], depression$DEP, use = "complete.obs")
}
}
corrplot(inter_cor_mat, type = "upper")
From the above plot, we can pick out many interactions that are both
positively and negatively correlated with the depression prevalence. The
positively correlated include POV:PEW (the poverty rate and
percent of population that is white), AHS:DIS (average
household size and disability rate), and PUE:DIS (percent
under eighteen and disability rate). The negatively correlated include
MHI:SER (median household income and sex ratio),
INT:MHI (percent of population with internet at home and
median household income), and BOH:LOE (percent of
population with a bachelor’s degree or higher and percent of population
that speaks a language other than English at home).
Next, we’ll pick out the interactions that have a correlation with the depression prevalence above a specific magnitude. These will be used in our models to enhance their predictive abilities.
inter_terms <- which(abs(inter_cor_mat) > 0.35, arr.ind = TRUE)
for (i in 1:dim(inter_terms)[1]) {
inter_terms[i,1] <- names(depression)[as.numeric(inter_terms[i,1]) + 2]
inter_terms[i,2] <- names(depression)[as.numeric(inter_terms[i,2]) + 2]
}
inter_terms
## row col
## AHS "AHS" "DIS"
## AFS "AFS" "DIS"
## DIS "DIS" "DIS"
## DIS "DIS" "COM"
## DIS "DIS" "INT"
## DIS "DIS" "ACT"
## AFS "AFS" "MHI"
## COM "COM" "MHI"
## INT "INT" "MHI"
## MHI "MHI" "MHI"
## DIS "DIS" "POV"
## INT "INT" "POV"
## ACT "ACT" "POV"
## DIS "DIS" "HOR"
## POV "POV" "HOR"
## MHI "MHI" "MEA"
## MHI "MHI" "SER"
## DIS "DIS" "PUE"
## MHI "MHI" "POS"
## DIS "DIS" "PEW"
## UER "UER" "PEW"
## POV "POV" "PEW"
## MHI "MHI" "HOU"
Before setting up the machine learning models, the data must first be
split into a training and testing set. Just as the names suggest the
training set is used to train the models, while the testing set is used
to test and evaluate the final model performance. While the distinction
may seem basic, splitting the data is important as we want to avoid
using the same data to fit and test the models. For this particular
analysis, I have chosen to split 70% of the data into the training set
and 30% of the data into the testing set, in order to maximize the
amount of information used to train the models while still maintaining a
sizable amount of data to test with. Additionally, the split is
stratified on the outcome variable DEP. This ensures that
the distribution of the outcome in both the training and testing set
mirrors the original distribution in the data.
set.seed(2468)
depression_split <- initial_split(depression, prop = 0.7, strata = DEP)
depression_train <- training(depression_split)
depression_test <- testing(depression_split)
To confirm that the data was properly split, we divide the number of rows in both the training and testing sets by the number of rows in our original data.
dim(depression_train)[1]/dim(depression)[1]
## [1] 0.6998723
dim(depression_test)[1]/dim(depression)[1]
## [1] 0.3001277
The training set has approximately 70% of the data and the testing set has around 30% of the data, so we have successfully split the data.
Next, we must specify a recipe for our models to use. For the most
part, we can use the same recipe for each model, making sure to remove
to FIPS data as it is not a predictor that we are trying to
use. Here, we also specify the interaction terms that we have selected
earlier to include in our model. Finally, we must also make sure to
center and scale all the predictors (that is, adjusting the values so
that the mean is 0 and standard deviation is 1 for each predictor). This
last step is particularly important for models like the \(k\)-nearest neighbors model, which relies
on computing distances between data points.
depression_recipe <- recipe(DEP ~ ., data = depression_train) %>%
step_rm(FIPS) %>%
step_interact(terms = ~ AHS:DIS + AFS:DIS + DIS:DIS + DIS:COM) %>%
step_interact(terms = ~ DIS:INT + DIS:UER + DIS:ACT + AFS:MHI) %>%
step_interact(terms = ~ COM:MHI + INT:MHI + MHI:MHI + DIS:POV) %>%
step_interact(terms = ~ INT:POV + ACT:POV + DIS:HOR + POV:HOR) %>%
step_interact(terms = ~ MHI:SER + DIS:PUE + MHI:POS + DIS:PEW) %>%
step_interact(terms = ~ UER:PEW + POV:PEW + MHI:HOU) %>%
step_normalize(all_predictors())
There is one model that we must specify a recipe in particular for:
the polynomial model, which requires us to tune the degree of the
polynomial to be fit. Just as before, we will remove the
FIPS data and normalize all predictors, but this time, we
will add a step which allows us to tune the degree of the polynomial
model.
depression_poly_recipe <- recipe(DEP ~ ., data = depression_train) %>%
step_rm(FIPS) %>%
step_poly(all_predictors(), degree = tune()) %>%
step_normalize(all_predictors())
As the final step to preparing to build our models, we will split the data into five folds using \(k\)-fold cross-validation. This process takes our training data and randomly splits it into \(k\) different groups (or folds). Then, when we train our models, each individual model will be fit \(k\) times to the training data, leaving out the \(k\)th fold on the \(k\)th time. The \(k\)th fold is then used to evaluate the model’s performance using metrics like RMSE (root mean square error) for regression problems or accuracy for classification models. The \(k\) metrics from the \(k\) fits can then be averaged to estimate the test error for the particular model, providing us a metric with which to compare and select the best performing model.
For this particular analysis, we will use \(k=5\), so our training data will be split
into 5 different groups, once again stratifying on the outcome variable,
DEP.
set.seed(2468)
depression_folds <- vfold_cv(depression_train, v = 5, strata = DEP)
We are now ready to start building and fitting our models. All machine learning models can be split into two categories: parametric and non-parametric models. In the coming steps, we will build a fit several of each type.
The general process for building and fitting the models will be quite similar across all models. In general, we will adhere to the following steps:
Begin by specifying the model and model parameters (and if they are to be tuned), the particular engine to be used, and whether the model will be used for regression or classification.
Set up a workflow for the model and add the recipe and model to the workflow.
If there are model parameters that are to be tuned, set up a tuning grid by specifying the range of values to be considered and number of levels (i.e. how many different values) for each parameter.
Tune the model with the workflow using the \(k\)-folds obtained earlier and the tuning grid.
Parametric models are those that specify a specific functional form of the outcome as a function of the predictors. These models require less data to fit and are generally easier to interpret. However, since they must adhere to a specific functional form, they have less flexibility and generally have a difficult time modeling more complex data.
The first parametric model we will build and fit is the classic linear regression model, which assumes that the outcome \(Y\) is a linear function of the predictors \(x_1,\dots,x_n\): \[\begin{equation*} E\left[Y\right]=\beta_0+\beta_1x_1+\dots+\beta_nx_n. \end{equation*}\] Here the goal is to determine the parameters \(\beta_0,\dots,\beta_1\) that best predict the outcome \(Y\) according to this functional form.
lm_model <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(depression_recipe)
tune_res_lm <- lm_wflow %>%
tune_grid(resamples = depression_folds)
The next parametric model we will fit is the elastic net model, which is similar to the linear regression model, except for a penalty term \[\begin{equation*} \lambda\left[\left(1-\alpha\right)\left|\left|\beta\right|\right|_2^2/2+\alpha\left|\left|\beta\right|\right|_1\right] \end{equation*}\] that is added to the residual sum of squares. This introduces some bias into our estimates of \(\beta\), but also reduces the variance, and thus possibly the mean square error and improving model performance. However, the elastic net requires us to tune two parameters, the mixture \(\alpha\), which controls how much weight we give to the ridge penalty term versus the lasso penalty term, and the penalty \(\lambda\).
net_model <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_engine("glmnet") %>%
set_mode("regression")
net_wflow <- workflow() %>%
add_model(net_model) %>%
add_recipe(depression_recipe)
net_grid <- grid_regular(penalty(),
mixture(),
levels = 20)
tune_res_net <- net_wflow %>%
tune_grid(resamples = depression_folds,
grid = net_grid)
The third and final parametric model we will fit is the polynomial model. Just as before with the elastic net model, the polynomial model is similar to the linear regression model (in fact, it technically is a linear regression model), with the exception that we now consider predictor terms of a degree greater than 1 in our functional form for the outcome. This means that we have one parameter to tune in this model, the degree \(n\) of the polynomial that we fit. This concludes the parametric models that we will fit.
poly_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(depression_poly_recipe)
poly_grid <- grid_regular(degree(range = c(1,5)),
levels = 5)
tune_res_poly <- poly_wflow %>%
tune_grid(resamples = depression_folds,
grid = poly_grid)
Next, we will build and fit several non-parametric models. Unlike their parametric counterparts, non-parametric models do not specify a functional form for the outcome, and thus exhibit greater flexibility. This flexibility does not come without cost however, as it does take more data to train these models, and they can be prone to overfitting. Despite these drawbacks, non-parametric models can perform just as well or even better than their parametric counterparts, especially when trying to model complex relationships.
The first non-parametric model we will fit is the \(k\)-nearest neighbors model. Just as the name implies, this model predicts the outcome for a particular point by looking at its \(k\) nearest neighbors and using them to predict the outcome. In a regression setting, this means averaging the outcomes of the \(k\) nearest neighbors to a particular point and assigning the result to that outcome. For this model, we have one parameter that we’ll need to tune: the number of neighbors \(k\) to consider.
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
knn_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(depression_recipe)
knn_grid <- grid_regular(neighbors(range = c(1,50)),
levels = 25)
tune_res_knn <- knn_wflow %>%
tune_grid(resamples = depression_folds,
grid = knn_grid)
The next non-parametric model we will build and fit is the pruned decision tree model. Put simply, a decision tree splits the predictor space into \(j\) distinct regions \(R_j\), and assigns to all points in each region \(R_j\) the average of all the outcomes in \(R_j\). A pruned decision tree utilizes a penalty term (similar to the elastic net) \(\alpha\left|T\right|\) to prevent overfitting more complex decision trees. So we have one parameter to tune for this model, the cost_complexity \(\alpha\).
ptr_model <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("regression")
ptr_wflow <- workflow() %>%
add_model(ptr_model) %>%
add_recipe(depression_recipe)
ptr_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
tune_res_ptr <- ptr_wflow %>%
tune_grid(resamples = depression_folds,
grid = ptr_grid)
Next, we’ll have another look at a tree-based model: the random
forest model. Random forest models build an ensemble of decision trees
and averages the results from each to come up with an outcome for a
particular point, thereby reducing the variance of the model. There is
one caveat though: rather than considering every single predictor as a
candidate each time the predictor space is split, a random subset of the
predictors are chosen as candidates (hence the name, random forest!). To
fit the random forest model then, there are three parameters that we
must tune: the number of predictors to consider at each split
mtry, the total number of trees trees, and the
minimum number of observations needed to split a branch,
min_n.
Note that random forests tend to take significant time to train, so I will be loading a previously saved fit for this analysis.
ran_for_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
ran_for_wflow <- workflow() %>%
add_recipe(depression_recipe) %>%
add_model(ran_for_model)
ran_for_grid <- grid_regular(mtry(range = c(20, 50)),
trees(range = c(500, 1500)),
min_n(range = c(2,20)),
levels = 5)
# tune_res_ran_for <- ran_for_wflow %>%
# tune_grid(resamples = depression_folds,
# grid = ran_for_grid)
# save(tune_res_ran_for, file = "tune_res_ran_for.rda")
load("tune_res_ran_for.rda")
Our final non-parametric model is yet another tree-based model, the
boosted tree model. The model works by sequentially fitting trees to the
residuals (the difference between the observed values and estimated
values) and updating the fitted model with a shrunken result of the
newly fitted tree. Thus, we again have multiple parameters to tune with
the boosted tree model: the number of predictors to consider at each
split mtry, the total number of trees trees,
and the shrinkage parameter learn_rate.
Note that boosted trees also tend to take significant time to train, so I will be loading a previously saved fit for this analysis as well.
btr_model <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
btr_wflow <- workflow() %>%
add_recipe(depression_recipe) %>%
add_model(btr_model)
btr_grid <- grid_regular(mtry(range = c(8,40)),
trees(range = c(500,1500)),
learn_rate(range = c(0.01,0.1), trans = identity_trans()),
levels = 5)
# tune_res_btr <- btr_wflow %>%
# tune_grid(resamples = depression_folds,
# grid = btr_grid)
# save(tune_res_btr, file = "tune_res_btr.rda")
load("tune_res_btr.rda")
Now that we’ve built and fit our machine learning models, let’s have a look at how they performed and the optimal parameter values for those that required tuning.
There were no parameters that required tuning for the linear regression model, so we’ll just have a look at the metrics of the fit.
collect_metrics(tune_res_lm)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 2.19 5 0.0361 Preprocessor1_Model1
## 2 rsq standard 0.517 5 0.0114 Preprocessor1_Model1
best_lm <- select_best(tune_res_lm, metric = "rmse")
From the table above, we see that the RMSE for the fit is 2.19 and the \(R^2\) is 0.515. These values indicate that the model performed moderately well, but not stellar.
Next, let’s have a look at the results from the elastic net model.
We’ll first have a look at how the tuned parameters mixture
and penalty affected the model’s performance.
autoplot(tune_res_net)
Fron the plot, we see that the model performance is relatively
constant as the amount of regularization (the penalty
parameter) increases, until around 0.01, where the performance quickly
drops off. We also see that lower values of the mixture
parameter tend to perform better as the amount of regularization
increases. Now, let’s have a look at the best performing sets of
parameters and their associated metrics:
show_best(tune_res_net, metric = "rmse")
## # A tibble: 5 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00785 0.0526 rmse standard 2.18 5 0.0317 Preprocessor1_Model036
## 2 0.00785 0.158 rmse standard 2.18 5 0.0312 Preprocessor1_Model076
## 3 0.00785 0.105 rmse standard 2.18 5 0.0316 Preprocessor1_Model056
## 4 0.00785 0.263 rmse standard 2.18 5 0.0306 Preprocessor1_Model116
## 5 0.0264 0.0526 rmse standard 2.18 5 0.0309 Preprocessor1_Model037
best_net <- select_best(tune_res_net, metric = "rmse")
From the table, we can observe that the best performing models have
smaller values of both the penalty and mixture
parameters, in accordance with our previous observations. The RMSE for
these models are all around 2.18, which indicates a model performance
roughly on par with the linear regression model.
Next, let’s see how the last parametric model—the polynomial
model—performed. There was only one tuned parameter in this model: the
degree parameter.
autoplot(tune_res_poly)
The plot of model performance as a function of polynomial degree
indicates that the polynomial model performs optimally with
degree=2, and the performance rapidly drops off as
degree increases. The large dropoff in performance is
likely caused by overfitting by higher degree polynomials.
show_best(tune_res_poly, metric = "rmse")
## # A tibble: 5 × 7
## degree .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rmse standard 2.17 5 0.0355 Preprocessor2_Model1
## 2 1 rmse standard 2.20 5 0.0322 Preprocessor1_Model1
## 3 3 rmse standard 2.40 5 0.124 Preprocessor3_Model1
## 4 4 rmse standard 11.4 5 6.76 Preprocessor4_Model1
## 5 5 rmse standard 18.0 5 12.2 Preprocessor5_Model1
Just as observed above, the polynomial model performed best when the
degree parameter was set equal to 2. With a RMSE of 2.17,
the polynomial model with degree=2 performed at roughly the
same level as the linear regression model and the elastic net model.
Now, we’ll take a look at how the first of our non-parametric models performed.
autoplot(tune_res_knn)
As can be seen in the above plot, the \(k\)-nearest neighbor model performed best
with a value of \(k\) around 20. Model
performance quickly drops off as the neighbors parameter
decreases and slowly decreases as the neighbors parameter
increases away from 20.
show_best(tune_res_knn, metric = "rmse")
## # A tibble: 5 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 21 rmse standard 2.20 5 0.0376 Preprocessor1_Model11
## 2 23 rmse standard 2.20 5 0.0384 Preprocessor1_Model12
## 3 19 rmse standard 2.20 5 0.0369 Preprocessor1_Model10
## 4 25 rmse standard 2.20 5 0.0391 Preprocessor1_Model13
## 5 27 rmse standard 2.20 5 0.0397 Preprocessor1_Model14
The best perform \(k\)-nearest
neighbors all have a value for the neighbors parameter
around 20, in agreement with our observations above. THe RMSE of the
best performing models are all around 2.20, indicating a performance on
par with both the linear regression and elastic net models, but slightly
behind the best polynomial model.
Next, we’ll look at the results from the pruned tree models.
autoplot(tune_res_ptr)
From the above plot, we can see that the pruned tree performed best
for a value of the cost_complexity parameter between 0.001
and 0.01, with the performance dropping off for larger values.
show_best(tune_res_ptr, metric = "rmse")
## # A tibble: 5 × 7
## cost_complexity .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00464 rmse standard 2.44 5 0.0522 Preprocessor1_Model04
## 2 0.00774 rmse standard 2.47 5 0.0407 Preprocessor1_Model05
## 3 0.00278 rmse standard 2.49 5 0.0565 Preprocessor1_Model03
## 4 0.00167 rmse standard 2.54 5 0.0383 Preprocessor1_Model02
## 5 0.0129 rmse standard 2.56 5 0.0330 Preprocessor1_Model06
The table indicates that the best performing pruned tree had a
cost_complexity parameter equal to 0.00464 with an RMSE of
2.44. This is the worst performance of any of the models we have looked
at so far by a significant margin.
Now, we’ll look at the performance of the random forest models.
autoplot(tune_res_ran_for)
The plot above indicates that the model performance increased for
larger values of mtry and trees and smaller
values of min_n.
show_best(tune_res_ran_for, metric = "rmse")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 50 750 6 rmse standard 2.01 5 0.0201 Preprocessor1_Model0…
## 2 50 1000 6 rmse standard 2.01 5 0.0206 Preprocessor1_Model0…
## 3 42 1500 6 rmse standard 2.01 5 0.0206 Preprocessor1_Model0…
## 4 50 500 2 rmse standard 2.01 5 0.0218 Preprocessor1_Model0…
## 5 50 750 2 rmse standard 2.01 5 0.0210 Preprocessor1_Model0…
The best performing random forest models all had an RMSE around 2.01, making it the best performing model that we have looked at so far, beating out the regression, elastic net, polynomial, and \(k\)-nearest neighbors models by a significant margin.
Finally, we’ll look at the results of last model that we fit: the boosted tree model.
autoplot(tune_res_btr)
From the above plot, we see that performance tends to best for a
learn_rate around 0.03, with mtry around 15,
and a larger number of trees. Overall, the model
performance tends to increase as learn_rate and
trees increases, while the performance based on
mtry varies based upon the other parameters.
show_best(tune_res_btr, metric = "rmse")
## # A tibble: 5 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 16 1500 0.0325 rmse standard 1.91 5 0.0187 Preprocessor1_M…
## 2 16 1250 0.0325 rmse standard 1.91 5 0.0187 Preprocessor1_M…
## 3 16 1000 0.0325 rmse standard 1.91 5 0.0186 Preprocessor1_M…
## 4 16 750 0.0325 rmse standard 1.91 5 0.0189 Preprocessor1_M…
## 5 32 1500 0.01 rmse standard 1.91 5 0.0277 Preprocessor1_M…
From the above table, we see that the best performing boosted trees have an RMSE around 1.91, making the boosted tree our best performing model.
Based on the analysis in the previous section, we concluded that the boosted tree was the best performing model. Thus, we’ll select the best performing boosted tree as our final model and fit it to the entirety of the training data. The parameters of this model are shown below:
best_model <- select_best(tune_res_btr, metric = "rmse")
best_model
## # A tibble: 1 × 4
## mtry trees learn_rate .config
## <int> <int> <dbl> <chr>
## 1 16 1500 0.0325 Preprocessor1_Model035
final_wflow <- finalize_workflow(btr_wflow, best_model)
final_fit <- fit(final_wflow, depression_train)
Now that we have fit our final model, we will use the testing data to evaluate its performance.
augment(final_fit, new_data = depression_test) %>%
rmse(truth = DEP, .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.92
With an RMSE of 1.92, the final model performed on par with expectations, mirroring the \(k\)-fold cross validations estimates of the testing error. Based on this result, we can conclude that the model performed moderately well at predicting the county-level prevalence of depression in the US.
To visualize our model performance, we can plot the observed rate of depression versus the predicted rate by the model.
augment(final_fit, new_data = depression_test) %>%
ggplot() +
geom_point(aes(x = .pred, y = DEP)) +
geom_abline(slope = 1, color = "red")
The red line indicates where the points would fall if the model had a 100% prediction accuracy (or a RMSE of 0). Overall, the points are clustered quite close to the line, once again indicating that the final model did quite well in predicting the county-level prevalence of depression.
We can also make a variable importance plot of the final model, which shows which variables are the most important in predicting the outcome variable.
vip(final_fit)
From the plot, we see that several of the interaction terms were the
most important, including MHI:POS (the median household
income with the percent of the population over the age of 65),
DIS:ACT (the disability rate with the average commute
time), and AHS:DIS (the average household size with the
disability rate). Of the non-interaction terms, the most important are
POP (the total population), PHL (the percent
of the population that is Hispanic or Latino), and PEW (the
percent of population that is white).
After fitting a multitude of models to the data, we found that the boosted tree model performed the best. This is not surprising, as the relationship between the various socioeconomic predictors and the observed rate of depression in each US county is likely quite complex and difficult to capture with parametric models like linear regression and elastic net. Moreover, the final boosted tree model did quite well at predicting the rate of depression in each US county, as evidenced by the RMSE value of 1.92 and the plot of observed versus predicted values.
While the two worst performing models were also non-parametric models, they should not detract from the conclusion that the relationship between the depression rate and the predictors used in this analysis is likely highly complex and nonlinear. The (relatively) poor performance of the \(k\)-nearest neighbors model can be attributed to the large number of predictors used in this anlysis, which makes it difficult for the \(k\)-nearest neighbors to identify the nearest neighbors optimally. The poor performance of the pruned tree (the worst performing model overall) can be attributed to the high variance nature of pruned tree models. Nonetheless, the strong performance of the random forest and boosted tree models indicate the presence of a complex nonlinear relationship between the rate of depression and the predictors used.
One avenue for further exploration beyond this analysis is the
interplay between the racial demographics of a region and the rate of
depression. In the variable importance plot of the final model, two of
the top three non-interaction variables were related to racial
demographics (PEW, the percent of the population that is
white, and PHL, the percent of the population that his
Hispanic or Latino). This indicates that race must be important in
determining the rate of depression at a county level, and the
relationship between the racial demographics of a region and the rate of
depression should be explored.
Lee B, Wang Y, Carlson SA, et al. National, State-Level, and County-Level Prevalence Estimates of Adults Aged ≥18 Years Self-Reporting a Lifetime Diagnosis of Depression — United States, 2020. MMWR Morb Mortal Wkly Rep 2023;72:644–650. DOI: http://dx.doi.org/10.15585/mmwr.mm7224a1↩︎
U.S. Census Bureau. (2022). Selected Social Characteristics in the United States. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP02. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP02?g=010XX00US$0500000&tp=false.↩︎
U.S. Census Bureau. (2022). Selected Economic Characteristics. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP03. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP03?g=010XX00US$0500000.↩︎
U.S. Census Bureau. (2022). Selected Housing Characteristics. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP04. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP04?g=010XX00US$0500000.↩︎
U.S. Census Bureau. (2022). ACS Demographic and Housing Estimates. American Community Survey, ACS 5-Year Estimates Data Profiles, Table DP05. Retrieved March 23, 2024, from https://data.census.gov/table/ACSDP5Y2022.DP05?g=010XX00US$0500000.↩︎