#The following code reads in the data and reduces it to only those who
#participated in the surveys and those who responded to at least one survey question
full_data <- read_tsv("NDACAN tab delimited original data")
all_participation <- full_data[full_data$OutcmRpt %in% 1,]
all_participation <- all_participation[all_participation$Responded %in% 1,]
#Then, I replaced "Wave" information (which period of surveying this particular survey was given) to the corresponding age for which the survey was given
by_case <- all_participation
by_case$Wave[by_case$Wave %in% 1] <- 17
by_case$Wave[by_case$Wave %in% 2] <- 19
by_case$Wave[by_case$Wave %in% 3] <- 21
#Creating NA values (instead of 77s and 99s)
by_case_clean <- by_case |>
mutate(across(c(8:16, 19:41, 49, 50), ~ na_if(., 77)))
by_case_clean <- by_case_clean |>
mutate(across(c(8:16, 19:41, 49, 50), ~ na_if(., 99)))
#Converting all variables to factor typed variables in R
by_case_factor <- by_case_clean |>
mutate(
across(
c(`Wave`, `State`, 8:16, 19:50),
factor
)
)
#creating AnyPubAid column, discussed above
by_case_factor2 <- by_case_factor |>
mutate(AnyPubAid = if_else(if_any(c(23:28), ~ . == 1), 1, 0))
by_case_factor2$AnyPubAid <- as.factor(by_case_factor2$AnyPubAid)Predicting Transitioning Foster Teen’s Use of External Financial Assistance
Introduction
Inspired by a research paper for a public policy class during the spring semester, I wanted to understand more about Foster Care outcomes on a quantitative level. I sourced data from the National Data Archive on Child Abuse and Neglect’s “National Youth in Transition Database,” and this project uses a file of survey responses from youth who reached the age of 17 and began to transition out of foster care in the government fiscal year 2017. The file also includes follow-up surveys of the same teens in 2019 and 2021. Using this data and statistical modeling techniques learned in my course, I attempt to answer the research questions: What are some predictors of transitioning foster youth’s reliance on external financial support— specifically, reliance on public aid? And, how can these variables help us predict future need for aid?
Modeling this data proved to be an exciting challenge. The file I used has seven categories of public aid and other forms of financial aid: Social Security Disability Insurance (or Supplemental Security Income), Education Aid, Public Financial Assistance (cash welfare payments for basic needs), Public Food Assistance (food stamps of any kind), Public Housing Assistance, Other Financial Support (from a family member, legal settlement, or child support), and Medicaid. In order to assess all of these factors simultaneously, I created a binary variable (AnyPubAid) that assigned 1 to a case where the youth was receiving any one of the seven categories of aid. When medicaid was included in the overall count of individuals receiving public aid, 85.6% of the youth in the surveys are marked as receiving public aid. However, when Medicaid is not included, only about 35% of youth in the surveys are marked for receipt of public aid. I decided to omit Medicaid from calculations due to its high prevalence, but because the proportions were still imbalanced, I had to decide how to increase the accuracy of my model when predicting the minority class (receipt of aid). For my supervised learning methods, I decided to employ Synthetic Minority Over-sampling Technique (SMOTE), which uses k-nearest-neighbors techniques to create new minority class data points to balance the class proportions. In other words—and in this context—SMOTE took information about the cases where individuals were receiving aid and used it to create more cases to have a more equal proportion of aid recipients to non-aid recipients. As you will see below, this greatly increased the accuracy of my model.
Lastly, even after reducing the data to those that submitted responses to at least 1 survey question, there were many rows that had large amounts of NA values (coded in the data file as “77” or “99” meaning the value is unknown or the survey respondent declined to answer) and some services that the survey asked about that were not applicable to the individual being surveyed (coded in data file as “88”). This created challenges around both reducing the data and imputing new values, as patterns showed that if there were some unknown values or some inapplicable variables, there was likely to be others. To remedy this, I decided to remove rows with unknown values in the public aid categories, while keeping any other unknown values. This way, I could still see if there were patterns of receipt of public aid among individuals who were not applicable for services or who declined to respond to demographic and lifestyle questions, but NA would not be a category that could be predicted.
With all of these data use choices in mind, the following models, code, and visualizations helped answer my research question.
table(by_case_factor2$AnyPubAid)
0 1
21256 11947
Hierarchical Clustering
The following code clusters the data together by case. One smaller data set is created to visualize clusters on a dendrogram, and another larger data set is created to assign cluster values to more data points and visualize cluster characteristics on a tile graph
set.seed(5214)
hc_subset_small <- by_case_factor2 |>
select(c(State, 8:16, 19:22, 29:36, 38:41, AnyPubAid)) |>
slice_sample(n = 100)
gower_dist_small <- daisy(hc_subset_small %>% select(-AnyPubAid), metric = "gower")
hc_small <- hclust(gower_dist_small)
small_dendrogram <- hc_small |>
as.dendrogram()
my_colors2 <- ifelse(hc_subset_small$AnyPubAid == 1,
'deepskyblue2',
'darkgoldenrod4')
small_dendrogram |>
color_branches(col = my_colors2[order.dendrogram(small_dendrogram)]) |>
color_labels(col = my_colors2[order.dendrogram(small_dendrogram)]) |>
plot()This dendrogram does not show particularly clear clustering of aid vs no-aid cases. Still, I’d like to see if the clusters pick up on any characteristics of those with and without aid.
set.seed(7432)
with_all_aid <- by_case_factor2 |>
slice_sample(n = 10000)
hc_subset <- with_all_aid |>
select(c(`State`, 8:16, 19:22, 29:36, 38:41, `AnyPubAid`))
gower_dist <- daisy(hc_subset %>% select(-AnyPubAid), metric = "gower")
hc <- hclust(gower_dist)
my_clusters <- cutree(hc, k=10)
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
cluster_summary_stats <- hc_subset %>%
mutate(cluster = my_clusters) %>%
mutate(SocSecrty = with_all_aid$SocSecrty) %>%
mutate(PubFinAs = with_all_aid$PubFinAs) %>%
mutate(PubFoodAs = with_all_aid$PubFoodAs) %>%
mutate(PubHousAs = with_all_aid$PubHousAs) %>%
mutate(OthrFinAs = with_all_aid$OthrFinAs) %>%
mutate(Medicaid = with_all_aid$Medicaid) %>%
group_by(cluster) %>%
summarize(across(everything(), get_mode))cluster_summary_stats |>
select(-`State`) |>
pivot_longer(-cluster,
names_to = "predictor",
values_to = "value") |>
group_by(predictor) |>
ggplot() +
geom_tile(aes(x = factor(cluster), y = predictor, fill = value), color = "black") +
labs(title = "Heirarchical Clustering: Cluster Characteristics",
fill = "Mode Characteristic") +
scale_fill_manual(values = c(
"0" = "darkgoldenrod4", #no
"1" = "deepskyblue2", #yes
"2" = "lightpink1", #declined to respond OR for sex: female
"88" = "cornsilk2", #does not apply
"7" = "darkgoldenrod4", #only for higher ed column, 1 for high school/GED, 7 for none
"3" = "cornsilk2" #only for insurance related questions, 3 = do not know
),
labels = c(
"0" = "Brown: No OR",
"1" = str_wrap("Blue: Yes OR for Higher Education Variable: Finished High School", width = 22),
"2" = str_wrap("Pink: Declined to Respond OR for Sex Variable: Female", width = 20),
"88" = "Cream: Does not apply",
"7" = str_wrap("For Higher Education Variable: Did not finish High School", width = 20),
"3" = str_wrap("For Insurance-related Variables: Do not know", width = 20),
"NA" = "Gray: NA"
)) +
theme_classic()In this visualization, we can see that there are two of ten clusters that have a mode of “Yes” for “AnyPubAid.” Other notable qualities in both of these clusters are white, female, significant health insurance, high school diplomas, connection to adults, and children out of marriage. However, one can see that the columns for which AnyPubAid is based on all have modes of “No.” This is discouraging, and is not easily fixed by removing NAs or by making more or less clusters—techniques that I tried and discarded after no discernible differences. Therefore, this hierarchical clustering algorithm should be taken lightly and other findings in this project can help answer the research question more accurately than this.
Random Forests
As my first supervised learning technique, I chose to run some random forest models. I used the function ranger due to the large amount of variables in my data set. As discussed before, I also used SMOTE to even out the class proportions. This allowed for a more accurate model (seen below) than the one developed under the observed proportions in the data (not included).
by_case_predictors <- by_case |>
mutate(AnyPubAid = as.factor(if_else(if_any(c(23:28), ~ . == 1), 1, 0))) |>
select(c(3, 8:16, 19:22, 29:36, 38:41, 52))
set.seed(3950)
by_case_predictors_numeric <- by_case_predictors |>
select(-AnyPubAid)
by_case_predictors_numeric[] <- lapply(by_case_predictors_numeric, function(x) as.numeric(x))
by_case_predictors_numeric <- by_case_predictors_numeric |>
mutate(AnyPubAid = by_case_predictors$AnyPubAid)
ranger_smote_recipe <- recipe(AnyPubAid ~ ., data = by_case_predictors_numeric) |>
step_smote(AnyPubAid, over_ratio = 1)
ranger_prep_recipe <- prep(ranger_smote_recipe)
ranger_smote_training_data <- bake(ranger_prep_recipe, new_data = NULL)
smote_ranger_forest <- ranger(AnyPubAid ~ .,
data = ranger_smote_training_data,
importance = "impurity")
smote_ranger_forestsmote_ranger_forest$confusion.matrix predicted
true 0 1
0 18581 3406
1 6293 15694
This confusion matrix shows an 85% accuracy rate for predicting that an individual is not on aid, and a 71% accuracy rate for predicting that an individual is on aid. This is much better compared to the unbalanced accuracy rates of models trained on non-synthetic-minority-oversampled data.
Now, I’d like to see what variables this model used to achieve a high accuracy!
sort(smote_ranger_forest$variable.importance) HawaiiPI Asian RaceDcln RaceUnkn AmIAKN PrescripIn
42.68297 74.57674 79.86565 110.41911 144.98506 160.16192
MedicalIn MentlHlthIn Marriage Children BlkAfrAm White
170.88541 180.75853 215.33607 215.91086 354.82399 364.33289
CnctAdult OthrHlthIn HisOrgin Homeless CurrFTE CurrPTE
370.81451 373.96960 471.70611 525.43251 526.73435 528.73043
SubAbuse EmplySklls Incarc Sex CurrenRoll OutcmFCS
574.58717 619.97632 730.96859 877.51696 1008.22507 1361.20415
HighEdCert State
1427.16323 2255.30446
From this list, we see that the top 11 most important variables are as follows:
- State
- Education Level (HighEdCert)
- Foster Care Status (at the time of the particular outcome survey - OutcmFCS)
- Current Enrollment (in high school or other education program)
- Sex
- Incarceration Status (ever been or currently incarcerated)
- Employment Skills (participation in one or more job training programs)
- Substance Abuse Status (ever been or currently referred to a substance abuse counselor/program)
- Current Part Time Employment
- Current Full Time Employment
- Homelessness Status (ever been or currently experiencing instability in adequate and regular housing)
Note: I include variable 11 because there is a very slight difference in importance score between variable 10 and 11, and a much greater difference in importance score between variable 11 and 12.
In order to visualize the directionality of these variables relationship to receipt of financial assistance, I created a series of segmented bar charts.
important_variables <- c("State", "HighEdCert", "OutcmFCS", "CurrenRoll", "Sex", "Incarc", "EmplySklls", "SubAbuse", "CurrPTE", "CurrFTE", "Homeless")
by_case_predictors_for_graphs <- by_case_predictors |>
mutate(Race = by_case_factor2$Race)
by_case_predictors_for_graphs$State <- factor(by_case_predictors_for_graphs$State, levels = c(1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 72), labels = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Lousiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana" , "Nebraska" , "Nevada" , "New Hampsire" , "New Jersey" , "New Mexico" , "New York", "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming" , "Puerto Rico"))
by_case_predictors_for_graphs$Sex <- factor(by_case_predictors_for_graphs$Sex, levels = c(1, 2), labels = c("Male" , "Female"))
by_case_predictors_for_graphs$CurrenRoll <- factor(by_case_predictors_for_graphs$CurrenRoll, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$OutcmFCS <- factor(by_case_predictors_for_graphs$OutcmFCS, levels = c(0, 1), labels = c("Not Currently in Foster Care" , "Currently in Foster Care"))
by_case_predictors_for_graphs$EmplySklls <- factor(by_case_predictors_for_graphs$EmplySklls, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$CurrPTE <- factor(by_case_predictors_for_graphs$CurrPTE, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$CurrFTE <- factor(by_case_predictors_for_graphs$CurrFTE, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$Incarc <- factor(by_case_predictors_for_graphs$Incarc, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$SubAbuse <- factor(by_case_predictors_for_graphs$SubAbuse, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$Homeless <- factor(by_case_predictors_for_graphs$Homeless, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$HighEdCert <- factor(by_case_predictors_for_graphs$HighEdCert, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 77), labels = c("High School/GED" , "Vocational Certificate", "Vocational License" , "Associate Degree" , "Bachelor Degree" , "Higher Degree", "None of the Above", "Declined To Answer, Unknown, or Not Applicable", "Declined To Answer, Unknown, or Not Applicable"))
by_case_predictors_for_graphs$Race <- factor(by_case_predictors_for_graphs$Race, levels = c(1, 2, 3, 4, 5, 6, 99), labels = c("White Only", "Black Only", "American Indian or Alaskan Native Only", "Asian Only", "Native Hawaiian or Other Pacific Islander Only", "More Than One Race", "Race Unknown"))
general_things <- by_case_predictors_for_graphs |>
select(AnyPubAid, Incarc, SubAbuse, Homeless) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
sex_graph <- by_case_predictors_for_graphs |>
select(AnyPubAid, Sex) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
fcs_graph <- by_case_predictors_for_graphs |>
select(AnyPubAid, OutcmFCS) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
state_graph <- by_case_predictors_for_graphs |>
select(AnyPubAid, State) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
education_level <- by_case_predictors_for_graphs |>
select(AnyPubAid, HighEdCert, CurrenRoll) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
employment_status <- by_case_predictors_for_graphs |>
select(AnyPubAid, EmplySklls, CurrFTE, CurrPTE) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")
new_race_graph <- by_case_predictors_for_graphs |>
select(AnyPubAid, Race) |>
pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")opts_chunk$set(fig.width=12, fig.height=8)
label_names1 <- c(
Incarc = "Incarceration",
Homeless = "Homelessness",
SubAbuse = "Substance Abuse"
)
general_things <- general_things %>%
mutate(value = factor(value, levels = c("Yes", "No", "Declined to Answer")))
ggplot(general_things, aes(x = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names1)) +
ylab("Proportion") +
xlab(" ") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Incarceration, Homelessness, and Substance Abuse Effects on Aid Receipt",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))opts_chunk$set(fig.width=12, fig.height=8)
ggplot(sex_graph, aes(x = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
ylab("Proportion") +
xlab(" ") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Aid Receipt by Sex",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))opts_chunk$set(fig.width=12, fig.height=8)
fcs_graph <- fcs_graph %>%
mutate(value = factor(value, levels = c("Currently in Foster Care", "Not Currently in Foster Care")))
ggplot(fcs_graph, aes(x = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
ylab("Proportion") +
xlab(" ") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Aid Receipt by Foster Care Status",
subtitle = "*Foster Care Status as measured by at the time of the survey",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))opts_chunk$set(fig.width=12, fig.height=8)
label_names2 <- c(
HighEdCert = "Education Level",
CurrenRoll = "Currently Enrolled in School"
)
ggplot(education_level, aes(x = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names2)) +
ylab("Proportion") +
xlab(" ") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Aid Receipt by Education Level",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))An important note about this graph is that “AnyPubAid” includes Education Aid. Therefore, it makes sense that those who have gone to higher levels of school may be receiving aid as they may be benefiting from education aid. It is unclear from this visualization whether they are also more likely to receive other kinds of public aid as well.
opts_chunk$set(fig.width=12, fig.height=8)
label_names3 <- c(
CurrFTE = str_wrap("Current Full Time Employment", width = 20),
CurrPTE = str_wrap("Current Part Time Employment", width = 20),
EmplySklls = str_wrap("Employment Skills Program Participation", width = 20)
)
ggplot(employment_status, aes(x = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names3)) +
ylab("Proportion") +
xlab(" ") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Aid Receipt by Employment Status",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))opts_chunk$set(fig.width=12, fig.height=10)
ggplot(state_graph, aes(y = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
ylab(" ") +
xlab("Proportion") +
theme_minimal() +
labs(title = "Aid Receipt by Residential State",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))opts_chunk$set(fig.width=12, fig.height=8)
ggplot(new_race_graph, aes(y = value, fill = AnyPubAid)) +
geom_bar(position = "fill") +
ylab(" ") +
xlab("Proportion") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Aid Receipt by Race",
fill = "Aid Received") +
scale_fill_manual(values = c(
"0" = "cornsilk2",
"1" = "deepskyblue2"
),
labels = c(
"0" = "No",
"1" = "Yes"
))Logistic Regression
For my second supervised learning algorithm, I chose to learn about Logistic Regression. This is a classification technique that is a form of regression and ensures probabilities remain between the values of 0-1. This way, rather than predicting different values, the model uses predictors to calculate a probability, which is then applied to a classification threshold. In most cases, models use a classification threshold of 0.5, meaning that if the probability of a certain case is higher than 0.5, the positive case is assigned (or, for a probability of less than 0.5, the negative case is assigned). Thank you to https://www.ml4publicpolicy.com/ for the in depth materials to apply logistic regression to policy analysis.
For the context of this problem, I run a logistic regression model that uses multiple factor variables to predict aid receipt. However, instead of manually inputting a classification threshold of 0.5, I employ cross validation, to ensure that the threshold is the optimal threshold for this data. Ultimately, this threshold ends up being 0.53, so the model does not change much compared to if I had manually input 0.5. However, it is important to cross-validate and find the optimal threshold to maximize correct predictions and minimize false predictions.
First, I run logistic regression on the data as it is.
#Split training and testing data to assess model accuracy
set.seed(1350)
logreg_predictors <- by_case_predictors
logreg_predictors$AnyPubAid <- factor(logreg_predictors$AnyPubAid, levels = c(0, 1), labels = c("no_aid", "aid"))
training_data_rows <- sample(1:nrow(logreg_predictors),
size = nrow(logreg_predictors)/2)
training_data <- logreg_predictors[training_data_rows, ]
testing_data <- logreg_predictors[-training_data_rows, ]TrControl <- trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE
)
logreg <- train(
AnyPubAid ~ .,
training_data,
method = "glm",
family = "binomial",
trControl = TrControl,
preProcess = c("center", "scale")
)print(logreg)Generalized Linear Model
16967 samples
26 predictor
2 classes: 'no_aid', 'aid'
Pre-processing: centered (26), scaled (26)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 13573, 13574, 13573, 13574, 13574
Resampling results:
ROC Sens Spec
0.6891743 0.8836957 0.3251248
However, this produces a high rate of sensitivity (the amount of correct non-aid predictions relative to the true amount of non-aid recipients) but a low rate of specificity (the amount of correct aid predictions relative to the true amount of aid recipients).
Thus, I performed SMOTE again to balance the classes, as I know this will help improve the specificity.
set.seed(123)
training_data_partial <- training_data |>
select(-27)
training_data_partial[] <- lapply(training_data_partial, function(x) as.numeric(x))
training_data_full <- training_data_partial |>
mutate(AnyPubAid = training_data$AnyPubAid)
smote_recipe <- recipe(AnyPubAid ~ ., data = training_data_full) |>
step_smote(AnyPubAid, over_ratio = 1)
prep_recipe <- prep(smote_recipe)
smote_training_data <- bake(prep_recipe, new_data = NULL)
logreg2 <- train(
AnyPubAid ~ .,
smote_training_data,
method = "glm",
family = "binomial",
trControl = TrControl,
preProcess = c("center", "scale")
)print(logreg2)Generalized Linear Model
22080 samples
26 predictor
2 classes: 'no_aid', 'aid'
Pre-processing: centered (26), scaled (26)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 17664, 17664, 17664, 17664, 17664
Resampling results:
ROC Sens Spec
0.6653036 0.6326087 0.6347826
This logistic regression model has a much better specificity. Though the sensitivity went down, this trade-off is much better than the uneven one above. Additionally, the ROC (receiver operating characteristic) is roughly as good as the one above, indicating that changing the specificity rate did not affect the statistical power.
To assess how my model does on predicting classes of new data, I can use the testing data that I set aside earlier.
set.seed(12345)
testing_data_partial <- testing_data |>
select(-27)
testing_data_partial[] <- lapply(testing_data_partial, function(x) as.numeric(x))
smote_testing_data <- testing_data_partial |>
mutate(AnyPubAid = testing_data$AnyPubAid)
prediction <- predict(logreg2, smote_testing_data, type = "raw")confusionMatrix(prediction, smote_testing_data[["AnyPubAid"]], positive = "aid")Confusion Matrix and Statistics
Reference
Prediction no_aid aid
no_aid 7818 2454
aid 3129 3566
Accuracy : 0.6709
95% CI : (0.6638, 0.678)
No Information Rate : 0.6452
P-Value [Acc > NIR] : 9.375e-13
Kappa : 0.299
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5924
Specificity : 0.7142
Pos Pred Value : 0.5326
Neg Pred Value : 0.7611
Prevalence : 0.3548
Detection Rate : 0.2102
Detection Prevalence : 0.3946
Balanced Accuracy : 0.6533
'Positive' Class : aid
The Kappa value tells us how much better the model is at predicting compared to if it were to just predict randomly. The closer to zero the kappa value is, the worse at predicting the model is; so, this model performs fair enough but is not actually particularly good. Additionally, the sensitivity (ability to correctly classify aid cases) and specificity (ability to correctly classify no-aid cases) have changed a bit, but not to the point of concern for the model’s accuracy.
Now, I’d like to visualize how my model that was trained on SMOTEd data assessed the optimal classification threshold, as well as the model’s ROC (receiver operating characteristic) curve.
Finding the Classification Threshold
The video supplied on https://www.ml4publicpolicy.com/classification.html#conclusion introduces the concept of this type of visualization, and I wanted to create my own using the data I have.
probs <- predict(logreg2, newdata = smote_testing_data, type = "prob")[, "aid"]
truth <- smote_testing_data$AnyPubAid
truth <- ifelse(truth == "aid", 1, 0)
thresholds <- seq(0, 1, by = 0.01)
pr_data <- sapply(thresholds, function(t) {
pred <- ifelse(probs >= t, 1, 0)
tp <- sum(pred == 1 & truth == 1)
fp <- sum(pred == 1 & truth == 0)
fn <- sum(pred == 0 & truth == 1)
precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
c(precision = precision, recall = recall)
}) |> t() |> as.data.frame()
pr_data$threshold <- thresholds
pr_data_long <- pr_data |>
pivot_longer(cols = c("precision", "recall"), names_to = "metric", values_to = "value")
pr_data$diff <- abs(pr_data$precision - pr_data$recall)
intersect_idx <- which.min(pr_data$diff)
intersect_point <- pr_data[intersect_idx, ]ggplot(pr_data_long, aes(x = threshold, y = value, color = metric)) +
geom_line() +
geom_point(data = intersect_point, aes(x = threshold, y = precision),
color = "darkgoldenrod4", size = 1) +
annotate("text",
x = intersect_point$threshold - 0.07,
y = intersect_point$precision - 0.11,
label = paste0("Threshold = ", round(intersect_point$threshold, 2),
"\nPrecision = Recall = ", round(intersect_point$precision, 2)),
hjust = 0.5, size = 3.5) +
ylim(0, 1) +
labs(
title = "Precision-Recall Curve by Classification Threshold",
x = "Threshold", y = "Score"
) +
theme_minimal() +
scale_color_manual(values = c("precision" = "deepskyblue2", "recall" = "lightpink2"))As noted above, we see here that the optimum threshold is identified to be 0.53, where the precision and recall (synonyms of sensitivity and specificity) are roughly equal at 0.54.
ROC Curve
Lastly, https://www.ml4publicpolicy.com/classification.html#conclusion recommends making a ROC plot to show how the model performs relative to random classification.
pr_numeric <- as.numeric(as.factor(prediction))
AnyPubAid_numeric <- as.numeric(as.factor(smote_testing_data$AnyPubAid))
roc_df <- data.frame(Observed = AnyPubAid_numeric, Predicted = pr_numeric)ggplot(roc_df, aes (d = Observed, m = Predicted)) +
geom_roc(labels = FALSE, color='deepskyblue3') +
style_roc(theme = theme_bw, guide = TRUE)The shape of this line, relative to the constant line spanning from point (0,0) to point (1,1), shows that this model is better than random classification. This is because the line from (0,0) to (1,1) represents the trade-offs between false positives and true positives if classification was randomly assigned, and the line representing my model’s trade-offs shows steeper incline and therefore better sensitivity and specificity.
Conclusion
Through use of hierarchical clustering, random forest algorithms, and logistic regression models, I have shown how the survey data of transitioning foster youth from 2017, 2019, and 2021 may be helpful for prediction of future youth’s financial needs.
However, the results of this project should be used carefully. Because the data had relatively low predictive power, use of the models may negatively influence aid distribution and reinforce racial, economic, gender, or other biases. Though this was an interesting research project, this report does not adequately answer the research question in a way that is meaningfully productive for application to policy work. Further research and analysis is required to create a real-world-ready algorithm for modeling foster youth’s public aid usage.
Consulted Sources
Coding and Machine Learning Algorithms:
https://www.ml4publicpolicy.com/classification.html#conclusion
https://quantifyinghealth.com/logistic-regression-in-r-with-categorical-variables/
https://quantifyinghealth.com/assess-variable-importance-in-regression/
https://posit.co/resources/cheatsheets/
https://r-graph-gallery.com/ggplot2-color.html
https://stackoverflow.com/questions
https://www.geeksforgeeks.org/k-fold-cross-validation-in-r-programming/
https://www.geeksforgeeks.org/how-to-use-smote-for-imbalanced-data-in-r/#
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
https://www.iguazio.com/glossary/classification-threshold/
OpenAI. (2024). ChatGPT.4o [Large language model]. https://chat.openai.com/chat
Data:
The data used in this publication, Dataset 297, National Youth in Transition Database (NYTD) - Outcomes File, Cohort Age 17 in FY2017, Waves 1-3 (Complete), were obtained from the National Data Archive on Child Abuse and Neglect and have been used in accordance with its Terms of Use Agreement license. The Administration on Children, Youth and Families, the Children’s Bureau, the original dataset collection personnel or funding source, NDACAN, Duke University, Cornell University, and their agents or employees bear no responsibility for the analyses or interpretations presented here.
https://www.ndacan.acf.hhs.gov/datasets/dataset-details.cfm?ID=297