Predicting Transitioning Foster Teen’s Use of External Financial Assistance

Author

Evie Happel

Introduction

Inspired by a research paper for a public policy class during the spring semester, I wanted to understand more about Foster Care outcomes on a quantitative level. I sourced data from the National Data Archive on Child Abuse and Neglect’s “National Youth in Transition Database,” and this project uses a file of survey responses from youth who reached the age of 17 and began to transition out of foster care in the government fiscal year 2017. The file also includes follow-up surveys of the same teens in 2019 and 2021. Using this data and statistical modeling techniques learned in my course, I attempt to answer the research questions: What are some predictors of transitioning foster youth’s reliance on external financial support— specifically, reliance on public aid? And, how can these variables help us predict future need for aid?

Modeling this data proved to be an exciting challenge. The file I used has seven categories of public aid and other forms of financial aid: Social Security Disability Insurance (or Supplemental Security Income), Education Aid, Public Financial Assistance (cash welfare payments for basic needs), Public Food Assistance (food stamps of any kind), Public Housing Assistance, Other Financial Support (from a family member, legal settlement, or child support), and Medicaid. In order to assess all of these factors simultaneously, I created a binary variable (AnyPubAid) that assigned 1 to a case where the youth was receiving any one of the seven categories of aid. When medicaid was included in the overall count of individuals receiving public aid, 85.6% of the youth in the surveys are marked as receiving public aid. However, when Medicaid is not included, only about 35% of youth in the surveys are marked for receipt of public aid. I decided to omit Medicaid from calculations due to its high prevalence, but because the proportions were still imbalanced, I had to decide how to increase the accuracy of my model when predicting the minority class (receipt of aid). For my supervised learning methods, I decided to employ Synthetic Minority Over-sampling Technique (SMOTE), which uses k-nearest-neighbors techniques to create new minority class data points to balance the class proportions. In other words—and in this context—SMOTE took information about the cases where individuals were receiving aid and used it to create more cases to have a more equal proportion of aid recipients to non-aid recipients. As you will see below, this greatly increased the accuracy of my model.

Lastly, even after reducing the data to those that submitted responses to at least 1 survey question, there were many rows that had large amounts of NA values (coded in the data file as “77” or “99” meaning the value is unknown or the survey respondent declined to answer) and some services that the survey asked about that were not applicable to the individual being surveyed (coded in data file as “88”). This created challenges around both reducing the data and imputing new values, as patterns showed that if there were some unknown values or some inapplicable variables, there was likely to be others. To remedy this, I decided to remove rows with unknown values in the public aid categories, while keeping any other unknown values. This way, I could still see if there were patterns of receipt of public aid among individuals who were not applicable for services or who declined to respond to demographic and lifestyle questions, but NA would not be a category that could be predicted.

With all of these data use choices in mind, the following models, code, and visualizations helped answer my research question.

#The following code reads in the data and reduces it to only those who 
#participated in the surveys and those who responded to at least one survey question
full_data <- read_tsv("NDACAN tab delimited original data")
all_participation <- full_data[full_data$OutcmRpt %in% 1,]
all_participation <- all_participation[all_participation$Responded %in% 1,]
#Then, I replaced "Wave" information (which period of surveying this particular survey was given) to the corresponding age for which the survey was given
by_case <- all_participation
by_case$Wave[by_case$Wave %in% 1] <- 17
by_case$Wave[by_case$Wave %in% 2] <- 19
by_case$Wave[by_case$Wave %in% 3] <- 21
#Creating NA values (instead of 77s and 99s)
by_case_clean <- by_case |>
  mutate(across(c(8:16, 19:41, 49, 50), ~ na_if(., 77)))
by_case_clean <- by_case_clean |>
  mutate(across(c(8:16, 19:41, 49, 50), ~ na_if(., 99)))
#Converting all variables to factor typed variables in R
by_case_factor <- by_case_clean |>
  mutate(
    across(
      c(`Wave`, `State`, 8:16, 19:50),
      factor
      )
    )
#creating AnyPubAid column, discussed above
by_case_factor2 <- by_case_factor |>
  mutate(AnyPubAid = if_else(if_any(c(23:28), ~ . == 1), 1, 0))
by_case_factor2$AnyPubAid <- as.factor(by_case_factor2$AnyPubAid)

table(by_case_factor2$AnyPubAid)


    0     1 
21256 11947

Hierarchical Clustering

The following code clusters the data together by case. One smaller data set is created to visualize clusters on a dendrogram, and another larger data set is created to assign cluster values to more data points and visualize cluster characteristics on a tile graph

set.seed(5214)
hc_subset_small <- by_case_factor2 |>
  select(c(State, 8:16, 19:22, 29:36, 38:41, AnyPubAid)) |>
  slice_sample(n = 100)

gower_dist_small <- daisy(hc_subset_small %>% select(-AnyPubAid), metric = "gower")
hc_small <- hclust(gower_dist_small)

small_dendrogram <- hc_small |>
  as.dendrogram()

my_colors2 <- ifelse(hc_subset_small$AnyPubAid == 1,
                    'deepskyblue2',
                    'darkgoldenrod4')

small_dendrogram |> 
  color_branches(col = my_colors2[order.dendrogram(small_dendrogram)]) |>
  color_labels(col = my_colors2[order.dendrogram(small_dendrogram)]) |>
  plot()

This dendrogram does not show particularly clear clustering of aid vs no-aid cases. Still, I’d like to see if the clusters pick up on any characteristics of those with and without aid.

set.seed(7432)

with_all_aid <- by_case_factor2 |>
  slice_sample(n = 10000)

hc_subset <- with_all_aid |>
  select(c(`State`, 8:16, 19:22, 29:36, 38:41, `AnyPubAid`))

gower_dist <- daisy(hc_subset %>% select(-AnyPubAid), metric = "gower")
hc <- hclust(gower_dist)

my_clusters <- cutree(hc, k=10)

get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

cluster_summary_stats <- hc_subset %>%
  mutate(cluster = my_clusters) %>%
  mutate(SocSecrty = with_all_aid$SocSecrty) %>%
  mutate(PubFinAs = with_all_aid$PubFinAs) %>%
  mutate(PubFoodAs = with_all_aid$PubFoodAs) %>%
  mutate(PubHousAs = with_all_aid$PubHousAs) %>%
  mutate(OthrFinAs = with_all_aid$OthrFinAs) %>%
  mutate(Medicaid = with_all_aid$Medicaid) %>%
  group_by(cluster) %>%
  summarize(across(everything(), get_mode))

cluster_summary_stats |>
  select(-`State`) |>
  pivot_longer(-cluster,
               names_to = "predictor",
               values_to = "value") |>
  group_by(predictor) |>
  ggplot() +
  geom_tile(aes(x = factor(cluster), y = predictor, fill = value), color = "black") +
  labs(title = "Heirarchical Clustering: Cluster Characteristics",
       fill = "Mode Characteristic") +
  scale_fill_manual(values = c(
    "0" = "darkgoldenrod4", #no
    "1" = "deepskyblue2", #yes
    "2" = "lightpink1", #declined to respond OR for sex: female
    "88" = "cornsilk2", #does not apply
    "7" = "darkgoldenrod4", #only for higher ed column, 1 for high school/GED, 7 for none
    "3" = "cornsilk2" #only for insurance related questions, 3 = do not know
  ),
  labels = c(
    "0" = "Brown: No OR",
    "1" = str_wrap("Blue: Yes OR for Higher Education Variable: Finished High School", width = 22),
    "2" = str_wrap("Pink: Declined to Respond OR for Sex Variable: Female", width = 20),
    "88" = "Cream: Does not apply",
    "7" = str_wrap("For Higher Education Variable: Did not finish High School", width = 20),
    "3" = str_wrap("For Insurance-related Variables: Do not know", width = 20),
    "NA" = "Gray: NA"
  )) +
  theme_classic()

In this visualization, we can see that there are two of ten clusters that have a mode of “Yes” for “AnyPubAid.” Other notable qualities in both of these clusters are white, female, significant health insurance, high school diplomas, connection to adults, and children out of marriage. However, one can see that the columns for which AnyPubAid is based on all have modes of “No.” This is discouraging, and is not easily fixed by removing NAs or by making more or less clusters—techniques that I tried and discarded after no discernible differences. Therefore, this hierarchical clustering algorithm should be taken lightly and other findings in this project can help answer the research question more accurately than this.

Random Forests

As my first supervised learning technique, I chose to run some random forest models. I used the function ranger due to the large amount of variables in my data set. As discussed before, I also used SMOTE to even out the class proportions. This allowed for a more accurate model (seen below) than the one developed under the observed proportions in the data (not included).

by_case_predictors <- by_case |>
  mutate(AnyPubAid = as.factor(if_else(if_any(c(23:28), ~ . == 1), 1, 0))) |>
  select(c(3, 8:16, 19:22, 29:36, 38:41, 52))

set.seed(3950)

by_case_predictors_numeric <- by_case_predictors |>
  select(-AnyPubAid)

by_case_predictors_numeric[] <- lapply(by_case_predictors_numeric, function(x) as.numeric(x))

by_case_predictors_numeric <- by_case_predictors_numeric |>
  mutate(AnyPubAid = by_case_predictors$AnyPubAid)

ranger_smote_recipe <- recipe(AnyPubAid ~ ., data = by_case_predictors_numeric) |>
  step_smote(AnyPubAid, over_ratio = 1)

ranger_prep_recipe <- prep(ranger_smote_recipe)
ranger_smote_training_data <- bake(ranger_prep_recipe, new_data = NULL)

smote_ranger_forest <- ranger(AnyPubAid ~ .,
                     data = ranger_smote_training_data,
                     importance = "impurity")
smote_ranger_forest

smote_ranger_forest$confusion.matrix

    predicted
true     0     1
   0 18581  3406
   1  6293 15694

This confusion matrix shows an 85% accuracy rate for predicting that an individual is not on aid, and a 71% accuracy rate for predicting that an individual is on aid. This is much better compared to the unbalanced accuracy rates of models trained on non-synthetic-minority-oversampled data.

Now, I’d like to see what variables this model used to achieve a high accuracy!

sort(smote_ranger_forest$variable.importance)

   HawaiiPI       Asian    RaceDcln    RaceUnkn      AmIAKN  PrescripIn 
   42.68297    74.57674    79.86565   110.41911   144.98506   160.16192 
  MedicalIn MentlHlthIn    Marriage    Children    BlkAfrAm       White 
  170.88541   180.75853   215.33607   215.91086   354.82399   364.33289 
  CnctAdult  OthrHlthIn    HisOrgin    Homeless     CurrFTE     CurrPTE 
  370.81451   373.96960   471.70611   525.43251   526.73435   528.73043 
   SubAbuse  EmplySklls      Incarc         Sex  CurrenRoll    OutcmFCS 
  574.58717   619.97632   730.96859   877.51696  1008.22507  1361.20415 
 HighEdCert       State 
 1427.16323  2255.30446

From this list, we see that the top 11 most important variables are as follows:

State
Education Level (HighEdCert)
Foster Care Status (at the time of the particular outcome survey - OutcmFCS)
Current Enrollment (in high school or other education program)
Sex
Incarceration Status (ever been or currently incarcerated)
Employment Skills (participation in one or more job training programs)
Substance Abuse Status (ever been or currently referred to a substance abuse counselor/program)
Current Part Time Employment
Current Full Time Employment
Homelessness Status (ever been or currently experiencing instability in adequate and regular housing)

Note: I include variable 11 because there is a very slight difference in importance score between variable 10 and 11, and a much greater difference in importance score between variable 11 and 12.

In order to visualize the directionality of these variables relationship to receipt of financial assistance, I created a series of segmented bar charts.

important_variables <- c("State", "HighEdCert", "OutcmFCS", "CurrenRoll", "Sex", "Incarc", "EmplySklls", "SubAbuse", "CurrPTE", "CurrFTE", "Homeless")

by_case_predictors_for_graphs <- by_case_predictors |>
  mutate(Race = by_case_factor2$Race)

by_case_predictors_for_graphs$State <- factor(by_case_predictors_for_graphs$State, levels = c(1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 53, 54, 55, 56, 72), labels = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Lousiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana" , "Nebraska" , "Nevada" , "New Hampsire" , "New Jersey" , "New Mexico" , "New York", "North Carolina" , "North Dakota" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming" , "Puerto Rico"))

by_case_predictors_for_graphs$Sex <- factor(by_case_predictors_for_graphs$Sex, levels = c(1, 2), labels = c("Male" , "Female"))

by_case_predictors_for_graphs$CurrenRoll <- factor(by_case_predictors_for_graphs$CurrenRoll, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$OutcmFCS <- factor(by_case_predictors_for_graphs$OutcmFCS, levels = c(0, 1), labels = c("Not Currently in Foster Care" , "Currently in Foster Care"))

by_case_predictors_for_graphs$EmplySklls <- factor(by_case_predictors_for_graphs$EmplySklls, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$CurrPTE <- factor(by_case_predictors_for_graphs$CurrPTE, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$CurrFTE <- factor(by_case_predictors_for_graphs$CurrFTE, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$Incarc <- factor(by_case_predictors_for_graphs$Incarc, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$SubAbuse <- factor(by_case_predictors_for_graphs$SubAbuse, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$Homeless <- factor(by_case_predictors_for_graphs$Homeless, levels = c(0, 1, 2, 77), labels = c("No" , "Yes", "Declined To Answer, Unknown, or Not Applicable" , "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$HighEdCert <- factor(by_case_predictors_for_graphs$HighEdCert, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 77), labels = c("High School/GED" , "Vocational Certificate", "Vocational License" , "Associate Degree" , "Bachelor Degree" , "Higher Degree", "None of the Above", "Declined To Answer, Unknown, or Not Applicable", "Declined To Answer, Unknown, or Not Applicable"))

by_case_predictors_for_graphs$Race <- factor(by_case_predictors_for_graphs$Race, levels = c(1, 2, 3, 4, 5, 6, 99), labels = c("White Only", "Black Only", "American Indian or Alaskan Native Only", "Asian Only", "Native Hawaiian or Other Pacific Islander Only", "More Than One Race", "Race Unknown"))

general_things <- by_case_predictors_for_graphs |>
  select(AnyPubAid, Incarc, SubAbuse, Homeless) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

sex_graph <- by_case_predictors_for_graphs |>
  select(AnyPubAid, Sex) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

fcs_graph <- by_case_predictors_for_graphs |>
  select(AnyPubAid, OutcmFCS) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

state_graph <- by_case_predictors_for_graphs |>
  select(AnyPubAid, State) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

education_level <- by_case_predictors_for_graphs |>
  select(AnyPubAid, HighEdCert, CurrenRoll) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

employment_status <- by_case_predictors_for_graphs |>
  select(AnyPubAid, EmplySklls, CurrFTE, CurrPTE) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

new_race_graph <- by_case_predictors_for_graphs |>
  select(AnyPubAid, Race) |>
  pivot_longer(cols = -AnyPubAid, names_to = "variable", values_to = "value")

opts_chunk$set(fig.width=12, fig.height=8) 
label_names1 <- c(
  Incarc = "Incarceration",
  Homeless = "Homelessness",
  SubAbuse = "Substance Abuse"
)

general_things <- general_things %>%
  mutate(value = factor(value, levels = c("Yes", "No", "Declined to Answer")))

ggplot(general_things, aes(x = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names1)) +
  ylab("Proportion") +
  xlab(" ") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Incarceration, Homelessness, and Substance Abuse Effects on Aid Receipt",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

opts_chunk$set(fig.width=12, fig.height=8)
ggplot(sex_graph, aes(x = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  ylab("Proportion") +
  xlab(" ") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Aid Receipt by Sex",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

opts_chunk$set(fig.width=12, fig.height=8)
fcs_graph <- fcs_graph %>%
  mutate(value = factor(value, levels = c("Currently in Foster Care", "Not Currently in Foster Care")))
  
ggplot(fcs_graph, aes(x = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  ylab("Proportion") +
  xlab(" ") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Aid Receipt by Foster Care Status",
       subtitle = "*Foster Care Status as measured by at the time of the survey",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

opts_chunk$set(fig.width=12, fig.height=8)
label_names2 <- c(
  HighEdCert = "Education Level",
  CurrenRoll = "Currently Enrolled in School"
)

ggplot(education_level, aes(x = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names2)) +
  ylab("Proportion") +
  xlab(" ") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Aid Receipt by Education Level",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

An important note about this graph is that “AnyPubAid” includes Education Aid. Therefore, it makes sense that those who have gone to higher levels of school may be receiving aid as they may be benefiting from education aid. It is unclear from this visualization whether they are also more likely to receive other kinds of public aid as well.

opts_chunk$set(fig.width=12, fig.height=8)
label_names3 <- c(
  CurrFTE = str_wrap("Current Full Time Employment", width = 20),
  CurrPTE = str_wrap("Current Part Time Employment", width = 20),
  EmplySklls = str_wrap("Employment Skills Program Participation", width = 20)
)

ggplot(employment_status, aes(x = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  facet_wrap(~ variable, scales = "free_x", labeller = labeller(variable = label_names3)) +
  ylab("Proportion") +
  xlab(" ") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Aid Receipt by Employment Status",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

opts_chunk$set(fig.width=12, fig.height=10)
ggplot(state_graph, aes(y = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  ylab(" ") +
  xlab("Proportion") +
  theme_minimal() +
  labs(title = "Aid Receipt by Residential State",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

opts_chunk$set(fig.width=12, fig.height=8)
ggplot(new_race_graph, aes(y = value, fill = AnyPubAid)) +
  geom_bar(position = "fill") +
  ylab(" ") +
  xlab("Proportion") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Aid Receipt by Race",
       fill = "Aid Received") +
 scale_fill_manual(values = c(
     "0" = "cornsilk2", 
    "1" = "deepskyblue2"
  ),
  labels = c(
    "0" = "No",
    "1" = "Yes"
  ))

Logistic Regression

For my second supervised learning algorithm, I chose to learn about Logistic Regression. This is a classification technique that is a form of regression and ensures probabilities remain between the values of 0-1. This way, rather than predicting different values, the model uses predictors to calculate a probability, which is then applied to a classification threshold. In most cases, models use a classification threshold of 0.5, meaning that if the probability of a certain case is higher than 0.5, the positive case is assigned (or, for a probability of less than 0.5, the negative case is assigned). Thank you to https://www.ml4publicpolicy.com/ for the in depth materials to apply logistic regression to policy analysis.

For the context of this problem, I run a logistic regression model that uses multiple factor variables to predict aid receipt. However, instead of manually inputting a classification threshold of 0.5, I employ cross validation, to ensure that the threshold is the optimal threshold for this data. Ultimately, this threshold ends up being 0.53, so the model does not change much compared to if I had manually input 0.5. However, it is important to cross-validate and find the optimal threshold to maximize correct predictions and minimize false predictions.

First, I run logistic regression on the data as it is.

#Split training and testing data to assess model accuracy
set.seed(1350)
logreg_predictors <- by_case_predictors
logreg_predictors$AnyPubAid <- factor(logreg_predictors$AnyPubAid, levels = c(0, 1), labels = c("no_aid", "aid"))
training_data_rows <- sample(1:nrow(logreg_predictors), 
                             size = nrow(logreg_predictors)/2)
training_data <- logreg_predictors[training_data_rows, ]
testing_data <- logreg_predictors[-training_data_rows, ]

TrControl <- trainControl(
    method = "cv",
    number = 5,
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    verboseIter = FALSE
)
logreg <- train(
    AnyPubAid ~ ., 
    training_data, 
    method = "glm",
    family = "binomial",
    trControl = TrControl,
    preProcess = c("center", "scale")
)

print(logreg)

Generalized Linear Model 

16967 samples
   26 predictor
    2 classes: 'no_aid', 'aid' 

Pre-processing: centered (26), scaled (26) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 13573, 13574, 13573, 13574, 13574 
Resampling results:

  ROC        Sens       Spec     
  0.6891743  0.8836957  0.3251248

However, this produces a high rate of sensitivity (the amount of correct non-aid predictions relative to the true amount of non-aid recipients) but a low rate of specificity (the amount of correct aid predictions relative to the true amount of aid recipients).

Thus, I performed SMOTE again to balance the classes, as I know this will help improve the specificity.

set.seed(123)

training_data_partial <- training_data |>
  select(-27)

training_data_partial[] <- lapply(training_data_partial, function(x) as.numeric(x))

training_data_full <- training_data_partial |>
  mutate(AnyPubAid = training_data$AnyPubAid)

smote_recipe <- recipe(AnyPubAid ~ ., data = training_data_full) |>
  step_smote(AnyPubAid, over_ratio = 1)

prep_recipe <- prep(smote_recipe)
smote_training_data <- bake(prep_recipe, new_data = NULL)

logreg2 <- train(
    AnyPubAid ~ ., 
    smote_training_data, 
    method = "glm",
    family = "binomial",
    trControl = TrControl,
    preProcess = c("center", "scale")
)

print(logreg2)

Generalized Linear Model 

22080 samples
   26 predictor
    2 classes: 'no_aid', 'aid' 

Pre-processing: centered (26), scaled (26) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 17664, 17664, 17664, 17664, 17664 
Resampling results:

  ROC        Sens       Spec     
  0.6653036  0.6326087  0.6347826

This logistic regression model has a much better specificity. Though the sensitivity went down, this trade-off is much better than the uneven one above. Additionally, the ROC (receiver operating characteristic) is roughly as good as the one above, indicating that changing the specificity rate did not affect the statistical power.

To assess how my model does on predicting classes of new data, I can use the testing data that I set aside earlier.

set.seed(12345)

testing_data_partial <- testing_data |>
  select(-27)

testing_data_partial[] <- lapply(testing_data_partial, function(x) as.numeric(x))

smote_testing_data <- testing_data_partial |>
  mutate(AnyPubAid = testing_data$AnyPubAid)

prediction <- predict(logreg2, smote_testing_data, type = "raw")

confusionMatrix(prediction, smote_testing_data[["AnyPubAid"]], positive = "aid")

Confusion Matrix and Statistics

          Reference
Prediction no_aid  aid
    no_aid   7818 2454
    aid      3129 3566
                                         
               Accuracy : 0.6709         
                 95% CI : (0.6638, 0.678)
    No Information Rate : 0.6452         
    P-Value [Acc > NIR] : 9.375e-13      
                                         
                  Kappa : 0.299          
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.5924         
            Specificity : 0.7142         
         Pos Pred Value : 0.5326         
         Neg Pred Value : 0.7611         
             Prevalence : 0.3548         
         Detection Rate : 0.2102         
   Detection Prevalence : 0.3946         
      Balanced Accuracy : 0.6533         
                                         
       'Positive' Class : aid

The Kappa value tells us how much better the model is at predicting compared to if it were to just predict randomly. The closer to zero the kappa value is, the worse at predicting the model is; so, this model performs fair enough but is not actually particularly good. Additionally, the sensitivity (ability to correctly classify aid cases) and specificity (ability to correctly classify no-aid cases) have changed a bit, but not to the point of concern for the model’s accuracy.

Now, I’d like to visualize how my model that was trained on SMOTEd data assessed the optimal classification threshold, as well as the model’s ROC (receiver operating characteristic) curve.

Finding the Classification Threshold

The video supplied on https://www.ml4publicpolicy.com/classification.html#conclusion introduces the concept of this type of visualization, and I wanted to create my own using the data I have.

probs <- predict(logreg2, newdata = smote_testing_data, type = "prob")[, "aid"]

truth <- smote_testing_data$AnyPubAid

truth <- ifelse(truth == "aid", 1, 0)

thresholds <- seq(0, 1, by = 0.01)

pr_data <- sapply(thresholds, function(t) {
  pred <- ifelse(probs >= t, 1, 0)
  tp <- sum(pred == 1 & truth == 1)
  fp <- sum(pred == 1 & truth == 0)
  fn <- sum(pred == 0 & truth == 1)

  precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
  recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn))

  c(precision = precision, recall = recall)
}) |> t() |> as.data.frame()

pr_data$threshold <- thresholds

pr_data_long <- pr_data |> 
  pivot_longer(cols = c("precision", "recall"), names_to = "metric", values_to = "value")

pr_data$diff <- abs(pr_data$precision - pr_data$recall)
intersect_idx <- which.min(pr_data$diff)

intersect_point <- pr_data[intersect_idx, ]

ggplot(pr_data_long, aes(x = threshold, y = value, color = metric)) +
  geom_line() +
  geom_point(data = intersect_point, aes(x = threshold, y = precision), 
             color = "darkgoldenrod4", size = 1) +
  annotate("text", 
           x = intersect_point$threshold - 0.07, 
           y = intersect_point$precision - 0.11, 
           label = paste0("Threshold = ", round(intersect_point$threshold, 2),
                          "\nPrecision = Recall = ", round(intersect_point$precision, 2)),
           hjust = 0.5, size = 3.5) +
  ylim(0, 1) +
  labs(
    title = "Precision-Recall Curve by Classification Threshold",
    x = "Threshold", y = "Score"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("precision" = "deepskyblue2", "recall" = "lightpink2"))

As noted above, we see here that the optimum threshold is identified to be 0.53, where the precision and recall (synonyms of sensitivity and specificity) are roughly equal at 0.54.

ROC Curve

Lastly, https://www.ml4publicpolicy.com/classification.html#conclusion recommends making a ROC plot to show how the model performs relative to random classification.

pr_numeric <- as.numeric(as.factor(prediction))

AnyPubAid_numeric <- as.numeric(as.factor(smote_testing_data$AnyPubAid))

roc_df <- data.frame(Observed = AnyPubAid_numeric, Predicted = pr_numeric)

ggplot(roc_df, aes (d = Observed, m = Predicted)) + 
            geom_roc(labels = FALSE, color='deepskyblue3') +
            style_roc(theme = theme_bw, guide = TRUE)

The shape of this line, relative to the constant line spanning from point (0,0) to point (1,1), shows that this model is better than random classification. This is because the line from (0,0) to (1,1) represents the trade-offs between false positives and true positives if classification was randomly assigned, and the line representing my model’s trade-offs shows steeper incline and therefore better sensitivity and specificity.

Conclusion

Through use of hierarchical clustering, random forest algorithms, and logistic regression models, I have shown how the survey data of transitioning foster youth from 2017, 2019, and 2021 may be helpful for prediction of future youth’s financial needs.

However, the results of this project should be used carefully. Because the data had relatively low predictive power, use of the models may negatively influence aid distribution and reinforce racial, economic, gender, or other biases. Though this was an interesting research project, this report does not adequately answer the research question in a way that is meaningfully productive for application to policy work. Further research and analysis is required to create a real-world-ready algorithm for modeling foster youth’s public aid usage.

Consulted Sources

Coding and Machine Learning Algorithms:

https://www.ml4publicpolicy.com/classification.html#conclusion

https://quantifyinghealth.com/logistic-regression-in-r-with-categorical-variables/

https://quantifyinghealth.com/assess-variable-importance-in-regression/

https://posit.co/resources/cheatsheets/

https://r-graph-gallery.com/ggplot2-color.html

https://stackoverflow.com/questions

https://www.geeksforgeeks.org/k-fold-cross-validation-in-r-programming/

https://www.geeksforgeeks.org/how-to-use-smote-for-imbalanced-data-in-r/#

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://www.iguazio.com/glossary/classification-threshold/

https://www.lyzr.ai/glossaries/smote/#:~:text=Generating%20Synthetic%20Instances%3A%20SMOTE%20generates,a%20combination%20of%20the%20two.

OpenAI. (2024). ChatGPT.4o [Large language model]. https://chat.openai.com/chat

Data:

The data used in this publication, Dataset 297, National Youth in Transition Database (NYTD) - Outcomes File, Cohort Age 17 in FY2017, Waves 1-3 (Complete), were obtained from the National Data Archive on Child Abuse and Neglect and have been used in accordance with its Terms of Use Agreement license. The Administration on Children, Youth and Families, the Children’s Bureau, the original dataset collection personnel or funding source, NDACAN, Duke University, Cornell University, and their agents or employees bear no responsibility for the analyses or interpretations presented here.

https://www.ndacan.acf.hhs.gov/datasets/dataset-details.cfm?ID=297