Abstract

This project analyzes multi-semester College Now student survey data at LaGuardia Community College to identify factors associated with students’ self-reported college readiness. Using structured survey responses related to learning, instructional support, and course satisfaction, several predictive models were developed, including logistic regression, Random Forest, and XGBoost. Model performance was evaluated with an emphasis on recall and specificity, reflecting the importance of accurately identifying both students who feel college ready and those who may need additional academic support. A tuned logistic regression model using structured survey predictors achieved the most balanced and stable performance. Natural Language Processing (NLP) features from open-ended responses were examined separately to avoid data leakage and provided additional context but did not outperform the structured-only model. Overall, the findings demonstrate that structured student feedback can meaningfully support early identification efforts and inform targeted interventions within the College Now program.


Introduction

College Now is one of the City University of New York’s (CUNY) largest pre-college initiatives, designed to provide high school students with early exposure to college-level coursework while earning transferable credits. Through partnerships between CUNY colleges and New York City high schools, the program aims to strengthen academic preparedness, build student confidence, and support a smoother transition to higher education. Each year, thousands of students participate across a wide range of disciplines, making College Now a central component of CUNY’s broader K–16 strategy to improve college access and success.

Given the scale and mission of the program, understanding how students experience College Now—and whether those experiences translate into a stronger sense of college readiness—is essential for data-informed program improvement. Post-course student surveys offer a valuable lens into learning, instructional support, and satisfaction across multiple semesters. This project leverages multisemester College Now survey data to identify which aspects of the student experience most strongly predict students’ self-reported college readiness, with the goal of informing instructors, advisors, and program staff as they design targeted supports to improve student outcomes.

Objective

The primary objective of this analysis is to identify which student survey responses, collected across multiple semesters, best predict self-reported college readiness among College Now participants. Using structured survey questions, along with supplementary analysis of open-ended responses, the project develops predictive models to understand readiness patterns and to highlight opportunities for earlier and more targeted student support.

Because College Now serves a diverse student population and seeks to promote equitable access to college preparation, particular attention is given to model performance in identifying students who may feel less prepared. Emphasis is placed on recall and balanced classification to reduce the risk of overlooking students who may benefit from additional academic or advising support.

Literature Review

Prior research on dual enrollment and early college programs consistently highlights the role of academic exposure, non-cognitive supports, and institutional context in shaping students’ readiness for college. This review synthesizes existing findings across five thematic areas to situate the present study and clarify how it builds on prior work.

Academic Preparation and the Impact of Dual Enrollment

A substantial body of research demonstrates that dual-enrollment participation positively influences students’ academic preparation and transition to college-level work. Studies by Kurlaender et al. (2019) and Ryu et al. (2024) show that early exposure to college coursework strengthens academic momentum and confidence, particularly during the transition into postsecondary education. Phelps and Chan (2016) further extend this evidence by linking dual-credit completion not only to improved college performance but also to longer-term labor market outcomes. Collectively, these findings establish dual enrollment as an effective mechanism for promoting academic readiness, providing a foundation for examining how students perceive their preparedness through survey responses.

College Readiness as a Multidimensional Construct

College readiness is increasingly understood as a multidimensional concept that extends beyond academic skills alone. Westrick et al. (2024) emphasize that readiness includes students’ attitudes, persistence, and self-efficacy, while Kurlaender et al. (2019) highlight the importance of students’ perceptions of readiness alongside objective measures. This broader framework aligns closely with College Now’s emphasis on confidence-building and authentic college experiences and motivates the use of self-reported readiness as a meaningful outcome in the present study.

Advising, Mentorship, and Non-Cognitive Supports

Research consistently underscores the importance of advising and mentoring in supporting dual-enrollment students. Cribb (2021) finds that structured advising and consistent communication improve student engagement and academic planning, while Abel and Oliver (2018) demonstrate that counseling and mentorship strengthen both college and career readiness. These studies highlight the role of non-cognitive supports—such as motivation, time management, and perceived support—which are directly reflected in several of the survey measures analyzed in this project.

Equity and Access in Dual Enrollment

Equitable access remains a central concern in the expansion of dual-enrollment programs. Taylor et al. (2022) identify the need for policies that broaden participation while maintaining adequate academic and institutional supports. Similarly, Austin et al. (2024) show that automatic enrollment policies can increase advanced course participation without lowering academic performance, provided that appropriate safeguards are in place. These findings underscore the importance of identifying students who may struggle within dual-enrollment settings so that additional support can be offered proactively—an objective directly aligned with this study’s focus on identifying “Not Ready” students.

Institutional and Regional Context

Institutional design and local context play a significant role in shaping dual-enrollment outcomes. Roland and Herman (2020) situate College Now within New York City’s broader K–16 initiatives, emphasizing coordinated efforts between high schools and colleges. Liu, Minaya, and Xu (2022) further demonstrate that program structure and partnerships influence students’ college application behavior and admissions outcomes. These studies highlight the value of institution-specific analyses, reinforcing the relevance of examining College Now survey data from LaGuardia Community College.

Summary and Research Gap

Together, prior research establishes that dual enrollment supports college readiness through academic exposure, advising, and institutional alignment. However, much of the existing literature relies on administrative outcomes such as enrollment, persistence, or completion. This study extends prior work by leveraging multi-semester student survey data to examine how students’ perceived learning, support, and engagement predict self-reported college readiness. By focusing on student voice and perception, this analysis complements existing outcome-based research and provides actionable insights for program improvement within College Now.


Data and Preparation

This study uses a combined, multi-semester dataset derived from College Now student surveys administered after course completion. Each survey captures students’ perceptions of their learning experience, instructional support, and overall satisfaction, along with a key outcome measure indicating whether the course improved their sense of college readiness. Survey responses from multiple academic terms were merged into a single analytic dataset to enable cross-semester comparison and modeling.

While several core questions remained consistent across semesters, additional items—particularly open-ended questions—were introduced in later survey versions. As a result, not all variables are available for every student. Rather than discarding these observations, the analysis preserves valid responses and treats questions that were not included in earlier survey versions as “Not Asked,” allowing the models to retain information without introducing bias from uneven survey design. This approach ensures that students are not penalized for missing responses that reflect survey timing rather than non-response behavior.

The dataset was cleaned and re-coded to standardize response formats, align factor levels across semesters, and prepare variables for exploratory analysis and modeling. This project is also personally meaningful, as I am both a College Now alumna and currently employed with the program. This dual perspective reinforces the importance of using student feedback not only for analysis, but also for informing practical, data-driven decisions that can strengthen program support and student outcomes.

Packages and Load Data

We begin by importing the relevant packages and loading the cleaned dataset for analysis.

# Core data manipulation & plotting
library(tidyverse)   
library(janitor)
library(broom)
library(pROC)

# Tables & visualization helpers
library(kableExtra)
library(scales)
library(treemapify)
library(ggrepel)

# Text mining / NLP
library(tidytext)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)

# Modeling
library(caret)
library(randomForest)
library(xgboost)
library(vcd)  
library(car)  

Load Final Dataset

The individual semester CSV exports from Qualtrics were merged, cleaned, and saved as a single file all_data_project_final.csv. We load that final dataset here.

# Load final dataset
df <- readr::read_csv("all_data_project_final.csv", show_col_types = FALSE)
head(df)
#> # A tibble: 6 × 22
#>   semester  finished q1    q2    q3    q4    q5    q6    q7    q8    q9    q10  
#>   <chr>     <lgl>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Fall 2020 TRUE     BTM1… No    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 Fall 2020 TRUE     MAT2… No    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 3 Fall 2020 TRUE     BTM1… No    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4 Fall 2020 TRUE     MAT1… No    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 Fall 2020 TRUE     HUC1… No    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 6 Fall 2020 TRUE     BTM1… Yes   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> # ℹ 10 more variables: q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>, q15 <chr>,
#> #   q16 <chr>, q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>

Load Survey Questions for Reference

ref <- read.csv("question_dictionary.csv")

ref %>%
kable(caption = "Question Reference Table") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)
Question Reference Table
var question_text
q1 Which College Now course(s) did you take this semester?
q2 Was this your first time taking a class with College Now?
q3 Which College Now course(s) did you take in the past?
q4 Was your College Now course held in-person, hybrid, or fully online?
q5 Do you feel that your Instructor supported you throughout this course?
q6 What helped you most?
q7 Is there any help you needed but did not get during the course?
q8 If yes, please explain.
q9 How could they have supported you better?
q10 How much do you feel you learned in this course?
q11 Do you feel you learned a lot in this course?
q12 Would you want to take another College Now course?
q13 Why would you want to take another College Now course?
q14 Why wouldn’t you want to take a College Now course online?
q15 Would you want to take another ONLINE College Now course? Please explain.
q16 Based on your experience, how is a college class different from a high school class?
q17 Do you feel this course improved your college readiness?
q18 Has this course made you more college-ready? Please explain.
q19 Why do you feel that the College Now class did not help you become more college ready?
q20 Why do you feel your College Now course experience has made you more college ready?

Exploratory Data Analysis (EDA)

This section explores response completeness, participation patterns, and key student experience measures prior to modeling.

glimpse(df)  
#> Rows: 6,432
#> Columns: 22
#> $ semester <chr> "Fall 2020", "Fall 2020", "Fall 2020", "Fall 2020", "Fall 202…
#> $ finished <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ q1       <chr> "BTM101 - Introduction to Business", "MAT200 - Precalculus", …
#> $ q2       <chr> "No", "No", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No"…
#> $ q3       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q4       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q5       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q6       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q7       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q8       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q9       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q10      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q11      <chr> "Yes", "Yes", "yes!", "Yes", "Yes definitely!", "Yes", "Yes",…
#> $ q12      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q13      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q14      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q15      <chr> "Yes I think it's easier to be able to take classes from home…
#> $ q16      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q17      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
#> $ q18      <chr> "Yes because it taught me how to be more independent when doi…
#> $ q19      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ q20      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
colSums(is.na(df))
#> semester finished       q1       q2       q3       q4       q5       q6 
#>        0        0      176      206     4667     3765     1954     3149 
#>       q7       q8       q9      q10      q11      q12      q13      q14 
#>     2190     6320     6346     2203     4869     2316     4490     6027 
#>      q15      q16      q17      q18      q19      q20 
#>     4877     3010      808     4878     6220     3441

Above, we examine missing values across columns. Some survey questions were not asked in all semesters, which contributes to structural structural non-response.

# Helper: order semesters chronologically (Spring then Fall within each year)
order_semesters <- function(semester) {
  # semester is a character or factor like "Spring 2020", "Fall 2023", etc.
  
  sem_date <- dplyr::case_when(
    stringr::str_detect(semester, "Spring") ~ paste0(stringr::str_extract(semester, "\\d{4}"), "-03-01"),
    stringr::str_detect(semester, "Fall")   ~ paste0(stringr::str_extract(semester, "\\d{4}"), "-10-01"),
    TRUE ~ NA_character_
  ) |> as.Date()
  
  forcats::fct_reorder(as.factor(semester), sem_date)
}

In the next section, we summarize how often students responded to each question, how responses are distributed across semesters, and how students describe their course experiences.

Response Completeness by Question

Goal: Evaluate how complete the dataset is by survey question.

Why: Helps assess response quality and identify patterns of missingness before modeling.

# Count non-missing responses for each column (excluding semester and finished)
response_summary <- df %>%
  select(-semester, -finished) %>%
  summarise(across(everything(), ~ sum(!is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "response_count") %>%
  mutate(response_percent = response_count / nrow(df))

# Plot response completeness
ggplot(response_summary, aes(x = reorder(Column, -response_count), y = response_count)) +
  geom_col(fill = "#4682B4") +
  geom_text(
    aes(label = paste0(round(response_percent * 100, 1), "%")),
    vjust = -0.5, size = 2.6
  ) +
  labs(
    title = "Response Completeness per Survey Question",
    x = "Survey Question",
    y = "Number of Responses"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Core survey questions such as q1 and q2 have the highest completeness (above 95%), while response rates gradually decline for later items. This reflects both structural missingness (questions added in later semesters) and item non-response, which I account for when selecting predictors for the models.

Having looked at missingness by question, I next examine how responses are distributed across semesters.

Response Distribution by Semester

Goal: Understand how survey participation varies across semesters.

Why: Contextualizes later analyses, especially readiness trends and model training distribution.

semester_counts <- df %>%
  count(semester) %>%
  mutate(
    response_percent = n / sum(n),
    semester = reorder(semester, -response_percent)
  )

ggplot(semester_counts, aes(x = semester, y = response_percent)) +
  geom_col(fill = "#6A5ACD") +
  geom_text(
    aes(label = percent(response_percent, accuracy = 0.1)),
    vjust = -0.5, size = 3.2
  ) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Responses per Semester",
    x = "Semester",
    y = "Share of Total Responses"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Participation peaked in Fall 2021 (17.4%), with strong engagement also seen in Spring 2021 and Fall 2020. Response volume dips slightly in mid-years but rises again by Spring 2025, suggesting renewed survey engagement in recent semesters. This distribution ensures that readiness modeling draws from a broad, multi-year student population.

With this context on when data were collected, we now turn to which courses students actually took during these semesters.

Course Participation (Q1)

Survey item: “Which College Now course(s) did you take this semester?”

Goal: Identify which College Now courses are most frequently taken.

Why: Highlights participation patterns and reveals which academic areas students most often engage with.

df %>%
  filter(!is.na(q1)) %>%
  count(q1, sort = TRUE) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(q1, n), y = n)) +
  geom_col(fill = "#6495ED") +
  coord_flip() +
  labs(
    title = "Top 10 Courses Taken",
    x = "Course",
    y = "Number of Students"
  ) +
  theme_minimal()

The most frequently taken courses were:

  • ENG101 – Composition I
  • HUP102 – Critical Thinking
  • CSE110 – Literacy and Propaganda

These top courses are reading- and writing-intensive, which aligns with College Now’s goal of strengthening academic foundations essential for college readiness. Math and business courses also appear—indicating a broad representation of subject areas across the program.

After understanding which courses students are taking, it is also useful to examine who these students are—specifically, whether they are new to the program or returning participants.

First-Time vs. Returning Students (Q2)

Survey item: “Was this your first time taking a class with College Now?”

Goal: Compare the proportion of first-time versus returning students.

Why: Helps understand program reach and whether students continue engaging with College Now beyond their initial course.

df %>%
  filter(q2 %in% c("Yes", "No")) %>%    
  count(q2) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(x = q2, y = percent)) +
  geom_col(fill = "#5D8AA8", show.legend = FALSE) +
  geom_text(
    aes(label = scales::percent(percent, accuracy = 0.1)),
    vjust = -0.5
  ) +
  labs(
    title = "First-Time vs Returning College Now Students (Q2)",
    x = NULL,
    y = "Percent of Students"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

About 57% of students were first-time participants, while 40% had taken at least one College Now course before. This indicates strong recruitment of new students each semester while still retaining a substantial returning population.

Among those returning students, certain courses appear particularly popular, which I explore next using Q3.

Top 10 past courses for returning students (Q3)

Survey item: “Which College Now course(s) did you take in the past?”

Goal: Identify which courses returning students had previously taken.

Why: Helps highlight subjects that tend to attract repeat participation and may reflect course popularity, accessibility, or perceived value.

df %>%
  filter(!is.na(q3)) %>%        
  count(q3, sort = TRUE) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(q3, n), y = n)) +
  geom_col(fill = "#E07A5F") +
  coord_flip() +
  labs(
    title = "Top 10 Previously Taken Courses (Q3)",
    x = "Course",
    y = "Number of Students"
  ) +
  theme_minimal()

Returning students most frequently reported taking CSE110 (Literacy and Propaganda), HUP102 (Critical Thinking), ENG101 (Composition I), and MAT115 (College Algebra and Trigonometry). This pattern suggests that students who return to the program tend to re-enroll in courses that strengthen core academic skills such as reading, writing, critical thinking, and foundational mathematics—areas closely connected to overall college preparedness.

Following this, we examine how students felt about the support they received from their instructors, another key component of readiness and overall satisfaction with the program.

Instructor Support (Q5)

Survey item: “Do you feel that your instructor supported you throughout this course?”

Goal: Summarize students’ perceptions of instructor support.

Why: Instructor support is a key contributor to student satisfaction and may influence readiness outcomes.

df %>%
  filter(q5 %in% c("Yes", "No")) %>%    
  count(q5) %>%
  ggplot(aes(x = reorder(q5, -n), y = n)) +
  geom_col(fill = "#20B2AA") +
  labs(
    title = "Perceived Instructor Support (Q5)",
    x = "Response",
    y = "Count"
  ) +
  theme_minimal()

The overwhelming majority of students reported feeling supported by their instructors, with only a small number indicating “No.” This suggests that instructor engagement is a strong and consistent positive element of College Now courses, contributing to students’ overall sense of preparedness and satisfaction.

High levels of instructor support may relate to whether students want to continue in the program. Next, we examine their interest in taking another College Now course.

Intent to Take Another Course (Q12)

Survey item: “Would you want to take another College Now course?”

Goal: Measure students’ interest in continuing with College Now.

Why: High interest in taking another course is a strong indicator of program satisfaction and perceived value

df %>%
  filter(q12 %in% c("Yes", "No")) %>%      
  count(q12) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(
    area = percent,
    fill = q12,
    label = paste0(q12, "\n", scales::percent(percent, accuracy = 0.1))
  )) +
  geom_treemap() +
  geom_treemap_text(
    color = "white",
    place = "centre",
    grow = TRUE
  ) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Would Students Take Another College Now Course? (Q12)"
  ) +
  theme(legend.position = "none")

A large majority of students—about 84%—indicated they would take another College Now course. Only 16% said they would not. This suggests that most students have a positive experience and see continued value in participating in the program.

Because the central focus of this project is college readiness, I now bring together the core readiness question (Q17) with students’ open-ended explanations (Q18 and Q20).

College-Readiness (Q17, Q18, Q20)

Overall Readiness (Q17)

Survey item: “Do you feel this course improved your college readiness?”

Goal: Capture students’ overall perception of whether the course helped them feel more college-ready.

Why: This is the primary outcome variable for the capstone and the foundation for modeling readiness.

# Standardize q17 into Yes/No responses only
q17_clean <- df %>%
  transmute(
    semester,
    q17 = str_squish(tolower(as.character(q17))),
    q17 = case_when(
      q17 %in% c("yes", "yes!", "yes definitely!", "yeah", "y", "yep", "sure", "of course") ~ "Yes",
      q17 %in% c("no", "nope", "n") ~ "No",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(q17))

# Overall summary
q17_sum <- q17_clean %>%
  count(q17) %>%
  mutate(percent = n / sum(n))

ggplot(q17_sum, aes(x = q17, y = n, fill = q17)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = percent(percent, accuracy = 0.1)), vjust = -0.4) +
  labs(
    title = "Did This Course Improve Your College Readiness? (Q17)",
    x = NULL,
    y = "Number of Students"
  ) +
  theme_minimal()

An overwhelming 92% of students reported that their College Now course improved their college readiness. This high level of agreement suggests that students consistently perceive the program as supporting their transition into college-level expectations and academic behaviors.

After examining overall readiness, it is also important to understand how students’ perceptions evolved over time. I therefore look at how readiness (Q17) varies across semesters.

Readiness by Semester (Q17)

Goal: Explore whether the share of students who feel more college-ready differs across semesters.

Why: Understanding semester-to-semester variation helps determine whether readiness outcomes remained stable or changed for different cohorts.

# Clean and standardize Q17 responses
q17_clean <- df %>%
  transmute(
    semester,
    q17 = str_squish(tolower(as.character(q17))),
    q17 = case_when(
      q17 %in% c("yes","yes!","yes definitely!","yeah","y","yep","sure","of course") ~ "Yes",
      q17 %in% c("no","nope","n") ~ "No",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(q17))

# Percent yes/no by semester, ordered chronologically
q17_line <- q17_clean %>%
  count(semester, q17) %>%
  group_by(semester) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup() %>%
  mutate(semester = order_semesters(semester))

ggplot(q17_line, aes(x = semester, y = percent, color = q17, group = q17)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "College Readiness by Semester (Q17)",
    x = "Semester",
    y = "Percent of Students",
    color = "Response"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Across semesters, more than 90% of students consistently reported that their course improved their college readiness. Only slight fluctuations appear from term to term, suggesting that the program’s impact on readiness has been strong, stable, and sustained over several years.

To better understand why students feel more or less college-ready, I next examine themes from their open-ended explanations in Q18 and Q20.

Open-Ended Themes (Q18 & Q20)

Goal: Identify the most common ideas students mention when explaining their readiness.

Why: Provides qualitative insight into the skills, experiences, or course elements students associate with becoming more college-ready.

q_cols <- intersect(c("q18", "q20"), names(df))

if (length(q_cols) > 0) {

  bigram_df <- df %>%
    select(all_of(q_cols)) %>%
    pivot_longer(everything(), names_to = "question", values_to = "text") %>%
    mutate(text = tolower(str_squish(as.character(text)))) %>%
    filter(!is.na(text), text != "") %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    separate(bigram, into = c("word1", "word2"), sep = " ") %>%
    filter(
      !word1 %in% stop_words$word,
      !word2 %in% stop_words$word,
      str_detect(word1, "^[a-z]+$"),
      str_detect(word2, "^[a-z]+$")
    ) %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(question, bigram, sort = TRUE) %>%
    group_by(question) %>%
    slice_max(n, n = 12) %>%
    ungroup() %>%
    mutate(bigram = tidytext::reorder_within(bigram, n, question))

  ggplot(bigram_df, aes(x = bigram, y = n)) +
    geom_col(fill = "#6A5ACD") +
    coord_flip() +
    facet_wrap(~ question, scales = "free_y") +
    tidytext::scale_x_reordered() +
    labs(
      title = "How College Now Courses Helped Students Feel More College-Ready",
      subtitle = "Top two-word themes from open-ended explanations (Q18 & Q20)",
      x = "Common Themes",
      y = "Frequency"
    ) +
    theme_minimal()
}

Students frequently used phrases such as “college ready,” “college class,” “college classes,” and “time management.” These themes suggest that students associate readiness with exposure to real college expectations and the development of practical academic habits—especially managing time effectively.

Beyond common themes, it is also useful to understand the overall tone of students’ explanations. I therefore examine the sentiment expressed in their open-ended responses.

Sentiment Snapshot (Q18 & Q20)

Goal: Assess whether students describe their readiness experience using positive or negative language.

Why: Sentiment highlights the emotional tone behind students’ explanations and provides another perspective on their readiness perceptions.

# Select open-ended text columns
q_cols <- intersect(c("q18", "q20"), names(df))

if (length(q_cols) > 0) {
  
  # Tokenize into individual words
  sentiment_df <- df %>%
    select(all_of(q_cols)) %>%
    pivot_longer(everything(), names_to = "question", values_to = "text") %>%
    mutate(text = tolower(str_squish(as.character(text)))) %>%
    filter(!is.na(text), text != "") %>%
    unnest_tokens(word, text)
  
  # Join with Bing sentiment lexicon
  sentiment_summary <- sentiment_df %>%
    inner_join(get_sentiments("bing"), by = "word") %>%
    count(question, sentiment) %>%
    group_by(question) %>%
    mutate(percent = n / sum(n)) %>%
    ungroup()
  
  # Plot sentiment distribution
  ggplot(sentiment_summary, aes(x = sentiment, y = percent, fill = sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ question) +
    scale_y_continuous(labels = percent_format(accuracy = 1)) +
    labs(
      title = "Sentiment Distribution for College Readiness Explanations",
      x = "Sentiment",
      y = "Share of Words"
    ) +
    theme_minimal()
}

Open-ended explanations contain overwhelmingly positive language. For both Q18 and Q20, more than 90% of sentiment-bearing words were classified as positive. This reinforces that students generally view their College Now experience as beneficial for improving their college readiness..

Why Not More Ready? (Q19)

Survey item: “Why do you feel that the College Now class did not help you become more college ready?”

Goal: Identify common explanations among students who did not perceive readiness gains.

Why: Understanding negative or uncertain feedback highlights areas where the program may strengthen support or clarify expectations.

# Trigram version (negation-aware)
q19_trigrams <- df %>%
  filter(!is.na(q19)) %>%
  mutate(q19 = tolower(str_squish(as.character(q19)))) %>%
  unnest_tokens(trigram, q19, token = "ngrams", n = 3) %>%
  filter(str_detect(trigram, "not |didn['’]?t |don['’]?t ")) %>%
  count(trigram, sort = TRUE) %>%
  filter(n >= 2) %>%
  mutate(trigram = fct_reorder(trigram, n))

ggplot(q19_trigrams, aes(x = trigram, y = n)) +
  geom_col(fill = "#E07A5F") +
  coord_flip() +
  labs(
    title = "Common Negative Phrases in College Readiness Responses (q19)",
    subtitle = "Top three-word phrases with negations (not / didn’t / don’t)",
    x = "Phrase (3 words)",
    y = "Count"
  ) +
  theme_minimal()

Among the small number of students who felt the course did not improve their readiness, common expressions included “I don’t think,” “I didn’t learn,” and “I’m not sure.” These statements suggest uncertainty, limited perceived learning, or a mismatch between expectations and course experience. Although this group is small, their feedback highlights areas where clearer expectations or stronger skill alignment may be helpful.

To complement the thematic and sentiment analyses, we also generate a word cloud to visualize the most frequent words students used when describing their College Now experience.

Word Cloud of Open-Ended Responses (Q18–Q20)

Goal: Display the most common words students used when explaining their readiness.

Why:A word cloud provides an intuitive visual summary of dominant ideas and recurring themes across open-ended responses.


# Open-ended text items
open_text_vars <- c("q18", "q19", "q20")

# Build combined text field
df_wc <- df %>%
  mutate(
    text_all = select(., all_of(open_text_vars)) |>
      apply(1, function(x) paste(x, collapse = " ")) |>
      str_squish()
  )

# Remove empty rows
df_clean_wc <- df_wc %>%
  mutate(text_all = if_else(is.na(text_all), "", text_all)) %>%
  filter(text_all != "")

# Build corpus
corpus <- Corpus(VectorSource(df_clean_wc$text_all))

# Text cleaning
corpus <- corpus %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stripWhitespace)

# Remove any fully empty documents
non_empty <- sapply(corpus, function(doc) nchar(as.character(doc)) > 0)
corpus <- corpus[non_empty]

# Term-document matrix and frequency table
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)

word_freq <- sort(rowSums(tdm_matrix), decreasing = TRUE)
df_words <- data.frame(word = names(word_freq), freq = word_freq)

# Word cloud
set.seed(1234)
wordcloud(
  words = df_words$word,
  freq  = df_words$freq,
  min.freq = 5,
  max.words = 150,
  random.order = FALSE,
  rot.per = 0.25,
  colors = brewer.pal(8, "Dark2")
)

The word cloud highlights the most prominent terms students used when describing their readiness experience. Frequent words such as “college,” “ready,” “class,” “learned,” and “skills” reinforce earlier findings: students commonly associate College Now with gaining exposure to college expectations and developing academic skills necessary for their transition to higher education.


Data Preprocessing

Creating target variable from q17

# Q17 has been pre-cleaned in the final CSV to simple "Yes"/"No"
df <- df |>
  mutate(
    college_readiness = if_else(q17 == "Yes", 1, 0)
  )

# Check distribution of the original Q17 responses
table(df$q17, useNA = "ifany")
#> 
#>   No  Yes <NA> 
#>  442 5182  808

We define ‘college_readiness’ as a binary variable where “Yes” = 1 and “No” = 0 based on Q17 (“Do you feel this course improved your college readiness?”).

Handling missingness

df |>
  group_by(semester) |>
  summarise(across(q1:q20, ~ sum(is.na(.x))))
#> # A tibble: 10 × 21
#>    semester       q1    q2    q3    q4    q5    q6    q7    q8    q9   q10   q11
#>    <chr>       <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 Fall 2020      51    50   918   918   918   918   918   918   918   918    56
#>  2 Fall 2021       0    23   779  1118    83   362   139  1097  1106   143  1118
#>  3 Fall 2022       0     3    79     0     5    35     9   112   113     9   114
#>  4 Fall 2023       0     0     3     0     0     1     0    10    10     0    10
#>  5 Fall 2024      36    19   549    16    29   206    67   725   726    69   739
#>  6 Spring 2020    51    52   758   758   758   758   758   758   758   758    57
#>  7 Spring 2021     0    22   548   942    54   276    95   920   926    90   942
#>  8 Spring 2022     0     8   261     0    38   167    50   514   522    53   532
#>  9 Spring 2024     0    13   356     0    38   193    70   525   528    74   539
#> 10 Spring 2025    38    16   416    13    31   233    84   741   739    89   762
#> # ℹ 9 more variables: q12 <int>, q13 <int>, q14 <int>, q15 <int>, q16 <int>,
#> #   q17 <int>, q18 <int>, q19 <int>, q20 <int>

This table confirms that earlier semesters did not include many of the later questions (Q3–Q20), which explains the large number of NA values for those items. These structural NAs are considered when selecting predictors for modeling.

Feature Engineering for Modeling

In this section, we derive the main outcome variable and re-code selected survey questions into clean categorical predictors that can be used in the modeling stage.

Make the target a factor

df <- df |>
  mutate(
    college_readiness = factor(
      college_readiness,
      levels = c(0, 1),
      labels = c("Not Ready", "Ready")
    )
  )

We convert college_readiness to a factor with levels “Not Ready” and “Ready” to prepare it for classification models.

Recoding Yes/No Variables and Handling “Not Asked” Responses

Re-coding Q11 Using a Lightweight NLP Approach

Q11 asks whether students felt they learned a lot in the course, but responses appear in many written variations such as “Yes definitely!”, “I learned so much,” or “not really.” To convert this into a consistent categorical predictor, I apply a simple NLP-based approach that standardizes the text, detects strongly positive phrases, evaluates overall sentiment, and classifies each response into four categories: Yes, No, Mixed, and Not Asked.

## 1. Strong positive patterns
positive_patterns <- c(
  "^yes",
  "\\byes\\b",
  "learned a lot",
  "learned alot",
  "learned so much",
  "learn so much",
  "i learned",
  "i feel i learned",
  "definitely",
  "absolutely",
  "a lot",
  "alot",
  "\\byeah\\b",
  "\\byea\\b"
)

## 2. Create clean q11 text
df_q11 <- df |>
  mutate(
    q11_id    = dplyr::row_number(),
    q11_clean = stringr::str_to_lower(q11)
  ) |>
  dplyr::select(q11_id, q11_clean)

## 3. Lexicon scoring
q11_sent <- df_q11 |>
  dplyr::filter(!is.na(q11_clean)) |>
  tidytext::unnest_tokens(word, q11_clean) |>
  dplyr::inner_join(tidytext::get_sentiments("bing"), by = "word") |>
  dplyr::count(q11_id, sentiment) |>
  tidyr::complete(
    q11_id,
    sentiment = c("positive", "negative"),
    fill = list(n = 0)
  ) |>
  tidyr::pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  )

## 4. Merge first
df <- df |>
  mutate(q11_id = dplyr::row_number()) |>
  dplyr::left_join(q11_sent, by = "q11_id")

## 5. Make sure positive/negative columns exist
if (!"positive" %in% names(df)) df <- df |> mutate(positive = 0)
if (!"negative" %in% names(df)) df <- df |> mutate(negative = 0)

## 6. Classify q11 into q11_learned
df <- df |>
  mutate(
    positive = tidyr::replace_na(positive, 0),
    negative = tidyr::replace_na(negative, 0),

    # Step A: Strong positive override
    strong_yes = dplyr::if_else(
      !is.na(q11) & stringr::str_detect(
        stringr::str_to_lower(q11),
        stringr::str_c(positive_patterns, collapse = "|")
      ),
      TRUE, FALSE
    ),

    # Step B: Final classification
    q11_learned = dplyr::case_when(
      is.na(q11) ~ "Not Asked",
      strong_yes ~ "Yes",
      positive > negative ~ "Yes",
      negative > positive ~ "No",
      TRUE ~ "Mixed"
    ),

    q11_learned = factor(
      q11_learned,
      levels = c("No", "Mixed", "Yes", "Not Asked")
    )
  ) |>
  dplyr::select(-q11_id, -strong_yes)
df |> count(q11_learned)
#> # A tibble: 4 × 2
#>   q11_learned     n
#>   <fct>       <int>
#> 1 No              8
#> 2 Mixed         104
#> 3 Yes          1451
#> 4 Not Asked    4869

This NLP-enhanced recoding converts free-text responses to Q11 into a clean categorical variable (‘q11_learned’). Strong affirmative statements (e.g., “I learned so much”) are classified as “Yes,” while negative or uncertain wording is classified as “No” or “Mixed.” Responses missing due to structural differences across semesters are labeled “Not Asked.” Because Q11 was not included in most early semesters, the majority of responses fall under “Not Asked,” and this feature is therefore excluded from the final predictive models to avoid structural missingness and unintended bias.

Recode Yes/No Questions (Q2, Q7, Q12)

These three survey items use Yes/No responses but contain inconsistencies such as lowercase entries, single-letter responses (“y”, “n”), and some missing values. To standardize them for modeling, I recode:

  • “Yes”, “Y”, “yes”, “y”, etc. = Yes

  • “No”, “N”, “no”, “n”, etc. = No

  • Missing or nonstandard responses = Not Asked

## Recode Yes/No questions (Q2, Q7, Q12)
yn_vars <- c("q2", "q7", "q12")

df <- df |>
  mutate(
    across(
      all_of(yn_vars),
      ~ case_when(
        str_to_lower(.) %in% c("yes", "y") ~ "Yes",
        str_to_lower(.) %in% c("no", "n")  ~ "No",
        is.na(.)                           ~ "Not Asked",
        TRUE                               ~ "Not Asked"  # anything else -> Not Asked
      )
    )
  ) |>
  mutate(
    across(
      all_of(yn_vars),
      ~ factor(., levels = c("No", "Yes", "Not Asked"))
    )
  )

These three Yes/No items were standardized by converting all variations of affirmative and negative responses into “Yes” or “No,” with missing or irregular entries labeled as “Not Asked.” This ensures consistent factor levels across semesters and prepares the variables for classification modeling.

Convert Other Closed-Ended Variables to Factors

To prepare the dataset for modeling, I convert the remaining structured survey items into factors. These include semester (term information), q4 (in-person, hybrid, or online format), q5 (perceived instructor support), q10 (“How much do you feel you learned?”, treated as an unordered Likert-type factor), and q11_learned (the NLP-derived learning indicator). Converting these variables ensures consistent handling of categorical predictors in downstream modeling.

closed_cat <- c("semester", "q4", "q5", "q10", "q11_learned")

df <- df |>
  mutate(
    across(
      all_of(closed_cat),
      as.factor
    )
  )

Create a modeling dataset with non-missing target

Before fitting predictive models, we create a modeling dataset that includes only rows where the target variable (college_readiness) is observed. This removes structurally missing cases from early semesters where Q17 was not asked and ensures that all modeling algorithms receive complete outcome information.

## Create modeling dataset with non-missing target
model_df <- df |>
  filter(!is.na(college_readiness))

This filtered dataset (model_df) serves as the foundation for all subsequent association tests and predictive modeling.

Association Checks

Identifying Candidate Predictors

To identify which survey questions are most strongly related to college readiness, we conduct association tests between each categorical predictor and the target variable (college_readiness). Only structured, closed-ended variables with sufficient response coverage are included. Open-ended items and variables with substantial structural missingness (e.g., Q11 in early semesters) are excluded to avoid bias.

The following predictors were selected for association testing:

  • semester
  • q2 (first-time vs returning)
  • q5 (instructor support)
  • q7 (help needed but not received)
  • q10 (how much the student learned)
  • q11_learned (NLP-derived indicator; included here for completeness but not for modeling)
  • q12 (willingness to take another College Now course)

Optional high-cardinality predictors: - q1 (course taken) - q3 (previous course taken)

Variables with very high cardinality (e.g., Q1, Q3) are not suitable for logistic regression and are evaluated separately as optional descriptive features.

Crosstabs + Chi-square Tests

Association Testing Using Chi-square and Cramér’s V

To assess relationships between categorical survey predictors and students’ self-reported college readiness, Chi-square tests of independence were applied. Because statistical significance alone does not convey the strength of an association, Cramér’s V was reported alongside each test as a standardized effect size measure (Real Statistics, n.d.; Statology, n.d.). Cramér’s V ranges from 0 (weak association) to 1 (strong association) and allows comparison across predictors with different numbers of categories.

The following code computes contingency tables, Chi-square statistics, p-values, and Cramér’s V effect sizes for each predictor:

cat_vars <- c("semester", "q2", "q5", "q7", "q10", "q11_learned", "q12")

assoc_results <- list()

for (var in cat_vars) {
  
  cat("\n\n============================================\n")
  cat("Association stats for:", var, "\n")
  cat("============================================\n")
  
  # Crosstab using modeling dataset
  tab <- table(model_df[[var]], model_df$college_readiness)
  print(tab)
  
  # Skip if table is degenerate (only 1 row or 1 column)
  if (nrow(tab) < 2 || ncol(tab) < 2) {
    cat("Not enough variation to compute Chi-square / Cramér's V.\n")
    next
  }
  
  # Assocstats: Chi-square + Cramér's V
  as_out <- assocstats(tab)
  print(as_out)
  
  # Extract *Pearson* Chi-square row
  ct <- as_out$chisq_tests
  
  chisq_val <- ct["Pearson", "X^2"]
  df_val    <- ct["Pearson", "df"]
  p_val     <- ct["Pearson", "P(> X^2)"]
  cv        <- as_out$cramer
  
  assoc_results[[var]] <- tibble(
    variable  = var,
    chisq     = chisq_val,
    df        = df_val,
    p_value   = p_val,
    cramers_v = cv
  )
}
#> 
#> 
#> ============================================
#> Association stats for: semester 
#> ============================================
#>              
#>               Not Ready Ready
#>   Fall 2020          72   785
#>   Fall 2021          83   853
#>   Fall 2022           7    91
#>   Fall 2023           0    10
#>   Fall 2024          48   601
#>   Spring 2020        61   636
#>   Spring 2021        71   751
#>   Spring 2022        30   430
#>   Spring 2024        33   411
#>   Spring 2025        37   614
#>                      X^2 df P(> X^2)
#> Likelihood Ratio 10.9097  9  0.28195
#> Pearson           9.7344  9  0.37241
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.042 
#> Cramer's V        : 0.042 
#> 
#> 
#> ============================================
#> Association stats for: q2 
#> ============================================
#>            
#>             Not Ready Ready
#>   No              155  2159
#>   Yes             287  3021
#>   Not Asked         0     2
#>                     X^2 df P(> X^2)
#> Likelihood Ratio 7.7986  2 0.020256
#> Pearson          7.5239  2 0.023239
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.037 
#> Cramer's V        : 0.037 
#> 
#> 
#> ============================================
#> Association stats for: q5 
#> ============================================
#>      
#>       Not Ready Ready
#>   No         41    66
#>   Yes       268  3695
#>                     X^2 df P(> X^2)
#> Likelihood Ratio  83.41  1        0
#> Pearson          147.87  1        0
#> 
#> Phi-Coefficient   : 0.191 
#> Contingency Coeff.: 0.187 
#> Cramer's V        : 0.191 
#> 
#> 
#> ============================================
#> Association stats for: q7 
#> ============================================
#>            
#>             Not Ready Ready
#>   No              259  3612
#>   Yes              47   138
#>   Not Asked       136  1432
#>                     X^2 df   P(> X^2)
#> Likelihood Ratio 61.032  2 5.5844e-14
#> Pearson          87.385  2 0.0000e+00
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.124 
#> Cramer's V        : 0.125 
#> 
#> 
#> ============================================
#> Association stats for: q10 
#> ============================================
#>                                  
#>                                   Not Ready Ready
#>   Far too little                         18    29
#>   Far too much                           13   351
#>   Neither too much nor too little       203  2562
#>   Slightly too little                    50   106
#>   Slightly too much                      25   710
#>                     X^2 df P(> X^2)
#> Likelihood Ratio 147.05  4        0
#> Pearson          223.07  4        0
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.228 
#> Cramer's V        : 0.234 
#> 
#> 
#> ============================================
#> Association stats for: q11_learned 
#> ============================================
#>            
#>             Not Ready Ready
#>   No                1     7
#>   Mixed            28    75
#>   Yes             104  1339
#>   Not Asked       309  3761
#>                     X^2 df   P(> X^2)
#> Likelihood Ratio 35.665  3 8.8162e-08
#> Pearson          54.606  3 8.3323e-12
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.098 
#> Cramer's V        : 0.099 
#> 
#> 
#> ============================================
#> Association stats for: q12 
#> ============================================
#>            
#>             Not Ready Ready
#>   No              107   532
#>   Yes             201  3152
#>   Not Asked       134  1498
#>                     X^2 df   P(> X^2)
#> Likelihood Ratio 71.673  2 2.2204e-16
#> Pearson          86.048  2 0.0000e+00
#> 
#> Phi-Coefficient   : NA 
#> Contingency Coeff.: 0.123 
#> Cramer's V        : 0.124

# Tidy summary table
assoc_summary <- bind_rows(assoc_results)
assoc_summary
#> # A tibble: 7 × 5
#>   variable     chisq    df  p_value cramers_v
#>   <chr>        <dbl> <dbl>    <dbl>     <dbl>
#> 1 semester      9.73     9 3.72e- 1    0.0416
#> 2 q2            7.52     2 2.32e- 2    0.0366
#> 3 q5          148.       1 0           0.191 
#> 4 q7           87.4      2 0           0.125 
#> 5 q10         223.       4 0           0.234 
#> 6 q11_learned  54.6      3 8.33e-12    0.0985
#> 7 q12          86.0      2 0           0.124

Results indicate that several survey items are meaningfully associated with college readiness. Instructor support (q5), perceived amount learned (q10), unmet help needs (q7), and willingness to take another College Now course (q12) all show strong statistical significance (p < .001) with small-to-moderate effect sizes (Cramér’s V ≈ 0.12–0.23). Among these, perceived learning (q10) exhibits the strongest association with readiness, followed by instructor support (q5).

In contrast, first-time versus returning status (q2) shows a statistically significant but very small effect, and semester demonstrates no meaningful association, suggesting stability in readiness patterns across cohorts. Although the NLP-derived variable q11_learned is conceptually relevant, its substantial structural missingness makes it unsuitable for inclusion in the primary predictive models.

These findings guided feature selection for modeling, resulting in the inclusion of q5, q7, q10, q12, and q2 as the core structured predictors.

Direction Plots

To better understand how each categorical predictor relates to the target outcome, we visualize the proportion of students classified as Ready vs. Not Ready across response categories. These direction plots help reveal whether certain survey responses consistently correspond to higher or lower readiness levels, which supports both feature selection and later model interpretation.

Because these plots aim to reflect students’ direct opinions, we exclude Not Asked responses to avoid distorting patterns created by structural missingness in early semesters.

# Function to plot direction for each categorical predictor
# Excludes "Not Asked" to focus on meaningful student responses.

plot_direction <- function(var, title_text) {
  df %>%
    filter(
      !is.na(college_readiness),
      !is.na(.data[[var]]),
      .data[[var]] != "Not Asked"
    ) %>%
    mutate(college_readiness = factor(college_readiness)) %>%   # <-- FIX HERE
    count(.data[[var]], college_readiness) %>%
    group_by(.data[[var]]) %>%
    mutate(percent = n / sum(n)) %>%
    ggplot(aes(x = .data[[var]], y = percent, fill = college_readiness)) +
    geom_col(position = "fill") +
    scale_y_continuous(labels = percent_format()) +
    labs(
      title = title_text,
      x = NULL,
      y = "Percent Readiness"
    ) +
    scale_fill_brewer(palette = "Set2") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

Instructor Support (Q5)

plot_direction("q5", "College Readiness by Instructor Support") +
  labs(subtitle = "Instructor support is strongly associated with college readiness.")

Help Needed but Not Received (Q7)

plot_direction("q7", "College Readiness by Help Needed But Not Received") +
  labs(subtitle = "Unmet help needs are associated with lower readiness.")

Amount Learned in the Course (Q10)

plot_direction("q10", "College Readiness by Amount Learned in the Course") +
  labs(subtitle = "Higher perceived learning corresponds to higher readiness.")

Willingness to Take Another Course (Q12)

plot_direction("q12", "College Readiness by Willingness to Take Another Course") +
  labs(subtitle = "Willingness to continue is strongly associated with readiness.")

Across all predictors, clear directional patterns emerge. Students who reported strong instructor support, high levels of learning, or a desire to take another College Now course were far more likely to feel Ready for college. Conversely, students who said they needed help but did not receive it were more likely to report being Not Ready. These consistent directional trends reinforce the predictors selected for modeling and provide early insight into how student experiences relate to perceived readiness.

Feature Selection for fixed questions.

Based on the association tests, I identified a subset of survey questions that are both conceptually meaningful and statistically useful for predicting college readiness. The strongest predictors were those with significant Chi-square associations and non-trivial Cramér’s V effect sizes. These items also appeared consistently across semesters and had minimal structural missingness, making them appropriate for modeling.

The selected predictors include:

  • q5: Instructor support

  • q7: Help needed but not received

  • q10: How much the student learned

  • q12: Willingness to take another College Now course

  • q2: First-time vs. returning student (included despite weak effect size because it is conceptually relevant and fully observed)

Predictors with very high cardinality (e.g., q1, q3) or heavy structural missingness across semesters (e.g., q11_learned) were excluded to ensure model stability. Open-ended text responses were also excluded from the main predictive models to avoid data leakage and preserve interpretability.

The final modeling dataset includes only the selected predictors and the binary target variable.

model_vars <- c(
  "college_readiness",   # target
  "q5",
  "q7",
  "q10",
  "q12",
  "q2"       
)

final_model_df <- model_df |> 
  select(all_of(model_vars)) |> 
  filter(!is.na(college_readiness))

These selected variables form the core feature set for classification modeling. They represent student perceptions of support, learning, unmet needs, satisfaction, and prior exposure to the program—factors shown in the association analysis to be meaningfully related to students’ reported college readiness.

Helper: Compute Binary Classification Metrics

This helper function computes standard binary classification metrics (accuracy, precision, recall, specificity, F1, and AUC) for models that predict college_readiness (positive class = “Ready” by default). It takes the true labels, predicted classes, and (optionally) predicted probabilities as inputs and returns a one-row tibble.

## Helper: compute binary classification metrics
## Positive class = "Ready" by default

compute_binary_metrics <- function(actual,
                                   predicted_class,
                                   predicted_prob = NULL,
                                   positive = "Ready",
                                   model_name = NA_character_) {
  # 1. Coerce to factor and check that the positive class exists
  actual_f <- as.factor(actual)
  
  if (!positive %in% levels(actual_f)) {
    stop("Positive class not found in 'actual' labels.")
  }
  
  # Identify negative class as "the other level"
  negative <- setdiff(levels(actual_f), positive)[1]
  
  # Relevel factors so that [negative, positive] is the explicit order
  actual_f <- factor(actual_f, levels = c(negative, positive))
  pred_f   <- factor(predicted_class, levels = c(negative, positive))
  
  # 2. Confusion matrix
  tab <- table(Actual = actual_f, Predicted = pred_f)
  
  # Guard: ensure both levels are present
  if (!all(c(negative, positive) %in% rownames(tab)) ||
      !all(c(negative, positive) %in% colnames(tab))) {
    stop("Confusion matrix does not contain both positive and negative levels.")
  }
  
  TP <- tab[positive,  positive]
  TN <- tab[negative,  negative]
  FP <- tab[negative,  positive]
  FN <- tab[positive,  negative]
  
  # 3. Metrics
  accuracy    <- (TP + TN) / (TP + TN + FP + FN)
  precision   <- if ((TP + FP) > 0) TP / (TP + FP) else NA_real_
  recall      <- if ((TP + FN) > 0) TP / (TP + FN) else NA_real_
  specificity <- if ((TN + FP) > 0) TN / (TN + FP) else NA_real_
  f1          <- if (!is.na(precision) && !is.na(recall) &&
                     (precision + recall) > 0) {
    2 * (precision * recall) / (precision + recall)
  } else {
    NA_real_
  }
  
  # 4. AUC (optional; requires predicted probabilities)
  auc_val <- NA_real_
  if (!is.null(predicted_prob)) {
    roc_obj <- pROC::roc(
      response  = actual_f,
      predictor = predicted_prob,
      levels    = c(negative, positive),
      direction = "<"
    )
    auc_val <- as.numeric(pROC::auc(roc_obj))
  }
  
  # 5. Return as one-row tibble
  tibble::tibble(
    model       = model_name,
    positive    = positive,
    TP          = as.integer(TP),
    FP          = as.integer(FP),
    TN          = as.integer(TN),
    FN          = as.integer(FN),
    accuracy    = accuracy,
    precision   = precision,
    recall      = recall,
    specificity = specificity,
    f1          = f1,
    auc         = auc_val
  )
}

Train/Test Split

set.seed(2025)  

train_index <- sample(seq_len(nrow(final_model_df)), size = 0.8 * nrow(final_model_df))

train_df <- final_model_df[train_index, ]
test_df  <- final_model_df[-train_index, ]

# Check the split balance
prop.table(table(train_df$college_readiness))
#> 
#>  Not Ready      Ready 
#> 0.07735052 0.92264948
prop.table(table(test_df$college_readiness))
#> 
#>  Not Ready      Ready 
#> 0.08355556 0.91644444

The train/test split shows an imbalanced outcome distribution, with approximately 7–9% “Not Ready” and 91–93% “Ready.” This imbalance is expected because the original survey responses were similarly skewed, with the large majority of students reporting that their course improved their college readiness. The split correctly preserves this natural distribution, ensuring that the model is trained and evaluated under realistic conditions.

Experimentation & Model Training (Structured Survey Models):

Logistic Regression Model

Logistic Regression (Baseline)

# Fit LR model
model_logit <- glm(
  college_readiness ~ q5 + q7 + q10 + q12 + q2,
  data = train_df,
  family = binomial
)
summary(model_logit)
#> 
#> Call:
#> glm(formula = college_readiness ~ q5 + q7 + q10 + q12 + q2, family = binomial, 
#>     data = train_df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8201   0.2564   0.3101   0.3549   1.8469  
#> 
#> Coefficients:
#>                                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                         -0.3324     0.4773  -0.696  0.48626    
#> q5Yes                                1.1035     0.2808   3.930 8.50e-05 ***
#> q7Yes                               -0.7605     0.2394  -3.177  0.00149 ** 
#> q7Not Asked                         -1.1854     0.7231  -1.639  0.10117    
#> q10Far too much                      2.1063     0.5351   3.936 8.29e-05 ***
#> q10Neither too much nor too little   1.1596     0.4001   2.898  0.00375 ** 
#> q10Slightly too little              -0.1346     0.4367  -0.308  0.75785    
#> q10Slightly too much                 1.8247     0.4500   4.055 5.01e-05 ***
#> q12Yes                               1.0801     0.1515   7.129 1.01e-12 ***
#> q12Not Asked                         2.4132     1.0240   2.357  0.01844 *  
#> q2Yes                               -0.2775     0.1454  -1.909  0.05632 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1742.9  on 3250  degrees of freedom
#> Residual deviance: 1546.2  on 3240  degrees of freedom
#>   (1248 observations deleted due to missingness)
#> AIC: 1568.2
#> 
#> Number of Fisher Scoring iterations: 6

# VIF
vif(model_logit)
#>         GVIF Df GVIF^(1/(2*Df))
#> q5  1.165778  1        1.079712
#> q7  1.202939  2        1.047275
#> q10 1.148451  4        1.017452
#> q12 1.003774  2        1.000942
#> q2  1.001697  1        1.000848

# Predicted prob
test_df$prob_baseline <- predict(
  model_logit,
  newdata = test_df,
  type = "response"
)

# Convert to Predicted class
test_df$pred_baseline <- ifelse(test_df$prob_baseline > 0.5, "Ready", "Not Ready")

# Confusion matrix
print(table(Actual = test_df$college_readiness, Predicted = test_df$pred_baseline))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready         5    58
#>   Ready             5   748

# Metrics function 
metrics_lr_baseline <- compute_binary_metrics(
  actual = test_df$college_readiness,
  predicted_class = test_df$pred_baseline,
  predicted_prob = test_df$prob_baseline,
  model_name = "LR_baseline"
)
metrics_lr_baseline
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 LR_bas… Ready      748    58     5     5    0.923     0.928  0.993      0.0794
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The baseline logistic regression model shows that several predictors are significantly associated with college readiness. Instructor support (Q5), perceived learning (Q10), unmet help needs (Q7), and willingness to take another course (Q12) all have strong and statistically significant effects in the expected directions. Multicollinearity is not a concern, as all VIF values are close to 1.

On the test set, the model achieves 92.3% accuracy, but this mainly reflects the underlying class imbalance. The model correctly identifies almost all “Ready” students (recall = 0.993), but struggles to detect “Not Ready” students (specificity = 0.079), which is a common issue when the minority class is small. The AUC of 0.76 indicates moderate overall discrimination ability.

While the baseline logistic regression performs well for the majority class, its low specificity highlights the need for improvement. In the next step, I fit a tuned logistic regression model to evaluate whether adjustments to model complexity and class handling can enhance predictive performance.

Logistic Regression (Tuned -Weighted)

# place w = weight

w <- ifelse(train_df$college_readiness == "Not Ready", 10, 1)

# Fit LR tuned model
model_weighted <- glm(
  college_readiness ~ q5 + q7 + q10 + q12 + q2,
  data = train_df,
  family = binomial,
  weights = w
)
summary(model_weighted)
#> 
#> Call:
#> glm(formula = college_readiness ~ q5 + q7 + q10 + q12 + q2, family = binomial, 
#>     data = train_df, weights = w)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -6.1043   0.7789   0.8684   0.9848   2.9057  
#> 
#> Coefficients:
#>                                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                        -2.24233    0.27212  -8.240  < 2e-16 ***
#> q5Yes                               1.01333    0.17014   5.956 2.59e-09 ***
#> q7Yes                              -1.04008    0.12464  -8.345  < 2e-16 ***
#> q7Not Asked                        -1.65555    0.39016  -4.243 2.20e-05 ***
#> q10Far too much                     1.83872    0.27653   6.649 2.95e-11 ***
#> q10Neither too much nor too little  0.92531    0.24431   3.787 0.000152 ***
#> q10Slightly too little             -0.61483    0.27033  -2.274 0.022944 *  
#> q10Slightly too much                1.49122    0.25569   5.832 5.47e-09 ***
#> q12Yes                              1.08474    0.07223  15.018  < 2e-16 ***
#> q12Not Asked                        2.52542    0.36055   7.004 2.48e-12 ***
#> q2Yes                              -0.30959    0.06193  -4.999 5.76e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 7521.7  on 3250  degrees of freedom
#> Residual deviance: 6497.3  on 3240  degrees of freedom
#>   (1248 observations deleted due to missingness)
#> AIC: 6519.3
#> 
#> Number of Fisher Scoring iterations: 5

# VIF
vif(model_weighted)
#>         GVIF Df GVIF^(1/(2*Df))
#> q5  1.127683  1        1.061924
#> q7  1.106486  2        1.025620
#> q10 1.126083  4        1.014954
#> q12 1.020747  2        1.005147
#> q2  1.005292  1        1.002643

#  Predicted probabilities
test_df$prob_weighted <- predict(
  model_weighted,
  newdata = test_df,
  type = "response"
)

# Convert to predicted classes 
test_df$pred_weighted <- ifelse(test_df$prob_weighted > 0.4, "Ready", "Not Ready")

# Confusion matrix
print(table(Actual = test_df$college_readiness, Predicted = test_df$pred_weighted))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready        29    34
#>   Ready            91   662

# Metrics function
metrics_lr_tuned <- compute_binary_metrics(
  actual = test_df$college_readiness,
  predicted_class = test_df$pred_weighted,
  predicted_prob = test_df$prob_weighted,
  model_name = "LR_tuned"
)
metrics_lr_tuned
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 LR_tun… Ready      662    34    29    91    0.847     0.951  0.879       0.460
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The tuned logistic regression model applies class weights and a lowered decision threshold (0.4) to better identify “Not Ready” students. All key predictors remain statistically significant and show the same pattern as before with the baseline model: strong instructor support and high perceived learning increase readiness, while unmet help needs decrease it. Multicollinearity remains negligible, with VIF values close to 1.

Model performance improves meaningfully for the minority class. The model now correctly identifies 29 “Not Ready” students, compared to only 5 in the baseline model. Specificity increases from 0.08 to 0.46, a substantial gain. This improvement comes with some reduction in accuracy and precision—expected trade-offs when optimizing for minority-class detection. Importantly, recall for the majority class (“Ready”) remains high at 0.88, and the AUC remains strong at 0.76, indicating good overall discrimination ability.

Overall, the weighted model offers a better balance for this project’s goal: identifying at-risk, not-ready students early, even if that means accepting more false positives. This tuned model provides a stronger foundation for downstream comparison with Random Forest and XGBoost.

Next, we fit a Random Forest model to assess whether a tree-based approach improves classification, particularly for the “Not Ready” group.

Random Forest

Random Forest (Baseline)

# Check missingness in training data
colSums(is.na(train_df))  # q5 and q10 have NAs due to being added later
#> college_readiness                q5                q7               q10 
#>                 0              1247                 0              1248 
#>               q12                q2 
#>                 0                 0

# Replace NA with "Not Asked" for q5 and q10 in both train and test
train_clean <- train_df |> 
  mutate(
    q5  = fct_na_value_to_level(q5,  "Not Asked"),
    q10 = fct_na_value_to_level(q10, "Not Asked")
  )

test_clean <- test_df |> 
  mutate(
    q5  = fct_na_value_to_level(q5,  "Not Asked"),
    q10 = fct_na_value_to_level(q10, "Not Asked")
  )

# Fit Baseline Random Forest
set.seed(2026)
model_rf <- randomForest(
  college_readiness ~ q5 + q7 + q10 + q12 + q2,
  data = train_clean
)
summary(model_rf)
#>                 Length Class  Mode     
#> call               3   -none- call     
#> type               1   -none- character
#> predicted       4499   factor numeric  
#> err.rate        1500   -none- numeric  
#> confusion          6   -none- numeric  
#> votes           8998   matrix numeric  
#> oob.times       4499   -none- numeric  
#> classes            2   -none- character
#> importance         5   -none- numeric  
#> importanceSD       0   -none- NULL     
#> localImportance    0   -none- NULL     
#> proximity          0   -none- NULL     
#> ntree              1   -none- numeric  
#> mtry               1   -none- numeric  
#> forest            14   -none- list     
#> y               4499   factor numeric  
#> test               0   -none- NULL     
#> inbag              0   -none- NULL     
#> terms              3   terms  call
# Predictions on test_clean
#    - class predictions
test_clean$pred_rf <- predict(
  model_rf,
  newdata = test_clean,
  type = "class"
)

#    - predicted probabilities for Ready (needed for AUC)
rf_prob <- predict(
  model_rf,
  newdata = test_clean,
  type = "prob"
)
test_clean$prob_rf <- rf_prob[, "Ready"]

# Confusion matrix
print(table(
  Actual   = test_clean$college_readiness,
  Predicted = test_clean$pred_rf
))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready         4    90
#>   Ready             2  1029

# Metrics via helper function
metrics_rf_baseline <- compute_binary_metrics(
  actual          = test_clean$college_readiness,
  predicted_class = test_clean$pred_rf,
  predicted_prob  = test_clean$prob_rf,
  model_name      = "RF_baseline"
)

metrics_rf_baseline
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 RF_bas… Ready     1029    90     4     2    0.918     0.920  0.998      0.0426
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

Before fitting the model, we recoded missing values in Q5 and Q10 as “Not Asked” in both the training and test sets so that the Random Forest could use all records without dropping rows. The baseline Random Forest model uses the same predictors as logistic regression (Q5, Q7, Q10, Q12, Q2).

On the test set, the model achieves 91.8% accuracy and very high recall for “Ready” (0.998), but, like the baseline logistic regression, it performs poorly for the “Not Ready” group (specificity = 0.043, only 4 of 94 correctly identified). The AUC of 0.60 is lower than the logistic regression models, suggesting that, under default settings, the Random Forest does not improve overall discrimination and still struggles to detect at-risk students in this imbalanced setting.

To improve its ability to detect “Not Ready” students, we next tune the Random Forest model by adjusting key hyperparameters and class weights

Random Forest (Tuned)

# Class distribution in training data
class_counts <- table(train_clean$college_readiness)
class_counts
#> 
#> Not Ready     Ready 
#>       348      4151

# Inverse-frequency class weights
class_weights <- as.numeric(sum(class_counts) / (length(class_counts) * class_counts))
names(class_weights) <- names(class_counts)
class_weights
#> Not Ready     Ready 
#> 6.4640805 0.5419176

# Tune mtry using OOB error
set.seed(2027)
mtry_grid   <- 2:4
oob_results <- data.frame(mtry = integer(), OOB_error = numeric())

for (m in mtry_grid) {
  rf_tmp <- randomForest(
    college_readiness ~ q5 + q7 + q10 + q12 + q2,
    data       = train_clean,
    ntree      = 200,
    mtry       = m,
    classwt    = class_weights,   # <- weights only
    importance = TRUE
  )
  
  # Get final OOB error
  oob_err <- tail(rf_tmp$err.rate[, "OOB"], 1)
  
  oob_results <- rbind(
    oob_results,
    data.frame(mtry = m, OOB_error = oob_err)
  )
}

oob_results   # best mtry ≈ 4
#>   mtry OOB_error
#> 1    2 0.9019782
#> 2    3 0.8975328
#> 3    4 0.8959769

# Fit tuned RF model with best mtry
set.seed(2028)
model_rf_tuned <- randomForest(
  college_readiness ~ q5 + q7 + q10 + q12 + q2,
  data       = train_clean,
  ntree      = 50,
  mtry       = 4,
  classwt    = class_weights,
  importance = TRUE
)

summary(model_rf_tuned)
#>                 Length Class  Mode     
#> call               7   -none- call     
#> type               1   -none- character
#> predicted       4499   factor numeric  
#> err.rate         150   -none- numeric  
#> confusion          6   -none- numeric  
#> votes           8998   matrix numeric  
#> oob.times       4499   -none- numeric  
#> classes            2   -none- character
#> importance        20   -none- numeric  
#> importanceSD      15   -none- numeric  
#> localImportance    0   -none- NULL     
#> proximity          0   -none- NULL     
#> ntree              1   -none- numeric  
#> mtry               1   -none- numeric  
#> forest            14   -none- list     
#> y               4499   factor numeric  
#> test               0   -none- NULL     
#> inbag              0   -none- NULL     
#> terms              3   terms  call

# Predictions on test_clean
# Class predictions
test_clean$pred_rf_tuned <- predict(
  model_rf_tuned,
  newdata = test_clean,
  type = "class"
)

# Probabilities for Ready (needed for AUC)
rf_prob_tuned <- predict(
  model_rf_tuned,
  newdata = test_clean,
  type = "prob"
)
test_clean$prob_rf_tuned <- rf_prob_tuned[, "Ready"]

# Confusion matrix
print(table(
  Actual   = test_clean$college_readiness,
  Predicted = test_clean$pred_rf_tuned
))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready        89     5
#>   Ready           998    33

# Metrics via helper function
metrics_rf_tuned <- compute_binary_metrics(
  actual          = test_clean$college_readiness,
  predicted_class = test_clean$pred_rf_tuned,
  predicted_prob  = test_clean$prob_rf_tuned,
  model_name      = "RF_tuned"
)

metrics_rf_tuned
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 RF_tun… Ready       33     5    89   998    0.108     0.868 0.0320       0.947
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The tuned Random Forest applies class weights and optimized mtry to address the strong class imbalance in the dataset. The OOB error was lowest when mtry = 4, and this hyperparameter was used to fit the final tuned model. Class weighting substantially changed model behavior: the model now predicts significantly more “Not Ready” cases, increasing sensitivity to the minority class.

On the test set, the model correctly identifies 89 “Not Ready” students—a large improvement over the baseline Random Forest (which identified only 4). However, this comes at the cost of misclassifying many “Ready” students as “Not Ready,” leading to low overall accuracy (10.8%) and very low recall for the “Ready” class (3.2%). Specificity for “Not Ready” (0.95) indicates that when the model predicts Not Ready, it is usually correct, but it over-predicts this class.

This tuned Random Forest demonstrates the trade-off created by strong class weighting: enhanced minority-class detection but reduced stability overall. In the next step, we evaluate ‘XGBoost’ to determine whether a boosted tree method can better balance sensitivity and overall predictive performance.

XGBoost

XGBoost (Baseline)

# Prepare TRAIN matrix
train_x <- train_clean %>%
  select(q5, q7, q10, q12, q2) %>% 
  mutate(across(everything(), as.factor))

train_matrix <- model.matrix(~ . - 1, data = train_x)
train_y <- ifelse(train_clean$college_readiness == "Ready", 1, 0)

# Prepare TEST matrix
test_x <- test_clean %>%
  select(q5, q7, q10, q12, q2) %>%
  mutate(across(everything(), as.factor))

test_matrix <- model.matrix(~ . - 1, data = test_x)
test_y <- ifelse(test_clean$college_readiness == "Ready", 1, 0)

# Convert to DMatrix objects
dtrain <- xgb.DMatrix(data = train_matrix, label = train_y)
dtest  <- xgb.DMatrix(data = test_matrix, label = test_y)

# Fit baseline XGBoost model
set.seed(2029)
params <- list(
  objective   = "binary:logistic",
  eval_metric = "logloss",
  max_depth   = 3,
  eta         = 0.3
)

model_xgb_base <- xgb.train(
  params  = params,
  data    = dtrain,
  nrounds = 50
)

summary(model_xgb_base)
#>               Length Class              Mode       
#> handle            1  xgb.Booster.handle externalptr
#> raw           59730  -none-             raw        
#> niter             1  -none-             numeric    
#> call              4  -none-             call       
#> params            5  -none-             list       
#> callbacks         1  -none-             list       
#> feature_names    14  -none-             character  
#> nfeatures         1  -none-             numeric

# Predictions
pred_prob_base  <- predict(model_xgb_base, dtest)
pred_class_base <- ifelse(pred_prob_base > 0.5, 1, 0)

# Confusion matrix
print(table(
  Actual   = test_y,
  Predicted = pred_class_base
))
#>       Predicted
#> Actual    0    1
#>      0    4   90
#>      1    2 1029

# Metrics using helper function
metrics_xgb_baseline <- compute_binary_metrics(
  actual          = factor(ifelse(test_y == 1, "Ready", "Not Ready"),
                           levels = c("Not Ready", "Ready")),
  predicted_class = factor(ifelse(pred_class_base == 1, "Ready", "Not Ready"),
                           levels = c("Not Ready", "Ready")),
  predicted_prob  = pred_prob_base,
  model_name      = "XGB_baseline"
)

metrics_xgb_baseline
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 XGB_ba… Ready     1029    90     4     2    0.918     0.920  0.998      0.0426
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The baseline XGBoost model shows performance patterns very similar to the Random Forest baseline. Accuracy is high (91.8%), but this is largely driven by the dominant “Ready” class. XGBoost identifies nearly all Ready students correctly (recall = 0.998), yet it struggles to detect “Not Ready” students (specificity = 0.043). This difficulty is expected because the minority class represents fewer than 10% of responses and no weighting or tuning has been applied yet. The AUC of 0.72 indicates moderate discrimination ability—slightly stronger than logistic regression, but not meaningfully better at recovering the minority class.

These results suggest that, like Random Forest, XGBoost requires class weighting and hyperparameter tuning before it can effectively identify students who are Not Ready.

XGBoost (Tuned)

Let’s now add a tuned XGBoost model that will still uses train_clean / test_clean and keeps 1 = Ready as the positive class for metrics. However, upweights the minority class (Not Ready = 0) using per-row weights.

# Compute class weights for imbalance
n_neg <- sum(train_y == 0)   # Not Ready
n_pos <- sum(train_y == 1)   # Ready

w_not_ready <- n_pos / n_neg
w_ready     <- 1

weights_train <- ifelse(train_y == 0, w_not_ready, w_ready)

# Weighted training DMatrix
dtrain_w <- xgb.DMatrix(
  data   = train_matrix,
  label  = train_y,
  weight = weights_train
)

# Fit tuned XGBoost model
set.seed(2030)
params_tuned <- list(
  objective        = "binary:logistic",
  eval_metric      = "logloss",
  max_depth        = 4,
  eta              = 0.2,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

model_xgb_tuned <- xgb.train(
  params  = params_tuned,
  data    = dtrain_w,
  nrounds = 100
)
summary(model_xgb_tuned)
#>               Length Class              Mode       
#> handle             1 xgb.Booster.handle externalptr
#> raw           146293 -none-             raw        
#> niter              1 -none-             numeric    
#> call               4 -none-             call       
#> params             7 -none-             list       
#> callbacks          1 -none-             list       
#> feature_names     14 -none-             character  
#> nfeatures          1 -none-             numeric

# Predicted probabilities + classes (threshold = 0.5)
pred_prob_tuned  <- predict(model_xgb_tuned, dtest)
pred_class_tuned <- ifelse(pred_prob_tuned > 0.5, 1, 0)

# Confusion matrix
print(table(
  Actual   = test_y,
  Predicted = pred_class_tuned
))
#>       Predicted
#> Actual   0   1
#>      0  49  45
#>      1 273 758

# Metrics at threshold = 0.5
metrics_xgb_tuned <- compute_binary_metrics(
  actual          = factor(ifelse(test_y == 1, "Ready", "Not Ready"),
                           levels = c("Not Ready", "Ready")),
  predicted_class = factor(ifelse(pred_class_tuned == 1, "Ready", "Not Ready"),
                           levels = c("Not Ready", "Ready")),
  predicted_prob  = pred_prob_tuned,
  model_name      = "XGB_tuned"
)
metrics_xgb_tuned
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 XGB_tu… Ready      758    45    49   273    0.717     0.944  0.735       0.521
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The tuned XGBoost model incorporates class weights and more flexible hyperparameters, enabling it to better handle the strong class imbalance in the data. Compared to the baseline version, the model substantially increases its ability to detect “Not Ready” students: specificity rises from 0.04 to 0.52, meaning XGBoost correctly identifies over half of the minority-class cases. This improvement comes with a decrease in recall for “Ready” students (0.998 → 0.735), reflecting the expected trade-off when the model becomes more sensitive to the minority class.

Precision remains high at 0.94, showing that when the model predicts a student is Ready, it is usually correct. Overall accuracy drops to 71.7%, but this result is appropriate and expected because accuracy is no longer dominated by the majority class. The model’s AUC (0.72) indicates moderate discrimination performance, similar to other tuned models.

This tuned XGBoost model provides a more balanced profile than earlier models, making it useful when correctly identifying Not Ready students is a priority—even at the cost of slightly reduced performance on Ready students.

With all structured-survey models fitted and evaluated, we now summarize their performance and compare them to identify the most balanced and reliable approach.

Results and Visualization (Structured Models)

We use three visuals—a performance table, a heatmap, and a recall-specificity scatter plot—to summarize how all structured-survey models performed. Together, these figures highlight overall accuracy patterns, reveal strengths and weaknesses across key metrics, and show which models achieve the best balance between identifying Ready and Not Ready students.

Metrics for structured-only models (Table 1)

# =============================================
# Table 1 – Structured-only Models (No NLP)
# =============================================

metrics_structured <- bind_rows(
  metrics_lr_baseline,
  metrics_lr_tuned,
  metrics_rf_baseline,
  metrics_rf_tuned,
  metrics_xgb_baseline,
  metrics_xgb_tuned
)

metrics_structured_rounded <- metrics_structured %>%
  mutate(
    across(
      c(accuracy, precision, recall, specificity, f1, auc),
      ~ round(.x, 3)
    )
  )

metrics_structured_rounded %>%
  kable(
    format  = "html",
    caption = "Model Performance Comparison (Structured-Survey Predictors Only)"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(
    which(metrics_structured_rounded$model == "LR_tuned"),
    bold       = TRUE,
    background = "#FFF2CC"
  )
Model Performance Comparison (Structured-Survey Predictors Only)
model positive TP FP TN FN accuracy precision recall specificity f1 auc
LR_baseline Ready 748 58 5 5 0.923 0.928 0.993 0.079 0.960 0.756
LR_tuned Ready 662 34 29 91 0.847 0.951 0.879 0.460 0.914 0.761
RF_baseline Ready 1029 90 4 2 0.918 0.920 0.998 0.043 0.957 0.600
RF_tuned Ready 33 5 89 998 0.108 0.868 0.032 0.947 0.062 0.459
XGB_baseline Ready 1029 90 4 2 0.918 0.920 0.998 0.043 0.957 0.721
XGB_tuned Ready 758 45 49 273 0.717 0.944 0.735 0.521 0.827 0.719

The comparison table shows that baseline models across LR, RF, and XGBoost achieve very high recall but extremely low specificity, meaning they predict nearly all students as “Ready.” The tuned Logistic Regression model (LR_tuned) stands out by improving specificity to 46% while maintaining strong recall (88%), offering the best balance between detecting “Ready” students and identifying those who may need additional support. Tuned Random Forest and tuned XGBoost shift too far in opposite directions—one overwhelmingly favors “Not Ready,” while the other sacrifices recall—making LR_tuned the most stable and reliable performer overall

Heatmap for structured-only models

# =============================================
# Heatmap – Structured-only Models
# =============================================

metrics_heat <- metrics_structured_rounded %>%
  select(model, accuracy, recall, specificity, f1, auc) %>%   # metrics to show
  pivot_longer(
    cols      = c(accuracy, recall, specificity, f1, auc),
    names_to  = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = factor(
      metric,
      levels = c("accuracy", "recall", "specificity", "f1", "auc"),
      labels = c("Accuracy", "Recall", "Specificity", "F1-score", "AUC")
    ),
    model = factor(
      model,
      levels = c(
        "LR_baseline", "LR_tuned",
        "RF_baseline", "RF_tuned",
        "XGB_baseline", "XGB_tuned"
      )
    )
  )

ggplot(metrics_heat, aes(x = metric, y = model, fill = value)) +
  geom_tile(color = "white") +
  geom_text(
    aes(label = scales::percent(value, accuracy = 1)),
    color = "white", size = 3
  ) +
  scale_fill_gradient(
    name   = "Score",
    low    = "#dceefd",
    high   = "#08306b",
    limits = c(0, 1),
    labels = scales::percent_format(accuracy = 1)
  ) +
  labs(
    title    = "Model Performance Heatmap",
    subtitle = "Accuracy, Recall, Specificity, F1-score, and AUC across LR, RF, and XGBoost",
    x        = NULL,
    y        = NULL,
    caption  = "High recall paired with low specificity indicates models predicting almost all students as 'Ready'."
  ) +
  theme_minimal() +
  theme(
    axis.text.x  = element_text(angle = 45, hjust = 1),
    axis.text.y  = element_text(size = 9),
    plot.title   = element_text(face = "bold"),
    plot.caption = element_text(
      size  = 8.5,
      face  = "italic",
      color = "gray50",
      hjust = 0.5
    ),
    panel.grid   = element_blank()
  )

The heatmap visually reinforces the table results: baseline models cluster in the high-recall/low-specificity region, indicating over-prediction of “Ready.” RF_tuned achieves high specificity but performs poorly on recall. XGB_tuned shows a more balanced profile, but not as strong as LR_tuned. Overall, LR_tuned demonstrates the most even distribution across all metrics—especially recall, specificity, and F1—highlighting its superior balance among structured-survey models.

Scatter plot for structured-only models

# =============================================
# Scatter – Recall vs Specificity (Structured-only)
# =============================================

scatter_df <- metrics_structured_rounded %>%
  mutate(
    model = factor(
      model,
      levels = c(
        "LR_baseline", "LR_tuned",
        "RF_baseline", "RF_tuned",
        "XGB_baseline", "XGB_tuned"
      )
    ),
    highlight = ifelse(model == "LR_tuned", "Best (LR_tuned)", "Other")
  )

ggplot(
  scatter_df,
  aes(
    x     = specificity,
    y     = recall,
    color = f1,
    size  = auc
  )
) +
  # reference lines for "good" region 
  geom_vline(xintercept = 0.30, color = "gray80", linetype = "dashed", size = 0.4) +
  geom_hline(yintercept = 0.85, color = "gray80", linetype = "dashed", size = 0.4) +
  
  geom_point(aes(alpha = highlight == "Best (LR_tuned)")) +
  
  ggrepel::geom_text_repel(
    aes(label = model),
    size         = 3,
    max.overlaps = 20,
    show.legend  = FALSE
  ) +
  
  scale_alpha_manual(values = c(`TRUE` = 1, `FALSE` = 0.5), guide = "none") +
  scale_color_gradient(
    name   = "F1-score",
    low    = "#c7e9b4",
    high   = "#00441b",
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.2),
    labels = scales::percent_format(accuracy = 1)
  ) +
  scale_size_continuous(
    name  = "AUC",
    range = c(3, 8)
  ) +
  scale_x_continuous(
    name   = "Specificity (Not Ready)",
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.1),
    labels = scales::percent_format(accuracy = 1)
  ) +
  scale_y_continuous(
    name   = "Recall (Ready)",
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.1),
    labels = scales::percent_format(accuracy = 1)
  ) +
  labs(
    title    = "Balanced Performance Trade-off Across Structured-Survey Models",
    subtitle = "LR_tuned highlighted as the most balanced model (Recall vs. Specificity)",
    caption  = "Dashed lines mark the region of high Recall (≥ 85%) and reasonable Specificity (≥ 30%)."
  ) +
  theme_minimal() +
  theme(
    plot.caption = element_text(
      size  = 9,
      color = "gray40",
      face  = "italic",
      hjust = 0   # left-align caption
    )
  )

The scatter plot illustrates the performance trade-off between identifying “Ready” students (recall) and detecting “Not Ready” students (specificity). Most models cluster in the upper-left, showing very high recall but minimal specificity. RF_tuned falls in the lower-right, indicating the opposite issue. LR_tuned is the only model that lands inside the preferred region (recall ≥ 85%, specificity ≥ 30%), confirming that it offers the most practical and balanced performance for real-world decision-making within the College Now program.


Natural Language Processing (NLP)

In the main modeling section, only structured survey questions were used to avoid data leakage, since students’ open-ended responses (Q18–Q20) may indirectly reveal their feelings about college readiness (Q17), the target variable. To maintain a fair comparison, NLP features are added in a separate modeling pipeline, where topic features and word-count features are incorporated after proper text preprocessing. The following code builds the NLP dataset, extracts LDA topics, computes word counts, and prepares train/test splits for models that include textual information.

NLP Feature Construction

# ---------------------------------------------------------
# Build a modeling dataset with NLP features
# ---------------------------------------------------------
# Steps:
# 1. Attach doc_id to every row.
# 2. Clean and combine open-ended text (q18–q20).
# 3. Tokenize → build DTM → extract LDA topic.
# 4. Compute NLP features: topic + word_count.
# 5. Merge NLP features into modeling dataset.
# 6. Train/test split for NLP models.


df_with_id <- df %>%
  mutate(doc_id = dplyr::row_number())

open_text_vars <- c("q18", "q19", "q20")

df_clean <- df_with_id %>%
  mutate(
    text_all = select(., all_of(open_text_vars)) |>
      apply(1, function(x) paste(x, collapse = " ")) |>
      str_squish(),
    text_all = str_to_lower(text_all),
    text_all = str_replace_all(text_all, "[^a-z\\s]", " "),
    text_all = str_replace_all(text_all, "\\s+", " ")
  )

data(stop_words)

tokens <- df_clean %>%
  select(doc_id, text_all) %>%
  unnest_tokens(word, text_all) %>%
  anti_join(stop_words, by = "word")

dtm <- tokens %>%
  count(doc_id, word) %>%
  cast_dtm(document = doc_id, term = word, value = n)

set.seed(2031)
lda_model <- LDA(dtm, k = 3, control = list(seed = 2028))

doc_topics <- tidy(lda_model, matrix = "gamma") %>%
  group_by(document) %>%
  slice_max(gamma, n = 1) %>%
  ungroup() %>%
  mutate(doc_id = as.integer(document)) %>%
  select(doc_id, topic)


NLP_features <- df_clean %>%
  select(doc_id, text_all) %>%
  left_join(doc_topics, by = "doc_id") %>%
  mutate(
    word_count = if_else(
      is.na(text_all) | text_all == "",
      0L,
      str_count(text_all, "\\w+")
    ),
    topic = factor(topic)
  ) %>%
  select(doc_id, word_count, topic)

# Build NLP modeling dataset

model_df_nlp <- df_with_id %>%
  select(
    doc_id,
    college_readiness,
    q2, q5, q7, q10, q12,
    semester
  ) %>%
  left_join(NLP_features, by = "doc_id")

# Keep only rows with non-missing outcome
model_df_complete <- model_df_nlp %>%
  filter(!is.na(college_readiness))

set.seed(2032)
idx <- createDataPartition(model_df_complete$college_readiness,
                           p = 0.7, list = FALSE)

train_df_nlp <- model_df_complete[idx, ]
test_df_nlp  <- model_df_complete[-idx, ]
## Harmonize factor levels (q2, q5, q7, q10, topic) across train & test
combined_nlp <- bind_rows(
  train_df_nlp %>% mutate(dataset = "train"),
  test_df_nlp  %>% mutate(dataset = "test")
) %>%
  mutate(
    q2  = fct_na_value_to_level(q2,  "Not Asked"),
    q5  = fct_na_value_to_level(q5,  "Not Asked"),
    q10 = fct_na_value_to_level(q10, "Not Asked"),
    topic = fct_na_value_to_level(topic, "No Topic"),
    
    q2    = as.factor(q2),
    q5    = as.factor(q5),
    q7    = as.factor(q7),
    q10   = as.factor(q10),
    topic = as.factor(topic),
    college_readiness = as.factor(college_readiness)
  )

train_df_nlp <- combined_nlp %>%
  filter(dataset == "train") %>%
  select(-dataset)

test_df_nlp <- combined_nlp %>%
  filter(dataset == "test") %>%
  select(-dataset)

With the NLP features prepared and factor levels harmonized, we now fit logistic regression, Random Forest, and XGBoost models that include both structured survey predictors and text-based features.

Logistic Regression (Tuned + NLP)

# Class weights: upweight Not Ready
w_nlp <- ifelse(train_df_nlp$college_readiness == "Not Ready", 10, 1)

# Fit weighted logistic regression with NLP features
model_weighted_nlp <- glm(
  college_readiness ~ q5 + q7 + q10 + q12 + q2 + word_count + topic,
  data   = train_df_nlp,
  family = binomial,
  weights = w_nlp
)

summary(model_weighted_nlp)  
#> 
#> Call:
#> glm(formula = college_readiness ~ q5 + q7 + q10 + q12 + q2 + 
#>     word_count + topic, family = binomial, data = train_df_nlp, 
#>     weights = w_nlp)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -7.3319   0.7839   0.9680   1.1282   2.8091  
#> 
#> Coefficients:
#>                                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                        -2.767e+00  3.067e-01  -9.020  < 2e-16 ***
#> q5Yes                               9.202e-01  1.819e-01   5.060 4.19e-07 ***
#> q5Not Asked                        -2.988e+01  8.125e+02  -0.037   0.9707    
#> q7Yes                              -1.016e+00  1.330e-01  -7.640 2.17e-14 ***
#> q7Not Asked                        -6.915e-01  4.160e-01  -1.662   0.0965 .  
#> q10Far too much                     1.914e+00  2.970e-01   6.444 1.16e-10 ***
#> q10Neither too much nor too little  1.351e+00  2.708e-01   4.990 6.03e-07 ***
#> q10Slightly too little             -2.248e-01  2.953e-01  -0.761   0.4465    
#> q10Slightly too much                1.890e+00  2.821e-01   6.702 2.06e-11 ***
#> q10Not Asked                        1.715e+01  7.894e+02   0.022   0.9827    
#> q12Yes                              9.263e-01  7.674e-02  12.071  < 2e-16 ***
#> q12Not Asked                        1.600e+01  1.927e+02   0.083   0.9338    
#> q2Yes                              -3.013e-01  5.451e-02  -5.529 3.23e-08 ***
#> q2Not Asked                         1.529e+01  1.455e+03   0.011   0.9916    
#> word_count                          3.080e-02  2.752e-03  11.192  < 2e-16 ***
#> topic2                             -4.154e-01  7.338e-02  -5.661 1.50e-08 ***
#> topic3                             -4.090e-02  7.469e-02  -0.548   0.5840    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 9285.5  on 3937  degrees of freedom
#> Residual deviance: 8231.0  on 3921  degrees of freedom
#> AIC: 8265
#> 
#> Number of Fisher Scoring iterations: 14

# Predicted probabilities on test set
test_df_nlp$prob_weighted_nlp <- predict(
  model_weighted_nlp,
  newdata = test_df_nlp,
  type = "response"
)

# Predicted classes at threshold 0.4 (same as tuned LR without NLP)
test_df_nlp$pred_weighted_nlp <- ifelse(
  test_df_nlp$prob_weighted_nlp > 0.4,
  "Ready",
  "Not Ready"
)

# Confusion matrix
print(table(
  Actual   = test_df_nlp$college_readiness,
  Predicted = test_df_nlp$pred_weighted_nlp
))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready        41    91
#>   Ready           131  1423

# Metrics via helper function
metrics_lr_tuned_nlp <- compute_binary_metrics(
  actual          = test_df_nlp$college_readiness,
  predicted_class = test_df_nlp$pred_weighted_nlp,
  predicted_prob  = test_df_nlp$prob_weighted_nlp,
  positive        = "Ready",
  model_name      = "LR_tuned_NLP"
)

metrics_lr_tuned_nlp
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 LR_tun… Ready     1423    91    41   131    0.868     0.940  0.916       0.311
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

Adding NLP features (word_count and LDA topics) improves the model’s ability to detect both groups. Students who write more in the open-ended questions are more likely to be classified as Ready, while certain topic patterns slightly reduce readiness likelihood. Compared with the structured-only model, Recall remains high (0.916), and Specificity improves to 0.311—showing a better balance while still strongly favoring correct identification of Ready students. The F1-score also increases (0.928), indicating that incorporating text meaningfully enhances predictive performance without severely harming minority-class detection.

Next, to evaluate whether a nonlinear tree-based approach can capture additional structure from the textual features, we fit a tuned Random Forest model that incorporates both the survey predictors and the NLP variables.

Random Forest (Tuned + NLP)

## Random Forest (Tuned) + NLP

# Compute class weights (same formula as before)
class_counts_nlp  <- table(train_df_nlp$college_readiness)
class_weights_nlp <- as.numeric(sum(class_counts_nlp) /
                                  (length(class_counts_nlp) * class_counts_nlp))
names(class_weights_nlp) <- names(class_counts_nlp)

class_weights_nlp  # optional: inspect
#> Not Ready     Ready 
#> 6.3516129 0.5427233

# Fit tuned RF with NLP features
set.seed(2033)

model_rf_tuned_nlp <- randomForest(
  college_readiness ~ q5 + q7 + q10 + q12 + q2 + word_count + topic,
  data       = train_df_nlp,
  ntree      = 50,
  mtry       = 3,
  classwt    = class_weights_nlp,
  importance = TRUE
)
summary(model_rf_tuned_nlp)
#>                 Length Class  Mode     
#> call               7   -none- call     
#> type               1   -none- character
#> predicted       3938   factor numeric  
#> err.rate         150   -none- numeric  
#> confusion          6   -none- numeric  
#> votes           7876   matrix numeric  
#> oob.times       3938   -none- numeric  
#> classes            2   -none- character
#> importance        28   -none- numeric  
#> importanceSD      21   -none- numeric  
#> localImportance    0   -none- NULL     
#> proximity          0   -none- NULL     
#> ntree              1   -none- numeric  
#> mtry               1   -none- numeric  
#> forest            14   -none- list     
#> y               3938   factor numeric  
#> test               0   -none- NULL     
#> inbag              0   -none- NULL     
#> terms              3   terms  call

# Predict on test set – classes
test_df_nlp$pred_rf_tuned_nlp <- predict(
  model_rf_tuned_nlp,
  newdata = test_df_nlp,
  type   = "class"
)

# Predict on test set – probabilities (for AUC)
rf_prob_nlp <- predict(
  model_rf_tuned_nlp,
  newdata = test_df_nlp,
  type   = "prob"
)

# we need the probability of Ready = positive class
test_df_nlp$prob_rf_tuned_nlp <- rf_prob_nlp[, "Ready"]

# Confusion matrix
print(table(
  Actual   = test_df_nlp$college_readiness,
  Predicted = test_df_nlp$pred_rf_tuned_nlp
))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready       103    29
#>   Ready          1011   543

# Metrics via helper function
metrics_rf_tuned_nlp <- compute_binary_metrics(
  actual          = test_df_nlp$college_readiness,
  predicted_class = test_df_nlp$pred_rf_tuned_nlp,
  predicted_prob  = test_df_nlp$prob_rf_tuned_nlp,
  positive        = "Ready",
  model_name      = "RF_tuned_NLP"
)

metrics_rf_tuned_nlp
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 RF_tun… Ready      543    29   103  1011    0.383     0.949  0.349       0.780
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The tuned Random Forest model with NLP features shows mixed performance. Adding word_count and topic information helps the model identify more Not Ready students (Specificity = 0.780), which is a substantial improvement over RF_baseline. However, this comes at the cost of a large drop in Recall for Ready students (0.349), meaning the model misclassifies many Ready students as Not Ready. Performance overall is imbalanced (Accuracy = 0.383), suggesting that the Random Forest tends to overcorrect toward the minority class when NLP features and class weights are combined. While the model captures useful signals from the text, it does not outperform logistic regression in achieving a balanced trade-off.

Next, to assess whether a more flexible boosting-based method can better integrate the NLP features, I fit a tuned XGBoost model using the same structured and text-based predictors.

XGBoost (Tuned + NLP)

# XGBoost + NLP matrices
# TRAIN matrix with NLP (from train_df_nlp)
train_x_nlp <- train_df_nlp %>%
  select(q5, q7, q10, q12, q2, topic, word_count) %>%
  mutate(
    across(c(q5, q7, q10, q12, q2, topic), as.factor)
    # word_count stays numeric
  )

train_matrix_nlp <- model.matrix(~ . - 1, data = train_x_nlp)

train_y_nlp <- ifelse(train_df_nlp$college_readiness == "Ready", 1, 0)

# TEST matrix with NLP (from test_df_nlp)
test_x_nlp <- test_df_nlp %>%
  select(q5, q7, q10, q12, q2, topic, word_count) %>%
  mutate(
    across(c(q5, q7, q10, q12, q2, topic), as.factor)
  )

test_matrix_nlp <- model.matrix(~ . - 1, data = test_x_nlp)

test_y_nlp <- ifelse(test_df_nlp$college_readiness == "Ready", 1, 0)

# DMatrix objects
dtrain_nlp <- xgb.DMatrix(data = train_matrix_nlp, label = train_y_nlp)
dtest_nlp  <- xgb.DMatrix(data = test_matrix_nlp, label = test_y_nlp)
## Fit XGBoost tuned + NLP (topic + word_count) and get confusion matrix
# Class weights (0 = Not Ready, 1 = Ready)
n_neg_nlp <- sum(train_y_nlp == 0)  # Not Ready
n_pos_nlp <- sum(train_y_nlp == 1)  # Ready

w_not_ready_nlp <- n_pos_nlp / n_neg_nlp
w_ready_nlp     <- 1

weights_train_nlp <- ifelse(train_y_nlp == 0, w_not_ready_nlp, w_ready_nlp)

# Weighted training DMatrix
dtrain_w_nlp <- xgb.DMatrix(
  data   = train_matrix_nlp,
  label  = train_y_nlp,
  weight = weights_train_nlp
)

# Tuned XGBoost model with NLP features
params_tuned_nlp <- list(
  objective        = "binary:logistic",
  eval_metric      = "logloss",
  max_depth        = 4,
  eta              = 0.2,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

set.seed(2034)
model_xgb_tuned_nlp <- xgb.train(
  params  = params_tuned_nlp,
  data    = dtrain_w_nlp,
  nrounds = 100
)

summary(model_xgb_tuned_nlp)
#>               Length Class              Mode       
#> handle             1 xgb.Booster.handle externalptr
#> raw           162681 -none-             raw        
#> niter              1 -none-             numeric    
#> call               4 -none-             call       
#> params             7 -none-             list       
#> callbacks          1 -none-             list       
#> feature_names     18 -none-             character  
#> nfeatures          1 -none-             numeric

# Predictions on test set (probabilities for Ready = 1)
pred_prob_tuned_nlp  <- predict(model_xgb_tuned_nlp, dtest_nlp)

# Predicted classes at threshold 0.5
test_df_nlp$pred_xgb_tuned_nlp <- ifelse(
  pred_prob_tuned_nlp > 0.5,
  "Ready",
  "Not Ready"
)

# Store probabilities in the test_df_nlp (for AUC)
test_df_nlp$prob_xgb_tuned_nlp <- pred_prob_tuned_nlp

# 6. Confusion matrix (using factor labels)
print(table(
  Actual   = test_df_nlp$college_readiness,
  Predicted = test_df_nlp$pred_xgb_tuned_nlp
))
#>            Predicted
#> Actual      Not Ready Ready
#>   Not Ready        73    59
#>   Ready           455  1099

# Metrics via helper function
metrics_xgb_tuned_nlp <- compute_binary_metrics(
  actual          = test_df_nlp$college_readiness,
  predicted_class = test_df_nlp$pred_xgb_tuned_nlp,
  predicted_prob  = test_df_nlp$prob_xgb_tuned_nlp,
  positive        = "Ready",
  model_name      = "XGB_tuned_NLP"
)

metrics_xgb_tuned_nlp
#> # A tibble: 1 × 12
#>   model   positive    TP    FP    TN    FN accuracy precision recall specificity
#>   <chr>   <chr>    <int> <int> <int> <int>    <dbl>     <dbl>  <dbl>       <dbl>
#> 1 XGB_tu… Ready     1099    59    73   455    0.695     0.949  0.707       0.553
#> # ℹ 2 more variables: f1 <dbl>, auc <dbl>

The tuned XGBoost model incorporating NLP features achieves a more balanced profile than the Random Forest NLP model. It improves Recall for Ready students (0.707) compared to RF_tuned_NLP while still maintaining moderate Specificity (0.553) for identifying Not Ready students. This model benefits from both boosting and the added text features, which help it distinguish signals in the open-ended responses. However, although XGBoost shows stronger balance than RF_tuned_NLP, it still does not outperform the non-NLP LR_tuned model, which remains the most stable and interpretable model across all performance dimensions.

Taken together, the NLP models add useful signals from the text responses, but none exceed the balanced performance achieved by the tuned logistic regression model using structured survey data alone because the text features—while informative—introduce additional variability and reduce stability without meaningfully improving the model’s ability to detect both Ready and Not Ready students.

Results and Visualization for NLP:

The NLP models are evaluated separately from the structured-only models to avoid data leakage and to keep the comparison fair. Here, we summarize how adding topic and word-count features from Q18–Q20 changes model performance, while still keeping the main structured-only pipeline (LR_tuned) as the primary basis for model selection.

NLP-only metrics table (Table 2)

# =============================================
# Table 2 – Models with NLP Features (Q18–Q20)
# =============================================

# Combine only NLP-augmented models
nlp_metrics <- bind_rows(
  metrics_lr_tuned_nlp,
  metrics_rf_tuned_nlp,
  metrics_xgb_tuned_nlp
)

# Round numeric metrics
nlp_metrics_rounded <- nlp_metrics %>%
  mutate(
    across(
      c(accuracy, precision, recall, specificity, f1, auc),
      ~ round(.x, 3)
    )
  )

# Build kable for NLP models
nlp_metrics_rounded %>%
  kable(
    format  = "html",
    caption = "Model Performance with NLP Features (Evaluated Separately)"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(
    which(nlp_metrics_rounded$model == "LR_tuned_NLP"),
    bold = TRUE,
    background = "#FFF2CC"
  )
Model Performance with NLP Features (Evaluated Separately)
model positive TP FP TN FN accuracy precision recall specificity f1 auc
LR_tuned_NLP Ready 1423 91 41 131 0.868 0.940 0.916 0.311 0.928 0.723
RF_tuned_NLP Ready 543 29 103 1011 0.383 0.949 0.349 0.780 0.511 0.599
XGB_tuned_NLP Ready 1099 59 73 455 0.695 0.949 0.707 0.553 0.810 0.702

Because NLP models are evaluated separately to avoid data leakage, their results are shown only in Table 2. Additional plots would not add meaningful insight. Among these models, LR_tuned_NLP performs best—showing high recall and the strongest overall balance—although it still does not outperform the structured-only tuned logistic regression mode.


Final Model Selection

Across all models, the tuned logistic regression (LR_tuned) remains the strongest and most stable choice. It provides the best balance between identifying students who feel Ready and detecting those who may be Not Ready, while remaining interpretable and methodologically transparent. Although NLP-enhanced models add useful contextual insight, they introduce additional complexity and potential leakage risk and are therefore treated as supplementary rather than candidates for final deployment. Given the goal of identifying both Ready and Not Ready students for targeted support, the tuned logistic regression model provides the best combination of predictive balance, interpretability, and actionable insight.


Conclusion

This project analyzed multi-semester College Now survey data to identify patterns that shape students’ perceptions of college readiness. Results consistently show that students who report strong instructional support, meaningful learning gains, and positive course experiences are significantly more likely to feel prepared for college-level work.

From a practical standpoint, these findings offer clear opportunities to strengthen student support within the College Now program. Because the final model relies on structured survey questions already collected each semester, it can be used to help instructors and advisors identify students who may benefit from additional academic guidance, tutoring, or advising early in the term. Program administrators can also use these insights to monitor where instructional support or course design may be falling short and to prioritize interventions that reinforce skills tied to readiness, such as critical thinking, writing, and time management.

Overall, the analysis demonstrates that routinely collected survey data can serve not only as an evaluation tool, but also as an early-warning signal to inform targeted, data-driven support strategies for College Now students.These results can help inform advisors and instructors where to target support (e.g., students reporting lower instructional support or unmet help needs), and can guide decisions about advising interventions earlier in the term.


Future Work

Future work can extend this analysis in several practical directions. More advanced NLP features—such as sentiment scores or embedding-based representations—could be explored to better capture nuance in student reflections, provided they are carefully separated from outcome variables to avoid leakage. Additional strategies for handling class imbalance and threshold optimization may further improve detection of students who are Not Ready. Longitudinal analyses could also examine how readiness perceptions change across semesters or differ by course and instructional context. Finally, translating the model into a simple dashboard or reporting tool would allow instructors, advisors, and administrators to interact with results directly and use them to support timely, student-centered interventions.


References

Abel, N., & Oliver, B. (2018). Innovative school counseling approaches to improving college and career readiness. Butler University Digital Commons. https://digitalcommons.butler.edu/cgi/viewcontent.cgi?article=1142&context=coe_papers

Austin, M., Backes, B., Goldhaber, D., Li, D., & Streich, F. (2024). Leveling up: An academic acceleration policy to increase advanced high school course taking [Policy report]. American Institutes for Research. https://www.air.org/resource/report/leveling-academic-acceleration-policy-increase-advanced-high-school-course-taking

Cribb, D. V. (2021). Dual enrollment programs: Advising policies and practices for high school students in post-secondary institutions [Doctoral dissertation, University of New England]. DUNE: DigitalUNE. https://dune.une.edu/cgi/viewcontent.cgi?article=1358&context=theses

City University of New York. (n.d.). College Now: A CUNY K–16 initiative. https://www.cuny.edu/academics/current-initiatives/k16/college-now/

Kurlaender, M., Reed, S., & Hurtt, A. (2019). Improving college readiness: A research summary and implications for practice. Policy Analysis for California Education (PACE). https://edpolicyinca.org/sites/default/files/R_Kurlaender_Aug19.pdf

Liu, Y., Minaya, V., & Xu, D. (2022). The impact of dual enrollment on college application choice and admission success (CCRC Working Paper No. 129). Community College Research Center, Teachers College, Columbia University. https://ccrc.tc.columbia.edu/wp-content/uploads/2022/12/CCRC_Working_Paper_No._129.pdf

Phelps, L. A., & Chan, H.-Y. (2016). Optimizing technical education pathways: Does dual-credit course completion predict students’ college and labor market success? Journal of Career and Technical Education, 31(1), 9–27. https://doi.org/10.21061/jcte.v31i1.1496

Real Statistics. (n.d.). Effect size for chi-square tests (Cramér’s V). https://real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/

Roland, A., & Herman, M. (2020). The state of college readiness and degree completion in New York City. GraduateNYC! (in collaboration with CUNY K–16 Initiatives and NYC DOE Office of Postsecondary Readiness). https://k16.cuny.edu/ccif/wp-content/uploads/sites/21/2023/07/GNYC-Public-Report-2020-FINAL.pdf

Ryu, W., Schudde, L., & Pack-Cosme, K. (2024). Dually noted: Examining the implications of dual enrollment course structure for students’ course and college enrollment outcomes. American Educational Research Journal, 61(4), 803–841. https://journals.sagepub.com/doi/10.3102/00028312241257453

Statology. (n.d.). How to interpret Cramer’s V. https://www.statology.org/interpret-cramers-v/

Taylor, J. L., Ozuna Allen, T., An, B. P., Denecker, C., Edmunds, J. A., Fink, J., Giani, M. S., Hodara, M., Hu, X., Tobolowsky, B. F., & Chen, W. (2022). Research priorities for advancing equitable dual enrollment policy and practice. Center for Higher Education Research and Policy, University of Utah. https://cherp.utah.edu/_resources/documents/publications/research_priorities_for_advancing_equitable_dual_enrollment_policy_and_practice.pdf

Westrick, P. A., Angehr, E. L., Shaw, E. J., & Marin, J. P. (2024). Recent trends in college readiness and subsequent college performance: With faculty perspectives on student readiness. College Board Research. https://research.collegeboard.org/media/pdf/Recent-Trends-in-College-Readiness-and-Subsequent-College-Performance.pdf