Adding Additional Predictors to Improve Accuracy

Badge - J.Carhart

Published

July 18, 2025

The final activity for each provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, there are two parts to complete:

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

#loading packages and data
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.9     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.1
✔ parsnip      1.3.2     ✔ yardstick    1.3.2
✔ recipes      1.3.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(skimr)
#wrangling data
#2a.import and inspect data
ipeds <- read_csv("/cloud/project/module-2/data/ipeds-all-title-9-2022-data.csv")
Rows: 5988 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): institution name, HD2022.Postsecondary and Title IV institution in...
dbl (15): unitid, year, DRVADM2022.Percent admitted - total, DRVIC2022.Tuiti...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#use same data cleaning steps
ipeds <- clean_names(ipeds)

#Clean up further
ipeds <- ipeds %>% 
    select(name = institution_name, 
           title_iv = hd2022_postsecondary_and_title_iv_institution_indicator, # is the university a title IV university?
           carnegie_class = hd2022_carnegie_classification_2021_basic, # which carnegie classification
           state = hd2022_state_abbreviation, # state
           total_enroll = drvef2022_total_enrollment, # total enrollment
           pct_admitted = drvadm2022_percent_admitted_total, # percentage of applicants admitted
           n_bach = drvc2022_bachelors_degree, # number of students receiving a bachelor's degree
           n_mast = drvc2022_masters_degree, # number receiving a master's
           n_doc = drvc2022_doctors_degree_research_scholarship, # number receive a doctoral degree
           tuition_fees = drvic2022_tuition_and_fees_2021_22, # total cost of tuition and fees
           grad_rate = drvgr2022_graduation_rate_total_cohort, # graduation rate
           percent_fin_aid = sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid, # percent of students receive financial aid
           avg_salary = drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks) # average salary of instructional staff

#create filter for TIV
ipeds <- ipeds %>% 
    filter(title_iv == "Title IV postsecondary institution")

#create filter only with carnegie
ipeds <- ipeds %>%
  filter(carnegie_class != "Not applicable, not in Carnegie universe (not accredited or nondegree-granting)")

#inspect data
glimpse(ipeds)
Rows: 3,818
Columns: 13
$ name            <chr> "Alabama A & M University", "University of Alabama at …
$ title_iv        <chr> "Title IV postsecondary institution", "Title IV postse…
$ carnegie_class  <chr> "Master's Colleges & Universities: Larger Programs", "…
$ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
$ total_enroll    <dbl> 6007, 21639, 647, 9237, 3828, 38644, 1777, 2894, 5109,…
$ pct_admitted    <dbl> 68, 87, NA, 78, 97, 80, NA, NA, 92, 44, 57, NA, NA, NA…
$ n_bach          <dbl> 511, 2785, 54, 1624, 480, 6740, NA, 738, 672, 5653, 26…
$ n_mast          <dbl> 249, 2512, 96, 570, 119, 2180, NA, 80, 300, 1415, 0, N…
$ n_doc           <dbl> 9, 166, 20, 41, 2, 215, NA, 0, 0, 284, 0, NA, 0, NA, N…
$ tuition_fees    <dbl> 10024, 8568, NA, 11488, 11068, 11620, 4930, NA, 8860, …
$ grad_rate       <dbl> 27, 64, 50, 63, 28, 73, 22, NA, 36, 81, 65, 26, 9, 29,…
$ percent_fin_aid <dbl> 87, 96, NA, 96, 97, 87, 87, NA, 99, 79, 100, 96, 92, 8…
$ avg_salary      <dbl> 77824, 106434, 36637, 92561, 72635, 97394, 63494, 8140…
#2bTransform Variables
#create binary variable indicating the institution graduation rate
# Create the binary variable and coerce to factor
ipeds <- ipeds %>%
  mutate(
    good_grad_rate = ifelse(grad_rate > 60, 1, 0),
    good_grad_rate = as.factor(good_grad_rate)
  )
#Redo with fixed error

# ───────────────────────────────
# 1. Load Required Packages
# ───────────────────────────────
library(tidyverse)
library(tidymodels)
library(janitor)
library(skimr)

# ───────────────────────────────
# 2. Explore and Clean the Data
# ───────────────────────────────

# Create good_grad_rate variable
ipeds <- ipeds %>%
  mutate(
    good_grad_rate = ifelse(grad_rate > 60, "High", "Low"),
    good_grad_rate = as.factor(good_grad_rate)
  )

# Tabulate outcomes
ipeds %>% tabyl(good_grad_rate) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting()
 good_grad_rate    n percent valid_percent
           High  982   25.7%         28.9%
            Low 2418   63.3%         71.1%
           <NA>  418   10.9%             -
          Total 3818  100.0%        100.0%
#good_grad_rate    n percent valid_percent
           #High  964   26.0%         28.9%
            #Low 2375   64.1%         71.1%
           #<NA>  365    9.9%             -
          #Total 3704  100.0%        100.0%

# Visualize Graduation Rate Distribution
ipeds %>% 
  ggplot(aes(x = grad_rate)) +
  geom_histogram(binwidth = 5, fill = "cadetblue2", color = "darksalmon") +
  labs(title = "Distribution of Graduation Rates",
       x = "Graduation Rate (%)",
       y = "Number of Institutions")
Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

# Visualize Tuition vs Graduation Rate
ipeds %>% 
  ggplot(aes(x = grad_rate)) +
  geom_histogram(binwidth = 5, fill = "darkseagreen1", color = "forestgreen") +
  labs(title = "Do Higher Tuition Fees Correspond to Higher Graduation Rates?",
       x = "Graduation Rate (%)",
       y = "Tuition Fees")
Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

# Visualize Enrollment vs Graduation Rate
ipeds %>% 
  ggplot(aes(x = grad_rate)) +
  geom_histogram(binwidth = 5, fill = "aliceblue", color = "cornflowerblue") +
  labs(title = "Graduation Rate vs. Total Enrollment",
       x = "Graduation Rate (%)",
       y = "Total Enrollment")
Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

# Skim summary of the dataset
skim(ipeds)
Data summary
Name ipeds
Number of rows 3818
Number of columns 14
_______________________
Column type frequency:
character 4
factor 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 91 0 3775 0
title_iv 0 1 34 34 0 1 0
carnegie_class 0 1 15 88 0 33 0
state 0 1 4 30 0 59 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
good_grad_rate 418 0.89 FALSE 2 Low: 2418, Hig: 982

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enroll 0 1.00 4884.67 9327.97 1 532 1796.5 5058.00 164091 ▇▁▁▁▁
pct_admitted 2040 0.47 71.86 22.51 0 60 77.0 89.00 100 ▁▁▃▆▇
n_bach 1192 0.69 771.19 1672.82 0 25 209.0 649.75 24230 ▇▁▁▁▁
n_mast 1195 0.69 336.29 922.00 0 0 47.0 273.50 19615 ▇▁▁▁▁
n_doc 1195 0.69 30.38 112.18 0 0 0.0 4.00 1595 ▇▁▁▁▁
tuition_fees 599 0.84 16801.79 15347.31 480 5039 10830.0 25045.00 66064 ▇▂▂▁▁
grad_rate 418 0.89 48.11 21.62 0 32 47.0 63.00 100 ▂▇▇▅▂
percent_fin_aid 405 0.89 89.88 15.56 0 86 97.0 100.00 100 ▁▁▁▁▇
avg_salary 113 0.97 67775.88 24347.24 1 52170 64822.0 80244.00 215232 ▂▇▂▁▁
#── Data Summary ────────────────────────
                           #Values
#Name                       ipeds 
#Number of rows             3704  
#Number of columns          15    
#_______________________          
#Column type frequency:           
  #character                4     
  #factor                   1     
  #numeric                  10    
#________________________         
#Group variables            None  

# ───────────────────────────────
# 3. Feature Engineering
# ───────────────────────────────

# Create tuition_to_salary_ratio
ipeds <- ipeds %>%
  mutate(tuition_to_salary_ratio = tuition_fees / avg_salary)

# Filter out problematic cases (e.g., salary = 0)
ipeds <- ipeds %>%
  filter(avg_salary > 1)

# Visualize new feature
ipeds %>%
  ggplot(aes(x = tuition_to_salary_ratio, fill = good_grad_rate)) +
  geom_density(alpha = 0.6) +
  labs(title = "Tuition-to-Salary Ratio by Graduation Rate",
       x = "Tuition to Salary Ratio",
       y = "Density")
Warning: Removed 543 rows containing non-finite outside the scale range
(`stat_density()`).

# ───────────────────────────────
# 4. Modeling Steps
# ───────────────────────────────
# Split the data
set.seed(20250712)
train_test_split <- initial_split(ipeds, prop = 0.8, strata = "good_grad_rate")
data_train <- training(train_test_split)
data_test  <- testing(train_test_split)

# Create recipe with additional predictor
my_rec <- recipe(good_grad_rate ~ 
                     tuition_fees +
                     avg_salary +
                     tuition_to_salary_ratio +
                     percent_fin_aid +
                     total_enroll,
                 data = ipeds)
 
# Specify logistic regression model
my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

# Combine model + recipe into workflow
my_wf <- workflow() %>%
    add_model(my_mod) %>%
    add_recipe(my_rec)

# Fit the model and evaluate on test data
final_fit <- last_fit(my_wf, train_test_split)

# ───────────────────────────────
# 5. Interpret Results
# ───────────────────────────────

# View prediction accuracy
final_fit %>%
    collect_predictions() %>%
    mutate(correct = .pred_class == good_grad_rate) %>%
    tabyl(correct)
 correct   n   percent valid_percent
   FALSE 118 0.1590296     0.1870048
    TRUE 513 0.6913747     0.8129952
      NA 111 0.1495957            NA
 #correct   n   percent valid_percent
   #FALSE 118 0.1590296     0.1870048
    #TRUE 513 0.6913747     0.8129952
      #NA 111 0.1495957            NA

#Split your data into training (80%) and test (~20%) sets, stratified on good_grad_rate so that the outcome distribution is preserved.Built a recipe (a data preprocessing plan). Specified a logistic regression model.Used last_fit() to train on the training data and evaluate on the test data.

#The model correctly predicted good_grad_rate about 81% of the time on non-missing test cases.Around 19% of predictions were incorrect. 111 cases had missing predictions — likely due to NA in one or more predictor variables in the test set.

# Quick model summary
final_fit %>%
    collect_metrics()
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.813 Preprocessor1_Model1
2 roc_auc     binary         0.823 Preprocessor1_Model1
3 brier_class binary         0.133 Preprocessor1_Model1

How does the accuracy of this new model compare? Add a few reflections below:

  • Each row represents a college or university. You want to predict whether a school has a “High” or “Low” graduation rate based on: Tuition and fees, Average salary of grads, Financial aid availability, Total enrollment, Tuition-to-salary ratio.

    You taught your model with 80% of the schools and tested it on 20%.

Metric Result What It Means
Accuracy 0.813 ~81.3% of predictions were correct
ROC AUC 0.823 The model is pretty good at separating high vs low grad rate
Brier 0.133 Prediction probability was fairly calibrated (lower = better)

The model is doing well overall, especially in distinguishing between high and low grad rate institutions. The accuracy and ROC AUC suggests a good starting point with just a few predictors.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

  1. Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?
  • The Data Modeling Culture assumes that data are generated by a given stochastic data model. This means they presume a specific underlying mechanism for how the data was produced. Breiman argues that the statistical community’s almost exclusive commitment to data models has led to irrelevant theory, questionable scientific conclusions, and has prevented statisticians from engaging with a large range of interesting current problems. He also points out that goodness-of-fit tests and residual analysis often lack power in high-dimensional data and that there can be a multiplicity of equally good models that offer different “pictures of nature”. He also highlights that predictive accuracy is rarely published as a measure of model fit.

  • This approach views the data-generating mechanism (the “black box” of nature) as complex and unknown. The goal is to find a function (an algorithm) that operates on the input variables (x) to accurately predict the response variables (y). This often involves “showing” the statistical model only some of the data (training data) and then testing its performance on new, unseen data (testing data). This culture has developed rapidly in fields outside statistics, with examples including decision trees, neural networks, and random forests. Random forest modeling, which is an extension of decision tree modeling, is particularly well-suited for exploring complex, non-linear relationships without specific directional hypotheses, allowing the computer to find connections that may not be immediately visible to humans. Algorithmic models can be more accurate and informative alternatives to data modeling, especially with large, complex datasets. They allow for nonlinear modeling and can handle a large number of variables without overfitting. Breiman emphasizes that even if complex, these models can still provide reliable information about the structure of relationships between inputs and outputs. For instance, random forests can identify variables that contribute most strongly to predictions (variable importance).

  1. How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?
  • Using machine learning to identify non-obvious predictors of persistence or graduation. Prioritizing what works for identifying disparities or patterns, even when the mechanism is complex or unknown. Identifying important features (e.g., financial aid %, early registration, institutional size) without overfitting.Identifying important features (e.g., financial aid %, early registration, institutional size) without overfitting. The evolution of educational data mining, predictive analytics in higher ed, and AI in advising reinforce Breiman’s argument—particularly in applied contexts like yours.
  1. Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?
  • In educational equity research, predictive accuracy is crucial for identifying patterns, flagging at-risk students, or evaluating interventions. However, explanation still matters when working with policymakers, institutional leaders, or students themselves. Stakeholders need to understand why a student might be flagged, not just that they are.

Part B:

  1. How good was the machine learning model we developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.
  • Our model performed well with ~81% accuracy and an AUC of 0.82. That’s a strong signal that the predictors you chose (e.g., tuition fees, financial aid, salary outcomes) help distinguish between high and low graduation rate institutions. This model would be suitable in exploratory or decision-support research, especially if paired with variable importance plots or fairness checks. It would not yet be appropriate for policy decisions without additional validation or interpretability work.
  1. How might the model be improved? Share any ideas you have at this time below:
  • To improve the model, we could incorporate additional institutional predictors like carnegie_class or HBCU/HSI status, address missing outcome values to reduce noise, explore tree-based models such as random forests to better capture interactions and non-linearities, and consider including behavioral features like early_registration_days to enhance predictive accuracy and contextual relevance.

Part C: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

  1. Provide an APA citation for your selected study.

    • I need to explore more in the field of higher education. I have not had any luck in finding article.
  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

  3. What were the results of these analyses?

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Your Second Machine Learning Badge

Congratulations, you’ve completed your second badge activity! To receive credit, again, please share the link to published webpage under the next incomplete badge artifact column on the LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet. We recommend bookmarking this spreadsheet as we’ll be using it throughout the year to keep track of your progress.

Once your instructor has checked your link, you will be provided a physical version of the badge below!