Explaining or Predicting Graduation Rates Using IPEDS

ECI 586 Introduction to Learning Analytics

Author

Dr. Joey Huang, v2 Anna Doo

Published

October 3, 2025

1. PREPARE

In Unit 2, we learn about five basic steps in a supervised machine learning process in addition to some other components of a learning analytics workflow. For example, to help prepare for analysis, we’ll first take a step back and think about how we want to use machine learning, and predicting is a key word. Many scholars have focused on predicting students who are at-risk: of dropping a course or not succeeding in it. In this introductory machine learning case study, we will cover the following workflow processes from Krumm, Means, and Bienkowski (2018) as we attempt to develop our own model for predicting student drop-out:

Prepare: Prior to analysis, we’ll look at the context from which our data came, formulate a basic research question, and get introduced the {tidymodels} packages for machine learning.
Wrangle: Wrangling data entails the work of cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files and modifying some of our variables.
Explore: We take a quick look at our variables of interest and do some basic “feature engineering” by creating some new variables we think will be predictive of students at risk.
Model: We introduce five basic steps in a supervised machine learning process, focusing on the mechanics of making predictions.
Communicate: To wrap up our case study, we’ll create our first “data product” and share our analyses and findings by creating our first web page using R Markdown.

1a. Conceptual Focus

Conceptually, in this case study we focus on prediction, the primary goal of supervised machine learning, and how it differs from the goals of description or explanation, a goal of traditional statistical methods. The reading introduced below focuses on this distinction between prediction and description or explanation. It is one of the most widely-read papers in machine learning and articulates how machine learning differs from other kinds of statistical models. Breiman describes the difference in terms of data modeling (models for description and explanation) and algorithmic modeling (what we call prediction or machine learning models).

Research Question

Technically, we’ll focus on the core parts of doing a machine learning analysis in R. We’ll use the {tidymodels} set of R packages (add-ons) to do so. We use as recent study by Zong and Davis Zong and Davis (2024) as an inspiration for ours. These authors used inferential models to try to understand what relates to the graduation rates of around 700 students from four-year universities in the United States, predicting this outcome on the basis of student background, finance, academic and social environment, and retention rate independent variables. You can find this study in the lit folder if you are interested in taking a look.

However, to help anchor our analysis and provide us with some direction, we’ll focus on the following research question as we explore this new data set:

How well can we predict drop out rates for among four-year universities?

Literature Review

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199-231. https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf

Abstract

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

👉 Your Turn ⤵

You can find this study in the lit folder as well. Open up the article and take quick scan of the article and note two observations you have about the article.

My initial observation is that I really need to take a rudimentary statistics class. There is a lot of jargon that while somewhat familiar, is not ingrained in my brain so as to comprehend on first read the differing models and inputs, outputs, etc. I also noticed that Breiman brings into his discourse the path he took and that it included consulting outside of academia.
The “emphasis on the data and the problem” not solely on the model to use the data stood out. It really seems to me (not a statistician) that more creativity is needed when approaching these problems. Instead of only using a hammer to pound a nail, thinking about multiple methods to fit the nail in place is more productive…and perhaps more predictive.

1b. Load Libraries

tidymodels 📦

The tidymodels package is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse. Like the {tidyverse} package, it includes a core set of packages that are loaded on startup and contains tools for:

data splitting and pre-processing;
model selection, tuning, and evaluation;
feature selection and variable importance estimation;
as well as other functionality.

Your Turn ⤵

In addition to the {tidymodels} package, we’ll also be using the {tidyverse} packages we learned about in Unit 1, as well as the {janitor} package for quickly cleaning up variable names.

Use the code chunk below to load these three packages:

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.3     ✔ tibble       3.2.1
✔ ggplot2      3.4.3     ✔ tidyr        1.3.0
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.3     ✔ stringr   1.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Tip

Remember to use the library() function to load these packages. After you’ve done that, click the green arrow to run the code chunk. If you see a bunch of messages (not anything labeled as an error), you are good to go! These messages mean the packages loaded correctly.

1c. Import & Inspect Data

In this case study, we will be using a new data set from the IPEDS, the Integrated Postsecondary Education Data System and similar to the data set used by Zong and Davis (2024).

Run the following code to read in the ipeds-all-title-9-2022-data.csv file using the read_csv() function, paying attention to where those files are located relative to this case study file – in the data folder!

ipeds <- read_csv("data/ipeds-all-title-9-2022-data.csv")

Rows: 5988 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): institution name, HD2022.Postsecondary and Title IV institution in...
dbl (15): unitid, year, DRVADM2022.Percent admitted - total, DRVIC2022.Tuiti...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ll then use a handy function from janitor, clean_names(). It does what it seems like it should - it cleans the names of variables, making them easy to view and type. Run this next code chunk.

ipeds <- janitor::clean_names(ipeds)

👉 Your Turn ⤵

In the chunk below, examine the data set using a function or means of your choice (such as just printing the data set by typing its name or using the glimpse() function). Do this in the code chunk below! Note its dimensions — especially how many rows it has!

glimpse(ipeds)

Rows: 5,988
Columns: 24
$ unitid                                                                                  <dbl> …
$ institution_name                                                                        <chr> …
$ year                                                                                    <dbl> …
$ hd2022_postsecondary_and_title_iv_institution_indicator                                 <chr> …
$ drvadm2022_percent_admitted_total                                                       <dbl> …
$ drvic2022_tuition_and_fees_2021_22                                                      <dbl> …
$ hd2022_institution_size_category                                                        <chr> …
$ hd2022_state_abbreviation                                                               <chr> …
$ hd2022_carnegie_classification_2021_basic                                               <chr> …
$ drvef2022_total_enrollment                                                              <dbl> …
$ drvef122022_total_12_month_unduplicated_headcount                                       <dbl> …
$ drvc2022_bachelors_degree                                                               <dbl> …
$ drvc2022_masters_degree                                                                 <dbl> …
$ drvc2022_doctors_degree_research_scholarship                                            <dbl> …
$ drvgr2022_graduation_rate_total_cohort                                                  <dbl> …
$ sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid        <dbl> …
$ xanyaidp                                                                                <chr> …
$ drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks <dbl> …
$ sfa2122_average_net_price_students_awarded_grant_or_scholarship_aid_2021_22             <dbl> …
$ xnpist2                                                                                 <chr> …
$ adm2022_sat_evidence_based_reading_and_writing_50th_percentile_score                    <dbl> …
$ xsatvr50                                                                                <chr> …
$ adm2022_act_composite_50th_percentile_score                                             <dbl> …
$ xactcm50                                                                                <chr> …

Now write down a couple observations after inspecting the data - and any all observations welcome!

5,988 rows is quite robust. It’s interesting that all sorts of demographic data is included from how many students were admitted, the tuition, student status and degrees and financial aid. There really is a fair amount of financial rows so I’m thinking the data would be useful for that sort of comparison.
It also seems the scores for reading and writing, composite percentile etc would be useful. I don’t know what Title IV institution indicator means, so I’ll go look that up, but clearly it’s important for this dataset. (Accreditation for higher learning institutions that want to be able to use federal funding).

❓Question

Recall that similar to Zong and Davis (2024), we are trying to predict student drop out rates using data readily available to higher education researchers. Take a look at our data set again and list three variables you think might be useful for predicting graduation rates:

drvgr2022_graduation_rate_total_cohort
drvc2022_bachelors_degree
Perhaps the SAT and ACT score

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data Wickham, Çetinkaya-Rundel, and Grolemund (2023). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled Krumm, Means, and Bienkowski (2018). In Part 2, we focus on the following wrangling processes to:

Selecting Variables. We use the select() function to simultaneous select variables for analysis and rename exceptionally long variables.
Filtering Variables. We use the filter() function to further reduce our data set to include only Title IV postsecondary institutions.

2a. Select Variables

Even though we cleaned the names to make them easier to view and type (thanks, clean_names())), they are still pretty long.

The code chunk below uses a very handy function, select(). This allows you to simultaneously choose and rename variables, returning a data frame with only the variables you have selected — named as you like. For now, we’ll just run this code. Later in your analyses, you’ll almost certainly use select() to get a more manageable dataset.

ipeds <- ipeds |> 
    select(name = institution_name, 
           title_iv = hd2022_postsecondary_and_title_iv_institution_indicator,
           carnegie_class = hd2022_carnegie_classification_2021_basic,
           state = hd2022_state_abbreviation,
           total_enroll = drvef2022_total_enrollment,
           pct_admitted = drvadm2022_percent_admitted_total,
           n_bach = drvc2022_bachelors_degree,
           n_mast = drvc2022_masters_degree,
           n_doc = drvc2022_doctors_degree_research_scholarship,
           tuition_fees = drvic2022_tuition_and_fees_2021_22,
           grad_rate = drvgr2022_graduation_rate_total_cohort,
           percent_fin_aid = sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid,
           avg_salary = drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks)

Before moving on, let’s make sure we understand what each variables represents. Even though the variables names are fairly intuitive, below is a brief description of variables we just selected:

name: Institution name
title_iv: Indicator if the university is Title IV eligible
carnegie_class: Carnegie Classification of the institution
state: State abbreviation
total_enroll: Total enrollment
pct_admitted: Percentage of applicants admitted
n_bach: Number of students receiving a bachelor’s degree
n_mast: Number receiving a master’s degree
n_doc: Number receiving a doctoral degree
tuition_fees: Total cost of tuition and fees
grad_rate: Graduation rate
percent_fin_aid: Percent of students receiving financial aid
avg_salary: Average salary of instructional staff

Sometimes publicly available data sets, particularly high-quality ones like IPEDS, will have a codebook or glossary to help that provides detailed information about a dataset, serving as a guide to understanding the structure, contents, and variables within the data. It essentially acts as a “dictionary” for your dataset, helping researchers, analysts, and anyone else using the data to interpret it correctly.

👉 Your Turn ⤵

Visit the the IPEDS glossary located here: https://surveys.nces.ed.gov/ipeds/public/glossary

Use the glossary to look up a variable from above or in our larger data set that you are interested in understanding better and record the full definition below:

Federal Grants: Transfers of money or property from the Federal government to the education institution without a requirement to receive anything in return. These grants may take the form of grants to the institutions to undertake research or they may be in the form of student financial aid. (Used for reporting on the Finance component)

Fun fact: RTI International – located in the North Carolina’s Research Triangle Park and from which they derive their name – has led IPEDS for over 20 years, collecting institution-level data from primary providers of postsecondary education nationwide. To learn more about their work visit: https://www.rti.org/impact/integrated-postsecondary-education-data-system-ipeds

2b. Count Variables

As illustrated in the figure below, Krumm, Means, and Bienkowski (2018) noted in Chapter 2 of Learning Analytics goes to Schools, processes in the Data-Intensive Research Workflow workflow (e.g., preparing, wrangling, exploring, etc.) are more can be seen as overlapping activities as much as distinct sequential steps.

For example, next we will explore our data a little bit to assist with our data wrangling process.

A useful function for exploring data is count(); it does what it sounds like! It counts how many times values for a variable appear.

ipeds |> 
    count(title_iv)

# A tibble: 2 × 2
  title_iv                                             n
  <chr>                                            <int>
1 Title IV NOT primarily postsecondary institution    30
2 Title IV postsecondary institution                5958

This suggests we may wish to filter the 30 non-Title IV institutions — something we’ll do shortly.

👉 Your Turn ⤵

Can you count another variable? Pick another (see the code chunk two above) and add a count below. While simple, counting up different values in our data can be very informative (and can often lead to further explorations)!

ipeds |> 
    count(carnegie_class)

# A tibble: 34 × 2
   carnegie_class                                                              n
   <chr>                                                                   <int>
 1 Associate's Colleges: High Career & Technical-High Nontraditional          88
 2 Associate's Colleges: High Career & Technical-High Traditional            106
 3 Associate's Colleges: High Career & Technical-Mixed Traditional/Nontra…   116
 4 Associate's Colleges: High Transfer-High Nontraditional                   105
 5 Associate's Colleges: High Transfer-High Traditional                      105
 6 Associate's Colleges: High Transfer-Mixed Traditional/Nontraditional      101
 7 Associate's Colleges: Mixed Transfer/Career & Technical-High Nontradit…   114
 8 Associate's Colleges: Mixed Transfer/Career & Technical-High Tradition…   104
 9 Associate's Colleges: Mixed Transfer/Career & Technical-Mixed Traditio…    97
10 Baccalaureate Colleges: Arts & Sciences Focus                             216
# ℹ 24 more rows

2c. Filter Variables

Our final data wrangling step is filtering our data set to include only Title IV postsecondary institutions.

We’ll do this with a function you should now be fairly familiar with. filter() is a very handy function that is part of the tidyverse; it filters to include (or exclude) observations in your data based upon logical conditions (e.g., ==, >, <=, etc.). See more here if interested.

Run the code chunk below to filter the data so it only includes only Title IV postsecondary institutions.

ipeds <- ipeds |> 
    filter(title_iv == "Title IV postsecondary institution")

👉 Your Turn ⤵

Can you filter the data again, this time to only include institutions with a carnegie classification?

In other words, can you exclude those institutions with a value for the carnegie_class variable that is “Not applicable, not in Carnegie universe (not accredited or nondegree-granting)”)?

ipeds <- ipeds |> 
    filter(carnegie_class != "Not applicable, not in Carnegie universe (not accredited or nondegree-granting)")

Tip

Whereas the logical operator == is used to include only matching conditions, the logical operator != excludes matching conditions.

👉 Your Turn ⤵

We’re cruising! Let’s take another peak at our data - using glimpse() or another means of your choosing below.

glimpse(ipeds)

Rows: 3,818
Columns: 13
$ name            <chr> "Alabama A & M University", "University of Alabama at …
$ title_iv        <chr> "Title IV postsecondary institution", "Title IV postse…
$ carnegie_class  <chr> "Master's Colleges & Universities: Larger Programs", "…
$ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
$ total_enroll    <dbl> 6007, 21639, 647, 9237, 3828, 38644, 1777, 2894, 5109,…
$ pct_admitted    <dbl> 68, 87, NA, 78, 97, 80, NA, NA, 92, 44, 57, NA, NA, NA…
$ n_bach          <dbl> 511, 2785, 54, 1624, 480, 6740, NA, 738, 672, 5653, 26…
$ n_mast          <dbl> 249, 2512, 96, 570, 119, 2180, NA, 80, 300, 1415, 0, N…
$ n_doc           <dbl> 9, 166, 20, 41, 2, 215, NA, 0, 0, 284, 0, NA, 0, NA, N…
$ tuition_fees    <dbl> 10024, 8568, NA, 11488, 11068, 11620, 4930, NA, 8860, …
$ grad_rate       <dbl> 27, 64, 50, 63, 28, 73, 22, NA, 36, 81, 65, 26, 9, 29,…
$ percent_fin_aid <dbl> 87, 96, NA, 96, 97, 87, 87, NA, 99, 79, 100, 96, 92, 8…
$ avg_salary      <dbl> 77824, 106434, 36637, 92561, 72635, 97394, 63494, 8140…

3. EXPLORE

As noted by Krumm, Means, and Bienkowski (2018), exploratory data analysis often involves some combination of data visualization and feature engineering. In Part 3, we will create some quick visualization to help us better understand our data and transform our dependent variable from continuous to categorical to simplify our modeling. Specifically, in Part 3 we will:

Visualize Variables by using the {ggplot2} package to visually inspect our grad_rate dependent variable as well as other variables of interest.
Dichotomize Dependent Variable by “mutating” our grad_rate dependent variable from our continuous variable to a dichotomous variable.

3a. Examine Dependent Variable

One key step in most analyses is to explore the data. Here, we conduct an exploratory data analysis with the IPEDS data, focusing on the key outcome of graduate rate.

Below, we use the ggplot2 package (part of the tidyverse) to visualize the spread of the values of our dependent variable, grad_rate, which represents institutions’ graduation rate. There is a lot to ggplot2, and data visualizations are not the focus of this module, but this web page has a lot of information you can use to learn more, if you are interested. ggplot2 is fantastic for creating publication-ready visualizations!

ipeds |> 
    ggplot(aes(x = grad_rate)) +
    geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 418 rows containing non-finite values (`stat_bin()`).

❓Question

What do you notice about this graph – and about graduation rate?

Seems like there are some notable outliers at either end. But overall at first glance, it looks like majority of institutions hover at about 30% graduation rate.
Nearly 50 programs have a 100% grad rate?! That’s very interesting to unpack…how many students in those programs, is it just doctorate or masters or particular institutions?

👉 Your Turn ⤵

Below, can you add one ggplot2 plot with a different variable/variables? Use the ggplot2 page linked above (also here) or the code above as a starting point (another histogram is fine!) for your visualization.

ipeds |> 
    ggplot(aes(x = state, y = avg_salary)) +
    geom_point()

Warning: Removed 113 rows containing missing values (`geom_point()`).

# yes it's super cluttered! Need to do some wrangling so it's a useful viz

3b. Dichotomoize Variables

Next we’ll again overlap our data wrangling and exploring activities in preparation for some data modeling that will come next. Recall that we are interested in assessing how we we can predict graduation rates, our dependent variable, using other variables or predictors in our data set.

Supervised machine learning, or predictive modeling, involves two broad approaches: classification and regression.

Classification algorithms model categorical outcomes (e.g., yes or no outcomes);
Regression algorithms characterize continuous outcomes (e.g., test scores).

To simplify our analysis for this case study, we’ll focus on classification. Specifically, we will model our dependent variable, grad_rate, as a dichotomous (i.e., yes or no; 1 or 0) dependent variable. This isn’t necessary, but it makes the contrast between the regression and supervised machine learning model a bit more vivid, and also dichotomous and categorical outcome variables are common in supervised machine learning applications, and so we’ll do this for this case study.

👉 Your Turn ⤵

Your next task is to decide what constitutes a good graduation rate. Our only suggestion - don’t pick a number too close to 0% or 100%. Otherwise, please replace XXX below with the number from 0-100 that represents the graduation rate percentage. Just add the number – don’t add the percentage symbol.

ipeds <- ipeds |> 
    mutate(good_grad_rate = if_else(grad_rate > 62, 1, 0),
           good_grad_rate = as.factor(good_grad_rate))

Here, add a reason or two for how and why you picked the number you did:

Initially it seemed like an attainable graduation rate.
Then, I went back over to the IPEDS website and asked what a valid graduation rate may be and found this data https://nces.ed.gov/fastfacts/display.asp?id=40 which led me to keep the 62%.

Before moving on, let’s unpack what we just did in the few lines of code above. The code is performing two primary operations on our ipeds dataset using the mutate() function from the dplyr package:

Creating a New Variable good_grad_rate:
- The mutate() function adds a new variable named good_grad_rate to the ipeds dataset.
- if_else(grad_rate > XXX, 1, 0): This part creates a binary (0 or 1) variable based on the grad_rate column.
  - If grad_rate is greater than XXX (where XXX is a placeholder value that needs to be specified), the value of good_grad_rate will be 1.
  - If grad_rate is not greater than XXX, the value of good_grad_rate will be 0.
Converting good_grad_rate to a Factor:
- good_grad_rate = as.factor(good_grad_rate): This part converts the good_grad_rate variable from a numeric type (0 or 1) to a factor type. Factors are useful in R for categorical data, and they help in modeling processes where the variable is treated as a categorical predictor or outcome variable.

In summary, the code is creating a new categorical variable good_grad_rate in the ipeds dataset that indicates whether the graduation rate (grad_rate) is above a specified threshold (XXX) and then converts this variable into a factor type with two levels: 0 (not above the threshold) and 1 (above the threshold).

4. MODEL

Models can be used for statistical inference, that is to test scientific hypotheses and understand the relationship between variables and outcomes, or in the case of machine learning, can be used primarily for the purpose of predicting an outcome as accurately as possible. In this section we revisit the use of regression models for statistical inference, and then compare the use of the same model for the purpose of prediction.

Note that in Chapter 3, Krumm, Means, and Bienkowski (2018) highlight that the term regression can take on different meanings across inference and prediction uses:

From a statistical, or inferential perspective, regression denotes a family of models that can be used on either categorical or continuous outcomes. Perhaps most confusing to newcomers or to researchers steeped in either inference or prediction are the ways in which specific models, such as logistic regression, can be used for either inference or classification.

In both sections below we use logistic regression, a statistical method used to model the relationship between one or more independent variables (predictors) and a binary outcome (dependent variable). In the context of machine learning, this is considered a classification problem since we are trying to build a model that can “classify” or label data based on patterns inherent in the data itself.

4a. Statistical Inference

We’ll first conduct a simple regression analysis from a statistical inference perspective, similar to what we did in Unit 1 for our virtual public school data.

Run the code chunk below to “fit” generalized linear model glm() function, due to the dependent variable being dichotomous:

m1 <- glm(good_grad_rate ~ 
            total_enroll + 
            pct_admitted + 
            n_bach + 
            n_mast + 
            n_doc + 
            tuition_fees + 
            percent_fin_aid + 
            avg_salary, 
          data = ipeds, 
          family = "binomial") 

summary(m1)


Call:
glm(formula = good_grad_rate ~ total_enroll + pct_admitted + 
    n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + 
    avg_salary, family = "binomial", data = ipeds)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -9.768e-01  7.271e-01  -1.343  0.17913    
total_enroll    -1.501e-04  4.829e-05  -3.109  0.00188 ** 
pct_admitted    -1.135e-02  3.892e-03  -2.916  0.00355 ** 
n_bach           9.693e-04  2.130e-04   4.551 5.34e-06 ***
n_mast          -4.401e-04  2.225e-04  -1.978  0.04790 *  
n_doc            8.313e-03  1.782e-03   4.665 3.08e-06 ***
tuition_fees     7.575e-05  5.551e-06  13.644  < 2e-16 ***
percent_fin_aid -3.301e-02  6.682e-03  -4.940 7.79e-07 ***
avg_salary       3.066e-05  4.428e-06   6.925 4.37e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2164.7  on 1615  degrees of freedom
Residual deviance: 1450.0  on 1607  degrees of freedom
  (2202 observations deleted due to missingness)
AIC: 1468

Number of Fisher Scoring iterations: 6

Code Explanation

The code is relatively straightforward but let’s breakdown what each line of code above is doing before interpreting the output of this model:

Building a Logistic Regression Model (glm):
- m1 <- glm(...): This line creates a generalized linear model (GLM) and assigns it to the object m1.
- good_grad_rate ~ ...: The model is predicting good_grad_rate, a binary outcome variable that we created above, using multiple independent variables including total_enroll, pct_admitted, n_bach, n_mast, n_doc, tuition_fees, percent_fin_aid, and avg_salary.
- The family = "binomial" argument specifies that this is a logistic regression model, which is appropriate when the dependent variable (good_grad_rate) is binary (0 or 1).
Displaying the Model Output (summary(m1)):
- Finally, the summary(m1) function provides detailed information about the logistic regression model, which we save as m1, and includes:
  - Coefficients for each predictor variable.
  - Standard errors, z-values, and p-values for hypothesis testing.
  - Overall model statistics such as the null and residual deviance, indicating model fit.

In summary, our code fits a logistic regression model to help us determine whether a university has a “good” graduation rate (good_grad_rate) based on several institutional characteristics. The summary(m1) output will show which predictors are statistically significant and how they influence the likelihood of a “good” graduation rate.

Output Interpretation

Now let’s focus on interpreting the output of summary() function and model. The table below provides the results of a logistic regression model predicting whether a university has a “good” graduation rate (good_grad_rate) using various institutional characteristics as predictors.

Note: Your output will look a little different based on the cut-off point you selected for a “good” graduate rate. In my case, I somewhat arbitrarily decided that if at least two-thirds (~67%) of students graduated, it was considered “good.”

Now let’s break down the interpretation. The table below provides estimates for each predictor, along with their standard errors, z-values, and p-values:

Predictor	Estimate	Std. Error	z value	Pr(>	z
(Intercept)	-0.4732	0.7276	-0.650	0.515466	(Not significant)
total_enroll	-0.0001054	0.00004649	-2.267	0.023415	* Significant
pct_admitted	-0.01357	0.003959	-3.427	0.000610	*** Highly significant
n_bach	0.0007563	0.0001981	3.817	0.000135	*** Highly significant
n_mast	-0.0004373	0.0002131	-2.052	0.040140	* Significant
n_doc	0.005749	0.001520	3.782	0.000156	*** Highly significant
tuition_fees	0.00007245	5.630e-06	12.868	< 2e-16	*** Highly significant
percent_fin_aid	-0.03674	0.006666	-5.512	3.56e-08	*** Highly significant
avg_salary	0.00002599	4.396e-06	5.912	3.39e-09	*** Highly significant

First lets take a look at the model coefficients:

Intercept: The intercept value (-0.4732) represents the log-odds of a university having a “good” graduation rate when all predictor variables are set to zero. The p-value (0.515) indicates that the intercept is not statistically significant, which isn’t a concern because our primary interest is in understanding the relationship between the independent variables and a good_grad_rate.
total_enroll: For each additional unit increase in total enrollment, the log-odds of having a good graduation rate decrease by 0.0001054. This effect is statistically significant (*).
pct_admitted: A 1% increase in the percentage of students admitted leads to a decrease in the log-odds of having a good graduation rate by 0.01357. This effect is highly significant (***).
n_bach: Each additional bachelor’s degree awarded is associated with an increase in the log-odds of a good graduation rate by 0.0007563, and this effect is highly significant (***).
n_mast: Each additional master’s degree awarded is associated with a slight decrease in the log-odds by 0.0004373, and this effect is marginally significant (*).
n_doc: Each additional doctoral degree awarded is associated with an increase in the log-odds of a good graduation rate by 0.005749, and this effect is highly significant (***).
tuition_fees: An increase of one unit in tuition fees is associated with a significant increase in the log-odds of having a good graduation rate by 0.00007245 (***).
percent_fin_aid: A 1% increase in students receiving financial aid is associated with a decrease in the log-odds of having a good graduation rate by 0.03674, and this effect is highly significant (***).
avg_salary: Each additional unit in average salary for instructional staff is associated with an increase in the log-odds of having a good graduation rate by 0.00002599, and this effect is highly significant (***).

Significance codes (***, **, *) provide a quick way to understand the strength of the evidence against the null hypothesis for each predictor in a regression model. Here’s a more detailed explanation:

*** Highly significant (p < 0.001). This indicates a very strong evidence against the null hypothesis (which typically states that the predictor has no effect on the outcome variable). A p-value less than 0.001 means there’s less than a 0.1% chance that the observed relationship is due to random chance. Therefore, we have very high confidence that this predictor is indeed related to the outcome variable.
** Significant (p < 0.01). This indicates strong evidence against the null hypothesis. A p-value between 0.001 and 0.01 means there’s less than a 1% chance that the observed relationship is due to random chance. We can still be quite confident that this predictor is related to the outcome variable, though not as strong as in the previous category.
* Marginally significant (p < 0.05). This indicates moderate evidence against the null hypothesis. A p-value between 0.01 and 0.05 means there’s less than a 5% chance that the observed relationship is due to random chance. While this predictor may have a real effect on the outcome variable, the evidence is weaker compared to predictors with smaller p-values.

Now let’s take a look a key Model Fit Statistic, the Akaike Information Criterion (AIC), which is a metric used to evaluate how well a model fits the data while also considering the complexity of the model. It not only evaluates how well the model explains the data but also penalizes the model for having too many predictors.

Our model has an AIC Value of 1387.5; however, you can’t interpret the AIC value in isolation as “good” or “bad”; instead, you compare it across models.

A lower AIC value indicates a model that achieves a better balance between goodness-of-fit and simplicity.
It means the model provides a good explanation of the data without being overly complex.

If we were add or remove predictors from our model to test other models, the model with the lowest AIC would be considered the best among the compared models, as it suggests the optimal combination of fit and simplicity.

Summary

Overall, the model appears to be a good fit for understanding factors related to a “good” graduation rate, with tuition fees, percent financial aid, and average salary being especially influential factors. Specifically, the predictors pct_admitted, n_bach, n_doc, tuition_fees, percent_fin_aid, and avg_salary are highly significant in predicting whether a university has a good graduation rate.

👉 Your Turn ⤵

In the space below, note a few similarities and/or differences between the model and it’s interpretation described above and the model you ran with your own determination of a “good” graduation rate:

NOTE 1 When I ran my version, the same big factors popped up — tuition, percent financial aid, and average salary really drove the model. That makes me think those variables are consistently meaningful no matter where the cutoff for “good” gets set.
NOTE 2 Some coefficients shifted around once I changed the threshold for graduation rate. It reminded me that how we define “good” isn’t neutral. The choice directly changes what looks important.
NOTE 3 My AIC came out a little different too. That showed me model fit isn’t just about the predictors, but also how we frame the outcome. Small tweaks in setup can make the results look more or less convincing.

4a. Supervised Machine Learning

Recall from our readings that there are two general types of machine learning approaches: unsupervised and supervised learning. In this section we focus on supervised learning models, which are used to quantify relationships between features (e.g., % of students admitted and tuition fees) and a known outcome (i.e., a good graduation rate). These models can be used for statistical inference, as illustrated above, or primarily for prediction purposes, as we’ll illustrate below.

In this section we’ll learn five basic machine learning steps for building, training, and evaluating a model. Specifically, we will use the {tidymodels} package in R and learn how to:

Split Data into a training and test set that will be used to develop a predictive model;
Create a “Recipe” for our predictive model and learn how to deal with data that we would like to use as predictors;
Specify model and workflow by selecting the functional form of the model that we want and using a model workflow to pair our model and recipe together;
Fit model to our training set using logistic regression;
Assess Accuracy of our model on our testing set to see how well our model can “predict” a good graduation rate.

Step 1. Split data

The authors of Data Science in Education Using R (Estrellado et al.,2020) remind us that:

At its core, machine learning is the process of “showing” your statistical model only some of the data at once and training the model to predict accurately on that training dataset (this is the “learning” part of machine learning). Then, the model as developed on the training data is shown new data - data you had all along, but hid from your computer initially - and you see how well the model that you developed on the training data performs on this new testing data. Eventually, you might use the model on entirely new data.

Training and Testing Sets

It is therefore common when beginning a machine learning project to separate the data set into two partitions:

The training set is used to estimate, develop and compare models; feature engineering techniques; tune models, etc.
The test set is held in reserve until the end of the project, at which point there should only be one or two models under serious consideration. It is used as an unbiased source for measuring final model performance.

There are different ways to create these partitions of the data and there is no uniform guideline for determining how much data should be set aside for testing. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors.

After you decide how much to set aside, the most common approach for actually partitioning your data is to use a random sample. For our purposes, we’ll use random sampling to select 20% for the test set and use the remainder for the training set, which are the defaults for the {rsample} package.

Split data set

To split our data, we will be using our first {tidymodels} function - initial_split().

The function initial_split() function from the {rsample} package takes the original data and saves the information on how to make the partitions. The {rsample} package has two aptly named functions for created a training and testing data set called training() and testing(), respectively.

We also specify the strata to ensure there is not misbalance in the dependent variable (good_grad_rate).

Run the following code to split the data:

set.seed(12345)

# Split the data into training (80%) and testing (20%) sets

set.seed(123) # Setting a seed for reproducibility

data_split <- initial_split(ipeds, prop = 0.8, strata = good_grad_rate) 

# Create training and testing datasets
training_data <- training(data_split)

testing_data <- testing(data_split)

Let’s break down what we just did a little further:

initial_split(ipeds, prop = 0.8) splits our ipeds data into 80% for training and 20% for testing.
strata = good_grad_rate ensures stratified sampling, meaning the proportion of good_grad_rate categories is maintained in both the training and testing sets.
training(data_split) extracts the training set.
testing(data_split) Extracts the testing set.

Note: Since random sampling uses random numbers, it is important to set the random number seed using the set.seed() function. This ensures that the random numbers can be reproduced at a later time (if needed). Here we use 12345 but any number will work!

👉 Your Turn ⤵

Go ahead and type data_train and data_test into the console (in steps) to check that this data set indeed has 80% of the number of observations as in the larger data. Do that in the chunk below:

training_data

# A tibble: 3,054 × 14
   name    title_iv carnegie_class state total_enroll pct_admitted n_bach n_mast
   <chr>   <chr>    <chr>          <chr>        <dbl>        <dbl>  <dbl>  <dbl>
 1 Alabam… Title I… Master's Coll… Alab…         6007           68    511    249
 2 Amridg… Title I… Master's Coll… Alab…          647           NA     54     96
 3 Centra… Title I… Associate's C… Alab…         1777           NA     NA     NA
 4 Athens… Title I… Baccalaureate… Alab…         2894           NA    738     80
 5 Chatta… Title I… Associate's C… Alab…         1641           NA     NA     NA
 6 South … Title I… Baccalaureate… Alab…          361           NA     31     17
 7 Enterp… Title I… Associate's C… Alab…         2010           NA     NA     NA
 8 Coasta… Title I… Associate's C… Alab…         6800           NA     NA     NA
 9 Faulkn… Title I… Master's Coll… Alab…         2817           82    448    347
10 Gadsde… Title I… Associate's C… Alab…         4352           NA     NA     NA
# ℹ 3,044 more rows
# ℹ 6 more variables: n_doc <dbl>, tuition_fees <dbl>, grad_rate <dbl>,
#   percent_fin_aid <dbl>, avg_salary <dbl>, good_grad_rate <fct>

testing_data

# A tibble: 764 × 14
   name    title_iv carnegie_class state total_enroll pct_admitted n_bach n_mast
   <chr>   <chr>    <chr>          <chr>        <dbl>        <dbl>  <dbl>  <dbl>
 1 Alabam… Title I… Doctoral/Prof… Alab…         3828           97    480    119
 2 Auburn… Title I… Master's Coll… Alab…         5109           92    672    300
 3 John C… Title I… Associate's C… Alab…         8163           NA     NA     NA
 4 Marion… Title I… Associate's C… Alab…          320           63     NA     NA
 5 H Coun… Title I… Associate's C… Alab…         1984           NA     NA     NA
 6 Troy U… Title I… Master's Coll… Alab…        14156           95   2273   1204
 7 Alaska… Title I… Special Focus… Alas…          268           NA     NA     NA
 8 Carrin… Title I… Special Focus… Ariz…          532           NA     NA     NA
 9 Brookl… Title I… Special Focus… Ariz…          888           NA    248     16
10 Arizon… Title I… Doctoral Univ… Ariz…        80065           90  14960   3727
# ℹ 754 more rows
# ℹ 6 more variables: n_doc <dbl>, tuition_fees <dbl>, grad_rate <dbl>,
#   percent_fin_aid <dbl>, avg_salary <dbl>, good_grad_rate <fct>

Step 2: Create a “Recipe”

In this section, we introduce another tidymodels package named {recipes}, which is designed to help you prepare your data before training your model. Recipes are built as a series of preprocessing steps, such as:

converting qualitative predictors to indicator variables (also known as dummy variables),
transforming data to be on a different scale (e.g., taking the logarithm of a variable),
transforming whole groups of predictors together,
extracting key features from raw variables (e.g., getting the day of the week out of a date variable), and so on.

If you are familiar with R’s formula interface, a lot of this might sound familiar and is very similar to what we did above in the statistical inference section. Recipes can be used to do many of the same things, but they have a much wider range of possibilities.

Add a Formula

To get started, let’s create a recipe for a simple logistic regression model. The recipe()function as we used it here has two arguments:

A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (code, in our present case). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.
The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = train_data here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.

Run the following code chunk to create a recipe where we predict good_grad_rate (the outcome variable) using the same predictor variables we used above:

my_recipe <- recipe(good_grad_rate ~ 
                   total_enroll + 
                   pct_admitted + 
                   n_bach + 
                   n_mast + 
                   n_doc + 
                   tuition_fees + 
                   percent_fin_aid + 
                   avg_salary, 
                 data = training_data)

Code Explanation

Again, let’s unpack what is happening in this code:

This code defines a recipe that outlines which variables (total_enroll, pct_admitted, etc.) will be used to predict the outcome (good_grad_rate) and ensures that this process is applied to the training_data.
At this stage, no actual preprocessing (e.g., scaling, imputation) is performed. The recipe simply sets up the structure, which can be extended later to include data cleaning, transformations, or feature engineering as needed.

By defining the recipe, you make the analysis more modular, allowing you to handle data preprocessing consistently and efficiently before training the machine learning model. These topics are covered in much greater depth in the Machine Learning course.

Step 3: Specify a Model

With tidymodels, we next start building a model by specifying the functional form of the model that we want using the {parsnip} package. Since our outcome is binary, the model type we will use is “logistic regression.” We can declare this with logistic_reg() and assign to an object we will later use in our workflow:

👉 Your Turn ⤵

In the code chunk below, specify our model by assigning the logistic_reg() function to a new object named my_model :

# specify model
my_model <- logistic_reg()

my_model

Logistic Regression Model Specification (classification)

Computational engine: glm

Great! If you did this correctly you should see the following output:

Logistic Regression Model Specification (classification)

Computational engine: glm

Add Model and Recipe to Workflow

Finally, we’ll create a “workflow”, which pairs a model and recipe together. Although we’ll just be using a single recipe This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows.

We’ll use the {workflows} package from tidymodels to bundle our parsnip model (my_model) with our recipe (my_recipe).

Add your model and recipe (see their names above)!

my_workflow <- workflow() |> # create a workflow
    add_model(my_model) |> # add the model we wrote above
    add_recipe(my_recipe) # add our recipe we wrote above

Step 4: Fit Model to Training Data

Now that we have a single workflow that can be used to prepare the recipe and train the model from the resulting predictors, we can use the fit() function to fit our model to our training_data.

fit_model <- fit(my_workflow, training_data)

👉 Your Turn ⤵

Type the output of the above function (the name you assigned the output to) below; this is the final, fitted model—one that can be interpreted further in the next step!

fit_model

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
    (Intercept)     total_enroll     pct_admitted           n_bach  
     -6.572e-01       -1.559e-04       -1.125e-02        9.650e-04  
         n_mast            n_doc     tuition_fees  percent_fin_aid  
     -1.946e-04        6.692e-03        7.507e-05       -3.643e-02  
     avg_salary  
      3.077e-05  

Degrees of Freedom: 1284 Total (i.e. Null);  1276 Residual
  (1769 observations deleted due to missingness)
Null Deviance:      1721 
Residual Deviance: 1162     AIC: 1180

Great work!

Step 5: Assess Accuracy on Test Data

Finally, run the following code snippet to generate predictions from the trained logistic regression model (fit_model) on the test dataset (testing_data) and then combine those predictions with the original test data.

predictions <- predict(fit_model, testing_data) |> 
  bind_cols(testing_data)

predictions

# A tibble: 764 × 15
   .pred_class name      title_iv carnegie_class state total_enroll pct_admitted
   <fct>       <chr>     <chr>    <chr>          <chr>        <dbl>        <dbl>
 1 0           Alabama … Title I… Doctoral/Prof… Alab…         3828           97
 2 0           Auburn U… Title I… Master's Coll… Alab…         5109           92
 3 <NA>        John C C… Title I… Associate's C… Alab…         8163           NA
 4 <NA>        Marion M… Title I… Associate's C… Alab…          320           63
 5 <NA>        H Counci… Title I… Associate's C… Alab…         1984           NA
 6 0           Troy Uni… Title I… Master's Coll… Alab…        14156           95
 7 <NA>        Alaska C… Title I… Special Focus… Alas…          268           NA
 8 <NA>        Carringt… Title I… Special Focus… Ariz…          532           NA
 9 <NA>        Brooklin… Title I… Special Focus… Ariz…          888           NA
10 1           Arizona … Title I… Doctoral Univ… Ariz…        80065           90
# ℹ 754 more rows
# ℹ 8 more variables: n_bach <dbl>, n_mast <dbl>, n_doc <dbl>,
#   tuition_fees <dbl>, grad_rate <dbl>, percent_fin_aid <dbl>,
#   avg_salary <dbl>, good_grad_rate <fct>

Again, let’s breakdown what is happening with this code and then interpret the output:

The predict() function produces a set of predictions for the outcome variable (good_grad_rate) for each observation in the testing_data.
- fit_model: The logistic regression model that was previously trained on the training data.
- testing_data: The dataset used to test the model’s performance, which contains new, unseen data.
The bind_cols() function merges the predicted values with the original test dataset row by row, resulting in a data frame that contains both the actual values (from testing_data) and the model’s predicted values side by side.
The predictions object now contains a combined dataset that includes:
- All original columns from testing_data.
- A new column (or columns) with the predicted values from the model, allowing you to compare the actual values with the predicted ones.
When you print predictions, you’ll see a data frame that includes both the original test data and the predicted values for good_grad_rate. This allows you to analyze how well the model’s predictions match the actual values.

This step is crucial for evaluating the model’s accuracy and effectiveness in making predictions on unseen data, which helps determine how well the model generalizes beyond the training data.

The output represents the predicted classification (.pred_class) for whether an institution has a “good” graduation rate (1 indicates “good,” and 0 indicates “not good”) based on the trained logistic regression model, along with some details about each institution, such as their name and title_iv status.

Interpretation of the Output:

.pred_class: This column contains the predicted class labels for each institution:
- 1: The model predicts that the institution has a “good” graduation rate.
- 0: The model predicts that the institution does not have a “good” graduation rate.
- NA: The model could not make a prediction for these institutions. This might be due to missing data in the predictor variables or data preprocessing issues that caused the model to be unable to generate a prediction.
name: This column lists the names of the institutions.
title_iv: This column indicates whether the institution is a “Title IV postsecondary institution,” meaning they are eligible for federal financial aid programs.

The model has successfully classified some institutions into having either a good (1) or not good (0) graduation rate.
The presence of NA values suggests that data issues (e.g., missing predictor values or cases where the model couldn’t confidently make a prediction) need to be investigated further to ensure comprehensive predictions.

Accuracy Metrics

Our final step is to assess model accuracy by comparing the predicted values (.pred_class) with the actual good_grad_rate values for the test set to evaluate the model’s performance on predicting graduation rates.

So, how accurate was our predictive model? Consider how well our model would have done by chance alone – what would the accuracy be in that case (with the model predicting pass one-half of the time)?

Run the following code

# Calculate accuracy on the testing set
accuracy <- predictions |> 
  metrics(truth = good_grad_rate, estimate = .pred_class) |>
  filter(.metric == "accuracy")

# Print accuracy
accuracy

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.807

You probably saw an output that looks something similar to the table below:

.metric	.estimator	.estimate
accuracy	binary	0.8193146

Let’s break these down:

.metric = "accuracy": Indicates that the metric being reported is accuracy.
.estimator = "binary": Specifies that the model is a binary classifier, meaning the model predicts one of two possible outcomes (in this case, predicting whether good_grad_rate is 0 or 1).
.estimate = 0.8193146: This value represents the accuracy of the model’s predictions on the test dataset. An accuracy of 0.8193 (or 81.93%) means that the model correctly predicted whether universities have a “good” graduation rate in approximately 81.93% of the cases in the test set.

Accuracy is the proportion of correct predictions made by the model out of all predictions. It is calculated as:

\(\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}\)

Accurracy, like AIC discussed above, is just one of several metrics we could use to tell us how well our model performs. This metric is an essential indicator of how well the model generalizes to unseen data, which is crucial for evaluating its overall performance.

The model’s accuracy on the test dataset reveals that the model correctly predicts the outcome (good graduation rate) approximately 81.93% of the time, which on the surface seems suggests that it is pretty good at predicting whether a college has a good graduation rate.

Code Explanation

Before moving on, let’s also break down the code we used to generate this output:

The metrics() function calculates various performance metrics for the model’s predictions, which we saved as predictions.
- truth = good_grad_rate: Specifies the actual values of the target variable (good_grad_rate) from the test data.
- estimate = .pred_class: Refers to the predicted values (i.e., the predicted class labels 0 or 1) generated by the model.

This step produces a data frame containing several evaluation metrics such as accuracy, precision, recall, etc., that compare the actual values (truth) with the predicted values (estimate), which we then passed on to:

filter(.metric == "accuracy"), which extracts only the row corresponding to the “accuracy” metric from the resulting data frame.
The accuracy object now contains a single row representing the accuracy of the model on the test data.

👉 Your Turn ⤵

The key to observe at this point is what is similar and different between the two approaches (regression and supervised machine learning). Both used the same underlying statistical model, but had some stark differences. Add two or more similarities and two or more differences (no wrong answers!) below.

Similarities:

SIMILARITY 1 Both approaches relied on the same predictors (enrollment, admission rate, tuition, aid, etc.) to estimate the likelihood of a “good” graduation rate.
SIMILARITY 2 Both regression and supervised machine learning used logistic regression as the underlying statistical model, so the mathematical foundation was the same.

Differences:

DIFFERENCE 1 Regression used the full dataset to estimate relationships, while supervised machine learning explicitly split the data into training and testing sets to evaluate generalizability.
DIFFERENCE 2 Regression focused on inference and explanation, reporting coefficients, significance, and model fit, while supervised machine learning focused on prediction accuracy on unseen data.

5. COMMUNICATE

Recall that the final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. (2018) have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

In this case study, we focused applying some basic machine learning techniques to help us understand how a predictive model used could predict colleges with good graduation rates. Specifically, we made a very crude first attempt at developing a model using machine learning techniques that seemed to perform reasonably well as predicting graduation rates.

For this case study, let’s simply focus on return to our research question:

How well can we predict drop out rates for among four-year universities?

👉 Your Turn ⤵

At the beginning of this section you chose for yourself what constitutes a good graduation rate. Now create your own recipe and see how it performs.

Use the code chunk below to repeat our machine learning steps but this time modify the recipe to see if you can create a model that predicts graduation outcomes with a higher level of accuracy. Feel free to copy and paste from our code above to assist you with this process.

set.seed(12345)

# Split the data into training (80%) and testing (20%) sets

set.seed(123) # Setting a seed for reproducibility

my_data_split <- initial_split(ipeds, prop = 0.72, strata = good_grad_rate) 

# Create training and testing datasets
my_training_data <- training(my_data_split)

my_testing_data <- testing(my_data_split)

my_new_recipe <- recipe(good_grad_rate ~ 
                   total_enroll + 
                   pct_admitted + 
                   n_bach + 
                   n_mast + 
                   n_doc + 
                   tuition_fees + 
                   percent_fin_aid + 
                   avg_salary, 
                 data = my_training_data)
my_new_model <- logistic_reg()

my_new_model

Logistic Regression Model Specification (classification)

Computational engine: glm

my_new_workflow <- workflow() |> # create a workflow
    add_model(my_new_model) |> # add the model we wrote above
    add_recipe(my_new_recipe) # add our recipe we wrote above

fit_new_model <- fit(my_new_workflow, my_training_data)

What did you find? How accurately were you able to predict “good” graduation rates? Focus on what you would communicate about this analysis to a general audience, again, keeping in mind this is based on your very initial interpretations.

With the slightly different split and recipe, the accuracy came out close to the earlier model — solid, but not perfect. To me, that says the predictors we’re using (enrollment, admission rate, tuition, aid, etc.) are doing most of the heavy lifting already. What stood out is how even small tweaks in setup (like shifting the training/test split) can nudge the results, but the overall story stays the same: some institutional features are strongly tied to graduation outcomes. If I were explaining this to a general audience, I’d put it simply — “we can predict graduation success about 8 out of 10 times using this kind of data, but the exact numbers shift depending on how you set up the model.”

Note: This exercise (especially the supervised machine learning model and its output) is very likely new, and this is meant to elicit initial perceptions, and not the right answer, so don’t stress to much about creating accurate models.

Congratulations!

You’ve completed the Unit 2 Case Study: Introduction to Machine Learning! To “turn in” your work, you can click the “Render” icon in the menu bar above. This will create a HTML report in your Files pane that serves as a record of your completed assignment and that can be opened in a browser or shared on the web.

References

Krumm, Andrew, Barbara Means, and Marie Bienkowski. 2018. Learning Analytics Goes to School. Routledge. https://doi.org/10.4324/9781315650722.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.". https://r4ds.hadley.nz.

Zong, Chen, and Alan Davis. 2024. “Modeling University Retention and Graduation Rates Using IPEDS.” Journal of College Student Retention: Research, Theory & Practice 26 (2): 311–33.