Explaining or Predicting Graduation Rates Using IPEDS

Case Study

Author

LASER Institute - J.Carhart

Published

July 16, 2025

1. PREPARE

The LASER workflow

Each supervised machine learning “case study” is designed to illustrate how supervised machine learning methods and techniques can be applied to address a research question of interest, create useful data products, and conduct reproducible research. Each case study is structured around a basic analytics workflow modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):

Figure 2.2 Steps of Data-Intensive Research Workflow

In the overview presentation for this learning lab, we considered five steps in our supervised machine learning process. Those steps are mirrored here in this case study, with the addition of some other components of this workflow. For example, to help prepare for analysis, we’ll first take a step back and think about how we want to use machine learning, and predicting is a key word. Many scholars have focused on predicting students who are at-risk: of dropping a course or not succeeding in it. In the ML Learning Lab 1 case study will cover the following workflow topics as we attempt to develop our own model for predicting student drop-out:

  1. Prepare: Prior to analysis, we’ll look at the context from which our data came, formulate a basic research question, and get introduced the {tidymodels} packages for machine learning.

  2. Wrangle: Wrangling data entails the work of cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files and modifying some of our variables.

  3. Explore: We take a quick look at our variables of interest and do some basic “feature engineering” by creating some new variables we think will be predictive of students at risk.

  4. Model: We dive deeper into the five steps in our supervised machine learning process, focusing on the mechanics of making predictions.

  5. Communicate: To wrap up our case study, we’ll create our first “data product” and share our analyses and findings by creating our first web page using R Markdown.

In this module, we will be using data from the IPEDS, the Integrated Postsecondary Education Data System. We use Zong and Davis’ (2022) study as an inspiration for ours. These authors used inferential models to try to understand what relates to the graduation rates of around 700 four-year universities in the United States, predicting this outcome on the basis of student background, finance, academic and social environment, and retention rate independent variables. You can find this in the lit folder (with an elaboration and discussion questions in the Readings file for this module).

Zong, C., & Davis, A. (2022). Modeling university retention and graduation rates using IPEDS. Journal of College Student Retention: Research, Theory & Practice, 15210251221074379.

Loading packages

As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up your “Project” within RStudio. Recall that:

A Project is the home for all of the files, images, reports, and code that are used in any given project

Since we are working from an R project cloned from GitHub, a Project has already been set up for you as indicated by the .Rproj file in your main directory in the Files pane. Instead, we will focus on getting our project set up withe the requisite packages we’ll need for analysis.

Packages, sometimes called libraries, are shareable collections of R code that can contain functions, data, and/or documentation and extend the functionality of R. You can always check to see which packages have already been installed and loaded into RStudio Cloud by looking at the the Files, Plots, & Packages Pane in the lower right-hand corner.

Two packages we’ll use extensively throughout these learning labs are the {tidyverse} and {tidymodels} packages.

tidyverse 📦

One package that we’ll be using extensively throughout LASER is the {tidyverse} package. Recall from earlier tutorials that the {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. These shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidymodels

The tidymodels package is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse. It includes a core set of packages that are loaded on startup and contains tools for:

  • data splitting and pre-processing;

  • model selection, tuning, and evaluation;

  • feature selection and variable importance estimation;

  • as well as other functionality.

👉 Your Turn

In addition to the {tidyverse} package, we’ll also be using {tidymodels} and the lightweight but highly useful {janitor} package to help with some data cleaning tasks. Use the code chunk below to load both tidymodels and janitor.

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.9     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.1
✔ parsnip      1.3.2     ✔ yardstick    1.3.2
✔ recipes      1.3.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test

As a tip, remember to use the library() function to load these packages. After you’ve done that, click the green arrow to run the code chunk. If you see a bunch of messages (not anything labeled as an error), you are good to go! These messages mean the packages loaded correctly.

Loading data

Next, we’ll read in data. We’ll use the read_csv() function to load the data file.

For now, please read in the `ipeds-all-title-9-2022-data.csv` file. Use the read_csv() function to do this, paying attention to where those files are located relative to this case study file – in the data folder!

ipeds <- read_csv("data/ipeds-all-title-9-2022-data.csv")
Rows: 5988 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): institution name, HD2022.Postsecondary and Title IV institution in...
dbl (15): unitid, year, DRVADM2022.Percent admitted - total, DRVIC2022.Tuiti...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ll then use a handy function from janitor, clean_names(). It does what it seems like it should - it cleans the names of variables, making them easy to view and type. Run this next code chunk.

ipeds <- janitor::clean_names(ipeds)

👉 Your Turn

In the chunk below, examine the data set using a function or means of your choice (such as just printing the data set by typing its name or using the glimpse() function). Do this in the code chunk below! Note its dimensions — especially how many rows it has!

glimpse(ipeds)
Rows: 5,988
Columns: 24
$ unitid                                                                                  <dbl> …
$ institution_name                                                                        <chr> …
$ year                                                                                    <dbl> …
$ hd2022_postsecondary_and_title_iv_institution_indicator                                 <chr> …
$ drvadm2022_percent_admitted_total                                                       <dbl> …
$ drvic2022_tuition_and_fees_2021_22                                                      <dbl> …
$ hd2022_institution_size_category                                                        <chr> …
$ hd2022_state_abbreviation                                                               <chr> …
$ hd2022_carnegie_classification_2021_basic                                               <chr> …
$ drvef2022_total_enrollment                                                              <dbl> …
$ drvef122022_total_12_month_unduplicated_headcount                                       <dbl> …
$ drvc2022_bachelors_degree                                                               <dbl> …
$ drvc2022_masters_degree                                                                 <dbl> …
$ drvc2022_doctors_degree_research_scholarship                                            <dbl> …
$ drvgr2022_graduation_rate_total_cohort                                                  <dbl> …
$ sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid        <dbl> …
$ xanyaidp                                                                                <chr> …
$ drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks <dbl> …
$ sfa2122_average_net_price_students_awarded_grant_or_scholarship_aid_2021_22             <dbl> …
$ xnpist2                                                                                 <chr> …
$ adm2022_sat_evidence_based_reading_and_writing_50th_percentile_score                    <dbl> …
$ xsatvr50                                                                                <chr> …
$ adm2022_act_composite_50th_percentile_score                                             <dbl> …
$ xactcm50                                                                                <chr> …

👉 Your Turn

Write down a few observations after inspecting the data - and any all observations welcome!

  • There is missing data in which for inferential analysis it is often emphasized to remove missing values or NA.

  • The variables can be shortened.

  • I assumed there are 700 institutions in which would be the rows not 5,988???

2. WRANGLE

Even though we cleaned the names to make them easier to view and type (thanks, clean_names()), they are still pretty long.

The code chunk below uses a very handy function, select(). This allows you to simultaneously choose and rename variables, returning a data frame with only the variables you have selected — named as you like. For now, we’ll just run this code. Later in your analyses, you’ll almost certainly use select() to get a more manageable dataset.

ipeds <- ipeds %>% 
    select(name = institution_name, 
           title_iv = hd2022_postsecondary_and_title_iv_institution_indicator, # is the university a title IV university?
           carnegie_class = hd2022_carnegie_classification_2021_basic, # which carnegie classification
           state = hd2022_state_abbreviation, # state
           total_enroll = drvef2022_total_enrollment, # total enrollment
           pct_admitted = drvadm2022_percent_admitted_total, # percentage of applicants admitted
           n_bach = drvc2022_bachelors_degree, # number of students receiving a bachelor's degree
           n_mast = drvc2022_masters_degree, # number receiving a master's
           n_doc = drvc2022_doctors_degree_research_scholarship, # number receive a doctoral degree
           tuition_fees = drvic2022_tuition_and_fees_2021_22, # total cost of tuition and fees
           grad_rate = drvgr2022_graduation_rate_total_cohort, # graduation rate
           percent_fin_aid = sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid, # percent of students receive financial aid
           avg_salary = drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks) # average salary of instructional staff

A useful function for exploring data is count(); it does what it sounds like! It counts how many times values for a variable appear.

ipeds %>% 
    count(title_iv)
# A tibble: 2 × 2
  title_iv                                             n
  <chr>                                            <int>
1 Title IV NOT primarily postsecondary institution    30
2 Title IV postsecondary institution                5958

This suggests we may wish to filter the 30 non-Title IV institutions — something we’ll do shortly.

👉 Your Turn

Can you count another variable? Pick another (see the code chunk two above) and add a count below. While simple, counting up different values in our data can be very informative (and can often lead to further explorations)!

ipeds %>% 
    count(carnegie_class)
# A tibble: 34 × 2
   carnegie_class                                                              n
   <chr>                                                                   <int>
 1 Associate's Colleges: High Career & Technical-High Nontraditional          88
 2 Associate's Colleges: High Career & Technical-High Traditional            106
 3 Associate's Colleges: High Career & Technical-Mixed Traditional/Nontra…   116
 4 Associate's Colleges: High Transfer-High Nontraditional                   105
 5 Associate's Colleges: High Transfer-High Traditional                      105
 6 Associate's Colleges: High Transfer-Mixed Traditional/Nontraditional      101
 7 Associate's Colleges: Mixed Transfer/Career & Technical-High Nontradit…   114
 8 Associate's Colleges: Mixed Transfer/Career & Technical-High Tradition…   104
 9 Associate's Colleges: Mixed Transfer/Career & Technical-Mixed Traditio…    97
10 Baccalaureate Colleges: Arts & Sciences Focus                             216
# ℹ 24 more rows

Filtering

filter() is a very handy function that is part of the tidyverse; it filters to include (or exclude) observations in your data based upon logical conditions (e.g., ==, >, <=, etc.). See more here if interested.

Below, we filter the data to include only Title IV postsecondary institutions.

ipeds <- ipeds %>% 
    filter(title_iv == "Title IV postsecondary institution")

👉 Your Turn

Can you filter the data again, this time to only include institutions with a carnegie classification?

In other words, can you exclude those institutions with a value for the carnegie_class variable that is “Not applicable, not in Carnegie universe (not accredited or nondegree-granting)”)? A little hint: whereas the logical operator == is used to include only matching conditions, the logical operator != excludes matching conditions.

ipeds <- ipeds %>% 
    filter(carnegie_class != "Not applicable, not in Carnegie universe (not accredited or nondegree-granting)")

👉 Your Turn

We’re cruising! Let’s take another peak at our data - using glimpse() or another means of your choosing below.

glimpse(ipeds)
Rows: 3,818
Columns: 13
$ name            <chr> "Alabama A & M University", "University of Alabama at …
$ title_iv        <chr> "Title IV postsecondary institution", "Title IV postse…
$ carnegie_class  <chr> "Master's Colleges & Universities: Larger Programs", "…
$ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
$ total_enroll    <dbl> 6007, 21639, 647, 9237, 3828, 38644, 1777, 2894, 5109,…
$ pct_admitted    <dbl> 68, 87, NA, 78, 97, 80, NA, NA, 92, 44, 57, NA, NA, NA…
$ n_bach          <dbl> 511, 2785, 54, 1624, 480, 6740, NA, 738, 672, 5653, 26…
$ n_mast          <dbl> 249, 2512, 96, 570, 119, 2180, NA, 80, 300, 1415, 0, N…
$ n_doc           <dbl> 9, 166, 20, 41, 2, 215, NA, 0, 0, 284, 0, NA, 0, NA, N…
$ tuition_fees    <dbl> 10024, 8568, NA, 11488, 11068, 11620, 4930, NA, 8860, …
$ grad_rate       <dbl> 27, 64, 50, 63, 28, 73, 22, NA, 36, 81, 65, 26, 9, 29,…
$ percent_fin_aid <dbl> 87, 96, NA, 96, 97, 87, 87, NA, 99, 79, 100, 96, 92, 8…
$ avg_salary      <dbl> 77824, 106434, 36637, 92561, 72635, 97394, 63494, 8140…

3. EXPLORE

One key step in most analyses is to explore the data. Here, we conduct an exploratory data analysis with the IPEDS data, focusing on the key outcome of graduate rate.

Below, we use the ggplot2 package (part of the tidyverse) to visualize the spread of the values of our dependent variable, grad_rate, which represents institutions’ graduation rate. There is a lot to ggplot2, and data visualizations are not the focus of this module, but this web page has a lot of information you can use to learn more, if you are interested. ggplot2 is fantastic for creating publication-ready visualizations!

ipeds %>% 
    ggplot(aes(x = grad_rate)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

What do you notice about this graph – and about graduation rate?

  • The distribution is slightly left-skewed and more data points appear at the lower graduation rates than higher ones.
  • The message shows 418 rows were removed due to NA, NaN, or Inf. I am not too sure how to check for missing or invalid values to understand if they are meaningful or should be cleaned?

👉 Your Turn

Below, can you add one ggplot2 plot with a different variable/variables? Use the ggplot2 page linked above (also here) or the code above as a starting point (another histogram is fine!) for your visualization.

ipeds %>% 
    ggplot(aes(x = grad_rate)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

ipeds %>% 
    ggplot(aes(x = avg_salary, y = grad_rate, color = percent_fin_aid)) +
    geom_point()
Warning: Removed 478 rows containing missing values or values outside the scale range
(`geom_point()`).

#schools with higher graduation rates tend to show higher average salaries for graduates.this supports the idea that improving completion can translate into better economic outcomes.Schools with low grad rates (<25%) mostly cluster around lower average salaries. Most points are light-to-medium blue → meaning most schools have relatively high % of students receiving aid.You don’t see a sharp color gradient driving the trend — so the positive link between grad rate and salary holds across different aid levels. However, you can spot some darker points (lower aid %) among the highest grad rates and salaries — these might be schools with wealthier student populations or big endowments.

We’ll next do a little additional data wrangling. For now, we’ll model our dependent variable, grad_rate, as a dichotomous (i.e., yes or no; 1 or 0) dependent variable. This isn’t necessary, but it makes the contrast between the regression and supervised machine learning model a bit more vivid, and also dichotomous and categorical outcome variables are common in supervised machine learning applications, and so we’ll do this for this case study.

👉 Your Turn

Your next task is to decide what constitutes a good graduation rate. Our only suggestion - don’t pick a number too close to 0% or 100%. Otherwise, please replace XXX below with the number from 0-100 that represents the graduation rate percentage. Just add the number — don’t add the percentage symbol.

ipeds <- ipeds %>% 
    mutate(good_grad_rate = if_else(grad_rate > 70, 1, 0),
           good_grad_rate = as.factor(good_grad_rate))

Here, add a reason or two for how and why you picked the number you did:

  • From the histogram and point, the schools with grad rates above 70% tend to better deliver salary outcomes.

4. MODEL

Now, we reach a fork in the road. Recall from our first reading that there are two general types of modeling approaches: unsupervised and supervised machine learning. In Part 4, we focus on supervised learning models, which are used to quantify relationships between features (e.g., motivation and performance) and a known outcome (e.g., student drop out). These models can be used for classification of binary or categorical outcomes, as we’ll illustrate in this section, or regression as we’ll demonstrate in modules 2 and 3.

Please write out preliminary, draft research questions for both a regression (RQ A) and supervised machine learning (RQ B) analysis. It may help to review the readings for this module; you can find them in the lit folder; they are listed in sml-1-readings.qmd, too. There aren’t right or wrong answers here; the point is to try to draw out what question might accompany these different analyses (or vice versa - what research questions are feasible to answer using different analyses).

👉 Your Turn

RQ A - Regression Research Question

  • To what extent to institutional characteristics including graduation rates, % of financial aid, admissions rate, tuition and total enrollment predict the average post-graduation salary for students? (avg_salary as continuous variable), multiple regression,

RQ B - Supervised Machine Learning Research Question

  • Can institutional and student factors (% of aid, graduation rate, admission rate, tuition fees, enrollment rates) classify high or low income earnings for their graduates?

Now, we will proceed to the analyses.

We’ll first conduct a regression analysis, like in the code-along. We use a generalized linear model due to the dependent variable being dichotomous. The code is relatively straightforward; the comments explain each step.

m1 <- glm(good_grad_rate ~ total_enroll + pct_admitted + n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + avg_salary, data = ipeds, family = "binomial") # to the left of the ~ is our dependent variable; to the right are the independent variables
# family = "binomial" is to specify the correct model type for the dichotomous dependent variable

summary(m1) # summary table of model output 

Call:
glm(formula = good_grad_rate ~ total_enroll + pct_admitted + 
    n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + 
    avg_salary, family = "binomial", data = ipeds)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.076e-01  7.558e-01  -0.275 0.783530    
total_enroll    -1.361e-04  5.479e-05  -2.483 0.013020 *  
pct_admitted    -1.900e-02  4.244e-03  -4.477 7.59e-06 ***
n_bach           7.607e-04  2.201e-04   3.457 0.000546 ***
n_mast          -4.135e-04  2.330e-04  -1.775 0.075899 .  
n_doc            7.092e-03  1.636e-03   4.336 1.45e-05 ***
tuition_fees     6.829e-05  6.063e-06  11.263  < 2e-16 ***
percent_fin_aid -4.042e-02  6.864e-03  -5.888 3.91e-09 ***
avg_salary       2.768e-05  4.742e-06   5.837 5.30e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1813.1  on 1615  degrees of freedom
Residual deviance: 1154.6  on 1607  degrees of freedom
  (2202 observations deleted due to missingness)
AIC: 1172.6

Number of Fisher Scoring iterations: 6
#Schools that admit fewer students tend to have better grad rates. Bachelor and doctoral programs are significant. Higher tuition=higher avg salary, more selective schools stronger graduation outcomes. More students with aid are linked to lower the odds of high grad rates, serving higher-need populations can require more support.

Then, we’ll conduct a supervised machine learning analysis (with a simple but still commonly used model - in fact, the same model we used for the regression, a generalized linear model!). Again, for now, you’ll run this code; later, you’ll work through each step in detail.

my_rec <- recipe(good_grad_rate ~ total_enroll + pct_admitted + n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + avg_salary, data = ipeds) # same as above; this sets up what predicts the outcome

# specify model
my_mod <-
    logistic_reg() %>% # specifies a logistic regression
    set_engine("glm") %>% # specifies the specific package used to estimate the logistic regression
    set_mode("classification") # specifies that our outcome is a dichotomous or categorical variable

my_wf <- workflow() %>% # this starts a workflow, which will stitch together the steps in our analysis
    add_model(my_mod) %>% # adds the model
    add_recipe(my_rec) # adds the recipe

fit_model <- fit(my_wf, data = ipeds) # fits the model

ipeds_preds <- predict(fit_model, ipeds, type = "class") %>% # makes predictions
  bind_cols(ipeds %>% select(good_grad_rate)) # binds the known outcomes

metrics(ipeds_preds, truth = good_grad_rate, estimate = .pred_class) # calculates metrics
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.867
2 kap      binary         0.597

The key to observe at this point is what is similar and different between the two approaches (regression and supervised machine learning). Both used the same underlying statistical model, but had some stark differences. Add two or more similarities and two or more differences (no wrong answers!) below.

👉 Your Turn

Similarities:

  • Both use the same statistical method of logistic regression which models the probability of a binary outcome.
  • They both had the same IV.

Differences:

  • Regression focuses on interpreting and understanding the coefficient relationships/associations.

  • SML can be used to predict “performance” or in this case avg salary.

5. COMMUNICATE

The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. (2018) have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

  1. Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”

  2. Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.

  3. Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

Now, let’s return to our research questions. What did we find? This (especially the supervised machine learning model and its output) is very likely new, and this is meant to elicit initial perceptions, and not the right answer. What did we find for each of your RQs? Add a few thoughts below for each. Focus on what you would communicate about this analysis to a general audience, again, keeping in mind this is based on your very initial interpretations.

RQ A - Regression Research Question

  • How institutional factors and average salaries predict whether institutions have a good graduation rate? Schools that are more selective (lower percent admitted) and charge higher tuition tend to have higher graduation rates. Schools that offer more bachelor’s and doctoral programs also have higher odds of strong graduation rates — likely because they have more resources and academic supports. Higher average salaries for graduates are linked to better grad rates, reinforcing that completion often connects to stronger economic outcomes.

  • The regression shows the schools selectivity, the tuitions, and programs it offers relate to the graduation rates and some have stronger salary outcomes.

RQ B - Supervised Machine Learning Research Question

  • How well can we use these institutional factors to predict whether an institution will have a “good graduation rate” and higher salary?
  • The supervised machine learning model (logistic regression classifier) was able to correctly classify schools as ‘good grad rate’ or not about 87% of the time, with a moderate to strong Kappa value (~0.60) — showing that the prediction is a lot better than guessing by chance. This means that while there is variation, the combination of features like admission rate, tuition, enrollment, and financial aid gives us useful information for predicting success. It also shows that if a state system or policymakers wanted to identify schools that might need more support to raise graduation rates, this kind of model could be a helpful tool — as long as it’s combined with real human context.

🧶 Knit & Check ✅

For your SML Module 1 Badge, you will further reflect on and interpret these models, and their distinctions.

Rendered HTML files can be published online through a variety of ways including Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods. The easiest way to quickly publish your file online is to publish directly from RStudio. You can do so by clicking the “Publish” button located in the Viewer Pane after you render your document as illustrated in the screenshot below.

Congratulations - you’ve completed this case study! Move on to the badge activity next.