Predicting 4-year graduates

Final Project for INSC 592

Sam Kilgore

Background

In October of 2022, I took a job as a data analyst in the Division of Student Success (DSS) at the University of Tennessee, Knoxville (UTK). Three years later, the division did some restructuring, hiring an Executive Director Analytics, a Director of Assessment, and promoting me to Assistant Director of Analytics. Together, we became the new Office Student Success Analytics, and have since been tasked with providing Student Success and university leadership with datasets, dashboards, surveys, and other analyses. Often these projects are requested to support a specific initiative or goal of the university. In this case, that goal is to increase our 4-year graduation rate.

The question I have been tasked with is this: how can we as university staff intervene with students who may not graduate on time (within 4 years of starting)? My role is to address what happens before that “how” question. If we want to intervene with students at risk of graduating late or not at all, we must first know who they are before we can answer how we can help.

Questions

How accurately can I expect to predict successful and unsuccessful 4-year graduates?
Who are the current students who are not expected to graduate within 4 years?

Definitions

Cohorts: Specific groups of students established for tracking purposes based on specific identification criteria and enrolled for credit hours on Day 14 of a fall or spring term.
Graduation rates: The percentage of students who begin a degree program and complete it within a specified time frame.
First-time Freshman: A student who has no prior postsecondary experience attending the University of Tennessee for the first time at the undergraduate level. Includes students enrolled in the fall term who attended college for the first time in the prior summer term, and students who entered with advanced standing (college credit or postsecondary credit earned before high school graduation).
Full-time undergraduate: An undergraduate student enrolled for 12 or more semester credits.
¹

Graduation rates are reported once per year, after the Summer term. Because a majority of undergraduate students start at UTK in the Fall semester following their high school graduations, graduation rates refer to Fall first-time full-time cohort (FTFT) students by default. At the time of this writing, UTK’s 4-year graduation rate is 66.8%, which is based on the Fall 2021 FTFT cohort of undergraduate students. This will update in August or September 2026, when the Fall 2022 FTFT cohort has had 4 years to finish their degrees. The students this project is concerned with belong to the Fall 2022, 2023, 2024, and 2025 FTFT cohorts, as these are the students who started at UTK less than 4 years ago.

Libraries

library(readr)
library(aod)
library(caret)
library(readxl)
library(dplyr)
library(readr)
library(mice)
library(ggplot2)
library(RODBC)
library(data.table)
library(tidyr)
library(lubridate)
library(keyring)
library(scales)

Part 1: Feasability Study

Before I can evaluate the graduation chances of each current student, I need to check how well I can make a model that guesses 4-year graduation for past students. The following model will train on the Fall 2019 and 2020 FTFT cohorts and test on the Fall 2021 FTFT cohort.

Data

These datasets are all derived from SQL and R scripts that I wrote to populate a dashboard for a separate project called Vol View. In working on that project, I inadvertently created a data warehouse of standardized and relatable datasets, which I now use for most of my data requests and projects. The only dataset I did not have constructed prior to starting this project is CREDITS, which I wrote a new SQL script for that executes in this chunk. These datasets all contain many more variables than are necessary or useful for this project, so the next few chunks will be spent widdling them down to what is needed.

# TERM contains retention, graduation, major, and other data that changes over time for students
TERM <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/demographics keys/demographics key by term PIT more minute.csv")

# STUDENT contains static student data, such as First Generation status and starting majors
STUDENT <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/demographics keys/demographics key by student.csv")

# CREDITS shows GPA and credit accumulation over time for students
credits_path <- "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/credits.sql"
sqltext <- read_file(credits_path)
conn <- odbcConnect("Test_DSN", uid = "skilgor4", pwd = key_get("sql_password"), believeNRows = F)
CREDITS <- sqlQuery(conn, sqltext)

# one row per student, TESTSCORES provides ACT scores
TESTSCORES <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/test scores/test scores.csv")

# permanent address of students, including ZIP codes
ADDRESS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/addresses/addresses.csv")

# ZIP provides the geodesic distance in miles each ZIP code in the United States is from 37996, the ZIP code of UTK
ZIP <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/addresses/zip centers.csv")

# SPEC_POPS says which students are in select special populations, such as Veterans or Honors
SPEC_POPS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/special populations/special populations.csv")

After each dataset has been imported, I copy them to new objects and clean up the copies for use in this analysis. Copying like this enables me to make changes as needed to the data without having to start over with the initial data import, which can be slow for my larger datasets.

TERM1 <- TERM

GRAD <- TERM1[which(TERM1$SEMESTER == "S12"),]
GRAD <- GRAD[,which(colnames(GRAD) %in% c("PIDM", "GRADUATED"))]
GRAD <- GRAD[which(!is.na(GRAD$GRADUATED)),]

TERM1 <- TERM
TERM1 <- TERM1[which(TERM1$PIDM %in% GRAD$PIDM),]
TERM1 <- TERM1[,which(colnames(TERM1) != "GRADUATED")]

# new for Haslam
HASLAM <- STUDENT$PIDM[which(STUDENT$COLLEGE_DESC == "Haslam College of Business")]

STUDENT1 <- STUDENT
STUDENT1 <- STUDENT1[which(STUDENT1$PIDM %in% GRAD$PIDM),]
STUDENT1$AP_CREDITS[which(is.na(STUDENT1$AP_CREDITS))] <- 0

STUDENT1 <- STUDENT1[,which(!(colnames(STUDENT1) %in% c("SUB_COLLEGE_DESC", "UTK_CONCENTRATION_DESC", "NETID", "STUDENT_ID")))]
colnames(STUDENT1)[which(colnames(STUDENT1) == "COLLEGE_DESC")] <- "FIRST_COLLEGE_DESC"
colnames(STUDENT1)[which(colnames(STUDENT1) == "DEPARTMENT_DESC")] <- "FIRST_DEPARTMENT_DESC"
colnames(STUDENT1)[which(colnames(STUDENT1) == "UTK_MAJOR_DESC")] <- "FIRST_UTK_MAJOR_DESC"

ADDRESS1 <- ADDRESS
colnames(ADDRESS1)[which(colnames(ADDRESS1) == "ENTITY_UID")] <- "PIDM"
ADDRESS1$ZIP_CODE <- as.numeric(substr(ADDRESS1$POSTAL_CODE, 1, 5))

ZIP1 <- ZIP
ZIP1 <- ZIP1[,which(colnames(ZIP1) %in% c("ZIP_CODE", "NEAR_DIST"))]
ZIP1$NEAR_DIST[which(ZIP1$NEAR_DIST < 0)] <- 0

SPEC_POPS1 <- SPEC_POPS
SPEC_POPS1 <- SPEC_POPS1[,which(colnames(SPEC_POPS) != "POPULATION_DESC")]

Once the separate datasets are mostly cleaned, I begin constructing the primary dataset for this project, called YEAR1.

YEAR1 <- merge(STUDENT1, GRAD, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, TESTSCORES, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, ADDRESS1, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, ZIP1, by = "ZIP_CODE", all.x = T)

grad_term <- max(YEAR1$ENTRY_TERM[which(!is.na(YEAR1$GRADUATED))])
YEAR1 <- YEAR1[which(YEAR1$ENTRY_TERM <= grad_term & YEAR1$ENTRY_TERM >= grad_term - 200),]

# adding special populations
YEAR1$Veteran <- "Not veteran"
YEAR1$Veteran[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Veterans")])] <- "Veteran"

# add Honors
YEAR1$Honors <- "Not honors"
YEAR1$Honors[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Chancellor's Honors Program")])] <- "Chancellor's Honors Program"
YEAR1$Honors[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Haslam Leadership Scholars")])] <- "Haslam Leadership Scholars"

# add year data and create intentional duplicates
YEAR1$Year <- NA
YEAR1_temp <- YEAR1
for (i in 1:7) {
  if(i %% 2 == 1) { # odds
    YEAR1_temp$Year <- (i-1)*50
  } else { # evens
    YEAR1_temp$Year <- (i-2)*50+80
  }
  YEAR1 <- rbind(YEAR1, YEAR1_temp)
}
YEAR1 <- YEAR1[which(!is.na(YEAR1$Year)),]

# need to add most up-to-date credit data as of end of student's first year
STUDENT_TERM <- YEAR1[,which(colnames(YEAR1) %in% c("PIDM", "ENTRY_TERM", "Year"))]
CREDITS_temp <- CREDITS[which(CREDITS$PIDM %in% STUDENT_TERM$PIDM),
                        which(colnames(CREDITS) %in% c("PIDM", "ACADEMIC_PERIOD"))]
JOIN_temp <- merge(STUDENT_TERM, CREDITS_temp, by = "PIDM", all.x = T)
JOIN_temp <- JOIN_temp[which(JOIN_temp$ENTRY_TERM <= JOIN_temp$ACADEMIC_PERIOD # after start date
                           & JOIN_temp$ENTRY_TERM + JOIN_temp$Year >= JOIN_temp$ACADEMIC_PERIOD),]
AGG1 <- aggregate(ACADEMIC_PERIOD ~ PIDM + Year, data = JOIN_temp, FUN = "max")

# add academic period of most recent credits term for join key
YEAR1 <- merge(YEAR1, AGG1, by = c("PIDM", "Year"))

# join most recent credit history
YEAR1 <- merge(YEAR1, CREDITS, by = c("PIDM", "ACADEMIC_PERIOD"))
YEAR1 <- merge(YEAR1, TERM1, by = c("PIDM", "ACADEMIC_PERIOD"))

# replace blanks with zeroes
YEAR1$CUM_HRS_ATT_GPA[which(is.na(YEAR1$CUM_HRS_ATT_GPA))] <- 0
YEAR1$CUM_GPA[which(is.na(YEAR1$CUM_GPA))] <- 0

# create categorical version of Year column
YEAR1$Year_desc <- paste("Year ", as.character(YEAR1$Year/100), sep = "")

By this point, my dataset has become unwieldy, as all the datasets I have joined together have many unneeded columns for this analysis. I like to do this cutting down step late in the construction, as I may change my mind on which variables I can use later, and this order makes it easier to add them back in if I so choose.

# remove unneeded columns
YEAR1 <- YEAR1[,which(!(colnames(YEAR1) %in% c("inforow", "STUDENT_LEVEL_CODE", "marker")))]

# keep only useful columns
keep_list <- c("GRADUATED", "PIDM", "AP_CREDITS", "ENTRY_TERM",
               "CALC_HIGH_SCHOOL_GPA_1", "COLLEGE_DESC", "GENDER", "PELL", 
               "ACT_SUPER_MATH","CUM_HOURS_EARN", "CUM_GPA", "IPEDS_RACE_DESC", "RESIDENCY_DESC", 
               "NEAR_DIST", "Honors", "Veteran",
               "Year", "Year_desc", "ENTRY_TERM", "ACADEMIC_PERIOD", "retained_N", "Honors"
              )
YEAR1 <- YEAR1[,which(colnames(YEAR1) %in% keep_list)]

Data overview

Now that the dataset is cleaned up, this is what it contains. This example only shows the data for students after their first Fall semester, but the whole dataset has the same data representing the end of each student’s first 7 Fall and Spring semesters.

summary(YEAR1[which(YEAR1$Year == 0),which(colnames(YEAR1) %in% c("CALC_HIGH_SCHOOL_GPA_1", "GRADUATED", "ACT_SUPER_MATH", "NEAR_DIST", "CUM_GPA", "CUM_HOURS_EARN"))])

##  CALC_HIGH_SCHOOL_GPA_1   GRADUATED     ACT_SUPER_MATH    NEAR_DIST      
##  Min.   :2.069          Min.   :0.000   Min.   :14.00   Min.   :   0.00  
##  1st Qu.:3.656          1st Qu.:0.000   1st Qu.:24.00   1st Qu.:  67.88  
##  Median :4.062          Median :1.000   Median :26.00   Median : 157.76  
##  Mean   :3.982          Mean   :0.623   Mean   :26.33   Mean   : 234.44  
##  3rd Qu.:4.367          3rd Qu.:1.000   3rd Qu.:29.00   3rd Qu.: 328.08  
##  Max.   :5.000          Max.   :1.000   Max.   :36.00   Max.   :7569.02  
##  NA's   :52                             NA's   :3699    NA's   :68       
##     CUM_GPA      CUM_HOURS_EARN  
##  Min.   :0.000   Min.   :  0.00  
##  1st Qu.:2.847   1st Qu.: 14.00  
##  Median :3.435   Median : 16.00  
##  Mean   :3.162   Mean   : 20.13  
##  3rd Qu.:3.800   3rd Qu.: 24.00  
##  Max.   :4.000   Max.   :133.00  
##

table(YEAR1$COLLEGE_DESC[which(YEAR1$Year == 0)])

## 
##        Architecture and Design              Arts and Sciences 
##                            481                           4863 
##  Communication and Information   Educ, Health, Human Sciences 
##                            493                           1610 
##     Haslam College of Business Herbert College of Agriculture 
##                           3905                            696 
##  N. L. Haslam College of Music                        Nursing 
##                            140                            632 
##                    Social Work  Tickle College of Engineering 
##                             85                           2635 
##                     University 
##                           1109

table(YEAR1$GENDER[which(YEAR1$Year == 0)])

## 
## Female   Male 
##   9338   7311

table(YEAR1$RESIDENCY_DESC[which(YEAR1$Year == 0)])

## 
##      In-State International  Out-of-State 
##         11295            89          5265

table(YEAR1$Veteran[which(YEAR1$Year == 0)])

## 
## Not veteran     Veteran 
##       16529         120

table(YEAR1$Honors[which(YEAR1$Year == 0)])

## 
## Chancellor's Honors Program  Haslam Leadership Scholars 
##                        1037                          45 
##                  Not honors 
##                       15567

table(YEAR1$PELL[which(YEAR1$Year == 0)])

## 
## Not Pell     Pell 
##    12988     3661

A glaring issue I dealt with early in the planning stage of this project is blank values. Specifically, absences of ACT scores. UTK requires that students self-report their ACT scores in their applications, though international students generally do not take the ACT or the SAT. Logistic regressions automatically exclude rows containing any null values, and I want all students included, so I had to fill these NAs with a number. However, I wanted that action to be accounted for in the regression training. The following chunk combs through each column in the dataset and checks whether that column contains any blanks. If it does, those blanks are filled with a 0 and a new column is created that says whether each row was blank in the original column. For example, ACT_SUPER_MATH_blank shows which rows of ACT_SUPER_MATH were blank. The columns originally containing nulls will be interacted with their corresponding blank indictor columns in the final regressions.

# find which columns have NAs and give them accompanying blank/not blank columns
for (j in 1:ncol(YEAR1)) {
  nas <- which(is.na(YEAR1[,j]))
  if(length(nas) > 0) {
    # initialize new column
    YEAR1$Temp <- NA
    #populate new column
    YEAR1$Temp[which(is.na(YEAR1[,j]))] <- "Blank"
    YEAR1$Temp[which(!is.na(YEAR1[,j]))] <- "Not blank"
    # name new column
    colnames(YEAR1)[ncol(YEAR1)] <- paste(colnames(YEAR1)[j], "_blank", sep = "")
  }
}

# fill in all blanks with zeroes
for (j in 1:ncol(YEAR1)) {
  data_type <- class(YEAR1[,j])
  rows <- which(is.na(YEAR1[,j]))
  YEAR1[rows,j] <- eval(parse(
    text = paste("as.", data_type, "(0)", sep = "")
    ))
}

Initial Regression

Now that the data was been assembled, the model can be built. First, split the data into training and testing data. I chose to split this data by cohort and not randomly to simulate the way this model will be working in practice. I will only ever be able to model future graduation trends based on past graduation data, so this model will be built in the same fashion.

# training data
YEAR1_logit <- YEAR1[which(YEAR1$ENTRY_TERM < grad_term),]

# testing data
YEAR1_test <- YEAR1[which(YEAR1$ENTRY_TERM == grad_term),]

Independent variables

Now that the dataset is standardized and the data is split, the independent variables are collected into a vector. The only interaction considered is between Residency and Distance From Home. My initial suspicion with these was that living far from home would have a negative impact for in-state students but a positive one for out-of-state students.

# create vector of independent variable names with interactions
independents_int <- 
  c(# lists number of AP credits students had upon admission, static
    "AP_CREDITS",
    # high school GPA, static
    "CALC_HIGH_SCHOOL_GPA_1",
    # college at a point in time, dynamic
    "COLLEGE_DESC",
    # gender reported at admission, static
    "GENDER",
    # Pell status upon 14th day of first Fall, binary, static
    "PELL",
    # highest reported ACT Math subscore, static
    "ACT_SUPER_MATH",
    # cumulative hours earned at that time, dynamic
    "CUM_HOURS_EARN", 
    # overall GPA, dynamic
    "CUM_GPA", 
    # in-state, out-of-state, or international interacted with distance from home, static
    "RESIDENCY_DESC*NEAR_DIST", 
    # Veteran status, binary, static
    "Veteran",
    # involvement in Haslam Leadership Scholars, Chancellor's Honors Program, or neither, static
    "Honors" 
  )

Creating the model(s)

Up to this point, I have referred to this process as preparing to create a model that predicts graduation in 4 years. This is an oversimplification, as I am actually creating 7 different logistic regressions independently of each other, one for each completed semester before a student reaches the 4 year mark. My goal is to be able to say whether I think a student will graduate on time based on the most up-to-date data possible. These models are how I accomplish that, as much of the data in my independent variables changes over time, like CUM_HOURS_EARN and COLLEGE_DESC. This is also why I wrote the independent variables into a vector earlier; if I decide I want to add a variable, it is much easier to add it to one string of objects than into 7 different glm() functions.

LOGITS <- data.frame(matrix(NA, ncol = 2, nrow = length(unique(YEAR1$Year))))
colnames(LOGITS) <- c("Year", "model")
LOGITS$Year <- unique(YEAR1$Year)

for (i in 1:nrow(LOGITS)) {
  ind_string <- ""
  for (j in 1:length(independents_int)) {
    column <- which(colnames(YEAR1_logit) == independents_int[j])
    if(length(column) == 1) {
      levels <- unique(YEAR1_logit[which(YEAR1_logit$Year == LOGITS$Year[i]),column])
      if(length(levels) == 1) {
        next
      }
    }
    
    blank_version <- paste(independents_int[j], "_blank", sep = "")
    if(blank_version %in% colnames(YEAR1_logit)) {
      temp <- paste(independents_int[j], "*", independents_int[j], "_blank + ",
                    independents_int[j], "_blank", sep = "")
    } else {
      temp <- independents_int[j]
    }
    ind_string <- paste(ind_string, " + ", temp, sep = "")
  }
  ind_string <- substr(ind_string, 4, nchar(ind_string))
  
  # piece together the command string for the regression
  logit_string <- paste(
    "logit", as.character(i), " <- ",
    "glm(GRADUATED ~ ",
    ind_string,
    ', data = YEAR1_logit[which(YEAR1_logit$Year == LOGITS$Year[i]),], family = "binomial")',
    sep = ""
  )
  # execute the logistic regression
  eval(parse(text = logit_string))
  # save the model to my computer for later use
  save_string <- paste(
    "saveRDS(logit", as.character(i), 
    ', file = "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits from R/logit',
    as.character(i), '.rds")',
    sep = ""
  )
  eval(parse(text = save_string))
  
  LOGITS$model[i] <- paste("logit", as.character(i), sep = "")
}

# save the directory of regressions
write.csv(LOGITS, "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits from R.csv", na = "", row.names = F)

With these models created, they can now be applied to the test data.

# get model directory
LOGITS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits from R.csv")

# apply logistic model
YEAR1_test$prediction <- NA
folder <- "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits from R"

for (i in unique(YEAR1_test$Year)) {
  row <- which(LOGITS$Year == i)
  logit_file <- paste(folder, "/", LOGITS$model[row], ".rds", sep = "")
  logit <- readRDS(logit_file)
  
  rows <- which(YEAR1_test$Year == i & YEAR1_test$COLLEGE_DESC %in% YEAR1_logit$COLLEGE_DESC)
  YEAR1_test$prediction[rows] <- predict(logit, newdata = YEAR1_test[rows,], type = "response")
}

Recall that I am trying to predict a binary variables: yes or no, will a student graduate within 4 years? The output of each logistic regression provides me with a number between 0 and 1 for each instance, which is the probability the model assigns a student to graduate on time based on the information available. From there, students with a probability less than .5 are predicted to not graduate on time, and the rest are predicted to graduate on time.

YEAR1_test$prediction_binary <- round(YEAR1_test$prediction)
YEAR1_test$prediction_bin <- YEAR1_test$prediction - (YEAR1_test$prediction %% .05)

YEAR1_test$accurate <- 0
YEAR1_test$accurate[which(YEAR1_test$GRADUATED == YEAR1_test$prediction_binary)] <- 1

YEAR1_test$correct <- "Not correct"
YEAR1_test$correct[which(YEAR1_test$accurate == 1)] <- "Correct"

Accuracy

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 0),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 1 semester, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 80),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) + 
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 2 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 100),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 3 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 180),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 4 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 200),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 5 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 280),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width=.045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 6 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

YEAR1_viz <- YEAR1_test[which(YEAR1_test$Year == 300),]

ggplot(YEAR1_viz, aes(fill=correct, x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width=.045) +
  scale_fill_manual(values = c("#3f704d", "#c21807")) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle(paste("After 7 semesters, accuracy = ", round(mean(YEAR1_viz$accurate)*100, 1), "%", sep = ""))

Evidently, the models increase in accuracy as they collect more data and the students grow closer to graduation day. After 1 semester of data, a majority of students are given a probability around 80% of graduating within 4 years and the distribution of probabilities is almost normal. However, as time passes, normality gives way to bimodality. The models get increasingly accurate with each passed semester, and students find themselves with more extreme probabilities. By the end of the 7th semester, the vast majority of students are found to be almost certain to graduate after 4 years, or almost certain not to. Very few students lie in the middle and the model has far fewer inaccuracies.

Rebuild models

Now that I have estimated the accuracy of each model, using the Fall 2021 FTFT cohort as a test, I want to know the graduation probability of all current students. However, the models I just created do not quite have the most up-to-date data, so I will rebuild them here to be based on the Fall 2020 and Fall 2021 FTFT cohorts, which can then be applied to the Fall 2022 and onward cohorts.

YEAR1_logit <- YEAR1[which(YEAR1$ENTRY_TERM <= grad_term & YEAR1$ENTRY_TERM > grad_term - 200),]

LOGITS <- data.frame(matrix(NA, ncol = 2, nrow = length(unique(YEAR1$Year))))
colnames(LOGITS) <- c("Year", "model")
LOGITS$Year <- unique(YEAR1$Year)

# save logit
for (i in 1:nrow(LOGITS)) {
  logit_string <- paste(
    "logit", as.character(i), " <- ",
    "glm(GRADUATED ~ ",
    ind_string,
    ', data = YEAR1_logit[which(YEAR1_logit$Year == LOGITS$Year[i]),], family = "binomial")',
    sep = ""
  )
  eval(parse(text = logit_string))
  save_string <- paste(
    "saveRDS(logit", as.character(i), 
    ', file = "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits for use/logit',
    as.character(i), '.rds")',
    sep = ""
  )
  eval(parse(text = save_string))
  
  LOGITS$model[i] <- paste("logit", as.character(i), sep = "")
}

write.csv(LOGITS, "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits for use.csv", na = "", row.names = F)

Part 2: Application to Current Students

The following chunks were initially written in a different Rmd file, so there is significant repetition to the data cleaning process shown earlier. The difference, however, is that this data includes 4 cohorts and each student is listed once, whereas the data used to build the models had students from 3 cohorts each listed 7 times, one for each semester before 4 years had passed. As a result, the data construction is slightly different.

library(readr)
library(aod)
library(caret)
library(readxl)
library(readr)
library(mice)
library(ggplot2)
library(RODBC)
library(data.table)
library(tidyr)
library(lubridate)
library(keyring)

TERM <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/demographics keys/demographics key by term PIT more minute.csv")
STUDENT <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/demographics keys/demographics key by student.csv")

credits_path <- "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/credits.sql"
sqltext <- read_file(credits_path)
conn <- odbcConnect("Test_DSN", uid = "skilgor4", pwd = key_get("sql_password"), believeNRows = F)
#CREDITS <- sqlQuery(conn, sqltext)

TESTSCORES <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/test scores/test scores.csv")

ADDRESS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/addresses/addresses.csv")

ZIP <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/addresses/zip centers.csv")

SPEC_POPS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/special populations/special populations.csv")

TERM1 <- TERM

GRAD <- TERM1[which(TERM1$SEMESTER == "S12"),]
GRAD <- GRAD[,which(colnames(GRAD) %in% c("PIDM", "GRADUATED"))]
GRAD <- GRAD[which(!is.na(GRAD$GRADUATED)),]

TERM1 <- TERM
TERM1 <- TERM1[,which(colnames(TERM1) != "GRADUATED")]

# new for Haslam
HASLAM <- STUDENT$PIDM[which(STUDENT$COLLEGE_DESC == "Haslam College of Business")]

STUDENT1 <- STUDENT
STUDENT1$AP_CREDITS[which(is.na(STUDENT1$AP_CREDITS))] <- 0
STUDENT1 <- STUDENT1[,which(!(colnames(STUDENT1) 
                              %in% c("COLLEGE_DESC", "SUB_COLLEGE_DESC", "DEPARTMENT_DESC",   
                                     "UTK_MAJOR_DESC", "UTK_CONCENTRATION_DESC", "NETID", "STUDENT_ID")))]

ADDRESS1 <- ADDRESS
colnames(ADDRESS1)[which(colnames(ADDRESS1) == "ENTITY_UID")] <- "PIDM"
ADDRESS1$ZIP_CODE <- as.numeric(substr(ADDRESS1$POSTAL_CODE, 1, 5))

ZIP1 <- ZIP
ZIP1 <- ZIP1[,which(colnames(ZIP1) %in% c("ZIP_CODE", "NEAR_DIST"))]

SPEC_POPS1 <- SPEC_POPS
SPEC_POPS1 <- SPEC_POPS1[,which(colnames(SPEC_POPS) != "POPULATION_DESC")]

YEAR1 <- merge(STUDENT1, GRAD, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, TESTSCORES, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, ADDRESS1, by = "PIDM", all.x = T)
YEAR1 <- merge(YEAR1, ZIP1, by = "ZIP_CODE", all.x = T)

YEAR1 <- YEAR1[which(YEAR1$ENTRY_TERM %% 100 == 40),]
YEAR1 <- YEAR1[which(YEAR1$ENTRY_TYPE == "First-time College Students"),]

grad_term <- max(YEAR1$ENTRY_TERM[which(!is.na(YEAR1$GRADUATED))])
YEAR1 <- YEAR1[which(YEAR1$ENTRY_TERM > grad_term),]

# adding special populations
YEAR1$Veteran <- "Not veteran"
YEAR1$Veteran[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Veterans")])] <- "Veteran"

# add Honors
YEAR1$Honors <- "Not honors"
YEAR1$Honors[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Chancellor's Honors Program")])] <- "Chancellor's Honors Program"
YEAR1$Honors[which(YEAR1$PIDM %in% SPEC_POPS1$PIDM[which(SPEC_POPS1$POPULATION == "Haslam Leadership Scholars")])] <- "Haslam Leadership Scholars"

# add year data
current_term <- max(TERM$ACADEMIC_PERIOD[which(TERM$ACADEMIC_PERIOD_CODED == "Current term")])
last_major_term <- current_term-1 # needs to be before current term
while(!(last_major_term %% 100) %in% c(20,40)) {
  last_major_term <- last_major_term - 1
}
YEAR1$Year <- last_major_term - YEAR1$ENTRY_TERM

# need to add most up-to-date credit data as of end of student's first year
STUDENT_TERM <- YEAR1[,which(colnames(YEAR1) %in% c("PIDM", "ENTRY_TERM", "Year"))]
CREDITS_temp <- CREDITS[which(CREDITS$PIDM %in% STUDENT_TERM$PIDM),
                        which(colnames(CREDITS) %in% c("PIDM", "ACADEMIC_PERIOD"))]
JOIN_temp <- merge(STUDENT_TERM, CREDITS_temp, by = "PIDM", all.x = T)
JOIN_temp <- JOIN_temp[which(JOIN_temp$ENTRY_TERM <= JOIN_temp$ACADEMIC_PERIOD # after start date
                           & JOIN_temp$ENTRY_TERM + JOIN_temp$Year >= JOIN_temp$ACADEMIC_PERIOD),]
AGG1 <- aggregate(ACADEMIC_PERIOD ~ PIDM + Year, data = JOIN_temp, FUN = "max")

# add academic period of most recent credits term for join key
YEAR1 <- merge(YEAR1, AGG1, by = c("PIDM", "Year"))

# join most recent credit history
YEAR1 <- merge(YEAR1, CREDITS, by = c("PIDM", "ACADEMIC_PERIOD"))
YEAR1 <- merge(YEAR1, TERM1, by = c("PIDM", "ACADEMIC_PERIOD"))

# replace blanks with zeroes
YEAR1$CUM_HRS_ATT_GPA[which(is.na(YEAR1$CUM_HRS_ATT_GPA))] <- 0
YEAR1$CUM_GPA[which(is.na(YEAR1$CUM_GPA))] <- 0

# create categorical version of Year column
YEAR1$Year_desc <- paste("Year ", as.character(YEAR1$Year/100), sep = "")

# remove unneeded columns
YEAR1 <- YEAR1[,which(!(colnames(YEAR1) %in% c("inforow", "STUDENT_LEVEL_CODE", "marker")))]

# keep only useful columns
keep_list <- c("GRADUATED", "PIDM", "AP_CREDITS", "ENTRY_TERM",
               "CALC_HIGH_SCHOOL_GPA_1", "COLLEGE_DESC", "GENDER", "PELL", 
               "ACT_SUPER_MATH","CUM_HOURS_EARN", "CUM_GPA", "IPEDS_RACE_DESC", "RESIDENCY_DESC", 
               "NEAR_DIST", "Honors", "Veteran",
               "Year", "Year_desc", "ENTRY_TERM"
              )
YEAR1 <- YEAR1[,which(colnames(YEAR1) %in% keep_list)]

# find which columns have NAs and give them accompanying blank/not blank columns
for (j in 1:ncol(YEAR1)) {
  nas <- which(is.na(YEAR1[,j]))
  if(length(nas) > 0) {
    # initialize new column
    YEAR1$Temp <- NA
    #populate new column
    YEAR1$Temp[which(is.na(YEAR1[,j]))] <- "Blank"
    YEAR1$Temp[which(!is.na(YEAR1[,j]))] <- "Not blank"
    # name new column
    colnames(YEAR1)[ncol(YEAR1)] <- paste(colnames(YEAR1)[j], "_blank", sep = "")
  }
}

# fill in all blanks with zeroes
for (j in 1:ncol(YEAR1)) {
  data_type <- class(YEAR1[,j])
  rows <- which(is.na(YEAR1[,j]))
  YEAR1[rows,j] <- eval(parse(
    text = paste("as.", data_type, "(0)", sep = "")
    ))
}

Now that the data for current students has been gathered and structured, the logistic regressions created in the first part of this script need to be retrieved.

LOGITS <- read.csv("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits for use.csv")
for (i in 1:nrow(LOGITS)) {
  get_logit <- paste(LOGITS$model[i], ' <- readRDS("C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/grad4 regression/logits for use/', 
                             LOGITS$model[i], '.rds")', sep = "")
  eval(parse(text = get_logit))
}

With the 7 logistic regression models gathered, they can be applied to current students. Recall that this writing is from just before the end of the Spring 2026 academic period, so the most recent available data remains to be from the end of last Fall. With that, not all 7 models are needed to evaluate the graduation favor of current students, only the ones that are based on 1, 3, 5, and 7 semesters of data.

YEAR1$prediction <- NA
for (i in 1:nrow(YEAR1)) {
  row <- which(LOGITS$Year == YEAR1$Year[i])
  if(length(row) == 1) {
    tryCatch({
      p_string <- paste("YEAR1$prediction[i] <- predict(", LOGITS$model[row], 
                        ', newdata = YEAR1[i,], type = "response")')
      eval(parse(text = p_string))
    }, error = function(e) {})
  }
}

Results

Here is how the predicted 4-year graduation probabilities look for the current undergraduate cohorts.

YEAR1$prediction_binary <- round(YEAR1$prediction)
YEAR1$prediction_bin <- YEAR1$prediction - (YEAR1$prediction %% .05)

YEAR1_viz <- YEAR1[which(YEAR1$ENTRY_TERM == 202240),]

ggplot(YEAR1_viz, aes(x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Spread of 4-year graduation probabilities for Fall 2022 FTFT cohort")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_count()`).

YEAR1_viz <- YEAR1[which(YEAR1$ENTRY_TERM == 202340),]

ggplot(YEAR1_viz, aes(x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Spread of 4-year graduation probabilities for Fall 2023 FTFT cohort")

## Warning: Removed 33 rows containing non-finite outside the scale range
## (`stat_count()`).

YEAR1_viz <- YEAR1[which(YEAR1$ENTRY_TERM == 202440),]

ggplot(YEAR1_viz, aes(x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Spread of 4-year graduation probabilities for Fall 2024 FTFT cohort")

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_count()`).

YEAR1_viz <- YEAR1[which(YEAR1$ENTRY_TERM == 202540),]

ggplot(YEAR1_viz, aes(x=prediction_bin+.025)) +
  geom_bar(position="stack", stat="count", width = .045) +
  geom_vline(xintercept = .5) +
  labs(x = "prediction bin", y = "students") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Spread of 4-year graduation probabilities for Fall 2025 FTFT cohort")

## Warning: Removed 99 rows containing non-finite outside the scale range
## (`stat_count()`).

Note the warnings on each rendering. These warnings result from new categories being added to some of the variables used in the regression, mostly the introduction of the Baker School for Public Policy and the College of Emerging and Collaborative Studies in the COLLEGE_DESC column. Because no students who have been in these colleges started at UTK more than 4 years ago, the models are not trained to know the effect of enrollment in these colleges on whether a student graduates on time.

Final Steps

Now that I have assigned each current Fall FTFT cohort student with a probability of graduating on time, I can pass these predictions on to university leadership. My hope is that they can use these predictions for targeted interventions with students, perhaps through advising or financial aid.

YEAR1 <- YEAR1[,which(colnames(YEAR1) %in% c("PIDM", "prediction"))]
write.csv(YEAR1, "C:/Users/skilgor4/OneDrive - University of Tennessee/Desktop/Office of Analytics/VolView/predictions/grad4 predictions.csv", na = "", row.names = F)

These definitions were provided to me by Amber Rountree, an Institutional Research Analyst in IRSA at UTK.↩︎