title: “Predicting Student Retention at Lakeland College”
author: “Jessica Bode”
date: “2024-04-27”
output: html_document

Project Objective

To investigate the relationship between student GPA and attending orientation program on retention at Lakeland College.

Question 1 and 2: Develop the Model & Assess Predictor Significance

Step 1: Install and load required libraries

# Uncomment the lines below to install the required packages if needed.
# install.packages("readxl")
# install.packages("Hmisc")
# install.packages("pscl")
# if(!require(pROC)) install.packages("pROC")

library(readxl)
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

###Step 2: Import and clean the data

college_df <- read_excel("/Users/jb/Documents/R Files/Lakeland College Data.xlsx")
coll_df <- subset(college_df, select = -c(Student))

###Step 3: Summarize the data

head(coll_df)
## # A tibble: 6 × 3
##     GPA Program Return
##   <dbl>   <dbl>  <dbl>
## 1  3.78       1      1
## 2  2.38       0      1
## 3  1.3        0      0
## 4  2.19       1      0
## 5  3.22       1      1
## 6  2.68       1      1
summary(coll_df)
##       GPA           Program         Return    
##  Min.   :1.210   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.377   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.735   Median :1.00   Median :1.00  
##  Mean   :2.740   Mean   :0.64   Mean   :0.66  
##  3rd Qu.:3.120   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.000   Max.   :1.00   Max.   :1.00
#Data Description: A description of some of the features is presented in the table below.

###Variable Definition GPA A student’s grade point average after the first year of college Program A student attending a voluntary one-week orientation program (1: yes, 0: no) Return A student returning to college for their sophomore semester (1: yes, 0: no) Step 4: Feature selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(coll_df))
corr
##          GPA Program Return
## GPA     1.00    0.50   0.58
## Program 0.50    1.00   0.52
## Return  0.58    0.52   1.00
## 
## n= 100 
## 
## 
## P
##         GPA Program Return
## GPA          0       0    
## Program  0           0    
## Return   0   0

###Step 5: Build the logistic regression

model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = coll_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5

###Question 3: Likelihood Ratio Test

null_model <- glm(Return ~1, data = coll_df, family = binomial)
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1        99    128.207                         
## 2        97     80.338  2   47.869 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

###Pseudo-R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -40.1688662 -64.1035478  47.8693631   0.3733753   0.3804077   0.5264883

###Area Under the Curve (AUC)

roc_curve <- roc(coll_df$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.8841

###Question 4 and 5: Predicting with new data

new_data1 <- data.frame(GPA = 2.5, Program = 0) # did not attend orientation
new_data2 <- data.frame(GPA = 2.5, Program = 1) # attend orientation

prob1 <- predict(model, newdata = new_data1, type = "response")
round((prob1 * 100), 2)
##     1 
## 36.69
prob2 <- predict(model, newdata = new_data2, type = "response")
round((prob2 * 100), 2)
##    1 
## 73.4

There is a 36.69% probability that the student will not return for sophomore year if they did not attend the orientation. There’s a 73.40% probability that the student will return if they attend the orientation program. ###Question 6: Odds Ratio

coefficients <- summary(model)$coefficients
odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program
## [1] 4.762413
#The odd ratio is higher than 1, which indicates that attending the orientation program is associated with higher odds of students returningtheir sophomore year compared to not attending the orientation at Lakeland College.