Predicting Student Retention at Lakeland College

Project Objective

To investigate the relationship between student GPA and attending orientation program at Lakeland College.

Question 1 & 2: Develope the Model & Access Predictor Significance

Step 1: Install and load required libraries

#installed.packages("Hmisc")
#installed.packages("pscl")
#if(!require(pROC)) install.packages("pROC")

library(readxl) #allows us to import excel files 
library(Hmisc) #allows us to call the correlation function

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl) #allows us ti cakk the pseudo R-square package to evaluate our model

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC) #allows us to run the area under the curve(AUC) package to get the plot and AUC score

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & Clean the data

college_df <- read_excel("Lakeland College Data.xlsx")
coll_df <- subset(college_df, select = -c(Student))

Step 3: Summarize the data

head(coll_df)

## # A tibble: 6 × 3
##     GPA Program Return
##   <dbl>   <dbl>  <dbl>
## 1  3.78       1      1
## 2  2.38       0      1
## 3  1.3        0      0
## 4  2.19       1      0
## 5  3.22       1      1
## 6  2.68       1      1

Data Description: A description of some of the features are presented in the
table below.
Variable      | Definition 
--------------|--------------
1. GPA        | A students grade point average after the first year of college 
2. Program    | A student attending a voluntary one-week orientation (1:yes and 0:no)
3. Return     | A student returning to college for their sophomore year(1:yes and 0:no)

summary(coll_df)

##       GPA           Program         Return    
##  Min.   :1.210   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.377   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.735   Median :1.00   Median :1.00  
##  Mean   :2.740   Mean   :0.64   Mean   :0.66  
##  3rd Qu.:3.120   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.000   Max.   :1.00   Max.   :1.00

Interpretation: The median GPA is 2.7, with a median of 1 meaning students 
attended the orientation and returned for the sophomore year.

Step 4: Feature selection (i.e.,correlation analysis)

corr <- rcorr(as.matrix(coll_df))
corr

##          GPA Program Return
## GPA     1.00    0.50   0.58
## Program 0.50    1.00   0.52
## Return  0.58    0.52   1.00
## 
## n= 100 
## 
## 
## P
##         GPA Program Return
## GPA          0       0    
## Program  0           0    
## Return   0   0

Interpretation: All the predictors are significant with the target variable.
There's no multicollinearity in the data.

Step 5: Build the logistic regression model

model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = coll_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5

Interpretation: All the independent variables were significant(p-value < 0.05)

Question 3: Overall Model Signnificance

Likelihood Ratio Test

# Fit a null model 
null_model <- glm(Return ~ 1, data = coll_df, family = binomial)

# Perform likelihood ratio test 
anova(null_model, model, test ="Chisq")

## Analysis of Deviance Table
## 
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1        99    128.207                         
## 2        97     80.338  2   47.869 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Interpretation: The inclusion of GPA and Program as predictictors in our LR model does indeed significantly predictthe likelihood of students returning to Lakeland College for their sophomore year, relative to a model that predicts based solely on the mean of observed outcomes.

Pseudo -R-squared

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -40.1688662 -64.1035478  47.8693631   0.3733753   0.3804077   0.5264883

Interpretation: A McFadden R-squared of 0.373 means that our LR model explains about 37.3% of the variability in the outcome relative to a model with no predictors. This is considered a moderate to good fit, where values above 0.2 to 0.4 are often seen as indicative of a useful model.

Area Under the Curve(AUC)

# Compute ROC Curve and the AUC score 
roc_curve <- roc(coll_df$Return, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

## Area under the curve: 0.8841

Interpretation: An AUC score of 0.88 indicates that the LR model has a high level of accuracy in predicting student retention.

Question 4 & 5: Predicting with New Information

new_data1 <- data.frame(GPA = 2.5, Program = 0) # did not attend orientation
new_data2 <- data.frame(GPA = 2.5, Program = 1) # attended orientation 

# Predict the probability 
prob1 <- predict(model, newdata = new_data1, type = "response") #probability that the student did not attend
prob1

##         1 
## 0.3668944

prob2 <- predict(model, newdata = new_data2, type = "response")
prob2

##         1 
## 0.7340349

Interpretation: 
(1) There's a 36.69% chance that the student will not return for the sophomore 
year if they did not attended the orientation. 
(2) There's a 73.40% chance that the student will return if they attend the
orientation program.

Question 6: Odds Ratio

# Extract the coefficients
model_summary <- summary(model)
coefficients <- model_summary$coefficients

# Calculate the odds ratio for 'Program'
odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program

## [1] 4.762413

Interpretation: The odd ratio is greater than 1 and this indicates that attending 
the orientation program is associated with higher odds of students returning 
for sophomore year compared to not attendiing the orientation at Lakeland college.