#Predicting Student Retention at Lakeland College
To investigate the factors that influence student retention at Lakeland College and to develop a predictive model that can be used to identify students who are at risk of dropping out.
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.3
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl)
## Warning: package 'pscl' was built under R version 4.3.3
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
college = read_excel(file.choose()) # Loading the dataset
head(college) # quick snapshot of the dataset
## # A tibble: 6 × 4
## Student GPA Program Return
## <dbl> <dbl> <dbl> <dbl>
## 1 1 3.78 1 1
## 2 2 2.38 0 1
## 3 3 1.3 0 0
## 4 4 2.19 1 0
## 5 5 3.22 1 1
## 6 6 2.68 1 1
colleger = subset(college, select = -c(Student)) # Dropping the 'Student' column
summary(colleger) # Summarizing the new dataset that does not have the 'Student' column
## GPA Program Return
## Min. :1.210 Min. :0.00 Min. :0.00
## 1st Qu.:2.377 1st Qu.:0.00 1st Qu.:0.00
## Median :2.735 Median :1.00 Median :1.00
## Mean :2.740 Mean :0.64 Mean :0.66
## 3rd Qu.:3.120 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :4.000 Max. :1.00 Max. :1.00
# Interpretation: The median GPA is 2.7, and the median value for attending orientation is 1, indicating that students are likely to return for their sophomore year.
corr = rcorr(as.matrix(colleger))
corr
## GPA Program Return
## GPA 1.00 0.50 0.58
## Program 0.50 1.00 0.52
## Return 0.58 0.52 1.00
##
## n= 100
##
##
## P
## GPA Program Return
## GPA 0 0
## Program 0 0
## Return 0 0
# Interpretation: All predictors show a significant relationship with the target variable. The average correlation between "program," "GPA," and "return" is low, indicating there's no multicollinearity (correlation does not exceed the 0.7 threshold). All variables are significant, as indicated by their p-values being 0.
# Avoid using variables if they are highly correlated with each other.
model = glm(Return ~ GPA + Program, data = colleger, family = binomial) # binomial because return variable is binary
summary(model)
##
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = colleger)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.8926 1.7472 -3.945 7.98e-05 ***
## GPA 2.5388 0.6729 3.773 0.000161 ***
## Program 1.5608 0.5631 2.772 0.005579 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.207 on 99 degrees of freedom
## Residual deviance: 80.338 on 97 degrees of freedom
## AIC: 86.338
##
## Number of Fisher Scoring iterations: 5
# GPA and program are significant predictors of "return," with p-values less than 0.05, indicating they have a significant effect on the outcome.
# In the logistic regression model, the intercept (denoted as b0) has a negative impact, while GPA (b1) and program (b2) have positive impacts on the likelihood of return.
# This suggests that as GPA and participation in the program increase, the probability of return also increases, while the intercept indicates an overall negative baseline effect.
null_model = glm(Return ~ 1, data=colleger, family = binomial)
null_model
##
## Call: glm(formula = Return ~ 1, family = binomial, data = colleger)
##
## Coefficients:
## (Intercept)
## 0.6633
##
## Degrees of Freedom: 99 Total (i.e. Null); 99 Residual
## Null Deviance: 128.2
## Residual Deviance: 128.2 AIC: 130.2
The logistic regression model is statistically significant because its p-value is less than the significance level (alpha = 0.05).
This indicates that including GPA and Program as predictors in the model significantly predicts the likelihood of students returning to Lakeland College for their sophomore year.
Compared to a null model (which predicts return based only on the average observed outcomes), this model provides a significant improvement in predicting student retention.
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -40.1688662 -64.1035478 47.8693631 0.3733753 0.3804077 0.5264883
Interpretation:
The McFadden's pseudo R2 value of 0.3733, equivalent to 37%, suggests that our model explains 37% of the variance in the outcome compared to a model with no predictors.
This level of explanation indicates a moderate to good fit. Typically, values between 0.2 to 0.4 are considered of a useful model, suggesting that our model falls within this range.
roc = roc(colleger$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc)
auc(roc)
## Area under the curve: 0.8841
An AUC (Area Under the Curve) score of 0.8841 indicates a high level of accuracy for the logistic regression (LR) model in predicting student retention. AUC scores range from 0 to 1, with values closer to 1 indicating better model performance.
# estimate the probability that students with a 2.5 gpa who did NOT attend the orientation program will return to lakeland for their sophomore year
newdatanot = data.frame(GPA = 2.5, Program = 0) # did not attend orientation
# predict probability that did not attend
prob1 = predict (model, newdata = newdatanot, type = "response") # prob. that student did not attend orientation
prob1
## 1
## 0.3668944
# 36.7% that the student will not return to lakeland for their sophomore year
# estimate the probability that students with a 2.5 gpa who did attend the orientation program
newdatayes = data.frame(GPA = 2.5, Program = 1) # did attended orientation
prob2 = predict (model, newdata = newdatayes, type = "response")
prob2
## 1
## 0.7340349
# 73.4% that the student will return to lakeland for their sophomore year
# extract the coefficient (b0 b1 b2)
coefficient = summary(model)$coefficients
# calculate the odd ratio for 'Program' = glm formula
odds_ratio_program = exp(coefficient["Program", "Estimate"])
odds_ratio_program
## [1] 4.762413
# 4.76 greater than 1 = students will come back students attended the program.
# attending the orientation program is associated with higher odds of student returning
# sophomore year in contrast when not attend the orientation program at lakeland college