Lakeland-College-Student-Retention-Classification-Model.knit

#Predicting Student Retention at Lakeland College

Project Objective:

To investigate the factors that influence student retention at Lakeland College and to develop a predictive model that can be used to identify students who are at risk of dropping out.

Question 1 & 2: Model Development & Assess Predictor Significance

Step 1: Install and load required libraries

library(readxl)

## Warning: package 'readxl' was built under R version 4.3.3

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.3.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pscl)

## Warning: package 'pscl' was built under R version 4.3.3

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2 & 3: Explore the dataset

college = read_excel(file.choose()) # Loading the dataset
head(college) # quick snapshot of the dataset

## # A tibble: 6 × 4
##   Student   GPA Program Return
##     <dbl> <dbl>   <dbl>  <dbl>
## 1       1  3.78       1      1
## 2       2  2.38       0      1
## 3       3  1.3        0      0
## 4       4  2.19       1      0
## 5       5  3.22       1      1
## 6       6  2.68       1      1

colleger = subset(college, select = -c(Student)) # Dropping the 'Student' column

summary(colleger) # Summarizing the new dataset that does not have the 'Student' column

##       GPA           Program         Return    
##  Min.   :1.210   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.377   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.735   Median :1.00   Median :1.00  
##  Mean   :2.740   Mean   :0.64   Mean   :0.66  
##  3rd Qu.:3.120   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.000   Max.   :1.00   Max.   :1.00

# Interpretation: The median GPA is 2.7, and the median value for attending orientation is 1, indicating that students are likely to return for their sophomore year.

Step 4: Feature selection (i.e., Correlation Analysis)

corr = rcorr(as.matrix(colleger))
corr

##          GPA Program Return
## GPA     1.00    0.50   0.58
## Program 0.50    1.00   0.52
## Return  0.58    0.52   1.00
## 
## n= 100 
## 
## 
## P
##         GPA Program Return
## GPA          0       0    
## Program  0           0    
## Return   0   0

# Interpretation: All predictors show a significant relationship with the target variable. The average correlation between "program," "GPA," and "return" is low, indicating there's no multicollinearity (correlation does not exceed the 0.7 threshold). All variables are significant, as indicated by their p-values being 0.
# Avoid using variables if they are highly correlated with each other.

Step 5: Build logistic regression model and assess predictor significance

model = glm(Return ~ GPA + Program, data = colleger, family = binomial) # binomial because return variable is binary
summary(model)

## 
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = colleger)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5

#     GPA and program are significant predictors of "return," with p-values less than 0.05, indicating they have a significant effect on the outcome.
# In the logistic regression model, the intercept (denoted as b0) has a negative impact, while GPA (b1) and program (b2) have positive impacts on the likelihood of return.
# This suggests that as GPA and participation in the program increase, the probability of return also increases, while the intercept indicates an overall negative baseline effect.

Question 3: Overall Model Significance

Likelihood Ratio Test

null_model = glm(Return ~ 1, data=colleger, family = binomial)
null_model

## 
## Call:  glm(formula = Return ~ 1, family = binomial, data = colleger)
## 
## Coefficients:
## (Intercept)  
##      0.6633  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  99 Residual
## Null Deviance:       128.2 
## Residual Deviance: 128.2     AIC: 130.2

The logistic regression model is statistically significant because its p-value is less than the significance level (alpha = 0.05).
This indicates that including GPA and Program as predictors in the model significantly predicts the likelihood of students returning to Lakeland College for their sophomore year.
Compared to a null model (which predicts return based only on the average observed outcomes), this model provides a significant improvement in predicting student retention.

pseudo r square

pR2(model)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -40.1688662 -64.1035478  47.8693631   0.3733753   0.3804077   0.5264883

Interpretation:
The McFadden's pseudo R2 value of 0.3733, equivalent to 37%, suggests that our model explains 37% of the variance in the outcome compared to a model with no predictors.
This level of explanation indicates a moderate to good fit. Typically, values between 0.2 to 0.4 are considered of a useful model, suggesting that our model falls within this range.

Area Under the Curve (AUC)

roc = roc(colleger$Return, fitted(model))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc)

auc(roc)

## Area under the curve: 0.8841

An AUC (Area Under the Curve) score of 0.8841 indicates a high level of accuracy for the logistic regression (LR) model in predicting student retention. AUC scores range from 0 to 1, with values closer to 1 indicating better model performance.

Question 4 & 5: Predicting with New Information

# estimate the probability that students with a 2.5 gpa who did NOT attend the orientation program will return to lakeland for their sophomore year
newdatanot = data.frame(GPA = 2.5, Program = 0) # did not attend orientation
# predict probability that did not attend
prob1 = predict (model, newdata = newdatanot, type = "response") # prob. that student did not attend orientation
prob1

##         1 
## 0.3668944

# 36.7% that the student will not return to lakeland for their sophomore year


# estimate the probability that students with a 2.5 gpa who did attend the orientation program 
newdatayes = data.frame(GPA = 2.5, Program = 1) # did attended orientation 
prob2 = predict (model, newdata = newdatayes, type = "response")
prob2

##         1 
## 0.7340349

# 73.4% that the student will return to lakeland for their sophomore year

Odds Ratio

# extract the coefficient (b0 b1 b2)
coefficient = summary(model)$coefficients

# calculate the odd ratio for 'Program' = glm formula
odds_ratio_program = exp(coefficient["Program", "Estimate"])
odds_ratio_program

## [1] 4.762413

# 4.76 greater than 1 = students will come back students attended the program.
# attending the orientation program is associated with higher odds of  student returning
# sophomore year in contrast when not attend the orientation program at lakeland college