Predicting Student Retention at Lakeland College

Project Objectives

To investigate the relationship between student GPA and attending orientation program on retention at Lakeland 
College.

Question 1 & 2 : Develop the Model & Asses Predictor Significance

Step 1: Install and load required libraries

#install.packages("readexcel)
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")
  
## step 1: load the libraries 
library(readxl) #allows us to import excel files
library(Hmisc) #allows us to call the correlation function
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl) #allows us to call the pseudo R-square package to evaluate our model
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC) #allows us to run the area under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import & clean the data

college_df <-read_excel("Lakeland College Data.xlsx")
coll_df <- subset(college_df,select = -c(Student))

Step 3: Summarize the data

head(coll_df)
## # A tibble: 6 × 3
##     GPA Program Return
##   <dbl>   <dbl>  <dbl>
## 1  3.78       1      1
## 2  2.38       0      1
## 3  1.3        0      0
## 4  2.19       1      0
## 5  3.22       1      1
## 6  2.68       1      1
A Description of the features are presented in the table below:

Variable     | Definition
-------------|------------
1. GPA       | A Student's grade point average after the first year of college
2. Program   | A student attending a voluntary one-week orientation program (1: yes and 0: no )
3. Return    | A student returning to college for their sophmore semester (1: yes and 0: no )
summary(coll_df)
##       GPA           Program         Return    
##  Min.   :1.210   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.377   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.735   Median :1.00   Median :1.00  
##  Mean   :2.740   Mean   :0.64   Mean   :0.66  
##  3rd Qu.:3.120   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.000   Max.   :1.00   Max.   :1.00
Interpretation: The median GPA is 2.74, with a median of 1 meaning students attended the orientation and would 
likely return for the sophomore year.

Step 4: Feature selection (i.e., correlation analysis)

corr <- rcorr(as.matrix(coll_df))
corr
##          GPA Program Return
## GPA     1.00    0.50   0.58
## Program 0.50    1.00   0.52
## Return  0.58    0.52   1.00
## 
## n= 100 
## 
## 
## P
##         GPA Program Return
## GPA          0       0    
## Program  0           0    
## Return   0   0
# Interpretation: All the variables are significant with the target variable.

Step 5: Build the logistic regression model

model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = coll_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5
Interpretation: All of the independent variables are significant (p-value < 0.05).

Question 3: Overall Model Significance

likelihood test

## Fit a null model
null_model <- glm(Return ~ 1, data = coll_df, family = binomial)

#Perform likelihood ratio test 
anova(null_model, model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
## Model 3: Return ~ GPA + Program
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1        99    128.207                         
## 2        97     80.338  2   47.869 4.03e-11 ***
## 3        97     80.338  0    0.000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of GPA and Program as predictors in our LR model does indeed significantly predict
the likelihood of students returning to Lakehood College for their sophomore year, relative to a model that 
predicts returns based solely on the mean of observed outcomes (i.e., null model). 

Pseudo R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -40.1688662 -64.1035478  47.8693631   0.3733753   0.3804077   0.5264883
Interpretation: A McFadden R-squared of 0.373 means that our LR model explains about 37.3% of the variability in 
the outcome relative to a model with no predictors. This is considered a moderate to good fit where values above 0.2 to 0.4 are often seen as indicative of a useful model.

Area Under the Curve (AUC)

The Area Under the Curve Score (AUC) score represents the ability of the model to correctly classify student who 
will return to Lakeland college and those who will not.
# Compute the ROC Curve and the AUC score
roc_curve <- roc(coll_df$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.8841
Interpretation: An AUC score of 0.88 indicated that the LR model has a high level of accuracy in predicting 
student retention. 

Question 4 & 5 Predicting with New Information

# Given the following new student information 
new_data1 <- data.frame(GPA = 2.5, Program = 0) # did not attend orientation
new_data2 <- data.frame(GPA = 2.5, Program = 1) # attened orientation 

# Predict the probability 
# (1) Probability that the student did not attend orientation
prob1 <- predict(model, newdata = new_data1, type = "response") 
round((prob1 * 100),2) 
##     1 
## 36.69
# (2) Probability that the student attended orientation
prob2 <- predict(model, newdata = new_data2, type = "response") 
round((prob2 * 100),2)
##    1 
## 73.4
Interpretation:
(1) There is a 36.69% chance that the student will not return if they do not attend the orientation.
program
(2) There is a 73.4% chance that the student will return if they attend the orientation program.

Question 6: Odds Ratio

# Extract the coefficients
coefficients <- summary(model)$coefficients

odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program 
## [1] 4.762413
Interpretation: the odd ratio = 4.76 > 1. This indicates that attending the orientation program is associated 
with the higher odds of students returning for the sophomore year compared to not attending the orientation at 
Lakeland college.