Prediction Student Retention at Lakeland College

Project Objective

To investigate the relationship between student GPA and attending orientation program on retention at Lakeland Collage

Question 1 & 2: Develop the Model & Asses Predictor Significance

Step 1: Install and load the required libraries

#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")

library(readxl) #allows us to import excel files 
library(Hmisc) #allows us to call the correlation function 
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(pscl) #helps us call the pseudo R-Square package to eval our model 
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC) # aloows us to run the area under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Step 2: Import data

college_df <- read_excel("Lakeland College Data.xlsx")
coll_df <-subset(college_df, select = -c(Student))

Step 3: Summarize the data (descriptive stats)

head(coll_df)
## # A tibble: 6 × 3
##     GPA Program Return
##   <dbl>   <dbl>  <dbl>
## 1  3.78       1      1
## 2  2.38       0      1
## 3  1.3        0      0
## 4  2.19       1      0
## 5  3.22       1      1
## 6  2.68       1      1
Data Description: A description of the features are presented in the table below.
Vatriable | Definition
----------|-----------
1. GPA    | A student point average after first year of college
2. Program| A student attending a voluntary one week program (1 : yes and 0: no)
3. Return | A student returning tto college for their sophmore semester (1 : yes and 0: no)
summary(coll_df)
##       GPA           Program         Return    
##  Min.   :1.210   Min.   :0.00   Min.   :0.00  
##  1st Qu.:2.377   1st Qu.:0.00   1st Qu.:0.00  
##  Median :2.735   Median :1.00   Median :1.00  
##  Mean   :2.740   Mean   :0.64   Mean   :0.66  
##  3rd Qu.:3.120   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :4.000   Max.   :1.00   Max.   :1.00
Interpretation: The median GPA is 'r median(coll_dt$GPA)' with a median 1 meaning students attend the orientation and retturn for the sophmore year

Step 4: feature selection (corrolation analysis)

corr <- rcorr(as.matrix(coll_df))
corr
##          GPA Program Return
## GPA     1.00    0.50   0.58
## Program 0.50    1.00   0.52
## Return  0.58    0.52   1.00
## 
## n= 100 
## 
## 
## P
##         GPA Program Return
## GPA          0       0    
## Program  0           0    
## Return   0   0
Interpretation:  all the predictor are significant with the target. There's no multicollinearity in the data.

Step 5: build the logistics regression

model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial())
summary(model)
## 
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial(), data = coll_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5
Interpretation:All the independent variables were significant (p-value<0.05)

Question 3:Test Overall Model Significance

Likelihood ratio Test

null_model <- glm(Return ~ 1, data = coll_df, family = binomial)
#perform likelihood ratio test 
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1        99    128.207                         
## 2        97     80.338  2   47.869 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:The inclustion of GPA and Program as predictor in our LR model does indeed 
#significantly predict the likelihood of students returning to Lakeland College for their sophomore year, relative to a model that predicts return soley on the mean of observed outcomes

Pseudo R-squared

pR2(model)
## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -40.1688662 -64.1035478  47.8693631   0.3733753   0.3804077   0.5264883
Interpretation: A McFadden R-squarred of 0.373 means that our LR model
explains about 37.3% of the variability in the outcome relatibve to a model
with no predictor . this is considered a moderate to goof fit, where values abour 0.02-0.4 are often seen 

###Area under the curve (AUC)

The area under the curve score represents teh abilibity of the model to correctly classify students who will return to lakeland college and those who won't
roc_curve <- roc(coll_df$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.8841
#interpretation: an AUC score of 0.88 indicates that the LR model has a high level of accuracy in predicting student retention

##Question 4 & 5 : predicting with new info

new_data1 <- data.frame(GPA = 2.5, Program = 0) #did not attend
new_data2 <- data.frame(GPA = 2.5, Program = 1) #did attend


#predict the probability 
prob1 <- predict(model, newdata = new_data1, type = "response") #prob that student didn't attend orientation
prob1 #36.7% that the student will not return
##         1 
## 0.3668944
prob2 <- predict(model, newdata = new_data2, type = "response") #prob that student didn't attend orientation
prob2 #73.4% that the student will return
##         1 
## 0.7340349
Interpretation:36.7% that the student will not return if they didn't attend the orientation, and 73.4% that the student will return if they attended the orientation

##Question 6: Odds ratio

#extract the coefficients 
coefficients <-summary(model)$coefficients

#calc the odd ratio for 'program'
odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program
## [1] 4.762413
Interpretation: the odds ratio is greater than 1 and this indicated that attending the orientation program is associated with higher odds of students returning for the sophmore yr compared to not attending the orientation at lakeland college