To investigate the relationship between student GPA and attending orientation program on retention at Lakeland Collage
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")
library(readxl) #allows us to import excel files
library(Hmisc) #allows us to call the correlation function
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl) #helps us call the pseudo R-Square package to eval our model
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(pROC) # aloows us to run the area under the curve (AUC) package to get the plot and AUC score
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
college_df <- read_excel("Lakeland College Data.xlsx")
coll_df <-subset(college_df, select = -c(Student))
head(coll_df)
## # A tibble: 6 × 3
## GPA Program Return
## <dbl> <dbl> <dbl>
## 1 3.78 1 1
## 2 2.38 0 1
## 3 1.3 0 0
## 4 2.19 1 0
## 5 3.22 1 1
## 6 2.68 1 1
Data Description: A description of the features are presented in the table below.
Vatriable | Definition
----------|-----------
1. GPA | A student point average after first year of college
2. Program| A student attending a voluntary one week program (1 : yes and 0: no)
3. Return | A student returning tto college for their sophmore semester (1 : yes and 0: no)
summary(coll_df)
## GPA Program Return
## Min. :1.210 Min. :0.00 Min. :0.00
## 1st Qu.:2.377 1st Qu.:0.00 1st Qu.:0.00
## Median :2.735 Median :1.00 Median :1.00
## Mean :2.740 Mean :0.64 Mean :0.66
## 3rd Qu.:3.120 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :4.000 Max. :1.00 Max. :1.00
Interpretation: The median GPA is 'r median(coll_dt$GPA)' with a median 1 meaning students attend the orientation and retturn for the sophmore year
corr <- rcorr(as.matrix(coll_df))
corr
## GPA Program Return
## GPA 1.00 0.50 0.58
## Program 0.50 1.00 0.52
## Return 0.58 0.52 1.00
##
## n= 100
##
##
## P
## GPA Program Return
## GPA 0 0
## Program 0 0
## Return 0 0
Interpretation: all the predictor are significant with the target. There's no multicollinearity in the data.
model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial())
summary(model)
##
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial(), data = coll_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.8926 1.7472 -3.945 7.98e-05 ***
## GPA 2.5388 0.6729 3.773 0.000161 ***
## Program 1.5608 0.5631 2.772 0.005579 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.207 on 99 degrees of freedom
## Residual deviance: 80.338 on 97 degrees of freedom
## AIC: 86.338
##
## Number of Fisher Scoring iterations: 5
Interpretation:All the independent variables were significant (p-value<0.05)
null_model <- glm(Return ~ 1, data = coll_df, family = binomial)
#perform likelihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 99 128.207
## 2 97 80.338 2 47.869 4.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:The inclustion of GPA and Program as predictor in our LR model does indeed
#significantly predict the likelihood of students returning to Lakeland College for their sophomore year, relative to a model that predicts return soley on the mean of observed outcomes
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -40.1688662 -64.1035478 47.8693631 0.3733753 0.3804077 0.5264883
Interpretation: A McFadden R-squarred of 0.373 means that our LR model
explains about 37.3% of the variability in the outcome relatibve to a model
with no predictor . this is considered a moderate to goof fit, where values abour 0.02-0.4 are often seen
###Area under the curve (AUC)
The area under the curve score represents teh abilibity of the model to correctly classify students who will return to lakeland college and those who won't
roc_curve <- roc(coll_df$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)
auc(roc_curve)
## Area under the curve: 0.8841
#interpretation: an AUC score of 0.88 indicates that the LR model has a high level of accuracy in predicting student retention
##Question 4 & 5 : predicting with new info
new_data1 <- data.frame(GPA = 2.5, Program = 0) #did not attend
new_data2 <- data.frame(GPA = 2.5, Program = 1) #did attend
#predict the probability
prob1 <- predict(model, newdata = new_data1, type = "response") #prob that student didn't attend orientation
prob1 #36.7% that the student will not return
## 1
## 0.3668944
prob2 <- predict(model, newdata = new_data2, type = "response") #prob that student didn't attend orientation
prob2 #73.4% that the student will return
## 1
## 0.7340349
Interpretation:36.7% that the student will not return if they didn't attend the orientation, and 73.4% that the student will return if they attended the orientation
##Question 6: Odds ratio
#extract the coefficients
coefficients <-summary(model)$coefficients
#calc the odd ratio for 'program'
odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program
## [1] 4.762413
Interpretation: the odds ratio is greater than 1 and this indicated that attending the orientation program is associated with higher odds of students returning for the sophmore yr compared to not attending the orientation at lakeland college