Question 1 & 2: Model and Predictor Significance
Step 1: Install and load required libraries
#install.packages("readxl")
#install.packages("Hmisc")
#install.packages("pscl")
#if(!require(pROC)) install.packages("pROC")
library(readxl) #allows us to import excel files
library(Hmisc) #allows us to call the correlation function
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pscl) #allows us to call the pseudo R-square package to evaluate our model
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(pROC) #allows us to run the area under the curve (AUC) package to get the plot and AUC score
## Warning: package 'pROC' was built under R version 4.3.1
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Step 2: Import & clean the data
college_df <- read_excel("Lakeland College Data.xlsx")
coll_df <- subset(college_df, select = -c(Student)) #drop irrelevant column
Step 3: Summarize the data
head(coll_df)
## # A tibble: 6 × 3
## GPA Program Return
## <dbl> <dbl> <dbl>
## 1 3.78 1 1
## 2 2.38 0 1
## 3 1.3 0 0
## 4 2.19 1 0
## 5 3.22 1 1
## 6 2.68 1 1
summary(coll_df)
## GPA Program Return
## Min. :1.210 Min. :0.00 Min. :0.00
## 1st Qu.:2.377 1st Qu.:0.00 1st Qu.:0.00
## Median :2.735 Median :1.00 Median :1.00
## Mean :2.740 Mean :0.64 Mean :0.66
## 3rd Qu.:3.120 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :4.000 Max. :1.00 Max. :1.00
Data Description: A description of some of the features are presented in the table below.
Variable | Definition
------------- | -------------
1. GPA | A students grade point average after the first year of college
2. Program | A student attending a voluntary one-week orientation program (1: yes and 0: no)
3. Return | A student returning to college for their sophomore semester (1: yes and 0: no)
Interpretation: The median GPA is 2.7, with a median of 1 meaning students attended the orientation and returned for the sophomore year.
Step 4: Feature selection (i.e., correlation analysis)
corr <- rcorr(as.matrix(coll_df))
corr
## GPA Program Return
## GPA 1.00 0.50 0.58
## Program 0.50 1.00 0.52
## Return 0.58 0.52 1.00
##
## n= 100
##
##
## P
## GPA Program Return
## GPA 0 0
## Program 0 0
## Return 0 0
Interpretation: All the predictors are significant with the target variable (i.e., Return). There's no multicollinearity in the data.
Step 6: Build the logistic regression model
model <- glm(Return ~ GPA + Program, data = coll_df, family = binomial)
summary(model)
##
## Call:
## glm(formula = Return ~ GPA + Program, family = binomial, data = coll_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.8926 1.7472 -3.945 7.98e-05 ***
## GPA 2.5388 0.6729 3.773 0.000161 ***
## Program 1.5608 0.5631 2.772 0.005579 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.207 on 99 degrees of freedom
## Residual deviance: 80.338 on 97 degrees of freedom
## AIC: 86.338
##
## Number of Fisher Scoring iterations: 5
Question 3: Overall Model Significance
Likelihood Ratio Test
# Fit a null model
null_model <- glm(Return ~ 1, data = coll_df, family = binomial)
# Perform likelihood ratio test
anova(null_model, model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Return ~ 1
## Model 2: Return ~ GPA + Program
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 99 128.207
## 2 97 80.338 2 47.869 4.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The inclusion of GPA and Program as predictors in our LR model does indeed significantly predict the likelihood of students returning to Lakeland College for their sophomore year, relative to a model that predicts return based solely on the mean of observed outcomes (i.e., null model).
Pseudo-R-squared
pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -40.1688662 -64.1035478 47.8693631 0.3733753 0.3804077 0.5264883
Interpretation: A McFadden R-squared of 0.373 means that our LR model explains about 37.3% of the variability in the outcome relative to a mdoel with no predictors. This is considered a moderate to good fit, where values above 0.2 to 0.4 are often seen as indicative of a useful model.
Area Under the Curve (AUC)
The Area Under the Curve (AUC) score represents the ability of the model to correctly classify students who will return to Lakeland College and those who will not.
# Compute ROC Curve and the AUC score
roc_curve <- roc(coll_df$Return, fitted(model))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.8841
Interpretation: An AUC score of 0.88 indicates that the LR model has a high level of accuracy in predicting student retention.
Question 6: Odds Ratio
# Extract the coefficients
coefficients <- summary(model)$coefficients
# Calculate the odd ratio for 'Program'
odds_ratio_program <- exp(coefficients["Program", "Estimate"])
odds_ratio_program
## [1] 4.762413
Interpretation: The odd ratio is greater than 1 and this indicates that attending the orientation program is associated with higher odds of students returning for the sophomore year compared to not attending the orientation at Lakeland college.