Introduction

One of the best-known binary classification is Logistic Regression. The goal is to predict whether the student can success admit to a program based on input features: GRE, GPA and Rank.

Data-set are provided at: https://github.com/finnstats/finnstats/blob/a9164a541ee2f4a830bfb66363ac3b7edc8b3449/binary.csv

load library

library(rio)
library(tidyverse)
library(ggplot2)
library(caret)

Data-set snapshot

raw_data=import("binary.csv")
str(raw_data)
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

There is no NA cell. So that I can just do simple cleaning. Then create a contingency table from cross-classifying factors, to see if there are any special distribution.

Cleaning

raw_data$admit=as.factor(case_when(raw_data$admit==1~"Success",raw_data$admit==0~"Fail"))
raw_data$rank=as.factor(raw_data$rank)
xtabs(~admit+rank,raw_data)
##          rank
## admit      1  2  3  4
##   Fail    28 97 93 55
##   Success 33 54 28 12

The table looks normal and no special distribution observed.

Logistic Regression

model=glm(admit~gre+gpa+rank,data=raw_data,family="binomial")
summary(model)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = raw_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

AIC is 470.52 and P-value are all less than 0.05. They are statistically significant. So lets continue.

Graphical Visualization

result=data.frame(predict_pct=model$fitted.values,actual=raw_data$admit)
result=result %>% arrange(predict_pct)
result$rank=1:nrow(result)

ggplot(result)+geom_point(aes(x=rank,y=predict_pct,color=actual,shape=actual),alpha=0.5)+labs(title="Logistic Regression Model for Student Admission")+xlab("No.")+ylab("Probability")+theme_classic()

The plot look complicated to interpret. I will turn it into a confusion matrix. Here I assume the model prediction, with a probability > 0.5 mean “Success”, while <0.5 mean “Fail”.

Confusion Matrix

result$predict= if_else(result$predict_pct>0.5,"Success","Fail")
confusionMatrix(as.factor(result$predict), result$actual,positive ="Success")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Success
##    Fail     254      97
##    Success   19      30
##                                          
##                Accuracy : 0.71           
##                  95% CI : (0.6628, 0.754)
##     No Information Rate : 0.6825         
##     P-Value [Acc > NIR] : 0.1293         
##                                          
##                   Kappa : 0.1994         
##                                          
##  Mcnemar's Test P-Value : 8.724e-13      
##                                          
##             Sensitivity : 0.2362         
##             Specificity : 0.9304         
##          Pos Pred Value : 0.6122         
##          Neg Pred Value : 0.7236         
##              Prevalence : 0.3175         
##          Detection Rate : 0.0750         
##    Detection Prevalence : 0.1225         
##       Balanced Accuracy : 0.5833         
##                                          
##        'Positive' Class : Success        
## 

The Accuracy is 71% with 95% CI (66%, 75%). The result is fairly acceptable. But note that false negative number (97) is much greater than false positive number (19).