One of the best-known binary classification is Logistic Regression. The goal is to predict whether the student can success admit to a program based on input features: GRE, GPA and Rank.
Data-set are provided at: https://github.com/finnstats/finnstats/blob/a9164a541ee2f4a830bfb66363ac3b7edc8b3449/binary.csv
library(rio)
library(tidyverse)
library(ggplot2)
library(caret)
raw_data=import("binary.csv")
str(raw_data)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
There is no NA cell. So that I can just do simple cleaning. Then create a contingency table from cross-classifying factors, to see if there are any special distribution.
raw_data$admit=as.factor(case_when(raw_data$admit==1~"Success",raw_data$admit==0~"Fail"))
raw_data$rank=as.factor(raw_data$rank)
xtabs(~admit+rank,raw_data)
## rank
## admit 1 2 3 4
## Fail 28 97 93 55
## Success 33 54 28 12
The table looks normal and no special distribution observed.
model=glm(admit~gre+gpa+rank,data=raw_data,family="binomial")
summary(model)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = raw_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
AIC is 470.52 and P-value are all less than 0.05. They are statistically significant. So lets continue.
result=data.frame(predict_pct=model$fitted.values,actual=raw_data$admit)
result=result %>% arrange(predict_pct)
result$rank=1:nrow(result)
ggplot(result)+geom_point(aes(x=rank,y=predict_pct,color=actual,shape=actual),alpha=0.5)+labs(title="Logistic Regression Model for Student Admission")+xlab("No.")+ylab("Probability")+theme_classic()
The plot look complicated to interpret. I will turn it into a confusion matrix. Here I assume the model prediction, with a probability > 0.5 mean “Success”, while <0.5 mean “Fail”.
result$predict= if_else(result$predict_pct>0.5,"Success","Fail")
confusionMatrix(as.factor(result$predict), result$actual,positive ="Success")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fail Success
## Fail 254 97
## Success 19 30
##
## Accuracy : 0.71
## 95% CI : (0.6628, 0.754)
## No Information Rate : 0.6825
## P-Value [Acc > NIR] : 0.1293
##
## Kappa : 0.1994
##
## Mcnemar's Test P-Value : 8.724e-13
##
## Sensitivity : 0.2362
## Specificity : 0.9304
## Pos Pred Value : 0.6122
## Neg Pred Value : 0.7236
## Prevalence : 0.3175
## Detection Rate : 0.0750
## Detection Prevalence : 0.1225
## Balanced Accuracy : 0.5833
##
## 'Positive' Class : Success
##
The Accuracy is 71% with 95% CI (66%, 75%). The result is fairly acceptable. But note that false negative number (97) is much greater than false positive number (19).