Performance evaluation is an important part of machine learning process. We need to know how well the model works on the unseen data set. ROC curve along with other methods is one of the most important evaluation metrics for checking any classification model’s performance.
ROC curves do not only tell us how well the model works, but also it provide us to choose the most appropriate cut-off point for a test. The best cut-off point has the highest true positive rate together with the lowest false positive rate.
True Positive: If we reject a false H0, we call it a true positive; in other words how much the model is right when it is right in actual.
True Negative: If we fail to reject a true H0, we call it a true negative; in other words how much the model is NOT right when it is NOT right in actual.
False Positive: If we reject a true H0, we call it a false positive; in other words how much the model is right when it is NOT right in actual.
False Negative: If we fail to reject a false H0, we call it a false negative; in other words how much the model is NOT right when it is right in actual.
Data:
Admission data set has 400 observation of 4 variables. We have admit variable which is binary; 1 is Admitted and 0 is not admitted, GRE indicates the students’ GRE score, and GPA indicates the student’s GPA. Rank represents the school’s rank where students graduate from. Schools with a rank of 1 have th ehighest prestige, and while those with a rank of 4 have the lowest.
Objective:
Carry out the logistic regression and evaluate model and choose the best cut-off point using ROC Curve.
Data Preparation
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # confusion matrix
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(nnet)
library(ROCR)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
# Load data
df <- read.csv("Admission.csv", header = TRUE)
head(df)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
nrow(df)
## [1] 400
colnames(df)
## [1] "admit" "gre" "gpa" "rank"
#Summary Statistics
summary(df)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(df, sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
#Data Structure
str(df)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
All variables are integer and gpa is numeric because it has decimal value.We need to convert admit and rank to categorical variable.
#Convert to Categorical Variable
newdf <- df %>%
mutate_at(vars(admit,rank), funs(factor))
str(newdf)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
#Pairs plot
ggpairs(newdf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can notice that GPA and GRE scores are slightly decreases from the Schools with Rank of 1 to Rank of 4. There are also few outliers which can be disregarded. The number of students who rejected significantly higher than the student who are admitted to the College.
# GPA and GRE average by Rank
newdf %>%
group_by(rank) %>%
summarise_at(vars(c(gpa,gre)), list(name = mean))
## # A tibble: 4 x 3
## rank gpa_name gre_name
## <fct> <dbl> <dbl>
## 1 1 3.45 612.
## 2 2 3.36 596.
## 3 3 3.43 575.
## 4 4 3.32 570.
This table indicates that students who attend the school with Rank 1 have higher GPA and GRE scores than other students.
Fit Model
glm
set.seed(111)
fit_glm <- multinom(admit~., data=newdf)
## # weights: 7 (6 variable)
## initial value 277.258872
## iter 10 value 229.258748
## iter 10 value 229.258746
## iter 10 value 229.258746
## final value 229.258746
## converged
Prediction
p <- predict(fit_glm, data=newdf)
#Confusion Matrix::
#By hand
table1 <- table(Predicted=p, Actual=newdf$admit)
table1
## Actual
## Predicted 0 1
## 0 254 97
## 1 19 30
# Accuracy Rate
sum(diag(table1)/sum(table1))
## [1] 0.71
#Accuracy Rate for random guess
table(newdf$admit)
##
## 0 1
## 273 127
273/400
## [1] 0.6825
AS you can notice, if we select all students as not admitted, we can even correctly predicted 69% of the students. So, our rate is not good.
Model Performance Evaluation
pred1<- predict(fit_glm, newdf, type='prob')
hist(pred1)
pred2<- prediction(pred1,newdf$admit)
eval <- performance(pred2,'acc')
plot(eval)
#Choosing the best cut off value
abline(v=0.47, col="red")
abline(v=0.485, col="green")
max <- which.max(slot(eval, "y.values")[[1]])
max
## [1] 55
acc <- slot(eval, "y.values")[[1]][max]
cutoff <- slot(eval,"x.values")[[1]][max]
cutoff
## 271
## 0.489942
print(c(Accuracy=acc, Cutoff=cutoff))
## Accuracy Cutoff.271
## 0.717500 0.489942
You can see that, we increased our accuracy rate litle bit by choosing the best cut off value.
ROC Curve
roc <- performance(pred2, "tpr", "fpr")
plot(roc,
colorize=T,
lwd = 3,
main="ROC Curve")
abline(a=0, b=1)
ROC Construction • ROC curve plots true positive fraction against false positive fraction for each value of the significance level between 0 and 1. • The 45° diagonal corresponds to classification based upon a coin toss: rate of correct classification = rate of misclassification • A minimally sufficient testing procedure should outperform a fair coin!
AUC (Area Under Curve)
auc <- performance(pred2, "auc")
area <- unlist(slot(auc,"y.values"))
area
## [1] 0.6928413
AUC = area under the ROC curve
• AUC of 1 –> perfect test
• AUC of 0 –> test misclassifies in every case for every value of 𝛼
• AUC of .5 –> performance equivalent to classification based on a coin toss
References:
CSU Applied Statistics Course Notes
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc