Introduction

Performance evaluation is an important part of machine learning process. We need to know how well the model works on the unseen data set. ROC curve along with other methods is one of the most important evaluation metrics for checking any classification model’s performance.

ROC curves do not only tell us how well the model works, but also it provide us to choose the most appropriate cut-off point for a test. The best cut-off point has the highest true positive rate together with the lowest false positive rate.

True Positive:
If we reject a false H0, we call it a true positive; in other words how much the model is right when it is right in actual.

True Negative:
If we fail to reject a true H0, we call it a true negative; in other words how much the model is NOT right when it is NOT right in actual.

False Positive:
If we reject a true H0, we call it a false positive; in other words how much the model is right when it is NOT right in actual.

False Negative:
If we fail to reject a false H0, we call it a false negative; in other words how much the model is NOT right when it is right in actual.

Data:

Admission data set has 400 observation of 4 variables. We have admit variable which is binary; 1 is Admitted and 0 is not admitted, GRE indicates the students’ GRE score, and GPA indicates the student’s GPA. Rank represents the school’s rank where students graduate from. Schools with a rank of 1 have th ehighest prestige, and while those with a rank of 4 have the lowest.

Objective:

Carry out the logistic regression and evaluate model and choose the best cut-off point using ROC Curve.



Data Preparation

#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr) 
library(caret)  # confusion matrix
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(nnet)
library(ROCR)
#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())
# Load data
df <- read.csv("Admission.csv", header = TRUE)

head(df)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
nrow(df)
## [1] 400
colnames(df)
## [1] "admit" "gre"   "gpa"   "rank"
#Summary Statistics
summary(df)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
sapply(df, sd)
##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602
#Data Structure
str(df)
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

All variables are integer and gpa is numeric because it has decimal value.We need to convert admit and rank to categorical variable.

#Convert to Categorical Variable
newdf <- df %>% 
  mutate_at(vars(admit,rank), funs(factor))

str(newdf)
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
#Pairs plot
ggpairs(newdf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can notice that GPA and GRE scores are slightly decreases from the Schools with Rank of 1 to Rank of 4. There are also few outliers which can be disregarded. The number of students who rejected significantly higher than the student who are admitted to the College.

# GPA and GRE average by Rank
newdf %>%
  group_by(rank) %>%
  summarise_at(vars(c(gpa,gre)), list(name = mean))
## # A tibble: 4 x 3
##   rank  gpa_name gre_name
##   <fct>    <dbl>    <dbl>
## 1 1         3.45     612.
## 2 2         3.36     596.
## 3 3         3.43     575.
## 4 4         3.32     570.

This table indicates that students who attend the school with Rank 1 have higher GPA and GRE scores than other students.

Fit Model

glm

set.seed(111)

fit_glm <- multinom(admit~., data=newdf)
## # weights:  7 (6 variable)
## initial  value 277.258872 
## iter  10 value 229.258748
## iter  10 value 229.258746
## iter  10 value 229.258746
## final  value 229.258746 
## converged

Prediction

p <- predict(fit_glm, data=newdf)
#Confusion Matrix::

#By hand
table1 <- table(Predicted=p, Actual=newdf$admit)

table1
##          Actual
## Predicted   0   1
##         0 254  97
##         1  19  30
# Accuracy Rate
sum(diag(table1)/sum(table1))
## [1] 0.71
#Accuracy Rate for random guess
table(newdf$admit)
## 
##   0   1 
## 273 127
273/400
## [1] 0.6825

AS you can notice, if we select all students as not admitted, we can even correctly predicted 69% of the students. So, our rate is not good.

Model Performance Evaluation

pred1<- predict(fit_glm, newdf, type='prob')

hist(pred1)

pred2<- prediction(pred1,newdf$admit)

eval <- performance(pred2,'acc')
plot(eval)


#Choosing the best cut off value

abline(v=0.47, col="red")
abline(v=0.485, col="green")

max <- which.max(slot(eval, "y.values")[[1]])
max
## [1] 55
acc <- slot(eval, "y.values")[[1]][max]

cutoff <- slot(eval,"x.values")[[1]][max]
cutoff
##      271 
## 0.489942
print(c(Accuracy=acc, Cutoff=cutoff))
##   Accuracy Cutoff.271 
##   0.717500   0.489942

You can see that, we increased our accuracy rate litle bit by choosing the best cut off value.

ROC Curve

roc <- performance(pred2, "tpr", "fpr")
plot(roc,
     colorize=T,
     lwd = 3,
     main="ROC Curve")
abline(a=0, b=1)

ROC Construction
• ROC curve plots true positive fraction against false positive fraction for each value of the significance level between 0 and 1.
• The 45° diagonal corresponds to classification based upon a coin toss: rate of correct classification = rate of misclassification
• A minimally sufficient testing procedure should outperform a fair coin!

AUC (Area Under Curve)

auc <- performance(pred2, "auc")
area <- unlist(slot(auc,"y.values"))
area
## [1] 0.6928413

AUC = area under the ROC curve

• AUC of 1 –> perfect test

• AUC of 0 –> test misclassifies in every case for every value of 𝛼

• AUC of .5 –> performance equivalent to classification based on a coin toss

References:

  1. CSU Applied Statistics Course Notes

  2. https://www.youtube.com/watch?v=ypO1DPEKYFo

  3. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

  4. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc



*************************