Introduction

On a recent project using logistic regression whilst testing my model accuracy, adjusting the classification threshold and creating many confusion matrices. I later found that using a ROC curve was a better approach to finding the optimal threshold.

library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v purrr   0.3.3     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(ROCR)


set.seed(42)

The ROC curve

THe ROC curve (reciever operating characteristic curve)

The ROC curve plots the true positive rate (the predictions our model got correct) versus the false positive rate (the predictions our model got incorrect) From the diagram we see the horizontal line which is no better than random guessing. The area under horizontal line is .5. Our ideal model would be at the red dot where our true positive rate (TPR) is 1 and our area under the curve is 1. Here we would have 100 % true positive and zero false positive.

Get data

First up I will load up some sample data. For this I am using the Cleveland heart disease data set. For more info : https://archive.ics.uci.edu/ml/datasets/heart+disease

I want to change the class to a binary outcome. So 1 has heart disease and 0 does not. I will also change the sex field and convert over to factors.

hd_data %>% mutate(hd = ifelse(class > 0, 1, 0))-> hd_data
hd_data %>% mutate(sex = ifelse(sex == 0, "F", "M"))-> hd_data
hd_data$hd <- factor(hd_data$hd )
hd_data$sex <- factor(hd_data$sex )

Now we can glimpse the data just to orientate ourselves. Note we have 303 observations.

glimpse(hd_data)
## Rows: 303
## Columns: 15
## $ age      <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52...
## $ sex      <fct> M, M, M, M, F, M, F, F, M, M, M, F, M, M, M, M, M, M, F, M...
## $ cp       <int> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2...
## $ trestbps <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140...
## $ chol     <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0...
## $ thalach  <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153...
## $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3...
## $ slope    <int> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1...
## $ ca       <int> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ thal     <int> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7, 3, 7, 3, 3, 3...
## $ class    <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0...
## $ hd       <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0...

Next we shuffle our data to avoid any bias that may be in the data set as a result of the default ordering. We then split our data into test and training set using a 70:30 split.

# Get the number of observations
n_obs <- nrow(hd_data)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: Sonar
model_shuffled <- hd_data[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- model_shuffled[1:split, ]

# Create test
test <- model_shuffled[(split + 1):n_obs, ]

Build a logistic regression model

Now we build our logistic regression model to see if we can predict heart disease using our predictors. Note for this example I have only used two predictors. To test the if there is a significant relationship between the predictors and response variable (hd) you can use Chi squared test to calculate p-values. I won’t go into this here.

glm_model <- glm(data = train, hd ~ age + sex + thalach, family = "binomial" )

summary(glm_model)
## 
## Call:
## glm(formula = hd ~ age + sex + thalach, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94637  -0.84293  -0.04118   0.86104   2.09577  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.016742   1.880857   2.136   0.0327 *  
## age          0.026756   0.019105   1.400   0.1614    
## sexM         1.605177   0.362115   4.433 9.30e-06 ***
## thalach     -0.043793   0.008484  -5.162 2.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.89  on 211  degrees of freedom
## Residual deviance: 226.25  on 208  degrees of freedom
## AIC: 234.25
## 
## Number of Fisher Scoring iterations: 4

Test the model and create a confusion matrix and ROC curve

Now lets test our model against the test set and create a confusion matrix. Note a cassification threshold of .5 is set initially.

# Predict on test
p <- predict(glm_model, test, type = "response")



# If p exceeds threshold of 0.5, 1 else 0
hd_or_nohd <- ifelse(p > 0.5, 1, 0)

# Convert to factor: p_class
p_class <- factor(hd_or_nohd, levels = levels(test[["hd"]]))


# Create confusion matrix
confusionMatrix(p_class, test[["hd"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36 10
##          1 22 23
##                                           
##                Accuracy : 0.6484          
##                  95% CI : (0.5412, 0.7456)
##     No Information Rate : 0.6374          
##     P-Value [Acc > NIR] : 0.46057         
##                                           
##                   Kappa : 0.2946          
##                                           
##  Mcnemar's Test P-Value : 0.05183         
##                                           
##             Sensitivity : 0.6207          
##             Specificity : 0.6970          
##          Pos Pred Value : 0.7826          
##          Neg Pred Value : 0.5111          
##              Prevalence : 0.6374          
##          Detection Rate : 0.3956          
##    Detection Prevalence : 0.5055          
##       Balanced Accuracy : 0.6588          
##                                           
##        'Positive' Class : 0               
## 

Below gives an idea of how the matrix is structured (https://en.wikipedia.org/wiki/Confusion_matrix)

So from looking at the above and our subsequent confusion matrix we have :

We could then go on to create many confusion matrices as we adjust the classification threshold and compare to others we created. This is not ideal.

Now let’s plot a ROC curve

roc_pred <- prediction(predictions = p  , labels = test$hd)
roc_perf <- performance(roc_pred , "tpr" , "fpr")
plot(roc_perf,
     colorize = TRUE,
     print.cutoffs.at= seq(0,1,0.05),
     text.adj=c(-0.2,1.7))

The x axis is the false positive rate and the y axis is the true positive rate. We can see each of the points represents a confusion matrix (like we created above) which we don’t have to evaluate manually. The points represent the tradeoff between true positive and false positive. By looking at the graph we can choose the optimal threshold depending on how many false positives(FP) we are willing to accept.

AUC

We can also calculate the area under the ROC curve. If we look at the area under the curve a perfect model would give an AUC of exactly 1.00 and the average AUC for a random model is .5 (no better than random guessing) as the plot represents a diagonal line. AUC is a single number summary that allows us to evaluate the model accuracy without looking at confusion matrices. Typically we want a model with .8 or higher. We can also use this to compare AUC against other models.

 (auc_ROCR <- performance(roc_pred, measure = "auc"))
## A performance instance
##   'Area under the ROC curve'
 (auc_ROCR <- auc_ROCR@y.values[[1]])
## [1] 0.7583595

From here one way to improve our model woul be to include aditional revelant predictors.