1 How to get Classification Confusion Matrix?

Logistic Regression is a classification type supervised learning model.

Logistic Regression is used when the independent variable x, is either a continuous or categorical variable and the dependent variable (y) is a categorical variable.

Confusion matrix: Confusion matrix categorizes the actual data w.r.t the predicted data. This matrix can be of any dimension (n x n matrix).

One can often get confused in understanding the classes in this matrix (True Positive, True Negative, False Positive, False Negative), hence it is termed as confusion matrix. Here, positive is usually the event of interest (could be an “bad” even like getting cancer or going bankrupt though) and indicates presence of the condition. Negative is absence of the event of interest.

The confusion matrix evaluates the predictions made on the test data, i.e., the number of correct predictions made as well as the wrong predictions made on the data.

2 Example

This recipe demonstrates an example of how to get Classification Confusion Matrix.

Confusion matrix: Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm.

The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values

It gives you insight not only into the errors made by your classifier but, more importantly, the types of errors that have been made.

TN:- We predicted something to be in 0 class and actually is in 0 class. This is good classification.
TP:- We predicted something to be in 1 class and actually is in 1 class. This is good classification.
FP:- We predicted something to be in 1 class and actually is in 0 class. This is a mis-classification.
FN:- We predicted something to be in 0 class and actually is in 1 class. This is a mis-classification.

2.1 Step 1 - Install necessary libraries

remove(list=ls())
# install.packages('caret') 
library(caret)     # for creating confusion matrix

## Loading required package: ggplot2

## Loading required package: lattice

# library(e1071)

2.2 Step 2 - Generate random data

Lets assume 1 is the event of interest.

You will have the actual values (the truth) from the raw data, or your testing data.
The predictions will come from your model (logistic, OLS,…). These are the predicted values.

actual_value    <- factor(c(1,1,1,0,0,1,0,0,0,1,1,1,0,0,1))

predicted_value <- factor(c(0,0,1,0,1,1,1,0,1,0,0,1,1,1,0))

3 Step 3 - Create a Confusion Matrix by hand

confusion_mat = table(Actual_Values    = actual_value, 
                      Predicted_Values = predicted_value) 

  
confusion_mat

##              Predicted_Values
## Actual_Values 0 1
##             0 2 5
##             1 5 3

accuracy    <- (2+3) / (2+3+5+5)
senstivity  <- 3     / (3+5)
specificity <- 2     / (2+5)

accuracy

## [1] 0.3333333

senstivity

## [1] 0.375

specificity

## [1] 0.2857143

confusion_mat_hand <- as.matrix(confusion_mat)

4 Step 4 - Confusion matrix using the ‘caret’ package

Lets use the confusionMatrix package from the caret package.

?confusionMatrix # Calculates a cross-tabulation of observed and predicted classes with associated statistics.

## Help on topic 'confusionMatrix' was found in the following packages:
## 
##   Package               Library
##   caret                 /Users/arvindsharma/Library/R/x86_64/4.2/library
##   ModelMetrics          /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## 
## 
## Using the first match ...

confusionMatrix(reference = factor(actual_value),
                data      = factor(predicted_value)
                )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 5
##          1 5 3
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.1182, 0.6162)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.9657          
##                                           
##                   Kappa : -0.3393         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.2857          
##             Specificity : 0.3750          
##          Pos Pred Value : 0.2857          
##          Neg Pred Value : 0.3750          
##              Prevalence : 0.4667          
##          Detection Rate : 0.1333          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.3304          
##                                           
##        'Positive' Class : 0               
##

accuracy

## [1] 0.3333333

senstivity

## [1] 0.375

specificity

## [1] 0.2857143

confusion_mat_hand

##              Predicted_Values
## Actual_Values 0 1
##             0 2 5
##             1 5 3

If you failed to declare what is the event of interest, while the computer will still generates an output, the definitions may be incorrectly applied.

Whe you read up on the caret::confusionMatrix command, you find that positive arguments takes an optional character string for the factor level that corresponds to a “positive” result (if that makes sense for your data). If there are only two factor levels, the first level will be used as the “positive” result.

factor(actual_value)

##  [1] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
## Levels: 0 1

factor(predicted_value)

##  [1] 0 0 1 0 1 1 1 0 1 0 0 1 1 1 0
## Levels: 0 1

In our data, it is taking 0 as as the positive event.

So always declare the event of interest explicitly with the positive option, like below.

confusionMatrix(reference = factor(actual_value),
                data      = factor(predicted_value),
                positive  = "1"
                )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 5
##          1 5 3
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.1182, 0.6162)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.9657          
##                                           
##                   Kappa : -0.3393         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.3750          
##             Specificity : 0.2857          
##          Pos Pred Value : 0.3750          
##          Neg Pred Value : 0.2857          
##              Prevalence : 0.5333          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.5333          
##       Balanced Accuracy : 0.3304          
##                                           
##        'Positive' Class : 1               
##

accuracy

## [1] 0.3333333

senstivity

## [1] 0.375

specificity

## [1] 0.2857143

confusion_mat_hand

##              Predicted_Values
## Actual_Values 0 1
##             0 2 5
##             1 5 3

The values of sentivity, specificity and accuracy do indeed match now.

Confusion Matrix

Arvind Sharma

2023-10-11