Confusion Matrix

How to get Classification Confusion Matrix?

Logistic Regression is a classification type supervised learning model.

Logistic Regression is used when the independent variable x, is either a continuous or categorical variable and the dependent variable (y) is a categorical variable.

Confusion matrix: Confusion matrix categorizes the actual data w.r.t the predicted data. This matrix can be of any dimension (n x n matrix).

One can often get confused in understanding the classes in this matrix (True Positive, True Negative, False Positive, False Negative), hence it is termed as confusion matrix. Here, positive is usually the event of interest (could be an “bad” even like getting cancer or going bankrupt though) and indicates presence of the condition. Negative is absence of the event of interest.

The confusion matrix evaluates the predictions made on the test data, i.e., the number of correct predictions made as well as the wrong predictions made on the data.

Example

This recipe demonstrates an example of how to get Classification Confusion Matrix.

Confusion matrix: Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm.

The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values

It gives you insight not only into the errors made by your classifier but, more importantly, the types of errors that have been made.

TN:- We predicted something to be in 0 class and actually is in 0 class. This is good classification.
TP:- We predicted something to be in 1 class and actually is in 1 class. This is good classification.
FP:- We predicted something to be in 1 class and actually is in 0 class. This is a mis-classification.
FN:- We predicted something to be in 0 class and actually is in 1 class. This is a mis-classification.

Step 1 - Install necessary libraries

remove(list=ls())
# install.packages('caret') 
library(caret)     # for creating confusion matrix

Loading required package: ggplot2

Loading required package: lattice

# library(e1071)

Step 2 - Generate random data

Lets assume 1 is the event of interest.

You will have the actual values (the truth) from the raw data, or your testing data.
The predictions will come from your model (logistic, OLS,…). These are the predicted values.

actual_value    <- factor(c(1,1,1,0,0,1,0,0,0,1,1,1,0,0,1))

predicted_value <- factor(c(0,0,1,0,1,1,1,0,1,0,0,1,1,1,0))

Step 3 - Create a Confusion Matrix by hand

confusion_mat = table(Actual_Values    = actual_value, 
                      Predicted_Values = predicted_value) 

  
confusion_mat

             Predicted_Values
Actual_Values 0 1
            0 2 5
            1 5 3

accuracy    <- (2+3) / (2+3+5+5)
senstivity  <- 3     / (3+5)
specificity <- 2     / (2+5)

accuracy

[1] 0.3333333

senstivity

[1] 0.375

specificity

[1] 0.2857143

confusion_mat_hand <- as.matrix(confusion_mat)

Step 4 - Confusion matrix using the ‘caret’ package

Lets use the confusionMatrix package from the caret package.

?confusionMatrix # Calculates a cross-tabulation of observed and predicted classes with associated statistics.

Help on topic 'confusionMatrix' was found in the following packages:

  Package               Library
  ModelMetrics          /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  caret                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...

confusionMatrix(reference = factor(actual_value),
                data      = factor(predicted_value)
                )

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 2 5
         1 5 3
                                          
               Accuracy : 0.3333          
                 95% CI : (0.1182, 0.6162)
    No Information Rate : 0.5333          
    P-Value [Acc > NIR] : 0.9657          
                                          
                  Kappa : -0.3393         
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.2857          
            Specificity : 0.3750          
         Pos Pred Value : 0.2857          
         Neg Pred Value : 0.3750          
             Prevalence : 0.4667          
         Detection Rate : 0.1333          
   Detection Prevalence : 0.4667          
      Balanced Accuracy : 0.3304          
                                          
       'Positive' Class : 0

accuracy

[1] 0.3333333

senstivity

[1] 0.375

specificity

[1] 0.2857143

confusion_mat_hand

             Predicted_Values
Actual_Values 0 1
            0 2 5
            1 5 3

If you failed to declare what is the event of interest, while the computer will still generates an output, the definitions may be incorrectly applied.

Whe you read up on the caret::confusionMatrix command, you find that positive arguments takes an optional character string for the factor level that corresponds to a “positive” result (if that makes sense for your data). If there are only two factor levels, the first level will be used as the “positive” result.

factor(actual_value)

 [1] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
Levels: 0 1

factor(predicted_value)

 [1] 0 0 1 0 1 1 1 0 1 0 0 1 1 1 0
Levels: 0 1

In our data, it is taking 0 as as the positive event.

So always declare the event of interest explicitly with the positive option, like below.

confusionMatrix(reference = factor(actual_value),
                data      = factor(predicted_value),
                positive  = "1"
                )

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 2 5
         1 5 3
                                          
               Accuracy : 0.3333          
                 95% CI : (0.1182, 0.6162)
    No Information Rate : 0.5333          
    P-Value [Acc > NIR] : 0.9657          
                                          
                  Kappa : -0.3393         
                                          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.3750          
            Specificity : 0.2857          
         Pos Pred Value : 0.3750          
         Neg Pred Value : 0.2857          
             Prevalence : 0.5333          
         Detection Rate : 0.2000          
   Detection Prevalence : 0.5333          
      Balanced Accuracy : 0.3304          
                                          
       'Positive' Class : 1

accuracy

[1] 0.3333333

senstivity

[1] 0.375

specificity

[1] 0.2857143

confusion_mat_hand

             Predicted_Values
Actual_Values 0 1
            0 2 5
            1 5 3

The values of sentivity, specificity and accuracy do indeed match now.