Logistic Regression is a classification type supervised learning model.
Logistic Regression is used when the independent variable x, is either a continuous or categorical variable and the dependent variable (y) is a categorical variable.
Confusion matrix: Confusion matrix categorizes the actual data w.r.t the predicted data. This matrix can be of any dimension (n x n matrix).
One can often get confused in understanding the classes in this matrix (True Positive, True Negative, False Positive, False Negative), hence it is termed as confusion matrix. Here, positive is usually the event of interest (could be an “bad” even like getting cancer or going bankrupt though) and indicates presence of the condition. Negative is absence of the event of interest.
The confusion matrix evaluates the predictions made on the test data, i.e., the number of correct predictions made as well as the wrong predictions made on the data.
This recipe demonstrates an example of how to get Classification Confusion Matrix.
Confusion matrix: Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm.
The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values
It gives you insight not only into the errors made by your classifier but, more importantly, the types of errors that have been made.
TN:- We predicted something to be in 0 class and actually is in 0 class. This is good classification.
TP:- We predicted something to be in 1 class and actually is in 1 class. This is good classification.
FP:- We predicted something to be in 1 class and actually is in 0 class. This is a mis-classification.
FN:- We predicted something to be in 0 class and actually is in 1 class. This is a mis-classification.
remove(list=ls())
# install.packages('caret')
library(caret) # for creating confusion matrix
## Loading required package: ggplot2
## Loading required package: lattice
# library(e1071)
Lets assume 1 is the event of interest.
You will have the actual values (the truth) from the raw data, or
your testing data.
The predictions will come from your model (logistic, OLS,…). These are
the predicted values.
actual_value <- factor(c(1,1,1,0,0,1,0,0,0,1,1,1,0,0,1))
predicted_value <- factor(c(0,0,1,0,1,1,1,0,1,0,0,1,1,1,0))
confusion_mat = table(Actual_Values = actual_value,
Predicted_Values = predicted_value)
confusion_mat
## Predicted_Values
## Actual_Values 0 1
## 0 2 5
## 1 5 3
accuracy <- (2+3) / (2+3+5+5)
senstivity <- 3 / (3+5)
specificity <- 2 / (2+5)
accuracy
## [1] 0.3333333
senstivity
## [1] 0.375
specificity
## [1] 0.2857143
confusion_mat_hand <- as.matrix(confusion_mat)
Lets use the confusionMatrix package from the
caret package.
?confusionMatrix # Calculates a cross-tabulation of observed and predicted classes with associated statistics.
## Help on topic 'confusionMatrix' was found in the following packages:
##
## Package Library
## caret /Users/arvindsharma/Library/R/x86_64/4.2/library
## ModelMetrics /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##
##
## Using the first match ...
confusionMatrix(reference = factor(actual_value),
data = factor(predicted_value)
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 5
## 1 5 3
##
## Accuracy : 0.3333
## 95% CI : (0.1182, 0.6162)
## No Information Rate : 0.5333
## P-Value [Acc > NIR] : 0.9657
##
## Kappa : -0.3393
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.2857
## Specificity : 0.3750
## Pos Pred Value : 0.2857
## Neg Pred Value : 0.3750
## Prevalence : 0.4667
## Detection Rate : 0.1333
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.3304
##
## 'Positive' Class : 0
##
accuracy
## [1] 0.3333333
senstivity
## [1] 0.375
specificity
## [1] 0.2857143
confusion_mat_hand
## Predicted_Values
## Actual_Values 0 1
## 0 2 5
## 1 5 3
If you failed to declare what is the event of interest, while the computer will still generates an output, the definitions may be incorrectly applied.
Whe you read up on the caret::confusionMatrix command,
you find that positive arguments takes an optional
character string for the factor level that corresponds to a “positive”
result (if that makes sense for your data). If there are only two factor
levels, the first level will be used as the “positive” result.
factor(actual_value)
## [1] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
## Levels: 0 1
factor(predicted_value)
## [1] 0 0 1 0 1 1 1 0 1 0 0 1 1 1 0
## Levels: 0 1
In our data, it is taking 0 as as the positive
event.
So always declare the event of interest explicitly with the
positive option, like below.
confusionMatrix(reference = factor(actual_value),
data = factor(predicted_value),
positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 5
## 1 5 3
##
## Accuracy : 0.3333
## 95% CI : (0.1182, 0.6162)
## No Information Rate : 0.5333
## P-Value [Acc > NIR] : 0.9657
##
## Kappa : -0.3393
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.3750
## Specificity : 0.2857
## Pos Pred Value : 0.3750
## Neg Pred Value : 0.2857
## Prevalence : 0.5333
## Detection Rate : 0.2000
## Detection Prevalence : 0.5333
## Balanced Accuracy : 0.3304
##
## 'Positive' Class : 1
##
accuracy
## [1] 0.3333333
senstivity
## [1] 0.375
specificity
## [1] 0.2857143
confusion_mat_hand
## Predicted_Values
## Actual_Values 0 1
## 0 2 5
## 1 5 3
The values of sentivity, specificity and
accuracy do indeed match now.