Loading required package: ggplot2
Loading required package: lattice
# library(e1071)
Logistic Regression is a classification type supervised learning model.
Logistic Regression is used when the independent variable x, is either a continuous or categorical variable and the dependent variable (y) is a categorical variable.
Confusion matrix: Confusion matrix categorizes the actual data w.r.t the predicted data. This matrix can be of any dimension (n x n matrix).
One can often get confused in understanding the classes in this matrix (True Positive, True Negative, False Positive, False Negative), hence it is termed as confusion matrix. Here, positive is usually the event of interest (could be an “bad” even like getting cancer or going bankrupt though) and indicates presence of the condition. Negative is absence of the event of interest.
The confusion matrix evaluates the predictions made on the test data, i.e., the number of correct predictions made as well as the wrong predictions made on the data.
This recipe demonstrates an example of how to get Classification Confusion Matrix.
Confusion matrix: Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm.
The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values
It gives you insight not only into the errors made by your classifier but, more importantly, the types of errors that have been made.
TN:- We predicted something to be in 0 class and actually is in 0 class. This is good classification.
TP:- We predicted something to be in 1 class and actually is in 1 class. This is good classification.
FP:- We predicted something to be in 1 class and actually is in 0 class. This is a mis-classification.
FN:- We predicted something to be in 0 class and actually is in 1 class. This is a mis-classification.
Lets assume 1
is the event of interest.
You will have the actual values (the truth) from the raw data, or your testing data.
The predictions will come from your model (logistic, OLS,…). These are the predicted values.
confusion_mat = table(Actual_Values = actual_value,
Predicted_Values = predicted_value)
confusion_mat
Predicted_Values
Actual_Values 0 1
0 2 5
1 5 3
accuracy <- (2+3) / (2+3+5+5)
senstivity <- 3 / (3+5)
specificity <- 2 / (2+5)
accuracy
[1] 0.3333333
senstivity
[1] 0.375
specificity
[1] 0.2857143
confusion_mat_hand <- as.matrix(confusion_mat)
Lets use the confusionMatrix
package from the caret
package.
?confusionMatrix # Calculates a cross-tabulation of observed and predicted classes with associated statistics.
Help on topic 'confusionMatrix' was found in the following packages:
Package Library
ModelMetrics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
caret /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
confusionMatrix(reference = factor(actual_value),
data = factor(predicted_value)
)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2 5
1 5 3
Accuracy : 0.3333
95% CI : (0.1182, 0.6162)
No Information Rate : 0.5333
P-Value [Acc > NIR] : 0.9657
Kappa : -0.3393
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.2857
Specificity : 0.3750
Pos Pred Value : 0.2857
Neg Pred Value : 0.3750
Prevalence : 0.4667
Detection Rate : 0.1333
Detection Prevalence : 0.4667
Balanced Accuracy : 0.3304
'Positive' Class : 0
accuracy
[1] 0.3333333
senstivity
[1] 0.375
specificity
[1] 0.2857143
confusion_mat_hand
Predicted_Values
Actual_Values 0 1
0 2 5
1 5 3
If you failed to declare what is the event of interest, while the computer will still generates an output, the definitions may be incorrectly applied.
Whe you read up on the caret::confusionMatrix
command, you find that positive
arguments takes an optional character string for the factor level that corresponds to a “positive” result (if that makes sense for your data). If there are only two factor levels, the first level will be used as the “positive” result.
factor(actual_value)
[1] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
Levels: 0 1
factor(predicted_value)
[1] 0 0 1 0 1 1 1 0 1 0 0 1 1 1 0
Levels: 0 1
In our data, it is taking 0
as as the positive event.
So always declare the event of interest explicitly with the positive
option, like below.
confusionMatrix(reference = factor(actual_value),
data = factor(predicted_value),
positive = "1"
)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2 5
1 5 3
Accuracy : 0.3333
95% CI : (0.1182, 0.6162)
No Information Rate : 0.5333
P-Value [Acc > NIR] : 0.9657
Kappa : -0.3393
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.3750
Specificity : 0.2857
Pos Pred Value : 0.3750
Neg Pred Value : 0.2857
Prevalence : 0.5333
Detection Rate : 0.2000
Detection Prevalence : 0.5333
Balanced Accuracy : 0.3304
'Positive' Class : 1
accuracy
[1] 0.3333333
senstivity
[1] 0.375
specificity
[1] 0.2857143
confusion_mat_hand
Predicted_Values
Actual_Values 0 1
0 2 5
1 5 3
The values of sentivity
, specificity
and accuracy
do indeed match now.