In this blog, lets review some of the performance metrics used for evaluating binary classifers and ways to compare multiple classifer models. The prerequiste of the audience is to know atleast how the binary classifers works and some of the algorthims used. Regression and Classifers algorthims are categorised into two types. Parametric and Non-Parametic Models.

Parameteric models are any algorthims that can represent in the form a function with respective co-efficient parameters. Logistic Regression. Naive Bayes.

Non-parametric are the one which doesnt have such representation and it follows some common sense logic. For e.g KNN follows K nearest points to determine the class label. The class label are determined by the most frequently occured in the neigbourhood area. Other such models are Decision Tree and SVM are non-paratmetric models.

Lets load a sample dataset which I have downloaded from Kaggle. The dataset is about some attributes about the patients and has target binary label 0 and 1. 0 means patient doesnt have heart dieases and 1 means patient has heart dieases.

df <- read.csv("C:\\Users\\Charls\\Documents\\CunyMSDS\\Data621\\blog2\\heart-disease-uci\\heart.csv")[-1]

head(df)

##   sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
## 1   1  3      145  233   1       0     150     0     2.3     0  0    1      1
## 2   1  2      130  250   0       1     187     0     3.5     0  0    2      1
## 3   0  1      130  204   0       0     172     0     1.4     2  0    2      1
## 4   1  1      120  236   0       1     178     0     0.8     2  0    2      1
## 5   0  0      120  354   0       1     163     1     0.6     2  0    2      1
## 6   1  0      140  192   0       1     148     0     0.4     1  0    1      1

Balanced or Imbalanced ?

Now, lets check if the number of class labels are balanced or not. This is crucial because we need to see the dataset has equally distributed class labels.And based on that we will use several performance metrics to guage the goodness of the classifer. If the dataset has more number of negative cases and very less number of positive cases, the dataset is baised towards negative labels.

bp <- barplot(table(df$target), beside = TRUE, main = "Total observations", 
col = c("lightblue", "mistyrose"),
xlab = "Label class", names = c("No Heart Dieases", "Heart Dieases"), 
ylab = "# of Obsevations", legend = c("No Heart Dieases", "Heart Dieases"), 
args.legend = list(title = "Legend", x = "topright", cex = .7), ylim = c(0, 300))
text(bp, 0, round(table(df$target), 1),cex=1,pos=3)

Confusion Matrix

Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

The 4 combinations are

True Positives,
False Positives(predicted as Positives, but in reality it is Negative),
True Negatives,
False Negatives(predicted as Negatives, but in reality it is Positive)

Below picture makes it clear.

Zero and One are interchangeable in the matrix and can be represented as two ways. Below are two representations as it may cause confusion for beginners.

https://hackernoon.com/idiots-guide-to-precision-recall-and-confusion-matrix-b32d36463556

Type 1 and Type 2 error are the ones which we should try to avoid while building classifers.

Accuracy

It is measured by dividing the total # of corrected predictions by total # of observations.

\(Accuracy = \frac{TP + TN}{N}\)

Accuracy is typically a usual measure to gauage a goodness of the classifer when the labels are balanced. However when the labels are not balanced, we should not rely on it completly. we should to rely on other metrics like precision and recall(sensitivity)

False Postive( Type 2 Error) and False Negative( Type 1 Error)

\(False Postive Rate = \frac{FP}{FP + TN}\)

\(False Negative Rate = \frac{FN}{FN + TP}\)

Precision

Precision is the percentage of correctly predicted positive values out of the total number of predicted positives.This is a measure of Type 1 errors of the classifer. The more closer to 1, better is the classifier resilent to the Type1 errors.

\(Precision = \frac{True positives}{Total Actual Positives}\)

\(Precision = \frac{TP}{TP + FP}\)

Recall, Sensitivity

Recall is the percentage of correctly predicted positive values out of the total number of actual positive values.This is a measure of Type 2 errors of the classifer. The more closer to 1, better is the classifier resilent to the Type 2 errors.

\(Recall = \frac{True positives}{Total Predicted Positives}\)

\(Recall = \frac{TP}{TP + FN}\)

Trade off (Accuracy v/s Precision v/s Recall)

Our goal is to reduce both Type-1 and Type-2 errors. If the test set is balanced, the accuracy can be used to evaluate its goodness. However if it is not balanced, we can not rely on the Acccuracy itelf. Measures like precision and Recall focusses on Type 1 and Type 2 error. We have to pick which one needs to be reduced in terms of the classifer’s requirement.

Let’s give an example for Type 1 and Type 2 scenarios and that would help which one to be picked in terms of evaluating classifer’s performance.

Type1 error(false positive error) : Predicting as positive when it is actually negative. E.g : predicting spam when it is not spam. Impact is that customers will miss the email. So in this case, Precision make significant for evaluvating the classifer’s performance rather than Recall.

Type2 error(false negative error) : Predicting as negative when it is actually positive. E.g : diagnosed as not having cancer when the patient has cancer. Impact is that the patient will die. So in this case, Recall make significant for evaluvating the classifer’s performance rather than Precision.

Once again, there are scenarios where Precision is more important and in some scenarios Recall is more important. Below scenario in this picture gives more significant to Type 2 error(False Negative). In this case you should focuss on improving Recall.

https://hackernoon.com/idiots-guide-to-precision-recall-and-confusion-matrix-b32d36463556

If we have to consider both Precision and Recall simultaneously, we can use another metrics called f1 score. F1 score more closer to 1, better would be the model considering the precision and recall of the classifer.

\(F1 Score = 2 * \frac{Precision * Recall }{Precision + Recall}\)

Specificity, False Positive rate

This is a another term used for false positive rate that is used in ROC(Receiver Operating Characteristics). The more closer to 1, better the model is to resilient to the False Positives rate.

\(Specificity = 1-FPR\)

ROC and AUC

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

\(TPR = \frac{TP}{TP + FN}\)

False Positive Rate (FPR) is defined as follows:

\(FPR = \frac{FP}{FP + TN}\)

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

ROC-AUC - courtesy https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). The more AUC, the better is the model and often used to compare different models.

ROC-AUC - courtesy https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Lets take a stab on building a logistic regression using above dataset.

library(caTools)

set.seed(123)
split = sample.split(df$target, SplitRatio = 0.8)
df_train = subset(df, split == TRUE)
df_test= subset(df, split == FALSE)

Lets build the logistic regression and plot the ROC plot with the best threshold.

log_classifer <- glm(target ~ ., data=df_train, family = "binomial")
lr_pred_prob=predict(log_classifer,newdata = df_test, type ="response" )
# consider the threshold of the linear regression classifer as 0.5
lr_pred_class  <- ifelse(lr_pred_prob > 0.5, 1, 0)
roc_log <- roc(df_test$target, lr_pred_class)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot.roc(roc_log,
main="Logistic Regression | ROC Curve", percent=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.auc = TRUE, 
print.thres="best")

table(df_test$target)

## 
##  0  1 
## 28 33

Lets print the confusion matrix and examine the performance metrics. Note that the test data is balanced. So accuracy would be a right performance matrix here. But still we can look at the precision and recall for reducing the Type1 and Type2 errors. Since the postive label denotes patient with heart dieases, we should focuss on the the Recall metric rather than Precision. recall is having 0.85 and precision(pos Pred Value) is 0.76.

Please note that the attribute positive in the confusion matric. This decides which label is marked as positives. In some cases, we may need to label 0 as 1. In this case we have to pass 0 instead.

predict.lr_result <- confusionMatrix(as.factor(lr_pred_class), as.factor(df_test$target), positive = "1")
predict.lr_result

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  5
##          1  9 28
##                                          
##                Accuracy : 0.7705         
##                  95% CI : (0.645, 0.8685)
##     No Information Rate : 0.541          
##     P-Value [Acc > NIR] : 0.0001784      
##                                          
##                   Kappa : 0.5328         
##                                          
##  Mcnemar's Test P-Value : 0.4226781      
##                                          
##             Sensitivity : 0.8485         
##             Specificity : 0.6786         
##          Pos Pred Value : 0.7568         
##          Neg Pred Value : 0.7917         
##              Prevalence : 0.5410         
##          Detection Rate : 0.4590         
##    Detection Prevalence : 0.6066         
##       Balanced Accuracy : 0.7635         
##                                          
##        'Positive' Class : 1              
##

Plotting ROC curves for multiple models.

This is interesting to plot two ROC curves in one plot to visually determine the goodness and compare between models. Lets build another classifer using naive Bayes and plot it together.

library(e1071)

nb_model=naiveBayes(as.factor(df_train$target) ~ ., data=df_train)

nm_pred_prob=predict(nb_model,newdata = df_test, type =  "raw" )
nm_pred_class=predict(nb_model,newdata = df_test, type ="class")

roc_nb <- roc(as.factor(df_test$target), nm_pred_prob[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot.roc(roc_log, print.auc = TRUE, 
                 col = "red") 
plot.roc(roc_nb , add = TRUE,print.auc = TRUE, 
                 col = "green", print.auc.y = .4) 

legend("bottomright", legend=c("Model-1 - Logistic regression", "Model-2 - Naive Bayes"),
col=c("red", "green"), lwd=2)

Binary Classifer Performance Metrics

Charls Joseph

April 26, 2020