Introduction

Machine learning has become very popular nowadays. We use machine learning to make inferences about new situations using old data, and there are too many machine learning algorithms to do this. Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, K-Means, and Random Forest are commonly used machine learning algorithms. We don’t just try an algorithm when make a prediction on data. Sometimes, we use more than one algorithm, and then we continue with the one who make better predictions on data. How do we understand which algorithm works better? Model evaluation metrics will help us in evaluating your model’s accuracy and measure the performance of this trained model.Model evaluation metrics tell us how well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models. By using different metrics for performance evaluation, we could improve the overall predictive power of our model before we roll it out for production on unseen data. When evaluating machine learning models, choosing the right metric is also critical. There are various metrics to evaluate machine learning models in different applications. Let examine the evaluation metrics for evaluating the performance of a machine learning model, which is very crucial step of any data science project. Because it aims to estimate the generalization accuracy of a model on the future data.

1. Classification Metrics

When the response is binary (only taking two values ex. 0:failure and 1: success) in a machine learning model, we use the classification models like logistic regression, decision trees, random forest, XGboost, convolutional neural network etc. Then, to evaluate these models, we use classification metrics.

1.1. Confusion Matrix(Accuracy, Sensitivity, and Specificity)

A confusion matrix includes prediction results of any binary testing that is often used to describe the performance of the classification model.

Let look at a sample R implementation of the Confusion matrix

# Predict the glass type from chemical properties.
library(mlbench)
library(rsample)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg
summary(PimaIndiansDiabetes) 
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           mass          pedigree           age        diabetes 
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   neg:500  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   pos:268  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00            
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24            
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00            
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00
#Split data into train and test set. 
set.seed(123)
split <- initial_split(PimaIndiansDiabetes, prop = .8) #80% of the data taken as training, and 20% of the data taken as test set.
train <- training(split)
test  <- testing(split)
#Construct the Model
model<-glm(diabetes~., data=train, family = binomial(link = "logit") )

#Construct the Confusion Matrix
prediction <- predict(model, newdata = test, type = 'response')
pred <- factor(ifelse(prediction <= 0.5,0,1))
result <- caret::confusionMatrix(pred,test$diabetes)
result
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 27
##          1  9 32
##                                           
##                Accuracy : 0.7647          
##                  95% CI : (0.6894, 0.8294)
##     No Information Rate : 0.6144          
##     P-Value [Acc > NIR] : 5.739e-05       
##                                           
##                   Kappa : 0.4735          
##                                           
##  Mcnemar's Test P-Value : 0.004607        
##                                           
##             Sensitivity : 0.9043          
##             Specificity : 0.5424          
##          Pos Pred Value : 0.7589          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.6144          
##          Detection Rate : 0.5556          
##    Detection Prevalence : 0.7320          
##       Balanced Accuracy : 0.7233          
##                                           
##        'Positive' Class : 0               
## 

We see the all Accuracy, Sensitivity, and Specificity in confusion matrix.

  • Accuracy is 0.76.
  • Sensitivity is 0.90 which is the ability of a test to correctly classify an individual as “have diabetes”.
  • Specificity is 0.54 which is the ability of test to correctly classify an individual as “do not have diabetes”. Model is doing the mistake of 46% when predicting the people who really do not have the disease.

1.2. Precision

When we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. Therefore we need to look at class specific performance metrics too. Precision is one of such metrics, which is defined as positive predicted values

\[ \displaystyle Precision= \left( \frac{True\ Positive}{True\ Positive+\ False\ Positive} \right) \]

1.3. Recall (Sensitivity)

Recall is also one of the important metric, it is the proportion of actual positive cases which are correctly identified.

\[ \displaystyle Recall= \left( \frac{True\ Positive}{True\ Positive+\ False\ Negative} \right) \]

1.4. F1-score

F1 score is a combination of two important error metrics: Precision and Recall. Thus, it can be considered as the Harmonic mean of Precision and Recall error metrics for an imbalanced dataset with respect to binary classification of data.

\[ \displaystyle F_1-Score= \left( \frac{2*Precision*recall}{Precision+recall} \right) \]

  • We can see the confusion table only by writing:
cm<-result$table
cm 
##           Reference
## Prediction  0  1
##          0 85 27
##          1  9 32
  • By pulling the byClass argument in the result confusion matrix, we can also see the F1 score, Precision and Recall.
metrics<-as.data.frame(result$byClass)
colnames(metrics)<-"metrics"
library(dplyr)
library(kableExtra)
kable(round(metrics,4), caption = "F1-score, Precision and Recall ") %>%
  kable_styling(font_size = 16)
F1-score, Precision and Recall
metrics
Sensitivity 0.9043
Specificity 0.5424
Pos Pred Value 0.7589
Neg Pred Value 0.7805
Precision 0.7589
Recall 0.9043
F1 0.8252
Prevalence 0.6144
Detection Rate 0.5556
Detection Prevalence 0.7320
Balanced Accuracy 0.7233

1.5. Receiver Operating Characteristics (ROC) Curve

Measuring the area under the ROC curve is also a very useful method for evaluating a model. It is shows the performance of a binary classifier as function of its cut-off threshold. It essentially shows the sensitivity against the false positive rate for various threshold values.

We write a function which allows use to make predictions based on different probability cutoffs, and then obtain the accuracy, sensitivity, and specificity for these classifiers.

get_logistic_pred = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  ifelse(probs > cut, pos, neg)
}
test_pred_10 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.1)
test_pred_50 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.5)
test_pred_90 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.9)
test_tab_10 = table(predicted = test_pred_10, actual = test$diabetes)
test_tab_50 = table(predicted = test_pred_50, actual = test$diabetes)
test_tab_90 = table(predicted = test_pred_90, actual = test$diabetes)

library(caret)
test_con_mat_10 = confusionMatrix(test_tab_10, positive = "1")
test_con_mat_50 = confusionMatrix(test_tab_50, positive = "1")
test_con_mat_90 = confusionMatrix(test_tab_90, positive = "1")
metrics = rbind(
  
  c(test_con_mat_10$overall["Accuracy"], 
    test_con_mat_10$byClass["Sensitivity"], 
    test_con_mat_10$byClass["Specificity"]),
  
  c(test_con_mat_50$overall["Accuracy"], 
    test_con_mat_50$byClass["Sensitivity"], 
    test_con_mat_50$byClass["Specificity"]),
  
  c(test_con_mat_90$overall["Accuracy"], 
    test_con_mat_90$byClass["Sensitivity"], 
    test_con_mat_90$byClass["Specificity"])

)

rownames(metrics) = c("c = 0.10", "c = 0.50", "c = 0.90")
metrics
##           Accuracy Sensitivity Specificity
## c = 0.10 0.5751634   0.9661017   0.3297872
## c = 0.50 0.7647059   0.5423729   0.9042553
## c = 0.90 0.6405229   0.1016949   0.9787234
library(pROC)
test_prob = predict(model, newdata = test, type = "response")
test_roc = roc(test$diabetes ~ test_prob, plot = TRUE, print.auc = TRUE)

If AUC value increases, we can say that model is adequate. (This is obtained with the high sensitivity and specificity).

1.6. Log Loss

Log Loss is a metric that quantifies the accuracy of a classifier by penalizing false classifications. This metric’s value represents the amount of uncertainty of prediction based on how much it varies from the actual label.

library(MLmetrics)
LogLoss(y_pred = model$fitted.values, y_true = test$diabetes)
## [1] NA

Conclusion

To conclude that, in this article, we examine some of the popular Machine learning metrics which are Regression Related Metrics and Classification Metrics used for evaluating the performance of classification and regression models. Moreover, we examine the importance of the usage of the metrics to obtain good predictions.

References

[1]. Precision and recall. (2021, March 25). Retrieved March 30, 2021, from https://en.wikipedia.org/wiki/Precision_and_recall

[2]. Minaee, S. (2019, October 28). 20 popular machine LEARNING Metrics. Part 1: Classification & Regression evaluation metrics. Retrieved March 30, 2021, from https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce