Model Evaluation Metrics in Machine Learning

Introduction

Machine learning has become very popular nowadays. We use machine learning to make inferences about new situations using old data, and there are too many machine learning algorithms to do this. Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, K-Means, and Random Forest are commonly used machine learning algorithms. We don’t just try an algorithm when make a prediction on data. Sometimes, we use more than one algorithm, and then we continue with the one who make better predictions on data. How do we understand which algorithm works better? Model evaluation metrics will help us in evaluating your model’s accuracy and measure the performance of this trained model.Model evaluation metrics tell us how well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models. By using different metrics for performance evaluation, we could improve the overall predictive power of our model before we roll it out for production on unseen data. When evaluating machine learning models, choosing the right metric is also critical. There are various metrics to evaluate machine learning models in different applications. Let examine the evaluation metrics for evaluating the performance of a machine learning model, which is very crucial step of any data science project. Because it aims to estimate the generalization accuracy of a model on the future data.

1. Classification Metrics

When the response is binary (only taking two values ex. 0:failure and 1: success) in a machine learning model, we use the classification models like logistic regression, decision trees, random forest, XGboost, convolutional neural network etc. Then, to evaluate these models, we use classification metrics.

1.1. Confusion Matrix(Accuracy, Sensitivity, and Specificity)

A confusion matrix includes prediction results of any binary testing that is often used to describe the performance of the classification model.

Let look at a sample R implementation of the Confusion matrix

# Predict the glass type from chemical properties.
library(mlbench)
library(rsample)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)

##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg

summary(PimaIndiansDiabetes)

##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           mass          pedigree           age        diabetes 
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   neg:500  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   pos:268  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00            
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24            
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00            
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00

#Split data into train and test set. 
set.seed(123)
split <- initial_split(PimaIndiansDiabetes, prop = .8) #80% of the data taken as training, and 20% of the data taken as test set.
train <- training(split)
test  <- testing(split)

#Construct the Model
model<-glm(diabetes~., data=train, family = binomial(link = "logit") )

#Construct the Confusion Matrix
prediction <- predict(model, newdata = test, type = 'response')
pred <- factor(ifelse(prediction <= 0.5,0,1))
result <- caret::confusionMatrix(pred,test$diabetes)
result

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 27
##          1  9 32
##                                           
##                Accuracy : 0.7647          
##                  95% CI : (0.6894, 0.8294)
##     No Information Rate : 0.6144          
##     P-Value [Acc > NIR] : 5.739e-05       
##                                           
##                   Kappa : 0.4735          
##                                           
##  Mcnemar's Test P-Value : 0.004607        
##                                           
##             Sensitivity : 0.9043          
##             Specificity : 0.5424          
##          Pos Pred Value : 0.7589          
##          Neg Pred Value : 0.7805          
##              Prevalence : 0.6144          
##          Detection Rate : 0.5556          
##    Detection Prevalence : 0.7320          
##       Balanced Accuracy : 0.7233          
##                                           
##        'Positive' Class : 0               
##

We see the all Accuracy, Sensitivity, and Specificity in confusion matrix.

Accuracy is 0.76.
Sensitivity is 0.90 which is the ability of a test to correctly classify an individual as “have diabetes”.
Specificity is 0.54 which is the ability of test to correctly classify an individual as “do not have diabetes”. Model is doing the mistake of 46% when predicting the people who really do not have the disease.

1.2. Precision

When we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. Therefore we need to look at class specific performance metrics too. Precision is one of such metrics, which is defined as positive predicted values

\[ \displaystyle Precision= \left( \frac{True\ Positive}{True\ Positive+\ False\ Positive} \right) \]

1.3. Recall (Sensitivity)

Recall is also one of the important metric, it is the proportion of actual positive cases which are correctly identified.

\[ \displaystyle Recall= \left( \frac{True\ Positive}{True\ Positive+\ False\ Negative} \right) \]

1.4. F1-score

F1 score is a combination of two important error metrics: Precision and Recall. Thus, it can be considered as the Harmonic mean of Precision and Recall error metrics for an imbalanced dataset with respect to binary classification of data.

\[ \displaystyle F_1-Score= \left( \frac{2*Precision*recall}{Precision+recall} \right) \]

We can see the confusion table only by writing:

cm<-result$table
cm

##           Reference
## Prediction  0  1
##          0 85 27
##          1  9 32

By pulling the byClass argument in the result confusion matrix, we can also see the F1 score, Precision and Recall.

metrics<-as.data.frame(result$byClass)
colnames(metrics)<-"metrics"
library(dplyr)
library(kableExtra)
kable(round(metrics,4), caption = "F1-score, Precision and Recall ") %>%
  kable_styling(font_size = 16)

F1-score, Precision and Recall
	metrics
Sensitivity	0.9043
Specificity	0.5424
Pos Pred Value	0.7589
Neg Pred Value	0.7805
Precision	0.7589
Recall	0.9043
F1	0.8252
Prevalence	0.6144
Detection Rate	0.5556
Detection Prevalence	0.7320
Balanced Accuracy	0.7233

1.5. Receiver Operating Characteristics (ROC) Curve

Measuring the area under the ROC curve is also a very useful method for evaluating a model. It is shows the performance of a binary classifier as function of its cut-off threshold. It essentially shows the sensitivity against the false positive rate for various threshold values.

We write a function which allows use to make predictions based on different probability cutoffs, and then obtain the accuracy, sensitivity, and specificity for these classifiers.

get_logistic_pred = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  ifelse(probs > cut, pos, neg)
}

test_pred_10 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.1)
test_pred_50 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.5)
test_pred_90 = get_logistic_pred(model, data = test, res = "default", 
                                 pos = "1", neg = "0", cut = 0.9)

test_tab_10 = table(predicted = test_pred_10, actual = test$diabetes)
test_tab_50 = table(predicted = test_pred_50, actual = test$diabetes)
test_tab_90 = table(predicted = test_pred_90, actual = test$diabetes)

library(caret)
test_con_mat_10 = confusionMatrix(test_tab_10, positive = "1")
test_con_mat_50 = confusionMatrix(test_tab_50, positive = "1")
test_con_mat_90 = confusionMatrix(test_tab_90, positive = "1")

metrics = rbind(
  
  c(test_con_mat_10$overall["Accuracy"], 
    test_con_mat_10$byClass["Sensitivity"], 
    test_con_mat_10$byClass["Specificity"]),
  
  c(test_con_mat_50$overall["Accuracy"], 
    test_con_mat_50$byClass["Sensitivity"], 
    test_con_mat_50$byClass["Specificity"]),
  
  c(test_con_mat_90$overall["Accuracy"], 
    test_con_mat_90$byClass["Sensitivity"], 
    test_con_mat_90$byClass["Specificity"])

)

rownames(metrics) = c("c = 0.10", "c = 0.50", "c = 0.90")
metrics

##           Accuracy Sensitivity Specificity
## c = 0.10 0.5751634   0.9661017   0.3297872
## c = 0.50 0.7647059   0.5423729   0.9042553
## c = 0.90 0.6405229   0.1016949   0.9787234

library(pROC)
test_prob = predict(model, newdata = test, type = "response")
test_roc = roc(test$diabetes ~ test_prob, plot = TRUE, print.auc = TRUE)

If AUC value increases, we can say that model is adequate. (This is obtained with the high sensitivity and specificity).

1.6. Log Loss

Log Loss is a metric that quantifies the accuracy of a classifier by penalizing false classifications. This metric’s value represents the amount of uncertainty of prediction based on how much it varies from the actual label.

library(MLmetrics)
LogLoss(y_pred = model$fitted.values, y_true = test$diabetes)

## [1] NA

2. Regression Related Metrics

When the response is continuous (target variable can take all values in real line) in a machine learning model, we use the regression models like linear regression, random forest, XGboost, convolutional neural network, recurrent neural network etc.Then, to evaluate these models, we use regression Related Metrics.

2.1. Mean Absolute Error (MAE)

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

2.2. Mean Square Error (MSE)

MSE tells us tells us how close a regression line is to a set of points. That means it finds the average squared error between the predicted and actual values. It is the most popular regression Related Metrics.

2.3. Root Mean Square Error (RMSE)

The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed.

Let look at a sample R implementation of the Regression Related Metrics.

data(mtcars) #Lets look at how disp and hp affect the mpg 
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#Split data into train and test set. 
set.seed(123)
split2 <- initial_split(mtcars, prop = .8) #80% of the data taken as training, and 20% of the data taken as test set.
train2 <- training(split2)
test2  <- testing(split2)

model2<-lm(mpg~disp+hp, data=mtcars) #and conduct a linear regression
summary(model2)

## 
## Call:
## lm(formula = mpg ~ disp + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7945 -2.3036 -0.8246  1.8582  6.9363 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.735904   1.331566  23.083  < 2e-16 ***
## disp        -0.030346   0.007405  -4.098 0.000306 ***
## hp          -0.024840   0.013385  -1.856 0.073679 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared:  0.7482, Adjusted R-squared:  0.7309 
## F-statistic: 43.09 on 2 and 29 DF,  p-value: 2.062e-09

Mean square error is the mean square of the difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the mean square of the residuals.

mse<-mean(model2$residuals^2)
mse

## [1] 8.85917

Or, we can also see the predicted values, and we can take the difference manually.

difference<- data.frame(pred = predict(model2), actual = mtcars$mpg)
head(difference)

##                       pred actual
## Mazda RX4         23.14809   21.0
## Mazda RX4 Wag     23.14809   21.0
## Datsun 710        25.14838   22.8
## Hornet 4 Drive    20.17416   21.4
## Hornet Sportabout 15.46423   18.7
## Valiant           21.29978   18.1

# MSE
mean((difference$actual - difference$pred)^2)

## [1] 8.85917

Or we can use the MAE function in “MLmetrics” library.

library(MLmetrics)
pred <- predict(model2, newdata = test2, type = 'response')
MSE(pred, test2$mpg)

## [1] 6.448028

Mean absolute error is the mean absolute difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the mean absolute residuals.

mae<-mean(abs(model2$residuals))
mae

## [1] 2.501426

Or we can use the MAE function in “Metrics” library.

library(Metrics)
pred <- predict(model2, newdata = test2, type = 'response')
MAE(pred, test2$mpg)

## [1] 2.388516

Root mean square error is the squared of the mean square difference between actual values and predicted values. Residuals of the model gives us the difference between actual values and predicted values, so we can pull the residuals from the model and we can obtain the squared of the mean square residuals.

rmse<-sqrt(mean(model2$residuals^2))
rmse

## [1] 2.976436

Or we can use the RMSE function in the “Metrics” library.

library(Metrics)
pred <- predict(model2, newdata = test2, type = 'response')
RMSE(pred, test2$mpg)

## [1] 2.539297

cat(" MAE:", mae, "\n", "MSE:", mse, "\n", 
     "RMSE:", rmse, "\n")

##  MAE: 2.501426 
##  MSE: 8.85917 
##  RMSE: 2.976436

Conclusion

To conclude that, in this article, we examine some of the popular Machine learning metrics which are Regression Related Metrics and Classification Metrics used for evaluating the performance of classification and regression models. Moreover, we examine the importance of the usage of the metrics to obtain good predictions.

References

[1]. Precision and recall. (2021, March 25). Retrieved March 30, 2021, from https://en.wikipedia.org/wiki/Precision_and_recall

[2]. Minaee, S. (2019, October 28). 20 popular machine LEARNING Metrics. Part 1: Classification & Regression evaluation metrics. Retrieved March 30, 2021, from https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce