Machine learning has become very popular nowadays. We use machine learning to make inferences about new situations using old data, and there are too many machine learning algorithms to do this. Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, K-Means, and Random Forest are commonly used machine learning algorithms. We don’t just try an algorithm when make a prediction on data. Sometimes, we use more than one algorithm, and then we continue with the one who make better predictions on data. How do we understand which algorithm works better? Model evaluation metrics will help us in evaluating your model’s accuracy and measure the performance of this trained model.Model evaluation metrics tell us how well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models. By using different metrics for performance evaluation, we could improve the overall predictive power of our model before we roll it out for production on unseen data. When evaluating machine learning models, choosing the right metric is also critical. There are various metrics to evaluate machine learning models in different applications. Let examine the evaluation metrics for evaluating the performance of a machine learning model, which is very crucial step of any data science project. Because it aims to estimate the generalization accuracy of a model on the future data.
When the response is binary (only taking two values ex. 0:failure and 1: success) in a machine learning model, we use the classification models like logistic regression, decision trees, random forest, XGboost, convolutional neural network etc. Then, to evaluate these models, we use classification metrics.
A confusion matrix includes prediction results of any binary testing that is often used to describe the performance of the classification model.
Let look at a sample R implementation of the Confusion matrix
# Predict the glass type from chemical properties.
library(mlbench)
library(rsample)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 0 0 25.6 0.201 30 neg
summary(PimaIndiansDiabetes)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin mass pedigree age diabetes
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 neg:500
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 pos:268
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
#Split data into train and test set.
set.seed(123)
split <- initial_split(PimaIndiansDiabetes, prop = .8) #80% of the data taken as training, and 20% of the data taken as test set.
train <- training(split)
test <- testing(split)
#Construct the Model
model<-glm(diabetes~., data=train, family = binomial(link = "logit") )
#Construct the Confusion Matrix
prediction <- predict(model, newdata = test, type = 'response')
pred <- factor(ifelse(prediction <= 0.5,0,1))
result <- caret::confusionMatrix(pred,test$diabetes)
result
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 27
## 1 9 32
##
## Accuracy : 0.7647
## 95% CI : (0.6894, 0.8294)
## No Information Rate : 0.6144
## P-Value [Acc > NIR] : 5.739e-05
##
## Kappa : 0.4735
##
## Mcnemar's Test P-Value : 0.004607
##
## Sensitivity : 0.9043
## Specificity : 0.5424
## Pos Pred Value : 0.7589
## Neg Pred Value : 0.7805
## Prevalence : 0.6144
## Detection Rate : 0.5556
## Detection Prevalence : 0.7320
## Balanced Accuracy : 0.7233
##
## 'Positive' Class : 0
##
We see the all Accuracy, Sensitivity, and Specificity in confusion matrix.
When we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. Therefore we need to look at class specific performance metrics too. Precision is one of such metrics, which is defined as positive predicted values
\[ \displaystyle Precision= \left( \frac{True\ Positive}{True\ Positive+\ False\ Positive} \right) \]
Recall is also one of the important metric, it is the proportion of actual positive cases which are correctly identified.
\[ \displaystyle Recall= \left( \frac{True\ Positive}{True\ Positive+\ False\ Negative} \right) \]
F1 score is a combination of two important error metrics: Precision and Recall. Thus, it can be considered as the Harmonic mean of Precision and Recall error metrics for an imbalanced dataset with respect to binary classification of data.
\[ \displaystyle F_1-Score= \left( \frac{2*Precision*recall}{Precision+recall} \right) \]
cm<-result$table
cm
## Reference
## Prediction 0 1
## 0 85 27
## 1 9 32
metrics<-as.data.frame(result$byClass)
colnames(metrics)<-"metrics"
library(dplyr)
library(kableExtra)
kable(round(metrics,4), caption = "F1-score, Precision and Recall ") %>%
kable_styling(font_size = 16)
| metrics | |
|---|---|
| Sensitivity | 0.9043 |
| Specificity | 0.5424 |
| Pos Pred Value | 0.7589 |
| Neg Pred Value | 0.7805 |
| Precision | 0.7589 |
| Recall | 0.9043 |
| F1 | 0.8252 |
| Prevalence | 0.6144 |
| Detection Rate | 0.5556 |
| Detection Prevalence | 0.7320 |
| Balanced Accuracy | 0.7233 |
Measuring the area under the ROC curve is also a very useful method for evaluating a model. It is shows the performance of a binary classifier as function of its cut-off threshold. It essentially shows the sensitivity against the false positive rate for various threshold values.
We write a function which allows use to make predictions based on different probability cutoffs, and then obtain the accuracy, sensitivity, and specificity for these classifiers.
get_logistic_pred = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
probs = predict(mod, newdata = data, type = "response")
ifelse(probs > cut, pos, neg)
}
test_pred_10 = get_logistic_pred(model, data = test, res = "default",
pos = "1", neg = "0", cut = 0.1)
test_pred_50 = get_logistic_pred(model, data = test, res = "default",
pos = "1", neg = "0", cut = 0.5)
test_pred_90 = get_logistic_pred(model, data = test, res = "default",
pos = "1", neg = "0", cut = 0.9)
test_tab_10 = table(predicted = test_pred_10, actual = test$diabetes)
test_tab_50 = table(predicted = test_pred_50, actual = test$diabetes)
test_tab_90 = table(predicted = test_pred_90, actual = test$diabetes)
library(caret)
test_con_mat_10 = confusionMatrix(test_tab_10, positive = "1")
test_con_mat_50 = confusionMatrix(test_tab_50, positive = "1")
test_con_mat_90 = confusionMatrix(test_tab_90, positive = "1")
metrics = rbind(
c(test_con_mat_10$overall["Accuracy"],
test_con_mat_10$byClass["Sensitivity"],
test_con_mat_10$byClass["Specificity"]),
c(test_con_mat_50$overall["Accuracy"],
test_con_mat_50$byClass["Sensitivity"],
test_con_mat_50$byClass["Specificity"]),
c(test_con_mat_90$overall["Accuracy"],
test_con_mat_90$byClass["Sensitivity"],
test_con_mat_90$byClass["Specificity"])
)
rownames(metrics) = c("c = 0.10", "c = 0.50", "c = 0.90")
metrics
## Accuracy Sensitivity Specificity
## c = 0.10 0.5751634 0.9661017 0.3297872
## c = 0.50 0.7647059 0.5423729 0.9042553
## c = 0.90 0.6405229 0.1016949 0.9787234
library(pROC)
test_prob = predict(model, newdata = test, type = "response")
test_roc = roc(test$diabetes ~ test_prob, plot = TRUE, print.auc = TRUE)
If AUC value increases, we can say that model is adequate. (This is obtained with the high sensitivity and specificity).
Log Loss is a metric that quantifies the accuracy of a classifier by penalizing false classifications. This metric’s value represents the amount of uncertainty of prediction based on how much it varies from the actual label.
library(MLmetrics)
LogLoss(y_pred = model$fitted.values, y_true = test$diabetes)
## [1] NA
To conclude that, in this article, we examine some of the popular Machine learning metrics which are Regression Related Metrics and Classification Metrics used for evaluating the performance of classification and regression models. Moreover, we examine the importance of the usage of the metrics to obtain good predictions.
[1]. Precision and recall. (2021, March 25). Retrieved March 30, 2021, from https://en.wikipedia.org/wiki/Precision_and_recall
[2]. Minaee, S. (2019, October 28). 20 popular machine LEARNING Metrics. Part 1: Classification & Regression evaluation metrics. Retrieved March 30, 2021, from https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce