We have very often scenarios in real life Machine Learning where the response variable is not balanced in the dataset that we are working on to build our machine learning model. Rather unbalanced data is much more common than the balanced data when we work on real life problems. Following are some scenarios where the data is unbalanced: 1) Fraud detection 2) Insurance claim applications 3) Disease detection
ML Algorithms will have a high bias to detect the majority class in case of an unbalanced dataset. For example there is a dataset for the overall population of the samples taken for a disease check, and only 3 per cent of the population suffers from this disease. In such a case, if we build the model on this data, as the positive population sample is only 3 per cent, the model will have a high bias towards predicting a new data to be negative, as negative was 97% of the data while building the model. Also the data samples for positive ones were very small in number, thus giving very less data to identify the negative class.
There are 2 different aspects of dealing with the unbalanced dataset. As the data is unbalanced, to make the ML model predict our data in with a good prediction, we might have to do undersampling (taking less data sample of the majority class) or oversampling (duplicating the samples from the minority class) so that the data becomes more towards balanced. Another important aspect is the evaluation of our models while working with unbalanced dataset. Instead of accuracy, we might want to consider metrics like Precision and Recall.
library(imbalance)
## Warning: package 'imbalance' was built under R version 3.6.3
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: ggplot2
data("iris0")
dim(iris0)
## [1] 150 5
head(iris0)
## SepalLength SepalWidth PetalLength PetalWidth Class
## 1 5.1 3.5 1.4 0.2 positive
## 2 4.9 3.0 1.4 0.2 positive
## 3 4.6 3.1 1.5 0.2 positive
## 4 5.0 3.6 1.4 0.2 positive
## 5 5.4 3.9 1.7 0.4 positive
## 6 4.6 3.4 1.4 0.3 positive
table(iris0$Class)
##
## negative positive
## 100 50
1 way of undersampling is the manual way of doing it, so that our new dataset will have the balanced data.
df_iris0_positive_ind <- which(iris0$Class == "positive")
df_iris0_negative_ind <- which(iris0$Class == "negative")
### setting negative counts to be same as positive counts - so that the data is balanced
nsample <- 50
pick_negative <- sample(df_iris0_negative_ind, nsample)
undersample_df1 <- iris0[c(df_iris0_positive_ind, pick_negative), ]
dim(undersample_df1)
## [1] 100 5
table(undersample_df1$Class)
##
## negative positive
## 50 50
R’s caret package also provides a feature which we can use to do undersampling, and then that can be fed into our model training step.
set.seed(123)
train.index <- createDataPartition(iris0$Class, p = .7, list = FALSE)
iris.train.df <- iris0[train.index, ]
iris.test.df <- iris0[-train.index, ]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "down")
set.seed(123)
random_model_us_rf <- caret::train(Class ~ .,
data = iris.train.df,
method = "rf",
preProcess = c("scale", "center"),
trControl = ctrl)
predict_undersample_prob <- predict(random_model_us_rf,
newdata = iris.test.df,
type = "prob")
length(predict_undersample_prob)
## [1] 2
predict_undersample_prob
## negative positive
## 1 0.000 1.000
## 6 0.000 1.000
## 9 0.000 1.000
## 11 0.000 1.000
## 13 0.028 0.972
## 16 0.060 0.940
## 17 0.000 1.000
## 24 0.000 1.000
## 28 0.000 1.000
## 33 0.000 1.000
## 34 0.050 0.950
## 36 0.000 1.000
## 37 0.000 1.000
## 38 0.000 1.000
## 45 0.000 1.000
## 51 1.000 0.000
## 52 1.000 0.000
## 53 1.000 0.000
## 60 1.000 0.000
## 61 1.000 0.000
## 69 1.000 0.000
## 70 0.994 0.006
## 74 0.998 0.002
## 78 1.000 0.000
## 85 1.000 0.000
## 87 1.000 0.000
## 94 0.996 0.004
## 95 0.998 0.002
## 96 0.996 0.004
## 98 1.000 0.000
## 102 1.000 0.000
## 104 1.000 0.000
## 105 1.000 0.000
## 106 1.000 0.000
## 109 1.000 0.000
## 112 1.000 0.000
## 118 0.998 0.002
## 124 1.000 0.000
## 130 1.000 0.000
## 131 1.000 0.000
## 133 1.000 0.000
## 136 1.000 0.000
## 138 1.000 0.000
## 142 1.000 0.000
## 148 1.000 0.000
predict_undersample_classes <- ifelse(predict_undersample_prob$positive > 0.5,
"positive", "negative")
table(predict_undersample_classes)
## predict_undersample_classes
## negative positive
## 30 15
table(Predicted = predict_undersample_classes, Actual = iris.test.df$Class)
## Actual
## Predicted negative positive
## negative 30 0
## positive 0 15
Oversampling means replicating the data from the minority class so that the counts of the minority class data is almost similar to that of the majority class.
Just like in undersampling, 1 way of undersampling is the manual way of doing it, so that our new dataset will have the balanced data.
Here we are simply multiplying the existing minority class to increase its count to match with the majority class
df_iris0_positive_ind <- which(iris0$Class == "positive")
df_iris0_negative_ind <- which(iris0$Class == "negative")
### setting negative counts to be same as positive counts - so that the data is balanced
oversample_df1 <- iris0[c(df_iris0_positive_ind, df_iris0_negative_ind , df_iris0_positive_ind), ]
dim(oversample_df1)
## [1] 200 5
table(oversample_df1$Class)
##
## negative positive
## 100 100
Again, we can use caret pakage trainControl to do the oversampling, and then feed that trainControl into our training of the model.
ctrl2 <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "up")
set.seed(123)
random_model_os_rf <- caret::train(Class ~ .,
data = iris.train.df,
method = "rf",
preProcess = c("scale", "center"),
trControl = ctrl2)
predict_oversample_prob <- predict(random_model_os_rf,
newdata = iris.test.df,
type = "prob")
length(predict_oversample_prob)
## [1] 2
predict_oversample_prob
## negative positive
## 1 0.000 1.000
## 6 0.000 1.000
## 9 0.000 1.000
## 11 0.000 1.000
## 13 0.032 0.968
## 16 0.074 0.926
## 17 0.000 1.000
## 24 0.000 1.000
## 28 0.000 1.000
## 33 0.000 1.000
## 34 0.070 0.930
## 36 0.000 1.000
## 37 0.000 1.000
## 38 0.000 1.000
## 45 0.000 1.000
## 51 1.000 0.000
## 52 1.000 0.000
## 53 1.000 0.000
## 60 1.000 0.000
## 61 1.000 0.000
## 69 1.000 0.000
## 70 1.000 0.000
## 74 1.000 0.000
## 78 1.000 0.000
## 85 1.000 0.000
## 87 1.000 0.000
## 94 0.994 0.006
## 95 1.000 0.000
## 96 1.000 0.000
## 98 1.000 0.000
## 102 1.000 0.000
## 104 1.000 0.000
## 105 1.000 0.000
## 106 1.000 0.000
## 109 1.000 0.000
## 112 1.000 0.000
## 118 1.000 0.000
## 124 1.000 0.000
## 130 1.000 0.000
## 131 1.000 0.000
## 133 1.000 0.000
## 136 1.000 0.000
## 138 1.000 0.000
## 142 1.000 0.000
## 148 1.000 0.000
predict_oversample_classes <- ifelse(predict_oversample_prob$positive > 0.5,
"positive", "negative")
table(predict_oversample_classes)
## predict_oversample_classes
## negative positive
## 30 15
table(Predicted = predict_oversample_classes, Actual = iris.test.df$Class)
## Actual
## Predicted negative positive
## negative 30 0
## positive 0 15
Another very important part of the training and evaluating of the models on our imbalanced dataset is what metrics we use to evaluate our models. Using the accuracy is not a good option to evaluate such classification models. Instead, we can use Precision or Recall based on what data we are trying to model, and what is the cost of predicting the positive data incorrectly or the negative data incorrectly.
\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)
For a datset which has say 97% of its data as negative class and only 3% of its data having the positive class. If our model predicts all our new data to be negative blindly, then in that case, our model accuracy will still be 97% which looks very good. However, our model has failed here badly to predict the negatives. Hence it is not a good model.
We can use Precision or Recall in such a case:
\(Precision\quad =\quad \frac { TP }{ TP+FP }\)
\(Recall\quad =\quad \frac { TP }{ TP+FN }\)
Hence we have to take 2 steps which are different in case we have an unbalanced dataset: 1) Undersampling / Oversampling 2) Using the right evaluation metrics to evaluate our model