Dealing with Imbalanced data

How to deal with unbalanced dataset

There are 2 different aspects of dealing with the unbalanced dataset. As the data is unbalanced, to make the ML model predict our data in with a good prediction, we might have to do undersampling (taking less data sample of the majority class) or oversampling (duplicating the samples from the minority class) so that the data becomes more towards balanced. Another important aspect is the evaluation of our models while working with unbalanced dataset. Instead of accuracy, we might want to consider metrics like Precision and Recall.

Undersampling and Oversampling

library(imbalance)

## Warning: package 'imbalance' was built under R version 3.6.3

library(caret)

## Warning: package 'caret' was built under R version 3.6.2

## Loading required package: lattice

## Loading required package: ggplot2

data("iris0")

dim(iris0)

## [1] 150   5

head(iris0)

##   SepalLength SepalWidth PetalLength PetalWidth    Class
## 1         5.1        3.5         1.4        0.2 positive
## 2         4.9        3.0         1.4        0.2 positive
## 3         4.6        3.1         1.5        0.2 positive
## 4         5.0        3.6         1.4        0.2 positive
## 5         5.4        3.9         1.7        0.4 positive
## 6         4.6        3.4         1.4        0.3 positive

table(iris0$Class)

## 
## negative positive 
##      100       50

Undersampling

1 way of undersampling is the manual way of doing it, so that our new dataset will have the balanced data.

df_iris0_positive_ind <- which(iris0$Class == "positive")
df_iris0_negative_ind <- which(iris0$Class == "negative")

### setting negative counts to be same as positive counts - so that the data is balanced
nsample <- 50
pick_negative <- sample(df_iris0_negative_ind, nsample)

undersample_df1 <- iris0[c(df_iris0_positive_ind, pick_negative), ]

dim(undersample_df1)

## [1] 100   5

table(undersample_df1$Class)

## 
## negative positive 
##       50       50

R’s caret package also provides a feature which we can use to do undersampling, and then that can be fed into our model training step.

set.seed(123)
train.index <- createDataPartition(iris0$Class, p = .7, list = FALSE)

iris.train.df <- iris0[train.index, ]

iris.test.df <- iris0[-train.index, ]


ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 10,
                     verboseIter = FALSE,
                     sampling = "down")

set.seed(123)
random_model_us_rf <- caret::train(Class ~ .,
                                data = iris.train.df,
                                method = "rf",
                                preProcess = c("scale", "center"),
                                trControl = ctrl)

predict_undersample_prob <- predict(random_model_us_rf, 
                                    newdata = iris.test.df,
                                    type = "prob")

length(predict_undersample_prob)

## [1] 2

predict_undersample_prob

##     negative positive
## 1      0.000    1.000
## 6      0.000    1.000
## 9      0.000    1.000
## 11     0.000    1.000
## 13     0.028    0.972
## 16     0.060    0.940
## 17     0.000    1.000
## 24     0.000    1.000
## 28     0.000    1.000
## 33     0.000    1.000
## 34     0.050    0.950
## 36     0.000    1.000
## 37     0.000    1.000
## 38     0.000    1.000
## 45     0.000    1.000
## 51     1.000    0.000
## 52     1.000    0.000
## 53     1.000    0.000
## 60     1.000    0.000
## 61     1.000    0.000
## 69     1.000    0.000
## 70     0.994    0.006
## 74     0.998    0.002
## 78     1.000    0.000
## 85     1.000    0.000
## 87     1.000    0.000
## 94     0.996    0.004
## 95     0.998    0.002
## 96     0.996    0.004
## 98     1.000    0.000
## 102    1.000    0.000
## 104    1.000    0.000
## 105    1.000    0.000
## 106    1.000    0.000
## 109    1.000    0.000
## 112    1.000    0.000
## 118    0.998    0.002
## 124    1.000    0.000
## 130    1.000    0.000
## 131    1.000    0.000
## 133    1.000    0.000
## 136    1.000    0.000
## 138    1.000    0.000
## 142    1.000    0.000
## 148    1.000    0.000

predict_undersample_classes <- ifelse(predict_undersample_prob$positive > 0.5, 
                                      "positive", "negative")

table(predict_undersample_classes)

## predict_undersample_classes
## negative positive 
##       30       15

table(Predicted = predict_undersample_classes, Actual = iris.test.df$Class)

##           Actual
## Predicted  negative positive
##   negative       30        0
##   positive        0       15

Oversampling

Oversampling means replicating the data from the minority class so that the counts of the minority class data is almost similar to that of the majority class.

Just like in undersampling, 1 way of undersampling is the manual way of doing it, so that our new dataset will have the balanced data.

Here we are simply multiplying the existing minority class to increase its count to match with the majority class

df_iris0_positive_ind <- which(iris0$Class == "positive")
df_iris0_negative_ind <- which(iris0$Class == "negative")

### setting negative counts to be same as positive counts - so that the data is balanced

oversample_df1 <- iris0[c(df_iris0_positive_ind, df_iris0_negative_ind , df_iris0_positive_ind), ]

dim(oversample_df1)

## [1] 200   5

table(oversample_df1$Class)

## 
## negative positive 
##      100      100

Again, we can use caret pakage trainControl to do the oversampling, and then feed that trainControl into our training of the model.

ctrl2 <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 10,
                     verboseIter = FALSE,
                     sampling = "up")

set.seed(123)
random_model_os_rf <- caret::train(Class ~ .,
                                data = iris.train.df,
                                method = "rf",
                                preProcess = c("scale", "center"),
                                trControl = ctrl2)

predict_oversample_prob <- predict(random_model_os_rf, 
                                    newdata = iris.test.df,
                                    type = "prob")

length(predict_oversample_prob)

## [1] 2

predict_oversample_prob

##     negative positive
## 1      0.000    1.000
## 6      0.000    1.000
## 9      0.000    1.000
## 11     0.000    1.000
## 13     0.032    0.968
## 16     0.074    0.926
## 17     0.000    1.000
## 24     0.000    1.000
## 28     0.000    1.000
## 33     0.000    1.000
## 34     0.070    0.930
## 36     0.000    1.000
## 37     0.000    1.000
## 38     0.000    1.000
## 45     0.000    1.000
## 51     1.000    0.000
## 52     1.000    0.000
## 53     1.000    0.000
## 60     1.000    0.000
## 61     1.000    0.000
## 69     1.000    0.000
## 70     1.000    0.000
## 74     1.000    0.000
## 78     1.000    0.000
## 85     1.000    0.000
## 87     1.000    0.000
## 94     0.994    0.006
## 95     1.000    0.000
## 96     1.000    0.000
## 98     1.000    0.000
## 102    1.000    0.000
## 104    1.000    0.000
## 105    1.000    0.000
## 106    1.000    0.000
## 109    1.000    0.000
## 112    1.000    0.000
## 118    1.000    0.000
## 124    1.000    0.000
## 130    1.000    0.000
## 131    1.000    0.000
## 133    1.000    0.000
## 136    1.000    0.000
## 138    1.000    0.000
## 142    1.000    0.000
## 148    1.000    0.000

predict_oversample_classes <- ifelse(predict_oversample_prob$positive > 0.5, 
                                      "positive", "negative")

table(predict_oversample_classes)

## predict_oversample_classes
## negative positive 
##       30       15

table(Predicted = predict_oversample_classes, Actual = iris.test.df$Class)

##           Actual
## Predicted  negative positive
##   negative       30        0
##   positive        0       15

Evaluation Metrics for unbalanced dataset

Another very important part of the training and evaluating of the models on our imbalanced dataset is what metrics we use to evaluate our models. Using the accuracy is not a good option to evaluate such classification models. Instead, we can use Precision or Recall based on what data we are trying to model, and what is the cost of predicting the positive data incorrectly or the negative data incorrectly.

\(Accuracy\quad =\quad \frac { TP+TN }{ TP+FP+TN+FN }\)

For a datset which has say 97% of its data as negative class and only 3% of its data having the positive class. If our model predicts all our new data to be negative blindly, then in that case, our model accuracy will still be 97% which looks very good. However, our model has failed here badly to predict the negatives. Hence it is not a good model.

We can use Precision or Recall in such a case:

\(Precision\quad =\quad \frac { TP }{ TP+FP }\)

\(Recall\quad =\quad \frac { TP }{ TP+FN }\)

Hence we have to take 2 steps which are different in case we have an unbalanced dataset: 1) Undersampling / Oversampling 2) Using the right evaluation metrics to evaluate our model

Dealing with Imbalanced data - overview

Deepak Mongia

5/1/2020

Dealing with Unbalanced data

What is unbalanced data:

Why the ML algorithms won’t work well with imbalanced data

How to deal with unbalanced dataset

Undersampling and Oversampling

Undersampling

Oversampling

Evaluation Metrics for unbalanced dataset