How to handle imbalanced datasets

Introduction

The following R vignette explores 3 common methods of handling data that occur in imbalanced datasets using the R package “unbalanced”. The methods are oversampling, undersampling and Smote.

What is a imbalanced dataset

Imbalanced datasets occur occur frequently within classification tasks. The imbalance arises due to a significant number of responses in one class referred to as the majority, compared to the other class or classes referred to as the minority class. For the purpose of this vignette, only binary classification tasks will be considered.

There is no formal definition of when a dataset is imbalanced but the general rule of thumb is any more than a 60:40 ratio can be regarded as imbalanced (Google, 2020).

Imbalanced datasets occur frequently in the area of detection such as fraud detection where the majority of purchases would be legitimate opposed to fraudulent. This results in a significant skew in the dataset towards the majority class.

Why is the imbalance important

Most machine learning algorithms were designed to handle balanced datasets (Brownlee, 2020). This generally results in models that struggle to identify minority cases which are generally more interesting. This occurs as the models see more majority cases and mistake minority cases for majority cases.

Why accuracy isn’t the best metric to consider on imbalanced datasets

Let’s consider a dataset with a significant skew of 99:1 yes to no responses. A very successful strategy in this case would be to always guess yes as the answer which would lead to 99% Accuracy but would misclassify the minority classes resulting in poor predictive performance (Brownlee, 2020). Alternative metrics that should be considered are specificity, sensitivity, precision, recall, AUC and F1 score (Johnson, 2019).

How can we handle the imbalance

While algorithms have been developed that deal with class imbalances, the simplest and potentially most effective method that can be done, is to balance the number of samples across the classes.

Import the libraries

library(ggplot2)
library(knitr)
suppressMessages(library(dplyr))
suppressMessages(library(unbalanced))

Import the data

To demonstrate balancing an imbalanced dataset, l’ll be demonstrating the imbalanced package on the credit card fraud dataset available from kaggle (Machine Learning Group, 2018).

credit_card_data <- read.csv("creditcard.csv")

Inital look at the data

credit_card_data$Class<-as.factor(credit_card_data$Class) # convert class to factor
levels(credit_card_data$Class) <- c('Legitimate', 'Fraud') # names of factors
summary(credit_card_data$Class)

## Legitimate      Fraud 
##     284315        492

# plotting number of samples in each class - original dataset
options(scipen=10000) # remove scientific notation when viewing plot

ggplot(data = credit_card_data, aes(fill = Class)) +
    geom_bar(aes(x = Class))+
    ggtitle("Number of samples in each class", subtitle = "Original dataset")+
    xlab("")+
    ylab("Samples")+
    scale_y_continuous(expand = c(0,0))+
    scale_x_discrete(expand = c(0,0))+
    theme(legend.position = "none", 
         legend.title = element_blank(),
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         panel.background = element_blank())

There are 284315 legitimate purchases compared to 492 fraudulent purchases which only make up 0.17% of the total samples. Clearly this dataset is heavily imbalanced as it has a ratio of approximately 580:1.

Let’s also look at a scatter plot of two variables. The meaning of the variables has been lost during the anonymisation process used to create the dataset.

# scatter plot original dataset
ggplot(data = credit_card_data, aes(x = V4, y = V7, colour = Class))+
  geom_point()+
    ggtitle("Original dataset")+
    xlab("V4")+
    ylab("V7")+
    geom_point() +
    xlim(-5, 15)+
    ylim(-50, 50)+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          legend.key=element_blank())

## Warning: Removed 23 rows containing missing values (geom_point).

## Warning: Removed 23 rows containing missing values (geom_point).

### Data preperation Separating the data into the predictor variables and the response variable.

predictor_variables <- credit_card_data[,-31] # Select everything except response
response_variable <- credit_card_data$Class   # Only select response variable

Minority case must be have the factor level of 1, default has it coded as 0.

# swaps around the factor encoding for legitimate and fraud
levels(response_variable) <- c('0', '1')

Undersampling

The process of undersampling counts the number of minority samples in the dataset, then randomly selects the same number from the majority sample. In our case we would end up with 492 randomly chosen legitimate and the original 492 fraudulent samples resulting in a 50:50 split. This has a major drawback as we are only using 0.35% of the original dataset.

# Run undersampled function
undersampled_data <- ubBalance(predictor_variables, 
                               response_variable, 
                               type='ubUnder',         # Option for undersampling
                               verbose = TRUE)

## Proportion of positives after ubUnder : 50 % of 984 observations

undersampled_combined <- cbind(undersampled_data$X,    # combine output
                               undersampled_data$Y)

names(undersampled_combined)[names(undersampled_combined) == "undersampled_data$Y"] <- "Class" # change name to class
levels(undersampled_combined$Class) <- c('Legitimate', 'Fraud')

# plot number of cases in undersampled dataset
ggplot(data = undersampled_combined, aes(fill = Class))+
    geom_bar(aes(x = Class))+
    ggtitle("Number of samples in each class after undersampling", 
            subtitle="Total samples: 984")+
     xlab("")+
     ylab("Samples")+
     scale_y_continuous(expand = c(0,0))+
     scale_x_discrete(expand = c(0,0))+
     theme(legend.position = "none", 
           legend.title = element_blank(),
           panel.grid.major = element_blank(),
           panel.grid.minor = element_blank(),
           panel.background = element_blank())

# scatter plot of undersampled data
ggplot(data = undersampled_combined, aes(x = V4, y = V7, colour = Class))+
  geom_point()+
    ggtitle("Undersampled dataset")+
    xlab("V4")+
    ylab("V7")+
    xlim(-5, 15)+
    ylim(-50, 50)+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          legend.key=element_blank())

After undersampling, we can clearly observe the reduced number of majority samples in the scatter plot, while the number of minority samples remain unchanged.

Oversampling

This method repeatedly duplicates randomly selected minority classes until there are an equal number of majority and minority samples. It does have its drawback as the duplicates may lead to generalizing of the minority class. The scatter plot will look identical to the original dataset and has not been included.

# run oversampled function on data
oversampled_data <- ubBalance(predictor_variables,  
                              response_variable, 
                              type='ubOver',         # Option for oversampling
                              k = 0,                 # Value of 0 creates 50:50 split
                              verbose = TRUE)

## Proportion of positives after ubOver : 50 % of 568630 observations

oversampled_combined <- cbind(oversampled_data$X,    # combine output
                               oversampled_data$Y)

names(oversampled_combined)[names(oversampled_combined) == "oversampled_data$Y"] <- "Class" # change name to class
levels(oversampled_combined$Class) <- c('Legitimate', 'Fraud')

# plot number of cases in oversampled dataset
ggplot(data = oversampled_combined, aes(fill = Class))+
    geom_bar(aes(x = Class))+
    ggtitle("Number of samples in each class after oversampling", subtitle="Total samples: 568630 ")+
     xlab("")+
     ylab("Samples")+
     scale_y_continuous(expand = c(0,0))+
     scale_x_discrete(expand = c(0,0))+
     theme(legend.position = "none", 
           legend.title = element_blank(),
           panel.grid.major = element_blank(),
           panel.grid.minor = element_blank(),
           panel.background = element_blank())

SMOTE

Synthetic Minority Oversampling TEchnique or SMOTE is another oversampling method to handle class imbalances. A very simple explanation is that it randomly selects a minority data point and looks at its nearest k minority class neighbours. It then randomly selects one of these neighbours, draws a line between them and creates a new data point randomly along that line. This will be repeated until the minority class has reached a predetermined ratio to the majority class.

# running smote function on data
Smote_data <- ubBalance(predictor_variables, 
                        response_variable, 
                        type='ubSMOTE',     # Option for SMOTE
                        k = 3,              # How many neighbouring data points to consider
                        percOver = 57787.6, # Percentage of minority cases 284315/492 *100
                        percUnder = 100,    # Percentage of majority cases default 100
                        verbose = TRUE)

## Proportion of positives after ubSMOTE : 50.04 % of 568260 observations

Smote_combined <- cbind(Smote_data$X,       # Combine output
                          Smote_data$Y)

names(Smote_combined)[names(Smote_combined) == "Smote_data$Y"] <- "Class" # change name to class
levels(Smote_combined$Class) <- c('Legitimate', 'Fraud')

# plot number of samples in each class after smote
ggplot(data = Smote_combined, aes(fill = Class))+
    geom_bar(aes(x = Class))+
    ggtitle("Number of samples in each class after SMOTE", subtitle="Total samples:  568260")+
    xlab("")+
    ylab("Samples")+
    scale_y_continuous(expand = c(0,0))+
    scale_x_discrete(expand = c(0,0))+
    theme(legend.position = "none", 
          legend.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank())

# scatter plot after smote
ggplot(data = Smote_combined, aes(x = V4, y = V7, colour = Class))+
  geom_point()+
    ggtitle("Smote dataset")+
    xlab("V4")+
    ylab("V7")+
    xlim(-5, 15)+
    ylim(-50, 50)+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          legend.key=element_blank())

## Warning: Removed 22 rows containing missing values (geom_point).

We can see the newly created synthetic samples from the majority class in the scatter plot above.

References

Google,2020, accessed 13 August 2021, https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

Brownlee J., 2020, Tour of Evaluation Metrics for Imbalanced Classification, machine Learning Mastery, accessed 13 August 2021, https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

Johnson, J.M., Khoshgoftaar, T.M., 2019, Survey on deep learning with class imbalance, Journal of Big Data, Volume 6, Article 27

Machine Learning Group - ULB, 2018, Credit Card Fraud Detection - Anonymized credit card transactions labeled as fraudulent or genuine version 3, https://www.kaggle.com/mlg-ulb/creditcardfraud