The following R vignette explores 3 common methods of handling data that occur in imbalanced datasets using the R package “unbalanced”. The methods are oversampling, undersampling and Smote.
Imbalanced datasets occur occur frequently within classification tasks. The imbalance arises due to a significant number of responses in one class referred to as the majority, compared to the other class or classes referred to as the minority class. For the purpose of this vignette, only binary classification tasks will be considered.
There is no formal definition of when a dataset is imbalanced but the general rule of thumb is any more than a 60:40 ratio can be regarded as imbalanced (Google, 2020).
Imbalanced datasets occur frequently in the area of detection such as fraud detection where the majority of purchases would be legitimate opposed to fraudulent. This results in a significant skew in the dataset towards the majority class.
Most machine learning algorithms were designed to handle balanced datasets (Brownlee, 2020). This generally results in models that struggle to identify minority cases which are generally more interesting. This occurs as the models see more majority cases and mistake minority cases for majority cases.
Let’s consider a dataset with a significant skew of 99:1 yes to no responses. A very successful strategy in this case would be to always guess yes as the answer which would lead to 99% Accuracy but would misclassify the minority classes resulting in poor predictive performance (Brownlee, 2020). Alternative metrics that should be considered are specificity, sensitivity, precision, recall, AUC and F1 score (Johnson, 2019).
While algorithms have been developed that deal with class imbalances, the simplest and potentially most effective method that can be done, is to balance the number of samples across the classes.
library(ggplot2)
library(knitr)
suppressMessages(library(dplyr))
suppressMessages(library(unbalanced))
To demonstrate balancing an imbalanced dataset, l’ll be demonstrating the imbalanced package on the credit card fraud dataset available from kaggle (Machine Learning Group, 2018).
<- read.csv("creditcard.csv") credit_card_data
$Class<-as.factor(credit_card_data$Class) # convert class to factor
credit_card_datalevels(credit_card_data$Class) <- c('Legitimate', 'Fraud') # names of factors
summary(credit_card_data$Class)
## Legitimate Fraud
## 284315 492
# plotting number of samples in each class - original dataset
options(scipen=10000) # remove scientific notation when viewing plot
ggplot(data = credit_card_data, aes(fill = Class)) +
geom_bar(aes(x = Class))+
ggtitle("Number of samples in each class", subtitle = "Original dataset")+
xlab("")+
ylab("Samples")+
scale_y_continuous(expand = c(0,0))+
scale_x_discrete(expand = c(0,0))+
theme(legend.position = "none",
legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
There are 284315 legitimate purchases compared to 492 fraudulent purchases which only make up 0.17% of the total samples. Clearly this dataset is heavily imbalanced as it has a ratio of approximately 580:1.
Let’s also look at a scatter plot of two variables. The meaning of the variables has been lost during the anonymisation process used to create the dataset.
# scatter plot original dataset
ggplot(data = credit_card_data, aes(x = V4, y = V7, colour = Class))+
geom_point()+
ggtitle("Original dataset")+
xlab("V4")+
ylab("V7")+
geom_point() +
xlim(-5, 15)+
ylim(-50, 50)+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
legend.key=element_blank())
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing missing values (geom_point).
### Data preperation Separating the data into the predictor variables and the response variable.
<- credit_card_data[,-31] # Select everything except response
predictor_variables <- credit_card_data$Class # Only select response variable response_variable
Minority case must be have the factor level of 1, default has it coded as 0.
# swaps around the factor encoding for legitimate and fraud
levels(response_variable) <- c('0', '1')
The process of undersampling counts the number of minority samples in the dataset, then randomly selects the same number from the majority sample. In our case we would end up with 492 randomly chosen legitimate and the original 492 fraudulent samples resulting in a 50:50 split. This has a major drawback as we are only using 0.35% of the original dataset.
# Run undersampled function
<- ubBalance(predictor_variables,
undersampled_data
response_variable, type='ubUnder', # Option for undersampling
verbose = TRUE)
## Proportion of positives after ubUnder : 50 % of 984 observations
<- cbind(undersampled_data$X, # combine output
undersampled_combined $Y)
undersampled_data
names(undersampled_combined)[names(undersampled_combined) == "undersampled_data$Y"] <- "Class" # change name to class
levels(undersampled_combined$Class) <- c('Legitimate', 'Fraud')
# plot number of cases in undersampled dataset
ggplot(data = undersampled_combined, aes(fill = Class))+
geom_bar(aes(x = Class))+
ggtitle("Number of samples in each class after undersampling",
subtitle="Total samples: 984")+
xlab("")+
ylab("Samples")+
scale_y_continuous(expand = c(0,0))+
scale_x_discrete(expand = c(0,0))+
theme(legend.position = "none",
legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
# scatter plot of undersampled data
ggplot(data = undersampled_combined, aes(x = V4, y = V7, colour = Class))+
geom_point()+
ggtitle("Undersampled dataset")+
xlab("V4")+
ylab("V7")+
xlim(-5, 15)+
ylim(-50, 50)+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
legend.key=element_blank())
After undersampling, we can clearly observe the reduced number of majority samples in the scatter plot, while the number of minority samples remain unchanged.
This method repeatedly duplicates randomly selected minority classes until there are an equal number of majority and minority samples. It does have its drawback as the duplicates may lead to generalizing of the minority class. The scatter plot will look identical to the original dataset and has not been included.
# run oversampled function on data
<- ubBalance(predictor_variables,
oversampled_data
response_variable, type='ubOver', # Option for oversampling
k = 0, # Value of 0 creates 50:50 split
verbose = TRUE)
## Proportion of positives after ubOver : 50 % of 568630 observations
<- cbind(oversampled_data$X, # combine output
oversampled_combined $Y)
oversampled_data
names(oversampled_combined)[names(oversampled_combined) == "oversampled_data$Y"] <- "Class" # change name to class
levels(oversampled_combined$Class) <- c('Legitimate', 'Fraud')
# plot number of cases in oversampled dataset
ggplot(data = oversampled_combined, aes(fill = Class))+
geom_bar(aes(x = Class))+
ggtitle("Number of samples in each class after oversampling", subtitle="Total samples: 568630 ")+
xlab("")+
ylab("Samples")+
scale_y_continuous(expand = c(0,0))+
scale_x_discrete(expand = c(0,0))+
theme(legend.position = "none",
legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
Synthetic Minority Oversampling TEchnique or SMOTE is another oversampling method to handle class imbalances. A very simple explanation is that it randomly selects a minority data point and looks at its nearest k minority class neighbours. It then randomly selects one of these neighbours, draws a line between them and creates a new data point randomly along that line. This will be repeated until the minority class has reached a predetermined ratio to the majority class.
# running smote function on data
<- ubBalance(predictor_variables,
Smote_data
response_variable, type='ubSMOTE', # Option for SMOTE
k = 3, # How many neighbouring data points to consider
percOver = 57787.6, # Percentage of minority cases 284315/492 *100
percUnder = 100, # Percentage of majority cases default 100
verbose = TRUE)
## Proportion of positives after ubSMOTE : 50.04 % of 568260 observations
<- cbind(Smote_data$X, # Combine output
Smote_combined $Y)
Smote_data
names(Smote_combined)[names(Smote_combined) == "Smote_data$Y"] <- "Class" # change name to class
levels(Smote_combined$Class) <- c('Legitimate', 'Fraud')
# plot number of samples in each class after smote
ggplot(data = Smote_combined, aes(fill = Class))+
geom_bar(aes(x = Class))+
ggtitle("Number of samples in each class after SMOTE", subtitle="Total samples: 568260")+
xlab("")+
ylab("Samples")+
scale_y_continuous(expand = c(0,0))+
scale_x_discrete(expand = c(0,0))+
theme(legend.position = "none",
legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
# scatter plot after smote
ggplot(data = Smote_combined, aes(x = V4, y = V7, colour = Class))+
geom_point()+
ggtitle("Smote dataset")+
xlab("V4")+
ylab("V7")+
xlim(-5, 15)+
ylim(-50, 50)+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
legend.key=element_blank())
## Warning: Removed 22 rows containing missing values (geom_point).
We can see the newly created synthetic samples from the majority class in the scatter plot above.
Google,2020, accessed 13 August 2021, https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
Brownlee J., 2020, Tour of Evaluation Metrics for Imbalanced Classification, machine Learning Mastery, accessed 13 August 2021, https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
Johnson, J.M., Khoshgoftaar, T.M., 2019, Survey on deep learning with class imbalance, Journal of Big Data, Volume 6, Article 27
Machine Learning Group - ULB, 2018, Credit Card Fraud Detection - Anonymized credit card transactions labeled as fraudulent or genuine version 3, https://www.kaggle.com/mlg-ulb/creditcardfraud