Battling imbalanced datasets

Introduction

As W. Edwards Deming once said, “…without data you’re just another person with an opinion”. I believe this statement is applicable more than ever; much of the attention of current organizations is deploying machine learning algorithms, however much of the available data is disparate, incomplete and ultimately, imbalanced.

Of course, first and foremost data governance must be employed within the organization to ensure the capture of quality data; though does this mean that the data already captured is useless? Not necessarily. The SMOTE (Synthetic Minority Oversampling Technique) function (Chawla et. al. 2002) in R is a tool helps combat this imbalanced problem that leads to oversampled data and results. Please note that in order to use the SMOTE function, you must install the DMwR package within R. In essence, the package artificially generates new data for the minority class using a nearest neighbours technique, whilst under-sampling the majority data, leaving us with a balanced dataset.

The Function & Dataset

In order to demonstrate the capabilities of the SMOTE package, I will use a dataset that contains the loan default history of a sample population; please note that the principal component analysis has been performed on the dataset, thus inference of columns can be difficult. Nonetheless, upon exploratory data analysis we can see that we have an imbalanced dataset.

The dataset Loan dataset.

#Initialise libraries

library(tidyverse)
library(tabplot)
library(DMwR)
library(blogdown)

#Load data
raw <- read.csv('C:/Users/sean.pereira/Documents/JLL/University/Courses/DAM/Assignments/Assignment 3/AT3_CANVAS_UPDATED/AT3_credit_train_STUDENT.csv')

#1.DATA UNDERSTANDING
#Re-order columns 
dat <- raw[ , c(1, ncol(raw), 3:6, 2, 8:ncol(raw)-1)]

#Inspecting the data
#Overview of variables' distribution
table(dat$default)

## 
##     N     Y 
## 17518  5583

tableplot(dat[, -1])

Upon analysis for the above visual, we can see that our dependent variable (default) is heavily skewed towards non-defaulters, opposed to defaulters. If we think about this, the consequences become obvious; we are trying to predict the likelihood of an individual to default based upon supplied independent variables, however our response variables, which we will use to train our models are heavily skewed one way; inevitably resulting in a bias model.

How do you battle this?

Prior to partitioning the dataset into ‘train’ and ‘test’ portions to validate results, we use the SMOTE function to create a balanced dataset. Before diving back into our example, let’s understand the SMOTE package in more detail. Firstly, ensure that you have installed the DMwR package. Upon which you will be able to use the SMOTE function.

The function contains the following parameters which you as the user can control. Let’s dive into these in a bit more detail:

Form – a formula describing the prediction problem

Data – a dataframe containing the imbalanced dataset

Perc.over - a number that drives the decision of how many extra cases from the minority class are generated (over-sampling).

K - a number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.

Perc.under - a number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (under-sampling)

Learner - specify a string with the name of a function that implements a classification algorithm that will be applied to new SMOTEd data set (defaults to NULL).

… - in case you specify a learner (parameter learner) you can indicate further arguments that will be used when calling this learner. [Torgo, 2013] Now that we understand the parameters within the SMOTE function, let’s jump back to our example.

Upon additional feature engineering, we set a seed to ensure reproducibility and assign a new variable to contain the SMOTEd dataset; namely newdat. Upon running the function we re-view the data using the tabeplot function to ensure that we now balanced dataset to perform our classification modelling.

## 
##     N     Y 
## 22280 16710

Conclusion

The SMOTE function provides a method of ensuring non-biased results in machine learning problems, when faced with imbalanced datasets. As always, many other methods do exist and I encourage you to explore these further; some example of these include: exploring ways to collect more data about your population, changing your performance metric, resampling your dataset or try penalized methods.

I hope this blog post will assist you in your modelling ventures and allow you to transition from ‘opinions’ to a ‘data backed’ decisions.

References

Machine Learning Mastery. 2018. 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset. [ONLINE] Available at: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/. [Accessed 15 August 2018].
RPubs - Using SMOTE to handle unbalance data . 2013. RPubs - Using SMOTE to handle unbalance data . Luis Targo, [ONLINE] Available at: https://rpubs.com/abhaypadda/smote-for-imbalanced-data. [Accessed 17 August 2018].
SMOTE function | R Documentation. 2018. SMOTE function | R Documentation. [ONLINE] Available at: https://www.rdocumentation.org/packages/DMwR/versions/0.4.1/topics/SMOTE. [Accessed 12 August 2018].