Illustration of an imbalanced outcome variable.
Imbalanced data typically refers to a classification problem where the classes of the response variable (or what you’re trying to predict) are not equally represented. In a case of a binary (two classes) classification problem, for example, there may be 100 rows of data where 90 instances indicate one outcome and 10 instances indicate another. Put differently, a study may include 100 HIV patients where 90 take ART and 10 do not. If ART use was your response variable, then this would be an example of an imbalanced dataset.
From a modeling perspective, an imbalanced dataset presents numerous challenges when developing a predictive model. First, an imbalanced dataset is prone to produce a model with artificially inflated performance metrics. Take our example from above on HIV patients. If a model were to consistently predict an HIV patient IS taking their ART, it would have an accuracy of 90%! Fantastic right? Unfortunately no, the 90% accuracy has more to do with the imbalanced dataset and less to do with statistical prowess. A model that always predicts one outcome, even if 90% accurate, would provide little to no utility in practice. This phenomena is conveniently named the Accuracy paradox. Another challenge of an imbalanced dataset is that the created model may have a bias towards the higher represented class. The model will treat the minority class as noise leading to high levels of misclassification.
There are a number of techniques, ranging from simple to complex, that can help address imbalanced data. What is the right technique for your dataset? It depends.
First, you can balance your dataset by randomly dropping rows of data from the majority class. This undersampling is continued until the remaining sample has a balanced outcome measure. This method will reduce the size of your dataset, improving the run time of the algorithm. Unfortunately, this method can introduce bias into your model as the resulting sample will not be representative of the actual population.
The inverse of undersampling is oversampling, where you randomly replicate observations from the minority class. This technique allows you to preserve all of your data and it outperforms undersampling. The disadvantage is the resulting model may overfit your data as you are duplicating records.
A slightly more refined method of oversampling is k-mean clustering. In this technique, the data are grouped into homogeneous clusters. Once in clusters, the observations are oversampled so as to create an equal number of observation in each cluster. This technique allows you to combat outcome imbalance but also class imbalance in your data. However, this technique, like other oversampling techniques, increases your chance of overfitting your data.
A more elegant oversampling method, which allows you to avoid overfitting your data, is SMOTE. In this technique, new observations from the minority class are artificially (or synthetically) estimated based on known observations from the minority class.