Class Imbalance

Author

Arqam Patel

Data Imbalance

Data imbalance refers to a situation where one class in a classification problem has significantly fewer examples than the other class(es). This can lead to inaccurate results, as the model may be biased towards the majority class and may not perform well on the minority class. Some common examples of imbalanced datasets include credit card fraud detection, medical diagnosis, and anomaly detection.

Effects on metrics

ROC curve:

Welp bad plot ig. Just says that imbalanced data will have low TPR and FPR due to underrepresentation of +ve samples in dataset. Undersampling increases both TPR and FPR so moves it to the upper right side.

There are actually some articles suggesting that ROC-AUC is an inaccurate metric in case of imbalanced data and Precision-Recall Curves should be used instead.

Countering imbalance

To handle imbalanced data, there are several techniques that can be used.

These include:

Undersampling: This involves reducing the number of examples in the majority class to balance the dataset. However, this can result in a loss of information and may not be the best approach for imbalanced datasets
Oversampling: This involves increasing the number of examples in the minority class to balance the dataset. This can be done by duplicating existing examples or generating new synthetic examples. However, this can result in overfitting and may not be effective for all datasets.
Cost-sensitive learning: This involves assigning different misclassification costs to different classes to account for the imbalance. This can be done by adjusting the threshold for classification or using different evaluation metrics.
Ensemble methods: This involves combining multiple models to improve performance on the minority class. This can be done by using techniques such as bagging, boosting, or stacking.

SMOTE (Synthetic Minority Over-sampling Technique)

Vanilla SMOTE

Creates synthetic data points based on original data points.

Algorithm

Select a minority class instance at random.
Find its k nearest fellow minority class neighbors.
Create synthetic instances by interpolating between the original instance and each of its k nearest neighbors.

Pretty simple; assumes that the examples cluster together in the representation space and it is continuous.

Borderline SMOTE

Borderline SMOTE only generates synthetic samples near the decision boundary between the minority and majority classes, which can improve the quality of the synthetic samples.

Algorithm

Identify the minority class examples that are near the majority class examples.
For each minority class example, select k nearest neighbors from the minority class.
For each minority class example, calculate the distance to its k nearest neighbors from the majority class.
For each minority class example, generate synthetic examples by interpolating between the original example and its k nearest neighbors from the minority class, but only if the distance to its k nearest neighbors from the majority class is less than a threshold.

Adaptive Synthetic Sampling (AdaSyn)

The idea behind AdaSyn is to assign higher weights to the minority class examples that are “harder to learn” (surrounded by majority class), and generate more synthetic examples for those examples.

For finds the k nearest neighbours (from amongst all classes), and computes the density distribution.

Algorithm:

For each \(x_i\) in the minority class, compute density \(r_i\) = proportion of majority class in KNN. Normalise it across all data points as \(\hat{r}_i\)
For each example compute number of synthetic datapoints to be derived from it:

\[ g_i = \hat{r_i} G \]
For each minority class data example, generate \(g_i\) synthetic data examples according to the following steps:
1. Randomly choose one minority data example, \(x_{zi}\), from the K nearest neighbors for \(x_i\).
2. Generate the synthetic data example:
\[ s_i = x_i + (x_{zi} − x_i) × λ \]
is the difference vector in n dimensional spaces, and λ is a random number: λ ∈ [0, 1].

Query: How do these synthetic generation algorithms deal with outliers? Wouldn’t these amplify them?

ENN: Edited Nearest Neighbours (Wilson’s) method

Kind of intelligent undersampling. Guesses that majority points which cluster around each other aren’t all that informative, and maybe neither is a minority point surrounded by majority ones. Helps delineate a clear boundary between the two classes.

Algorithm

Given the dataset with N observations, determine K, as the number of nearest neighbors. If not determined, then K=3.
Find the K-nearest neighbor of the observation among the other observations in the dataset, then return the majority class from the K-nearest neighbor.
If the class of the observation and the majority class from the observation’s K-nearest neighbor is different, then the observation and its K-nearest neighbor are deleted from the dataset.
Repeat step 2 and 3 until the desired proportion of each class is fulfilled.

Smote+ENN

A combination of intelligent under and over sampling.

SMOTE, followed by ENN

Cost Sensitive Learning

\(c_{i,j}(x)\) = cost of classifying x as i when it’s actually j.

Make a cost matrix clarifying the cost/reward of wrong/correct predictions.

Usually 𝑐(𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦, 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦) > 𝑐(𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦, 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦) for example in diagnostics or fraud detection.

Cost of classifying something in \(i^{th}\) class is the probability weighted sum of costs of misclassifying an object of each of the other classes as i.

We use the new expected cost function, on top of the existing classifier, instead of just the probabilities, for predictions.

\[ Cost(i | x) = \sum_j c_{i,j}(x) Pr(C_j | x) \]

We use the one with minimum cost as the predicted class.

References

Unbalanced Data? Stop Using ROC-AUC and Use AUPRC Instead

Imbalanced data & why you should NOT use ROC curve (Kaggle)

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning