Introduction

Machine learning is making proper inroads into solving medical problems. We need only look at deep learning for medical imaging to see what progress has been made.

The dataset probably stands as the biggest hurdle in advancing the science of ML in healthcare. Patient confidentiality must be kept in mind. Small sample sizes provide another hurdle. Even when these problems are overcome, datasets often suffer from class imbalance.

Class imbalance is a disproportionate representation of one of the target classes. The .csv data file is populated with simulated data point values for seven feature variables and large imbalance in the target variable. This can create a major headache when splitting the data an tarining a model.

Fortunately, there are many appropriate methods for correcting this issue. One such method is the synthetic sampling approach. The ROSE (random over-sampling examples) package provides for the creation of synthetic data to increase the numbers of the minority class. This package provides many functions to help address class imbalance. Below is one such function.

The data

The ImbalancedData.csv file is available on request.

df <- as_tibble(read.csv("ImbalancedData.csv"))
df
## # A tibble: 5,133 x 8
##    Feature1 Feature2 Feature3 Feature4 Feature5 Feature6 Feature7 Target
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <int>
##  1   0.134   0.110     0.426    0.875    0.820    0.926    0.467       0
##  2   0.861   0.727     0.591    0.343    0.596    0.880    0.952       0
##  3   0.0805  0.485     0.420    0.0193   0.605    0.276    0.494       0
##  4   0.916   0.0659    0.921    0.818    0.0492   0.337    0.688       0
##  5   0.199   0.631     0.191    0.748    0.997    0.0445   0.213       0
##  6   0.905   0.559     0.599    0.428    0.318    0.388    0.865       0
##  7   0.345   0.520     0.840    0.431    0.722    0.919    0.614       0
##  8   0.318   0.487     0.355    0.801    0.462    0.297    0.1000      0
##  9   0.566   0.473     0.691    0.0449   0.426    0.740    0.308       0
## 10   0.911   0.00295   0.0739   0.188    0.851    0.597    0.0672      0
## # ... with 5,123 more rows

Note the seven feature variables, named Feature1 through Feature7, and the target class named Target. The table() function will show a count of the sample space elements of the target variable.

table(df$Target)
## 
##    0    1 
## 4056 1077

The 0 class outnumbers the 1 class at a ratio of almost \(4:1\).

The ovun.sample() function

This function has three methods. The problem at hand can be solved by the method = "over" argument, which will over-sample the minority class.

balanced.df <- ovun.sample(Target ~ .,
                           data = df,
                           seed = 123,
                           method = "over")$data # Don't forget the $data at the end

The first argument expresses the formula to be used. It states the target variable and the dot refers to all the columns. In essence create rows of new data to address the minority class imbalance. The seed = argument allows for reproducibility.

The table() function will now show a much better balance, with the minority class over-sampled so as to equal the majority class in count.

table(balanced.df$Target)
## 
##    0    1 
## 4056 4010

This dataset is now ready for any ML task.