From this site: https://shiring.github.io/machine_learning/2017/03/31/webinar_code
The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository(http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29).
The first dataset looks at the predictor classes:
The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:
bc_data[bc_data == "?"] <- NA
# how many NAs are in the data
length(which(is.na(bc_data)))
[1] 16
# how many samples would we loose, if we removed them?
nrow(bc_data)
[1] 699
nrow(bc_data[is.na(bc_data), ])
[1] 16
Missing values are imputed with the mice package.
summary(bc_data$classes)
benign malignant
458 241
We can see that our data is unbalanced.
Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our breast cancer dataset: assume we had 10 malignant vs 90 benign samples. A machine learning model that has been trained and tested on such a dataset could now predict “benign” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!
The basic theoretical concepts behind over- and under-sampling are very simple:
With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 241 out of the 458 benign cases. The main disadvantage of under-sampling is that we loose potentially relevant information from the left-out samples.
With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid loosing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data, i.e. the test data is no longer independent from training data. This would lead to an overestimation of our model’s performance and generalizability.
In reality though, we should not simply perform over- or under-sampling on our training data and then run the model. We need to account for cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance!
I randomly divide the data into training and test sets (stratified by class) and perform Random Forest modeling with 10 x 10 repeated cross-validation. Final model performance is then measured on the test set.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.
cm_under
Confusion Matrix and Statistics
Reference
Prediction benign malignant
benign 132 0
malignant 5 72
Accuracy : 0.9761
95% CI : (0.9451, 0.9922)
No Information Rate : 0.6555
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.9479
Mcnemar's Test P-Value : 0.07364
Sensitivity : 0.9635
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9351
Prevalence : 0.6555
Detection Rate : 0.6316
Detection Prevalence : 0.6316
Balanced Accuracy : 0.9818
'Positive' Class : benign
For over- (also called up-) sampling we simply specify sampling = "up".
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "up")
set.seed(42)
model_rf_over <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
final_over <- data.frame(actual = test_data$classes,
predict(model_rf_over, newdata = test_data, type = "prob"))
final_over$predict <- ifelse(final_over$benign > 0.5, "benign", "malignant")
cm_over <- confusionMatrix(final_over$predict, test_data$classes)
cm_over
Confusion Matrix and Statistics
Reference
Prediction benign malignant
benign 133 1
malignant 4 71
Accuracy : 0.9761
95% CI : (0.9451, 0.9922)
No Information Rate : 0.6555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9475
Mcnemar's Test P-Value : 0.3711
Sensitivity : 0.9708
Specificity : 0.9861
Pos Pred Value : 0.9925
Neg Pred Value : 0.9467
Prevalence : 0.6555
Detection Rate : 0.6364
Detection Prevalence : 0.6411
Balanced Accuracy : 0.9785
'Positive' Class : benign
Besides over- and under-sampling, there are hybrid methods that combine under-sampling with the generation of additional data. Two of the most popular are ROSE and SMOTE.
From Nicola Lunardon, Giovanna Menardi and Nicola Torelli’s “ROSE: A Package for Binary Imbalanced Learning” (R Journal, 2014, Vol. 6 Issue 1, p. 79): “The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.”
You implement them the same way as before, this time choosing sampling = "rose"…
# install.packages('ROSE')
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "rose")
set.seed(42)
model_rf_rose <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = ctrl)
Loaded ROSE 0.0-3
final_rose <- data.frame(actual = test_data$classes,
predict(model_rf_rose, newdata = test_data, type = "prob"))
final_rose$predict <- ifelse(final_rose$benign > 0.5, "benign", "malignant")
cm_rose <- confusionMatrix(final_rose$predict, test_data$classes)
cm_rose
Confusion Matrix and Statistics
Reference
Prediction benign malignant
benign 132 2
malignant 5 70
Accuracy : 0.9665
95% CI : (0.9322, 0.9864)
No Information Rate : 0.6555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9266
Mcnemar's Test P-Value : 0.4497
Sensitivity : 0.9635
Specificity : 0.9722
Pos Pred Value : 0.9851
Neg Pred Value : 0.9333
Prevalence : 0.6555
Detection Rate : 0.6316
Detection Prevalence : 0.6411
Balanced Accuracy : 0.9679
'Positive' Class : benign
… or by choosing sampling = "smote" in the trainControl settings.
From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357): “This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples.”
model_rf_smote <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = ctrl)
final_smote <- data.frame(actual = test_data$classes,
predict(model_rf_smote, newdata = test_data, type = "prob"))
final_smote$predict <- ifelse(final_smote$benign > 0.5, "benign", "malignant")
cm_smote <- confusionMatrix(final_smote$predict, test_data$classes)
cm_smote
Confusion Matrix and Statistics
Reference
Prediction benign malignant
benign 133 2
malignant 4 70
Accuracy : 0.9713
95% CI : (0.9386, 0.9894)
No Information Rate : 0.6555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9369
Mcnemar's Test P-Value : 0.6831
Sensitivity : 0.9708
Specificity : 0.9722
Pos Pred Value : 0.9852
Neg Pred Value : 0.9459
Prevalence : 0.6555
Detection Rate : 0.6364
Detection Prevalence : 0.6459
Balanced Accuracy : 0.9715
'Positive' Class : benign
Now let’s compare the predictions of all these models:
library(dplyr)
Attaching package: ‘dplyr’
The following object is masked from ‘package:randomForest’:
combine
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
comparison <- data.frame(model = names(models),
Sensitivity = rep(NA, length(models)),
Specificity = rep(NA, length(models)),
Precision = rep(NA, length(models)),
Recall = rep(NA, length(models)),
F1 = rep(NA, length(models)))
for (name in names(models)) {
model <- get(paste0("cm_", name))
comparison[comparison$model == name, ] <- filter(comparison, model == name) %>%
mutate(Sensitivity = model$byClass["Sensitivity"],
Specificity = model$byClass["Specificity"],
Precision = model$byClass["Precision"],
Recall = model$byClass["Recall"],
F1 = model$byClass["F1"])
}
library(tidyr)
Attaching package: ‘tidyr’
The following object is masked from ‘package:mice’:
complete
comparison %>%
gather(x, y, Sensitivity:F1) %>%
ggplot(aes(x = x, y = y, color = model)) +
geom_jitter(width = 0.2, alpha = 0.5, size = 3)
With this small dataset, we can already see how the different techniques can influence model performance. Sensitivity (or recall) describes the proportion of benign cases that have been predicted correctly, while specificity describes the proportion of malignant cases that have been predicted correctly. Precision describes the true positives, i.e. the proportion of benign predictions that were actual from benign samples. F1 is the weighted average of precision and sensitivity/ recall.
Here, all four methods improved specificity and precision compared to the original model. Under-sampling, over-sampling and ROSE additionally improved precision and the F1 score.