Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Introduction
Class imbalance is a common problem in Machine learning practice that could seriously affect both algorithm’s function and model selection. This refers to situations when the classes are not equally distributed within the dataset. Imbalance problem could be confronted in either binary or multiclass classification tasks.
Imagine if we would like to build a machine learning model for detecting a rare disease or a rare clinical event. As the target outcome is rare, there are only 100 or less positive cases in a dataset of 1000 cases. After data splitting, we kept a subset of 10% for validation. This test subset has 10 patients and 90 healthy persons. The model training proposed two models : A and B. Model A misclassified 6/10 patients as negative (false negative) and 10/90 healthy subjects as positive (false positive) while model B misclassified only 2/10 patients as healthy and 30/90 normal persons as having disease. Based on an absolute accuracy and/or absolute error rate, the computer might consider that the model A is better than model B; as its only made 16 mistakes vs 32 mistakes by model B. However, we should choose the model B instead, as false negative error is more dangerous than false positive. The problem is that we might not even have a chance for such reasoning, as the model B was already eliminated by the computer during training process. Some algorithms only work best on balanced data. For multiclass classification problem, the training might also be interrupted due to class imbalance.
A common tactic to reduce the bad influence of classes imbalance is taking control the resampling process. The control might consist of oversampling (compensating the minority class by data replication) or undersampling (removing instances from the majority class). Neither of them is really good. By down resampling some informative cases might be lost, while oversampling might lead to overfitting. We can also combine oversampling and under-sampling with a trade-off between their positive and negative effects.
The h2o provide a simple and flexible solution for dealing with data imbalance that based on resampling control, via two hyper-parameters in training process: Balance classes and Fold assignment. Those two parameters only take effect on Random Forest, Deep neuralnet and GBM, which are also the learners that involve random splitting. The parameters could be set within the model training but not physically on the data frame.
When the balance classes parameter is set as true, a hybrid sampling control is activated, this allows to oversample or undersample the training folds based on the proportion of classes within original training frame. Each row in majority classes will be weighted higher than a row in minority classes. User can also set an explicit control on this parameter by introducing the weight values to class_sampling_factor parameter (recommended).
The fold_assignment defines how to split up the training data during cross validation training. Its value could be “Auto” (that means random splitting), Modulo and Stratified. When being set as Stratified, the program will try to “balance” the classes equally into each fold. This could improve significantly the model’s performance on imbalanced data.
Objective
The main objective of this case study X8 is to evaluate the effect of different fold_assignment and balance classes settings on the performance of Random Forest algorithm applied to a binary classfication problem. Our study will imply the Thoracic surgery dataset (Wroclaw Medical University, Poland). This dataset was collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the period of 2007 to 2011. Our classfication task aims to make 1 year Survival prognosis of those patients, based on their preoperative physical and functional characteristics.
Materials and method
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#ffffef"),
strip.background = element_rect(fill = "#ffbb00", color = "black", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "black"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
mycolors=c("#f32440","#ffd700","#ff8c00","#c9e101","#c100e6","#39d3d6","#e84412")
Now we load the dataset from UCI website and perform a descriptive analysis
require(foreign)
df=read.arff("https://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff")%>%as_tibble()
names(df)=c("Diagnosis","FVC","FEV1","Zubrod","Pain","Haemoptysis","Dyspnoea","Cough","Weakness","T_grade","DBtype2","MI","PAD","Smoking","Asthma","Age","Survival")
df$Survival=df$Survival%>%recode_factor(.,`F` = "Survived", `T` = "Dead")
df$Tiffneau=df$FEV1/df$FVC
Hmisc::describe(df)
## df
##
## 18 Variables 470 Observations
## ---------------------------------------------------------------------------
## Diagnosis
## n missing distinct
## 470 0 7
##
## Value DGN1 DGN2 DGN3 DGN4 DGN5 DGN6 DGN8
## Frequency 1 52 349 47 15 4 2
## Proportion 0.002 0.111 0.743 0.100 0.032 0.009 0.004
## ---------------------------------------------------------------------------
## FVC
## n missing distinct Info Mean Gmd .05 .10
## 470 0 134 1 3.282 0.9818 2.018 2.316
## .25 .50 .75 .90 .95
## 2.600 3.160 3.808 4.560 4.900
##
## lowest : 1.44 1.46 1.70 1.81 1.82, highest: 5.52 5.56 5.60 6.08 6.30
## ---------------------------------------------------------------------------
## FEV1
## n missing distinct Info Mean Gmd .05 .10
## 470 0 136 1 4.569 4.805 1.440 1.640
## .25 .50 .75 .90 .95
## 1.960 2.400 3.080 3.762 4.311
##
## Value 1 2 3 4 5 9 52 61 64 66
## Frequency 27 224 151 47 6 1 1 1 1 1
## Proportion 0.057 0.477 0.321 0.100 0.013 0.002 0.002 0.002 0.002 0.002
##
## Value 67 69 71 73 76 77 78 79 86
## Frequency 1 1 1 2 1 1 1 1 1
## Proportion 0.002 0.002 0.002 0.004 0.002 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------
## Zubrod
## n missing distinct
## 470 0 3
##
## Value PRZ0 PRZ1 PRZ2
## Frequency 130 313 27
## Proportion 0.277 0.666 0.057
## ---------------------------------------------------------------------------
## Pain
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 439 31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Haemoptysis
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 402 68
## Proportion 0.855 0.145
## ---------------------------------------------------------------------------
## Dyspnoea
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 439 31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Cough
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 147 323
## Proportion 0.313 0.687
## ---------------------------------------------------------------------------
## Weakness
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 392 78
## Proportion 0.834 0.166
## ---------------------------------------------------------------------------
## T_grade
## n missing distinct
## 470 0 4
##
## Value OC11 OC12 OC13 OC14
## Frequency 177 257 19 17
## Proportion 0.377 0.547 0.040 0.036
## ---------------------------------------------------------------------------
## DBtype2
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 435 35
## Proportion 0.926 0.074
## ---------------------------------------------------------------------------
## MI
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 468 2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## PAD
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 462 8
## Proportion 0.983 0.017
## ---------------------------------------------------------------------------
## Smoking
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 84 386
## Proportion 0.179 0.821
## ---------------------------------------------------------------------------
## Asthma
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 468 2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 470 0 45 0.998 62.53 9.737 49.45 52.00
## .25 .50 .75 .90 .95
## 57.00 62.00 69.00 74.00 77.00
##
## lowest : 21 37 38 39 40, highest: 78 79 80 81 87
## ---------------------------------------------------------------------------
## Survival
## n missing distinct
## 470 0 2
##
## Value Survived Dead
## Frequency 400 70
## Proportion 0.851 0.149
## ---------------------------------------------------------------------------
## Tiffneau
## n missing distinct Info Mean Gmd .05 .10
## 470 0 406 1 1.47 1.484 0.5806 0.6375
## .25 .50 .75 .90 .95
## 0.7134 0.7738 0.8307 0.8909 0.9859
##
## Value 0.5 1.0 3.0 16.0 17.5 21.0 23.5 24.0 25.0 25.5
## Frequency 181 274 1 3 1 2 1 1 1 1
## Proportion 0.385 0.583 0.002 0.006 0.002 0.004 0.002 0.002 0.002 0.002
##
## Value 26.0 28.5 33.0 47.5
## Frequency 1 1 1 1
## Proportion 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------
The predictors consist of 16 variables:
ICD-10 codes for primary and secondary as well multiple tumours if any; Spirometric values (FEV1, FVC and Tiffneau index= FEV1/FVC), Performance status on Zubrod scale, Pain before surgery (T,F), Haemoptysis before surgery (T,F), Dyspnoea before surgery (T,F), Cough before surgery (T,F), Weakness before surgery (T,F), T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest), Type 2 diabetes mellitus (T,F), MI up to 6 months (T,F), peripheral arterial diseases (T,F), Smoking (T,F), Asthma (T,F) and Age at surgery (numeric).
The target outcome is highly imbalanced with a ratio of 400 survived / 70 dead (about 1:6)
Data visualising
library(gridExtra)
a1=df%>%ggplot(aes(x=Survival,fill=Diagnosis))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("Diagnosis")
a2=df%>%ggplot(aes(x=Diagnosis,y=..count..,fill=Diagnosis))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)
grid.arrange(a1,a2,ncol=1)
b1=df%>%ggplot(aes(x=Survival,fill=T_grade))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("T_grade")
b2=df%>%ggplot(aes(x=T_grade,y=..count..,fill=T_grade))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)
grid.arrange(b1,b2,ncol=1)
df%>%gather(Pain:Weakness,DBtype2:Asthma,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=..count..,fill=Value))+geom_bar(alpha=0.8,color="black")+facet_wrap(~Features,ncol=5)+scale_fill_manual(values=mycolors)
df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=Value,fill=Survival))+geom_boxplot(alpha=0.8,color="black")+coord_flip()+facet_wrap(~Features,ncol=1,scales="free")+scale_fill_manual(values=mycolors)
df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Value,fill=Survival))+geom_density(alpha=0.6,color="black")+facet_wrap(~Features,ncol=2,scales="free")+scale_fill_manual(values=mycolors)
Machine learning experiment
The first step consists of initialising our h2o package in R. The caret package will be used for data splitting, as it’s the only package that warrant the identical proportion of classes across the training and testing subset. By keeping the same proportion of target outcome, we hope that imbalanced data would not effect the validation of trained model.
library(h2o)
h2o.init(nthreads = -1,max_mem_size ="4g")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 3 hours 11 minutes
## H2O cluster version: 3.10.3.6
## H2O cluster version age: 2 months and 4 days
## H2O cluster name: H2O_started_from_R_Admin_bbl792
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.15 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.1 (2016-06-21)
library(caret)
set.seed(123)
idTrain=caret::createDataPartition(y=df$Survival,p=369/470,list=FALSE)
trainset=df[idTrain,]
testset=df[-idTrain,]
sp1=df%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Origin")
sp2=trainset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Train")
sp3=testset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Test")
grid.arrange(sp1,sp2,sp3,ncol=1)
wtrain=as.h2o(trainset)
##
|
| | 0%
|
|=================================================================| 100%
wtest=as.h2o(testset)
##
|
| | 0%
|
|=================================================================| 100%
response="Survival"
features=setdiff(colnames(wtrain),response)
Our experiment consists of evaluating the effect of 4 possible setting combinations between the Fold assignment and Balance classes parameters:
Model RF with Randomised fold assignment without classes balancing
Model RF with Randomised fold assignment with classes balancing
Model RF with Stratified fold assignment without classes balancing
Model RF with Stratified fold assignment with classes balancing
#RF learner
#Balanced + stratified
rfmod1=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
#Balanced + Not stratified
rfmod2=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "AUTO",
balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
#Unbalanced + stratified
rfmod3=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = FALSE,
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
#Unbalanced + not stratified
rfmod0=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
balance_classes = F,
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
Confusion matrix and performance of 4 models on tests subset
h2o.performance(rfmod0,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1305544
## RMSE: 0.3613231
## LogLoss: 1.031624
## Mean Per-Class Error: 0.4666667
## AUC: 0.6505882
## Gini: 0.3011765
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.511111 0.923913 60
## 2 max f2 0.511111 0.968109 60
## 3 max f0point5 0.511111 0.883576 60
## 4 max accuracy 0.511111 0.860000 60
## 5 max precision 0.912478 0.955556 19
## 6 max recall 0.511111 1.000000 60
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.853333 0.268387 30
## 9 max min_per_class_accuracy 0.862222 0.666667 28
## 10 max mean_per_class_accuracy 0.912478 0.686275 19
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod1,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.182993
## RMSE: 0.4277768
## LogLoss: 1.240996
## Mean Per-Class Error: 0.4666667
## AUC: 0.5976471
## Gini: 0.1952941
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.080790 0.923913 50
## 2 max f2 0.080790 0.968109 50
## 3 max f0point5 0.462663 0.890736 38
## 4 max accuracy 0.080790 0.860000 50
## 5 max precision 0.584416 0.907895 36
## 6 max recall 0.080790 1.000000 50
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.584416 0.288526 36
## 9 max min_per_class_accuracy 0.659121 0.533333 22
## 10 max mean_per_class_accuracy 0.584416 0.672549 36
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod2,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.182993
## RMSE: 0.4277768
## LogLoss: 1.240996
## Mean Per-Class Error: 0.4666667
## AUC: 0.5976471
## Gini: 0.1952941
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.080790 0.923913 50
## 2 max f2 0.080790 0.968109 50
## 3 max f0point5 0.462663 0.890736 38
## 4 max accuracy 0.080790 0.860000 50
## 5 max precision 0.584416 0.907895 36
## 6 max recall 0.080790 1.000000 50
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.584416 0.288526 36
## 9 max min_per_class_accuracy 0.659121 0.533333 22
## 10 max mean_per_class_accuracy 0.584416 0.672549 36
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod3,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1329517
## RMSE: 0.3646254
## LogLoss: 1.031372
## Mean Per-Class Error: 0.5
## AUC: 0.6639216
## Gini: 0.3278431
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 0 15 1.000000 =15/15
## Survived 0 85 0.000000 =0/85
## Totals 0 100 0.150000 =15/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.450000 0.918919 52
## 2 max f2 0.450000 0.965909 52
## 3 max f0point5 0.450000 0.876289 52
## 4 max accuracy 0.450000 0.850000 52
## 5 max precision 0.906667 0.961538 17
## 6 max recall 0.450000 1.000000 52
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.906667 0.325125 17
## 9 max min_per_class_accuracy 0.861111 0.670588 25
## 10 max mean_per_class_accuracy 0.906667 0.727451 17
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Suppose that “Survived” is the negative and “Dead” is positive outcomes (as we try to identify the patients with higher risk of post-operative mortality): It seems that all 4 models have the same performance on test subset: they provided correct survival prognosis, but all failed to identify Mortality risk. Each model maded 14 to 15 mistakes (all of them were false negatives). We don’t know whether our settings could have any effect on each one of these 4 models or not ?.
To find the answer for this question, we must resample the validation on replicated dataset by bootstrapping:
First, we call up the mlr package, and train a dummy h2o random forest model in h2o. The real purpose of this dummy model is for cloning exactly the 4 models trained in h2o. This trick has been well explained in the previous tutorials.
library(mlr)
taskTS=mlr::makeClassifTask(id="Thorac",data=df,target="Survival",positive = "Dead")
learnerH2ORF=makeLearner(id="h2oRF","classif.h2o.randomForest", predict.type = "prob")
mlrRF=train(learner = learnerH2ORF, task=taskTS)
Neither accuracy nor absolute misclassification rate are right metric to use when confronting the imbalanced data. They could be misleading, as mentioned above. Other metrics are more appropriate, including True/False Negative or Positive rate, Balanced accuracy or ballanced error rate, Recall (or specificity), Cohen’s Kappa coefficient or F1 score. We will adopt some of those metrics for out benchmark study.
bootmlrPERF=function(h2omodel,data,i){
d=data[i,]
predmlr=predict(mlrRF,newdata=d)
predh2o=predict(get(h2omodel),as.h2o(d))
predmlr$data$response<-as.vector(predh2o$predict)
predmlr$data$prob.Dead<-as.vector(predh2o$Dead)
predmlr$data$prob.Dead<-as.vector(predh2o$Survived)
mets=list(bac,f1,tpr,tnr,fpr,fnr)
p=mlr::performance(predmlr,mets)
BAC=p[[1]]
F1=p[[2]]
TPR=p[[3]]
TNR=p[[4]]
FPR=p[[5]]
FNR=p[[6]]
return=cbind(BAC,F1,TPR,TNR,FPR,FNR)
}
set.seed(123)
library(boot)
perfmod0=boot(statistic=bootmlrPERF,h2omodel="rfmod0",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Random",iter=as.numeric(rownames(.)))
perfmod1=boot(statistic=bootmlrPERF,h2omodel="rfmod1",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Stratified",iter=as.numeric(rownames(.)))
perfmod2=boot(statistic=bootmlrPERF,h2omodel="rfmod2",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Random",iter=as.numeric(rownames(.)))
perfmod3=boot(statistic=bootmlrPERF,h2omodel="rfmod3",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Stratified",iter=as.numeric(rownames(.)))
bootperf=rbind(perfmod0,perfmod1,perfmod2,perfmod3)
names(bootperf)=c("BAC","F1","TPR","TNR","FPR","FNR","Mode","Iteration")
bootperf[,c(1:6)]%>%psych::describeBy(.,bootperf$Mode)
## $Balanced_Random
## vars n mean sd median trimmed mad min max range skew kurtosis
## BAC 1 30 0.93 0.02 0.93 0.93 0.02 0.89 0.97 0.08 0.02 -1.11
## F1 2 30 0.84 0.04 0.84 0.84 0.04 0.77 0.92 0.15 0.09 -0.61
## TPR 3 30 0.90 0.04 0.90 0.90 0.04 0.82 0.97 0.15 -0.09 -1.03
## TNR 4 30 0.96 0.01 0.96 0.96 0.01 0.94 0.98 0.04 -0.08 -0.82
## FPR 5 30 0.04 0.01 0.04 0.04 0.01 0.02 0.06 0.04 0.08 -0.82
## FNR 6 30 0.10 0.04 0.10 0.10 0.04 0.03 0.18 0.15 0.09 -1.03
## se
## BAC 0.00
## F1 0.01
## TPR 0.01
## TNR 0.00
## FPR 0.00
## FNR 0.01
##
## $Balanced_Stratified
## vars n mean sd median trimmed mad min max range skew kurtosis
## BAC 1 30 0.92 0.02 0.92 0.92 0.02 0.89 0.97 0.08 0.32 -0.68
## F1 2 30 0.83 0.03 0.84 0.83 0.04 0.78 0.91 0.13 0.29 -0.70
## TPR 3 30 0.89 0.04 0.89 0.89 0.04 0.81 0.96 0.14 -0.10 -0.89
## TNR 4 30 0.96 0.01 0.96 0.96 0.01 0.93 0.98 0.04 -0.62 -0.39
## FPR 5 30 0.04 0.01 0.04 0.04 0.01 0.02 0.07 0.04 0.62 -0.39
## FNR 6 30 0.11 0.04 0.11 0.11 0.04 0.04 0.19 0.14 0.10 -0.89
## se
## BAC 0.00
## F1 0.01
## TPR 0.01
## TNR 0.00
## FPR 0.00
## FNR 0.01
##
## $Unbalanced_Random
## vars n mean sd median trimmed mad min max range skew kurtosis se
## BAC 1 30 0.5 0 0.5 0.5 0 0.5 0.5 0 NaN NaN 0
## F1 2 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TPR 3 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TNR 4 30 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
## FPR 5 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## FNR 6 30 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
##
## $Unbalanced_Stratified
## vars n mean sd median trimmed mad min max range skew kurtosis se
## BAC 1 30 0.5 0 0.5 0.5 0 0.5 0.5 0 NaN NaN 0
## F1 2 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TPR 3 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TNR 4 30 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
## FPR 5 30 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## FNR 6 30 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
##
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)
pairwise.wilcox.test(x=bootperf$FNR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)
##
## Pairwise comparisons using Wilcoxon signed rank test
##
## data: bootperf$FNR and bootperf$Mode
##
## Balanced_Random Balanced_Stratified
## Balanced_Stratified 1 -
## Unbalanced_Random 9.1e-06 9.1e-06
## Unbalanced_Stratified 9.1e-06 9.1e-06
## Unbalanced_Random
## Balanced_Stratified -
## Unbalanced_Random -
## Unbalanced_Stratified -
##
## P value adjustment method: bonferroni
pairwise.wilcox.test(x=bootperf$FPR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)
##
## Pairwise comparisons using Wilcoxon signed rank test
##
## data: bootperf$FPR and bootperf$Mode
##
## Balanced_Random Balanced_Stratified
## Balanced_Stratified 1 -
## Unbalanced_Random 9.1e-06 9.1e-06
## Unbalanced_Stratified 9.1e-06 9.1e-06
## Unbalanced_Random
## Balanced_Stratified -
## Unbalanced_Random -
## Unbalanced_Stratified -
##
## P value adjustment method: bonferroni
bootlong=bootperf%>%gather(BAC:FNR,key="Metric",value="Score")
bootlong%>%ggplot(aes(x=Score,fill=Mode))+geom_histogram(alpha=0.6,color="black")+facet_grid(Mode~Metric,scales="free")+scale_fill_manual(values=mycolors)
bootlong%>%ggplot(aes(x=Metric,y=Score,fill=Mode))+geom_boxplot(alpha=0.6)+facet_wrap(~Metric,scales="free",ncol=2)+scale_fill_manual(values=mycolors)+coord_flip()
bootperf%>%ggplot(aes(x=Iteration,y=F1,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.95,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
bootperf%>%ggplot(aes(x=Iteration,y=BAC,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.9,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
bootperf%>%ggplot(aes(x=Iteration,y=FNR,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.1,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
The bootstraped validation on 30 random testing subsets revealed something more interesting:
The model with stand-alone classes weighting or combined with stratified fold assignment produced the best performance (in red and yellow). Those two models had the best F1 score, highest True Positive rate and lowest false negative rate. Only model trained by such protocole can make sense when applied to random imbalanced validation sets.
The difference in False negative rate between Random/Balanced model and Random/Unbalanced model is significative (Wilcoxon test p value =0.0000091).There was no difference between Balanced/Random assignment and Balanced/stratified models. This indicates that applying Balance class parameter alone (without Fold assignment Stratification) can already improve the model’s performance on imbalanced data.Folds stratification might be not enough to resolve the data imbalance problem. Combining Classes balancing and Stratified assignment could optomise our model’s performance.
Conclusion
Classes imbalance is a common and frustrating problem in machine learning practice. Class balancing is an useful feature in h2o that could be used alone or in combination with Fold assignment Stratification for dealing with such problem.
Thank you for joining us and see you soon in the next tutorial.
END