The main objective of this case study X8 is to evaluate the effect of different fold_assignment and balance classes settings on the performance of Random Forest algorithm applied to a binary classfication problem. Our study will imply the Thoracic surgery dataset (Wroclaw Medical University, Poland). This dataset was collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the period of 2007 to 2011. Our classfication task aims to make 1 year Survival prognosis of those patients, based on their preoperative physical and functional characteristics.
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#ffffef"),
strip.background = element_rect(fill = "#ffbb00", color = "black", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "black"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
mycolors=c("#f32440","#ffd700","#ff8c00","#c9e101","#c100e6","#39d3d6","#e84412")
Now we load the dataset from UCI website and perform a descriptive analysis
require(foreign)
df=read.arff("https://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff")%>%as_tibble()
names(df)=c("Diagnosis","FVC","FEV1","Zubrod","Pain","Haemoptysis","Dyspnoea","Cough","Weakness","T_grade","DBtype2","MI","PAD","Smoking","Asthma","Age","Survival")
df$Survival=df$Survival%>%recode_factor(.,`F` = "Survived", `T` = "Dead")
df$Tiffneau=df$FEV1/df$FVC
Hmisc::describe(df)
## df
##
## 18 Variables 470 Observations
## ---------------------------------------------------------------------------
## Diagnosis
## n missing distinct
## 470 0 7
##
## Value DGN1 DGN2 DGN3 DGN4 DGN5 DGN6 DGN8
## Frequency 1 52 349 47 15 4 2
## Proportion 0.002 0.111 0.743 0.100 0.032 0.009 0.004
## ---------------------------------------------------------------------------
## FVC
## n missing distinct Info Mean Gmd .05 .10
## 470 0 134 1 3.282 0.9818 2.018 2.316
## .25 .50 .75 .90 .95
## 2.600 3.160 3.808 4.560 4.900
##
## lowest : 1.44 1.46 1.70 1.81 1.82, highest: 5.52 5.56 5.60 6.08 6.30
## ---------------------------------------------------------------------------
## FEV1
## n missing distinct Info Mean Gmd .05 .10
## 470 0 136 1 4.569 4.805 1.440 1.640
## .25 .50 .75 .90 .95
## 1.960 2.400 3.080 3.762 4.311
##
## Value 1 2 3 4 5 9 52 61 64 66
## Frequency 27 224 151 47 6 1 1 1 1 1
## Proportion 0.057 0.477 0.321 0.100 0.013 0.002 0.002 0.002 0.002 0.002
##
## Value 67 69 71 73 76 77 78 79 86
## Frequency 1 1 1 2 1 1 1 1 1
## Proportion 0.002 0.002 0.002 0.004 0.002 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------
## Zubrod
## n missing distinct
## 470 0 3
##
## Value PRZ0 PRZ1 PRZ2
## Frequency 130 313 27
## Proportion 0.277 0.666 0.057
## ---------------------------------------------------------------------------
## Pain
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 439 31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Haemoptysis
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 402 68
## Proportion 0.855 0.145
## ---------------------------------------------------------------------------
## Dyspnoea
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 439 31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Cough
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 147 323
## Proportion 0.313 0.687
## ---------------------------------------------------------------------------
## Weakness
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 392 78
## Proportion 0.834 0.166
## ---------------------------------------------------------------------------
## T_grade
## n missing distinct
## 470 0 4
##
## Value OC11 OC12 OC13 OC14
## Frequency 177 257 19 17
## Proportion 0.377 0.547 0.040 0.036
## ---------------------------------------------------------------------------
## DBtype2
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 435 35
## Proportion 0.926 0.074
## ---------------------------------------------------------------------------
## MI
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 468 2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## PAD
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 462 8
## Proportion 0.983 0.017
## ---------------------------------------------------------------------------
## Smoking
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 84 386
## Proportion 0.179 0.821
## ---------------------------------------------------------------------------
## Asthma
## n missing distinct
## 470 0 2
##
## Value F T
## Frequency 468 2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 470 0 45 0.998 62.53 9.737 49.45 52.00
## .25 .50 .75 .90 .95
## 57.00 62.00 69.00 74.00 77.00
##
## lowest : 21 37 38 39 40, highest: 78 79 80 81 87
## ---------------------------------------------------------------------------
## Survival
## n missing distinct
## 470 0 2
##
## Value Survived Dead
## Frequency 400 70
## Proportion 0.851 0.149
## ---------------------------------------------------------------------------
## Tiffneau
## n missing distinct Info Mean Gmd .05 .10
## 470 0 406 1 1.47 1.484 0.5806 0.6375
## .25 .50 .75 .90 .95
## 0.7134 0.7738 0.8307 0.8909 0.9859
##
## Value 0.5 1.0 3.0 16.0 17.5 21.0 23.5 24.0 25.0 25.5
## Frequency 181 274 1 3 1 2 1 1 1 1
## Proportion 0.385 0.583 0.002 0.006 0.002 0.004 0.002 0.002 0.002 0.002
##
## Value 26.0 28.5 33.0 47.5
## Frequency 1 1 1 1
## Proportion 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------
The predictors consist of 16 variables:
ICD-10 codes for primary and secondary as well multiple tumours if any; Spirometric values (FEV1, FVC and Tiffneau index= FEV1/FVC), Performance status on Zubrod scale, Pain before surgery (T,F), Haemoptysis before surgery (T,F), Dyspnoea before surgery (T,F), Cough before surgery (T,F), Weakness before surgery (T,F), T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest), Type 2 diabetes mellitus (T,F), MI up to 6 months (T,F), peripheral arterial diseases (T,F), Smoking (T,F), Asthma (T,F) and Age at surgery (numeric).
The target outcome is highly imbalanced with a ratio of 400 survived / 70 dead (about 1:6)
library(gridExtra)
a1=df%>%ggplot(aes(x=Survival,fill=Diagnosis))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("Diagnosis")
a2=df%>%ggplot(aes(x=Diagnosis,y=..count..,fill=Diagnosis))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)
grid.arrange(a1,a2,ncol=1)
b1=df%>%ggplot(aes(x=Survival,fill=T_grade))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("T_grade")
b2=df%>%ggplot(aes(x=T_grade,y=..count..,fill=T_grade))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)
grid.arrange(b1,b2,ncol=1)
df%>%gather(Pain:Weakness,DBtype2:Asthma,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=..count..,fill=Value))+geom_bar(alpha=0.8,color="black")+facet_wrap(~Features,ncol=5)+scale_fill_manual(values=mycolors)
df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=Value,fill=Survival))+geom_boxplot(alpha=0.8,color="black")+coord_flip()+facet_wrap(~Features,ncol=1,scales="free")+scale_fill_manual(values=mycolors)
df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Value,fill=Survival))+geom_density(alpha=0.6,color="black")+facet_wrap(~Features,ncol=2,scales="free")+scale_fill_manual(values=mycolors)
The first step consists of initialising our h2o package in R. The caret package will be used for data splitting, as it’s the only package that warrant the identical proportion of classes across the training and testing subset. By keeping the same proportion of target outcome, we hope that imbalanced data would not effect the validation of trained model.
library(h2o)
h2o.init(nthreads = -1,max_mem_size ="4g")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /var/folders/5p/hr5_8rfs2h30_vbr9xwbvpg40000gn/T//RtmpgthBoE/h2o_olegbaydakov_started_from_r.out
## /var/folders/5p/hr5_8rfs2h30_vbr9xwbvpg40000gn/T//RtmpgthBoE/h2o_olegbaydakov_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 298 milliseconds
## H2O cluster version: 3.16.0.1
## H2O cluster version age: 8 days
## H2O cluster name: H2O_started_from_R_olegbaydakov_tau002
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.56 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.1 (2017-06-30)
library(caret)
set.seed(123)
idTrain=caret::createDataPartition(y=df$Survival,p=369/470,list=FALSE)
trainset=df[idTrain,]
testset=df[-idTrain,]
sp1=df%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Origin")
sp2=trainset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Train")
sp3=testset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Test")
grid.arrange(sp1,sp2,sp3,ncol=1)
wtrain=as.h2o(trainset)
##
|
| | 0%
|
|=================================================================| 100%
wtest=as.h2o(testset)
##
|
| | 0%
|
|=================================================================| 100%
response="Survival"
features=setdiff(colnames(wtrain),response)
Our experiment consists of evaluating the effect of 4 possible setting combinations between the Fold assignment and Balance classes parameters:
Model RF with Randomised fold assignment without classes balancing
Model RF with Randomised fold assignment with classes balancing
Model RF with Stratified fold assignment without classes balancing
Model RF with Stratified fold assignment with classes balancing
#RF learner
#Balanced + stratified
rfmod1=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
##
|
| | 0%
|
|= | 2%
|
|==== | 6%
|
|=========================================================== | 91%
|
|=================================================================| 100%
#Balanced + Not stratified
rfmod2=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "AUTO",
balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
##
|
| | 0%
|
|=================================================================| 100%
#Unbalanced + stratified
rfmod3=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = FALSE,
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
##
|
| | 0%
|
|= | 1%
|
|=================================================================| 100%
#Unbalanced + not stratified
rfmod0=h2o.randomForest(x = features,
y = response,
training_frame = wtrain,nfolds=10,
balance_classes = F,
ntrees = 100, max_depth = 50,
stopping_metric = "logloss",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = TRUE,
keep_cross_validation_predictions=TRUE,
score_each_iteration = TRUE,
seed=12345)
##
|
| | 0%
|
|= | 2%
|
|=================================================================| 100%
h2o.performance(rfmod0,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1314281
## RMSE: 0.3625301
## LogLoss: 1.035559
## Mean Per-Class Error: 0.4666667
## AUC: 0.647451
## Gini: 0.294902
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.510417 0.923913 61
## 2 max f2 0.510417 0.968109 61
## 3 max f0point5 0.510417 0.883576 61
## 4 max accuracy 0.510417 0.860000 61
## 5 max precision 0.913889 0.954545 20
## 6 max recall 0.510417 1.000000 61
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.862500 0.285831 29
## 9 max min_per_class_accuracy 0.862500 0.658824 29
## 10 max mean_per_class_accuracy 0.862500 0.696078 29
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod1,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1830306
## RMSE: 0.4278207
## LogLoss: 1.241207
## Mean Per-Class Error: 0.4666667
## AUC: 0.5968627
## Gini: 0.1937255
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.080790 0.923913 52
## 2 max f2 0.080790 0.968109 52
## 3 max f0point5 0.462663 0.890736 39
## 4 max accuracy 0.080790 0.860000 52
## 5 max precision 0.584416 0.907895 37
## 6 max recall 0.080790 1.000000 52
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.584416 0.288526 37
## 9 max min_per_class_accuracy 0.659121 0.533333 23
## 10 max mean_per_class_accuracy 0.584416 0.672549 37
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod2,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1830306
## RMSE: 0.4278207
## LogLoss: 1.241207
## Mean Per-Class Error: 0.4666667
## AUC: 0.5968627
## Gini: 0.1937255
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 1 14 0.933333 =14/15
## Survived 0 85 0.000000 =0/85
## Totals 1 99 0.140000 =14/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.080790 0.923913 52
## 2 max f2 0.080790 0.968109 52
## 3 max f0point5 0.462663 0.890736 39
## 4 max accuracy 0.080790 0.860000 52
## 5 max precision 0.584416 0.907895 37
## 6 max recall 0.080790 1.000000 52
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.584416 0.288526 37
## 9 max min_per_class_accuracy 0.659121 0.533333 23
## 10 max mean_per_class_accuracy 0.584416 0.672549 37
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(rfmod3,wtest)
## H2OBinomialMetrics: drf
##
## MSE: 0.1328032
## RMSE: 0.3644218
## LogLoss: 1.031488
## Mean Per-Class Error: 0.5
## AUC: 0.6631373
## Gini: 0.3262745
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Dead Survived Error Rate
## Dead 0 15 1.000000 =15/15
## Survived 0 85 0.000000 =0/85
## Totals 0 100 0.150000 =15/100
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.450000 0.918919 51
## 2 max f2 0.450000 0.965909 51
## 3 max f0point5 0.450000 0.876289 51
## 4 max accuracy 0.450000 0.850000 51
## 5 max precision 0.907407 0.960784 16
## 6 max recall 0.450000 1.000000 51
## 7 max specificity 1.000000 0.866667 0
## 8 max absolute_mcc 0.907407 0.316527 16
## 9 max min_per_class_accuracy 0.861111 0.658824 24
## 10 max mean_per_class_accuracy 0.907407 0.721569 16
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Suppose that “Survived” is the negative and “Dead” is positive outcomes (as we try to identify the patients with higher risk of post-operative mortality): It seems that all 4 models have the same performance on test subset: they provided correct survival prognosis, but all failed to identify Mortality risk. Each model maded 14 to 15 mistakes (all of them were false negatives). We don’t know whether our settings could have any effect on each one of these 4 models or not ?.
To find the answer for this question, we must resample the validation on replicated dataset by bootstrapping
First, we call up the mlr package, and train a dummy h2o random forest model in h2o. The real purpose of this dummy model is for cloning exactly the 4 models trained in h2o. This trick has been well explained in the previous tutorials.
library(mlr)
## Loading required package: ParamHelpers
##
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
##
## train
taskTS=mlr::makeClassifTask(id="Thorac",data=df,target="Survival",positive = "Dead")
learnerH2ORF=makeLearner(id="h2oRF","classif.h2o.randomForest", predict.type = "prob")
mlrRF=train(learner = learnerH2ORF, task=taskTS)
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|============ | 18%
|
|=================================================================| 100%
Neither accuracy nor absolute misclassification rate are right metric to use when confronting the imbalanced data. They could be misleading, as mentioned above. Other metrics are more appropriate, including True/False Negative or Positive rate, Balanced accuracy or ballanced error rate, Recall (or specificity), Cohen’s Kappa coefficient or F1 score. We will adopt some of those metrics for out benchmark study.
bootmlrPERF=function(h2omodel,data,i){
d=data[i,]
predmlr=predict(mlrRF,newdata=d)
predh2o=predict(get(h2omodel),as.h2o(d))
predmlr$data$response<-as.vector(predh2o$predict)
predmlr$data$prob.Dead<-as.vector(predh2o$Dead)
predmlr$data$prob.Dead<-as.vector(predh2o$Survived)
mets=list(bac,f1,tpr,tnr,fpr,fnr)
p=mlr::performance(predmlr,mets)
BAC=p[[1]]
F1=p[[2]]
TPR=p[[3]]
TNR=p[[4]]
FPR=p[[5]]
FNR=p[[6]]
return=cbind(BAC,F1,TPR,TNR,FPR,FNR)
}
set.seed(123)
library(boot)
perfmod0=boot(statistic=bootmlrPERF,h2omodel="rfmod0",data=df,R=2)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Random",iter=as.numeric(rownames(.)))
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
perfmod1=boot(statistic=bootmlrPERF,h2omodel="rfmod1",data=df,R=2)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Stratified",iter=as.numeric(rownames(.)))
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
perfmod2=boot(statistic=bootmlrPERF,h2omodel="rfmod2",data=df,R=2)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Random",iter=as.numeric(rownames(.)))
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
perfmod3=boot(statistic=bootmlrPERF,h2omodel="rfmod3",data=df,R=2)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Stratified",iter=as.numeric(rownames(.)))
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
bootperf=rbind(perfmod0,perfmod1,perfmod2,perfmod3)
names(bootperf)=c("BAC","F1","TPR","TNR","FPR","FNR","Mode","Iteration")
bootperf[,c(1:6)]%>%psych::describeBy(.,bootperf$Mode)
##
## Descriptive statistics by group
## group: Balanced_Random
## vars n mean sd median trimmed mad min max range skew kurtosis
## BAC 1 2 0.92 0.03 0.92 0.92 0.04 0.90 0.94 0.05 0 -2.75
## F1 2 2 0.84 0.05 0.84 0.84 0.05 0.81 0.88 0.07 0 -2.75
## TPR 3 2 0.88 0.07 0.88 0.88 0.07 0.83 0.93 0.10 0 -2.75
## TNR 4 2 0.96 0.00 0.96 0.96 0.00 0.96 0.96 0.00 0 -2.75
## FPR 5 2 0.04 0.00 0.04 0.04 0.00 0.04 0.04 0.00 0 -2.75
## FNR 6 2 0.12 0.07 0.12 0.12 0.07 0.07 0.17 0.10 0 -2.75
## se
## BAC 0.02
## F1 0.04
## TPR 0.05
## TNR 0.00
## FPR 0.00
## FNR 0.05
## --------------------------------------------------------
## group: Balanced_Stratified
## vars n mean sd median trimmed mad min max range skew kurtosis
## BAC 1 2 0.91 0.06 0.91 0.91 0.07 0.87 0.96 0.09 0 -2.75
## F1 2 2 0.81 0.09 0.81 0.81 0.09 0.75 0.87 0.13 0 -2.75
## TPR 3 2 0.87 0.13 0.87 0.87 0.13 0.78 0.96 0.18 0 -2.75
## TNR 4 2 0.96 0.00 0.96 0.96 0.00 0.96 0.96 0.00 0 -2.75
## FPR 5 2 0.04 0.00 0.04 0.04 0.00 0.04 0.04 0.00 0 -2.75
## FNR 6 2 0.13 0.13 0.13 0.13 0.13 0.04 0.22 0.18 0 -2.75
## se
## BAC 0.05
## F1 0.06
## TPR 0.09
## TNR 0.00
## FPR 0.00
## FNR 0.09
## --------------------------------------------------------
## group: Unbalanced_Random
## vars n mean sd median trimmed mad min max range skew kurtosis se
## BAC 1 2 0.5 0 0.5 0.5 0 0.5 0.5 0 NaN NaN 0
## F1 2 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TPR 3 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TNR 4 2 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
## FPR 5 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## FNR 6 2 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
## --------------------------------------------------------
## group: Unbalanced_Stratified
## vars n mean sd median trimmed mad min max range skew kurtosis se
## BAC 1 2 0.5 0 0.5 0.5 0 0.5 0.5 0 NaN NaN 0
## F1 2 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TPR 3 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## TNR 4 2 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
## FPR 5 2 0.0 0 0.0 0.0 0 0.0 0.0 0 NaN NaN 0
## FNR 6 2 1.0 0 1.0 1.0 0 1.0 1.0 0 NaN NaN 0
pairwise.wilcox.test(x=bootperf$FNR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)
## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot
## compute exact p-value with zeroes
##
## Pairwise comparisons using Wilcoxon signed rank test
##
## data: bootperf$FNR and bootperf$Mode
##
## Balanced_Random Balanced_Stratified
## Balanced_Stratified 1 -
## Unbalanced_Random 1 1
## Unbalanced_Stratified 1 1
## Unbalanced_Random
## Balanced_Stratified -
## Unbalanced_Random -
## Unbalanced_Stratified -
##
## P value adjustment method: bonferroni
pairwise.wilcox.test(x=bootperf$FPR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)
## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot
## compute exact p-value with zeroes
##
## Pairwise comparisons using Wilcoxon signed rank test
##
## data: bootperf$FPR and bootperf$Mode
##
## Balanced_Random Balanced_Stratified
## Balanced_Stratified 1 -
## Unbalanced_Random 1 1
## Unbalanced_Stratified 1 1
## Unbalanced_Random
## Balanced_Stratified -
## Unbalanced_Random -
## Unbalanced_Stratified -
##
## P value adjustment method: bonferroni
bootlong=bootperf%>%gather(BAC:FNR,key="Metric",value="Score")
bootlong%>%ggplot(aes(x=Score,fill=Mode))+geom_histogram(alpha=0.6,color="black")+facet_grid(Mode~Metric,scales="free")+scale_fill_manual(values=mycolors)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
bootlong%>%ggplot(aes(x=Metric,y=Score,fill=Mode))+geom_boxplot(alpha=0.6)+facet_wrap(~Metric,scales="free",ncol=2)+scale_fill_manual(values=mycolors)+coord_flip()
bootperf%>%ggplot(aes(x=Iteration,y=F1,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.95,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
bootperf%>%ggplot(aes(x=Iteration,y=BAC,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.9,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
bootperf%>%ggplot(aes(x=Iteration,y=FNR,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.1,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)
The bootstraped validation on 30 random testing subsets revealed something more interesting:
The model with stand-alone classes weighting or combined with stratified fold assignment produced the best performance (in red and yellow). Those two models had the best F1 score, highest True Positive rate and lowest false negative rate. Only model trained by such protocole can make sense when applied to random imbalanced validation sets.
The difference in False negative rate between Random/Balanced model and Random/Unbalanced model is significative (Wilcoxon test p value =0.0000091).There was no difference between Balanced/Random assignment and Balanced/stratified models. This indicates that applying Balance class parameter alone (without Fold assignment Stratification) can already improve the model’s performance on imbalanced data.Folds stratification might be not enough to resolve the data imbalance problem. Combining Classes balancing and Stratified assignment could optomise our model’s performance.
Classes imbalance is a common and frustrating problem in machine learning practice. Class balancing is an useful feature in h2o that could be used alone or in combination with Fold assignment Stratification for dealing with such problem.