Case study X8: Dealing with imbalanced data in h2o

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Introduction

Class imbalance is a common problem in Machine learning practice that could seriously affect both algorithm’s function and model selection. This refers to situations when the classes are not equally distributed within the dataset. Imbalance problem could be confronted in either binary or multiclass classification tasks.

Imagine if we would like to build a machine learning model for detecting a rare disease or a rare clinical event. As the target outcome is rare, there are only 100 or less positive cases in a dataset of 1000 cases. After data splitting, we kept a subset of 10% for validation. This test subset has 10 patients and 90 healthy persons. The model training proposed two models : A and B. Model A misclassified 6/10 patients as negative (false negative) and 10/90 healthy subjects as positive (false positive) while model B misclassified only 2/10 patients as healthy and 30/90 normal persons as having disease. Based on an absolute accuracy and/or absolute error rate, the computer might consider that the model A is better than model B; as its only made 16 mistakes vs 32 mistakes by model B. However, we should choose the model B instead, as false negative error is more dangerous than false positive. The problem is that we might not even have a chance for such reasoning, as the model B was already eliminated by the computer during training process. Some algorithms only work best on balanced data. For multiclass classification problem, the training might also be interrupted due to class imbalance.

A common tactic to reduce the bad influence of classes imbalance is taking control the resampling process. The control might consist of oversampling (compensating the minority class by data replication) or undersampling (removing instances from the majority class). Neither of them is really good. By down resampling some informative cases might be lost, while oversampling might lead to overfitting. We can also combine oversampling and under-sampling with a trade-off between their positive and negative effects.

The h2o provide a simple and flexible solution for dealing with data imbalance that based on resampling control, via two hyper-parameters in training process: Balance classes and Fold assignment. Those two parameters only take effect on Random Forest, Deep neuralnet and GBM, which are also the learners that involve random splitting. The parameters could be set within the model training but not physically on the data frame.

When the balance classes parameter is set as true, a hybrid sampling control is activated, this allows to oversample or undersample the training folds based on the proportion of classes within original training frame. Each row in majority classes will be weighted higher than a row in minority classes. User can also set an explicit control on this parameter by introducing the weight values to class_sampling_factor parameter (recommended).

The fold_assignment defines how to split up the training data during cross validation training. Its value could be “Auto” (that means random splitting), Modulo and Stratified. When being set as Stratified, the program will try to “balance” the classes equally into each fold. This could improve significantly the model’s performance on imbalanced data.

Objective

The main objective of this case study X8 is to evaluate the effect of different fold_assignment and balance classes settings on the performance of Random Forest algorithm applied to a binary classfication problem. Our study will imply the Thoracic surgery dataset (Wroclaw Medical University, Poland). This dataset was collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the period of 2007 to 2011. Our classfication task aims to make 1 year Survival prognosis of those patients, based on their preoperative physical and functional characteristics.

Materials and method

First, we will prepare the ggplot theme for our experiment

library(tidyverse)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#ffffef"),
      strip.background = element_rect(fill = "#ffbb00", color = "black", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "black"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

mycolors=c("#f32440","#ffd700","#ff8c00","#c9e101","#c100e6","#39d3d6","#e84412")

Now we load the dataset from UCI website and perform a descriptive analysis

require(foreign)
df=read.arff("https://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff")%>%as_tibble()

names(df)=c("Diagnosis","FVC","FEV1","Zubrod","Pain","Haemoptysis","Dyspnoea","Cough","Weakness","T_grade","DBtype2","MI","PAD","Smoking","Asthma","Age","Survival")

df$Survival=df$Survival%>%recode_factor(.,`F` = "Survived", `T` = "Dead")

df$Tiffneau=df$FEV1/df$FVC

Hmisc::describe(df)

## df 
## 
##  18  Variables      470  Observations
## ---------------------------------------------------------------------------
## Diagnosis 
##        n  missing distinct 
##      470        0        7 
##                                                     
## Value       DGN1  DGN2  DGN3  DGN4  DGN5  DGN6  DGN8
## Frequency      1    52   349    47    15     4     2
## Proportion 0.002 0.111 0.743 0.100 0.032 0.009 0.004
## ---------------------------------------------------------------------------
## FVC 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      470        0      134        1    3.282   0.9818    2.018    2.316 
##      .25      .50      .75      .90      .95 
##    2.600    3.160    3.808    4.560    4.900 
## 
## lowest : 1.44 1.46 1.70 1.81 1.82, highest: 5.52 5.56 5.60 6.08 6.30
## ---------------------------------------------------------------------------
## FEV1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      470        0      136        1    4.569    4.805    1.440    1.640 
##      .25      .50      .75      .90      .95 
##    1.960    2.400    3.080    3.762    4.311 
##                                                                       
## Value          1     2     3     4     5     9    52    61    64    66
## Frequency     27   224   151    47     6     1     1     1     1     1
## Proportion 0.057 0.477 0.321 0.100 0.013 0.002 0.002 0.002 0.002 0.002
##                                                                 
## Value         67    69    71    73    76    77    78    79    86
## Frequency      1     1     1     2     1     1     1     1     1
## Proportion 0.002 0.002 0.002 0.004 0.002 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------
## Zubrod 
##        n  missing distinct 
##      470        0        3 
##                             
## Value       PRZ0  PRZ1  PRZ2
## Frequency    130   313    27
## Proportion 0.277 0.666 0.057
## ---------------------------------------------------------------------------
## Pain 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    439    31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Haemoptysis 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    402    68
## Proportion 0.855 0.145
## ---------------------------------------------------------------------------
## Dyspnoea 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    439    31
## Proportion 0.934 0.066
## ---------------------------------------------------------------------------
## Cough 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    147   323
## Proportion 0.313 0.687
## ---------------------------------------------------------------------------
## Weakness 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    392    78
## Proportion 0.834 0.166
## ---------------------------------------------------------------------------
## T_grade 
##        n  missing distinct 
##      470        0        4 
##                                   
## Value       OC11  OC12  OC13  OC14
## Frequency    177   257    19    17
## Proportion 0.377 0.547 0.040 0.036
## ---------------------------------------------------------------------------
## DBtype2 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    435    35
## Proportion 0.926 0.074
## ---------------------------------------------------------------------------
## MI 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    468     2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## PAD 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    462     8
## Proportion 0.983 0.017
## ---------------------------------------------------------------------------
## Smoking 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency     84   386
## Proportion 0.179 0.821
## ---------------------------------------------------------------------------
## Asthma 
##        n  missing distinct 
##      470        0        2 
##                       
## Value          F     T
## Frequency    468     2
## Proportion 0.996 0.004
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      470        0       45    0.998    62.53    9.737    49.45    52.00 
##      .25      .50      .75      .90      .95 
##    57.00    62.00    69.00    74.00    77.00 
## 
## lowest : 21 37 38 39 40, highest: 78 79 80 81 87
## ---------------------------------------------------------------------------
## Survival 
##        n  missing distinct 
##      470        0        2 
##                             
## Value      Survived     Dead
## Frequency       400       70
## Proportion    0.851    0.149
## ---------------------------------------------------------------------------
## Tiffneau 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      470        0      406        1     1.47    1.484   0.5806   0.6375 
##      .25      .50      .75      .90      .95 
##   0.7134   0.7738   0.8307   0.8909   0.9859 
##                                                                       
## Value        0.5   1.0   3.0  16.0  17.5  21.0  23.5  24.0  25.0  25.5
## Frequency    181   274     1     3     1     2     1     1     1     1
## Proportion 0.385 0.583 0.002 0.006 0.002 0.004 0.002 0.002 0.002 0.002
##                                   
## Value       26.0  28.5  33.0  47.5
## Frequency      1     1     1     1
## Proportion 0.002 0.002 0.002 0.002
## ---------------------------------------------------------------------------

The predictors consist of 16 variables:

ICD-10 codes for primary and secondary as well multiple tumours if any; Spirometric values (FEV1, FVC and Tiffneau index= FEV1/FVC), Performance status on Zubrod scale, Pain before surgery (T,F), Haemoptysis before surgery (T,F), Dyspnoea before surgery (T,F), Cough before surgery (T,F), Weakness before surgery (T,F), T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest), Type 2 diabetes mellitus (T,F), MI up to 6 months (T,F), peripheral arterial diseases (T,F), Smoking (T,F), Asthma (T,F) and Age at surgery (numeric).

The target outcome is highly imbalanced with a ratio of 400 survived / 70 dead (about 1:6)

Data visualising

library(gridExtra)

a1=df%>%ggplot(aes(x=Survival,fill=Diagnosis))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("Diagnosis")

a2=df%>%ggplot(aes(x=Diagnosis,y=..count..,fill=Diagnosis))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)

grid.arrange(a1,a2,ncol=1)

b1=df%>%ggplot(aes(x=Survival,fill=T_grade))+geom_bar(position="fill",color="black",alpha=0.8,show.legend = T)+scale_fill_manual(values=mycolors)+coord_flip()+ggtitle("T_grade")
b2=df%>%ggplot(aes(x=T_grade,y=..count..,fill=T_grade))+geom_bar(color="black",alpha=0.8,show.legend =F)+scale_fill_manual(values=mycolors)+coord_flip()+facet_grid(Survival~.)

grid.arrange(b1,b2,ncol=1)

df%>%gather(Pain:Weakness,DBtype2:Asthma,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=..count..,fill=Value))+geom_bar(alpha=0.8,color="black")+facet_wrap(~Features,ncol=5)+scale_fill_manual(values=mycolors)

df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Survival,y=Value,fill=Survival))+geom_boxplot(alpha=0.8,color="black")+coord_flip()+facet_wrap(~Features,ncol=1,scales="free")+scale_fill_manual(values=mycolors)

df%>%gather(Age,FEV1,FVC,Tiffneau,key="Features",value="Value")%>%ggplot(aes(x=Value,fill=Survival))+geom_density(alpha=0.6,color="black")+facet_wrap(~Features,ncol=2,scales="free")+scale_fill_manual(values=mycolors)

Machine learning experiment

The first step consists of initialising our h2o package in R. The caret package will be used for data splitting, as it’s the only package that warrant the identical proportion of classes across the training and testing subset. By keeping the same proportion of target outcome, we hope that imbalanced data would not effect the validation of trained model.

library(h2o)

h2o.init(nthreads = -1,max_mem_size ="4g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 hours 11 minutes 
##     H2O cluster version:        3.10.3.6 
##     H2O cluster version age:    2 months and 4 days  
##     H2O cluster name:           H2O_started_from_R_Admin_bbl792 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.15 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.1 (2016-06-21)

library(caret)
set.seed(123)

idTrain=caret::createDataPartition(y=df$Survival,p=369/470,list=FALSE)
trainset=df[idTrain,]
testset=df[-idTrain,]


sp1=df%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Origin")
sp2=trainset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Train")
sp3=testset%>%ggplot(aes(x=Survival,fill=Survival))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f32440","#ffd700"))+coord_flip()+ggtitle("Test")

grid.arrange(sp1,sp2,sp3,ncol=1)

wtrain=as.h2o(trainset)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

wtest=as.h2o(testset)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

response="Survival"
features=setdiff(colnames(wtrain),response)

Our experiment consists of evaluating the effect of 4 possible setting combinations between the Fold assignment and Balance classes parameters:

Model RF with Randomised fold assignment without classes balancing
Model RF with Randomised fold assignment with classes balancing
Model RF with Stratified fold assignment without classes balancing
Model RF with Stratified fold assignment with classes balancing

#RF learner

#Balanced + stratified

rfmod1=h2o.randomForest(x = features,
              y = response,
              training_frame = wtrain,nfolds=10,
              fold_assignment = "Stratified",
              balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
              ntrees = 100, max_depth = 50,
              stopping_metric = "logloss",
              stopping_tolerance = 0.01,
              stopping_rounds = 3,
              keep_cross_validation_fold_assignment = TRUE,
              keep_cross_validation_predictions=TRUE,
              score_each_iteration = TRUE,
              seed=12345)

#Balanced + Not stratified

rfmod2=h2o.randomForest(x = features,
               y = response,
               training_frame = wtrain,nfolds=10,
               fold_assignment = "AUTO",
               balance_classes = TRUE,class_sampling_factors=c(1.17,6.66),
               ntrees = 100, max_depth = 50,
               stopping_metric = "logloss",
               stopping_tolerance = 0.01,
               stopping_rounds = 3,
               keep_cross_validation_fold_assignment = TRUE,
               keep_cross_validation_predictions=TRUE,
               score_each_iteration = TRUE,
               seed=12345)

#Unbalanced + stratified

rfmod3=h2o.randomForest(x = features,
               y = response,
               training_frame = wtrain,nfolds=10,
               fold_assignment = "Stratified",
               balance_classes = FALSE,
               ntrees = 100, max_depth = 50,
               stopping_metric = "logloss",
               stopping_tolerance = 0.01,
               stopping_rounds = 3,
               keep_cross_validation_fold_assignment = TRUE,
               keep_cross_validation_predictions=TRUE,
               score_each_iteration = TRUE,
               seed=12345)

#Unbalanced + not stratified

rfmod0=h2o.randomForest(x = features,
               y = response,
               training_frame = wtrain,nfolds=10,
               balance_classes = F,
               ntrees = 100, max_depth = 50,
               stopping_metric = "logloss",
               stopping_tolerance = 0.01,
               stopping_rounds = 3,
               keep_cross_validation_fold_assignment = TRUE,
               keep_cross_validation_predictions=TRUE,
               score_each_iteration = TRUE,
               seed=12345)

Confusion matrix and performance of 4 models on tests subset

h2o.performance(rfmod0,wtest)

## H2OBinomialMetrics: drf
## 
## MSE:  0.1305544
## RMSE:  0.3613231
## LogLoss:  1.031624
## Mean Per-Class Error:  0.4666667
## AUC:  0.6505882
## Gini:  0.3011765
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          Dead Survived    Error     Rate
## Dead        1       14 0.933333   =14/15
## Survived    0       85 0.000000    =0/85
## Totals      1       99 0.140000  =14/100
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.511111 0.923913  60
## 2                       max f2  0.511111 0.968109  60
## 3                 max f0point5  0.511111 0.883576  60
## 4                 max accuracy  0.511111 0.860000  60
## 5                max precision  0.912478 0.955556  19
## 6                   max recall  0.511111 1.000000  60
## 7              max specificity  1.000000 0.866667   0
## 8             max absolute_mcc  0.853333 0.268387  30
## 9   max min_per_class_accuracy  0.862222 0.666667  28
## 10 max mean_per_class_accuracy  0.912478 0.686275  19
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

h2o.performance(rfmod1,wtest)

## H2OBinomialMetrics: drf
## 
## MSE:  0.182993
## RMSE:  0.4277768
## LogLoss:  1.240996
## Mean Per-Class Error:  0.4666667
## AUC:  0.5976471
## Gini:  0.1952941
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          Dead Survived    Error     Rate
## Dead        1       14 0.933333   =14/15
## Survived    0       85 0.000000    =0/85
## Totals      1       99 0.140000  =14/100
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.080790 0.923913  50
## 2                       max f2  0.080790 0.968109  50
## 3                 max f0point5  0.462663 0.890736  38
## 4                 max accuracy  0.080790 0.860000  50
## 5                max precision  0.584416 0.907895  36
## 6                   max recall  0.080790 1.000000  50
## 7              max specificity  1.000000 0.866667   0
## 8             max absolute_mcc  0.584416 0.288526  36
## 9   max min_per_class_accuracy  0.659121 0.533333  22
## 10 max mean_per_class_accuracy  0.584416 0.672549  36
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

h2o.performance(rfmod2,wtest)

## H2OBinomialMetrics: drf
## 
## MSE:  0.182993
## RMSE:  0.4277768
## LogLoss:  1.240996
## Mean Per-Class Error:  0.4666667
## AUC:  0.5976471
## Gini:  0.1952941
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          Dead Survived    Error     Rate
## Dead        1       14 0.933333   =14/15
## Survived    0       85 0.000000    =0/85
## Totals      1       99 0.140000  =14/100
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.080790 0.923913  50
## 2                       max f2  0.080790 0.968109  50
## 3                 max f0point5  0.462663 0.890736  38
## 4                 max accuracy  0.080790 0.860000  50
## 5                max precision  0.584416 0.907895  36
## 6                   max recall  0.080790 1.000000  50
## 7              max specificity  1.000000 0.866667   0
## 8             max absolute_mcc  0.584416 0.288526  36
## 9   max min_per_class_accuracy  0.659121 0.533333  22
## 10 max mean_per_class_accuracy  0.584416 0.672549  36
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

h2o.performance(rfmod3,wtest)

## H2OBinomialMetrics: drf
## 
## MSE:  0.1329517
## RMSE:  0.3646254
## LogLoss:  1.031372
## Mean Per-Class Error:  0.5
## AUC:  0.6639216
## Gini:  0.3278431
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          Dead Survived    Error     Rate
## Dead        0       15 1.000000   =15/15
## Survived    0       85 0.000000    =0/85
## Totals      0      100 0.150000  =15/100
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.450000 0.918919  52
## 2                       max f2  0.450000 0.965909  52
## 3                 max f0point5  0.450000 0.876289  52
## 4                 max accuracy  0.450000 0.850000  52
## 5                max precision  0.906667 0.961538  17
## 6                   max recall  0.450000 1.000000  52
## 7              max specificity  1.000000 0.866667   0
## 8             max absolute_mcc  0.906667 0.325125  17
## 9   max min_per_class_accuracy  0.861111 0.670588  25
## 10 max mean_per_class_accuracy  0.906667 0.727451  17
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Suppose that “Survived” is the negative and “Dead” is positive outcomes (as we try to identify the patients with higher risk of post-operative mortality): It seems that all 4 models have the same performance on test subset: they provided correct survival prognosis, but all failed to identify Mortality risk. Each model maded 14 to 15 mistakes (all of them were false negatives). We don’t know whether our settings could have any effect on each one of these 4 models or not ?.

To find the answer for this question, we must resample the validation on replicated dataset by bootstrapping:

First, we call up the mlr package, and train a dummy h2o random forest model in h2o. The real purpose of this dummy model is for cloning exactly the 4 models trained in h2o. This trick has been well explained in the previous tutorials.

library(mlr)

taskTS=mlr::makeClassifTask(id="Thorac",data=df,target="Survival",positive = "Dead")

learnerH2ORF=makeLearner(id="h2oRF","classif.h2o.randomForest", predict.type = "prob")

mlrRF=train(learner = learnerH2ORF, task=taskTS)

Neither accuracy nor absolute misclassification rate are right metric to use when confronting the imbalanced data. They could be misleading, as mentioned above. Other metrics are more appropriate, including True/False Negative or Positive rate, Balanced accuracy or ballanced error rate, Recall (or specificity), Cohen’s Kappa coefficient or F1 score. We will adopt some of those metrics for out benchmark study.

bootmlrPERF=function(h2omodel,data,i){
  d=data[i,]
  predmlr=predict(mlrRF,newdata=d)
  predh2o=predict(get(h2omodel),as.h2o(d))
  predmlr$data$response<-as.vector(predh2o$predict)
  predmlr$data$prob.Dead<-as.vector(predh2o$Dead)
  predmlr$data$prob.Dead<-as.vector(predh2o$Survived)
  mets=list(bac,f1,tpr,tnr,fpr,fnr)
  p=mlr::performance(predmlr,mets)
  BAC=p[[1]]
  F1=p[[2]]
  TPR=p[[3]]
  TNR=p[[4]]
  FPR=p[[5]]
  FNR=p[[6]]
  return=cbind(BAC,F1,TPR,TNR,FPR,FNR)
}


set.seed(123)
library(boot)

perfmod0=boot(statistic=bootmlrPERF,h2omodel="rfmod0",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Random",iter=as.numeric(rownames(.)))

perfmod1=boot(statistic=bootmlrPERF,h2omodel="rfmod1",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Stratified",iter=as.numeric(rownames(.)))

perfmod2=boot(statistic=bootmlrPERF,h2omodel="rfmod2",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Balanced_Random",iter=as.numeric(rownames(.)))

perfmod3=boot(statistic=bootmlrPERF,h2omodel="rfmod3",data=df,R=30)%>%.$t%>%as_tibble()%>%mutate(Mode="Unbalanced_Stratified",iter=as.numeric(rownames(.)))

bootperf=rbind(perfmod0,perfmod1,perfmod2,perfmod3)

names(bootperf)=c("BAC","F1","TPR","TNR","FPR","FNR","Mode","Iteration")

bootperf[,c(1:6)]%>%psych::describeBy(.,bootperf$Mode)

## $Balanced_Random
##     vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis
## BAC    1 30 0.93 0.02   0.93    0.93 0.02 0.89 0.97  0.08  0.02    -1.11
## F1     2 30 0.84 0.04   0.84    0.84 0.04 0.77 0.92  0.15  0.09    -0.61
## TPR    3 30 0.90 0.04   0.90    0.90 0.04 0.82 0.97  0.15 -0.09    -1.03
## TNR    4 30 0.96 0.01   0.96    0.96 0.01 0.94 0.98  0.04 -0.08    -0.82
## FPR    5 30 0.04 0.01   0.04    0.04 0.01 0.02 0.06  0.04  0.08    -0.82
## FNR    6 30 0.10 0.04   0.10    0.10 0.04 0.03 0.18  0.15  0.09    -1.03
##       se
## BAC 0.00
## F1  0.01
## TPR 0.01
## TNR 0.00
## FPR 0.00
## FNR 0.01
## 
## $Balanced_Stratified
##     vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis
## BAC    1 30 0.92 0.02   0.92    0.92 0.02 0.89 0.97  0.08  0.32    -0.68
## F1     2 30 0.83 0.03   0.84    0.83 0.04 0.78 0.91  0.13  0.29    -0.70
## TPR    3 30 0.89 0.04   0.89    0.89 0.04 0.81 0.96  0.14 -0.10    -0.89
## TNR    4 30 0.96 0.01   0.96    0.96 0.01 0.93 0.98  0.04 -0.62    -0.39
## FPR    5 30 0.04 0.01   0.04    0.04 0.01 0.02 0.07  0.04  0.62    -0.39
## FNR    6 30 0.11 0.04   0.11    0.11 0.04 0.04 0.19  0.14  0.10    -0.89
##       se
## BAC 0.00
## F1  0.01
## TPR 0.01
## TNR 0.00
## FPR 0.00
## FNR 0.01
## 
## $Unbalanced_Random
##     vars  n mean sd median trimmed mad min max range skew kurtosis se
## BAC    1 30  0.5  0    0.5     0.5   0 0.5 0.5     0  NaN      NaN  0
## F1     2 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## TPR    3 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## TNR    4 30  1.0  0    1.0     1.0   0 1.0 1.0     0  NaN      NaN  0
## FPR    5 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## FNR    6 30  1.0  0    1.0     1.0   0 1.0 1.0     0  NaN      NaN  0
## 
## $Unbalanced_Stratified
##     vars  n mean sd median trimmed mad min max range skew kurtosis se
## BAC    1 30  0.5  0    0.5     0.5   0 0.5 0.5     0  NaN      NaN  0
## F1     2 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## TPR    3 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## TNR    4 30  1.0  0    1.0     1.0   0 1.0 1.0     0  NaN      NaN  0
## FPR    5 30  0.0  0    0.0     0.0   0 0.0 0.0     0  NaN      NaN  0
## FNR    6 30  1.0  0    1.0     1.0   0 1.0 1.0     0  NaN      NaN  0
## 
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)

pairwise.wilcox.test(x=bootperf$FNR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)

## 
##  Pairwise comparisons using Wilcoxon signed rank test 
## 
## data:  bootperf$FNR and bootperf$Mode 
## 
##                       Balanced_Random Balanced_Stratified
## Balanced_Stratified   1               -                  
## Unbalanced_Random     9.1e-06         9.1e-06            
## Unbalanced_Stratified 9.1e-06         9.1e-06            
##                       Unbalanced_Random
## Balanced_Stratified   -                
## Unbalanced_Random     -                
## Unbalanced_Stratified -                
## 
## P value adjustment method: bonferroni

pairwise.wilcox.test(x=bootperf$FPR,g=bootperf$Mode,p.adjust.method="bonferroni",n=6,paired=T)

## 
##  Pairwise comparisons using Wilcoxon signed rank test 
## 
## data:  bootperf$FPR and bootperf$Mode 
## 
##                       Balanced_Random Balanced_Stratified
## Balanced_Stratified   1               -                  
## Unbalanced_Random     9.1e-06         9.1e-06            
## Unbalanced_Stratified 9.1e-06         9.1e-06            
##                       Unbalanced_Random
## Balanced_Stratified   -                
## Unbalanced_Random     -                
## Unbalanced_Stratified -                
## 
## P value adjustment method: bonferroni

bootlong=bootperf%>%gather(BAC:FNR,key="Metric",value="Score")

bootlong%>%ggplot(aes(x=Score,fill=Mode))+geom_histogram(alpha=0.6,color="black")+facet_grid(Mode~Metric,scales="free")+scale_fill_manual(values=mycolors)

bootlong%>%ggplot(aes(x=Metric,y=Score,fill=Mode))+geom_boxplot(alpha=0.6)+facet_wrap(~Metric,scales="free",ncol=2)+scale_fill_manual(values=mycolors)+coord_flip()

bootperf%>%ggplot(aes(x=Iteration,y=F1,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.95,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)

bootperf%>%ggplot(aes(x=Iteration,y=BAC,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.9,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)

bootperf%>%ggplot(aes(x=Iteration,y=FNR,color=Mode,fill=Mode))+geom_path(alpha=0.8,size=1.2)+geom_point(shape=21,size=4,color="black")+scale_color_manual(values=c("red4","purple","orange","blue"))+scale_fill_manual(values=c("red","purple1","gold","skyblue"))+geom_hline(yintercept=0.1,linetype=2,size=1)+facet_wrap(~Mode,ncol=1)

The bootstraped validation on 30 random testing subsets revealed something more interesting:

The model with stand-alone classes weighting or combined with stratified fold assignment produced the best performance (in red and yellow). Those two models had the best F1 score, highest True Positive rate and lowest false negative rate. Only model trained by such protocole can make sense when applied to random imbalanced validation sets.

The difference in False negative rate between Random/Balanced model and Random/Unbalanced model is significative (Wilcoxon test p value =0.0000091).There was no difference between Balanced/Random assignment and Balanced/stratified models. This indicates that applying Balance class parameter alone (without Fold assignment Stratification) can already improve the model’s performance on imbalanced data.Folds stratification might be not enough to resolve the data imbalance problem. Combining Classes balancing and Stratified assignment could optomise our model’s performance.

Conclusion

Classes imbalance is a common and frustrating problem in machine learning practice. Class balancing is an useful feature in h2o that could be used alone or in combination with Fold assignment Stratification for dealing with such problem.

Thank you for joining us and see you soon in the next tutorial.

END

MLM Case study X8

Lê Ngọc Khả Nhi (MD,PhD)

19 April 2017

Case study X8: Dealing with imbalanced data in h2o