Case study X6:Deep learning 3-Tuning a Deep Neural network)

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam 2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

In the tutorial X4, we have introduced the Deep learning algorithm applied to a binary classification task using h2o package. However, our training protocol was arbitrary. Despite that h2o provides a good training protocol by default for Deep learning (it might return impressive outcome without any tuning for every task), sometimes hyper-parameter tuning becomes important for optimize the model’s performance. Making arbitrary decision is impossible in the real world. We cannot imagine how our decision will affect the model’s performance, and we don’t want neither to waste our time for searching by brute force, as h2o provides enough parameters that could be tuned for the rest of our life.

The main objective of this tutorial X6 is to introduce the Random grid-based tuning for Deep learning algorithm in another binary classifciation task.

Materials and method

The present case study adopt the Biopsy dataset of Dr. William H. Wolberg (William H. Wolberg and O.L. Mangasarian (1990) Proceedings of the National Academy of Sciences, U.S.A. 87, 9193-96.). This dataset was obtained from the University of Wisconsin Hospitals. Biopsies of breast tumours for 699 patients up to 15 July 1992 have been recorded. Each of nine attributes has been scored on a scale of 1 to 10, and the outcome is classified as “Benign” or “Malignant” tumors. Original dataset contained 699 instances of 10 variables. After removing missing values, there are 683 cases. The 9 features include scores of clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses.

This datset could be downloaded from the famous http://vincentarelbundock.github.io website

library(tidyverse)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#faefff"),
      strip.background = element_rect(fill = "#400156", color = "#400156", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())


df=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/biopsy.csv")%>%as_tibble()%>%.[,c(3:12)]%>%na.omit()

names(df)=c("clumpthickness",
            "SizeUniformity",
            "ShapeUniformity",
            "Margin_adhesion",
            "EpiCellSize",
            "Barenuclei",
            "BlandChromatin",
            "NormalNucleoli",
            "Mitoses",
            "Class"
            )

Data visualising

dfscale<-df[,-10]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Class=df$Class,Id=row.names(.))

dfscale%>%gather(clumpthickness:Mitoses,key="Criteria",value="Score")%>%ggplot(aes(x=reorder(Id,-Score),y=reorder(Criteria,Score),fill=Score))+geom_tile(show.legend=T)+facet_wrap(~Class,ncol=1,shrink=T,scale="free")+scale_fill_gradient2(low="#fcde00",mid="#fc3232",high="#7901a8",midpoint=1.5)+theme(axis.text.x=element_blank())+scale_y_discrete("Criteria")+scale_x_discrete("Patient's Id")

df%>%gather(clumpthickness:Mitoses,key="Criteria",value="Score")%>%ggplot(aes(x=Score,fill=Class,color=Class))+geom_histogram(alpha=0.6,show.legend =T,binwidth = 1)+coord_flip()+facet_wrap(~Criteria,scales="free",ncol=3)+scale_fill_brewer(palette = "Set1",direction = -1)+scale_color_brewer(palette = "Set1",direction = -1)

Initialising h2o and data spliting

#Initalising h2o

library(h2o)

h2o.init(nthreads = -1,max_mem_size ="4g")

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 hours 37 minutes 
##     H2O cluster version:        3.10.3.6 
##     H2O cluster version age:    1 month and 27 days  
##     H2O cluster name:           H2O_started_from_R_Admin_myw898 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.39 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.1 (2016-06-21)

library(caret)

set.seed(123)
id=createDataPartition(y=df$Class, p=0.75,list=FALSE)
trainset=df[id,]
remain=df[-id,]
set.seed(1234)
idv=createDataPartition(y=remain$Class, p=0.5,list=FALSE)
testset=remain[idv,]
validset=remain[-idv,]

The initialisation of h2o package has been introduced in the tutorial X4. As mentioned before, h2o does not assure the same proportion of target variable across the subsets, that why we will split our data using caret package which is more accurate.

The original dataset was splitted into 3 parts; a training subset of 75% (n=513), a validation subset of 12.5% (n=84) and a test subset of 86 patients. The training subset will be used for model training and hyper parameter-tuning. Validation subset will be used for calibrating the model during cross-validation training and finally the test subset will be used for independent, external validation of our final model.

p1=trainset%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_brewer(palette = "Set1",direction = -1)+ggtitle("Train_set")+coord_flip()

p2=testset%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_brewer(palette = "Set1",direction = -1)+ggtitle("Test_set")+coord_flip()

p3=validset%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_brewer(palette = "Set1",direction = -1)+ggtitle("Validation_set")+coord_flip()

p4=df%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_brewer(palette = "Set1",direction = -1)+ggtitle("Origin_data")+coord_flip()


library(gridExtra)

grid.arrange(p1,p2,p3,p4,ncol=1)

Then we transform 3 subsets and original data into h2o frame, as h2o uses its own environment

wdata=as.h2o(df)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

wtrain=as.h2o(trainset)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

wvalid=as.h2o(validset)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

wtest=as.h2o(testset)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Following parameters are usually considered for a tuning:

1) Structural parameters: How many hidden layers would be required ?, and how many neurons would be included for each layers ?

2) Functional parameters: What type of activation function that should be used ? Should I consider a drop-out function within each neuron ?, How much data would be used for feeding the input layer ? Should we apply a lasso/ridge regularization ?

h2o supports both grid-based searching (brute-force scan for all possible combination) and random searching (with randomized combination and early stopping criteria). The second is recommended if we consider tuning more than 4 parameters at a time.

In our case, we would like to compare the performance across 60 possible combinations of 4 hyper-parameters :

Hidden layer’s structure: 2 hidden layers versus 3 hidden layers (each one can contain up to 500 neurons).

Input_dropout_ratio: From 0 (no drop out) to 0.4 (40% random dropout), indicating the proportion of randomised dropped-out reatures from the input data before feeding to each neuron in the first hidden layer

Activation function: What function would be chosen among : Rectifier, Tanh and Maxout, each one already includes a random drop-out rate of 50% for input signals (this idea is similar to that in Random Forest or GBM algorithms).

Lasso regularisation: L1= 0 versus 0.00005

response="Class"
features=setdiff(colnames(wtrain),response)

hyper_params=list(
  hidden=list(c(500,500),c(500,500,500)),
  input_dropout_ratio=c(0,0.1,0.2,0.3,0.4),
  activation=c("RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
  l1=seq(0,1e-5)
)

Classical neural networks implied sigmoid functions, but h2o Deep learning adopts Hyperbolic Tangent, Rectifier and Maxout functions. Selection of correct activation function can optimize the learning speed, stability and/or accuracy of model. Tanh provides good flexibility and stability but this function is more computationally expensive than Rectifier. Rectifier allows fastest training speed for large network.

Regularization or penalties can be applied to Neural network via l1 (lasso) and l2 (ridge) parameters. L1 lets only strong weights survive (constant pulling force towards zero), while L2 prevents any single weight from getting too big. Another regularization technique consists of randomly dropping out the input data for either input layer or hidden layers. These regularizations could be set via input_dropout_ratio and hidden_dropout_ratios parameters. Their values can vary from 0 (no drop-out) to 0.5.

As there are 60 possible combinations, we would like to optimise our tuning speed with random-discrete searching mode and early stopping criteria. Our criteria for early stopping is based on logloss metric. Each model training will be stopped when logloss does not improve after 2 iterations). To maximise the tuning speed, we will not consider any cross-validation.

search_criteria = list(strategy = "RandomDiscrete", 
                       seed=123,
                       stopping_rounds=2, 
                       stopping_tolerance=1e-2
                       )

After setting the search_criteria and hyper-parameter list, we apply the random_grid function with following configuration:

random_grid=h2o.grid(
  x = features,
  y = response,
  algorithm="deeplearning",
  grid_id = "dl_grid_random",
  training_frame = wtrain,
  validation_frame = wvalid,               
  epochs=1,
  stopping_metric="logloss",
  stopping_rounds=2, 
  stopping_tolerance=1e-2,
  reproducible = TRUE,seed=123,
  hyper_params = hyper_params,
  search_criteria = search_criteria
)

## 
  |                                                                       
  |                                                                 |   0%

The grid-based random search takes about 40 minutes. Once all models are converged, we can explore the results:

grid=h2o.getGrid(random_grid@grid_id,sort_by="logloss",decreasing=FALSE)

grid

## H2O Grid Details
## ================
## 
## Grid ID: dl_grid_random 
## Used hyper parameters: 
##   -  activation 
##   -  hidden 
##   -  input_dropout_ratio 
##   -  l1 
## Number of models: 102 
## Number of failed models: 0 
## 
## Hyper-Parameter Search Summary: ordered by increasing logloss
##          activation          hidden input_dropout_ratio  l1
## 1 MaxoutWithDropout      [500, 500]                 0.3 0.0
## 2   TanhWithDropout      [400, 400]                 0.3 0.0
## 3   TanhWithDropout      [500, 500]                 0.3 0.0
## 4   TanhWithDropout      [200, 200]                 0.2 0.0
## 5   TanhWithDropout [500, 500, 500]                 0.3 0.0
##                 model_ids              logloss
## 1 dl_grid_random_model_38 0.009117608752401826
## 2 dl_grid_random_model_31  0.01223306682665868
## 3 dl_grid_random_model_84 0.014651424042551896
## 4 dl_grid_random_model_69 0.017930589240114335
## 5 dl_grid_random_model_82 0.020520372850408048
## 
## ---
##            activation          hidden input_dropout_ratio  l1
## 97  MaxoutWithDropout [500, 500, 500]                 0.3 0.0
## 98  MaxoutWithDropout      [500, 500]                 0.4 0.0
## 99  MaxoutWithDropout [500, 500, 500]                 0.2 0.0
## 100 MaxoutWithDropout [128, 256, 512]                 0.0 0.0
## 101 MaxoutWithDropout [500, 500, 500]                 0.0 0.0
## 102 MaxoutWithDropout [500, 500, 500]                 0.4 0.0
##                   model_ids             logloss
## 97  dl_grid_random_model_83 0.10582823821645582
## 98  dl_grid_random_model_76 0.11535274872378734
## 99  dl_grid_random_model_91 0.13777146540044524
## 100  dl_grid_random_model_3  0.1686939238996708
## 101 dl_grid_random_model_99 0.18711686181067302
## 102 dl_grid_random_model_74 0.22947589982323383

grid@summary_table[1,]

## Hyper-Parameter Search Summary: ordered by increasing logloss
##          activation     hidden input_dropout_ratio  l1
## 1 MaxoutWithDropout [500, 500]                 0.3 0.0
##                 model_ids              logloss
## 1 dl_grid_random_model_38 0.009117608752401826

best_model=h2o.getModel(grid@model_ids[[1]]) ## model with lowest logloss
best_model

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  dl_grid_random_model_38 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 512Â 002 weights/biases, 5,9 MB, 2Â 763 training samples, mini-batch size 1
##   layer units          type dropout       l1       l2 mean_rate rate_rms
## 1     1     9         Input 30.00 %                                     
## 2     2   500 MaxoutDropout 50.00 % 0.000000 0.000000  0.002115 0.001579
## 3     3   500 MaxoutDropout 50.00 % 0.000000 0.000000  0.029228 0.065569
## 4     4     2       Softmax         0.000000 0.000000  0.000918 0.000178
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.001916   0.068608  0.491713 0.026613
## 3 0.000000   -0.000460   0.046840  0.998163 0.013527
## 4 0.000000   -0.011817   0.246904 -0.000596 0.004542
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  0.03135303
## RMSE:  0.1770679
## LogLoss:  0.2546598
## Mean Per-Class Error:  0.0222973
## AUC:  0.9947114
## Gini:  0.9894228
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           benign malignant    Error     Rate
## benign       320        13 0.039039  =13/333
## malignant      1       179 0.005556   =1/180
## Totals       321       192 0.027290  =14/513
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.000167 0.962366  73
## 2                       max f2  0.000167 0.981360  73
## 3                 max f0point5  0.849938 0.955882  57
## 4                 max accuracy  0.000167 0.972710  73
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.000000 1.000000  99
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.000167 0.942200  73
## 9   max min_per_class_accuracy  0.063713 0.966667  64
## 10 max mean_per_class_accuracy  0.000167 0.977703  73
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.002471014
## RMSE:  0.0497093
## LogLoss:  0.009117609
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           benign malignant    Error   Rate
## benign        55         0 0.000000  =0/55
## malignant      0        29 0.000000  =0/29
## Totals        55        29 0.000000  =0/84
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.606832 1.000000  27
## 2                       max f2  0.606832 1.000000  27
## 3                 max f0point5  0.606832 1.000000  27
## 4                 max accuracy  0.606832 1.000000  27
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.606832 1.000000  27
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.606832 1.000000  27
## 9   max min_per_class_accuracy  0.606832 1.000000  27
## 10 max mean_per_class_accuracy  0.606832 1.000000  27
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

The result indicates that the best model (which produces lowest logloss value) would imply a network of 2 hidden layers of 500 neurons within each layer, with Tanh activation function, no lasso regularisation and a random input dropout of 30%

Training again the model with best configuration setting

Now we got the optimal configuration, we can apply this training protocole for a comprehensive training process. This time, we integrate both 10x10 cross-validation and recurent scoring on validation subset.

dlmod=h2o.deeplearning  (x = features,
                         y = response,
                         model_id = "Best_model",
                         training_frame = wtrain,validation_frame = wvalid,                             
                         nfolds = 10,
                         hidden = c(500,500), 
                         stopping_metric = "logloss",
                         replicate_training_data = TRUE,
                         stopping_tolerance = 0.01,
                         stopping_rounds = 2,
                         overwrite_with_best_model=TRUE,
                         fold_assignment = "Stratified",
                         epochs=100,
                         activation = "TanhWithDropout",
                         input_dropout_ratio=0.3,
                         keep_cross_validation_fold_assignment = TRUE,
                         keep_cross_validation_predictions=FALSE,
                         score_each_iteration = TRUE,
                         variable_importances = TRUE,
                         reproducible = TRUE,seed=123)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |=================================================================| 100%

Scoring history

After such training process, we can enjoy now our work. First, we will plot the scoring history for both cross-validation and fixed validation:

cvf=dlmod@model$cross_validation_metrics_summary%>%as_tibble()%>%mutate(Metric=rownames(.))%>%gather(cv_1_valid:cv_10_valid,key="Fold",value="Result")
cvf$Result=as.numeric(cvf$Result)

tshf=dlmod@model$scoring_history%>%as_tibble()%>%gather(training_logloss:training_classification_error,key="Metric",value="Training_Score")
tshf$Training_Score=as.numeric(tshf$Training_Score)

vshf=dlmod@model$scoring_history%>%as_tibble()%>%gather(validation_logloss:validation_classification_error,key="Metric",value="Validation_Score")
vshf$Validation_Score=as.numeric(vshf$Validation_Score)

loglossdf=dlmod@model$scoring_history%>%as_tibble()%>%gather(training_logloss,validation_logloss,key="Metric",value="Score")
loglossdf$Score=as.numeric(loglossdf$Score)


cvf%>%ggplot(aes(x=Fold,y=Result,color=Metric,fill=Metric))+geom_line(group=1,size=1,show.legend = F)+geom_point(shape=21,size=3,color="black",show.legend = F)+theme(axis.text.x=element_blank())+facet_wrap(~Metric,scales="free",ncol=3)+ggtitle("10x10 Cross-validation")

tshf%>%ggplot(aes(x=epochs,y=Training_Score,color=Metric,fill=Metric))+geom_line(group=1,size=1,show.legend = F)+geom_point(shape=21,size=3,color="black",show.legend = F)+facet_wrap(~Metric,scales="free",ncol=1)+ggtitle("Training score history")

vshf%>%ggplot(aes(x=epochs,y=Validation_Score,color=Metric,fill=Metric))+geom_line(group=1,size=1,show.legend = F)+geom_point(shape=21,size=3,color="black",show.legend = F)+facet_wrap(~Metric,scales="free",ncol=1)+ggtitle("Validation score history")

loglossdf=dlmod@model$scoring_history%>%as_tibble()%>%gather(training_logloss,validation_logloss,key="Metric",value="Score")
loglossdf$Score=as.numeric(loglossdf$Score)
pl=loglossdf%>%ggplot(aes(x=epochs,y=Score,color=Metric,fill=Metric))+geom_line(group=1,size=1,show.legend = T)+geom_point(shape=21,size=3,color="black",show.legend = F)+ggtitle("Logloss Score history")

aucdf=dlmod@model$scoring_history%>%as_tibble()%>%gather(training_auc,validation_auc,key="Metric",value="Score")
aucdf$Score=as.numeric(aucdf$Score)
pa=aucdf%>%ggplot(aes(x=epochs,y=Score,color=Metric,fill=Metric))+geom_line(group=1,size=1,show.legend = T)+geom_point(shape=21,size=3,color="black",show.legend = F)+ggtitle("AUC Score history")

grid.arrange(pl,pa,ncol=2)

Model exploration

Then we can export the Maginalised plots and important Variables, as usual:

h2o.varimp_plot(dlmod)

h2o.partialPlot(dlmod,data=wdata)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

## [[1]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'clumpthickness'
##    clumpthickness mean_response
## 1        1.000000      0.321763
## 2        2.000000      0.331114
## 3        3.000000      0.339207
## 4        4.000000      0.346659
## 5        5.000000      0.354268
## 6        6.000000      0.362636
## 7        7.000000      0.372220
## 8        8.000000      0.383911
## 9        9.000000      0.399667
## 10      10.000000      0.422970
## 
## [[2]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'SizeUniformity'
##    SizeUniformity mean_response
## 1        1.000000      0.344683
## 2        2.000000      0.349147
## 3        3.000000      0.353725
## 4        4.000000      0.358494
## 5        5.000000      0.363568
## 6        6.000000      0.369127
## 7        7.000000      0.375464
## 8        8.000000      0.383009
## 9        9.000000      0.392370
## 10      10.000000      0.404327
## 
## [[3]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'ShapeUniformity'
##    ShapeUniformity mean_response
## 1         1.000000      0.344587
## 2         2.000000      0.348631
## 3         3.000000      0.352813
## 4         4.000000      0.357175
## 5         5.000000      0.361740
## 6         6.000000      0.366548
## 7         7.000000      0.371697
## 8         8.000000      0.377386
## 9         9.000000      0.383932
## 10       10.000000      0.391780
## 
## [[4]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'Margin_adhesion'
##    Margin_adhesion mean_response
## 1         1.000000      0.350751
## 2         2.000000      0.352761
## 3         3.000000      0.354774
## 4         4.000000      0.356816
## 5         5.000000      0.358914
## 6         6.000000      0.361097
## 7         7.000000      0.363394
## 8         8.000000      0.365839
## 9         9.000000      0.368473
## 10       10.000000      0.371347
## 
## [[5]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'EpiCellSize'
##    EpiCellSize mean_response
## 1     1.000000      0.350395
## 2     2.000000      0.351860
## 3     3.000000      0.353334
## 4     4.000000      0.354821
## 5     5.000000      0.356328
## 6     6.000000      0.357859
## 7     7.000000      0.359420
## 8     8.000000      0.361020
## 9     9.000000      0.362664
## 10   10.000000      0.364365
## 
## [[6]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'Barenuclei'
##    Barenuclei mean_response
## 1    1.000000      0.319033
## 2    2.000000      0.330582
## 3    3.000000      0.341724
## 4    4.000000      0.352805
## 5    5.000000      0.364587
## 6    6.000000      0.378576
## 7    7.000000      0.396908
## 8    8.000000      0.422082
## 9    9.000000      0.456868
## 10  10.000000      0.503672
## 
## [[7]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'BlandChromatin'
##    BlandChromatin mean_response
## 1        1.000000      0.344014
## 2        2.000000      0.348056
## 3        3.000000      0.352068
## 4        4.000000      0.356071
## 5        5.000000      0.360125
## 6        6.000000      0.364334
## 7        7.000000      0.368847
## 8        8.000000      0.373858
## 9        9.000000      0.379625
## 10      10.000000      0.386470
## 
## [[8]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'NormalNucleoli'
##    NormalNucleoli mean_response
## 1        1.000000      0.340690
## 2        2.000000      0.346594
## 3        3.000000      0.352743
## 4        4.000000      0.359404
## 5        5.000000      0.366905
## 6        6.000000      0.375814
## 7        7.000000      0.387117
## 8        8.000000      0.402303
## 9        9.000000      0.423304
## 10      10.000000      0.452202
## 
## [[9]]
## PartialDependence: Partial Dependence Plot of model Best_model on column 'Mitoses'
##      Mitoses mean_response
## 1   1.000000      0.352187
## 2   2.000000      0.360234
## 3   3.000000      0.368634
## 4   4.000000      0.378562
## 5   5.000000      0.392855
## 6   6.000000      0.416240
## 7   7.000000      0.454577
## 8   8.000000      0.512917
## 9   9.000000      0.591643
## 10 10.000000      0.684522

h2o.performance(dlmod,wtest)

## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.01749282
## RMSE:  0.1322604
## LogLoss:  0.0519362
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           benign malignant    Error   Rate
## benign        56         0 0.000000  =0/56
## malignant      0        30 0.000000  =0/30
## Totals        56        30 0.000000  =0/86
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.115727 1.000000  29
## 2                       max f2  0.115727 1.000000  29
## 3                 max f0point5  0.115727 1.000000  29
## 4                 max accuracy  0.115727 1.000000  29
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.115727 1.000000  29
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.115727 1.000000  29
## 9   max min_per_class_accuracy  0.115727 1.000000  29
## 10 max mean_per_class_accuracy  0.115727 1.000000  29
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

As we could see, the best model is really “good”, as it worked perfectly well on the independent test set.

Prediction boundaries using mlr package

(This one is new !). The mlr package supports h2o algorithms. Just make sure that we input the same configuration. Model training and exploration are integrated into plotLearnerPrediction function.

We could see that the final prediction of a neural network looks similar to that of a Logistic regression, doesn’t it ?

library(mlr)

taskBiopsy=mlr::makeClassifTask(id="Biopsy",data=trainset,target="Class",positive = "malignant")

learnerDL = makeLearner(id="DL","classif.h2o.deeplearning", predict.type = "prob",
                        activation = "TanhWithDropout",
                        hidden = c(500,500),
                        input_dropout_ratio=0.3,
                        reproducible = TRUE,
                        seed=123
                        )

pp1=plotLearnerPrediction(learnerDL,taskBiopsy,features=c("clumpthickness","BlandChromatin"),cv=0,gridsize=100)+scale_fill_manual(values=c("#11a6fc","#ff0061"))

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

pp2=plotLearnerPrediction(learnerDL,taskBiopsy,features=c("Barenuclei","Margin_adhesion"),cv=0,gridsize=100)+scale_fill_manual(values=c("#11a6fc","#ff0061"))

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

grid.arrange(pp1,pp2,ncol=2)

Conclusion

We have learnt the tuning process to determine the optimal training protocol for deep learning in h2o. As Deep learning is the most complicated algorithm in h2o with many hyper-parameters, we should consider a randomised searching method and carefully build a list of hyper-parameters, as well as to apply an early stopping criteria. A well organised tuning process might help us to optimise the accuracy of our model.

See you in the next tuutorial and thank for joining us

END

MLM Case study X6

Le Ngoc Kha Nhi (MD,PhD)

17 April 2017

Case study X6:Deep learning 3-Tuning a Deep Neural network)