Hyper-Parameter Optimization for Multilayer Artificial Neural Networks using “H2O” Package in R

Introduction

The purpose of this project is optimization of hyper-parameters for multilayer artificial neural network (ANN) used to recognize handwritten digits. “H2O” package is one of the best and easiest to work with packages in R for deep learning applications. Vincenzo Lomonaco in his Udemy course provides excellent introduction to the neural networks implementation with R. ANNs are instrumental in the field of computer vision. However, neural networks parameters, such as number of neurons, activation function, regularization methods, etc. are often have to be selected manually. “H2O” package, as illustrated below, implements built-in functions, which are capable of selecting best ANN’s hyper-parameters to build the model.

Data Set

Training set from Kaggle’s competittion will be used in this project. This set contains 42,000 images stored in 28x28 matrix. Although Kaggle provides separate testing set, for the purposes of this exersice, the training set will be split into 3 sets used for training, validation and testing. Therefore, 26,000 images are to be used for training, 8,000 for validation with another 8,000 for testing purposes. First column of the data set contains labels, and the rest 784 columns - pixel values. Undoubtedly, larger training set would result in higher accuracy of the model. However, the goal is to see implementation of “H2O” package for optimization of multilayer ANN’s hyper-parameters.

Discussion

“H2O” package is easy to install and use in R. It allows parallel computing option for multi-core machines. Similarly to other deep learning packages (for example “MxNet”), it requires transformation of data set into special format. In case of “H2O”, data are transformed into H2O object with as.h2o command.

# Loading "H2O":
library(h2o)

# Loading data set:
digits.data <- read.csv("dataset.csv")

# This is optional function that switches off progress bar, which is displayed by default in "H2O" package:
h2o.no_progress()

# Selecting 3 cores out of 4 available on the machine. 4th core is used to run OS as calculations may take up to 20 minutes:
h2o.init(nthreads=3)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\Ulpan\AppData\Local\Temp\RtmpOw5sDw/h2o_Ulpan_started_from_r.out
##     C:\Users\Ulpan\AppData\Local\Temp\RtmpOw5sDw/h2o_Ulpan_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ......... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         39 seconds 513 milliseconds 
##     H2O cluster timezone:       America/Chicago 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.2 
##     H2O cluster version age:    3 months and 1 day  
##     H2O cluster name:           H2O_started_from_R_Ulpan_qum829 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.87 GB 
##     H2O cluster total cores:    0 
##     H2O cluster allowed cores:  0 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.3 (2017-11-30)

# Identifying training, validation and testing sets:
train <- digits.data[1:26000,]
valid <- digits.data[26001:34000,]
test <- digits.data[34001:42000,]

# Labels are converted into factor variable:
train$label <- as.factor(train$label)

# Transforming data sets into H2O objects:
train_h2o <- as.h2o(train)
valid_h2o <- as.h2o(valid)
test_h2o <- as.h2o(test)

Package offers excellent choices of hyper-parameters to be evaluated for multilayer ANN: activation functions, number of neurons in hidden layers, regularization. Activation functions can be selected with “dropout” option, allowing random drop out of some units in the ANN. This is one of the most powerful regularization methods to optimize the ANN and avoid overfitting. Three models with various number of hidden layers are selected for evaluation:

ANN with 4 hidden layers containing 349, 174, 87 and 29 neurons respectively
ANN with 3 hidden layers containing 174, 87 and 29 neurons respectively
ANN with 2 hidden layers containing 87 and 29 neurons respectively

Number of neurons (units) for each layer were selected as to shrink the input of 784-long vector at each layer. Additionally, neuron dropout rate can be specified for the input layer with ‘input_dropout_ratio’ option. Three different dropout rates are going to be tried in this case: 0, 5% and 10%. Lasso (L1) and Ridge (L2) regularizations are also added. Three values of \(\lambda\) parameter are to be tested: 0, 10^-5, 10^-4.
Documentation for h2o.grid function summarizes many options available for optimization of ANN hyper-parameters. One of the options in ‘search_criteria’ allows optimization over all possible parameters (“Cartesian”) or only the ones specified in your code (“RandomDiscrete”).

h2o.no_progress()
# Selecting hyper-parameters:
hyper_params <- list(activation = c("Rectifier","Tanh","Maxout", "RectifierWithDropout","TanhWithDropout", "MaxoutWithDropout"), hidden = list(c(349, 174, 87, 29), c(174, 87, 29), c(87, 29)), input_dropout_ratio = c(0, 0.05, 0.1), l1 = seq(0, 1e-5, 1e-4), l2 = seq(0, 1e-5, 1e-4))

# Selecting optimal model search criteria. Search will stop once top 5 models are within 1% of each other:
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 600, max_models = 100, seed=1234567, stopping_rounds=10, stopping_tolerance=1e-2)

# Running search for the optimal models:
dl_random_grid <- h2o.grid(algorithm="deeplearning", grid_id = "dl_grid_random", training_frame=train_h2o, validation_frame=valid_h2o, x=2:785, y=1, epochs=1, stopping_metric="logloss", stopping_tolerance=1e-2, stopping_rounds=3, hyper_params = hyper_params, search_criteria = search_criteria)    

# Sorting models:                            
grid <- h2o.getGrid("dl_grid_random",sort_by="logloss", decreasing=FALSE)

Results

Let’s look at three top performing models selected by “H2O”. We will look at each model hyper-parameters and accuracy based on training and test sets.

Model 1

Following code identifies the top performing ANN model:

h2o.no_progress()
grid@summary_table[1,]

## Hyper-Parameter Search Summary: ordered by increasing logloss
##   activation             hidden input_dropout_ratio  l1  l2
## 1       Tanh [349, 174, 87, 29]                 0.0 0.0 0.0
##                model_ids             logloss
## 1 dl_grid_random_model_0 0.23133493030865912

best_model <- h2o.getModel(grid@model_ids[[1]]) 
best_model

## Model Details:
## ==============
## 
## H2OMultinomialModel: deeplearning
## Model ID:  dl_grid_random_model_0 
## Status of Neuron Layers: predicting label, 10-class classification, multinomial distribution, CrossEntropy loss, 322,928 weights/biases, 3.8 MB, 26,811 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1   698   Input  0.00 %                                     
## 2     2   349    Tanh  0.00 % 0.000000 0.000000  0.151625 0.249622
## 3     3   174    Tanh  0.00 % 0.000000 0.000000  0.009515 0.003414
## 4     4    87    Tanh  0.00 % 0.000000 0.000000  0.005023 0.001855
## 5     5    29    Tanh  0.00 % 0.000000 0.000000  0.002029 0.000676
## 6     6    10 Softmax         0.000000 0.000000  0.004179 0.002113
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000    0.000112   0.050430 -0.000983 0.031318
## 3 0.000000    0.000365   0.065661 -0.009643 0.056597
## 4 0.000000    0.000755   0.094077  0.002085 0.057244
## 5 0.000000    0.002955   0.124758 -0.001565 0.060723
## 6 0.000000   -0.015762   0.918220 -0.048128 0.075586
## 
## 
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10087 samples **
## 
## Training Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.03688055
## RMSE: (Extract with `h2o.rmse`) 0.1920431
## Logloss: (Extract with `h2o.logloss`) 0.1468963
## Mean Per-Class Error: 0.04346478
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##          0    1    2    3    4   5    6    7   8   9  Error           Rate
## 0      935    0    2    0    1   7    4    0   3   1 0.0189 =     18 / 953
## 1        0 1129    9   11    1   1    2    8   5   3 0.0342 =   40 / 1,169
## 2        6    0  975    8    6   1    2    9   5   2 0.0385 =   39 / 1,014
## 3        2    1   13  967    2  24    2    8   9   5 0.0639 =   66 / 1,033
## 4        1    0    5    0  951   2    2    4   0   2 0.0165 =     16 / 967
## 5        3    0    0   11    8 878    6    3  10   3 0.0477 =     44 / 922
## 6        6    1    2    0    6   3  992    0   2   0 0.0198 =   20 / 1,012
## 7        2    2    4    2    1   0    2 1011   1   7 0.0203 =   21 / 1,032
## 8        3    6    4   22    5  10    1    4 914  11 0.0673 =     66 / 980
## 9        1    0    1    3   66   4    1   28   4 897 0.1075 =  108 / 1,005
## Totals 959 1139 1015 1024 1047 930 1014 1075 953 931 0.0434 = 438 / 10,087
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-10 Hit Ratios: 
##     k hit_ratio
## 1   1  0.956578
## 2   2  0.985625
## 3   3  0.992961
## 4   4  0.995935
## 5   5  0.997522
## 6   6  0.998612
## 7   7  0.999207
## 8   8  0.999306
## 9   9  0.999802
## 10 10  1.000000
## 
## 
## H2OMultinomialMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## Validation Set Metrics: 
## =====================
## 
## Extract validation frame with `h2o.getFrame("valid")`
## MSE: (Extract with `h2o.mse`) 0.05458837
## RMSE: (Extract with `h2o.rmse`) 0.2336415
## Logloss: (Extract with `h2o.logloss`) 0.2313349
## Mean Per-Class Error: 0.06271125
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##          0   1   2   3   4   5   6   7   8   9  Error          Rate
## 0      774   0   2   3   0   1   5   1   2   3 0.0215 =    17 / 791
## 1        0 850  11   9   1   0   0   7   4   0 0.0363 =    32 / 882
## 2        5   1 765   6   2   3   2  11   8   2 0.0497 =    40 / 805
## 3        3   1  20 721   1  19   1  14  14   6 0.0988 =    79 / 800
## 4        3   0   4   2 710   1   0   2   0   1 0.0180 =    13 / 723
## 5        6   0   2  17   2 691   9   3  12   6 0.0762 =    57 / 748
## 6        8   2   4   0   3   9 736   1   6   0 0.0429 =    33 / 769
## 7        5   1   5   0   1   2   1 827   0  11 0.0305 =    26 / 853
## 8        0  12  11   9   9  13   2   9 713  10 0.0952 =    75 / 788
## 9        2   2   0   7  72  12   0  29   9 708 0.1581 =   133 / 841
## Totals 806 869 824 774 801 751 756 904 768 747 0.0631 = 505 / 8,000
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
## =======================================================================
## Top-10 Hit Ratios: 
##     k hit_ratio
## 1   1  0.936875
## 2   2  0.976125
## 3   3  0.986750
## 4   4  0.992375
## 5   5  0.995125
## 6   6  0.996625
## 7   7  0.998125
## 8   8  0.999000
## 9   9  0.999625
## 10 10  1.000000

Best model utilizes hyperbolic tangent (tanh) as activation function, uses all four hidden layers with 349, 174, 87 and 29 units. No regularization such as Dropout, Lasso or Ridge was necessary to achieve error rate as specified above for training and validation sets. Accuracy of this model reaches values close to 96% on training set (100%-error rate) and close to 95% on validation set. Finally, accuracy on testing set can be found below to be close to 94-95%. This indicates that model overfits slightly because of higher accuracy on training set vs. validation and testing sets.

yhat <- h2o.predict(best_model, test_h2o)
h2o.confusionMatrix(best_model, test_h2o)

## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##          0   1   2   3   4   5   6   7   8   9  Error          Rate
## 0      789   0   3   2   2   3  11   0   2   1 0.0295 =    24 / 813
## 1        0 887   6   4   2   0   1   7   5   1 0.0285 =    26 / 913
## 2        4   1 731   3   1   5   5   4   7   2 0.0419 =    32 / 763
## 3        0   2  21 773   1  22   1   8   8   6 0.0819 =    69 / 842
## 4        2   1   3   0 762   0   6   1   2   2 0.0218 =    17 / 779
## 5        5   2   5  12   7 653   6   0  16   2 0.0777 =    55 / 708
## 6        7   3   5   0   5   3 768   0   1   0 0.0303 =    24 / 792
## 7        2   0   8   4   4   5   1 803   0   5 0.0349 =    29 / 832
## 8        3  15   4   7   9   7   3   4 716   3 0.0713 =    55 / 771
## 9        3   2   1   7  60  13   2  25   6 668 0.1512 =   119 / 787
## Totals 815 913 787 812 853 711 804 852 763 690 0.0563 = 450 / 8,000

Models 2 and 3

Next two models are somewhat less accurate and have error rates within 6-7% which translates into 93-94% accuracy. In order to save some space in this report, I would like to list only final hyper-parameters selected by “H2O”.

h2o.no_progress()
# Model 2
grid@summary_table[2,]

## Hyper-Parameter Search Summary: ordered by increasing logloss
##   activation   hidden input_dropout_ratio  l1  l2               model_ids
## 2       Tanh [87, 29]                0.05 0.0 0.0 dl_grid_random_model_10
##              logloss
## 2 0.2587646621186195

# Model 3
grid@summary_table[3,]

## Hyper-Parameter Search Summary: ordered by increasing logloss
##   activation        hidden input_dropout_ratio  l1  l2
## 3  Rectifier [174, 87, 29]                 0.1 0.0 0.0
##                model_ids             logloss
## 3 dl_grid_random_model_5 0.26394789491360127

Interestingly, 2nd and 3rd best models often use rectifier instead of hyperbolic tangent as activation function, and requires less neurons (3 hidden layers instead of 4!). One of the models may also use 10% dropout rate on input layer as regularization. However, accuracy is not that much different from best performing model. If you try to replicate this code, you may get different models than the ones shown above. Hyper-parameter search starts with random selection which can be controlled to some extent by setting seed value. However, outcomes vary at each code execution. This report demonstrates only one of many possible outcomes for ANN’s hyper-parameters.

Conclusions

Artificial neural networks are extremely powerful algorithms when it comes to computer vision. Simple ANNs have quite high accuracy in recognizing handwritten numbers, which is probably only marginally worse than human capabilities for the task. ANNs such as convolutional neural networks are even better in image recognition than simple multilayer ANNs described in this report. Comparison of 3 best performing models suggests that rectifier and hyperbolic tangent are probably the best activation functions. Dropout regularization seems to work the best vs. L1 and L2 regularizations. Interestingly, best performing ANNs do not require large number of units. ANNs are clearly powerful tools for variety of applications. To my opinion, Python has more options for deep learning than R. Nevertheless, “H2O” is one the best packages for machine learning algorithms in R.