H2O in practice

Introduction to H2O

R interface for ‘H2O’, the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models, Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (AutoML).

H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster.

Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O.

More info here: Booklet Machine Learning with R and H2O

Wifi Example

Load packages and data

pacman::p_load(readr,h2o, rstudioapi, caret)

Import datasets

validation <- read.csv("validationData.csv")

train <- read_csv("trainingData.csv", na = c("N/A"))
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   LONGITUDE = col_double(),
##   LATITUDE = col_double(),
##   SPACEID = col_character()
## )
## See spec(...) for full column specifications.

Launch H2O clusters

Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster.

#To launch the H2O cluster, write 

h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         22 minutes 43 seconds 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    4 months and 24 days !!! 
##     H2O cluster name:           H2O_started_from_R_gabri_agd448 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.40 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 24 days)!
## Please download and install the latest version from http://h2o.ai/download/
#data to h2o cluster
train.h2o <- as.h2o(train)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(validation)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
#dependent variable (Lat)
y.dep <- 522

#independent variables (WAPS)
x.indep <- c(1:520)

regression.model <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "gaussian")
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [WAP004, WAP246, WAP444, WAP488, WAP247, WAP445, WAP244, WAP365, WAP442, WAP003, WAP245, WAP487, WAP520, WAP242, WAP243, WAP441, WAP485, WAP240, WAP482, WAP241, WAP360, WAP160, WAP158, WAP433, WAP159, WAP353, WAP152, WAP239, WAP438, WAP238, WAP301, WAP423, WAP307, WAP429, WAP349, WAP226, WAP303, WAP227, WAP304, WAP497, WAP333, WAP451, WAP254, WAP296, WAP095, WAP293, WAP491, WAP093, WAP094, WAP092, WAP419, WAP217, WAP416, WAP215, WAP458].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |=================================================================| 100%
h2o.performance(regression.model)
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  343.3843
## RMSE:  18.53063
## MAE:  13.75071
## RMSLE:  3.809061e-06
## Mean Residual Deviance :  343.3843
## R^2 :  0.9233728
## Null Deviance :89342432
## Null D.o.F. :19936
## Residual Deviance :6846053
## Residual D.o.F. :19537
## AIC :173789.9
#make predictions
predict.reg <- as.data.frame(h2o.predict(regression.model, test.h2o))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
postResample(predict.reg, validation$LATITUDE)
##       RMSE   Rsquared        MAE 
## 28.5500579  0.8350735 21.1030353
#Random Forest
system.time(
  rforest.model <- h2o.randomForest(y=y.dep, x=x.indep, training_frame = train.h2o, 
                                    ntrees = 1000, mtries = 3, max_depth = 4, seed = 1122))
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [WAP004, WAP246, WAP444, WAP488, WAP247, WAP445, WAP244, WAP365, WAP442, WAP003, WAP245, WAP487, WAP520, WAP242, WAP243, WAP441, WAP485, WAP240, WAP482, WAP241, WAP360, WAP160, WAP158, WAP433, WAP159, WAP353, WAP152, WAP239, WAP438, WAP238, WAP301, WAP423, WAP307, WAP429, WAP349, WAP226, WAP303, WAP227, WAP304, WAP497, WAP333, WAP451, WAP254, WAP296, WAP095, WAP293, WAP491, WAP093, WAP094, WAP092, WAP419, WAP217, WAP416, WAP215, WAP458].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##    0.28    0.03    6.51
h2o.performance(rforest.model)
## H2ORegressionMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  4200.068
## RMSE:  64.80794
## MAE:  56.35595
## RMSLE:  1.33216e-05
## Mean Residual Deviance :  4200.068
h2o.varimp(rforest.model)
## Variable Importances: 
##   variable relative_importance scaled_importance percentage
## 1   WAP052    309852864.000000          1.000000   0.080748
## 2   WAP161    198004784.000000          0.639028   0.051600
## 3   WAP066    160390144.000000          0.517633   0.041798
## 4   WAP162    152954240.000000          0.493635   0.039860
## 5   WAP517    123277664.000000          0.397859   0.032126
## 
## ---
##     variable relative_importance scaled_importance percentage
## 460   WAP506            0.000000          0.000000   0.000000
## 461   WAP507            0.000000          0.000000   0.000000
## 462   WAP509            0.000000          0.000000   0.000000
## 463   WAP510            0.000000          0.000000   0.000000
## 464   WAP514            0.000000          0.000000   0.000000
## 465   WAP519            0.000000          0.000000   0.000000
#making predictions on unseen data
system.time(predict.rforest <- as.data.frame(h2o.predict(rforest.model, test.h2o)))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##    0.09    0.00    1.23
postResample(predict.rforest, validation$LATITUDE)
##      RMSE  Rsquared       MAE 
## 72.823783  0.794539 62.813439
#GBM
system.time(
  gbm.model <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, max_depth = 4, learn_rate = 0.01, seed = 1122)
)
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [WAP004, WAP246, WAP444, WAP488, WAP247, WAP445, WAP244, WAP365, WAP442, WAP003, WAP245, WAP487, WAP520, WAP242, WAP243, WAP441, WAP485, WAP240, WAP482, WAP241, WAP360, WAP160, WAP158, WAP433, WAP159, WAP353, WAP152, WAP239, WAP438, WAP238, WAP301, WAP423, WAP307, WAP429, WAP349, WAP226, WAP303, WAP227, WAP304, WAP497, WAP333, WAP451, WAP254, WAP296, WAP095, WAP293, WAP491, WAP093, WAP094, WAP092, WAP419, WAP217, WAP416, WAP215, WAP458].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |=========================================                        |  62%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  82%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##    0.74    0.15   79.71
h2o.performance (gbm.model)
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  872.2626
## RMSE:  29.53409
## MAE:  24.45155
## RMSLE:  6.07087e-06
## Mean Residual Deviance :  872.2626
predict.gbm <- as.data.frame(h2o.predict(gbm.model, test.h2o))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
postResample(predict.gbm, validation$LATITUDE)
##        RMSE    Rsquared         MAE 
## 136.1527664   0.6246418 131.4833031

Gabriel Ristow Cidral

11/04/2019