Introduction

This is a quick demo for a presentation on H2O:

http://www.h2o.ai

Preliminaries

H2O on Docker

There are many ways to run H2O. I’ll be using my own H2O docker image to run the subsequent code chunks. The only prerequisite is that Docker is properly configured on your host. Note that H2O also have an official docker image.

Read more about my docker image here: https://github.com/bwv988/docker-h2oai

In Linux, I would simply start H2O in bash like so:

docker run -it -p 54321:54321 --name h2oai --rm bwv988/h2oai

If the image isn’t present it will be pulled from the central Docker registry. CTRL + C will terminate and remove the container.

Once the container is up and running, log files can be inspected by opening a shell into the container:

docker exec -it h2oai bash

Then, logs can be inspected in the logs/ sub-directory.

H2O R package setup

In order to access H2O, we need to install and load the required H2O R package:

# NOTE: Below is set to FALSE to not re-install when "knitting".
# Set doit to TRUE to run through the install.
doit = FALSE

if(doit) {
  # Check and remove previous installations.
  if ("package:h2o" %in% search()) { 
    detach("package:h2o", unload=TRUE) 
  }
  
  if ("h2o" %in% rownames(installed.packages())) { 
    remove.packages("h2o") 
  }
  
  # Next, we download packages that H2O depends on.
  pkgs = c("statmod", "RCurl", "jsonlite")
  for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { 
      install.packages(pkg) 
    }
  }
  
  # Download, install and initialize the H2O package for R.
  install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-vajda/4/R")
}

# Load the library.
require(h2o)

Now we are in a position to connect R to the H2O docker container:

# This instructs H2O to connect to the instance running in the container.

# Use the below, when running the docker container locally.
host.ip = "127.0.0.1"
h2o.init(ip = host.ip, startH2O = FALSE)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         25 minutes 32 seconds 
##     H2O cluster version:        3.10.5.4 
##     H2O cluster version age:    10 days  
##     H2O cluster name:           H2ODemo 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   12.45 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          127.0.0.1 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.1 (2017-06-30)
# Disable progress bars for Rpubs. ;)
h2o.no_progress()

The output tells us something about the size and version of the H2O cloud we’re connecting to.

Getting data into H2O

Load an R dataset into H2O

Let’s just load R’s built-in Iris data set, and make it accessible to H2O.

# Attach Iris data to current environment.
data(iris)

# Import the data from R into H2O.
# NOTE: We only retain a handle to the data in R.
# The destination_frame command is useful for later.
iris.data = as.h2o(iris, destination_frame = "iris.hex")

It’s important to understand that the variable iris.data is merely a handle to data in the H2O cloud.

So now we can run h2o.describe() on the H2OFrame object we’ve created:

# Describe the data set.
h2o.describe(iris.data)
##          Label Type Missing Zeros PosInf NegInf Min Max     Mean     Sigma
## 1 Sepal.Length real       0     0      0      0 4.3 7.9 5.843333 0.8280661
## 2  Sepal.Width real       0     0      0      0 2.0 4.4 3.057333 0.4358663
## 3 Petal.Length real       0     0      0      0 1.0 6.9 3.758000 1.7652982
## 4  Petal.Width real       0     0      0      0 0.1 2.5 1.199333 0.7622377
## 5      Species enum       0    50      0      0 0.0 2.0       NA        NA
##   Cardinality
## 1          NA
## 2          NA
## 3          NA
## 4          NA
## 5           3

Evaluating the output of str() confirms that iris.data is indeed a H2OFrame.

str(iris.data)
## Class 'H2OFrame' <environment: 0x5d28b38> 
##  - attr(*, "op")= chr "Parse"
##  - attr(*, "id")= chr "iris.hex"
##  - attr(*, "eval")= logi FALSE
##  - attr(*, "nrow")= int 150
##  - attr(*, "ncol")= int 5
##  - attr(*, "types")=List of 5
##   ..$ : chr "real"
##   ..$ : chr "real"
##   ..$ : chr "real"
##   ..$ : chr "real"
##   ..$ : chr "enum"
##  - attr(*, "data")='data.frame': 10 obs. of  5 variables:
##   ..$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9
##   ..$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
##   ..$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
##   ..$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1

Load a CSV file from the internet into H2O

In the upcoming example, I’ll be using the 1984 Congressional Voting Records from the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/congressional+voting+records

The data set records the results of 16 key votes for 435 U.S. House of Representatives Congressmen, along with their party affiliation.

We’ll do something more interesting with this data later, but for now, here is how to load remote files:

voting.url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"

# Name the feature columns.
feature.cols = paste0("vote", 1:16)

# Use some extra args to specify column names and types.
voting.data.raw = h2o.importFile(path = voting.url, 
                             col.names = c("party", feature.cols), 
                             col.types = rep("string", 17), 
                             destination_frame = "voting.hex")

Once the data has been fetched, I can use an overloaded version of the head() function to peek into the data set:

# There are some issues with some entries.
head(voting.data.raw)
##        party vote1 vote2 vote3 vote4 vote5 vote6 vote7 vote8 vote9 vote10
## 1 republican     n     y     n     y     y     y     n     n     n      y
## 2 republican     n     y     n     y     y     y     n     n     n      n
## 3   democrat     ?     y     y     ?     y     y     n     n     n      n
## 4   democrat     n     y     y     n     ?     y     n     n     n      n
## 5   democrat     y     y     y     n     y     y     n     n     n      n
## 6   democrat     n     y     y     n     y     y     n     n     n      n
##   vote11 vote12 vote13 vote14 vote15 vote16
## 1      ?      y      y      y      n      y
## 2      n      y      y      y      n      ?
## 3      y      n      y      y      n      n
## 4      y      n      y      n      n      y
## 5      y      ?      y      y      y      y
## 6      n      n      y      y      y      y

Data manipulation in H2O

For simplicity, all the missing values "?" are replaced with "n". To make things a bit more readable, we have supplied the column headings upon loading. Lastly, we ensure that all variables are categorical in nature (or enum, in H2O terminology).

This, and the examples before show how H2O implements some neat syntactic sugar by overloading standard R functions and operators, which makes dealing with Big Data relatively easy from within R.

However there are some caveats:

# In plain R I would simply do this:
#
# voting.data.raw[voting.data.raw == "?"] = "n"
#
# However H2O's data frame semantics does not yet support this.

# The best solution, as per Erin Ledell's suggestion is ot use h20.sub():
voting.data = h2o.sub("\\\\?", "n", x = voting.data.raw)

# Want variables to be of a categorical nature.
voting.data = h2o.asfactor(voting.data.raw)

# The above currently has an unexpected side-effect and removes the column names, so for now
# we need to put in a temporary fix.
colnames(voting.data) = colnames(voting.data.raw)

# Looks better now.
head(voting.data)
##        party vote1 vote2 vote3 vote4 vote5 vote6 vote7 vote8 vote9 vote10
## 1 republican     n     y     n     y     y     y     n     n     n      y
## 2 republican     n     y     n     y     y     y     n     n     n      n
## 3   democrat     ?     y     y     ?     y     y     n     n     n      n
## 4   democrat     n     y     y     n     ?     y     n     n     n      n
## 5   democrat     y     y     y     n     y     y     n     n     n      n
## 6   democrat     n     y     y     n     y     y     n     n     n      n
##   vote11 vote12 vote13 vote14 vote15 vote16
## 1      ?      y      y      y      n      y
## 2      n      y      y      y      n      ?
## 3      y      n      y      y      n      n
## 4      y      n      y      n      n      y
## 5      y      ?      y      y      y      y
## 6      n      n      y      y      y      y

Example: US Congressional voting records analysis

In this example, we’ll apply and compare two different H2O classification algorithms to the voting records loaded before.

The goal is to be able to learn from the data, and correctly predict a congress member’s party association based on their voting behaviors.

Formally, let \(y_i\) be party association of the ith congress member. Also let \(x_j\) be their jth vote, where \(i = 1...435, j = 1...16\). For convenience we encode this as:

\[ \begin{aligned} x_j & = \begin{cases} 0, \text{if vote j was "no"}\\ 1, \text{if vote j was "yes"} \end{cases}, \\ y_i & = \begin{cases} 0, \text{if member i is a republican}\\ 1, \text{if member i is a democrat} \end{cases} \end{aligned} \]

Splitting the data

As usual, we split into test and training data.

voting.split = h2o.splitFrame(data = voting.data, ratios = 0.8)
voting.train = voting.split[[1]]
voting.test = voting.split[[2]]

# Identify response column.
response = "party"

GLM: Logistic regression

We can perform classification with h2o.glm() simply by setting the family parameter to binomial.

As usual, let \(X\) refer to the design matrix. The idea is to estimate parameters of a linear regression model using the logit of the response, i.e we are interested in the probabilities of the response: \(\pi({\underline{x_i}}) = \text{Pr}\{y_i = 1 | {\underline{x_i}}^T \}\), and the modeling approach is:

\[ \begin{aligned} \pi(X) &= \frac{\exp(X^T {\underline{\beta}})}{1 - \exp(X^T {\underline{\beta)}}}, \\ \frac{\pi(X)}{1 - \pi(X)} &= \exp(X^T {\underline{\beta)}}, \text{i.e.}\\ \log \Big( \frac{\pi(X)}{1 - \pi(X)} \Big) &= X^T {\underline{\beta}} \end{aligned} \] H2O’s implementation of GLM yields penalized maximum likelihood estimates for the parameters \({\underline{\beta}}\).

There are many tuning options, e.g. we can influence the amount of penalization by selecting the ratio between LASSO and Ridge regression, etc…

# Very powerful. The underlying data set could be huge!
fit.glm.votes = h2o.glm(x = feature.cols, 
                  y = response, 
                  training_frame = voting.train, 
                  family = "binomial", 
                  model_id = "glm.fit")

# Predict on test data.
pred.glm.votes = h2o.predict(object = fit.glm.votes,
                   newdata = voting.test)

Deep Learning classification

This trains a Deep Neural Network model for classifying the political party using H2O’s custom, CPU-based implementation of a feed-forward, multi-layer ANN.

Again, there are many tuning parameters to play with, but I’m not going to go into it ;)

# Pretty cool, only one line of code!
fit.deep.votes = h2o.deeplearning(x = feature.cols, 
                            y = response,
                            training_frame = voting.train, 
                            model_id = "deeplearning.fit")  

# Predict on the held-out data.
pred.deep.votes = h2o.predict(object = fit.deep.votes,
                   newdata = voting.test)

Results on training data

First let’s compare the confusion matrices for both models on the training data:

Logistic regression

h2o.confusionMatrix(fit.glm.votes)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.373947892885042:
##            democrat republican    Error     Rate
## democrat        207         11 0.050459  =11/218
## republican        2        129 0.015267   =2/131
## Totals          209        140 0.037249  =13/349

Looks fairly good, but some misclassification errors.

Deep Learning

h2o.confusionMatrix(fit.deep.votes)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.9938799492586:
##            democrat republican    Error    Rate
## democrat        215          3 0.013761  =3/218
## republican        0        131 0.000000  =0/131
## Totals          215        134 0.008596  =3/349

This approach seems to do marginally better than GLM.

Results on test data

But what about predicting on the held-out test set? What we need to do is to compare the predicted classes to the observed classes.

The subsequent code demonstrates how to construct a confusion matrices for the test set, and how to calculate the test prediction accuracy using the diagonal and off-diagonal elements from the confusion matrix manually:

\[ acc = \frac{tp + tn}{tp + fp + tn + fn} \]

In particular, note how we first combine columns in the H2O cloud, and then exported data into R. This is generally more efficient, however would not work if the resulting data frames were too large to fit into R:

# Little helper function to calculate the accuracy.
accuracy = function(tab) {
  sum(diag(tab)) / sum(tab)
}

# Confusion matrices.
res.train.glm = table(as.data.frame(h2o.cbind(voting.test["party"], pred.glm.votes["predict"])))
res.train.deep = table(as.data.frame(h2o.cbind(voting.test["party"], pred.deep.votes["predict"])))

Logistic regression accuracy

accuracy(res.train.glm)
## [1] 0.9302326

Deep Learning

accuracy(res.train.deep)
## [1] 0.9534884

Both classifiers turn out a fairly decent prediction accuracy, but again it looks as though the neural net classifier performs slightly better.

Much simpler: h2o.performance()

With the h2o.performance() function we would have gotten the same (and more) results, however all calculations are performed entirely in the H2O cloud:

# Just for the GLM model.
h2o.performance(model = fit.glm.votes, newdata = voting.test)
## H2OBinomialMetrics: glm
## 
## MSE:  0.04600654
## RMSE:  0.2144914
## LogLoss:  0.1940718
## Mean Per-Class Error:  0.04412576
## AUC:  0.9942085
## Gini:  0.988417
## R^2:  0.8123197
## Residual Deviance:  33.38035
## AIC:  67.38035
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            democrat republican    Error   Rate
## democrat         46          3 0.061224  =3/49
## republican        1         36 0.027027  =1/37
## Totals           47         39 0.046512  =4/86
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.654241 0.947368  18
## 2                       max f2  0.312193 0.973684  21
## 3                 max f0point5  0.739593 0.976331  13
## 4                 max accuracy  0.739593 0.953488  13
## 5                max precision  0.863374 1.000000   0
## 6                   max recall  0.312193 1.000000  21
## 7              max specificity  0.863374 1.000000   0
## 8             max absolute_mcc  0.739593 0.908063  13
## 9   max min_per_class_accuracy  0.668563 0.945946  17
## 10 max mean_per_class_accuracy  0.654241 0.955874  18
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Flow

Quick walk-through of H2O’s Flow UI.

Flow is great for intuitively exploring data, building and assessing models through the browser, with little to no coding required.

http://localhost:54321

Some things to do:

Steam

Quick walk-through of Steam UI.

Login: superuser / superuser.

http://localhost:9000