Fort Worth Data Science MeetUp

03/11/2016

H2O Demonstration in R

Presented by: Matthew Landowski

H2O Overview

  • Open source machine learning platform
  • Written in Java (runs on the JVM)
  • In-memory distributed parallel processing
  • H2O Sparkling Water - Works with Spark
  • Can export models

H2O Built in Algorithms

  • (Gradient Boosted Machine) GBM
  • K-Means
  • Distributed Random Forest
  • Deep Learning
  • Generalized Linear Model (GLM)
  • Naive Bayes
  • Principle Component Analysis (PCA)

Setting Up a Cluster

flatfile.txt

192.168.1.163:54321
192.168.1.164:54321

Star H2O nodes with flat file

java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321

Recommend:

  • nodes are identical (same cpu(s), memory, etc)
  • 10Gbit Ethernat or greater

H2O in R

Easy to install package

# Install h2o package. Downloads latest version.
# Client and server need same h2o version installed.
install.packages("h2o")

Easy to use

library(h2o)

# can define max memory size, threads (cpus), etc
h2o.init(max_mem_size = 24, nthreads = 3)

# create h2o frame from R data frame
iris.h2o <- as.h2o(iris)

# Train a multinomial model on the training data
# x and y are vectors of column names or indices
fit <- h2o.gbm(x = 1:4, y = "Species", training_frame = iris.h2o)

DeepLearing in H2O

# train deel learning model
model.dl <- h2o.deeplearning(2:784, 1, training_frame = h2o.train, nfolds = 10)

# make predictions
h2o.predictions <- h2o.predict(model.dl, h2o.test)

Grid Search

Model optimization avaible with a built-in grid search.

grid <- h2o.grid("gbm", x = c(1:4), y = 5, training_frame = iris.hex,
                 hyper_params = list(ntrees = c(1,2,3)))

Summary

  • H2O is an easy to use machine learning framework
  • Powerful built in models
  • R, Python, and web intefaces available