Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Many of the popular modern machine learning algorithms are actually ensembles. For example, Random Forest and Gradient Boosting Machine (GBM) are both ensemble learners. Both bagging (e.g. Random Forest) and boosting (e.g. GBM) are methods for ensembling that take a collection of weak learners (e.g. decision tree) and form a single, strong learner.
H2O’s Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking. This method currently supports regression and binary classification, and multiclass support is planned for a future release.
Native support for ensembles of H2O algorithms was added into core H2O in version 3.10.3.1. A separate implementation, the h2oEnsemble R package, is also still available, however for new projects we recommend using the native H2O version, documented below.
Stacking, also called Super Learning or Stacked Regression, is a class of algorithms that involves training a second-level “metalearner” to find the optimal combination of the base learners. Unlike bagging and boosting, the goal in stacking is to ensemble strong, diverse sets of learners together.
Although the concept of stacking was originally developed in 1992, the theoretical guarantees for stacking were not proven until the publication of a paper titled, “Super Learner”, in 2007. In this paper, it was shown that the Super Learner ensemble represents an asymptotically optimal system for learning.
There are some ensemble methods that are broadly labeled as stacking, however, the Super Learner ensemble is distinguished by the use of cross-validation to form what is called the “level-one” data, or the data that the metalearning or “combiner” algorithm is trained on. More detail about the Super Learner algorithm is provided below.
The steps below describe the individual tasks involved in training and testing a Super Learner ensemble. H2O automates most of the steps below so that you can quickly and easily build ensembles of H2O models.
Set up the ensemble. Specify a list of L base algorithms (with a specific set of model parameters). Specify a metalearning algorithm. Train the ensemble. Train each of the L base algorithms on the training set. Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms. The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the “level-one” data. (N = number of rows in the training set.) Train the metalearning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the metalearning model, which can then be used to generate predictions on a test set. Predict on new data. To generate ensemble predictions, first generate predictions from the base learners. Feed those predictions into the metalearner to generate the ensemble prediction. Defining an H2O Stacked Ensemble Model model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. training_frame Specify the dataset used to build the model. validation_frame: Specify the dataset used to evaluate the accuracy of the model. y: (Required) Specify the column to use as the independent variable (response column). The data can be numeric or categorical. base_models: Specify a list of model IDs that can be stacked together. Models must have been cross-validated using nfolds > 1, they all must use the same cross-validation folds, and keep_cross_validation_folds must be set to True. Notes regarding base_models:
One way to guarantee identical folds across base models is to set fold_assignment = “Modulo” in all the base models. It is also possible to get identical folds by setting fold_assignment = “Random” when the same seed is used in all base models. In R, you can specify a list of models in the base_models parameter. keep_levelone_frame: Keep the level one data frame that’s constructed for the metalearning step. This option is disabled by default. Also in a future release, there will be an additional metalearner parameter which allows for the user to specify the metalearning algorithm used. Currently, the metalearner is fixed as a default H2O GLM with non-negative weights.
Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can’t improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.
Boosting is a two-step approach, where one first uses subsets of the original data to produce a series of averagely performing models and then “boosts” their performance by combining them together using a particular cost function (=majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.
Stacking is a similar to boosting: you also apply several models to your original data. The difference here is, however, that you don’t have just an empirical formula for your weight function, rather you introduce a meta-level and use another model/approach to estimate the input together with outputs of every model to estimate the weights or, in other words, to determine what models perform well and what badly given these input data.
#devtools::install_github("h2oai/h2o-2/R/ensemble/h2oEnsemble-package")
#install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.2.1.tar.gz", repos = NULL)
library(h2oEnsemble)
library(tidyverse)
library(h2o)
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(ggthemes)
library(knitr)
library(plotly)
library(lime)
library(plotROC)
library(pROC)
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Data available at UCI machine learning repository available here
Data<- rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")
Data%>%head
setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20")
Data<-Data%>%mutate_if(is.character,as.factor)
str(Data)
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
localH2O = h2o.init(ip = 'localhost', port = 54321, nthreads = -1,max_mem_size = "8G")
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/var/folders/mj/w1gxzjcd0qx2cw_0690z7y640000gn/T//RtmpMWnh3P/h2o_nanaakwasiabayieboateng_started_from_r.out
/var/folders/mj/w1gxzjcd0qx2cw_0690z7y640000gn/T//RtmpMWnh3P/h2o_nanaakwasiabayieboateng_started_from_r.err
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
Starting H2O JVM and connecting: ... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 900 milliseconds
H2O cluster version: 3.14.0.3
H2O cluster version age: 16 days
H2O cluster name: H2O_started_from_R_nanaakwasiabayieboateng_zlf343
H2O cluster total nodes: 1
H2O cluster total memory: 7.11 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.1 (2017-06-30)
# convert data to H2OFrame
Data_h2o <- as.h2o(Data)
|
| | 0%
|
|=========================================================================================| 100%
splits <- h2o.splitFrame(Data_h2o,
ratios = c(0.6, 0.2),
seed = 148) #partition data into 60%, 20%, 20% chunks
train <- splits[[1]]
validation <- splits[[2]]
test <- splits[[3]]
y <- "y"
x <- setdiff(colnames(train), y)
# library(h2o)
# h2o.init()
#
# # Import a sample binary outcome train/test set into H2O
# train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
#
# # Identify predictors and response
# y <- "response"
# x <- setdiff(names(train), y)
#
# # For binary classification, response should be a factor
# train[,y] <- as.factor(train[,y])
# test[,y] <- as.factor(test[,y])
# Number of CV folds (to generate level-one data for stacking)
nfolds <- 5
# There are a few ways to assemble a list of models to stack toegether:
# 1. Train individual models and put them in a list
# 2. Train a grid of models
# 3. Train several grids of models
# Note: All base models must have the same cross-validation folds and
# the cross-validated predicted values must be kept.
# 1. Generate a 2-model ensemble (GBM + RF)
# Train & Cross-validate a GBM
my_gbm <- h2o.gbm(x = x,
y = y,
training_frame = train,
distribution = "bernoulli",
ntrees = 10,
max_depth = 3,
min_rows = 2,
learn_rate = 0.2,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
seed = 1)
|
| | 0%
|
|=============== | 17%
|
|============================================ | 50%
|
|========================================================================== | 83%
|
|====================================================================================== | 97%
|
|=========================================================================================| 100%
# Train & Cross-validate a RF
my_rf <- h2o.randomForest(x = x,
y = y,
training_frame = train,
ntrees = 50,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
seed = 1)
|
| | 0%
|
|=== | 3%
|
|======= | 8%
|
|=============== | 17%
|
|=========================== | 30%
|
|======================================= | 44%
|
|=================================================== | 57%
|
|========================================================= | 64%
|
|============================================================== | 70%
|
|======================================================================= | 80%
|
|========================================================================== | 83%
|
|========================================================================== | 84%
|
|============================================================================= | 86%
|
|============================================================================== | 87%
|
|=============================================================================== | 88%
|
|================================================================================ | 90%
|
|=========================================================================================| 100%
# Train a stacked ensemble using the GBM and RF above
ensemble <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
model_id = "my_ensemble_binomial",
base_models = list(my_gbm@model_id, my_rf@model_id))
|
| | 0%
|
|=========================================================================================| 100%
# Eval ensemble performance on a test set
perf <- h2o.performance(ensemble, newdata = test)
# Compare to base learner performance on the test set
perf_gbm_test <- h2o.performance(my_gbm, newdata = test)
perf_rf_test <- h2o.performance(my_rf, newdata = test)
baselearner_best_auc_test <- max(h2o.auc(perf_gbm_test), h2o.auc(perf_rf_test))
ensemble_auc_test <- h2o.auc(perf)
print(sprintf("Best Base-learner Test AUC: %s", baselearner_best_auc_test))
[1] "Best Base-learner Test AUC: 0.932059874368773"
print(sprintf("Ensemble Test AUC: %s", ensemble_auc_test))
[1] "Ensemble Test AUC: 0.933707424822913"
The ensemble performs slightly better than the maximum AUC from the two base models.
# Generate predictions on a test set (if neccessary)
pred <- h2o.predict(ensemble, newdata = test)
|
| | 0%
|
|=========================================================================================| 100%
pred%>%head()
no is the probability (between 0 and 1) that class no is chosen.
yes is the probability (between 0 and 1) that class yes is chosen.
The predict is made by applying a threshold to no/yes. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It’s not just 0.5.
The threshold chosen for “the prediction” is max-F1. But you can extract out p1 yourself and threshold it any way you like.
# 2. Generate a random grid of models and stack them together
# GBM Hyperparamters
learn_rate_opt <- c(0.01, 0.03)
max_depth_opt <- c(3, 4, 5, 6, 9)
sample_rate_opt <- c(0.7, 0.8, 0.9, 1.0)
col_sample_rate_opt <- c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
hyper_params <- list(learn_rate = learn_rate_opt,
max_depth = max_depth_opt,
sample_rate = sample_rate_opt,
col_sample_rate = col_sample_rate_opt)
search_criteria <- list(strategy = "RandomDiscrete",
max_models = 3,
seed = 1)
gbm_grid <- h2o.grid(algorithm = "gbm",
grid_id = "gbm_grid_binomial",
x = x,
y = y,
training_frame = train,
ntrees = 50,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
|
| | 0%
|
|== | 3%
|
|==== | 5%
|
|====== | 7%
|
|======== | 9%
|
|=============== | 17%
|
|======================== | 27%
|
|========================= | 28%
|
|========================= | 29%
|
|========================== | 30%
|
|=========================== | 30%
|
|=========================== | 31%
|
|============================== | 33%
|
|================================ | 36%
|
|=================================== | 39%
|
|==================================== | 41%
|
|===================================== | 42%
|
|================================================= | 55%
|
|====================================================== | 61%
|
|======================================================= | 62%
|
|======================================================== | 63%
|
|========================================================= | 64%
|
|========================================================== | 65%
|
|============================================================== | 69%
|
|================================================================ | 72%
|
|=================================================================== | 75%
|
|===================================================================== | 77%
|
|============================================================================= | 87%
|
|==================================================================================== | 95%
|
|===================================================================================== | 95%
|
|====================================================================================== | 97%
|
|======================================================================================= | 97%
|
|======================================================================================= | 98%
|
|=========================================================================================| 100%
# Train a stacked ensemble using the GBM grid
ensemble2 <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
model_id = "ensemble_gbm_grid_binomial",
base_models = gbm_grid@model_ids)
|
| | 0%
|
|=========================================================================================| 100%
# storing and loading the model
path <- h2o.saveModel(ensemble2, path = "ensemble2", force = TRUE)
print(path)
[1] "/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/ensemble2/ensemble_gbm_grid_binomial"
loaded <- h2o.loadModel(path)
# Eval ensemble performance on a test set
perf <- h2o.performance(ensemble2, newdata = test)
# Compare to base learner performance on the test set
.getauc <- function(mm) h2o.auc(h2o.performance(h2o.getModel(mm), newdata = test))
baselearner_aucs <- sapply(gbm_grid@model_ids, .getauc)
baselearner_best_auc_test <- max(baselearner_aucs)
ensemble_auc_test <- h2o.auc(perf)
print(sprintf("Best Base-learner Test AUC: %s", baselearner_best_auc_test))
[1] "Best Base-learner Test AUC: 0.914290440531397"
print(sprintf("Ensemble Test AUC: %s", ensemble_auc_test))
[1] "Ensemble Test AUC: 0.915028629103197"
The parameter tuning results in a final model that is actually poorer than the initial base and ensemble models.
# Generate predictions on a test set (if neccessary)
pred <- h2o.predict(ensemble, newdata = test)
|
| | 0%
|
|=========================================================================================| 100%
h2o.varimp_plot(my_rf)
h2o.varimp(my_rf)%>%as_tibble()
# for deep learning set the variable_importance parameter to TRUE eg.
#iris.hex <- as.h2o(iris)
#iris.dl <- h2o.deeplearning(x = 1:4, y = 5, training_frame = iris.hex,
#variable_importances = TRUE)
#h2o.varimp_plot(iris.dl)
plot(h2o.performance(ensemble2)) ## display ROC curve
plot(my_rf)
#plot(my_rf, timestep = "duration", metric = "deviance")
plot(my_rf, timestep = "number_of_trees", metric = "auc")
plot(my_rf, timestep = "number_of_trees", metric = "rmse")
plot(my_rf, timestep = "number_of_trees", metric = "logloss")
Plot an H2O Tabulate Heatmap
tab <- h2o.tabulate(data = Data_h2o, x = "age", y = "y",
weights_column = NULL, nbins_x = 10, nbins_y = 10)
plot(tab)
# storing and loading the model
# path <- h2o.saveModel(model, path = "mybest_deeplearning_covtype_model", force = TRUE)
# print(path)
# loaded <- h2o.loadModel(path)
h2o.shutdown(prompt = FALSE)
[1] TRUE