library(h2o)
library(h2oEnsemble)
library(tidyverse)
library(rio)

AutoML: Automatic Machine Learning

AutoML Interface

The AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time-constraint.

In both the R and Python API, AutoML uses the same data-related arguments, x, y, training_frame, validation_frame, as the other H2O algorithms.

The x argument only needs to be specified if the user wants to exclude predictor columns from their data frame. If all columns (other than the response) should be used in prediction, this can be left blank/unspecified. The y argument is the name (or index) of the response column. Required. The training_frame is the training set. Required. The validation_frame argument is optional and will be used for early stopping within the training process of the individual models in the AutoML run. # The leaderboard_frame argument allows the user to specify a particular data frame to rank the models on the leaderboard. This frame will not be used for anything besides creating the leaderboard. To control how long the AutoML run will execute, the user can specify max_runtime_secs, which defaults to 600 seconds (10 minutes). # If the user doesn’t specify all three frames (training, validation and leaderboard), then the missing frames will be created automatically from what is provided by the user. For reference, here are the rules for auto-generating the missing frames.

When the user specifies:

training: The training_frame is split into training (70%), validation (15%) and leaderboard (15%) sets. training + validation: The validation_frame is split into validation (50%) and leaderboard (50%) sets and the original training frame stays as-is. training + leaderboard: The training_frame is split into training (70%) and validation (30%) sets and the leaderboard frame stays as-is. training + validation + leaderboard: Leave all frames as-is.

bankdata<-rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Data available at UCI machine learning repository available here

bankdata<-bankdata%>%mutate_if(is.character,as.factor)
str(bankdata)
'data.frame':   45211 obs. of  17 variables:
 $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
 $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
localH2O = h2o.init(ip = 'localhost', port = 54321, nthreads = -1,max_mem_size = "8G")

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/mj/w1gxzjcd0qx2cw_0690z7y640000gn/T//RtmpUHt60n/h2o_nanaakwasiabayieboateng_started_from_r.out
    /var/folders/mj/w1gxzjcd0qx2cw_0690z7y640000gn/T//RtmpUHt60n/h2o_nanaakwasiabayieboateng_started_from_r.err
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Starting H2O JVM and connecting: ... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 902 milliseconds 
    H2O cluster version:        3.14.0.3 
    H2O cluster version age:    15 days  
    H2O cluster name:           H2O_started_from_R_nanaakwasiabayieboateng_qlb223 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   7.11 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.1 (2017-06-30) 
bankdata <- as.h2o(bankdata)

  |                                                                                               
  |                                                                                         |   0%
  |                                                                                               
  |=========================================================================================| 100%
splits <- h2o.splitFrame(bankdata, 
                         ratios = c(0.6, 0.2), 
                         seed = 148)   #partition data into 60%, 20%, 20% chunks
train <- splits[[1]]
validation <- splits[[2]]
test <- splits[[3]]
outcome_name <- "y" #response column: digits 0-1
features <- setdiff(colnames(train), outcome_name)
features
 [1] "age"       "job"       "marital"   "education" "default"   "balance"   "housing"  
 [8] "loan"      "contact"   "day"       "month"     "duration"  "campaign"  "pdays"    
[15] "previous"  "poutcome" 
aml <- h2o.automl(x = features, y = outcome_name,
                  training_frame = train,
                  leaderboard_frame = test,
                  max_runtime_secs = 120)

  |                                                                                               
  |                                                                                         |   0%
  |                                                                                               
  |===                                                                                      |   3%
  |                                                                                               
  |====                                                                                     |   5%
  |                                                                                               
  |=====                                                                                    |   5%
  |                                                                                               
  |=====                                                                                    |   6%
  |                                                                                               
  |======                                                                                   |   6%
  |                                                                                               
  |======                                                                                   |   7%
  |                                                                                               
  |=======                                                                                  |   8%
  |                                                                                               
  |========                                                                                 |   8%
lb%>%as_tibble()

The stacked ensemble performs better than each of the single algorithms by themselves. The same data was trained with several algorithms and their auc performance is tabled below.

data_frame(model=c(" Stochastic Gradient Boosting Machine","Random Forest","Boosted Logistic Regression","Extreme Gradient Boosting  Machine","Logistic Regression","Neural Networks"),
           accuracy=c(0.8832,0.9174,0.8416,0.8832  ,0.901,0.9038 ))
h2o.performance(model = aml@leader,
                            newdata = test)
H2OBinomialMetrics: stackedensemble

MSE:  0.06543222
RMSE:  0.2557972
LogLoss:  0.2172478
Mean Per-Class Error:  0.1763793
AUC:  0.9335191
Gini:  0.8670382

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         no  yes    Error       Rate
no     7334  572 0.072350  =572/7906
yes     302  775 0.280409  =302/1077
Totals 7636 1347 0.097295  =874/8983

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.218871 0.639439 254
2                       max f2  0.067442 0.744354 338
3                 max f0point5  0.401107 0.622259 186
4                 max accuracy  0.401107 0.909496 186
5                max precision  0.972859 1.000000   0
6                   max recall  0.023700 1.000000 399
7              max specificity  0.972859 1.000000   0
8             max absolute_mcc  0.127667 0.593583 298
9   max min_per_class_accuracy  0.084390 0.866431 325
10 max mean_per_class_accuracy  0.067442 0.871945 338

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.confusionMatrix(aml@leader)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.319712452240004:
          no  yes    Error        Rate
no     16782   54 0.003207   =54/16836
yes       72 2165 0.032186    =72/2237
Totals 16854 2219 0.006606  =126/19073
# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly
pred <- h2o.predict(aml@leader, test)

  |                                                                                               
  |                                                                                         |   0%
  |                                                                                               
  |=========================================================================================| 100%
pred%>%head() 
LS0tCnRpdGxlOiAiQXV0b21hdGljIG1hY2hpbmUgbGVhcm5pbmcgd2l0aCBIMjAiCm91dHB1dDogaHRtbF9ub3RlYm9vawphdXRob3I6IE5hbmEgQm9hdGVuZwpkZl9wcmludDogcGFnZWQKVGltZTogJ2ByIFN5cy50aW1lKClgJwpkYXRlOiAiYHIgZm9ybWF0KFN5cy50aW1lKCksICclQiAlZCwgJVknKWAiCi0tLQoKCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoY2FjaGU9VFJVRSkKYGBgCgoKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KbGlicmFyeShoMm8pCmxpYnJhcnkoaDJvRW5zZW1ibGUpCmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KHJpbykKYGBgCgoKIyMjIyAgQXV0b01MOiBBdXRvbWF0aWMgTWFjaGluZSBMZWFybmluZwoKIAojIyMjIyBBdXRvTUwgSW50ZXJmYWNlCiBUaGUgQXV0b01MIGludGVyZmFjZSBpcyBkZXNpZ25lZCB0byBoYXZlIGFzIGZldyBwYXJhbWV0ZXJzIGFzIHBvc3NpYmxlIHNvIHRoYXQgYWxsIHRoZSB1c2VyIG5lZWRzIHRvIGRvIGlzIHBvaW50IHRvIHRoZWlyIGRhdGFzZXQsIGlkZW50aWZ5IHRoZSByZXNwb25zZSBjb2x1bW4gYW5kIG9wdGlvbmFsbHkgc3BlY2lmeSBhIHRpbWUtY29uc3RyYWludC4KIAogSW4gYm90aCB0aGUgUiBhbmQgUHl0aG9uIEFQSSwgQXV0b01MIHVzZXMgdGhlIHNhbWUgZGF0YS1yZWxhdGVkIGFyZ3VtZW50cywgeCwgeSwgdHJhaW5pbmdfZnJhbWUsIHZhbGlkYXRpb25fZnJhbWUsIGFzIHRoZSBvdGhlciBIMk8gYWxnb3JpdGhtcy4KIAogVGhlIHggYXJndW1lbnQgb25seSBuZWVkcyB0byBiZSBzcGVjaWZpZWQgaWYgdGhlIHVzZXIgd2FudHMgdG8gZXhjbHVkZSBwcmVkaWN0b3IgY29sdW1ucyBmcm9tIHRoZWlyIGRhdGEgZnJhbWUuIElmIGFsbCBjb2x1bW5zIChvdGhlciB0aGFuIHRoZSByZXNwb25zZSkgc2hvdWxkIGJlIHVzZWQgaW4gcHJlZGljdGlvbiwgdGhpcyBjYW4gYmUgbGVmdCBibGFuay91bnNwZWNpZmllZC4KIFRoZSB5IGFyZ3VtZW50IGlzIHRoZSBuYW1lIChvciBpbmRleCkgb2YgdGhlIHJlc3BvbnNlIGNvbHVtbi4gUmVxdWlyZWQuCiBUaGUgdHJhaW5pbmdfZnJhbWUgaXMgdGhlIHRyYWluaW5nIHNldC4gUmVxdWlyZWQuCiBUaGUgdmFsaWRhdGlvbl9mcmFtZSBhcmd1bWVudCBpcyBvcHRpb25hbCBhbmQgd2lsbCBiZSB1c2VkIGZvciBlYXJseSBzdG9wcGluZyB3aXRoaW4gdGhlIHRyYWluaW5nIHByb2Nlc3Mgb2YgdGhlIGluZGl2aWR1YWwgbW9kZWxzIGluIHRoZSBBdXRvTUwgcnVuLgojIFRoZSBsZWFkZXJib2FyZF9mcmFtZSBhcmd1bWVudCBhbGxvd3MgdGhlIHVzZXIgdG8gc3BlY2lmeSBhIHBhcnRpY3VsYXIgZGF0YSBmcmFtZSB0byByYW5rIHRoZSBtb2RlbHMgb24gdGhlIGxlYWRlcmJvYXJkLiBUaGlzIGZyYW1lIHdpbGwgbm90IGJlIHVzZWQgZm9yIGFueXRoaW5nIGJlc2lkZXMgY3JlYXRpbmcgdGhlIGxlYWRlcmJvYXJkLgogVG8gY29udHJvbCBob3cgbG9uZyB0aGUgQXV0b01MIHJ1biB3aWxsIGV4ZWN1dGUsIHRoZSB1c2VyIGNhbiBzcGVjaWZ5IG1heF9ydW50aW1lX3NlY3MsIHdoaWNoIGRlZmF1bHRzIHRvIDYwMCBzZWNvbmRzICgxMCBtaW51dGVzKS4KIyBJZiB0aGUgdXNlciBkb2VzbuKAmXQgc3BlY2lmeSBhbGwgdGhyZWUgZnJhbWVzICh0cmFpbmluZywgdmFsaWRhdGlvbiBhbmQgbGVhZGVyYm9hcmQpLCB0aGVuIHRoZSBtaXNzaW5nIGZyYW1lcyB3aWxsIGJlIGNyZWF0ZWQgYXV0b21hdGljYWxseSBmcm9tIHdoYXQgaXMgcHJvdmlkZWQgYnkgdGhlIHVzZXIuIEZvciByZWZlcmVuY2UsIGhlcmUgYXJlIHRoZSBydWxlcyBmb3IgYXV0by1nZW5lcmF0aW5nIHRoZSBtaXNzaW5nIGZyYW1lcy4KIAogV2hlbiB0aGUgdXNlciBzcGVjaWZpZXM6CiAgCiB0cmFpbmluZzogVGhlIHRyYWluaW5nX2ZyYW1lIGlzIHNwbGl0IGludG8gdHJhaW5pbmcgKDcwJSksIHZhbGlkYXRpb24gKDE1JSkgYW5kIGxlYWRlcmJvYXJkICgxNSUpIHNldHMuCnRyYWluaW5nICsgdmFsaWRhdGlvbjogVGhlIHZhbGlkYXRpb25fZnJhbWUgaXMgc3BsaXQgaW50byB2YWxpZGF0aW9uICg1MCUpIGFuZCBsZWFkZXJib2FyZCAoNTAlKSBzZXRzIGFuZCB0aGUgb3JpZ2luYWwgdHJhaW5pbmcgZnJhbWUgc3RheXMgYXMtaXMuCiB0cmFpbmluZyArIGxlYWRlcmJvYXJkOiBUaGUgdHJhaW5pbmdfZnJhbWUgaXMgc3BsaXQgaW50byB0cmFpbmluZyAoNzAlKSBhbmQgdmFsaWRhdGlvbiAoMzAlKSBzZXRzIGFuZCB0aGUgbGVhZGVyYm9hcmQgZnJhbWUgc3RheXMgYXMtaXMuCnRyYWluaW5nICsgdmFsaWRhdGlvbiArIGxlYWRlcmJvYXJkOiBMZWF2ZSBhbGwgZnJhbWVzIGFzLWlzLgoKCgpgYGB7cixtZXNzYWdlPUZBTFNFLHdhcm5pbmc9RkFMU0V9CmJhbmtkYXRhPC1yaW86OmltcG9ydCgiL1VzZXJzL25hbmFha3dhc2lhYmF5aWVib2F0ZW5nL0RvY3VtZW50cy9tZW1waGlzY2xhc3Nlc2Jvb2tzL0RhdGFNaW5pbmdzY2llbmNlL0Fub21hbHlkZXRlY3Rpb24vYmFuay9iYW5rLWZ1bGwuY3N2IikKCmBgYAoKClRoZSBkYXRhIGlzIHJlbGF0ZWQgd2l0aCBkaXJlY3QgbWFya2V0aW5nIGNhbXBhaWducyAocGhvbmUgY2FsbHMpIG9mIGEgUG9ydHVndWVzZSBiYW5raW5nIGluc3RpdHV0aW9uLiBUaGUgY2xhc3NpZmljYXRpb24gZ29hbCBpcyB0byBwcmVkaWN0IGlmIHRoZSBjbGllbnQgd2lsbCBzdWJzY3JpYmUgYSB0ZXJtIGRlcG9zaXQgKHZhcmlhYmxlIHkpLgoKCkRhdGEgYXZhaWxhYmxlIGF0IFVDSSBtYWNoaW5lIGxlYXJuaW5nIHJlcG9zaXRvcnkgYXZhaWxhYmxlICBbaGVyZV0oaHR0cHM6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL2RhdGFzZXRzL2JhbmsrbWFya2V0aW5nKSAKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KYmFua2RhdGE8LWJhbmtkYXRhJT4lbXV0YXRlX2lmKGlzLmNoYXJhY3Rlcixhcy5mYWN0b3IpCnN0cihiYW5rZGF0YSkKYGBgCgoKYGBge3IsbWVzc2FnZT1GQUxTRSx3YXJuaW5nPUZBTFNFfQpsb2NhbEgyTyA9IGgyby5pbml0KGlwID0gJ2xvY2FsaG9zdCcsIHBvcnQgPSA1NDMyMSwgbnRocmVhZHMgPSAtMSxtYXhfbWVtX3NpemUgPSAiOEciKQoKYGBgCgoKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KYmFua2RhdGEgPC0gYXMuaDJvKGJhbmtkYXRhKQoKc3BsaXRzIDwtIGgyby5zcGxpdEZyYW1lKGJhbmtkYXRhLCAKICAgICAgICAgICAgICAgICAgICAgICAgIHJhdGlvcyA9IGMoMC42LCAwLjIpLCAKICAgICAgICAgICAgICAgICAgICAgICAgIHNlZWQgPSAxNDgpICAgI3BhcnRpdGlvbiBkYXRhIGludG8gNjAlLCAyMCUsIDIwJSBjaHVua3MKCgoKdHJhaW4gPC0gc3BsaXRzW1sxXV0KdmFsaWRhdGlvbiA8LSBzcGxpdHNbWzJdXQp0ZXN0IDwtIHNwbGl0c1tbM11dCgpvdXRjb21lX25hbWUgPC0gInkiICNyZXNwb25zZSBjb2x1bW46IGRpZ2l0cyAwLTEKCmZlYXR1cmVzIDwtIHNldGRpZmYoY29sbmFtZXModHJhaW4pLCBvdXRjb21lX25hbWUpCgoKZmVhdHVyZXMKYGBgCgoKCgpgYGB7cixtZXNzYWdlPUZBTFNFLHdhcm5pbmc9RkFMU0V9CgoKCmFtbCA8LSBoMm8uYXV0b21sKHggPSBmZWF0dXJlcywgeSA9IG91dGNvbWVfbmFtZSwKICAgICAgICAgICAgICAgICAgdHJhaW5pbmdfZnJhbWUgPSB0cmFpbiwKICAgICAgICAgICAgICAgICAgbGVhZGVyYm9hcmRfZnJhbWUgPSB0ZXN0LAogICAgICAgICAgICAgICAgICBtYXhfcnVudGltZV9zZWNzID0gMTIwKQoKIyBWaWV3IHRoZSBBdXRvTUwgTGVhZGVyYm9hcmQKbGIgPC0gYW1sQGxlYWRlcmJvYXJkCgoKCmBgYAoKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KbGIlPiVhc190aWJibGUoKQpgYGAKClRoZSBzdGFja2VkIGVuc2VtYmxlIHBlcmZvcm1zIGJldHRlciB0aGFuIGVhY2ggb2YgdGhlIHNpbmdsZSBhbGdvcml0aG1zIGJ5IHRoZW1zZWx2ZXMuIFRoZSBzYW1lIGRhdGEgd2FzIHRyYWluZWQgd2l0aCBzZXZlcmFsIGFsZ29yaXRobXMgYW5kIHRoZWlyIGF1YyBwZXJmb3JtYW5jZSBpcyB0YWJsZWQgYmVsb3cuCgoKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KZGF0YV9mcmFtZShtb2RlbD1jKCIgU3RvY2hhc3RpYyBHcmFkaWVudCBCb29zdGluZyBNYWNoaW5lIiwiUmFuZG9tIEZvcmVzdCIsIkJvb3N0ZWQgTG9naXN0aWMgUmVncmVzc2lvbiIsIkV4dHJlbWUgR3JhZGllbnQgQm9vc3RpbmcgIE1hY2hpbmUiLCJMb2dpc3RpYyBSZWdyZXNzaW9uIiwiTmV1cmFsIE5ldHdvcmtzIiksCiAgICAgICAgICAgYWNjdXJhY3k9YygwLjg4MzIsMC45MTc0LDAuODQxNiwwLjg4MzIgICwwLjkwMSwwLjkwMzggKSkKYGBgCgoKCgpgYGB7cixtZXNzYWdlPUZBTFNFLHdhcm5pbmc9RkFMU0V9Cmgyby5wZXJmb3JtYW5jZShtb2RlbCA9IGFtbEBsZWFkZXIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBuZXdkYXRhID0gdGVzdCkKYGBgCgoKCmBgYHtyLG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRX0KCmgyby5jb25mdXNpb25NYXRyaXgoYW1sQGxlYWRlcikKYGBgCgoKCgpgYGB7cixtZXNzYWdlPUZBTFNFLHdhcm5pbmc9RkFMU0V9CiMgSWYgeW91IG5lZWQgdG8gZ2VuZXJhdGUgcHJlZGljdGlvbnMgb24gYSB0ZXN0IHNldCwgeW91IGNhbiBtYWtlCiMgcHJlZGljdGlvbnMgZGlyZWN0bHkgb24gdGhlIGAiSDJPQXV0b01MImAgb2JqZWN0LCBvciBvbiB0aGUgbGVhZGVyCiMgbW9kZWwgb2JqZWN0IGRpcmVjdGx5CgoKCnByZWQgPC0gaDJvLnByZWRpY3QoYW1sQGxlYWRlciwgdGVzdCkKcHJlZCU+JWhlYWQoKSAKYGBgCgo=