Complete MLand DL using H2o_AutoML
Complete MLand DL using H2o_AutoML
Source file ⇒ CompleteMLandDLusingH2o_AutoML.rmd
1 About Automated Machine Learning
AutoML is a function in H2O that automates the process of building a large number of models, with the goal of finding the “best” model without any prior knowledge or effort by the Data Scientist.
The current version of AutoML (in H2O 3.16.*) trains and cross-validates a default Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, a fixed grid of GLMs, and then trains two Stacked Ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).
2 Automated Machine Learning using h2o
AutoML Interface
The AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time-constraint.
In both the R and Python API, AutoML uses the same data-related arguments, x, y, training_frame, validation_frame, as the other H2O algorithms.
Steps in AutoML:
- Specify a training frame.
- Specify the response variable and predictor variables.
- Run AutoML where stopping is based on max number of models.
- View the leaderboard (based on cross-validation metrics).
- Explore the ensemble composition.
- Save the leader model (binary format)
The x argument only needs to be specified if the user wants to exclude predictor columns from their data frame. If all columns (other than the response) should be used in prediction, this can be left blank/unspecified. The y argument is the name (or index) of the response column. Required. The training_frame is the training set. Required. The validation_frame argument is optional and will be used for early stopping within the training process of the individual models in the AutoML run.
The leaderboard_frame argument allows the user to specify a particular data frame to rank the models on the leaderboard. This frame will not be used for anything besides creating the leaderboard. To control how long the AutoML run will execute, the user can specify max_runtime_secs, which defaults to 600 seconds (10 minutes). # If the user doesn’t specify all three frames (training, validation and leaderboard), then the missing frames will be created automatically from what is provided by the user. For reference, here are the rules for auto-generating the missing frames.
When the user specifies:
training: The training_frame is split into training (70%), validation (15%) and leaderboard (15%) sets. training + validation: The validation_frame is split into validation (50%) and leaderboard (50%) sets and the original training frame stays as-is. training + leaderboard: The training_frame is split into training (70%) and validation (30%) sets and the leaderboard frame stays as-is. training + validation + leaderboard: Leave all frames as-is.
2.1 Loading the required packages
2.2 Data importing and Basic EDA
# ================================= Data Pre-processing
# =================================
# Clear workspace:
rm(list = ls())
# Import data: library(tidyverse)
hmeq <- read_csv("http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv")
datatable(head(hmeq), rownames = FALSE, options = list(pageLength = 6, scrollX = TRUE))## Classes 'tbl_df', 'tbl' and 'data.frame': 5960 obs. of 13 variables:
## $ BAD : int 1 1 1 1 0 1 1 1 1 1 ...
## $ LOAN : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
## $ MORTDUE: num 25860 70053 13500 NA 97800 ...
## $ VALUE : num 39025 68400 16700 NA 112000 ...
## $ REASON : chr "HomeImp" "HomeImp" "HomeImp" NA ...
## $ JOB : chr "Other" "Other" "Other" NA ...
## $ YOJ : num 10.5 7 4 NA 3 9 5 11 3 16 ...
## $ DEROG : int 0 0 0 NA 0 0 3 0 0 0 ...
## $ DELINQ : int 0 2 0 NA 0 0 2 0 2 0 ...
## $ CLAGE : num 94.4 121.8 149.5 NA 93.3 ...
## $ NINQ : int 1 0 1 NA 0 1 1 0 1 0 ...
## $ CLNO : int 9 14 10 NA 14 8 17 8 12 13 ...
## $ DEBTINC: num NA NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ BAD : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ LOAN : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ MORTDUE: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ VALUE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ REASON : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ JOB : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ YOJ : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ DEROG : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ DELINQ : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ CLAGE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ NINQ : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ CLNO : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ DEBTINC: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
| Name | hmeq |
| Number of rows | 5960 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| REASON | 252 | 0.96 | 7 | 7 | 0 | 2 | 0 |
| JOB | 279 | 0.95 | 3 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| BAD | 0 | 1.00 | 0.20 | 0.40 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| LOAN | 0 | 1.00 | 18607.97 | 11207.48 | 1100.00 | 11100 | 16300 | 23300 | 89900 | ▇▅▁▁▁ |
| MORTDUE | 518 | 0.91 | 73760.82 | 44457.61 | 2063.00 | 46276 | 65019 | 91488 | 399550 | ▇▃▁▁▁ |
| VALUE | 112 | 0.98 | 101776.05 | 57385.78 | 8000.00 | 66076 | 89236 | 119824 | 855909 | ▇▁▁▁▁ |
| YOJ | 515 | 0.91 | 8.92 | 7.57 | 0.00 | 3 | 7 | 13 | 41 | ▇▃▂▁▁ |
| DEROG | 708 | 0.88 | 0.25 | 0.85 | 0.00 | 0 | 0 | 0 | 10 | ▇▁▁▁▁ |
| DELINQ | 580 | 0.90 | 0.45 | 1.13 | 0.00 | 0 | 0 | 0 | 15 | ▇▁▁▁▁ |
| CLAGE | 308 | 0.95 | 179.77 | 85.81 | 0.00 | 115 | 173 | 232 | 1168 | ▇▂▁▁▁ |
| NINQ | 510 | 0.91 | 1.19 | 1.73 | 0.00 | 0 | 1 | 2 | 17 | ▇▁▁▁▁ |
| CLNO | 222 | 0.96 | 21.30 | 10.14 | 0.00 | 15 | 20 | 26 | 71 | ▃▇▂▁▁ |
| DEBTINC | 1267 | 0.79 | 33.78 | 8.60 | 0.52 | 29 | 35 | 39 | 203 | ▇▁▁▁▁ |
#-------------- Distributions ------------#
# xray::distributions tries to analyze the distribution of your variables, so you
# can understand how each variable is statistically structured. It also returns a
# percentiles table of numeric variables as a result, which can inform you of the
# shape of the data.
xray::distributions(hmeq)## ================================================================================
## Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99
## 1 DEROG 0 0 0 0 0 1 4
## 2 BAD 0 0 0 0 0 1 1
## 3 DELINQ 0 0 0 0 0 2 5
## 4 NINQ 0 0 0 1 2 3 8.51
## 5 DEBTINC 13.2764 23.7784 29.14 34.8183 39.0031 41.4407 49.2203
## 6 YOJ 0 1 3 7 13 21 30
## 7 MORTDUE 7855.4 26976.6 46276 65019 91488 130280.3 232230.41
## 8 CLAGE 30.2426 84.5543 115.1167 173.4667 231.5623 295.7159 399.5449
## 9 CLNO 0 10 15 20 26 34 50
## 10 VALUE 26262.2 48800 66075.5 89235.5 119824.25 175094.4 289962.8
## 11 LOAN 3359 7600 11100 16300 23300 30500 60869
2.3 Missing Value Handling and data pre-processing
For classification, the response should be encoded as categorical (aka. “factor” or “enum”)
#-------------- Anomaly detection ------------#
# xray::anomalies analyzes all your columns for anomalies, whether they are NAs,
# Zeroes, Infinite, etc, and warns you if it detects variables with at least 80%
# of rows with those anomalies. It also warns you when all rows have the same
# value.
xray::anomalies(hmeq)## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 DEROG 5960 708 11.88% 4527 75.96% 0 - 0 - 12
## 2 BAD 5960 0 - 4771 80.05% 0 - 0 - 2
## 3 DELINQ 5960 580 9.73% 4179 70.12% 0 - 0 - 15
## 4 NINQ 5960 510 8.56% 2531 42.47% 0 - 0 - 17
## 5 DEBTINC 5960 1267 21.26% 0 - 0 - 0 - 4694
## 6 YOJ 5960 515 8.64% 415 6.96% 0 - 0 - 100
## 7 MORTDUE 5960 518 8.69% 0 - 0 - 0 - 5054
## 8 CLAGE 5960 308 5.17% 2 0.03% 0 - 0 - 5315
## 9 CLNO 5960 222 3.72% 62 1.04% 0 - 0 - 63
## 10 JOB 5960 279 4.68% 0 - 0 - 0 - 7
## 11 REASON 5960 252 4.23% 0 - 0 - 0 - 3
## 12 VALUE 5960 112 1.88% 0 - 0 - 0 - 5382
## 13 LOAN 5960 0 - 0 - 0 - 0 - 540
## type anomalous_percent
## 1 Integer 87.84%
## 2 Integer 80.05%
## 3 Integer 79.85%
## 4 Integer 51.02%
## 5 Numeric 21.26%
## 6 Numeric 15.6%
## 7 Numeric 8.69%
## 8 Numeric 5.2%
## 9 Integer 4.77%
## 10 Character 4.68%
## 11 Character 4.23%
## 12 Numeric 1.88%
## 13 Integer -
##
## $problem_variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 DEROG 5960 708 11.88% 4527 75.96% 0 - 0 - 12
## 2 BAD 5960 0 - 4771 80.05% 0 - 0 - 2
## type anomalous_percent problems
## 1 Integer 87.84% Anomalies present in 87.84% of the rows.
## 2 Integer 80.05% Anomalies present in 80.05% of the rows.
## $Continuous
## label var_type n missing_n missing_percent mean sd min
## BAD BAD <int> 5960 0 0.0 0.2 0.4 0.0
## LOAN LOAN <int> 5960 0 0.0 18608.0 11207.5 1100.0
## MORTDUE MORTDUE <dbl> 5442 518 8.7 73760.8 44457.6 2063.0
## VALUE VALUE <dbl> 5848 112 1.9 101776.0 57385.8 8000.0
## YOJ YOJ <dbl> 5445 515 8.6 8.9 7.6 0.0
## DEROG DEROG <int> 5252 708 11.9 0.3 0.8 0.0
## DELINQ DELINQ <int> 5380 580 9.7 0.4 1.1 0.0
## CLAGE CLAGE <dbl> 5652 308 5.2 179.8 85.8 0.0
## NINQ NINQ <int> 5450 510 8.6 1.2 1.7 0.0
## CLNO CLNO <int> 5738 222 3.7 21.3 10.1 0.0
## DEBTINC DEBTINC <dbl> 4693 1267 21.3 33.8 8.6 0.5
## quartile_25 median quartile_75 max
## BAD 0.0 0.0 0.0 1.0
## LOAN 11100.0 16300.0 23300.0 89900.0
## MORTDUE 46276.0 65019.0 91488.0 399550.0
## VALUE 66075.5 89235.5 119824.2 855909.0
## YOJ 3.0 7.0 13.0 41.0
## DEROG 0.0 0.0 0.0 10.0
## DELINQ 0.0 0.0 0.0 15.0
## CLAGE 115.1 173.5 231.6 1168.2
## NINQ 0.0 1.0 2.0 17.0
## CLNO 15.0 20.0 26.0 71.0
## DEBTINC 29.1 34.8 39.0 203.3
##
## $Categorical
## label var_type n missing_n missing_percent levels_n levels
## REASON REASON <chr> 5708 252 4.2 2 -
## JOB JOB <chr> 5681 279 4.7 6 -
## levels_count levels_percent
## REASON - -
## JOB - -
##
## Missing value imputation by chained tree ensembles
##
## Variables ignored in imputation (wrong data type or all values missing: REASON, JOB
## iter 1: .........
## iter 2: .........
## iter 3: .........
# Stage 2 - Normalize 0-1 features:
df_final <- hmeq_imputed %>% mutate(BAD = case_when(BAD == 1 ~ "Bad", TRUE ~ "Good")) %>%
mutate_if(is.character, as.factor) %>% mutate_if(is.numeric, function(x) {
(x - min(x))/(max(x) - min(x))
})
hmeq_imputed %>% missing_plot()2.4 Train-Test split
Next, let’s identify the response & predictor columns by saving them as x and y.
# Stage - Split data for training, validation and testing:
# library(h2o) h2o.init(nthreads = 40, max_mem_size = '8g')
h2o.init(max_mem_size = "2G", nthreads = 2, ip = "localhost", port = 54321)##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\HP\AppData\Local\Temp\Rtmp6fwQLS\file40c81c327da2/h2o_HP_started_from_r.out
## C:\Users\HP\AppData\Local\Temp\Rtmp6fwQLS\file40c8374e77bf/h2o_HP_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 9 seconds 388 milliseconds
## H2O cluster timezone: Asia/Kolkata
## H2O data parsing timezone: UTC
## H2O cluster version: 3.30.0.1
## H2O cluster version age: 1 month and 24 days
## H2O cluster name: H2O_started_from_R_HP_nmn851
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.78 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.5.0 (2018-04-23)
# h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll() # Optional: remove anything from previous session
h2o_frame <- as.h2o(df_final)##
|
| | 0%
|
|======================================================================| 100%
2.5 Train Auto Machine Learning
Run AutoML, stopping after 10 models. The max_models argument specifies the number of individual (or “base”) models, and does not include the two ensemble models that are trained at the end.
# =================================== Training Auto Machine Learning
# ===================================
autoML <- h2o.automl(x = x, y = y, training_frame = train, leaderboard_frame = valid,
stopping_metric = "AUC", stopping_rounds = 10, stopping_tolerance = 0.02, max_models = 10,
max_runtime_secs = 60 * 60, seed = 1, sort_metric = "AUC")##
|
| | 0%
## 13:16:47.214: AutoML: XGBoost is not available; skipping it.
|
|===== | 8%
|
|======= | 10%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|============ | 17%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 22%
|
|================ | 23%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 29%
|
|===================== | 30%
|
|======================= | 33%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 39%
|
|================================= | 48%
|
|=========================================== | 61%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|======================================================================| 100%
2.6 Model performance by AUC: Leaderboard
Next, we will view the AutoML Leaderboard.
A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of binary classification, the default ranking metric is Area Under the ROC Curve (AUC).
The leader model is stored at autoML@leader and the leaderboard is stored at autoML@leaderboard.
## Model Details:
## ==============
##
## H2OBinomialModel: stackedensemble
## Model ID: StackedEnsemble_AllModels_AutoML_20200528_131647
## NULL
##
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.00076
## RMSE: 0.028
## LogLoss: 0.023
## Mean Per-Class Error: 0
## AUC: 1
## AUCPR: 1
## Gini: 1
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 630 0 0.000000 =0/630
## Good 0 2354 0.000000 =0/2354
## Totals 630 2354 0.000000 =0/2984
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.786425 1.000000 261
## 2 max f2 0.786425 1.000000 261
## 3 max f0point5 0.786425 1.000000 261
## 4 max accuracy 0.786425 1.000000 261
## 5 max precision 0.981050 1.000000 0
## 6 max recall 0.786425 1.000000 261
## 7 max specificity 0.981050 1.000000 0
## 8 max absolute_mcc 0.786425 1.000000 261
## 9 max min_per_class_accuracy 0.786425 1.000000 261
## 10 max mean_per_class_accuracy 0.786425 1.000000 261
## 11 max tns 0.981050 630.000000 0
## 12 max fns 0.981050 2312.000000 0
## 13 max fps 0.001448 630.000000 399
## 14 max tps 0.786425 2354.000000 261
## 15 max tnr 0.981050 1.000000 0
## 16 max fnr 0.981050 0.982158 0
## 17 max fpr 0.001448 1.000000 399
## 18 max tpr 0.786425 1.000000 261
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.051
## RMSE: 0.23
## LogLoss: 0.18
## Mean Per-Class Error: 0.13
## AUC: 0.96
## AUCPR: 0.99
## Gini: 0.93
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 476 154 0.244444 =154/630
## Good 37 2317 0.015718 =37/2354
## Totals 513 2471 0.064008 =191/2984
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.389204 0.960415 267
## 2 max f2 0.196162 0.978170 308
## 3 max f0point5 0.784595 0.957611 174
## 4 max accuracy 0.403159 0.935992 264
## 5 max precision 0.981010 1.000000 0
## 6 max recall 0.012527 1.000000 381
## 7 max specificity 0.981010 1.000000 0
## 8 max absolute_mcc 0.403159 0.800281 264
## 9 max min_per_class_accuracy 0.888581 0.901444 123
## 10 max mean_per_class_accuracy 0.888581 0.902309 123
## 11 max tns 0.981010 630.000000 0
## 12 max fns 0.981010 2309.000000 0
## 13 max fps 0.001339 630.000000 399
## 14 max tps 0.012527 2354.000000 381
## 15 max tnr 0.981010 1.000000 0
## 16 max fnr 0.981010 0.980884 0
## 17 max fpr 0.001339 1.000000 399
## 18 max tpr 0.012527 1.000000 381
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
df_results <- autoML@leaderboard %>% as.data.frame() %>% select(model_id, auc) %>%
mutate(Rank = 1:nrow(.), auc = round(auc, 4)) %>% rename(AUC_Val = auc)
# df_results %>% knitr::kable(caption = 'Table 1: AUC on Validation Data')
datatable(df_results, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE,
round)) %>% formatRound(columns = -1, digits = 4)2.7 AUC on test data by i-th model:
getAUC_onTestData <- function(i) {
# Extract i-th model:
best_ith <- h2o.getModel(autoML@leaderboard[i, 1])
# Model performance by ith model by AUC on Test data:
metrics_ith <- h2o.performance(model = best_ith, newdata = test)
# Return output:
return(data.frame(AUC_Test = metrics_ith@metrics$AUC, model_id = best_ith@model_id))
}
# Calculate AUC for all models:
auc_on_testData <- lapply(1:nrow(df_results), getAUC_onTestData)
auc_on_testData <- do.call("bind_rows", auc_on_testData)
# AUC by all models on test data:
auc_on_testData %>% select(model_id, AUC_Test) %>% knitr::kable(caption = "Table 2: AUC on Test Data")| model_id | AUC_Test |
|---|---|
| StackedEnsemble_AllModels_AutoML_20200528_131647 | 0.97 |
| StackedEnsemble_BestOfFamily_AutoML_20200528_131647 | 0.97 |
| XRT_1_AutoML_20200528_131647 | 0.96 |
| GBM_3_AutoML_20200528_131647 | 0.96 |
| DRF_1_AutoML_20200528_131647 | 0.97 |
| GBM_4_AutoML_20200528_131647 | 0.97 |
| GBM_2_AutoML_20200528_131647 | 0.96 |
| GBM_1_AutoML_20200528_131647 | 0.96 |
| GBM_grid__1_AutoML_20200528_131647_model_1 | 0.93 |
| GBM_5_AutoML_20200528_131647 | 0.94 |
| DeepLearning_1_AutoML_20200528_131647 | 0.88 |
| GLM_1_AutoML_20200528_131647 | 0.83 |
2.8 Correlation between Validatation and test AUC:
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 20, df = 10, p-value = 0.000000004
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.95 1.00
## sample estimates:
## cor
## 0.99
3 Use best model for predicting PD
##
|
| | 0%
|
|======================================================================| 100%
## [1] 0.94 0.98 0.99 1.00 0.99 0.99
## H2OBinomialMetrics: stackedensemble
##
## MSE: 0.043
## RMSE: 0.21
## LogLoss: 0.16
## Mean Per-Class Error: 0.12
## AUC: 0.97
## AUCPR: 0.99
## Gini: 0.94
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 256 78 0.233533 =78/334
## Good 14 1432 0.009682 =14/1446
## Totals 270 1510 0.051685 =92/1780
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.500807 0.968877 270
## 2 max f2 0.357837 0.982497 288
## 3 max f0point5 0.895531 0.966056 164
## 4 max accuracy 0.598648 0.948315 253
## 5 max precision 0.981040 1.000000 0
## 6 max recall 0.062842 1.000000 342
## 7 max specificity 0.981040 1.000000 0
## 8 max absolute_mcc 0.598648 0.824292 253
## 9 max min_per_class_accuracy 0.912400 0.912172 148
## 10 max mean_per_class_accuracy 0.906957 0.914174 154
## 11 max tns 0.981040 334.000000 0
## 12 max fns 0.981040 1409.000000 0
## 13 max fps 0.001666 334.000000 399
## 14 max tps 0.062842 1446.000000 342
## 15 max tnr 0.981040 1.000000 0
## 16 max fnr 0.981040 0.974412 0
## 17 max fpr 0.001666 1.000000 399
## 18 max tpr 0.062842 1.000000 342
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.786425036925665:
## Bad Good Error Rate
## Bad 630 0 0.000000 =0/630
## Good 0 2354 0.000000 =0/2354
## Totals 630 2354 0.000000 =0/2984
##compute variable importance and performance for all
## model_id auc logloss aucpr
## 1 StackedEnsemble_AllModels_AutoML_20200528_131647 0.96 0.17 0.99
## 2 StackedEnsemble_BestOfFamily_AutoML_20200528_131647 0.96 0.17 0.99
## 3 XRT_1_AutoML_20200528_131647 0.96 0.21 0.99
## 4 GBM_3_AutoML_20200528_131647 0.96 0.18 0.99
## 5 DRF_1_AutoML_20200528_131647 0.96 0.20 0.99
## 6 GBM_4_AutoML_20200528_131647 0.96 0.19 0.99
## 7 GBM_2_AutoML_20200528_131647 0.95 0.19 0.99
## 8 GBM_1_AutoML_20200528_131647 0.95 0.20 0.98
## 9 GBM_grid__1_AutoML_20200528_131647_model_1 0.93 0.22 0.98
## 10 GBM_5_AutoML_20200528_131647 0.93 0.23 0.98
## 11 DeepLearning_1_AutoML_20200528_131647 0.87 0.32 0.95
## 12 GLM_1_AutoML_20200528_131647 0.85 0.34 0.95
## mean_per_class_error rmse mse
## 1 0.13 0.22 0.048
## 2 0.14 0.22 0.049
## 3 0.13 0.24 0.059
## 4 0.13 0.22 0.050
## 5 0.15 0.24 0.057
## 6 0.11 0.22 0.049
## 7 0.12 0.23 0.052
## 8 0.14 0.24 0.055
## 9 0.17 0.25 0.063
## 10 0.16 0.25 0.064
## 11 0.30 0.30 0.092
## 12 0.28 0.32 0.100
##
## [12 rows x 7 columns]
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(lb$model_id)[, 1]
# View variable importance for the top 5 models (besides Stacked Ensemble)
for (model_id in model_ids[1:5]) {
print(model_id)
m <- h2o.getModel(model_id)
h2o.varimp(m)
h2o.varimp_plot(m)
}## [1] "StackedEnsemble_AllModels_AutoML_20200528_131647"
## [1] "StackedEnsemble_BestOfFamily_AutoML_20200528_131647"
## [1] "XRT_1_AutoML_20200528_131647"
## [1] "GBM_3_AutoML_20200528_131647"
## [1] "DRF_1_AutoML_20200528_131647"
3.1 Ensemble Exploration
To understand how the ensemble works, let’s take a peek inside the Stacked Ensemble “All Models” model. The “All Models” ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(autoML@leaderboard$model_id)[, 1]
# Get the 'All Models' Stacked Ensemble model
se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1])
# Get the Stacked Ensemble metalearner model
metalearner <- h2o.getModel(se@model$metalearner$name)Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.
## variable relative_importance
## 1 XRT_1_AutoML_20200528_131647 1.14
## 2 GBM_4_AutoML_20200528_131647 0.74
## 3 DRF_1_AutoML_20200528_131647 0.72
## 4 GBM_3_AutoML_20200528_131647 0.36
## 5 GBM_2_AutoML_20200528_131647 0.00
## 6 GBM_1_AutoML_20200528_131647 0.00
## 7 GBM_grid__1_AutoML_20200528_131647_model_1 0.00
## 8 GBM_5_AutoML_20200528_131647 0.00
## 9 DeepLearning_1_AutoML_20200528_131647 0.00
## 10 GLM_1_AutoML_20200528_131647 0.00
## scaled_importance percentage
## 1 1.00 0.38
## 2 0.65 0.25
## 3 0.64 0.24
## 4 0.32 0.12
## 5 0.00 0.00
## 6 0.00 0.00
## 7 0.00 0.00
## 8 0.00 0.00
## 9 0.00 0.00
## 10 0.00 0.00
We can also plot the base learner contributions to the ensemble.
4 Classification Part Two: XAI—-Explainable AI
4.0.2 The explain() Function
4.0.3 Explainer for H2O Models
explainer_automl <- DALEX::explain(model = autoML@leader, data = as.data.frame(test)[,
x], y = df_final$BAD, predict_function = custom_predict, label = "H2O AutoML")## Preparation of a new explainer is initiated
## -> model label : H2O AutoML
## -> data : 1780 rows 12 cols
## -> target variable : 5960 values
## -> target variable : length of 'y' is different than number of rows in 'data' ( [31m WARNING [39m )
## -> target variable : Please note that 'y' is a factor. ( [31m WARNING [39m )
## -> target variable : Consider changing the 'y' to a logical or numerical vector.
## -> target variable : Otherwise I will not be able to calculate residuals or loss function.
## -> model_info : package Model of class: H2OBinomialModel package unrecognized , ver. Unknown , task regression ( [33m default [39m )
## -> predict function : custom_predict
##
|
| | 0%
|
|======================================================================| 100%
##
|
| | 0%
|
|======================================================================| 100%
## -> predicted values : numerical, min = 0 , mean = 0.85 , max = 1
## -> residual function : difference between y and yhat ( [33m default [39m )
##
|
| | 0%
|
|======================================================================| 100%
##
|
| | 0%
|
|======================================================================| 100%
## -> residuals : numerical, min = NA , mean = NA , max = NA
## [32m A new explainer has been created! [39m
4.0.4 Variable importance
h2o.no_progress()
vi_automl <- feature_importance(explainer_automl, type = "difference")
plot(vi_automl)4.0.5 Partial Dependence Plots
4.1 Prediction Understanding-Instance-level explanations of the model
https://pbiecek.github.io/ema/breakDown.html
Break-down plots show how the contribution of individual explanatory variables change the average model prediction to the prediction for a single instance (observation).The green and red bars indicate, respectively, positive and negative changes in the average predictions (variable contributions).
Last bar indicates the difference between the model’s prediction for a particular observation and an average model prediction. Other bars show contributions of variables. Red color means a negative effect on the survival probability, while green color means a positive effect. Order of variables on the y-axis corresponds to their sequence used in Break-down algorithm.
# Prediction: Diabetes = Negative (0)
pb_automl <- break_down(explainer_automl, as.data.frame(test)[1, ])
plot(pb_automl)5 Compare H2o AutoML with Xgboost
# ========================= Compare with Xgboost =========================
# Convert to data frame:
df_train <- bind_rows(as.data.frame(train), as.data.frame(valid))
df_test <- as.data.frame(test)
# Function conducts one-hot encoding:
library(caret)
one_hotEncoding <- function(df) {
dummies <- dummyVars("~.", data = df)
df_oneHot <- predict(dummies, df) %>% as.data.frame()
df_oneHot %>% select(-BAD.Good) %>% rename(BAD = BAD.Bad) %>% return()
}
# Use function:
df_train <- df_train %>% one_hotEncoding()
df_test <- df_test %>% one_hotEncoding()
# Convert features to DMatrix form:
X_train <- df_train %>% select(-BAD) %>% as.matrix()
Y_train <- df_train %>% pull(BAD)
X_test <- df_test %>% select(-BAD) %>% as.matrix()
Y_test <- df_test %>% pull(BAD)
#------------------------------------------
# Train XGBoost with default parameters
#------------------------------------------
library(xgboost)
# Convert to DMatrix form for train data:
dtrain <- xgb.DMatrix(data = X_train, label = Y_train)
# Train a default XGBoost:
set.seed(29)
xgb1 <- xgboost(data = dtrain, objective = "binary:logistic", eval_metric = "auc",
verbose = 0, nround = 30)
# Use Xgboost for predicting PD:
pd_xgb <- predict(xgb1, X_test)
# AUC on test data by Xgboost:
library(pROC)
auc_xgb <- roc(Y_test, pd_xgb)$auc %>% as.numeric()
# AUC on test data by 1-th model:
auc_best <- auc_on_testData$AUC_Test[1]
# Compare AUC by the two approaches:
data.frame(Model = c("BestAutoML", "Xgboost"), AUC = c(auc_best, auc_xgb)) %>% knitr::kable(caption = "Table 3: AUC on Test Data")| Model | AUC |
|---|---|
| BestAutoML | 0.97 |
| Xgboost | 0.95 |
5.1 Calculate model performance by cutoff selected for Xgboost
byCutoff_xgb <- function(cutoff) {
pred <- case_when(pd_xgb >= cutoff ~ "Bad", TRUE ~ "Good") %>% as.factor()
thuc_te <- case_when(Y_test == 1 ~ "Bad", Y_test == 0 ~ "Good") %>% as.factor()
cm <- confusionMatrix(pred, thuc_te, positive = "Bad")
bg <- cm$table %>% as.vector()
acc <- cm$overall %>% as.vector()
sen <- cm$byClass %>% as.vector()
model_perCutoff <- data.frame(BB = bg[1], BG = bg[2], GB = bg[3], GG = bg[4],
Accuracy = acc[1], Kappa = acc[2], Recall = sen[1], Specificity = sen[2],
Cutoff = cutoff)
return(model_perCutoff)
}
# Calculate model performance by cutoff selected for best Auto ML:
byCutoff_best <- function(cutoff) {
pred <- case_when(pd_best >= cutoff ~ "Bad", TRUE ~ "Good") %>% as.factor()
thuc_te <- case_when(Y_test == 1 ~ "Bad", Y_test == 0 ~ "Good") %>% as.factor()
cm <- confusionMatrix(pred, thuc_te, positive = "Bad")
bg <- cm$table %>% as.vector()
acc <- cm$overall %>% as.vector()
sen <- cm$byClass %>% as.vector()
model_perCutoff <- data.frame(BB = bg[1], BG = bg[2], GB = bg[3], GG = bg[4],
Accuracy = acc[1], Kappa = acc[2], Recall = sen[1], Specificity = sen[2],
Cutoff = cutoff)
return(model_perCutoff)
}5.2 Compare model performance by plot:
# A range of cutoffs:
cutoffs <- seq(0.05, 0.95, 0.05)
# Model performance by cutoff for the two models:
performance_cutoff_xgb <- lapply(cutoffs, byCutoff_xgb)
performance_cutoff_best <- lapply(cutoffs, byCutoff_best)
# Convert to DF and combine results:
performance_cutoff_xgb <- do.call("bind_rows", performance_cutoff_xgb)
performance_cutoff_best <- do.call("bind_rows", performance_cutoff_best)
df_comparision <- bind_rows(performance_cutoff_best %>% mutate(Model = "BestAutoML"),
performance_cutoff_xgb %>% mutate(Model = "Xgboost"))
# Compare model performance by plot:
my_colors <- c("#e41a1c", "#377eb8")
theme_set(theme_gray())
f1 <- df_comparision %>% select(5:10) %>% gather(Metric, Value, -Cutoff, -Model) %>%
ggplot(aes(Cutoff, Value, color = Model)) + geom_line() + geom_point() + scale_color_manual(values = my_colors) +
facet_wrap(~Metric, scales = "free") + theme(legend.position = "top") + scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL, title = "Figure 1: Model Performance between AutoML and Xgboost by Cutoff",
subtitle = "Data Source: http://www.creditriskanalytics.net")
library(plotly)
ggplotly(f1)5.3 Model Performance between AutoML and Xgboost by Cutoff
f2 <- df_comparision %>% select(-c(5:8)) %>% gather(Metric, Value, -Cutoff, -Model) %>%
ggplot(aes(Cutoff, Value, color = Model)) + geom_line() + geom_point() + scale_color_manual(values = my_colors) +
facet_wrap(~Metric, scales = "free") + theme(legend.position = "top") + labs(x = NULL,
y = NULL, title = "Figure 2: Model Performance between AutoML and Xgboost by Cutoff",
subtitle = "Data Source: http://www.creditriskanalytics.net")
ggplotly(f2)6 References
- H2O AutoML Tutorial
- Automated Machine Learning: Methods, Systems, Challenges.
- Practical Automated Machine Learning on Azure: Using Azure Machine Learning to Quickly Build AI Solutions.
- Hands-On Automated Machine Learning: A beginner’s guide to building automated machine learning systems using AutoML and Python.
## 718.66 sec elapsed
## elapsed
## 719
7 R Environment and OS
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
## [5] LC_TIME=English_India.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.9.2.1 pROC_1.12.1 xgboost_0.71.2
## [4] caret_6.0-86 xray_0.2 forcats_0.3.0
## [7] purrr_0.3.3 readr_1.1.1 tibble_2.1.3
## [10] tidyverse_1.2.1 skimr_2.1.1 missRanger_1.0.3
## [13] inspectdf_0.0.7 h2o_3.30.0.1 finalfit_1.0.1
## [16] DT_0.4 DALEX_1.2.1 breakDown_0.2.0
## [19] tictoc_1.0 DataComputing_0.8.3 curl_4.3
## [22] base64enc_0.1-3 manipulate_1.0.1 mosaic_1.4.0
## [25] Matrix_1.2-14 mosaicData_0.17.0 ggformula_0.9.0
## [28] ggstance_0.3.1 lattice_0.20-35 knitr_1.28
## [31] stringr_1.3.1 tidyr_1.0.2 lubridate_1.7.4
## [34] dplyr_0.8.5 ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] minqa_1.2.4 colorspace_1.3-2 class_7.3-14
## [4] ellipsis_0.3.0 ggdendro_0.1-20 rstudioapi_0.11
## [7] mice_3.3.0 ggrepel_0.8.0 ggfittext_0.8.1
## [10] prodlim_2018.04.18 fansi_0.4.1 ranger_0.10.1
## [13] xml2_1.2.0 codetools_0.2-15 splines_3.5.0
## [16] jsonlite_1.6.1 nloptr_1.2.0 Cairo_1.5-9
## [19] broom_0.5.0 shiny_1.1.0 compiler_3.5.0
## [22] httr_1.3.1 backports_1.1.2 assertthat_0.2.0
## [25] lazyeval_0.2.1 cli_1.0.0 later_0.7.5
## [28] formatR_1.5 htmltools_0.3.6 prettyunits_1.0.2
## [31] tools_3.5.0 gtable_0.2.0 glue_1.3.0
## [34] reshape2_1.4.3 Rcpp_1.0.4 cellranger_1.1.0
## [37] vctrs_0.2.4 nlme_3.1-137 crosstalk_1.0.0
## [40] iterators_1.0.10 timeDate_3043.102 gower_0.1.2
## [43] xfun_0.14 lme4_1.1-18-1 rvest_0.3.2
## [46] mime_0.5 miniUI_0.1.1.1 lifecycle_0.2.0
## [49] mosaicCore_0.6.0 pacman_0.4.6 pan_1.6
## [52] MASS_7.3-49 scales_1.0.0 ipred_0.9-7
## [55] hms_0.4.2 promises_1.0.1 parallel_3.5.0
## [58] yaml_2.2.0 gridExtra_2.3 rpart_4.1-13
## [61] stringi_1.2.4 highr_0.7 foreach_1.4.4
## [64] e1071_1.7-0 boot_1.3-20 lava_1.6.3
## [67] repr_0.15.0 rlang_0.4.5 pkgconfig_2.0.2
## [70] bitops_1.0-6 evaluate_0.14 recipes_0.1.12
## [73] labeling_0.3 htmlwidgets_1.5.1 tidyselect_1.0.0
## [76] plyr_1.8.4 magrittr_1.5 bookdown_0.7
## [79] R6_2.2.2 generics_0.0.2 mitml_0.3-6
## [82] pillar_1.4.3 haven_1.1.2 withr_2.1.2
## [85] survival_2.42-6 RCurl_1.95-4.11 nnet_7.3-12
## [88] modelr_0.1.2 crayon_1.3.4 questionr_0.6.3
## [91] jomo_2.6-4 utf8_1.1.4 rmarkdown_2.1
## [94] ingredients_1.2.0 progress_1.2.0 grid_3.5.0
## [97] readxl_1.1.0 data.table_1.11.6 FNN_1.1.2.1
## [100] ModelMetrics_1.2.2.2 rmdformats_0.3.3 digest_0.6.17
## [103] xtable_1.8-3 httpuv_1.4.5 stats4_3.5.0
## [106] munsell_0.5.0 viridisLite_0.3.0