Complete MLand DL using H2o_AutoML

Source file ⇒ CompleteMLandDLusingH2o_AutoML.rmd

library(tictoc)
tic()

1 About Automated Machine Learning

AutoML is a function in H2O that automates the process of building a large number of models, with the goal of finding the “best” model without any prior knowledge or effort by the Data Scientist.

The current version of AutoML (in H2O 3.16.*) trains and cross-validates a default Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, a fixed grid of GLMs, and then trains two Stacked Ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).

2 Automated Machine Learning using h2o

AutoML Interface

The AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time-constraint.

In both the R and Python API, AutoML uses the same data-related arguments, x, y, training_frame, validation_frame, as the other H2O algorithms.

Steps in AutoML:

  • Specify a training frame.
  • Specify the response variable and predictor variables.
  • Run AutoML where stopping is based on max number of models.
  • View the leaderboard (based on cross-validation metrics).
  • Explore the ensemble composition.
  • Save the leader model (binary format)

The x argument only needs to be specified if the user wants to exclude predictor columns from their data frame. If all columns (other than the response) should be used in prediction, this can be left blank/unspecified. The y argument is the name (or index) of the response column. Required. The training_frame is the training set. Required. The validation_frame argument is optional and will be used for early stopping within the training process of the individual models in the AutoML run.

The leaderboard_frame argument allows the user to specify a particular data frame to rank the models on the leaderboard. This frame will not be used for anything besides creating the leaderboard. To control how long the AutoML run will execute, the user can specify max_runtime_secs, which defaults to 600 seconds (10 minutes). # If the user doesn’t specify all three frames (training, validation and leaderboard), then the missing frames will be created automatically from what is provided by the user. For reference, here are the rules for auto-generating the missing frames.

When the user specifies:

training: The training_frame is split into training (70%), validation (15%) and leaderboard (15%) sets. training + validation: The validation_frame is split into validation (50%) and leaderboard (50%) sets and the original training frame stays as-is. training + leaderboard: The training_frame is split into training (70%) and validation (30%) sets and the leaderboard frame stays as-is. training + validation + leaderboard: Leave all frames as-is.

2.1 Loading the required packages

# Loading the required packages

pacman::p_load(breakDown, DALEX, DT, finalfit, h2o, inspectdf, missRanger, skimr, 
    tidyverse, xray)

2.2 Data importing and Basic EDA

# ================================= Data Pre-processing
# =================================

# Clear workspace:
rm(list = ls())

# Import data: library(tidyverse)
hmeq <- read_csv("http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv")

datatable(head(hmeq), rownames = FALSE, options = list(pageLength = 6, scrollX = TRUE))
str(hmeq)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5960 obs. of  13 variables:
##  $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
##  $ MORTDUE: num  25860 70053 13500 NA 97800 ...
##  $ VALUE  : num  39025 68400 16700 NA 112000 ...
##  $ REASON : chr  "HomeImp" "HomeImp" "HomeImp" NA ...
##  $ JOB    : chr  "Other" "Other" "Other" NA ...
##  $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
##  $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
##  $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
##  $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
##  $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
##  $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
##  $ DEBTINC: num  NA NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 13
##   .. ..$ BAD    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ LOAN   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ MORTDUE: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ VALUE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ REASON : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ JOB    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ YOJ    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ DEROG  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ DELINQ : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ CLAGE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ NINQ   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ CLNO   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ DEBTINC: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
# fix_windows_histograms() Run this once to show histogram in rmd html

skimr::skim(hmeq)
Data summary
Name hmeq
Number of rows 5960
Number of columns 13
_______________________
Column type frequency:
character 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
REASON 252 0.96 7 7 0 2 0
JOB 279 0.95 3 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
BAD 0 1.00 0.20 0.40 0.00 0 0 0 1 ▇▁▁▁▂
LOAN 0 1.00 18607.97 11207.48 1100.00 11100 16300 23300 89900 ▇▅▁▁▁
MORTDUE 518 0.91 73760.82 44457.61 2063.00 46276 65019 91488 399550 ▇▃▁▁▁
VALUE 112 0.98 101776.05 57385.78 8000.00 66076 89236 119824 855909 ▇▁▁▁▁
YOJ 515 0.91 8.92 7.57 0.00 3 7 13 41 ▇▃▂▁▁
DEROG 708 0.88 0.25 0.85 0.00 0 0 0 10 ▇▁▁▁▁
DELINQ 580 0.90 0.45 1.13 0.00 0 0 0 15 ▇▁▁▁▁
CLAGE 308 0.95 179.77 85.81 0.00 115 173 232 1168 ▇▂▁▁▁
NINQ 510 0.91 1.19 1.73 0.00 0 1 2 17 ▇▁▁▁▁
CLNO 222 0.96 21.30 10.14 0.00 15 20 26 71 ▃▇▂▁▁
DEBTINC 1267 0.79 33.78 8.60 0.52 29 35 39 203 ▇▁▁▁▁
#--------------  Distributions  ------------#
# xray::distributions tries to analyze the distribution of your variables, so you
# can understand how each variable is statistically structured. It also returns a
# percentiles table of numeric variables as a result, which can inform you of the
# shape of the data.

xray::distributions(hmeq)
## ================================================================================

##    Variable     p_1    p_10     p_25     p_50      p_75     p_90      p_99
## 1     DEROG       0       0        0        0         0        1         4
## 2       BAD       0       0        0        0         0        1         1
## 3    DELINQ       0       0        0        0         0        2         5
## 4      NINQ       0       0        0        1         2        3      8.51
## 5   DEBTINC 13.2764 23.7784    29.14  34.8183   39.0031  41.4407   49.2203
## 6       YOJ       0       1        3        7        13       21        30
## 7   MORTDUE  7855.4 26976.6    46276    65019     91488 130280.3 232230.41
## 8     CLAGE 30.2426 84.5543 115.1167 173.4667  231.5623 295.7159  399.5449
## 9      CLNO       0      10       15       20        26       34        50
## 10    VALUE 26262.2   48800  66075.5  89235.5 119824.25 175094.4  289962.8
## 11     LOAN    3359    7600    11100    16300     23300    30500     60869

2.3 Missing Value Handling and data pre-processing

For classification, the response should be encoded as categorical (aka. “factor” or “enum”)

#--------------  Anomaly detection  ------------#

# xray::anomalies analyzes all your columns for anomalies, whether they are NAs,
# Zeroes, Infinite, etc, and warns you if it detects variables with at least 80%
# of rows with those anomalies. It also warns you when all rows have the same
# value.
xray::anomalies(hmeq)
## $variables
##    Variable    q  qNA    pNA qZero  pZero qBlank pBlank qInf pInf qDistinct
## 1     DEROG 5960  708 11.88%  4527 75.96%      0      -    0    -        12
## 2       BAD 5960    0      -  4771 80.05%      0      -    0    -         2
## 3    DELINQ 5960  580  9.73%  4179 70.12%      0      -    0    -        15
## 4      NINQ 5960  510  8.56%  2531 42.47%      0      -    0    -        17
## 5   DEBTINC 5960 1267 21.26%     0      -      0      -    0    -      4694
## 6       YOJ 5960  515  8.64%   415  6.96%      0      -    0    -       100
## 7   MORTDUE 5960  518  8.69%     0      -      0      -    0    -      5054
## 8     CLAGE 5960  308  5.17%     2  0.03%      0      -    0    -      5315
## 9      CLNO 5960  222  3.72%    62  1.04%      0      -    0    -        63
## 10      JOB 5960  279  4.68%     0      -      0      -    0    -         7
## 11   REASON 5960  252  4.23%     0      -      0      -    0    -         3
## 12    VALUE 5960  112  1.88%     0      -      0      -    0    -      5382
## 13     LOAN 5960    0      -     0      -      0      -    0    -       540
##         type anomalous_percent
## 1    Integer            87.84%
## 2    Integer            80.05%
## 3    Integer            79.85%
## 4    Integer            51.02%
## 5    Numeric            21.26%
## 6    Numeric             15.6%
## 7    Numeric             8.69%
## 8    Numeric              5.2%
## 9    Integer             4.77%
## 10 Character             4.68%
## 11 Character             4.23%
## 12   Numeric             1.88%
## 13   Integer                 -
## 
## $problem_variables
##   Variable    q qNA    pNA qZero  pZero qBlank pBlank qInf pInf qDistinct
## 1    DEROG 5960 708 11.88%  4527 75.96%      0      -    0    -        12
## 2      BAD 5960   0      -  4771 80.05%      0      -    0    -         2
##      type anomalous_percent                                 problems
## 1 Integer            87.84% Anomalies present in 87.84% of the rows.
## 2 Integer            80.05% Anomalies present in 80.05% of the rows.
# install.packages('finalfit')

## Missing value info using finalfit package
hmeq %>% ff_glimpse()
## $Continuous
##           label var_type    n missing_n missing_percent     mean      sd    min
## BAD         BAD    <int> 5960         0             0.0      0.2     0.4    0.0
## LOAN       LOAN    <int> 5960         0             0.0  18608.0 11207.5 1100.0
## MORTDUE MORTDUE    <dbl> 5442       518             8.7  73760.8 44457.6 2063.0
## VALUE     VALUE    <dbl> 5848       112             1.9 101776.0 57385.8 8000.0
## YOJ         YOJ    <dbl> 5445       515             8.6      8.9     7.6    0.0
## DEROG     DEROG    <int> 5252       708            11.9      0.3     0.8    0.0
## DELINQ   DELINQ    <int> 5380       580             9.7      0.4     1.1    0.0
## CLAGE     CLAGE    <dbl> 5652       308             5.2    179.8    85.8    0.0
## NINQ       NINQ    <int> 5450       510             8.6      1.2     1.7    0.0
## CLNO       CLNO    <int> 5738       222             3.7     21.3    10.1    0.0
## DEBTINC DEBTINC    <dbl> 4693      1267            21.3     33.8     8.6    0.5
##         quartile_25  median quartile_75      max
## BAD             0.0     0.0         0.0      1.0
## LOAN        11100.0 16300.0     23300.0  89900.0
## MORTDUE     46276.0 65019.0     91488.0 399550.0
## VALUE       66075.5 89235.5    119824.2 855909.0
## YOJ             3.0     7.0        13.0     41.0
## DEROG           0.0     0.0         0.0     10.0
## DELINQ          0.0     0.0         0.0     15.0
## CLAGE         115.1   173.5       231.6   1168.2
## NINQ            0.0     1.0         2.0     17.0
## CLNO           15.0    20.0        26.0     71.0
## DEBTINC        29.1    34.8        39.0    203.3
## 
## $Categorical
##         label var_type    n missing_n missing_percent levels_n levels
## REASON REASON    <chr> 5708       252             4.2        2      -
## JOB       JOB    <chr> 5681       279             4.7        6      -
##        levels_count levels_percent
## REASON            -              -
## JOB               -              -
hmeq %>% missing_plot()

# Stage 1 - Impute missing data: library(missRanger)

hmeq_imputed <- missRanger(hmeq, seed = 29)
## 
## Missing value imputation by chained tree ensembles
## 
##   Variables ignored in imputation (wrong data type or all values missing: REASON, JOB
## iter 1:  .........
## iter 2:  .........
## iter 3:  .........
# Stage 2 - Normalize 0-1 features:

df_final <- hmeq_imputed %>% mutate(BAD = case_when(BAD == 1 ~ "Bad", TRUE ~ "Good")) %>% 
    mutate_if(is.character, as.factor) %>% mutate_if(is.numeric, function(x) {
    (x - min(x))/(max(x) - min(x))
})

hmeq_imputed %>% missing_plot()

hmeq_imputed %>% inspect_na %>% show_plot

2.4 Train-Test split

Next, let’s identify the response & predictor columns by saving them as x and y.

# Stage - Split data for training, validation and testing:

# library(h2o) h2o.init(nthreads = 40, max_mem_size = '8g')
h2o.init(max_mem_size = "2G", nthreads = 2, ip = "localhost", port = 54321)
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\HP\AppData\Local\Temp\Rtmp6fwQLS\file40c81c327da2/h2o_HP_started_from_r.out
##     C:\Users\HP\AppData\Local\Temp\Rtmp6fwQLS\file40c8374e77bf/h2o_HP_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         9 seconds 388 milliseconds 
##     H2O cluster timezone:       Asia/Kolkata 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    1 month and 24 days  
##     H2O cluster name:           H2O_started_from_R_HP_nmn851 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.78 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.5.0 (2018-04-23)
# h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll()  # Optional: remove anything from previous session

h2o_frame <- as.h2o(df_final)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
splits <- h2o.splitFrame(h2o_frame, ratios = c(0.5, 0.2), seed = 29)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

# Define predictors and target:
y <- "BAD"
x <- setdiff(names(train), y)

2.5 Train Auto Machine Learning

Run AutoML, stopping after 10 models. The max_models argument specifies the number of individual (or “base”) models, and does not include the two ensemble models that are trained at the end.

# =================================== Training Auto Machine Learning
# ===================================


autoML <- h2o.automl(x = x, y = y, training_frame = train, leaderboard_frame = valid, 
    stopping_metric = "AUC", stopping_rounds = 10, stopping_tolerance = 0.02, max_models = 10, 
    max_runtime_secs = 60 * 60, seed = 1, sort_metric = "AUC")
## 
  |                                                                            
  |                                                                      |   0%
## 13:16:47.214: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================================| 100%

2.6 Model performance by AUC: Leaderboard

Next, we will view the AutoML Leaderboard.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of binary classification, the default ranking metric is Area Under the ROC Curve (AUC).

The leader model is stored at autoML@leader and the leaderboard is stored at autoML@leaderboard.

autoML@leader
## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_AutoML_20200528_131647 
## NULL
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.00076
## RMSE:  0.028
## LogLoss:  0.023
## Mean Per-Class Error:  0
## AUC:  1
## AUCPR:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error     Rate
## Bad    630    0 0.000000   =0/630
## Good     0 2354 0.000000  =0/2354
## Totals 630 2354 0.000000  =0/2984
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.786425    1.000000 261
## 2                       max f2  0.786425    1.000000 261
## 3                 max f0point5  0.786425    1.000000 261
## 4                 max accuracy  0.786425    1.000000 261
## 5                max precision  0.981050    1.000000   0
## 6                   max recall  0.786425    1.000000 261
## 7              max specificity  0.981050    1.000000   0
## 8             max absolute_mcc  0.786425    1.000000 261
## 9   max min_per_class_accuracy  0.786425    1.000000 261
## 10 max mean_per_class_accuracy  0.786425    1.000000 261
## 11                     max tns  0.981050  630.000000   0
## 12                     max fns  0.981050 2312.000000   0
## 13                     max fps  0.001448  630.000000 399
## 14                     max tps  0.786425 2354.000000 261
## 15                     max tnr  0.981050    1.000000   0
## 16                     max fnr  0.981050    0.982158   0
## 17                     max fpr  0.001448    1.000000 399
## 18                     max tpr  0.786425    1.000000 261
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.051
## RMSE:  0.23
## LogLoss:  0.18
## Mean Per-Class Error:  0.13
## AUC:  0.96
## AUCPR:  0.99
## Gini:  0.93
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error       Rate
## Bad    476  154 0.244444   =154/630
## Good    37 2317 0.015718   =37/2354
## Totals 513 2471 0.064008  =191/2984
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.389204    0.960415 267
## 2                       max f2  0.196162    0.978170 308
## 3                 max f0point5  0.784595    0.957611 174
## 4                 max accuracy  0.403159    0.935992 264
## 5                max precision  0.981010    1.000000   0
## 6                   max recall  0.012527    1.000000 381
## 7              max specificity  0.981010    1.000000   0
## 8             max absolute_mcc  0.403159    0.800281 264
## 9   max min_per_class_accuracy  0.888581    0.901444 123
## 10 max mean_per_class_accuracy  0.888581    0.902309 123
## 11                     max tns  0.981010  630.000000   0
## 12                     max fns  0.981010 2309.000000   0
## 13                     max fps  0.001339  630.000000 399
## 14                     max tps  0.012527 2354.000000 381
## 15                     max tnr  0.981010    1.000000   0
## 16                     max fnr  0.981010    0.980884   0
## 17                     max fpr  0.001339    1.000000 399
## 18                     max tpr  0.012527    1.000000 381
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
df_results <- autoML@leaderboard %>% as.data.frame() %>% select(model_id, auc) %>% 
    mutate(Rank = 1:nrow(.), auc = round(auc, 4)) %>% rename(AUC_Val = auc)

# df_results %>% knitr::kable(caption = 'Table 1: AUC on Validation Data')

datatable(df_results, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, 
    round)) %>% formatRound(columns = -1, digits = 4)

2.7 AUC on test data by i-th model:

getAUC_onTestData <- function(i) {
    
    # Extract i-th model:
    best_ith <- h2o.getModel(autoML@leaderboard[i, 1])
    
    # Model performance by ith model by AUC on Test data:
    metrics_ith <- h2o.performance(model = best_ith, newdata = test)
    
    # Return output:
    return(data.frame(AUC_Test = metrics_ith@metrics$AUC, model_id = best_ith@model_id))
    
}

# Calculate AUC for all models:

auc_on_testData <- lapply(1:nrow(df_results), getAUC_onTestData)
auc_on_testData <- do.call("bind_rows", auc_on_testData)

# AUC by all models on test data:
auc_on_testData %>% select(model_id, AUC_Test) %>% knitr::kable(caption = "Table 2: AUC on Test Data")
Table 2: AUC on Test Data
model_id AUC_Test
StackedEnsemble_AllModels_AutoML_20200528_131647 0.97
StackedEnsemble_BestOfFamily_AutoML_20200528_131647 0.97
XRT_1_AutoML_20200528_131647 0.96
GBM_3_AutoML_20200528_131647 0.96
DRF_1_AutoML_20200528_131647 0.97
GBM_4_AutoML_20200528_131647 0.97
GBM_2_AutoML_20200528_131647 0.96
GBM_1_AutoML_20200528_131647 0.96
GBM_grid__1_AutoML_20200528_131647_model_1 0.93
GBM_5_AutoML_20200528_131647 0.94
DeepLearning_1_AutoML_20200528_131647 0.88
GLM_1_AutoML_20200528_131647 0.83

2.8 Correlation between Validatation and test AUC:

cor.test(df_results$AUC_Val, auc_on_testData$AUC_Test)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 20, df = 10, p-value = 0.000000004
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.95 1.00
## sample estimates:
##  cor 
## 0.99

3 Use best model for predicting PD

pd_best <- h2o.predict(autoML@leader, test) %>% as.data.frame() %>% pull(Bad)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
head(pd_best)
## [1] 0.94 0.98 0.99 1.00 0.99 0.99
# h2o.predict(autoML@leader, test) %>% as.data.frame()
h2o.performance(model = autoML@leader, newdata = test)
## H2OBinomialMetrics: stackedensemble
## 
## MSE:  0.043
## RMSE:  0.21
## LogLoss:  0.16
## Mean Per-Class Error:  0.12
## AUC:  0.97
## AUCPR:  0.99
## Gini:  0.94
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error      Rate
## Bad    256   78 0.233533   =78/334
## Good    14 1432 0.009682  =14/1446
## Totals 270 1510 0.051685  =92/1780
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.500807    0.968877 270
## 2                       max f2  0.357837    0.982497 288
## 3                 max f0point5  0.895531    0.966056 164
## 4                 max accuracy  0.598648    0.948315 253
## 5                max precision  0.981040    1.000000   0
## 6                   max recall  0.062842    1.000000 342
## 7              max specificity  0.981040    1.000000   0
## 8             max absolute_mcc  0.598648    0.824292 253
## 9   max min_per_class_accuracy  0.912400    0.912172 148
## 10 max mean_per_class_accuracy  0.906957    0.914174 154
## 11                     max tns  0.981040  334.000000   0
## 12                     max fns  0.981040 1409.000000   0
## 13                     max fps  0.001666  334.000000 399
## 14                     max tps  0.062842 1446.000000 342
## 15                     max tnr  0.981040    1.000000   0
## 16                     max fnr  0.981040    0.974412   0
## 17                     max fpr  0.001666    1.000000 399
## 18                     max tpr  0.062842    1.000000 342
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.confusionMatrix(autoML@leader)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.786425036925665:
##        Bad Good    Error     Rate
## Bad    630    0 0.000000   =0/630
## Good     0 2354 0.000000  =0/2354
## Totals 630 2354 0.000000  =0/2984

##compute variable importance and performance for all

lb <- autoML@leaderboard
print(lb, n = nrow(lb))
##                                               model_id  auc logloss aucpr
## 1     StackedEnsemble_AllModels_AutoML_20200528_131647 0.96    0.17  0.99
## 2  StackedEnsemble_BestOfFamily_AutoML_20200528_131647 0.96    0.17  0.99
## 3                         XRT_1_AutoML_20200528_131647 0.96    0.21  0.99
## 4                         GBM_3_AutoML_20200528_131647 0.96    0.18  0.99
## 5                         DRF_1_AutoML_20200528_131647 0.96    0.20  0.99
## 6                         GBM_4_AutoML_20200528_131647 0.96    0.19  0.99
## 7                         GBM_2_AutoML_20200528_131647 0.95    0.19  0.99
## 8                         GBM_1_AutoML_20200528_131647 0.95    0.20  0.98
## 9           GBM_grid__1_AutoML_20200528_131647_model_1 0.93    0.22  0.98
## 10                        GBM_5_AutoML_20200528_131647 0.93    0.23  0.98
## 11               DeepLearning_1_AutoML_20200528_131647 0.87    0.32  0.95
## 12                        GLM_1_AutoML_20200528_131647 0.85    0.34  0.95
##    mean_per_class_error rmse   mse
## 1                  0.13 0.22 0.048
## 2                  0.14 0.22 0.049
## 3                  0.13 0.24 0.059
## 4                  0.13 0.22 0.050
## 5                  0.15 0.24 0.057
## 6                  0.11 0.22 0.049
## 7                  0.12 0.23 0.052
## 8                  0.14 0.24 0.055
## 9                  0.17 0.25 0.063
## 10                 0.16 0.25 0.064
## 11                 0.30 0.30 0.092
## 12                 0.28 0.32 0.100
## 
## [12 rows x 7 columns]
# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(lb$model_id)[, 1]

# View variable importance for the top 5 models (besides Stacked Ensemble)
for (model_id in model_ids[1:5]) {
    print(model_id)
    m <- h2o.getModel(model_id)
    h2o.varimp(m)
    h2o.varimp_plot(m)
}
## [1] "StackedEnsemble_AllModels_AutoML_20200528_131647"
## [1] "StackedEnsemble_BestOfFamily_AutoML_20200528_131647"
## [1] "XRT_1_AutoML_20200528_131647"

## [1] "GBM_3_AutoML_20200528_131647"

## [1] "DRF_1_AutoML_20200528_131647"

3.1 Ensemble Exploration

To understand how the ensemble works, let’s take a peek inside the Stacked Ensemble “All Models” model. The “All Models” ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

# Get model ids for all models in the AutoML Leaderboard
model_ids <- as.data.frame(autoML@leaderboard$model_id)[, 1]
# Get the 'All Models' Stacked Ensemble model
se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1])
# Get the Stacked Ensemble metalearner model
metalearner <- h2o.getModel(se@model$metalearner$name)

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

h2o.varimp(metalearner)
##                                      variable relative_importance
## 1                XRT_1_AutoML_20200528_131647                1.14
## 2                GBM_4_AutoML_20200528_131647                0.74
## 3                DRF_1_AutoML_20200528_131647                0.72
## 4                GBM_3_AutoML_20200528_131647                0.36
## 5                GBM_2_AutoML_20200528_131647                0.00
## 6                GBM_1_AutoML_20200528_131647                0.00
## 7  GBM_grid__1_AutoML_20200528_131647_model_1                0.00
## 8                GBM_5_AutoML_20200528_131647                0.00
## 9       DeepLearning_1_AutoML_20200528_131647                0.00
## 10               GLM_1_AutoML_20200528_131647                0.00
##    scaled_importance percentage
## 1               1.00       0.38
## 2               0.65       0.25
## 3               0.64       0.24
## 4               0.32       0.12
## 5               0.00       0.00
## 6               0.00       0.00
## 7               0.00       0.00
## 8               0.00       0.00
## 9               0.00       0.00
## 10              0.00       0.00

We can also plot the base learner contributions to the ensemble.

h2o.varimp_plot(metalearner)

4 Classification Part Two: XAI—-Explainable AI

4.0.1 Package DALEX

# Descriptive mAchine Learning EXplanations (DALEX)
library(DALEX)

4.0.2 The explain() Function

# Custom Predict Function
custom_predict <- function(model, newdata) {
    newdata_h2o <- as.h2o(newdata)
    res <- as.data.frame(h2o.predict(model, newdata_h2o))
    return(round(res[, 3]))  # round the probabilities
}

4.0.3 Explainer for H2O Models

explainer_automl <- DALEX::explain(model = autoML@leader, data = as.data.frame(test)[, 
    x], y = df_final$BAD, predict_function = custom_predict, label = "H2O AutoML")
## Preparation of a new explainer is initiated
##   -> model label       :  H2O AutoML 
##   -> data              :  1780  rows  12  cols 
##   -> target variable   :  5960  values 
##   -> target variable   :  length of 'y' is different than number of rows in 'data' (  WARNING  ) 
##   -> target variable   :  Please note that 'y' is a factor.  (  WARNING  )
##   -> target variable   :  Consider changing the 'y' to a logical or numerical vector.
##   -> target variable   :  Otherwise I will not be able to calculate residuals or loss function.
##   -> model_info        :  package Model of class: H2OBinomialModel package unrecognized , ver. Unknown , task regression (  default  ) 
##   -> predict function  :  custom_predict 
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##   -> predicted values  :  numerical, min =  0 , mean =  0.85 , max =  1  
##   -> residual function :  difference between y and yhat (  default  )
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
##   A new explainer has been created! 

4.0.4 Variable importance

h2o.no_progress()
vi_automl <- feature_importance(explainer_automl, type = "difference")
plot(vi_automl)

4.0.5 Partial Dependence Plots

4.0.5.1 Prediction Understanding

library(DALEX)

Let’s look at feature LOAN

pdp_automl_rm <- variable_effect_partial_dependency(explainer_automl, variable = "LOAN")
plot(pdp_automl_rm)

4.1 Prediction Understanding-Instance-level explanations of the model

https://pbiecek.github.io/ema/breakDown.html

Break-down plots show how the contribution of individual explanatory variables change the average model prediction to the prediction for a single instance (observation).The green and red bars indicate, respectively, positive and negative changes in the average predictions (variable contributions).

Last bar indicates the difference between the model’s prediction for a particular observation and an average model prediction. Other bars show contributions of variables. Red color means a negative effect on the survival probability, while green color means a positive effect. Order of variables on the y-axis corresponds to their sequence used in Break-down algorithm.

library(breakDown)
# Prediction: Diabetes = Negative (0)
pb_automl <- break_down(explainer_automl, as.data.frame(test)[1, ])
plot(pb_automl)

5 Compare H2o AutoML with Xgboost

# ========================= Compare with Xgboost =========================

# Convert to data frame:

df_train <- bind_rows(as.data.frame(train), as.data.frame(valid))
df_test <- as.data.frame(test)


# Function conducts one-hot encoding:

library(caret)

one_hotEncoding <- function(df) {
    
    dummies <- dummyVars("~.", data = df)
    
    df_oneHot <- predict(dummies, df) %>% as.data.frame()
    
    df_oneHot %>% select(-BAD.Good) %>% rename(BAD = BAD.Bad) %>% return()
}


# Use function:
df_train <- df_train %>% one_hotEncoding()
df_test <- df_test %>% one_hotEncoding()

# Convert features to DMatrix form:

X_train <- df_train %>% select(-BAD) %>% as.matrix()

Y_train <- df_train %>% pull(BAD)

X_test <- df_test %>% select(-BAD) %>% as.matrix()

Y_test <- df_test %>% pull(BAD)

#------------------------------------------
# Train XGBoost with default parameters
#------------------------------------------
library(xgboost)

# Convert to DMatrix form for train data:
dtrain <- xgb.DMatrix(data = X_train, label = Y_train)

# Train a default XGBoost:
set.seed(29)
xgb1 <- xgboost(data = dtrain, objective = "binary:logistic", eval_metric = "auc", 
    verbose = 0, nround = 30)

# Use Xgboost for predicting PD:
pd_xgb <- predict(xgb1, X_test)


# AUC on test data by Xgboost:
library(pROC)
auc_xgb <- roc(Y_test, pd_xgb)$auc %>% as.numeric()


# AUC on test data by 1-th model:
auc_best <- auc_on_testData$AUC_Test[1]

# Compare AUC by the two approaches:
data.frame(Model = c("BestAutoML", "Xgboost"), AUC = c(auc_best, auc_xgb)) %>% knitr::kable(caption = "Table 3: AUC on Test Data")
Table 3: AUC on Test Data
Model AUC
BestAutoML 0.97
Xgboost 0.95

5.1 Calculate model performance by cutoff selected for Xgboost

byCutoff_xgb <- function(cutoff) {
    
    pred <- case_when(pd_xgb >= cutoff ~ "Bad", TRUE ~ "Good") %>% as.factor()
    
    thuc_te <- case_when(Y_test == 1 ~ "Bad", Y_test == 0 ~ "Good") %>% as.factor()
    
    cm <- confusionMatrix(pred, thuc_te, positive = "Bad")
    
    bg <- cm$table %>% as.vector()
    acc <- cm$overall %>% as.vector()
    sen <- cm$byClass %>% as.vector()
    
    model_perCutoff <- data.frame(BB = bg[1], BG = bg[2], GB = bg[3], GG = bg[4], 
        Accuracy = acc[1], Kappa = acc[2], Recall = sen[1], Specificity = sen[2], 
        Cutoff = cutoff)
    
    return(model_perCutoff)
    
}

# Calculate model performance by cutoff selected for best Auto ML:



byCutoff_best <- function(cutoff) {
    
    pred <- case_when(pd_best >= cutoff ~ "Bad", TRUE ~ "Good") %>% as.factor()
    
    thuc_te <- case_when(Y_test == 1 ~ "Bad", Y_test == 0 ~ "Good") %>% as.factor()
    
    cm <- confusionMatrix(pred, thuc_te, positive = "Bad")
    
    bg <- cm$table %>% as.vector()
    acc <- cm$overall %>% as.vector()
    sen <- cm$byClass %>% as.vector()
    
    model_perCutoff <- data.frame(BB = bg[1], BG = bg[2], GB = bg[3], GG = bg[4], 
        Accuracy = acc[1], Kappa = acc[2], Recall = sen[1], Specificity = sen[2], 
        Cutoff = cutoff)
    
    return(model_perCutoff)
    
}

5.2 Compare model performance by plot:

# A range of cutoffs:
cutoffs <- seq(0.05, 0.95, 0.05)

# Model performance by cutoff for the two models:
performance_cutoff_xgb <- lapply(cutoffs, byCutoff_xgb)
performance_cutoff_best <- lapply(cutoffs, byCutoff_best)

# Convert to DF and combine results:
performance_cutoff_xgb <- do.call("bind_rows", performance_cutoff_xgb)
performance_cutoff_best <- do.call("bind_rows", performance_cutoff_best)


df_comparision <- bind_rows(performance_cutoff_best %>% mutate(Model = "BestAutoML"), 
    performance_cutoff_xgb %>% mutate(Model = "Xgboost"))


# Compare model performance by plot:


my_colors <- c("#e41a1c", "#377eb8")
theme_set(theme_gray())

f1 <- df_comparision %>% select(5:10) %>% gather(Metric, Value, -Cutoff, -Model) %>% 
    ggplot(aes(Cutoff, Value, color = Model)) + geom_line() + geom_point() + scale_color_manual(values = my_colors) + 
    facet_wrap(~Metric, scales = "free") + theme(legend.position = "top") + scale_y_continuous(labels = scales::percent) + 
    labs(x = NULL, y = NULL, title = "Figure 1: Model Performance between AutoML and Xgboost by Cutoff", 
        subtitle = "Data Source: http://www.creditriskanalytics.net")


library(plotly)
ggplotly(f1)

5.3 Model Performance between AutoML and Xgboost by Cutoff

f2 <- df_comparision %>% select(-c(5:8)) %>% gather(Metric, Value, -Cutoff, -Model) %>% 
    ggplot(aes(Cutoff, Value, color = Model)) + geom_line() + geom_point() + scale_color_manual(values = my_colors) + 
    facet_wrap(~Metric, scales = "free") + theme(legend.position = "top") + labs(x = NULL, 
    y = NULL, title = "Figure 2: Model Performance between AutoML and Xgboost by Cutoff", 
    subtitle = "Data Source: http://www.creditriskanalytics.net")

ggplotly(f2)

7 R Environment and OS

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.9.2.1      pROC_1.12.1         xgboost_0.71.2     
##  [4] caret_6.0-86        xray_0.2            forcats_0.3.0      
##  [7] purrr_0.3.3         readr_1.1.1         tibble_2.1.3       
## [10] tidyverse_1.2.1     skimr_2.1.1         missRanger_1.0.3   
## [13] inspectdf_0.0.7     h2o_3.30.0.1        finalfit_1.0.1     
## [16] DT_0.4              DALEX_1.2.1         breakDown_0.2.0    
## [19] tictoc_1.0          DataComputing_0.8.3 curl_4.3           
## [22] base64enc_0.1-3     manipulate_1.0.1    mosaic_1.4.0       
## [25] Matrix_1.2-14       mosaicData_0.17.0   ggformula_0.9.0    
## [28] ggstance_0.3.1      lattice_0.20-35     knitr_1.28         
## [31] stringr_1.3.1       tidyr_1.0.2         lubridate_1.7.4    
## [34] dplyr_0.8.5         ggplot2_3.0.0      
## 
## loaded via a namespace (and not attached):
##   [1] minqa_1.2.4          colorspace_1.3-2     class_7.3-14        
##   [4] ellipsis_0.3.0       ggdendro_0.1-20      rstudioapi_0.11     
##   [7] mice_3.3.0           ggrepel_0.8.0        ggfittext_0.8.1     
##  [10] prodlim_2018.04.18   fansi_0.4.1          ranger_0.10.1       
##  [13] xml2_1.2.0           codetools_0.2-15     splines_3.5.0       
##  [16] jsonlite_1.6.1       nloptr_1.2.0         Cairo_1.5-9         
##  [19] broom_0.5.0          shiny_1.1.0          compiler_3.5.0      
##  [22] httr_1.3.1           backports_1.1.2      assertthat_0.2.0    
##  [25] lazyeval_0.2.1       cli_1.0.0            later_0.7.5         
##  [28] formatR_1.5          htmltools_0.3.6      prettyunits_1.0.2   
##  [31] tools_3.5.0          gtable_0.2.0         glue_1.3.0          
##  [34] reshape2_1.4.3       Rcpp_1.0.4           cellranger_1.1.0    
##  [37] vctrs_0.2.4          nlme_3.1-137         crosstalk_1.0.0     
##  [40] iterators_1.0.10     timeDate_3043.102    gower_0.1.2         
##  [43] xfun_0.14            lme4_1.1-18-1        rvest_0.3.2         
##  [46] mime_0.5             miniUI_0.1.1.1       lifecycle_0.2.0     
##  [49] mosaicCore_0.6.0     pacman_0.4.6         pan_1.6             
##  [52] MASS_7.3-49          scales_1.0.0         ipred_0.9-7         
##  [55] hms_0.4.2            promises_1.0.1       parallel_3.5.0      
##  [58] yaml_2.2.0           gridExtra_2.3        rpart_4.1-13        
##  [61] stringi_1.2.4        highr_0.7            foreach_1.4.4       
##  [64] e1071_1.7-0          boot_1.3-20          lava_1.6.3          
##  [67] repr_0.15.0          rlang_0.4.5          pkgconfig_2.0.2     
##  [70] bitops_1.0-6         evaluate_0.14        recipes_0.1.12      
##  [73] labeling_0.3         htmlwidgets_1.5.1    tidyselect_1.0.0    
##  [76] plyr_1.8.4           magrittr_1.5         bookdown_0.7        
##  [79] R6_2.2.2             generics_0.0.2       mitml_0.3-6         
##  [82] pillar_1.4.3         haven_1.1.2          withr_2.1.2         
##  [85] survival_2.42-6      RCurl_1.95-4.11      nnet_7.3-12         
##  [88] modelr_0.1.2         crayon_1.3.4         questionr_0.6.3     
##  [91] jomo_2.6-4           utf8_1.1.4           rmarkdown_2.1       
##  [94] ingredients_1.2.0    progress_1.2.0       grid_3.5.0          
##  [97] readxl_1.1.0         data.table_1.11.6    FNN_1.1.2.1         
## [100] ModelMetrics_1.2.2.2 rmdformats_0.3.3     digest_0.6.17       
## [103] xtable_1.8-3         httpuv_1.4.5         stats4_3.5.0        
## [106] munsell_0.5.0        viridisLite_0.3.0

Dr. Nishant Upadhyay
Infosys-Analytics

2020-05-28