Goal is to automate building and tuning a classification model to predict employee attrition, using the h2o::h2o.automl.
Import the cleaned data from Module 7.
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::day() masks h2o::day()
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::hour() masks h2o::hour()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::month() masks h2o::month()
## ✖ lubridate::week() masks h2o::week()
## ✖ lubridate::year() masks h2o::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyquant)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching core tidyquant packages ──────────────────────── tidyquant 1.0.8 ──
## ✔ PerformanceAnalytics 2.0.4 ✔ TTR 0.24.4
## ✔ quantmod 0.4.26 ✔ xts 0.13.2── Conflicts ────────────────────────────────────────── tidyquant_conflicts() ──
## ✖ zoo::as.Date() masks base::as.Date()
## ✖ zoo::as.Date.numeric() masks base::as.Date.numeric()
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ lubridate::day() masks h2o::day()
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ xts::first() masks dplyr::first()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ lubridate::hour() masks h2o::hour()
## ✖ dplyr::lag() masks stats::lag()
## ✖ xts::last() masks dplyr::last()
## ✖ PerformanceAnalytics::legend() masks graphics::legend()
## ✖ TTR::momentum() masks dials::momentum()
## ✖ lubridate::month() masks h2o::month()
## ✖ yardstick::spec() masks readr::spec()
## ✖ quantmod::summary() masks h2o::summary(), base::summary()
## ✖ lubridate::week() masks h2o::week()
## ✖ lubridate::year() masks h2o::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <-readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv') %>%
# h2o requires all variables to be either numeric or factors
mutate(across(where(is.character), factor))
## Rows: 76519 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
## dbl (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
## lgl (6): hired, success, solo, oxygen_used, died, injured
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(data)
Name | data |
Number of rows | 76519 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
factor | 10 |
logical | 6 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
expedition_id | 0 | 1.00 | FALSE | 10350 | EVE: 99, HIM: 90, EVE: 79, EVE: 76 |
member_id | 0 | 1.00 | FALSE | 76518 | KAN: 2, ACH: 1, ACH: 1, ACH: 1 |
peak_id | 0 | 1.00 | FALSE | 391 | EVE: 21813, CHO: 8890, AMA: 8406, MAN: 4593 |
peak_name | 15 | 1.00 | FALSE | 390 | Eve: 21813, Cho: 8890, Ama: 8406, Man: 4593 |
season | 0 | 1.00 | FALSE | 5 | Spr: 37782, Aut: 35895, Win: 2101, Sum: 740 |
sex | 2 | 1.00 | FALSE | 2 | M: 69473, F: 7044 |
citizenship | 10 | 1.00 | FALSE | 212 | Nep: 16135, USA: 6448, Jap: 6432, UK: 5219 |
expedition_role | 21 | 1.00 | FALSE | 524 | Cli: 44667, H-A: 14489, Lea: 10036, Exp: 1450 |
death_cause | 75413 | 0.01 | FALSE | 12 | Ava: 369, Fal: 331, AMS: 102, Ill: 60 |
injury_type | 74807 | 0.02 | FALSE | 11 | Exp: 599, AMS: 415, Ill: 257, Fal: 117 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
hired | 0 | 1 | 0.21 | FAL: 60788, TRU: 15731 |
success | 0 | 1 | 0.38 | FAL: 47320, TRU: 29199 |
solo | 0 | 1 | 0.00 | FAL: 76398, TRU: 121 |
oxygen_used | 0 | 1 | 0.24 | FAL: 58286, TRU: 18233 |
died | 0 | 1 | 0.01 | FAL: 75413, TRU: 1106 |
injured | 0 | 1 | 0.02 | FAL: 74806, TRU: 1713 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2000.36 | 14.78 | 1905 | 1991 | 2004 | 2012 | 2019 | ▁▁▁▃▇ |
age | 3497 | 0.95 | 37.33 | 10.40 | 7 | 29 | 36 | 44 | 85 | ▁▇▅▁▁ |
highpoint_metres | 21833 | 0.71 | 7470.68 | 1040.06 | 3800 | 6700 | 7400 | 8400 | 8850 | ▁▁▆▃▇ |
death_height_metres | 75451 | 0.01 | 6592.85 | 1308.19 | 400 | 5800 | 6600 | 7550 | 8830 | ▁▁▂▇▆ |
injury_height_metres | 75510 | 0.01 | 7049.91 | 1214.24 | 400 | 6200 | 7100 | 8000 | 8880 | ▁▁▂▇▇ |
data_clean <- data %>%
# Logical variables
mutate(across(is.logical, as.factor)) %>%
# Missing values
select(-death_cause, -injury_type, -death_height_metres, -injury_height_metres, -peak_id) %>%
na.omit() %>%
# Duplicate values
filter(member_id !="KANG10101-01")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(is.logical, as.factor)`.
## Caused by warning:
## ! Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
## # Was:
## data %>% select(is.logical)
##
## # Now:
## data %>% select(where(is.logical))
set.seed(1234)
data_split <- initial_split(data_clean, strata = "died")
train_tbl <- training(data_split)
test_tbl <- testing(data_split)
recipe_obj <- recipe(died ~ ., data = train_tbl) %>%
# Remove zero variance variables
step_zv(all_predictors())
# Initialize h2o
h2o.init()
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\bella\AppData\Local\Temp\RtmpctWXIg\file41344a292830/h2o_bella_started_from_r.out
## C:\Users\bella\AppData\Local\Temp\RtmpctWXIg\file41344f094fa4/h2o_bella_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 seconds 171 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.44.0.3
## H2O cluster version age: 11 months and 13 days
## H2O cluster name: H2O_started_from_R_bella_saw471
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.93 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.3.3 (2024-02-29 ucrt)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is (11 months and 13 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
split.h2o <- h2o.splitFrame(as.h2o(train_tbl), ratios = c(0.85), seed = 2345)
## | | | 0% | |======================================================================| 100%
train_h2o <- split.h2o[[1]]
valid_h2o <- split.h2o[[2]]
test_h2o <- as.h2o(test_tbl)
## | | | 0% | |======================================================================| 100%
y <- "died"
x <- setdiff(names(train_tbl), y)
models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 200,
max_models = 30,
exclude_algos = "DeepLearning",
nfolds = 5,
seed = 3456
)
## | | | 0% | |= | 1%
## 17:24:34.939: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
## 17:24:34.972: AutoML: XGBoost is not available; skipping it. | |= | 2% | |== | 4% | |=== | 5% | |==== | 6% | |===== | 8% | |====== | 9% | |======= | 11% | |======== | 12% | |========= | 13% | |========== | 15% | |=========== | 16% | |============= | 19% | |============== | 20% | |=============== | 22% | |================ | 23% | |================= | 24% | |================== | 26% | |=================== | 27% | |==================== | 28% | |===================== | 30% | |====================== | 31% | |======================= | 32% | |======================== | 34% | |========================= | 35% | |========================== | 37% | |=========================== | 38% | |============================ | 39% | |============================= | 41% | |============================== | 42% | |=============================== | 44% | |================================ | 45% | |================================= | 46% | |================================= | 48% | |================================== | 49% | |=================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 55% | |======================================= | 56% | |======================================== | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================= | 70% | |================================================== | 71% | |=================================================== | 73% | |==================================================== | 74% | |===================================================== | 75% | |====================================================== | 77% | |======================================================= | 78% | |======================================================== | 79% | |========================================================= | 81% | |========================================================== | 82% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 99% | |======================================================================| 100%
Examine the output of h2o.automl
models_h2o %>% typeof()
## [1] "S4"
models_h2o %>% slotNames()
## [1] "project_name" "leader" "leaderboard" "event_log"
## [5] "modeling_steps" "training_info"
models_h2o@leaderboard
## model_id auc logloss aucpr
## 1 GBM_2_AutoML_1_20241203_172434 0.8036697 0.06516173 0.10910235
## 2 GBM_1_AutoML_1_20241203_172434 0.7973391 0.06579470 0.10605978
## 3 GBM_grid_1_AutoML_1_20241203_172434_model_3 0.7963132 0.06565563 0.11244511
## 4 GBM_grid_1_AutoML_1_20241203_172434_model_1 0.7811029 0.06606823 0.10478619
## 5 GLM_1_AutoML_1_20241203_172434 0.7749443 0.06812567 0.07376531
## 6 GBM_4_AutoML_1_20241203_172434 0.7741146 0.06690653 0.11538820
## mean_per_class_error rmse mse
## 1 0.4026292 0.1171643 0.01372746
## 2 0.4265910 0.1162476 0.01351350
## 3 0.4042430 0.1165517 0.01358430
## 4 0.3874597 0.1162211 0.01350735
## 5 0.3964332 0.1176749 0.01384739
## 6 0.3929593 0.1165073 0.01357394
##
## [11 rows x 7 columns]
models_h2o@leader
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_2_AutoML_1_20241203_172434
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 36 36 54773 7
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 7 7.00000 22 73 47.50000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.008961694
## RMSE: 0.09466623
## LogLoss: 0.03852236
## Mean Per-Class Error: 0.2329254
## AUC: 0.9556503
## AUCPR: 0.5973041
## Gini: 0.9113005
## R^2: 0.3456315
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## FALSE TRUE Error Rate
## FALSE 32793 153 0.004644 =153/32946
## TRUE 214 250 0.461207 =214/464
## Totals 33007 403 0.010985 =367/33410
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.159173 0.576701 160
## 2 max f2 0.087845 0.592389 205
## 3 max f0point5 0.279926 0.684172 113
## 4 max accuracy 0.279926 0.990542 113
## 5 max precision 0.987364 1.000000 0
## 6 max recall 0.003187 1.000000 384
## 7 max specificity 0.987364 1.000000 0
## 8 max absolute_mcc 0.232973 0.580404 129
## 9 max min_per_class_accuracy 0.018224 0.883621 313
## 10 max mean_per_class_accuracy 0.025603 0.890043 292
## 11 max tns 0.987364 32946.000000 0
## 12 max fns 0.987364 463.000000 0
## 13 max fps 0.001299 32946.000000 399
## 14 max tps 0.003187 464.000000 384
## 15 max tnr 0.987364 1.000000 0
## 16 max fnr 0.987364 0.997845 0
## 17 max fpr 0.001299 1.000000 399
## 18 max tpr 0.003187 1.000000 384
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## ** Validation metrics **
##
## MSE: 0.01532671
## RMSE: 0.1238011
## LogLoss: 0.07626308
## Mean Per-Class Error: 0.4368569
## AUC: 0.7368485
## AUCPR: 0.09142548
## Gini: 0.473697
## R^2: 0.005352599
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## FALSE TRUE Error Rate
## FALSE 5761 24 0.004149 =24/5785
## TRUE 80 12 0.869565 =80/92
## Totals 5841 36 0.017696 =104/5877
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.250715 0.187500 35
## 2 max f2 0.046723 0.209340 154
## 3 max f0point5 0.250715 0.254237 35
## 4 max accuracy 0.982513 0.984516 0
## 5 max precision 0.982513 1.000000 0
## 6 max recall 0.001914 1.000000 396
## 7 max specificity 0.982513 1.000000 0
## 8 max absolute_mcc 0.250715 0.200912 35
## 9 max min_per_class_accuracy 0.007850 0.695652 318
## 10 max mean_per_class_accuracy 0.008958 0.709039 309
## 11 max tns 0.982513 5785.000000 0
## 12 max fns 0.982513 91.000000 0
## 13 max fps 0.001438 5785.000000 399
## 14 max tps 0.001914 92.000000 396
## 15 max tnr 0.982513 1.000000 0
## 16 max fnr 0.982513 0.989130 0
## 17 max fpr 0.001438 1.000000 399
## 18 max tpr 0.001914 1.000000 396
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.01307764
## RMSE: 0.1143575
## LogLoss: 0.06324787
## Mean Per-Class Error: 0.4147945
## AUC: 0.8005885
## AUCPR: 0.1282246
## Gini: 0.6011769
## R^2: 0.04509201
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## FALSE TRUE Error Rate
## FALSE 32596 350 0.010623 =350/32946
## TRUE 380 84 0.818966 =380/464
## Totals 32976 434 0.021850 =730/33410
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.118795 0.187082 162
## 2 max f2 0.028451 0.231857 268
## 3 max f0point5 0.237870 0.223064 102
## 4 max accuracy 0.881138 0.986411 5
## 5 max precision 0.995080 1.000000 0
## 6 max recall 0.001527 1.000000 397
## 7 max specificity 0.995080 1.000000 0
## 8 max absolute_mcc 0.128907 0.176320 155
## 9 max min_per_class_accuracy 0.008456 0.728448 341
## 10 max mean_per_class_accuracy 0.010056 0.735163 332
## 11 max tns 0.995080 32946.000000 0
## 12 max fns 0.995080 463.000000 0
## 13 max fps 0.001074 32946.000000 399
## 14 max tps 0.001527 464.000000 397
## 15 max tnr 0.995080 1.000000 0
## 16 max fnr 0.995080 0.997845 0
## 17 max fpr 0.001074 1.000000 399
## 18 max tpr 0.001527 1.000000 397
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid
## accuracy 0.977073 0.005698 0.978001 0.983239 0.971266
## auc 0.800842 0.015145 0.825141 0.800069 0.783483
## err 0.022927 0.005698 0.021999 0.016761 0.028734
## err_count 153.200000 38.074924 147.000000 112.000000 192.000000
## f0point5 0.207694 0.038214 0.183585 0.243446 0.173193
## f1 0.202558 0.020821 0.187845 0.188406 0.193277
## f2 0.204234 0.032135 0.192308 0.153664 0.218631
## lift_top_group 15.763171 2.627420 16.048721 14.697250 13.505286
## logloss 0.063450 0.003458 0.059600 0.065307 0.066987
## max_per_class_error 0.791391 0.046694 0.804598 0.863158 0.760417
## mcc 0.196236 0.020548 0.176847 0.195857 0.182775
## mean_per_class_accuracy 0.598248 0.020663 0.591863 0.566144 0.610757
## mean_per_class_error 0.401752 0.020663 0.408137 0.433856 0.389243
## mse 0.013112 0.000714 0.012570 0.013483 0.013919
## pr_auc 0.129926 0.021163 0.106255 0.137610 0.110188
## precision 0.216797 0.063778 0.180851 0.302326 0.161972
## r2 0.042598 0.024789 0.021823 0.037972 0.017042
## recall 0.208609 0.046694 0.195402 0.136842 0.239583
## rmse 0.114471 0.003130 0.112116 0.116116 0.117980
## specificity 0.987888 0.006121 0.988324 0.995446 0.981931
## cv_4_valid cv_5_valid
## accuracy 0.971116 0.981742
## auc 0.796007 0.799508
## err 0.028884 0.018258
## err_count 193.000000 122.000000
## f0point5 0.183554 0.254692
## f1 0.205761 0.237500
## f2 0.234082 0.222482
## lift_top_group 14.394215 20.170383
## logloss 0.065496 0.059862
## max_per_class_error 0.742268 0.786517
## mcc 0.195829 0.229873
## mean_per_class_accuracy 0.619678 0.602798
## mean_per_class_error 0.380322 0.397202
## mse 0.013406 0.012179
## pr_auc 0.139295 0.156283
## precision 0.171233 0.267606
## r2 0.062897 0.073254
## recall 0.257732 0.213483
## rmse 0.115785 0.110360
## specificity 0.981625 0.992113
#?h2o.getModel
#?h2o.saveModel
#?h2o.loadModel
# h2o.getModel("GBM_2_AutoML_1_20241121_145121") %>%
# h2o.saveModel("h2o_models/")
best_model <- h2o.loadModel("../h2o_models/GBM_2_AutoML_1_20241121_145121")
predictions <- h2o.predict(best_model, newdata = test_h2o)
## | | | 0% | |======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'expedition_id' has levels not trained on: ["AMAD00107",
## "AMAD03102", "AMAD03311", "AMAD03315", "AMAD04322", "AMAD04324", "AMAD04337",
## "AMAD04340", "AMAD05104", "AMAD05346", ...422 not listed..., "TAWO09401",
## "THAM82301", "TILI18301", "TILI84301", "TILI97301", "TKPO03301", "TKRE16401",
## "TLNG75101", "TUKU05101", "TUKU11101"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'member_id' has levels not trained on: ["ACHN15302-01",
## "ACHN15302-11", "AMAD00101-04", "AMAD00101-05", "AMAD00103-03", "AMAD00103-04",
## "AMAD00106-03", "AMAD00107-01", "AMAD00107-02", "AMAD00110-04", ...13076 not
## listed..., "YALU89101-11", "YALU89101-16", "YALU89101-17", "YALU89401-05",
## "YALU91301-05", "YANK17301-01", "YANS03301-03", "YARA18301-02", "YAUP17101-05",
## "YAUP17101-07"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'peak_name' has levels not trained on: ["Kumlung", "Lashar I",
## "Ngojumba Kang III"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'citizenship' has levels not trained on: ["Algeria",
## "Australia/UK", "Azerbaijan", "Bahrain", "Canada/Macedonia", "Canada/UK",
## "China/USA", "Egypt/UK", "Netherlands/Switzerland", "New Zealand/UK",
## "Poland/Canada", "Saudi Arabia/USA", "Spain/Brazil", "Syria", "UK/Iceland",
## "USA/China", "USA/Israel", "USA/Jamaica"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'expedition_role' has levels not trained on: ["2nd Deputy
## Leader", "ABC Staff", "BC Manager/Climber", "BC Mgr/Coach", "Climber (Guest)",
## "Climber (NE Ridge)", "Climber (trekking agency)", "Climber/Advisor",
## "Climber/Food Officer", "Climber/Journalist", ...16 not listed..., "Leaer",
## "Naike", "Non-member", "PR & media manager", "Physiologist", "Rope Team/H-A
## Assistant", "Scientific Coordinator", "Signal Officer", "Transport Officer",
## "leader"]
predictions_tbl <- predictions %>%
as_tibble()
predictions_tbl %>%
bind_cols(test_tbl)
## # A tibble: 13,096 × 19
## predict FALSE. TRUE. expedition_id member_id peak_name year season sex
## <fct> <dbl> <dbl> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 FALSE 0.989 0.0107 AMAD78301 AMAD78301-… Ama Dabl… 1978 Autumn M
## 2 FALSE 0.988 0.0116 AMAD78301 AMAD78301-… Ama Dabl… 1978 Autumn M
## 3 FALSE 0.986 0.0140 AMAD79101 AMAD79101-… Ama Dabl… 1979 Spring M
## 4 FALSE 0.997 0.00342 AMAD79101 AMAD79101-… Ama Dabl… 1979 Spring M
## 5 FALSE 0.996 0.00356 AMAD79301 AMAD79301-… Ama Dabl… 1979 Autumn M
## 6 FALSE 0.996 0.00374 AMAD79301 AMAD79301-… Ama Dabl… 1979 Autumn M
## 7 FALSE 0.993 0.00707 AMAD79303 AMAD79303-… Ama Dabl… 1979 Autumn M
## 8 FALSE 0.993 0.00689 AMAD79303 AMAD79303-… Ama Dabl… 1979 Autumn M
## 9 FALSE 0.995 0.00483 AMAD80301 AMAD80301-… Ama Dabl… 1980 Autumn M
## 10 FALSE 0.995 0.00502 AMAD80301 AMAD80301-… Ama Dabl… 1980 Autumn M
## # ℹ 13,086 more rows
## # ℹ 10 more variables: age <dbl>, citizenship <fct>, expedition_role <fct>,
## # hired <fct>, highpoint_metres <dbl>, success <fct>, solo <fct>,
## # oxygen_used <fct>, died <fct>, injured <fct>
#?h2o.performance
performance_h2o <- h2o.performance(best_model, newdata = test_h2o)
typeof(performance_h2o)
## [1] "S4"
slotNames(performance_h2o)
## [1] "algorithm" "on_train" "on_valid" "on_xval" "metrics"
performance_h2o@metrics
## $model
## $model$`__meta`
## $model$`__meta`$schema_version
## [1] 3
##
## $model$`__meta`$schema_name
## [1] "ModelKeyV3"
##
## $model$`__meta`$schema_type
## [1] "Key<Model>"
##
##
## $model$name
## [1] "GBM_2_AutoML_1_20241121_145121"
##
## $model$type
## [1] "Key<Model>"
##
## $model$URL
## [1] "/3/Models/GBM_2_AutoML_1_20241121_145121"
##
##
## $model_checksum
## [1] "6790451505041704840"
##
## $frame
## $frame$name
## [1] "test_tbl_sid_a713_3"
##
##
## $frame_checksum
## [1] "-8290472776413816056"
##
## $description
## NULL
##
## $scoring_time
## [1] 1.733265e+12
##
## $predictions
## NULL
##
## $MSE
## [1] 0.01372746
##
## $RMSE
## [1] 0.1171643
##
## $nobs
## [1] 13096
##
## $custom_metric_name
## NULL
##
## $custom_metric_value
## [1] 0
##
## $r2
## [1] 0.02982337
##
## $logloss
## [1] 0.06516173
##
## $AUC
## [1] 0.8036697
##
## $pr_auc
## [1] 0.1091024
##
## $Gini
## [1] 0.6073394
##
## $mean_per_class_error
## [1] 0.4026292
##
## $domain
## [1] "FALSE" "TRUE"
##
## $cm
## $cm$`__meta`
## $cm$`__meta`$schema_version
## [1] 3
##
## $cm$`__meta`$schema_name
## [1] "ConfusionMatrixV3"
##
## $cm$`__meta`$schema_type
## [1] "ConfusionMatrix"
##
##
## $cm$table
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## FALSE TRUE Error Rate
## FALSE 12744 164 0.0127 = 164 / 12,908
## TRUE 149 39 0.7926 = 149 / 188
## Totals 12893 203 0.0239 = 313 / 13,096
##
##
## $thresholds_and_metric_scores
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
## threshold f1 f2 f0point5 accuracy precision recall specificity
## 1 0.977988 0.010582 0.006640 0.026042 0.985721 1.000000 0.005319 1.000000
## 2 0.952335 0.010526 0.006631 0.025510 0.985644 0.500000 0.005319 0.999923
## 3 0.895224 0.010471 0.006623 0.025000 0.985568 0.333333 0.005319 0.999845
## 4 0.839610 0.010417 0.006614 0.024510 0.985492 0.250000 0.005319 0.999768
## 5 0.797666 0.020725 0.013210 0.048077 0.985568 0.400000 0.010638 0.999768
## absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns fns fps tps
## 1 0.072410 0.005319 0.502660 12908 187 0 1
## 2 0.050458 0.005319 0.502621 12907 187 1 1
## 3 0.040591 0.005319 0.502582 12906 187 2 1
## 4 0.034627 0.005319 0.502543 12905 187 3 1
## 5 0.063360 0.010638 0.505203 12905 186 3 2
## tnr fnr fpr tpr idx
## 1 1.000000 0.994681 0.000000 0.005319 0
## 2 0.999923 0.994681 0.000077 0.005319 1
## 3 0.999845 0.994681 0.000155 0.005319 2
## 4 0.999768 0.994681 0.000232 0.005319 3
## 5 0.999768 0.989362 0.000232 0.010638 4
##
## ---
## threshold f1 f2 f0point5 accuracy precision recall
## 395 0.002256 0.029154 0.069831 0.018423 0.043907 0.014793 1.000000
## 396 0.002137 0.028790 0.068996 0.018190 0.031460 0.014605 1.000000
## 397 0.001995 0.028561 0.068468 0.018044 0.023442 0.014487 1.000000
## 398 0.001831 0.028418 0.068141 0.017953 0.018403 0.014414 1.000000
## 399 0.001610 0.028350 0.067983 0.017909 0.015959 0.014379 1.000000
## 400 0.001279 0.028305 0.067880 0.017880 0.014356 0.014356 1.000000
## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns
## 395 0.029981 0.021060 0.029981 0.514991 387
## 396 0.017354 0.015920 0.017354 0.508677 224
## 397 0.009219 0.011557 0.009219 0.504610 119
## 398 0.004106 0.007693 0.004106 0.502053 53
## 399 0.001627 0.004837 0.001627 0.500813 21
## 400 0.000000 0.000000 0.000000 0.500000 0
## fns fps tps tnr fnr fpr tpr idx
## 395 0 12521 188 0.029981 0.000000 0.970019 1.000000 394
## 396 0 12684 188 0.017354 0.000000 0.982646 1.000000 395
## 397 0 12789 188 0.009219 0.000000 0.990781 1.000000 396
## 398 0 12855 188 0.004106 0.000000 0.995894 1.000000 397
## 399 0 12887 188 0.001627 0.000000 0.998373 1.000000 398
## 400 0 12908 188 0.000000 0.000000 1.000000 1.000000 399
##
## $max_criteria_and_metric_scores
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.100475 0.199488 127
## 2 max f2 0.027684 0.244173 251
## 3 max f0point5 0.259758 0.233645 48
## 4 max accuracy 0.977988 0.985721 0
## 5 max precision 0.977988 1.000000 0
## 6 max recall 0.002256 1.000000 394
## 7 max specificity 0.977988 1.000000 0
## 8 max absolute_mcc 0.100475 0.187518 127
## 9 max min_per_class_accuracy 0.007961 0.723272 339
## 10 max mean_per_class_accuracy 0.010520 0.744399 324
## 11 max tns 0.977988 12908.000000 0
## 12 max fns 0.977988 187.000000 0
## 13 max fps 0.001279 12908.000000 399
## 14 max tps 0.002256 188.000000 394
## 15 max tnr 0.977988 1.000000 0
## 16 max fnr 0.977988 0.994681 0
## 17 max fpr 0.001279 1.000000 399
## 18 max tpr 0.002256 1.000000 394
##
## $gains_lift_table
## Gains/Lift Table: Avg response rate: 1.44 %, avg score: 1.24 %
## group cumulative_data_fraction lower_threshold lift cumulative_lift
## 1 1 0.01000305 0.133596 15.420822 15.420822
## 2 2 0.02000611 0.080601 7.444535 11.432678
## 3 3 0.03000916 0.059313 4.254020 9.039792
## 4 4 0.04001222 0.048213 3.722267 7.710411
## 5 5 0.05001527 0.040546 4.785772 7.125483
## 6 6 0.10003054 0.022496 3.296865 5.211174
## 7 7 0.15004582 0.015088 1.276206 3.899518
## 8 8 0.20013745 0.011418 1.699014 3.348762
## 9 9 0.30001527 0.007395 0.798848 2.499873
## 10 10 0.40004582 0.005433 0.744453 2.060934
## 11 11 0.50000000 0.004302 0.425727 1.734043
## 12 12 0.60003054 0.003657 0.372227 1.507015
## 13 13 0.69998473 0.003489 0.159648 1.314618
## 14 14 0.80001527 0.002987 0.691278 1.236679
## 15 15 0.89996946 0.002539 0.053216 1.105238
## 16 16 1.00000000 0.001148 0.053175 1.000000
## response_rate score cumulative_response_rate cumulative_score
## 1 0.221374 0.308330 0.221374 0.308330
## 2 0.106870 0.102575 0.164122 0.205453
## 3 0.061069 0.069056 0.129771 0.159987
## 4 0.053435 0.053118 0.110687 0.133270
## 5 0.068702 0.044059 0.102290 0.115428
## 6 0.047328 0.029681 0.074809 0.072554
## 7 0.018321 0.018285 0.055980 0.054464
## 8 0.024390 0.013203 0.048073 0.044137
## 9 0.011468 0.009106 0.035887 0.032475
## 10 0.010687 0.006396 0.029586 0.025954
## 11 0.006112 0.004803 0.024893 0.021726
## 12 0.005344 0.003935 0.021634 0.018760
## 13 0.002292 0.003575 0.018872 0.016592
## 14 0.009924 0.003215 0.017753 0.014919
## 15 0.000764 0.002756 0.015866 0.013568
## 16 0.000763 0.002254 0.014356 0.012436
## capture_rate cumulative_capture_rate gain cumulative_gain
## 1 0.154255 0.154255 1442.082183 1442.082183
## 2 0.074468 0.228723 644.453468 1043.267825
## 3 0.042553 0.271277 325.401981 803.979211
## 4 0.037234 0.308511 272.226734 671.041091
## 5 0.047872 0.356383 378.577229 612.548319
## 6 0.164894 0.521277 229.686536 421.117427
## 7 0.063830 0.585106 27.620594 289.951816
## 8 0.085106 0.670213 69.901401 234.876245
## 9 0.079787 0.750000 -20.115167 149.987274
## 10 0.074468 0.824468 -25.554653 106.093416
## 11 0.042553 0.867021 -57.427304 73.404255
## 12 0.037234 0.904255 -62.777327 50.701548
## 13 0.015957 0.920213 -84.035239 31.461835
## 14 0.069149 0.989362 -30.872178 23.667852
## 15 0.005319 0.994681 -94.678413 10.523845
## 16 0.005319 1.000000 -94.682475 0.000000
## kolmogorov_smirnov
## 1 0.146353
## 2 0.211757
## 3 0.244781
## 4 0.272409
## 5 0.310830
## 6 0.427381
## 7 0.441397
## 8 0.476922
## 9 0.456539
## 10 0.430604
## 11 0.372367
## 12 0.308656
## 13 0.223436
## 14 0.192104
## 15 0.096091
## 16 0.000000
h2o.auc(performance_h2o)
## [1] 0.8036697
h2o.confusionMatrix(performance_h2o)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.100474663733287:
## FALSE TRUE Error Rate
## FALSE 12744 164 0.012705 =164/12908
## TRUE 149 39 0.792553 =149/188
## Totals 12893 203 0.023900 =313/13096
h2o.metric(performance_h2o)
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
## threshold f1 f2 f0point5 accuracy precision recall specificity
## 1 0.977988 0.010582 0.006640 0.026042 0.985721 1.000000 0.005319 1.000000
## 2 0.952335 0.010526 0.006631 0.025510 0.985644 0.500000 0.005319 0.999923
## 3 0.895224 0.010471 0.006623 0.025000 0.985568 0.333333 0.005319 0.999845
## 4 0.839610 0.010417 0.006614 0.024510 0.985492 0.250000 0.005319 0.999768
## 5 0.797666 0.020725 0.013210 0.048077 0.985568 0.400000 0.010638 0.999768
## absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns fns fps tps
## 1 0.072410 0.005319 0.502660 12908 187 0 1
## 2 0.050458 0.005319 0.502621 12907 187 1 1
## 3 0.040591 0.005319 0.502582 12906 187 2 1
## 4 0.034627 0.005319 0.502543 12905 187 3 1
## 5 0.063360 0.010638 0.505203 12905 186 3 2
## tnr fnr fpr tpr idx
## 1 1.000000 0.994681 0.000000 0.005319 0
## 2 0.999923 0.994681 0.000077 0.005319 1
## 3 0.999845 0.994681 0.000155 0.005319 2
## 4 0.999768 0.994681 0.000232 0.005319 3
## 5 0.999768 0.989362 0.000232 0.010638 4
##
## ---
## threshold f1 f2 f0point5 accuracy precision recall
## 395 0.002256 0.029154 0.069831 0.018423 0.043907 0.014793 1.000000
## 396 0.002137 0.028790 0.068996 0.018190 0.031460 0.014605 1.000000
## 397 0.001995 0.028561 0.068468 0.018044 0.023442 0.014487 1.000000
## 398 0.001831 0.028418 0.068141 0.017953 0.018403 0.014414 1.000000
## 399 0.001610 0.028350 0.067983 0.017909 0.015959 0.014379 1.000000
## 400 0.001279 0.028305 0.067880 0.017880 0.014356 0.014356 1.000000
## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns
## 395 0.029981 0.021060 0.029981 0.514991 387
## 396 0.017354 0.015920 0.017354 0.508677 224
## 397 0.009219 0.011557 0.009219 0.504610 119
## 398 0.004106 0.007693 0.004106 0.502053 53
## 399 0.001627 0.004837 0.001627 0.500813 21
## 400 0.000000 0.000000 0.000000 0.500000 0
## fns fps tps tnr fnr fpr tpr idx
## 395 0 12521 188 0.029981 0.000000 0.970019 1.000000 394
## 396 0 12684 188 0.017354 0.000000 0.982646 1.000000 395
## 397 0 12789 188 0.009219 0.000000 0.990781 1.000000 396
## 398 0 12855 188 0.004106 0.000000 0.995894 1.000000 397
## 399 0 12887 188 0.001627 0.000000 0.998373 1.000000 398
## 400 0 12908 188 0.000000 0.000000 1.000000 1.000000 399
The h2o model performed very similarly to the xgboost model, however the xbgoost model performed better. The h2o model has an AUC value of 0.803 and xgboost has an AUC value of 0.813. After improving max runtime and models, there is no improvement in the AUC value from the h2o model.