Import data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(correlationfunnel)

## ══ correlationfunnel Tip #3 ════════════════════════════════════════════════════
## Using `binarize()` with data containing many columns or many rows can increase dimensionality substantially.
## Try subsetting your data column-wise or row-wise to avoid creating too many columns.
## You can always make a big problem smaller by sampling. :)

library(tidymodels) #for building models

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.4.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10

## Warning: package 'modeldata' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(textrecipes) # For processing string variable
library(tidytext)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.3

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## 
## Attaching package: 'h2o'
## 
## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year
## 
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## 
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

library(tidyquant)

## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## The following object is masked from 'package:graphics':
## 
##     legend
## 
## Loading required package: quantmod
## Loading required package: TTR
## 
## Attaching package: 'TTR'
## 
## The following object is masked from 'package:dials':
## 
##     momentum
## 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')

## Rows: 76519 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
## dbl  (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
## lgl  (6): hired, success, solo, oxygen_used, died, injured
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data

skimr::skim(members)

Data summary
Name	members
Number of rows	76519
Number of columns	21
_______________________
Column type frequency:
character	10
logical	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
expedition_id	0	1.00	9	9	10350
member_id	0	1.00	12	12	76518
peak_id	0	1.00	4	4	391
peak_name	15	1.00	4	25	390
season	0	1.00	6	7	5
sex	2	1.00	1	1	2
citizenship	10	1.00	2	23	212
expedition_role	21	1.00	4	25	524
death_cause	75413	0.01	3	27	12
injury_type	74807	0.02	3	27	11

Variable type: logical

skim_variable	complete_rate	mean	count
hired	1	0.21	FAL: 60788, TRU: 15731
success	1	0.38	FAL: 47320, TRU: 29199
solo	1	0.00	FAL: 76398, TRU: 121
oxygen_used	1	0.24	FAL: 58286, TRU: 18233
died	1	0.01	FAL: 75413, TRU: 1106
injured	1	0.02	FAL: 74806, TRU: 1713

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2000.36	14.78	1905	1991	2004	2012	2019	▁▁▁▃▇
age	3497	0.95	37.33	10.40	7	29	36	44	85	▁▇▅▁▁
highpoint_metres	21833	0.71	7470.68	1040.06	3800	6700	7400	8400	8850	▁▁▆▃▇
death_height_metres	75451	0.01	6592.85	1308.19	400	5800	6600	7550	8830	▁▁▂▇▆
injury_height_metres	75510	0.01	7049.91	1214.24	400	6200	7100	8000	8880	▁▁▂▇▇

Explore data

members1 <- members %>%
    # Treat missing values
    select(-death_height_metres, -injury_height_metres, -death_cause, -injury_type, -peak_id) %>%
    filter(!is.na(age)) %>%
    filter(!is.na(highpoint_metres)) %>%
    distinct(member_id, .keep_all = TRUE)

members1 %>% filter(duplicated(member_id))

## # A tibble: 0 × 16
## # ℹ 16 variables: expedition_id <chr>, member_id <chr>, peak_name <chr>,
## #   year <dbl>, season <chr>, sex <chr>, age <dbl>, citizenship <chr>,
## #   expedition_role <chr>, hired <lgl>, highpoint_metres <dbl>, success <lgl>,
## #   solo <lgl>, oxygen_used <lgl>, died <lgl>, injured <lgl>

factors_vec1 <- members1 %>% select(hired, success, solo, oxygen_used, died, injured) %>% names()

members1_clean <- members1 %>%
    # Address factors imported as numeric
    mutate(across(all_of(factors_vec1), as.factor)) %>%
    
    # Recode Attrition
    mutate(died = if_else(died == "TRUE", "Died", died)) %>%
    
    # Convert character to factor
    mutate(across(where(is.character), factor))

Split data

library(tidymodels)

set.seed(1234)
#members1_clean <- members1_clean #%>% sample_n(100)
members1_clean <- members1_clean #%>% 
    #group_by(died) %>% 
    #sample_n(50) %>% 
    #ungroup()

members_split <- initial_split(members1_clean, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)

members_cv <- rsample::vfold_cv(members_train, strata = died)
members_cv

## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [35373/3931]> Fold01
##  2 <split [35373/3931]> Fold02
##  3 <split [35373/3931]> Fold03
##  4 <split [35373/3931]> Fold04
##  5 <split [35374/3930]> Fold05
##  6 <split [35374/3930]> Fold06
##  7 <split [35374/3930]> Fold07
##  8 <split [35374/3930]> Fold08
##  9 <split [35374/3930]> Fold09
## 10 <split [35374/3930]> Fold10

Recipes

recipe_obj <- recipe(died ~ ., data = members_train) %>%
    
    # Remove zero variance variables
    step_zv(all_predictors())

Model

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 days 19 hours 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    11 months 
##     H2O cluster name:           H2O_started_from_R_kajsabergstrand_fhp551 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.36 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is (11 months) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

split.h2o <- h2o.splitFrame(as.h2o(members_train), ratios = c(0.85), seed = 2345)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

train_h2o <- split.h2o[[1]]
valid_h2o <- split.h2o[[2]]
test_h2o <- as.h2o(members_test)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

y <- "died"
x <- setdiff(names(members_train), y)

models_h2o <- h2o.automl(
    x = x,
    y = y,
    training_frame    = train_h2o, 
    validation_frame  = valid_h2o, 
    leaderboard_frame = test_h2o, 
    #max_runtime_secs  = 30,
    max_models        = 10, 
    exclude_algos     = "DeepLearning",
    nfolds            = 5, 
    seed              = 3456
)

## 
  |                                                                            
  |                                                                      |   0%
## 11:13:48.273: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix,  : 
##   Unexpected CURL error: Received HTTP/0.9 when not allowed
## [1] "Job request failed Unexpected CURL error: Received HTTP/0.9 when not allowed, will retry after 3s."
## 
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |======================================================================| 100%

examine the output of h2o.automl

models_h2o %>% typeof()

## [1] "S4"

models_h2o %>% slotNames()

## [1] "project_name"   "leader"         "leaderboard"    "event_log"     
## [5] "modeling_steps" "training_info"

models_h2o@leaderboard

##                                                   model_id       auc    logloss
## 1    StackedEnsemble_AllModels_1_AutoML_22_20241121_111348 0.8158502 0.06114443
## 2 StackedEnsemble_BestOfFamily_1_AutoML_22_20241121_111348 0.8112641 0.06186289
## 3                          GBM_1_AutoML_22_20241121_111348 0.7991690 0.06367366
## 4                      XGBoost_1_AutoML_22_20241121_111348 0.7987528 0.06543955
## 5                          GBM_4_AutoML_22_20241121_111348 0.7984025 0.06357861
## 6                      XGBoost_2_AutoML_22_20241121_111348 0.7983176 0.06538963
##       aucpr mean_per_class_error      rmse        mse
## 1 0.9960705            0.4812215 0.1133256 0.01284270
## 2 0.9960055            0.4892473 0.1137901 0.01294819
## 3 0.9953286            0.4760000 0.1140544 0.01300841
## 4 0.9958949            0.4946624 0.1163321 0.01353317
## 5 0.9945556            0.4919355 0.1148071 0.01318067
## 6 0.9959030            0.4811828 0.1159002 0.01343285
## 
## [12 rows x 7 columns]

models_h2o@leader

## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_1_AutoML_22_20241121_111348 
## Model Summary for Stacked Ensemble: 
##                                     key            value
## 1                     Stacking strategy cross_validation
## 2  Number of base models (used / total)             8/10
## 3      # GBM base models (used / total)              3/4
## 4  # XGBoost base models (used / total)              2/3
## 5      # GLM base models (used / total)              1/1
## 6      # DRF base models (used / total)              2/2
## 7                 Metalearner algorithm              GLM
## 8    Metalearner fold assignment scheme           Random
## 9                    Metalearner nfolds                5
## 10              Metalearner fold_column               NA
## 11   Custom metalearner hyperparameters             None
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.007992434
## RMSE:  0.08940041
## LogLoss:  0.0335353
## Mean Per-Class Error:  0.1866879
## AUC:  0.9840665
## AUCPR:  0.9997542
## Gini:  0.968133
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Died FALSE    Error       Rate
## Died     86    51 0.372263    =51/137
## FALSE    11  9872 0.001113   =11/9883
## Totals   97  9923 0.006188  =62/10020
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.776798    0.996870 322
## 2                       max f2  0.736166    0.998322 337
## 3                 max f0point5  0.835047    0.996097 301
## 4                 max accuracy  0.786338    0.993812 320
## 5                max precision  0.999352    1.000000   0
## 6                   max recall  0.591910    1.000000 371
## 7              max specificity  0.999352    1.000000   0
## 8             max absolute_mcc  0.786338    0.744178 320
## 9   max min_per_class_accuracy  0.962421    0.934307 170
## 10 max mean_per_class_accuracy  0.944068    0.950205 207
## 11                     max tns  0.999352  137.000000   0
## 12                     max fns  0.999352 9842.000000   0
## 13                     max fps  0.136985  137.000000 399
## 14                     max tps  0.591910 9883.000000 371
## 15                     max tnr  0.999352    1.000000   0
## 16                     max fnr  0.999352    0.995851   0
## 17                     max fpr  0.136985    1.000000 399
## 18                     max tpr  0.591910    1.000000 371
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on validation data. **
## 
## MSE:  0.01181528
## RMSE:  0.1086981
## LogLoss:  0.05989122
## Mean Per-Class Error:  0.4794521
## AUC:  0.764154
## AUCPR:  0.9951604
## Gini:  0.528308
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Died FALSE    Error      Rate
## Died      3    70 0.958904    =70/73
## FALSE     0  5806 0.000000   =0/5806
## Totals    3  5876 0.011907  =70/5879
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.524271    0.994008 396
## 2                       max f2  0.524271    0.997595 396
## 3                 max f0point5  0.570644    0.990647 392
## 4                 max accuracy  0.570644    0.988093 392
## 5                max precision  0.999380    1.000000   0
## 6                   max recall  0.524271    1.000000 396
## 7              max specificity  0.999380    1.000000   0
## 8             max absolute_mcc  0.570644    0.218834 392
## 9   max min_per_class_accuracy  0.989460    0.684932  94
## 10 max mean_per_class_accuracy  0.986821    0.712945 110
## 11                     max tns  0.999380   73.000000   0
## 12                     max fns  0.999380 5780.000000   0
## 13                     max fps  0.211172   73.000000 399
## 14                     max tps  0.524271 5806.000000 396
## 15                     max tnr  0.999380    1.000000   0
## 16                     max fnr  0.999380    0.995522   0
## 17                     max fpr  0.211172    1.000000 399
## 18                     max tpr  0.524271    1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.01298572
## RMSE:  0.1139549
## LogLoss:  0.06104625
## Mean Per-Class Error:  0.4702549
## AUC:  0.8371741
## AUCPR:  0.9966334
## Gini:  0.6743482
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Died FALSE    Error        Rate
## Died     29   456 0.940206    =456/485
## FALSE    10 32930 0.000304   =10/32940
## Totals   39 33386 0.013942  =466/33425
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.550747     0.992974 367
## 2                       max f2  0.331035     0.997100 384
## 3                 max f0point5  0.828024     0.989411 293
## 4                 max accuracy  0.576689     0.986058 365
## 5                max precision  0.999394     1.000000   0
## 6                   max recall  0.153974     1.000000 395
## 7              max specificity  0.999394     1.000000   0
## 8             max absolute_mcc  0.828024     0.244724 293
## 9   max min_per_class_accuracy  0.986998     0.748454  78
## 10 max mean_per_class_accuracy  0.989301     0.757612  68
## 11                     max tns  0.999394   485.000000   0
## 12                     max fns  0.999394 32834.000000   0
## 13                     max fps  0.091965   485.000000 399
## 14                     max tps  0.153974 32940.000000 395
## 15                     max tnr  0.999394     1.000000   0
## 16                     max fnr  0.999394     0.996782   0
## 17                     max fpr  0.091965     1.000000 399
## 18                     max tpr  0.153974     1.000000 395
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                mean       sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.986297 0.001074   0.986978   0.987179   0.986986   0.985596
## auc        0.837601 0.023955   0.837157   0.805541   0.861431   0.823745
## err        0.013703 0.001074   0.013022   0.012821   0.013014   0.014404
## err_count 91.600000 7.162402  87.000000  85.000000  88.000000  96.000000
## f0point5   0.989278 0.000767   0.989810   0.989888   0.989811   0.988528
##           cv_5_valid
## accuracy    0.984746
## auc         0.860130
## err         0.015253
## err_count 102.000000
## f0point5    0.988355
## 
## ---
##                         mean        sd cv_1_valid cv_2_valid cv_3_valid
## precision           0.986752  0.000923   0.987406   0.987466   0.987406
## r2                  0.090926  0.027205   0.084964   0.082159   0.105197
## recall              0.999514  0.000347   0.999545   0.999694   0.999550
## residual_deviance 815.154970 40.406376 795.735300 783.903200 778.394000
## rmse                0.113884  0.004035   0.111476   0.110252   0.111331
## specificity         0.086878  0.042314   0.086957   0.067416   0.105263
##                   cv_4_valid cv_5_valid
## precision           0.985738   0.985744
## r2                  0.054795   0.127518
## recall              0.999848   0.998935
## residual_deviance 860.157500 857.584960
## rmse                0.117020   0.119341
## specificity         0.030612   0.144144

Save and Load

?h2o.getModel
?h2o.saveModel
?h2o.loadModel

h2o.getModel("GBM_4_AutoML_20_20241121_105527") %>%
    h2o.saveModel("h2o_models/")

## [1] "/Users/kajsabergstrand/Desktop/PSU_DAT3100/11_module13/h2o_models/GBM_4_AutoML_20_20241121_105527"

best_model <- h2o.loadModel("h2o_models/GBM_4_AutoML_20_20241121_105527")

Make predictions

predictions <- h2o.predict(best_model, newdata = test_h2o)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'expedition_id' has levels not trained on: ["ACHN15301",
## "ACHN15302", "AMAD00101", "AMAD00103", "AMAD00105", "AMAD00106", "AMAD00111",
## "AMAD00302", "AMAD00304", "AMAD00305", ...6183 not listed..., "YALU84301",
## "YALU84302", "YALU88401", "YALU89101", "YALU89301", "YALU89401", "YANK17301",
## "YANS03301", "YAUP17101", "YAUP89301"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'member_id' has levels not trained on: ["ACHN15301-01",
## "ACHN15302-01", "ACHN15302-03", "ACHN15302-10", "AMAD00101-03", "AMAD00101-04",
## "AMAD00103-03", "AMAD00103-04", "AMAD00105-01", "AMAD00106-01", ...13068 not
## listed..., "YALU89401-01", "YALU89401-05", "YANK17301-01", "YANS03301-02",
## "YANS03301-03", "YANS03301-04", "YAUP17101-01", "YAUP17101-07", "YAUP17101-09",
## "YAUP89301-01"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'peak_name' has levels not trained on: ["Aichyn", "Amotsang",
## "Amphu Gyabjen", "Amphu I", "Amphu Middle", "Anidesh Chuli", "Annapurna I
## East", "Annapurna I Middle", "Annapurna II", "Annapurna III", ...282 not
## listed..., "Tsaurabong Peak", "Tsisima", "Tso Karpo Kang", "Urkema",
## "Urkinmang", "Yakawa Kang", "Yalung Kang", "Yanme Kang", "Yansa Tsenji",
## "Yaupa"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'season' has levels not trained on: ["Summer"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'citizenship' has levels not trained on: ["Albania", "Andorra",
## "Argentina", "Argentina/Canada", "Australia/Ireland", "Australia/New Zealand",
## "Australia/UK", "Azerbaijan", "Azerbaijan/Russia", "Bangladesh", ...85 not
## listed..., "USA/Dominican Republic", "USA/Jamaica", "USA/UK", "USSR",
## "Ukraine", "Uruguay", "Uzbekistan", "Venezuela", "Vietnam", "Yugoslavia"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'expedition_role' has levels not trained on: ["2nd Deputy
## Leader", "2nd Sirdar", "ABC Manager", "ABC support", "Acting Leader",
## "Assistant Guide", "Assistant Leader", "Assistant Sirdar", "BC Manager", "BC
## Support", ...81 not listed..., "Scientific Coordinator", "Signal Officer",
## "Sirdar", "Support", "Support Climber", "Support Member", "Support member", "TV
## Reporter", "Technical Advisor", "Trekker"]

predictions_tbl <- predictions %>%
    as_tibble()

predictions_tbl %>%
    bind_cols(members_test)

## # A tibble: 13,102 × 19
##    predict   Died FALSE. expedition_id member_id    peak_name  year season sex  
##    <fct>    <dbl>  <dbl> <fct>         <fct>        <fct>     <dbl> <fct>  <fct>
##  1 FALSE   0.118   0.882 AMAD78301     AMAD78301-06 Ama Dabl…  1978 Autumn M    
##  2 FALSE   0.118   0.882 AMAD78301     AMAD78301-07 Ama Dabl…  1978 Autumn M    
##  3 FALSE   0.0654  0.935 AMAD79101     AMAD79101-12 Ama Dabl…  1979 Spring M    
##  4 FALSE   0.0654  0.935 AMAD79101     AMAD79101-18 Ama Dabl…  1979 Spring M    
##  5 FALSE   0.119   0.881 AMAD79301     AMAD79301-20 Ama Dabl…  1979 Autumn M    
##  6 FALSE   0.119   0.881 AMAD79301     AMAD79301-22 Ama Dabl…  1979 Autumn M    
##  7 FALSE   0.0652  0.935 AMAD79303     AMAD79303-01 Ama Dabl…  1979 Autumn M    
##  8 FALSE   0.0652  0.935 AMAD79303     AMAD79303-05 Ama Dabl…  1979 Autumn M    
##  9 FALSE   0.0654  0.935 AMAD80301     AMAD80301-01 Ama Dabl…  1980 Autumn M    
## 10 FALSE   0.0654  0.935 AMAD80301     AMAD80301-02 Ama Dabl…  1980 Autumn M    
## # ℹ 13,092 more rows
## # ℹ 10 more variables: age <dbl>, citizenship <fct>, expedition_role <fct>,
## #   hired <fct>, highpoint_metres <dbl>, success <fct>, solo <fct>,
## #   oxygen_used <fct>, died <fct>, injured <fct>

Evaluate model

performance_h2o <- h2o.performance(best_model, newdata = test_h2o)
typeof(performance_h2o)

## [1] "S4"

slotNames(performance_h2o)

## [1] "algorithm" "on_train"  "on_valid"  "on_xval"   "metrics"

performance_h2o@metrics

## $model
## $model$`__meta`
## $model$`__meta`$schema_version
## [1] 3
## 
## $model$`__meta`$schema_name
## [1] "ModelKeyV3"
## 
## $model$`__meta`$schema_type
## [1] "Key<Model>"
## 
## 
## $model$name
## [1] "GBM_4_AutoML_20_20241121_105527"
## 
## $model$type
## [1] "Key<Model>"
## 
## $model$URL
## [1] "/3/Models/GBM_4_AutoML_20_20241121_105527"
## 
## 
## $model_checksum
## [1] "5880890028699031936"
## 
## $frame
## $frame$name
## [1] "members_test_sid_a856_3"
## 
## 
## $frame_checksum
## [1] "-8290376914860413972"
## 
## $description
## NULL
## 
## $scoring_time
## [1] 1.732206e+12
## 
## $predictions
## NULL
## 
## $MSE
## [1] 0.01915106
## 
## $RMSE
## [1] 0.1383874
## 
## $nobs
## [1] 13102
## 
## $custom_metric_name
## NULL
## 
## $custom_metric_value
## [1] 0
## 
## $r2
## [1] -0.3684442
## 
## $logloss
## [1] 0.1220857
## 
## $AUC
## [1] 0.6628169
## 
## $pr_auc
## [1] 0.9912565
## 
## $Gini
## [1] 0.3256339
## 
## $mean_per_class_error
## [1] 0.4811828
## 
## $domain
## [1] "Died"  "FALSE"
## 
## $cm
## $cm$`__meta`
## $cm$`__meta`$schema_version
## [1] 3
## 
## $cm$`__meta`$schema_name
## [1] "ConfusionMatrixV3"
## 
## $cm$`__meta`$schema_type
## [1] "ConfusionMatrix"
## 
## 
## $cm$table
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        Died FALSE  Error           Rate
## Died      7   179 0.9624 =    179 / 186
## FALSE     0 12916 0.0000  =  0 / 12 916
## Totals    7 13095 0.0137 = 179 / 13 102
## 
## 
## $thresholds_and_metric_scores
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall specificity
## 1  0.934895 0.000155 0.000097 0.000387 0.014273  1.000000 0.000077    1.000000
## 2  0.934795 0.154209 0.102360 0.312500 0.095787  0.989918 0.083617    0.940860
## 3  0.934768 0.154999 0.102917 0.313800 0.096245  0.989973 0.084082    0.940860
## 4  0.934686 0.155131 0.103010 0.314017 0.096321  0.989982 0.084159    0.940860
## 5  0.934585 0.342583 0.245893 0.564591 0.216990  0.994050 0.206953    0.913978
##   absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns   fns fps
## 1     0.001048               0.000077                0.500039 186 12915   0
## 2     0.010481               0.083617                0.512239 175 11836  11
## 3     0.010653               0.084082                0.512471 175 11830  11
## 4     0.010682               0.084159                0.512510 175 11829  11
## 5     0.035422               0.206953                0.560466 170 10243  16
##    tps      tnr      fnr      fpr      tpr idx
## 1    1 1.000000 0.999923 0.000000 0.000077   0
## 2 1080 0.940860 0.916383 0.059140 0.083617   1
## 3 1086 0.940860 0.915918 0.059140 0.084082   2
## 4 1087 0.940860 0.915841 0.059140 0.084159   3
## 5 2673 0.913978 0.793047 0.086022 0.206953   4
## 
## ---
##    threshold       f1       f2 f0point5 accuracy precision   recall specificity
## 82  0.085909 0.993080 0.997221 0.988974 0.986262  0.986255 1.000000    0.032258
## 83  0.085638 0.993042 0.997205 0.988913 0.986185  0.986180 1.000000    0.026882
## 84  0.078472 0.993004 0.997190 0.988853 0.986109  0.986105 1.000000    0.021505
## 85  0.072323 0.992966 0.997174 0.988792 0.986033  0.986029 1.000000    0.016129
## 86  0.070837 0.992889 0.997144 0.988671 0.985880  0.985879 1.000000    0.005376
## 87  0.069369 0.992851 0.997128 0.988611 0.985804  0.985804 1.000000    0.000000
##    absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns fns fps
## 82     0.178367               0.032258                0.516129   6   0 180
## 83     0.162820               0.026882                0.513441   5   0 181
## 84     0.145625               0.021505                0.510753   4   0 182
## 85     0.126110               0.016129                0.508065   3   0 183
## 86     0.072804               0.005376                0.502688   1   0 185
## 87     0.000000               0.000000                0.500000   0   0 186
##      tps      tnr      fnr      fpr      tpr idx
## 82 12916 0.032258 0.000000 0.967742 1.000000  81
## 83 12916 0.026882 0.000000 0.973118 1.000000  82
## 84 12916 0.021505 0.000000 0.978495 1.000000  83
## 85 12916 0.016129 0.000000 0.983871 1.000000  84
## 86 12916 0.005376 0.000000 0.994624 1.000000  85
## 87 12916 0.000000 0.000000 1.000000 1.000000  86
## 
## $max_criteria_and_metric_scores
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.133877     0.993118  80
## 2                       max f2  0.133877     0.997236  80
## 3                 max f0point5  0.768089     0.989126  63
## 4                 max accuracy  0.133877     0.986338  80
## 5                max precision  0.934895     1.000000   0
## 6                   max recall  0.133877     1.000000  80
## 7              max specificity  0.934895     1.000000   0
## 8             max absolute_mcc  0.133877     0.192665  80
## 9   max min_per_class_accuracy  0.916652     0.596774  25
## 10 max mean_per_class_accuracy  0.925077     0.627086  21
## 11                     max tns  0.934895   186.000000   0
## 12                     max fns  0.934895 12915.000000   0
## 13                     max fps  0.069369   186.000000  86
## 14                     max tps  0.133877 12916.000000  80
## 15                     max tnr  0.934895     1.000000   0
## 16                     max fnr  0.934895     0.999923   0
## 17                     max fpr  0.069369     1.000000  86
## 18                     max tpr  0.133877     1.000000  80
## 
## $gains_lift_table
## Gains/Lift Table: Avg response rate: 98,58 %, avg score: 91,17 %
##   group cumulative_data_fraction lower_threshold     lift cumulative_lift
## 1     1               0.08326973        0.934795 1.004173        1.004173
## 2     2               0.20523584        0.934585 1.011227        1.008365
## 3     3               0.42138605        0.931867 1.008670        1.008521
## 4     4               0.57517936        0.924940 0.995774        1.005113
## 5     5               0.62906426        0.916652 1.005780        1.005170
## 6     6               0.70355671        0.902257 0.997771        1.004387
## 7     7               0.85811327        0.881979 0.995365        1.002762
## 8     8               0.90024424        0.881633 0.992349        1.002274
## 9     9               1.00000000        0.069369 0.979475        1.000000
##   response_rate    score cumulative_response_rate cumulative_score capture_rate
## 1      0.989918 0.934795                 0.989918         0.934795     0.083617
## 2      0.996871 0.934586                 0.994050         0.934671     0.123335
## 3      0.994350 0.931883                 0.994204         0.933241     0.218024
## 4      0.981638 0.924966                 0.990844         0.931028     0.153143
## 5      0.991501 0.916799                 0.990900         0.929809     0.054196
## 6      0.983607 0.907075                 0.990128         0.927402     0.074326
## 7      0.981235 0.884194                 0.988526         0.919620     0.153840
## 8      0.978261 0.881713                 0.988046         0.917846     0.041809
## 9      0.965570 0.856650                 0.985804         0.911741     0.097708
##   cumulative_capture_rate      gain cumulative_gain kolmogorov_smirnov
## 1                0.083617  0.417305        0.417305           0.024477
## 2                0.206953  1.122677        0.836489           0.120931
## 3                0.424977  0.866967        0.852122           0.252934
## 4                0.578120 -0.422597        0.511284           0.207152
## 5                0.632317  0.577977        0.516997           0.229091
## 6                0.706643 -0.222878        0.438659           0.217396
## 7                0.860483 -0.463493        0.276171           0.166935
## 8                0.902292 -0.765145        0.227438           0.144227
## 9                1.000000 -2.052507        0.000000           0.000000

h2o.auc(performance_h2o)

## [1] 0.6628169

h2o.confusionMatrix(performance_h2o)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.133876663207534:
##        Died FALSE    Error        Rate
## Died      7   179 0.962366    =179/186
## FALSE     0 12916 0.000000    =0/12916
## Totals    7 13095 0.013662  =179/13102

h2o.metric(performance_h2o)

## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall specificity
## 1  0.934895 0.000155 0.000097 0.000387 0.014273  1.000000 0.000077    1.000000
## 2  0.934795 0.154209 0.102360 0.312500 0.095787  0.989918 0.083617    0.940860
## 3  0.934768 0.154999 0.102917 0.313800 0.096245  0.989973 0.084082    0.940860
## 4  0.934686 0.155131 0.103010 0.314017 0.096321  0.989982 0.084159    0.940860
## 5  0.934585 0.342583 0.245893 0.564591 0.216990  0.994050 0.206953    0.913978
##   absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns   fns fps
## 1     0.001048               0.000077                0.500039 186 12915   0
## 2     0.010481               0.083617                0.512239 175 11836  11
## 3     0.010653               0.084082                0.512471 175 11830  11
## 4     0.010682               0.084159                0.512510 175 11829  11
## 5     0.035422               0.206953                0.560466 170 10243  16
##    tps      tnr      fnr      fpr      tpr idx
## 1    1 1.000000 0.999923 0.000000 0.000077   0
## 2 1080 0.940860 0.916383 0.059140 0.083617   1
## 3 1086 0.940860 0.915918 0.059140 0.084082   2
## 4 1087 0.940860 0.915841 0.059140 0.084159   3
## 5 2673 0.913978 0.793047 0.086022 0.206953   4
## 
## ---
##    threshold       f1       f2 f0point5 accuracy precision   recall specificity
## 82  0.085909 0.993080 0.997221 0.988974 0.986262  0.986255 1.000000    0.032258
## 83  0.085638 0.993042 0.997205 0.988913 0.986185  0.986180 1.000000    0.026882
## 84  0.078472 0.993004 0.997190 0.988853 0.986109  0.986105 1.000000    0.021505
## 85  0.072323 0.992966 0.997174 0.988792 0.986033  0.986029 1.000000    0.016129
## 86  0.070837 0.992889 0.997144 0.988671 0.985880  0.985879 1.000000    0.005376
## 87  0.069369 0.992851 0.997128 0.988611 0.985804  0.985804 1.000000    0.000000
##    absolute_mcc min_per_class_accuracy mean_per_class_accuracy tns fns fps
## 82     0.178367               0.032258                0.516129   6   0 180
## 83     0.162820               0.026882                0.513441   5   0 181
## 84     0.145625               0.021505                0.510753   4   0 182
## 85     0.126110               0.016129                0.508065   3   0 183
## 86     0.072804               0.005376                0.502688   1   0 185
## 87     0.000000               0.000000                0.500000   0   0 186
##      tps      tnr      fnr      fpr      tpr idx
## 82 12916 0.032258 0.000000 0.967742 1.000000  81
## 83 12916 0.026882 0.000000 0.973118 1.000000  82
## 84 12916 0.021505 0.000000 0.978495 1.000000  83
## 85 12916 0.016129 0.000000 0.983871 1.000000  84
## 86 12916 0.005376 0.000000 0.994624 1.000000  85
## 87 12916 0.000000 0.000000 1.000000 1.000000  86

Colnclusion

In conclusion using the H2o approach is a faster way to build a model Comparing the accuracy and AUC

Supervised ML - Classification model: Accuracy: 0.985, AUC: 0.762

Automatic machine learning: Accuracy: 0.538462 with the threshold: 0.706676 AUC: 0.760355

Conclusion: previous model has a higher accuracy and AUC than this one but this model took a shorter time to build.

Apply 11 NEW

Tindra Bergstrand

2024-11-21