Code Along 12

I have a dataset called attrition_raw_tbl that looks like this.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

attrition_raw_tbl <- read_csv("00_data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

attrition_raw_tbl %>% slice %>% glimpse

## Rows: 0
## Columns: 35
## $ Age                      <dbl> 
## $ Attrition                <chr> 
## $ BusinessTravel           <chr> 
## $ DailyRate                <dbl> 
## $ Department               <chr> 
## $ DistanceFromHome         <dbl> 
## $ Education                <dbl> 
## $ EducationField           <chr> 
## $ EmployeeCount            <dbl> 
## $ EmployeeNumber           <dbl> 
## $ EnvironmentSatisfaction  <dbl> 
## $ Gender                   <chr> 
## $ HourlyRate               <dbl> 
## $ JobInvolvement           <dbl> 
## $ JobLevel                 <dbl> 
## $ JobRole                  <chr> 
## $ JobSatisfaction          <dbl> 
## $ MaritalStatus            <chr> 
## $ MonthlyIncome            <dbl> 
## $ MonthlyRate              <dbl> 
## $ NumCompaniesWorked       <dbl> 
## $ Over18                   <chr> 
## $ OverTime                 <chr> 
## $ PercentSalaryHike        <dbl> 
## $ PerformanceRating        <dbl> 
## $ RelationshipSatisfaction <dbl> 
## $ StandardHours            <dbl> 
## $ StockOptionLevel         <dbl> 
## $ TotalWorkingYears        <dbl> 
## $ TrainingTimesLastYear    <dbl> 
## $ WorkLifeBalance          <dbl> 
## $ YearsAtCompany           <dbl> 
## $ YearsInCurrentRole       <dbl> 
## $ YearsSinceLastPromotion  <dbl> 
## $ YearsWithCurrManager     <dbl>

The goal is to help predict attrition for employees.

Please write R code to create a predictive model that predicts the probability of attrition.

Prompts

Prompt 1 Can you use this code but use tidymodels instead of caret and h2o instead of pROC

Prompt 2 Error in step_dummy():Caused by error in bake():! Only one factor level in col_name: Y.

Prompt 3 Update the code to put initializing h2o when the code is training the h2o model, and treat the data first

Prompt 4 Can you train the autoML model more

# Load required libraries
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.2.1     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0

## Warning: package 'broom' was built under R version 4.3.3

## Warning: package 'modeldata' was built under R version 4.3.3

## Warning: package 'recipes' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

# Step 1: Data Preparation
set.seed(123)
attrition_split <- initial_split(attrition_raw_tbl, prop = 0.8, strata = Attrition)
train_data <- training(attrition_split)
test_data <- testing(attrition_split)

# Step 2: Define Recipe
attrition_recipe <- recipe(Attrition ~ ., data = train_data) %>%
  step_rm(all_nominal_predictors()) %>%  # Remove single-level factors
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%  # One-hot encode remaining categorical variables
  step_normalize(all_numeric_predictors())

# Prepare data for H2O
train_processed <- attrition_recipe %>% prep() %>% juice()

## Warning: !  The following columns have zero variance so scaling cannot be used:
##   EmployeeCount and StandardHours.
## ℹ Consider using ?step_zv (`?recipes::step_zv()`) to remove those columns
##   before normalizing.

test_processed <- attrition_recipe %>% prep() %>% bake(test_data)

## Warning: !  The following columns have zero variance so scaling cannot be used:
##   EmployeeCount and StandardHours.
## ℹ Consider using ?step_zv (`?recipes::step_zv()`) to remove those columns
##   before normalizing.

# Step 3: Initialize H2O
h2o.init()

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /var/folders/x_/s4jnxcsx0fsd3qx1z_c351r80000gn/T//Rtmp6fy9Dt/file87d07c107a6d/h2o_jordanlanowy_started_from_r.out
##     /var/folders/x_/s4jnxcsx0fsd3qx1z_c351r80000gn/T//Rtmp6fy9Dt/file87d01dd5c9bd/h2o_jordanlanowy_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 seconds 976 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    11 months and 13 days 
##     H2O cluster name:           H2O_started_from_R_jordanlanowy_rxf816 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.77 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is (11 months and 13 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

# Convert preprocessed data to H2OFrame
train_h2o <- as.h2o(train_processed)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

test_h2o <- as.h2o(test_processed)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Step 4: Train AutoML Model
aml_model <- h2o.automl(
  x = setdiff(names(train_h2o), "Attrition"),  # Predictors
  y = "Attrition",                            # Target variable
  training_frame = train_h2o,
  max_runtime_secs = 90,                     # Extend runtime to 10 minutes
  seed = 123,
  balance_classes = TRUE                      # Handle class imbalance
)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
## 17:53:20.302: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |===                                                                   |   4%
## 17:53:24.574: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |=====                                                                 |   7%
## 17:53:26.567: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |========                                                              |  11%
## 17:53:30.397: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
## 17:53:31.24: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |==========                                                            |  14%
## 17:53:32.184: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |===========                                                           |  16%
## 17:53:34.570: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |=============                                                         |  18%
## 17:53:35.872: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
## 17:53:36.797: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
## 17:53:37.867: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
  |                                                                            
  |===============                                                       |  21%
## 17:53:38.281: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
## 17:53:38.621: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
## 17:53:39.552: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |================                                                      |  23%
## 17:53:41.116: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
## 17:53:42.10: _train param, Dropping bad and constant columns: [StandardHours, EmployeeCount]
  |                                                                            
  |==================                                                    |  26%
## 17:53:44.35: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
  |                                                                            
  |====================                                                  |  28%
## 17:53:44.477: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  86%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |==================================================================    |  94%
## 17:54:44.49: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
## 17:54:44.364: _train param, Dropping unused columns: [StandardHours, EmployeeCount]
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%

# Step 5: Get Leader Model and Summarize Results
leader_model <- aml_model@leader
print(leader_model)

## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_BestOfFamily_4_AutoML_1_20241203_175320 
## Model Summary for Stacked Ensemble: 
##                                          key            value
## 1                          Stacking strategy cross_validation
## 2       Number of base models (used / total)              4/6
## 3           # GBM base models (used / total)              1/1
## 4       # XGBoost base models (used / total)              1/1
## 5           # GLM base models (used / total)              1/1
## 6           # DRF base models (used / total)              1/2
## 7  # DeepLearning base models (used / total)              0/1
## 8                      Metalearner algorithm              GLM
## 9         Metalearner fold assignment scheme           Random
## 10                        Metalearner nfolds                5
## 11                   Metalearner fold_column               NA
## 12        Custom metalearner hyperparameters             None
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.0512914
## RMSE:  0.226476
## LogLoss:  0.2032248
## Mean Per-Class Error:  0.08309186
## AUC:  0.9846904
## AUCPR:  0.9414842
## Gini:  0.9693809
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error      Rate
## No     963  23 0.023327   =23/986
## Yes     27 162 0.142857   =27/189
## Totals 990 185 0.042553  =50/1175
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.361525   0.866310 136
## 2                       max f2  0.279409   0.878514 171
## 3                 max f0point5  0.473075   0.911493 106
## 4                 max accuracy  0.399196   0.958298 126
## 5                max precision  0.902397   1.000000   0
## 6                   max recall  0.150036   1.000000 244
## 7              max specificity  0.902397   1.000000   0
## 8             max absolute_mcc  0.361525   0.841078 136
## 9   max min_per_class_accuracy  0.268490   0.929006 176
## 10 max mean_per_class_accuracy  0.268490   0.930112 176
## 11                     max tns  0.902397 986.000000   0
## 12                     max fns  0.902397 188.000000   0
## 13                     max fps  0.005819 986.000000 399
## 14                     max tps  0.150036 189.000000 244
## 15                     max tnr  0.902397   1.000000   0
## 16                     max fnr  0.902397   0.994709   0
## 17                     max fpr  0.005819   1.000000 399
## 18                     max tpr  0.150036   1.000000 244
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1143952
## RMSE:  0.3382236
## LogLoss:  0.3780821
## Mean Per-Class Error:  0.3051665
## AUC:  0.7521733
## AUCPR:  0.4402968
## Gini:  0.5043466
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     812 174 0.176471   =174/986
## Yes     82 107 0.433862    =82/189
## Totals 894 281 0.217872  =256/1175
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.216966   0.455319 162
## 2                       max f2  0.158981   0.567498 208
## 3                 max f0point5  0.377343   0.467980  80
## 4                 max accuracy  0.535384   0.856170  32
## 5                max precision  0.854803   1.000000   0
## 6                   max recall  0.012075   1.000000 396
## 7              max specificity  0.854803   1.000000   0
## 8             max absolute_mcc  0.237916   0.336953 147
## 9   max min_per_class_accuracy  0.158981   0.698413 208
## 10 max mean_per_class_accuracy  0.158981   0.709754 208
## 11                     max tns  0.854803 986.000000   0
## 12                     max fns  0.854803 188.000000   0
## 13                     max fps  0.008062 986.000000 399
## 14                     max tps  0.012075 189.000000 396
## 15                     max tnr  0.854803   1.000000   0
## 16                     max fnr  0.854803   0.994709   0
## 17                     max fpr  0.008062   1.000000 399
## 18                     max tpr  0.012075   1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                mean        sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.824865  0.043497   0.806723   0.838565   0.795918   0.894737
## auc        0.754676  0.034352   0.813168   0.753417   0.744000   0.724341
## err        0.175135  0.043497   0.193277   0.161435   0.204082   0.105263
## err_count 41.400000 11.392981  46.000000  36.000000  50.000000  24.000000
## f0point5   0.485235  0.038597   0.472727   0.510204   0.476190   0.533981
##           cv_5_valid
## accuracy    0.788382
## auc         0.738454
## err         0.211618
## err_count  51.000000
## f0point5    0.433071
## 
## ---
##                         mean        sd cv_1_valid cv_2_valid cv_3_valid
## precision           0.478172  0.064230   0.440678   0.500000   0.456140
## r2                  0.146401  0.032966   0.169370   0.170207   0.162054
## recall              0.546243  0.094023   0.666667   0.555556   0.577778
## residual_deviance 177.698600 22.106504 173.156460 166.093600 204.036830
## rmse                0.337397  0.018619   0.337354   0.335160   0.354457
## specificity         0.875328  0.052675   0.834171   0.893048   0.845000
##                   cv_4_valid cv_5_valid
## precision           0.578947   0.415094
## r2                  0.092174   0.138197
## recall              0.407407   0.523810
## residual_deviance 149.648180 195.557900
## rmse                0.307855   0.352158
## specificity         0.960199   0.844221

# Step 6: Evaluate Performance on Test Data
performance <- h2o.performance(leader_model, newdata = test_h2o)

# Print AUC
auc <- h2o.auc(performance)
print(paste("AUC Score:", auc))

## [1] "AUC Score: 0.808367071524966"

# Shutdown H2O when done
h2o.shutdown(prompt = FALSE)