Code Along 12

Prompt 1:

I have a dataset called attrition_raw_tbl that looks like this.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.4     ✔ purrr   1.0.2
## ✔ tibble  3.2.1     ✔ dplyr   1.1.4
## ✔ tidyr   1.3.1     ✔ stringr 1.5.1
## ✔ readr   2.1.2     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

attrition_raw_tbl <- read_csv("../00_data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

attrition_raw_tbl %>% glimpse() Rows: 1,470 Columns: 35 $ Age 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34, 28, 29, 32, 22, 53, 38, 24, … $ Attrition “Yes”, “No”, “Yes”, “No”, “No”, “No”, “No”, “No”, “No”, “No”, “No”, “No”, “No”, “No”… $ BusinessTravel “Travel_Rarely”, “Travel_Frequently”, “Travel_Rarely”, “Travel_Frequently”, “Travel_… $ DailyRate 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299, 809, 153, 670, 1346, 103, 1… $ Department ”Sales”, “Research & Development”, “Research & Development”, “Research & Development… $ DistanceFromHome 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, 19, 24, 21, 5, 16, 2, 2, 11, 9, 7, 15, … $ Education 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2, 4, 3, 2, 4, 4, 2, 1, 3, 1, 4, … $ EducationField ”Life Sciences”, “Life Sciences”, “Other”, “Life Sciences”, “Medical”, “Life Science… $ EmployeeCount 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ EmployeeNumber 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28… $ EnvironmentSatisfaction 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, 2, 1, 4, 1, 4, 1, 3, 1, 3, 2, 3, 2, 3, … $ Gender ”Female”, “Male”, “Male”, “Female”, “Male”, “Male”, “Female”, “Male”, “Male”, “Male”… $ HourlyRate 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 93, 50, 51, 80, 96, 78, 45, 96, … $ JobInvolvement 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 4, 4, 2, 3, 4, 2, 3, 3, 3, 3, 1, 3, … $ JobLevel 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1, 4, 1, 2, 1, 3, 1, 1, 5, 1, 2, … $ JobRole “Sales Executive”, “Research Scientist”, “Laboratory Technician”, “Research Scientis… $ JobSatisfaction 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, 1, 2, 4, 4, 4, 3, 1, 2, 4, 1, 3, 1, 2, … $ MaritalStatus ”Single”, “Married”, “Single”, “Married”, “Married”, “Single”, “Married”, “Divorced”… $ MonthlyIncome 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2693, 9526, 5237, 2426, 4193, 2911, 2661, … $ MonthlyRate 19479, 24907, 2396, 23159, 16632, 11864, 9964, 13335, 8787, 16577, 16479, 12682, 151… $ NumCompaniesWorked 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, 1, 0, 1, 2, 5, 0, 7, 0, 1, 2, 4, 1, 0, … $ Over18 “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”, “Y”,… $ OverTime “Yes”, “No”, “Yes”, “Yes”, “No”, “No”, “Yes”, “No”, “No”, “No”, “No”, “Yes”, “No”, “… $ PercentSalaryHike 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 11, 14, 11, 12, 13, 16, 11, 18, … $ PerformanceRating 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, … $ RelationshipSatisfaction 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, 3, 4, 2, 3, 3, 4, 2, 3, 4, 3, 4, 2, 4, … $ StandardHours 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, … $ StockOptionLevel 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, 1, 2, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, … $ TotalWorkingYears 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3, 6, 10, 7, 1, 31, 6, 5, 10, 13, 0, 8, … $ TrainingTimesLastYear 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, 1, 5, 2, 3, 3, 5, 4, 4, 6, 2, 3, 5, 2, … $ WorkLifeBalance 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, … $ YearsAtCompany 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4, 10, 6, 1, 25, 3, 4, 5, 12, 0, 4, 14, 1… $ YearsInCurrentRole 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, 9, 2, 0, 8, 2, 2, 3, 6, 0, 2, 13, 2, 7,… $ YearsSinceLastPromotion 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, 8, 0, 0, 3, 1, 1, 0, 2, 0, 1, 4, 6, 4, … $ YearsWithCurrManager 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, 8, 5, 0, 7, 2, 3, 3, 11, 0, 3, 8, 7, 2,…

The goal is to help predict attrition for employees.

Please write R code to create a predictive model that predicts the probability of attrition.

Prompt 2:

Please update the code to use tidymodels instead of caret and to use the h2o model instead of glmnet.

# Load libraries
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.6     ✔ workflows    1.1.3
## ✔ modeldata    1.3.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.3.0
## ✔ recipes      1.0.9

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

# Initialize h2o
h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 days 16 hours 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    4 months and 11 days 
##     H2O cluster name:           H2O_started_from_R_jasonzink_qxv383 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.12 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.2.1 (2022-06-23)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is (4 months and 11 days) old. There may be a newer version available.
## Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

# Convert data to h2o frame
attrition_h2o <- as.h2o(attrition_raw_tbl)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Convert the response variable to a factor
attrition_h2o$Attrition <- as.factor(attrition_h2o$Attrition)

# Split data into training and testing sets
set.seed(123)
split <- h2o.splitFrame(data = attrition_h2o, ratios = c(0.5, 0.4))  # Adjusted ratios here
train <- split[[1]]
test <- split[[2]]

# Define predictors and response
predictors <- setdiff(names(train), "Attrition")
response <- "Attrition"

# Define AutoML model
aml <- h2o.automl(
  x = predictors,
  y = response,
  training_frame = train,
  leaderboard_frame = test,
  max_runtime_secs = 30  # Set the maximum runtime for AutoML in seconds
)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |========                                                              |  11%
## 12:54:07.534: _train param, Dropping bad and constant columns: [JobRole, MaritalStatus, StandardHours, BusinessTravel, Department, OverTime, Over18, EmployeeCount, Gender, EducationField]
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |=====================================                                 |  53%
## 12:54:21.118: _train param, Dropping bad and constant columns: [JobRole, MaritalStatus, StandardHours, BusinessTravel, Department, OverTime, Over18, EmployeeCount, Gender, EducationField]
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%

# Get the best model from AutoML
best_model <- aml@leader

# Print best model summary
print(best_model)

## Model Details:
## ==============
## 
## H2OBinomialModel: xgboost
## Model ID:  XGBoost_1_AutoML_9_20240502_125405 
## Model Summary: 
##   number_of_trees
## 1               1
## 
## 
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
## 
## MSE:  0.1885444
## RMSE:  0.434217
## LogLoss:  0.5681876
## Mean Per-Class Error:  0.3124707
## AUC:  0.7220954
## AUCPR:  0.2745602
## Gini:  0.4441909
## R^2:  -0.4609136
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error      Rate
## No     417 190 0.313015  =190/607
## Yes     34  75 0.311927   =34/109
## Totals 451 265 0.312849  =224/716
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.430804   0.401070   2
## 2                       max f2  0.407958   0.553435   3
## 3                 max f0point5  0.438248   0.332497   1
## 4                 max accuracy  0.485298   0.801676   0
## 5                max precision  0.438248   0.308140   1
## 6                   max recall  0.373558   1.000000   5
## 7              max specificity  0.485298   0.906096   0
## 8             max absolute_mcc  0.430804   0.279059   2
## 9   max min_per_class_accuracy  0.430804   0.686985   2
## 10 max mean_per_class_accuracy  0.430804   0.687529   2
## 11                     max tns  0.485298 550.000000   0
## 12                     max fns  0.485298  85.000000   0
## 13                     max fps  0.373558 607.000000   5
## 14                     max tps  0.373558 109.000000   5
## 15                     max tnr  0.485298   0.906096   0
## 16                     max fnr  0.485298   0.779817   0
## 17                     max fpr  0.373558   1.000000   5
## 18                     max tpr  0.373558   1.000000   5
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: xgboost
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.190732
## RMSE:  0.4367288
## LogLoss:  0.5726233
## Mean Per-Class Error:  0.4131917
## AUC:  0.5954159
## AUCPR:  0.1924633
## Gini:  0.1908317
## R^2:  -0.4778641
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error      Rate
## No     317 290 0.477759  =290/607
## Yes     38  71 0.348624   =38/109
## Totals 355 361 0.458101  =328/716
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.399392   0.302128  16
## 2                       max f2  0.370184   0.475952  21
## 3                 max f0point5  0.455750   0.251509   3
## 4                 max accuracy  0.493750   0.825419   0
## 5                max precision  0.455750   0.257732   3
## 6                   max recall  0.363651   1.000000  24
## 7              max specificity  0.493750   0.968699   0
## 8             max absolute_mcc  0.399392   0.124747  16
## 9   max min_per_class_accuracy  0.400389   0.545305  15
## 10 max mean_per_class_accuracy  0.399392   0.586808  16
## 11                     max tns  0.493750 588.000000   0
## 12                     max fns  0.493750 106.000000   0
## 13                     max fps  0.363651 607.000000  24
## 14                     max tps  0.363651 109.000000  24
## 15                     max tnr  0.493750   0.968699   0
## 16                     max fnr  0.493750   0.972477   0
## 17                     max fpr  0.363651   1.000000  24
## 18                     max tpr  0.363651   1.000000  24
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                              mean        sd cv_1_valid cv_2_valid cv_3_valid
## accuracy                 0.520824  0.223588   0.611111   0.706294   0.132867
## auc                      0.591425  0.075637   0.664632   0.665549   0.504669
## err                      0.479176  0.223588   0.388889   0.293706   0.867133
## err_count               68.600000 31.934307  56.000000  42.000000 124.000000
## f0point5                 0.241035  0.070263   0.250896   0.351759   0.160745
## f1                       0.307289  0.062775   0.333333   0.400000   0.234568
## f2                       0.438276  0.042580   0.496454   0.463576   0.433790
## lift_top_group           1.529521  0.670138   2.526316   1.713508   1.026316
## logloss                  0.572659  0.014376   0.547277   0.580259   0.577150
## max_per_class_error      0.559715  0.247543   0.408000   0.481482   1.000000
## mcc                      0.159854  0.077906   0.223640   0.229166         NA
## mean_per_class_accuracy  0.585356  0.065287   0.664421   0.634259   0.500000
## mean_per_class_error     0.414644  0.065287   0.335579   0.365741   0.500000
## mse                      0.190749  0.006972   0.178442   0.194521   0.192954
## pr_auc                   0.202452  0.060710   0.225318   0.289975   0.137382
## precision                0.211690  0.070830   0.215385   0.325581   0.132867
## r2                      -0.496398  0.147612  -0.557965  -0.270038  -0.674755
## recall                   0.669254  0.204533   0.736842   0.518518   1.000000
## rmse                     0.436689  0.008076   0.422424   0.441045   0.439266
## specificity              0.501458  0.289598   0.592000   0.750000   0.000000
##                         cv_4_valid cv_5_valid
## accuracy                  0.594406   0.559441
## auc                       0.597671   0.524606
## err                       0.405594   0.440559
## err_count                58.000000  63.000000
## f0point5                  0.229008   0.212766
## f1                        0.292683   0.275862
## f2                        0.405405   0.392157
## lift_top_group            1.568966   0.812500
## logloss                   0.582153   0.576454
## max_per_class_error       0.454545   0.454545
## mcc                       0.108762   0.077850
## mean_per_class_accuracy   0.574380   0.553719
## mean_per_class_error      0.425620   0.446281
## mse                       0.195294   0.192535
## pr_auc                    0.205275   0.154311
## precision                 0.200000   0.184615
## r2                       -0.500213  -0.479021
## recall                    0.545455   0.545455
## rmse                      0.441921   0.438788
## specificity               0.603306   0.561983

# Make predictions on test data
predictions <- h2o.predict(best_model, newdata = test)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Calculate accuracy
accuracy <- mean(as.vector(predictions$predict == test$Attrition))
accuracy

## [1] 0.6850649

Code Along 12

Jason Zink

2024-05-02

Prompt 1:

Prompt 2: