Class 3 HW - Elastic Net Logistic Regression for Employee Attrition

In this assignment, I use an elastic net logistic regression model to predict employee attrition using the IBM HR dataset. I begin by loading and preparing the data, then split the dataset into training and test sets so that model performance can be evaluated on unseen employees. Next, I create a preprocessing recipe, address class imbalance using SMOTE, and tune the model’s hyperparameters using cross-validation. Finally, I select the best-performing model, evaluate its performance on the held-out test set, and examine the characteristics of the final model.

Stage 1a: Load Packages

The following packages support data import, reproducible preprocessing, stratified resampling, hyperparameter tuning, and classification diagnostics within a single tidymodels workflow.

# Clearing the environment so the analysis starts from a clean workspace
rm(list = ls())

# Loading conflicted to manage functions that share the same name across packages
library(conflicted)

# Installing pacman if needed, then loading it
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman", repos = "http://cran.r-project.org")
}

library(pacman)

# Loading packages needed for data import, cleaning, modeling, resampling, and evaluation
pacman::p_load(
  readxl, 
  dplyr, 
  tidyr, 
  ggplot2, 
  knitr, 
  tidymodels,
  janitor, 
  scales, 
  themis, 
  tidylog
)

# Giving tidylog priority when its functions overlap with functions from other packages
for (f in getNamespaceExports("tidylog")) {
  conflicted::conflict_prefer(f, "tidylog", quiet = TRUE)
}

# Giving dplyr priority for common data-wrangling functions
conflicted::conflict_prefer("filter", "dplyr", quiet = TRUE)
conflicted::conflict_prefer("select", "dplyr", quiet = TRUE)

Stage 1b: Prepare the HR Data

The IBM HR dataset contains employee demographic, job, satisfaction, and compensation variables along with an attrition outcome. Before building the model, I need to prepare the data so that categorical variables are coded correctly, the attrition outcome is defined consistently, and variable names are easier to work with throughout the analysis.

# Updating the file path if the Excel file is stored in a different location
data_path <- "~/Desktop/1 - Coursework/Graduate Coursework/UGA/4 - Summer 2026/Advanced Analytics (PSYC 6841)/Data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"

# Reading the IBM HR attrition dataset into R
df <- read_excel(data_path)

# Converting all character variables into factors so categorical predictors are recognized correctly
df <- df %>%
  mutate(across(where(is.character), as.factor))

# Setting attrition = "Yes" as the event of interest so predictions focus on employee departures
df <- df %>%
  mutate(Attrition = relevel(factor(Attrition), ref = "Yes"))

# Creating a unique row identifier that can be used to link predictions back to employees later
df <- df %>%
  mutate(id = row_number()) %>%
  select(id, everything())

# Standardizing variable names into snake_case for easier coding and interpretation
df <- df %>%
  clean_names()

# Reviewing the structure of the cleaned dataset before moving into model development
glimpse(df)

## Rows: 1,470
## Columns: 36
## $ id                         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ age                        <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition                  <fct> Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ business_travel            <fct> Travel_Rarely, Travel_Frequently, Travel_Ra…
## $ daily_rate                 <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department                 <fct> Sales, Research & Development, Research & D…
## $ distance_from_home         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education                  <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3…
## $ education_field            <fct> Life Sciences, Life Sciences, Other, Life S…
## $ employee_count             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ employee_number            <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 1…
## $ environment_satisfaction   <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3…
## $ gender                     <fct> Female, Male, Male, Female, Male, Male, Fem…
## $ hourly_rate                <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84,…
## $ job_involvement            <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2…
## $ job_level                  <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1…
## $ job_role                   <fct> Sales Executive, Research Scientist, Labora…
## $ job_satisfaction           <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3…
## $ marital_status             <fct> Single, Married, Single, Married, Married, …
## $ monthly_income             <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ monthly_rate               <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 99…
## $ num_companies_worked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ over18                     <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ over_time                  <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No,…
## $ percent_salary_hike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating         <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3…
## $ relationship_satisfaction  <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2…
## $ standard_hours             <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,…
## $ stock_option_level         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0…
## $ total_working_years        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ training_times_last_year   <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4…
## $ work_life_balance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3…
## $ years_at_company           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_in_current_role      <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2…
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
## $ years_with_curr_manager    <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3…

Stage 2: Split the Data into Training and Test Sets

Before building the model, I split the data set into training and test sets. The training set will be used to build and tune the model, while the test set will be reserved for evaluating how well the final model performs on employees it has not seen before. I also use stratification so that the attrition rate remains similar across both data sets.

# Setting a seed so the train-test split can be reproduced
set.seed(42)

# Creating a 75/25 split while preserving the attrition rate in both datasets
data_split <- initial_split(df, prop = 0.75, strata = attrition)

# Separating the training and test datasets from the split object
train_data <- training(data_split)
test_data  <- testing(data_split)

After creating the split, I want to verify that the attrition rate remains consistent across the full data set, training data set, and test data set. If the rates differ substantially, model performance could be influenced by sampling differences rather than actual predictive ability.

# Calculating the attrition count and percentage for the full dataset
tabyl(df$attrition) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting()

df$attrition	n	percent
Yes	237	16.1%
No	1233	83.9%
Total	1470	100.0%

Approximately 16% of employees in the full dataset experienced attrition.

# Calculating the attrition count and percentage for the training dataset
tabyl(train_data$attrition) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting()

train_data$attrition	n	percent
Yes	177	16.1%
No	924	83.9%
Total	1101	100.0%

# Calculating the attrition count and percentage for the test dataset
tabyl(test_data$attrition) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting()

test_data$attrition	n	percent
Yes	60	16.3%
No	309	83.7%
Total	369	100.0%

The attrition rates in the training and test datasets are very similar to the rate observed in the full dataset. This suggests that stratification worked as intended and that both datasets contain a comparable proportion of employees who stayed and employees who left.

Stage 3: Create Cross-Validation Folds

After splitting the data, I create five cross-validation folds from the training dataset. Cross-validation allows the model to be trained and evaluated multiple times on different parts of the training data before it is tested on the held-out test set. I use stratification again so that each fold keeps a similar attrition rate.

# Setting a seed so the cross-validation folds can be reproduced
set.seed(42)

# Creating 5 stratified folds from the training data only
# The test set is not used here because it must remain untouched until final evaluation
cv_folds <- vfold_cv(
  train_data,
  v      = 5,
  strata = attrition
)

Stage 4: Identify Skewed Predictors

Before creating the preprocessing recipe, I want to identify any predictors with highly skewed distributions. Variables with substantial skewness can contain extreme values that may influence the model’s ability to estimate stable relationships. To avoid introducing information from the test set, I calculate skewness using only the training data.

# Identifying numeric variables that will later be converted into factors
vars_to_factorize <- c("job_level", "stock_option_level")

# Identifying ordinal variables that should remain in their original form
ordinal_likert_vars <- c(
  "education",
  "environment_satisfaction",
  "job_involvement",
  "job_satisfaction",
  "performance_rating",
  "relationship_satisfaction",
  "work_life_balance"
)

# Excluding outcome, identifier, factor, and ordinal variables from the skewness assessment
exclude_from_skew <- c(
  "attrition",
  "employee_number",
  "id",
  vars_to_factorize,
  ordinal_likert_vars
)

# Calculating skewness for the remaining numeric predictors using only the training dataset
skewness_df <- train_data %>%
  select(where(is.numeric)) %>%
  select(-any_of(exclude_from_skew)) %>%
  summarise(
    across(
      .cols = everything(),
      .fns  = ~ {
        x <- .x[!is.na(.x)]
        n <- length(x)
        m <- mean(x)
        s <- sd(x)
        if (s == 0 || n < 3) NA_real_ else (sum((x - m)^3) / n) / s^3
      }
    )
  ) %>%
  pivot_longer(
    cols      = everything(),
    names_to  = "variable",
    values_to = "skewness"
  ) %>%
  arrange(desc(abs(skewness)))

# Identifying predictors with substantial skewness that may benefit from a Yeo-Johnson transformation
skewed_feature_names <- skewness_df %>%
  filter(!is.na(skewness), abs(skewness) > 1) %>%
  pull(variable)

# Printing the variables that will be transformed later in the preprocessing recipe
cat("\nVariables flagged for Yeo-Johnson (|skewness| > 1):\n")

## 
## Variables flagged for Yeo-Johnson (|skewness| > 1):

print(skewed_feature_names)

## [1] "years_since_last_promotion" "years_at_company"          
## [3] "monthly_income"             "total_working_years"       
## [5] "num_companies_worked"       "distance_from_home"

The variables identified above exceed the selected skewness threshold and will be transformed later in the preprocessing recipe. Reducing skewness can help stabilize coefficient estimation and improve the performance of regularized regression models.

Stage 5a: Create the Preprocessing Recipe

Before training the model, I need to prepare the predictors so they can be used appropriately by the elastic net logistic regression model. The preprocessing recipe below removes uninformative variables, prepares categorical predictors, transforms highly skewed variables, and standardizes predictors onto a common scale. Defining these steps in a recipe ensures that the same preprocessing is applied consistently during cross-validation, testing, and final model fitting.

# Creating a preprocessing recipe that uses all available predictors to estimate attrition
reg_recipe <- recipe(attrition ~ ., data = train_data) %>%

  # Identifying employee number as an ID rather than a predictive feature
  update_role(employee_number, new_role = "ID") %>%

  # Removing predictors with no variation
  step_zv(all_predictors()) %>%

  # Removing predictors with very little variation
  step_nzv(all_predictors()) %>%

  # Converting selected numeric variables into categorical factors
  step_mutate(
    job_level          = factor(job_level,          levels = 1:5),
    stock_option_level = factor(stock_option_level, levels = 0:3)
  ) %>%

  # Creating a placeholder level for categories that may appear in future data
  step_novel(all_nominal_predictors()) %>%

  # Grouping infrequent categories into a single "other_rare" category
  step_other(
    all_nominal_predictors(),
    threshold = 0.05,
    other     = "other_rare"
  ) %>%

  # Applying a Yeo-Johnson transformation to highly skewed predictors
  step_YeoJohnson(all_of(skewed_feature_names)) %>%

  # Converting categorical predictors into dummy-coded variables
  step_dummy(all_nominal_predictors(), one_hot = FALSE) %>%

  # Removing any zero-variance predictors created during dummy coding
  step_zv(all_predictors()) %>%

  # Standardizing all numeric predictors prior to elastic net regularization
  step_normalize(all_numeric_predictors())

Stage 5b: Handle Class Imbalance with SMOTE

Employee attrition is relatively uncommon in this dataset, which means a model could achieve high accuracy by predicting that most employees will stay. To help the model learn patterns associated with attrition, I use SMOTE (Synthetic Minority Oversampling Technique) to create additional synthetic attrition cases during model training. This balancing procedure is applied only within the training resamples so that the test set continues to reflect the natural attrition rate.

# Adding SMOTE to the preprocessing recipe to address class imbalance
reg_recipe_smote <- reg_recipe %>%

  # Generating synthetic attrition cases within the training data
  step_smote(
    attrition,
    over_ratio = 0.5,
    seed       = 42
  )

At this point, the preprocessing recipe is complete. The recipe now handles variable preparation, skewness correction, dummy coding, normalization, and class balancing. These steps will be applied automatically whenever the model is trained, tuned, or evaluated in later stages of the analysis.

Stage 6a: Specify the Elastic Net Model

Now that the data are prepared, I can define the elastic net logistic regression model. Elastic net extends traditional logistic regression by adding regularization, which helps reduce overfitting and improve model stability when predictors are correlated. Because the amount of regularization and the balance between Ridge and LASSO penalties can affect performance, both parameters will be tuned later using cross-validation.

# Defining an elastic net logistic regression model with tunable hyperparameters
elastic_net_spec <- logistic_reg(

  # Tuning the overall amount of regularization (lambda)
  penalty = tune(),

  # Tuning the balance between Ridge (0) and LASSO (1) regularization
  mixture = tune()
) %>%

  # Using glmnet as the modeling engine
  set_engine("glmnet") %>%

  # Specifying a binary classification model
  set_mode("classification")

Stage 6b: Build the Modeling Workflow

Before tuning the model, I combine the preprocessing recipe and elastic net specification into a single workflow. Doing so ensures that every preprocessing step and modeling step is applied together throughout cross-validation, model selection, and final evaluation.

# Combining the preprocessing recipe and elastic net model into a single workflow
elastic_net_wf <- workflow() %>%

  # Adding the preprocessing recipe
  add_recipe(reg_recipe_smote) %>%

  # Adding the elastic net model specification
  add_model(elastic_net_spec)

With stages 6a and 6b complete, the modeling workflow has been fully defined. The remaining steps focus on determining which combination of hyperparameters produces the strongest predictive performance.

Stage 7: Define Model Evaluation Metrics

Before tuning the model, I need to decide how model performance will be evaluated. Because employee attrition is an imbalanced outcome, relying on accuracy alone could be misleading. Instead, I use a collection of metrics that evaluate how well the model identifies employees who leave, distinguishes leavers from stayers, and balances different types of classification errors.

# Giving yardstick versions of accuracy and precision priority when conflicts exist
conflicts_prefer(yardstick::accuracy)
conflicts_prefer(yardstick::precision)

# Creating a metric set that will be used during model tuning and evaluation
class_metric <- metric_set(
  accuracy,    # Measuring overall classification accuracy
  f_meas,      # Measuring the balance between precision and recall
  j_index,     # Measuring the balance between sensitivity and specificity
  kap,         # Measuring agreement beyond chance
  precision,   # Measuring the proportion of flagged employees who actually left
  sensitivity, # Measuring the proportion of actual leavers correctly identified
  specificity, # Measuring the proportion of stayers correctly identified
  roc_auc,     # Measuring overall class separation across thresholds
  mcc,         # Measuring classification quality using all confusion matrix cells
  pr_auc       # Measuring precision-recall performance under class imbalance
)

Although several metrics will be reported, I use PR AUC (Precision-Recall Area Under the Curve) as the primary metric for model selection. PR AUC is particularly useful when the outcome is imbalanced because it focuses on how effectively the model identifies attrition cases. ROC AUC is also reported as a complementary measure of the model’s ability to separate employees who leave from those who stay.

Stage 8: Tune the Elastic Net Hyperparameters

The next step is to determine which combination of penalty and mixture values produces the strongest model performance. Rather than selecting these values manually, I evaluate many possible combinations using cross-validation. The tuning grid includes Ridge-like models, LASSO-like models, and intermediate elastic net solutions across a range of regularization strengths.

# Creating a tuning grid that evaluates different combinations of penalty and mixture values
en_grid <- grid_regular(

  # Testing a range of regularization strengths on a log scale
  penalty(range = c(-4, 2)),

  # Testing Ridge, LASSO, and intermediate elastic net solutions
  mixture(range = c(0, 1)),

  # Defining the number of values evaluated for each hyperparameter
  levels = c(penalty = 20, mixture = 5)
)

# Reviewing the first few combinations in the tuning grid
head(en_grid, 10)

penalty	mixture
0.0001000	0
0.0002069	0
0.0004281	0
0.0008859	0
0.0018330	0
0.0037927	0
0.0078476	0
0.0162378	0
0.0335982	0
0.0695193	0

# Displaying the total number of hyperparameter combinations being evaluated
cat("\nTuning grid size:", nrow(en_grid), "combinations\n")

## 
## Tuning grid size: 100 combinations

# Setting a seed so tuning results can be reproduced
set.seed(42)

# Evaluating each penalty and mixture combination using cross-validation
elastic_net_tune_res <- tune_grid(
  elastic_net_wf,
  resamples = cv_folds,
  grid      = en_grid,
  metrics   = class_metric
)

# Displaying the top-performing models based on PR AUC
show_best(elastic_net_tune_res, metric = "pr_auc", n = 5)

penalty	mixture	.metric	.estimator	mean	n	std_err	.config
0.0162378	0.50	pr_auc	binary	0.6390982	5	0.0318394	pre0_mod038_post0
0.0695193	0.00	pr_auc	binary	0.6382526	5	0.0304547	pre0_mod046_post0
0.0335982	0.00	pr_auc	binary	0.6380157	5	0.0274793	pre0_mod041_post0
0.0162378	0.25	pr_auc	binary	0.6357271	5	0.0264245	pre0_mod037_post0
0.0078476	1.00	pr_auc	binary	0.6344649	5	0.0295331	pre0_mod035_post0

# Visualizing how PR AUC changes across penalty and mixture combinations
autoplot(elastic_net_tune_res, metric = "pr_auc") +
  labs(title = "Elastic Net Tuning: PR AUC by Penalty and Mixture")

# Visualizing how ROC AUC changes across penalty and mixture combinations
autoplot(elastic_net_tune_res, metric = "roc_auc") +
  labs(title = "Elastic Net Tuning: ROC AUC by Penalty and Mixture")

# Summarizing the top-performing hyperparameter combinations for PR AUC and ROC AUC
collect_metrics(elastic_net_tune_res) %>%
  filter(.metric %in% c("pr_auc", "roc_auc")) %>%
  group_by(.metric) %>%
  slice_max(mean, n = 3, with_ties = FALSE) %>%
  ungroup() %>%
  select(.metric, penalty, mixture, mean, std_err) %>%
  knitr::kable(
    digits = 3,
    caption = "Top 3 cross-validated penalty/mixture combinations by AUC metric"
  )

Top 3 cross-validated penalty/mixture combinations by AUC metric
.metric	penalty	mixture	mean	std_err
pr_auc	0.016	0.5	0.639	0.032
pr_auc	0.070	0.0	0.638	0.030
pr_auc	0.034	0.0	0.638	0.027
roc_auc	0.034	0.0	0.832	0.023
roc_auc	0.016	0.0	0.832	0.022
roc_auc	0.000	0.0	0.832	0.022

The tuning results above show which combinations of regularization strength and penalty type produced the strongest cross-validated performance. In the next section, I select the combination with the highest PR AUC and use it to define the final elastic net model.

Stage 9: Select the Best Hyperparameters

Now that all hyperparameter combinations have been evaluated, I select the combination that produced the highest average PR AUC across the five cross-validation folds. Because attrition is an imbalanced outcome, this approach prioritizes identifying employees who leave rather than relying solely on overall accuracy.

# Selecting the penalty and mixture combination with the highest PR AUC
best_auc <- select_best(elastic_net_tune_res, metric = "pr_auc")

# Reviewing the selected hyperparameter values
best_auc

penalty	mixture	.config
0.0162378	0.5	pre0_mod038_post0

# Extracting the optimal penalty value
best_penalty <- best_auc$penalty

# Extracting the optimal mixture value
best_mixture <- best_auc$mixture

# Printing the selected hyperparameters
cat(
  "\nBest penalty:", format(best_penalty, digits = 4, scientific = TRUE),
  "\nBest mixture:", round(best_mixture, 3), "\n"
)

## 
## Best penalty: 1.624e-02 
## Best mixture: 0.5

# Interpreting the selected mixture value based on its location along the Ridge-LASSO continuum
if (best_mixture == 0) {

  # Ridge regression solution
  mixture_note <- "The selected solution corresponds to Ridge regression (mixture = 0), which shrinks coefficients without eliminating predictors entirely."

} else if (best_mixture == 1) {

  # LASSO regression solution
  mixture_note <- "The selected solution corresponds to LASSO regression (mixture = 1), which can reduce the active predictor set by zeroing coefficients."

} else if (best_mixture < 0.25) {

  # Elastic net solution that behaves mostly like Ridge
  mixture_note <- paste0(
    "The selected solution is elastic net but behaves more like Ridge (mixture = ",
    round(best_mixture, 3), "), suggesting that coefficient shrinkage mattered more than aggressive variable elimination."
  )

} else if (best_mixture > 0.75) {

  # Elastic net solution that behaves mostly like LASSO
  mixture_note <- paste0(
    "The selected solution is elastic net but behaves more like LASSO (mixture = ",
    round(best_mixture, 3), "), suggesting that variable selection contributed meaningfully to the final model."
  )

} else {

  # Balanced elastic net solution
  mixture_note <- paste0(
    "The selected solution reflects a balanced elastic net blend of Ridge and LASSO (mixture = ",
    round(best_mixture, 3), "), combining shrinkage with partial variable selection."
  )
}

# Printing the interpretation of the selected mixture value
mixture_note

## [1] "The selected solution reflects a balanced elastic net blend of Ridge and LASSO (mixture = 0.5), combining shrinkage with partial variable selection."

Elastic net performs regularization as part of the modeling process, so I did not apply a separate feature-selection procedure such as Boruta. Instead, the tuning process allowed the model to determine how much coefficient shrinkage and variable selection were needed to maximize performance. The selected mixture value indicates whether the final solution behaved more like Ridge regression, LASSO regression, or a combination of both.

The tuning process evaluated Ridge, LASSO, and intermediate elastic net solutions across a range of penalty values. Based on PR AUC, the selected model used a mixture value of 0.5, indicating that the final model combined coefficient shrinkage and variable selection. This suggests that the strongest-performing model was obtained by balancing model complexity and predictive performance through regularization.

Stage 10: Finalize and Re-Evaluate the Model

After selecting the best hyperparameters, I finalize the workflow by replacing the tunable placeholders with the chosen penalty and mixture values. I then re-run cross-validation using the finalized model to estimate how consistently it performs across different subsets of the training data before moving to the held-out test set.

# Creating the finalized workflow using the selected hyperparameters
elastic_net_final_wflow <- finalize_workflow(elastic_net_wf, best_auc)

The next step evaluates the finalized model across the cross-validation folds. This provides a final estimate of training-set performance and allows me to verify that performance remains stable before testing the model on unseen employees.

# Setting a seed so the resampling results can be reproduced
set.seed(42)

# Evaluating the finalized model across the cross-validation folds
elastic_net_cv_results <- fit_resamples(
  elastic_net_final_wflow,
  resamples = cv_folds,
  metrics   = class_metric
)

# Summarizing cross-validated model performance across all folds
elastic_net_cv_results %>%
  collect_metrics() %>%
  select(.metric, mean, std_err) %>%
  arrange(.metric) %>%
  knitr::kable(
    digits = 3,
    caption = "Finalized Elastic Net — 5-fold CV performance"
  )

Finalized Elastic Net — 5-fold CV performance
.metric	mean	std_err
accuracy	0.856	0.007
f_meas	0.565	0.025
j_index	0.494	0.035
kap	0.479	0.028
mcc	0.480	0.029
pr_auc	0.639	0.032
precision	0.548	0.021
roc_auc	0.831	0.025
sensitivity	0.587	0.035
specificity	0.907	0.007

At this point, the model development process is complete. The remaining steps focus on evaluating the finalized model on the held-out test dataset and interpreting the results.

Stage 11a: Evaluate the Final Model on the Test Set

The test dataset provides the most important estimate of model performance because it was not used during preprocessing, tuning, or cross-validation. In this stage, I fit the finalized model using the training data and then evaluate how well it predicts attrition among employees in the held-out test dataset.

# Setting a seed so the final model evaluation can be reproduced
set.seed(42)

# Fitting the finalized model on the training data and evaluating it on the test data
elastic_net_last_fit <- elastic_net_final_wflow %>%
  last_fit(data_split, metrics = class_metric)

# Extracting performance metrics from the held-out test dataset
elastic_net_test_performance <- elastic_net_last_fit %>%
  collect_metrics() %>%
  select(.metric, .estimate) %>%
  arrange(.metric)

# Printing the final test-set performance table
knitr::kable(
  elastic_net_test_performance,
  digits = 3,
  col.names = c("Metric", "Test Set Estimate"),
  caption = "Final Elastic Net model — held-out test set performance"
)

Final Elastic Net model — held-out test set performance
Metric	Test Set Estimate
accuracy	0.873
f_meas	0.624
j_index	0.566
kap	0.547
mcc	0.548
pr_auc	0.696
precision	0.600
roc_auc	0.862
sensitivity	0.650
specificity	0.916

The next step extracts the model predictions generated for the test dataset. Reviewing these predictions provides a direct look at the probabilities and classifications used to calculate the performance metrics reported above.

# Extracting individual employee predictions from the test dataset
elastic_net_test_predictions <- elastic_net_last_fit %>%
  collect_predictions()

# Reviewing the first few predicted probabilities and classifications
elastic_net_test_predictions %>%
  select(attrition, .pred_Yes, .pred_No, .pred_class) %>%
  head(10)

attrition	.pred_Yes	.pred_No	.pred_class
No	0.2080112	0.7919888	No
No	0.4466760	0.5533240	No
No	0.2290620	0.7709380	No
No	0.2311662	0.7688338	No
No	0.1247012	0.8752988	No
No	0.2001179	0.7998821	No
Yes	0.8566513	0.1433487	Yes
No	0.3676295	0.6323705	No
Yes	0.5136523	0.4863477	Yes
Yes	0.0143757	0.9856243	No

Stage 11b: Confusion Matrix

The confusion matrix summarizes how often the model correctly and incorrectly classified employees as staying or leaving. This provides a more concrete view of model performance by showing the number of correct predictions, false alarms, and missed attrition cases.

# Giving yardstick's specificity function priority when conflicts exist
conflicts_prefer(yardstick::spec)

# Creating a confusion matrix for the test-set classifications
elastic_net_test_predictions %>%
  conf_mat(truth = attrition, estimate = .pred_class)

##           Truth
## Prediction Yes  No
##        Yes  39  26
##        No   21 283

The summary below converts the confusion matrix into commonly reported classification statistics, including sensitivity and specificity.

# Calculating summary statistics from the confusion matrix
elastic_net_test_predictions %>%
  conf_mat(truth = attrition, estimate = .pred_class) %>%
  summary()

.metric	.estimator	.estimate
accuracy	binary	0.8726287
kap	binary	0.5474755
sens	binary	0.6500000
spec	binary	0.9158576
ppv	binary	0.6000000
npv	binary	0.9309211
mcc	binary	0.5481110
j_index	binary	0.5658576
bal_accuracy	binary	0.7829288
detection_prevalence	binary	0.1761518
precision	binary	0.6000000
recall	binary	0.6500000
f_meas	binary	0.6240000

Stage 11c: ROC Curve

The ROC curve shows how sensitivity and specificity change across different classification thresholds. It provides a threshold-independent view of how well the model separates employees who leave from employees who stay.

# Creating an ROC curve to visualize model performance across classification thresholds
elastic_net_test_predictions %>%
  roc_curve(attrition, .pred_Yes) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(linewidth = 1.5, color = "midnightblue") +
  geom_abline(lty = 2, alpha = 0.5, color = "gray50", linewidth = 1.2) +
  coord_equal() +
  labs(
    title    = "ROC Curve: Elastic Net Logistic Regression",
    subtitle = "Event = Attrition (Yes) | IBM HR dataset",
    x        = "1 - Specificity",
    y        = "Sensitivity"
  ) +
  theme_minimal(base_size = 13)

Although ROC curves are widely used, precision-recall curves are often more informative when the outcome is relatively uncommon. Because attrition represents a minority class in this dataset, I use the precision-recall curve as the primary graphical summary of model performance.

Stage 11d: PR Curve

The precision-recall curve focuses specifically on how well the model identifies employees who leave. This perspective is particularly useful when evaluating imbalanced outcomes because it emphasizes performance on the minority class rather than overall classification success.

# Giving ggplot's margin function priority when conflicts exist
conflicts_prefer(ggplot2::margin)

# Calculating the attrition prevalence in the test dataset
prevalence <- mean(test_data$attrition == "Yes")

# Calculating the PR AUC value that will be displayed on the plot
pr_auc_val <- elastic_net_test_predictions %>%
  pr_auc(attrition, .pred_Yes) %>%
  pull(.estimate) %>%
  round(3)

# Creating a precision-recall curve and comparing it to the no-skill baseline
elastic_net_test_predictions %>%
  pr_curve(attrition, .pred_Yes) %>%
  ggplot(aes(x = recall, y = precision)) +
  geom_hline(
    yintercept = prevalence,
    linetype   = "dashed",
    color      = "grey55",
    linewidth  = 0.7
  ) +
  annotate(
    "text",
    x        = 0.02,
    y        = prevalence + 0.035,
    label    = paste0("No-skill baseline (", round(prevalence, 2), ")"),
    hjust    = 0,
    size     = 3.3,
    color    = "grey45"
  ) +
  geom_area(fill = "#2c7bb6", alpha = 0.12) +
  geom_path(color = "#2c7bb6", linewidth = 1.1) +
  annotate(
    "text",
    x        = 0.65,
    y        = 0.93,
    label    = paste0("PR-AUC = ", pr_auc_val),
    hjust    = 0,
    size     = 4.2,
    fontface = "bold",
    color    = "#2c7bb6"
  ) +
  scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  coord_equal() +
  labs(
    title    = "Precision-Recall Curve: Elastic Net Logistic Regression",
    subtitle = "Event = Attrition (Yes) | IBM HR dataset",
    x        = "Recall  (share of true leavers identified)",
    y        = "Precision  (share of flagged employees who actually left)",
    caption  = "Dashed line = no-skill baseline at positive class prevalence (~16%)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold"),
    plot.subtitle      = element_text(color = "grey40", margin = margin(b = 8)),
    plot.caption       = element_text(color = "grey50", size = 9),
    panel.grid.minor   = element_blank(),
    axis.title.x       = element_text(margin = margin(t = 8)),
    axis.title.y       = element_text(margin = margin(r = 8))
  )

Together, the test-set metrics, confusion matrix, ROC curve, and precision-recall curve provide a comprehensive evaluation of model performance. Because all results were generated using employees who were not involved in model training, these estimates provide the most realistic indication of how the model would perform in practice.

Stage 12: Fit the Final Deployment Model

The previous stage evaluated model performance using the held-out test dataset and provides the most trustworthy estimate of how well the model generalizes to new employees. After completing that evaluation, I refit the finalized model using the entire dataset so that all available information contributes to the final model.

This full-data model is intended for future prediction and interpretation, not for evaluating performance. Because the model has now been trained on every employee record, any performance statistics generated from this fit would no longer represent out-of-sample performance.

# Refitting the finalized model using the full dataset
elastic_net_full_fit <- fit(elastic_net_final_wflow, data = df)

Stage 13: Examine the Most Influential Predictors

In addition to evaluating model performance, I also examine the predictors with the largest non-zero coefficients. Because elastic net applies regularization, these coefficients reflect the variables that contributed most strongly to attrition prediction within the final model.

# Extracting and displaying the largest non-zero coefficients from the finalized model
tidy(elastic_net_full_fit %>% extract_fit_parsnip()) %>%
  filter(term != "(Intercept)", estimate != 0) %>%
  mutate(abs_estimate = abs(estimate)) %>%
  arrange(desc(abs_estimate)) %>%
  slice_head(n = 10) %>%
  select(term, estimate) %>%
  knitr::kable(
    digits = 4,
    caption = "Top 10 non-zero Elastic Net coefficients (by absolute size)"
  )

Top 10 non-zero Elastic Net coefficients (by absolute size)
term	estimate
over_time_Yes	-0.6838
job_level_X2	0.4784
stock_option_level_X1	0.4375
total_working_years	0.4053
environment_satisfaction	0.3754
business_travel_Travel_Frequently	-0.3532
num_companies_worked	-0.3298
job_satisfaction	0.2927
job_involvement	0.2794
stock_option_level_X2	0.2439

The coefficient table highlights the predictors that contributed most strongly to attrition prediction in the finalized model. Positive coefficients are associated with a higher predicted probability of attrition, whereas negative coefficients are associated with a lower predicted probability of attrition. Because all numeric predictors were standardized during preprocessing, coefficient magnitudes can be compared within the model as indicators of relative association strength.

Although these coefficients help identify variables that influenced prediction, they should not be interpreted as evidence of causation. Instead, they indicate which employee characteristics and workplace factors the model relied on most heavily when estimating attrition risk.

Stage 14: Interpret Model Performance

In this analysis, I used an elastic net logistic regression model to predict employee attrition using the IBM HR dataset. The model was tuned using five-fold cross-validation, with PR AUC serving as the primary selection metric because attrition represents an imbalanced outcome.

The final model selected a penalty value of 0.0162 and a mixture value of 0.5. This indicates that the model used a balanced elastic net blend of Ridge and LASSO regularization, combining coefficient shrinkage with partial variable selection. On the held-out test set, the model achieved a PR AUC of 0.696, a ROC AUC of 0.862, a sensitivity of 0.650, a specificity of 0.916, an accuracy of 0.873, and a Youden’s J value of 0.566.

Overall, these results suggest that the model was able to identify meaningful patterns associated with employee attrition and distinguish employees who left from those who stayed. The ROC AUC of 0.862 indicates strong overall separation between leavers and stayers, while the PR AUC of 0.696 suggests that the model performs well even when evaluated under the natural class imbalance present in the data. The model correctly identified 65% of employees who left while correctly identifying approximately 92% of employees who stayed.

From a practical perspective, this model would be most useful if HR intends to prioritize retention efforts toward employees with elevated attrition risk while still avoiding many unnecessary interventions for employees who are likely to remain. Although some attrition cases were missed, the model provides meaningful information beyond the base attrition rate and could serve as a useful decision-support tool for identifying employees who may be at greater risk of leaving the organization.