In this assignment, I use an elastic net logistic regression model to predict employee attrition using the IBM HR dataset. I begin by loading and preparing the data, then split the dataset into training and test sets so that model performance can be evaluated on unseen employees. Next, I create a preprocessing recipe, address class imbalance using SMOTE, and tune the model’s hyperparameters using cross-validation. Finally, I select the best-performing model, evaluate its performance on the held-out test set, and examine the characteristics of the final model.
The following packages support data import, reproducible preprocessing, stratified resampling, hyperparameter tuning, and classification diagnostics within a single tidymodels workflow.
# Clearing the environment so the analysis starts from a clean workspace
rm(list = ls())
# Loading conflicted to manage functions that share the same name across packages
library(conflicted)
# Installing pacman if needed, then loading it
if (!requireNamespace("pacman", quietly = TRUE)) {
install.packages("pacman", repos = "http://cran.r-project.org")
}
library(pacman)
# Loading packages needed for data import, cleaning, modeling, resampling, and evaluation
pacman::p_load(
readxl,
dplyr,
tidyr,
ggplot2,
knitr,
tidymodels,
janitor,
scales,
themis,
tidylog
)
# Giving tidylog priority when its functions overlap with functions from other packages
for (f in getNamespaceExports("tidylog")) {
conflicted::conflict_prefer(f, "tidylog", quiet = TRUE)
}
# Giving dplyr priority for common data-wrangling functions
conflicted::conflict_prefer("filter", "dplyr", quiet = TRUE)
conflicted::conflict_prefer("select", "dplyr", quiet = TRUE)
The IBM HR dataset contains employee demographic, job, satisfaction, and compensation variables along with an attrition outcome. Before building the model, I need to prepare the data so that categorical variables are coded correctly, the attrition outcome is defined consistently, and variable names are easier to work with throughout the analysis.
# Updating the file path if the Excel file is stored in a different location
data_path <- "~/Desktop/1 - Coursework/Graduate Coursework/UGA/4 - Summer 2026/Advanced Analytics (PSYC 6841)/Data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx"
# Reading the IBM HR attrition dataset into R
df <- read_excel(data_path)
# Converting all character variables into factors so categorical predictors are recognized correctly
df <- df %>%
mutate(across(where(is.character), as.factor))
# Setting attrition = "Yes" as the event of interest so predictions focus on employee departures
df <- df %>%
mutate(Attrition = relevel(factor(Attrition), ref = "Yes"))
# Creating a unique row identifier that can be used to link predictions back to employees later
df <- df %>%
mutate(id = row_number()) %>%
select(id, everything())
# Standardizing variable names into snake_case for easier coding and interpretation
df <- df %>%
clean_names()
# Reviewing the structure of the cleaned dataset before moving into model development
glimpse(df)
## Rows: 1,470
## Columns: 36
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ business_travel <fct> Travel_Rarely, Travel_Frequently, Travel_Ra…
## $ daily_rate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department <fct> Sales, Research & Development, Research & D…
## $ distance_from_home <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3…
## $ education_field <fct> Life Sciences, Life Sciences, Other, Life S…
## $ employee_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ employee_number <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 1…
## $ environment_satisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3…
## $ gender <fct> Female, Male, Male, Female, Male, Male, Fem…
## $ hourly_rate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84,…
## $ job_involvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2…
## $ job_level <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1…
## $ job_role <fct> Sales Executive, Research Scientist, Labora…
## $ job_satisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3…
## $ marital_status <fct> Single, Married, Single, Married, Married, …
## $ monthly_income <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ monthly_rate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 99…
## $ num_companies_worked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ over_time <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No,…
## $ percent_salary_hike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3…
## $ relationship_satisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2…
## $ standard_hours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,…
## $ stock_option_level <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0…
## $ total_working_years <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ training_times_last_year <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4…
## $ work_life_balance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3…
## $ years_at_company <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_in_current_role <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2…
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
## $ years_with_curr_manager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3…
Before building the model, I split the data set into training and test sets. The training set will be used to build and tune the model, while the test set will be reserved for evaluating how well the final model performs on employees it has not seen before. I also use stratification so that the attrition rate remains similar across both data sets.
# Setting a seed so the train-test split can be reproduced
set.seed(42)
# Creating a 75/25 split while preserving the attrition rate in both datasets
data_split <- initial_split(df, prop = 0.75, strata = attrition)
# Separating the training and test datasets from the split object
train_data <- training(data_split)
test_data <- testing(data_split)
After creating the split, I want to verify that the attrition rate remains consistent across the full data set, training data set, and test data set. If the rates differ substantially, model performance could be influenced by sampling differences rather than actual predictive ability.
# Calculating the attrition count and percentage for the full dataset
tabyl(df$attrition) %>%
adorn_totals("row") %>%
adorn_pct_formatting()
| df$attrition | n | percent |
|---|---|---|
| Yes | 237 | 16.1% |
| No | 1233 | 83.9% |
| Total | 1470 | 100.0% |
Approximately 16% of employees in the full dataset experienced attrition.
# Calculating the attrition count and percentage for the training dataset
tabyl(train_data$attrition) %>%
adorn_totals("row") %>%
adorn_pct_formatting()
| train_data$attrition | n | percent |
|---|---|---|
| Yes | 177 | 16.1% |
| No | 924 | 83.9% |
| Total | 1101 | 100.0% |
# Calculating the attrition count and percentage for the test dataset
tabyl(test_data$attrition) %>%
adorn_totals("row") %>%
adorn_pct_formatting()
| test_data$attrition | n | percent |
|---|---|---|
| Yes | 60 | 16.3% |
| No | 309 | 83.7% |
| Total | 369 | 100.0% |
The attrition rates in the training and test datasets are very similar to the rate observed in the full dataset. This suggests that stratification worked as intended and that both datasets contain a comparable proportion of employees who stayed and employees who left.
After splitting the data, I create five cross-validation folds from the training dataset. Cross-validation allows the model to be trained and evaluated multiple times on different parts of the training data before it is tested on the held-out test set. I use stratification again so that each fold keeps a similar attrition rate.
# Setting a seed so the cross-validation folds can be reproduced
set.seed(42)
# Creating 5 stratified folds from the training data only
# The test set is not used here because it must remain untouched until final evaluation
cv_folds <- vfold_cv(
train_data,
v = 5,
strata = attrition
)
Before creating the preprocessing recipe, I want to identify any predictors with highly skewed distributions. Variables with substantial skewness can contain extreme values that may influence the model’s ability to estimate stable relationships. To avoid introducing information from the test set, I calculate skewness using only the training data.
# Identifying numeric variables that will later be converted into factors
vars_to_factorize <- c("job_level", "stock_option_level")
# Identifying ordinal variables that should remain in their original form
ordinal_likert_vars <- c(
"education",
"environment_satisfaction",
"job_involvement",
"job_satisfaction",
"performance_rating",
"relationship_satisfaction",
"work_life_balance"
)
# Excluding outcome, identifier, factor, and ordinal variables from the skewness assessment
exclude_from_skew <- c(
"attrition",
"employee_number",
"id",
vars_to_factorize,
ordinal_likert_vars
)
# Calculating skewness for the remaining numeric predictors using only the training dataset
skewness_df <- train_data %>%
select(where(is.numeric)) %>%
select(-any_of(exclude_from_skew)) %>%
summarise(
across(
.cols = everything(),
.fns = ~ {
x <- .x[!is.na(.x)]
n <- length(x)
m <- mean(x)
s <- sd(x)
if (s == 0 || n < 3) NA_real_ else (sum((x - m)^3) / n) / s^3
}
)
) %>%
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "skewness"
) %>%
arrange(desc(abs(skewness)))
# Identifying predictors with substantial skewness that may benefit from a Yeo-Johnson transformation
skewed_feature_names <- skewness_df %>%
filter(!is.na(skewness), abs(skewness) > 1) %>%
pull(variable)
# Printing the variables that will be transformed later in the preprocessing recipe
cat("\nVariables flagged for Yeo-Johnson (|skewness| > 1):\n")
##
## Variables flagged for Yeo-Johnson (|skewness| > 1):
print(skewed_feature_names)
## [1] "years_since_last_promotion" "years_at_company"
## [3] "monthly_income" "total_working_years"
## [5] "num_companies_worked" "distance_from_home"
The variables identified above exceed the selected skewness threshold and will be transformed later in the preprocessing recipe. Reducing skewness can help stabilize coefficient estimation and improve the performance of regularized regression models.
Before training the model, I need to prepare the predictors so they can be used appropriately by the elastic net logistic regression model. The preprocessing recipe below removes uninformative variables, prepares categorical predictors, transforms highly skewed variables, and standardizes predictors onto a common scale. Defining these steps in a recipe ensures that the same preprocessing is applied consistently during cross-validation, testing, and final model fitting.
# Creating a preprocessing recipe that uses all available predictors to estimate attrition
reg_recipe <- recipe(attrition ~ ., data = train_data) %>%
# Identifying employee number as an ID rather than a predictive feature
update_role(employee_number, new_role = "ID") %>%
# Removing predictors with no variation
step_zv(all_predictors()) %>%
# Removing predictors with very little variation
step_nzv(all_predictors()) %>%
# Converting selected numeric variables into categorical factors
step_mutate(
job_level = factor(job_level, levels = 1:5),
stock_option_level = factor(stock_option_level, levels = 0:3)
) %>%
# Creating a placeholder level for categories that may appear in future data
step_novel(all_nominal_predictors()) %>%
# Grouping infrequent categories into a single "other_rare" category
step_other(
all_nominal_predictors(),
threshold = 0.05,
other = "other_rare"
) %>%
# Applying a Yeo-Johnson transformation to highly skewed predictors
step_YeoJohnson(all_of(skewed_feature_names)) %>%
# Converting categorical predictors into dummy-coded variables
step_dummy(all_nominal_predictors(), one_hot = FALSE) %>%
# Removing any zero-variance predictors created during dummy coding
step_zv(all_predictors()) %>%
# Standardizing all numeric predictors prior to elastic net regularization
step_normalize(all_numeric_predictors())
Employee attrition is relatively uncommon in this dataset, which means a model could achieve high accuracy by predicting that most employees will stay. To help the model learn patterns associated with attrition, I use SMOTE (Synthetic Minority Oversampling Technique) to create additional synthetic attrition cases during model training. This balancing procedure is applied only within the training resamples so that the test set continues to reflect the natural attrition rate.
# Adding SMOTE to the preprocessing recipe to address class imbalance
reg_recipe_smote <- reg_recipe %>%
# Generating synthetic attrition cases within the training data
step_smote(
attrition,
over_ratio = 0.5,
seed = 42
)
At this point, the preprocessing recipe is complete. The recipe now handles variable preparation, skewness correction, dummy coding, normalization, and class balancing. These steps will be applied automatically whenever the model is trained, tuned, or evaluated in later stages of the analysis.
Now that the data are prepared, I can define the elastic net logistic regression model. Elastic net extends traditional logistic regression by adding regularization, which helps reduce overfitting and improve model stability when predictors are correlated. Because the amount of regularization and the balance between Ridge and LASSO penalties can affect performance, both parameters will be tuned later using cross-validation.
# Defining an elastic net logistic regression model with tunable hyperparameters
elastic_net_spec <- logistic_reg(
# Tuning the overall amount of regularization (lambda)
penalty = tune(),
# Tuning the balance between Ridge (0) and LASSO (1) regularization
mixture = tune()
) %>%
# Using glmnet as the modeling engine
set_engine("glmnet") %>%
# Specifying a binary classification model
set_mode("classification")
Before tuning the model, I combine the preprocessing recipe and elastic net specification into a single workflow. Doing so ensures that every preprocessing step and modeling step is applied together throughout cross-validation, model selection, and final evaluation.
# Combining the preprocessing recipe and elastic net model into a single workflow
elastic_net_wf <- workflow() %>%
# Adding the preprocessing recipe
add_recipe(reg_recipe_smote) %>%
# Adding the elastic net model specification
add_model(elastic_net_spec)
With stages 6a and 6b complete, the modeling workflow has been fully defined. The remaining steps focus on determining which combination of hyperparameters produces the strongest predictive performance.
Before tuning the model, I need to decide how model performance will be evaluated. Because employee attrition is an imbalanced outcome, relying on accuracy alone could be misleading. Instead, I use a collection of metrics that evaluate how well the model identifies employees who leave, distinguishes leavers from stayers, and balances different types of classification errors.
# Giving yardstick versions of accuracy and precision priority when conflicts exist
conflicts_prefer(yardstick::accuracy)
conflicts_prefer(yardstick::precision)
# Creating a metric set that will be used during model tuning and evaluation
class_metric <- metric_set(
accuracy, # Measuring overall classification accuracy
f_meas, # Measuring the balance between precision and recall
j_index, # Measuring the balance between sensitivity and specificity
kap, # Measuring agreement beyond chance
precision, # Measuring the proportion of flagged employees who actually left
sensitivity, # Measuring the proportion of actual leavers correctly identified
specificity, # Measuring the proportion of stayers correctly identified
roc_auc, # Measuring overall class separation across thresholds
mcc, # Measuring classification quality using all confusion matrix cells
pr_auc # Measuring precision-recall performance under class imbalance
)
Although several metrics will be reported, I use PR AUC (Precision-Recall Area Under the Curve) as the primary metric for model selection. PR AUC is particularly useful when the outcome is imbalanced because it focuses on how effectively the model identifies attrition cases. ROC AUC is also reported as a complementary measure of the model’s ability to separate employees who leave from those who stay.
The next step is to determine which combination of penalty and mixture values produces the strongest model performance. Rather than selecting these values manually, I evaluate many possible combinations using cross-validation. The tuning grid includes Ridge-like models, LASSO-like models, and intermediate elastic net solutions across a range of regularization strengths.
# Creating a tuning grid that evaluates different combinations of penalty and mixture values
en_grid <- grid_regular(
# Testing a range of regularization strengths on a log scale
penalty(range = c(-4, 2)),
# Testing Ridge, LASSO, and intermediate elastic net solutions
mixture(range = c(0, 1)),
# Defining the number of values evaluated for each hyperparameter
levels = c(penalty = 20, mixture = 5)
)
# Reviewing the first few combinations in the tuning grid
head(en_grid, 10)
| penalty | mixture |
|---|---|
| 0.0001000 | 0 |
| 0.0002069 | 0 |
| 0.0004281 | 0 |
| 0.0008859 | 0 |
| 0.0018330 | 0 |
| 0.0037927 | 0 |
| 0.0078476 | 0 |
| 0.0162378 | 0 |
| 0.0335982 | 0 |
| 0.0695193 | 0 |
# Displaying the total number of hyperparameter combinations being evaluated
cat("\nTuning grid size:", nrow(en_grid), "combinations\n")
##
## Tuning grid size: 100 combinations
# Setting a seed so tuning results can be reproduced
set.seed(42)
# Evaluating each penalty and mixture combination using cross-validation
elastic_net_tune_res <- tune_grid(
elastic_net_wf,
resamples = cv_folds,
grid = en_grid,
metrics = class_metric
)
# Displaying the top-performing models based on PR AUC
show_best(elastic_net_tune_res, metric = "pr_auc", n = 5)
| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0.0162378 | 0.50 | pr_auc | binary | 0.6390982 | 5 | 0.0318394 | pre0_mod038_post0 |
| 0.0695193 | 0.00 | pr_auc | binary | 0.6382526 | 5 | 0.0304547 | pre0_mod046_post0 |
| 0.0335982 | 0.00 | pr_auc | binary | 0.6380157 | 5 | 0.0274793 | pre0_mod041_post0 |
| 0.0162378 | 0.25 | pr_auc | binary | 0.6357271 | 5 | 0.0264245 | pre0_mod037_post0 |
| 0.0078476 | 1.00 | pr_auc | binary | 0.6344649 | 5 | 0.0295331 | pre0_mod035_post0 |
# Visualizing how PR AUC changes across penalty and mixture combinations
autoplot(elastic_net_tune_res, metric = "pr_auc") +
labs(title = "Elastic Net Tuning: PR AUC by Penalty and Mixture")
# Visualizing how ROC AUC changes across penalty and mixture combinations
autoplot(elastic_net_tune_res, metric = "roc_auc") +
labs(title = "Elastic Net Tuning: ROC AUC by Penalty and Mixture")
# Summarizing the top-performing hyperparameter combinations for PR AUC and ROC AUC
collect_metrics(elastic_net_tune_res) %>%
filter(.metric %in% c("pr_auc", "roc_auc")) %>%
group_by(.metric) %>%
slice_max(mean, n = 3, with_ties = FALSE) %>%
ungroup() %>%
select(.metric, penalty, mixture, mean, std_err) %>%
knitr::kable(
digits = 3,
caption = "Top 3 cross-validated penalty/mixture combinations by AUC metric"
)
| .metric | penalty | mixture | mean | std_err |
|---|---|---|---|---|
| pr_auc | 0.016 | 0.5 | 0.639 | 0.032 |
| pr_auc | 0.070 | 0.0 | 0.638 | 0.030 |
| pr_auc | 0.034 | 0.0 | 0.638 | 0.027 |
| roc_auc | 0.034 | 0.0 | 0.832 | 0.023 |
| roc_auc | 0.016 | 0.0 | 0.832 | 0.022 |
| roc_auc | 0.000 | 0.0 | 0.832 | 0.022 |
The tuning results above show which combinations of regularization strength and penalty type produced the strongest cross-validated performance. In the next section, I select the combination with the highest PR AUC and use it to define the final elastic net model.
Now that all hyperparameter combinations have been evaluated, I select the combination that produced the highest average PR AUC across the five cross-validation folds. Because attrition is an imbalanced outcome, this approach prioritizes identifying employees who leave rather than relying solely on overall accuracy.
# Selecting the penalty and mixture combination with the highest PR AUC
best_auc <- select_best(elastic_net_tune_res, metric = "pr_auc")
# Reviewing the selected hyperparameter values
best_auc
| penalty | mixture | .config |
|---|---|---|
| 0.0162378 | 0.5 | pre0_mod038_post0 |
# Extracting the optimal penalty value
best_penalty <- best_auc$penalty
# Extracting the optimal mixture value
best_mixture <- best_auc$mixture
# Printing the selected hyperparameters
cat(
"\nBest penalty:", format(best_penalty, digits = 4, scientific = TRUE),
"\nBest mixture:", round(best_mixture, 3), "\n"
)
##
## Best penalty: 1.624e-02
## Best mixture: 0.5
# Interpreting the selected mixture value based on its location along the Ridge-LASSO continuum
if (best_mixture == 0) {
# Ridge regression solution
mixture_note <- "The selected solution corresponds to Ridge regression (mixture = 0), which shrinks coefficients without eliminating predictors entirely."
} else if (best_mixture == 1) {
# LASSO regression solution
mixture_note <- "The selected solution corresponds to LASSO regression (mixture = 1), which can reduce the active predictor set by zeroing coefficients."
} else if (best_mixture < 0.25) {
# Elastic net solution that behaves mostly like Ridge
mixture_note <- paste0(
"The selected solution is elastic net but behaves more like Ridge (mixture = ",
round(best_mixture, 3), "), suggesting that coefficient shrinkage mattered more than aggressive variable elimination."
)
} else if (best_mixture > 0.75) {
# Elastic net solution that behaves mostly like LASSO
mixture_note <- paste0(
"The selected solution is elastic net but behaves more like LASSO (mixture = ",
round(best_mixture, 3), "), suggesting that variable selection contributed meaningfully to the final model."
)
} else {
# Balanced elastic net solution
mixture_note <- paste0(
"The selected solution reflects a balanced elastic net blend of Ridge and LASSO (mixture = ",
round(best_mixture, 3), "), combining shrinkage with partial variable selection."
)
}
# Printing the interpretation of the selected mixture value
mixture_note
## [1] "The selected solution reflects a balanced elastic net blend of Ridge and LASSO (mixture = 0.5), combining shrinkage with partial variable selection."
Elastic net performs regularization as part of the modeling process, so I did not apply a separate feature-selection procedure such as Boruta. Instead, the tuning process allowed the model to determine how much coefficient shrinkage and variable selection were needed to maximize performance. The selected mixture value indicates whether the final solution behaved more like Ridge regression, LASSO regression, or a combination of both.
The tuning process evaluated Ridge, LASSO, and intermediate elastic net solutions across a range of penalty values. Based on PR AUC, the selected model used a mixture value of 0.5, indicating that the final model combined coefficient shrinkage and variable selection. This suggests that the strongest-performing model was obtained by balancing model complexity and predictive performance through regularization.
After selecting the best hyperparameters, I finalize the workflow by replacing the tunable placeholders with the chosen penalty and mixture values. I then re-run cross-validation using the finalized model to estimate how consistently it performs across different subsets of the training data before moving to the held-out test set.
# Creating the finalized workflow using the selected hyperparameters
elastic_net_final_wflow <- finalize_workflow(elastic_net_wf, best_auc)
The next step evaluates the finalized model across the cross-validation folds. This provides a final estimate of training-set performance and allows me to verify that performance remains stable before testing the model on unseen employees.
# Setting a seed so the resampling results can be reproduced
set.seed(42)
# Evaluating the finalized model across the cross-validation folds
elastic_net_cv_results <- fit_resamples(
elastic_net_final_wflow,
resamples = cv_folds,
metrics = class_metric
)
# Summarizing cross-validated model performance across all folds
elastic_net_cv_results %>%
collect_metrics() %>%
select(.metric, mean, std_err) %>%
arrange(.metric) %>%
knitr::kable(
digits = 3,
caption = "Finalized Elastic Net — 5-fold CV performance"
)
| .metric | mean | std_err |
|---|---|---|
| accuracy | 0.856 | 0.007 |
| f_meas | 0.565 | 0.025 |
| j_index | 0.494 | 0.035 |
| kap | 0.479 | 0.028 |
| mcc | 0.480 | 0.029 |
| pr_auc | 0.639 | 0.032 |
| precision | 0.548 | 0.021 |
| roc_auc | 0.831 | 0.025 |
| sensitivity | 0.587 | 0.035 |
| specificity | 0.907 | 0.007 |
At this point, the model development process is complete. The remaining steps focus on evaluating the finalized model on the held-out test dataset and interpreting the results.
The test dataset provides the most important estimate of model performance because it was not used during preprocessing, tuning, or cross-validation. In this stage, I fit the finalized model using the training data and then evaluate how well it predicts attrition among employees in the held-out test dataset.
# Setting a seed so the final model evaluation can be reproduced
set.seed(42)
# Fitting the finalized model on the training data and evaluating it on the test data
elastic_net_last_fit <- elastic_net_final_wflow %>%
last_fit(data_split, metrics = class_metric)
# Extracting performance metrics from the held-out test dataset
elastic_net_test_performance <- elastic_net_last_fit %>%
collect_metrics() %>%
select(.metric, .estimate) %>%
arrange(.metric)
# Printing the final test-set performance table
knitr::kable(
elastic_net_test_performance,
digits = 3,
col.names = c("Metric", "Test Set Estimate"),
caption = "Final Elastic Net model — held-out test set performance"
)
| Metric | Test Set Estimate |
|---|---|
| accuracy | 0.873 |
| f_meas | 0.624 |
| j_index | 0.566 |
| kap | 0.547 |
| mcc | 0.548 |
| pr_auc | 0.696 |
| precision | 0.600 |
| roc_auc | 0.862 |
| sensitivity | 0.650 |
| specificity | 0.916 |
The next step extracts the model predictions generated for the test dataset. Reviewing these predictions provides a direct look at the probabilities and classifications used to calculate the performance metrics reported above.
# Extracting individual employee predictions from the test dataset
elastic_net_test_predictions <- elastic_net_last_fit %>%
collect_predictions()
# Reviewing the first few predicted probabilities and classifications
elastic_net_test_predictions %>%
select(attrition, .pred_Yes, .pred_No, .pred_class) %>%
head(10)
| attrition | .pred_Yes | .pred_No | .pred_class |
|---|---|---|---|
| No | 0.2080112 | 0.7919888 | No |
| No | 0.4466760 | 0.5533240 | No |
| No | 0.2290620 | 0.7709380 | No |
| No | 0.2311662 | 0.7688338 | No |
| No | 0.1247012 | 0.8752988 | No |
| No | 0.2001179 | 0.7998821 | No |
| Yes | 0.8566513 | 0.1433487 | Yes |
| No | 0.3676295 | 0.6323705 | No |
| Yes | 0.5136523 | 0.4863477 | Yes |
| Yes | 0.0143757 | 0.9856243 | No |
The confusion matrix summarizes how often the model correctly and incorrectly classified employees as staying or leaving. This provides a more concrete view of model performance by showing the number of correct predictions, false alarms, and missed attrition cases.
# Giving yardstick's specificity function priority when conflicts exist
conflicts_prefer(yardstick::spec)
# Creating a confusion matrix for the test-set classifications
elastic_net_test_predictions %>%
conf_mat(truth = attrition, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 39 26
## No 21 283
The summary below converts the confusion matrix into commonly reported classification statistics, including sensitivity and specificity.
# Calculating summary statistics from the confusion matrix
elastic_net_test_predictions %>%
conf_mat(truth = attrition, estimate = .pred_class) %>%
summary()
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.8726287 |
| kap | binary | 0.5474755 |
| sens | binary | 0.6500000 |
| spec | binary | 0.9158576 |
| ppv | binary | 0.6000000 |
| npv | binary | 0.9309211 |
| mcc | binary | 0.5481110 |
| j_index | binary | 0.5658576 |
| bal_accuracy | binary | 0.7829288 |
| detection_prevalence | binary | 0.1761518 |
| precision | binary | 0.6000000 |
| recall | binary | 0.6500000 |
| f_meas | binary | 0.6240000 |
The ROC curve shows how sensitivity and specificity change across different classification thresholds. It provides a threshold-independent view of how well the model separates employees who leave from employees who stay.
# Creating an ROC curve to visualize model performance across classification thresholds
elastic_net_test_predictions %>%
roc_curve(attrition, .pred_Yes) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(linewidth = 1.5, color = "midnightblue") +
geom_abline(lty = 2, alpha = 0.5, color = "gray50", linewidth = 1.2) +
coord_equal() +
labs(
title = "ROC Curve: Elastic Net Logistic Regression",
subtitle = "Event = Attrition (Yes) | IBM HR dataset",
x = "1 - Specificity",
y = "Sensitivity"
) +
theme_minimal(base_size = 13)
Although ROC curves are widely used, precision-recall curves are often more informative when the outcome is relatively uncommon. Because attrition represents a minority class in this dataset, I use the precision-recall curve as the primary graphical summary of model performance.
The precision-recall curve focuses specifically on how well the model identifies employees who leave. This perspective is particularly useful when evaluating imbalanced outcomes because it emphasizes performance on the minority class rather than overall classification success.
# Giving ggplot's margin function priority when conflicts exist
conflicts_prefer(ggplot2::margin)
# Calculating the attrition prevalence in the test dataset
prevalence <- mean(test_data$attrition == "Yes")
# Calculating the PR AUC value that will be displayed on the plot
pr_auc_val <- elastic_net_test_predictions %>%
pr_auc(attrition, .pred_Yes) %>%
pull(.estimate) %>%
round(3)
# Creating a precision-recall curve and comparing it to the no-skill baseline
elastic_net_test_predictions %>%
pr_curve(attrition, .pred_Yes) %>%
ggplot(aes(x = recall, y = precision)) +
geom_hline(
yintercept = prevalence,
linetype = "dashed",
color = "grey55",
linewidth = 0.7
) +
annotate(
"text",
x = 0.02,
y = prevalence + 0.035,
label = paste0("No-skill baseline (", round(prevalence, 2), ")"),
hjust = 0,
size = 3.3,
color = "grey45"
) +
geom_area(fill = "#2c7bb6", alpha = 0.12) +
geom_path(color = "#2c7bb6", linewidth = 1.1) +
annotate(
"text",
x = 0.65,
y = 0.93,
label = paste0("PR-AUC = ", pr_auc_val),
hjust = 0,
size = 4.2,
fontface = "bold",
color = "#2c7bb6"
) +
scale_x_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
coord_equal() +
labs(
title = "Precision-Recall Curve: Elastic Net Logistic Regression",
subtitle = "Event = Attrition (Yes) | IBM HR dataset",
x = "Recall (share of true leavers identified)",
y = "Precision (share of flagged employees who actually left)",
caption = "Dashed line = no-skill baseline at positive class prevalence (~16%)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "grey40", margin = margin(b = 8)),
plot.caption = element_text(color = "grey50", size = 9),
panel.grid.minor = element_blank(),
axis.title.x = element_text(margin = margin(t = 8)),
axis.title.y = element_text(margin = margin(r = 8))
)
Together, the test-set metrics, confusion matrix, ROC curve, and precision-recall curve provide a comprehensive evaluation of model performance. Because all results were generated using employees who were not involved in model training, these estimates provide the most realistic indication of how the model would perform in practice.
The previous stage evaluated model performance using the held-out test dataset and provides the most trustworthy estimate of how well the model generalizes to new employees. After completing that evaluation, I refit the finalized model using the entire dataset so that all available information contributes to the final model.
This full-data model is intended for future prediction and interpretation, not for evaluating performance. Because the model has now been trained on every employee record, any performance statistics generated from this fit would no longer represent out-of-sample performance.
# Refitting the finalized model using the full dataset
elastic_net_full_fit <- fit(elastic_net_final_wflow, data = df)
In addition to evaluating model performance, I also examine the predictors with the largest non-zero coefficients. Because elastic net applies regularization, these coefficients reflect the variables that contributed most strongly to attrition prediction within the final model.
# Extracting and displaying the largest non-zero coefficients from the finalized model
tidy(elastic_net_full_fit %>% extract_fit_parsnip()) %>%
filter(term != "(Intercept)", estimate != 0) %>%
mutate(abs_estimate = abs(estimate)) %>%
arrange(desc(abs_estimate)) %>%
slice_head(n = 10) %>%
select(term, estimate) %>%
knitr::kable(
digits = 4,
caption = "Top 10 non-zero Elastic Net coefficients (by absolute size)"
)
| term | estimate |
|---|---|
| over_time_Yes | -0.6838 |
| job_level_X2 | 0.4784 |
| stock_option_level_X1 | 0.4375 |
| total_working_years | 0.4053 |
| environment_satisfaction | 0.3754 |
| business_travel_Travel_Frequently | -0.3532 |
| num_companies_worked | -0.3298 |
| job_satisfaction | 0.2927 |
| job_involvement | 0.2794 |
| stock_option_level_X2 | 0.2439 |
The coefficient table highlights the predictors that contributed most strongly to attrition prediction in the finalized model. Positive coefficients are associated with a higher predicted probability of attrition, whereas negative coefficients are associated with a lower predicted probability of attrition. Because all numeric predictors were standardized during preprocessing, coefficient magnitudes can be compared within the model as indicators of relative association strength.
Although these coefficients help identify variables that influenced prediction, they should not be interpreted as evidence of causation. Instead, they indicate which employee characteristics and workplace factors the model relied on most heavily when estimating attrition risk.
In this analysis, I used an elastic net logistic regression model to predict employee attrition using the IBM HR dataset. The model was tuned using five-fold cross-validation, with PR AUC serving as the primary selection metric because attrition represents an imbalanced outcome.
The final model selected a penalty value of 0.0162 and a mixture value of 0.5. This indicates that the model used a balanced elastic net blend of Ridge and LASSO regularization, combining coefficient shrinkage with partial variable selection. On the held-out test set, the model achieved a PR AUC of 0.696, a ROC AUC of 0.862, a sensitivity of 0.650, a specificity of 0.916, an accuracy of 0.873, and a Youden’s J value of 0.566.
Overall, these results suggest that the model was able to identify meaningful patterns associated with employee attrition and distinguish employees who left from those who stayed. The ROC AUC of 0.862 indicates strong overall separation between leavers and stayers, while the PR AUC of 0.696 suggests that the model performs well even when evaluated under the natural class imbalance present in the data. The model correctly identified 65% of employees who left while correctly identifying approximately 92% of employees who stayed.
From a practical perspective, this model would be most useful if HR intends to prioritize retention efforts toward employees with elevated attrition risk while still avoiding many unnecessary interventions for employees who are likely to remain. Although some attrition cases were missed, the model provides meaningful information beyond the base attrition rate and could serve as a useful decision-support tool for identifying employees who may be at greater risk of leaving the organization.