This report applies logistic regression to predict
employee attrition using the IBM HR Analytics Employee Attrition
dataset, accessed directly via the rsample package
— the exact same dataset used in the textbook reference below.
Reference methodology: Bradley Boehmke & Brandon
Greenwell, Hands-On Machine Learning with R, Chapter 5 —
Logistic Regression.
https://bradleyboehmke.github.io/HOML/logistic-regression.html#assessing-model-accuracy-1
We estimate three nested logistic regression models and evaluate each on a held-out test set using confusion matrices.
library(tidyverse) # data wrangling + ggplot2
library(modeldata) # attrition dataset (moved from rsample)
library(rsample) # train/test split
library(caret) # confusionMatrix()
library(broom) # tidy model output
library(knitr) # kable tables
library(kableExtra) # table stylingInstall any missing packages with:
install.packages(c("tidyverse", "modeldata", "rsample", "caret", "broom", "kableExtra"))
The attrition dataset ships with the
rsample package and contains 1,470
employees across 31 variables — no external
file download required.
## Rows: 1,470
## Columns: 31
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,…
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare…
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <fct> Sales, Research_Development, Research_Develop…
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <ord> College, Below_College, College, Master, Belo…
## $ EducationField <fct> Life_Sciences, Life_Sciences, Other, Life_Sci…
## $ EnvironmentSatisfaction <ord> Medium, High, Very_High, Very_High, Low, Very…
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal…
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <ord> High, Medium, Medium, High, High, High, Very_…
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <fct> Sales_Executive, Research_Scientist, Laborato…
## $ JobSatisfaction <ord> Very_High, Medium, High, High, Medium, Very_H…
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si…
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <ord> Excellent, Outstanding, Excellent, Excellent,…
## $ RelationshipSatisfaction <ord> Low, Very_High, Medium, High, Very_High, High…
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <ord> Bad, Better, Better, Better, Better, Good, Go…
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
df <- attrition %>%
mutate(Attrition = factor(Attrition, levels = c("No", "Yes")))
df %>%
count(Attrition) %>%
mutate(Proportion = round(n / sum(n), 3)) %>%
kable(caption = "Attrition Class Distribution") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Attrition | n | Proportion |
|---|---|---|
| No | 1233 | 0.839 |
| Yes | 237 | 0.161 |
All three models are fitted on the training set with
glm(..., family = binomial).
\[\log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1\,\text{MonthlyIncome}\]
m1 <- glm(Attrition ~ MonthlyIncome,
data = train,
family = binomial)
tidy(m1) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(caption = "Model 1: Attrition ~ MonthlyIncome") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.9792 | 0.1437 | -6.8164 | 0 |
| MonthlyIncome | -0.0001 | 0.0000 | -4.9938 | 0 |
\[\log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1\,\text{MonthlyIncome} + \beta_2\,\text{OverTime}\]
m2 <- glm(Attrition ~ MonthlyIncome + OverTime,
data = train,
family = binomial)
tidy(m2) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(caption = "Model 2: Attrition ~ MonthlyIncome + OverTime") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.4775 | 0.1645 | -8.9827 | 0 |
| MonthlyIncome | -0.0001 | 0.0000 | -5.1236 | 0 |
| OverTimeYes | 1.3982 | 0.1674 | 8.3522 | 0 |
\[\log\!\left(\frac{p}{1-p}\right) = \beta_0 + \sum_{j=1}^{k} \beta_j\,X_j\]
m3 <- glm(Attrition ~ .,
data = train,
family = binomial)
tidy(m3) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
arrange(p.value) %>%
kable(caption = "Model 3: Attrition ~ All Variables (sorted by p-value)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
scroll_box(height = "400px")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| BusinessTravelTravel_Frequently | 2.2568 | 0.5268 | 4.2843 | 0.0000 |
| EnvironmentSatisfaction.L | -0.9488 | 0.2049 | -4.6294 | 0.0000 |
| JobSatisfaction.L | -0.8894 | 0.2105 | -4.2249 | 0.0000 |
| NumCompaniesWorked | 0.1874 | 0.0436 | 4.3024 | 0.0000 |
| OverTimeYes | 2.0093 | 0.2216 | 9.0669 | 0.0000 |
| JobInvolvement.L | -1.5370 | 0.3941 | -3.9001 | 0.0001 |
| MaritalStatusSingle | 1.5185 | 0.3951 | 3.8435 | 0.0001 |
| DistanceFromHome | 0.0462 | 0.0124 | 3.7132 | 0.0002 |
| YearsSinceLastPromotion | 0.1497 | 0.0470 | 3.1855 | 0.0014 |
| YearsAtCompany | 0.1415 | 0.0455 | 3.1129 | 0.0019 |
| YearsInCurrentRole | -0.1573 | 0.0510 | -3.0820 | 0.0021 |
| YearsWithCurrManager | -0.1565 | 0.0517 | -3.0253 | 0.0025 |
| JobRoleLaboratory_Technician | 1.6566 | 0.5618 | 2.9487 | 0.0032 |
| BusinessTravelTravel_Rarely | 1.4402 | 0.4913 | 2.9313 | 0.0034 |
| RelationshipSatisfaction.L | -0.5658 | 0.2039 | -2.7752 | 0.0055 |
| EnvironmentSatisfaction.Q | 0.5388 | 0.2084 | 2.5848 | 0.0097 |
| TotalWorkingYears | -0.0845 | 0.0340 | -2.4849 | 0.0130 |
| RelationshipSatisfaction.Q | 0.5028 | 0.2113 | 2.3792 | 0.0173 |
| TrainingTimesLastYear | -0.1754 | 0.0823 | -2.1329 | 0.0329 |
| WorkLifeBalance.L | -0.7032 | 0.3316 | -2.1210 | 0.0339 |
| GenderMale | 0.4133 | 0.2097 | 1.9712 | 0.0487 |
| JobSatisfaction.C | -0.3457 | 0.2062 | -1.6765 | 0.0936 |
| JobRoleSales_Representative | 2.1994 | 1.3656 | 1.6106 | 0.1073 |
| JobInvolvement.C | -0.3187 | 0.2014 | -1.5822 | 0.1136 |
| WorkLifeBalance.Q | 0.4249 | 0.2726 | 1.5585 | 0.1191 |
| Age | -0.0237 | 0.0156 | -1.5218 | 0.1281 |
| MaritalStatusMarried | 0.4562 | 0.3051 | 1.4952 | 0.1349 |
| EnvironmentSatisfaction.C | -0.3136 | 0.2127 | -1.4745 | 0.1403 |
| DailyRate | -0.0004 | 0.0003 | -1.4518 | 0.1466 |
| JobRoleResearch_Scientist | 0.7229 | 0.5814 | 1.2434 | 0.2137 |
| Education.Q | -0.4674 | 0.4066 | -1.1495 | 0.2504 |
| JobRoleResearch_Director | -1.0285 | 1.0924 | -0.9415 | 0.3464 |
| JobRoleSales_Executive | 1.1745 | 1.3012 | 0.9027 | 0.3667 |
| WorkLifeBalance.C | 0.1574 | 0.1970 | 0.7988 | 0.4244 |
| EducationFieldTechnical_Degree | 0.6096 | 0.9526 | 0.6399 | 0.5222 |
| RelationshipSatisfaction.C | -0.1332 | 0.2207 | -0.6034 | 0.5463 |
| Education.L | -0.2863 | 0.4763 | -0.6010 | 0.5478 |
| MonthlyIncome | 0.0001 | 0.0001 | 0.5591 | 0.5761 |
| EducationFieldLife_Sciences | -0.4911 | 0.9363 | -0.5245 | 0.5999 |
| EducationFieldMedical | -0.4763 | 0.9395 | -0.5069 | 0.6122 |
| JobRoleManufacturing_Director | 0.3051 | 0.6043 | 0.5048 | 0.6137 |
| EducationFieldOther | -0.4826 | 1.0002 | -0.4825 | 0.6295 |
| PercentSalaryHike | -0.0178 | 0.0442 | -0.4033 | 0.6867 |
| JobRoleManager | -0.3839 | 1.0754 | -0.3570 | 0.7211 |
| Education.C | -0.1067 | 0.3032 | -0.3520 | 0.7248 |
| JobInvolvement.Q | -0.0977 | 0.3095 | -0.3157 | 0.7522 |
| Education^4 | -0.0570 | 0.2125 | -0.2680 | 0.7887 |
| MonthlyRate | 0.0000 | 0.0000 | 0.1891 | 0.8500 |
| HourlyRate | 0.0007 | 0.0050 | 0.1464 | 0.8836 |
| JobLevel | -0.0460 | 0.3563 | -0.1292 | 0.8972 |
| JobSatisfaction.Q | 0.0261 | 0.2082 | 0.1253 | 0.9003 |
| PerformanceRating.L | -0.0361 | 0.3228 | -0.1119 | 0.9109 |
| (Intercept) | -17.6456 | 664.6184 | -0.0265 | 0.9788 |
| EducationFieldMarketing | 0.0244 | 0.9878 | 0.0247 | 0.9803 |
| JobRoleHuman_Resources | 15.0289 | 664.6169 | 0.0226 | 0.9820 |
| DepartmentResearch_Development | 13.5399 | 664.6166 | 0.0204 | 0.9837 |
| DepartmentSales | 13.3934 | 664.6167 | 0.0202 | 0.9839 |
| StockOptionLevel | 0.0000 | 0.1737 | 0.0002 | 0.9998 |
tibble(
Model = c("M1: MonthlyIncome",
"M2: MonthlyIncome + OverTime",
"M3: All Variables"),
AIC = round(c(AIC(m1), AIC(m2), AIC(m3)), 2),
Deviance = round(c(deviance(m1), deviance(m2), deviance(m3)), 2),
Null_Deviance = round(c(m1$null.deviance,
m2$null.deviance,
m3$null.deviance), 2),
df_residual = c(df.residual(m1), df.residual(m2), df.residual(m3))
) %>%
kable(caption = "Model Fit Statistics — Lower AIC Indicates Better Fit") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | AIC | Deviance | Null_Deviance | df_residual |
|---|---|---|---|---|
| M1: MonthlyIncome | 1008.49 | 1004.49 | 1036.53 | 1173 |
| M2: MonthlyIncome + OverTime | 940.60 | 934.60 | 1036.53 | 1172 |
| M3: All Variables | 791.87 | 675.87 | 1036.53 | 1117 |
Interpretation: AIC penalises model complexity. A lower AIC in Model 3 confirms that additional predictors improve fit beyond their added cost.
Each observation is classified as “Yes” (attrition) when the predicted probability exceeds the default threshold of 0.50.
predict_class <- function(model, newdata, threshold = 0.50) {
probs <- predict(model, newdata = newdata, type = "response")
factor(if_else(probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
}pred1 <- predict_class(m1, test)
cm1 <- confusionMatrix(pred1, test$Attrition, positive = "Yes")
print(cm1)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 247 48
## Yes 0 0
##
## Accuracy : 0.8373
## 95% CI : (0.7901, 0.8775)
## No Information Rate : 0.8373
## P-Value [Acc > NIR] : 0.5384
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.17e-11
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8373
## Prevalence : 0.1627
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
pred2 <- predict_class(m2, test)
cm2 <- confusionMatrix(pred2, test$Attrition, positive = "Yes")
print(cm2)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 247 48
## Yes 0 0
##
## Accuracy : 0.8373
## 95% CI : (0.7901, 0.8775)
## No Information Rate : 0.8373
## P-Value [Acc > NIR] : 0.5384
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.17e-11
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8373
## Prevalence : 0.1627
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
pred3 <- predict_class(m3, test)
cm3 <- confusionMatrix(pred3, test$Attrition, positive = "Yes")
print(cm3)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 236 27
## Yes 11 21
##
## Accuracy : 0.8712
## 95% CI : (0.8275, 0.9072)
## No Information Rate : 0.8373
## P-Value [Acc > NIR] : 0.06380
##
## Kappa : 0.4539
##
## Mcnemar's Test P-Value : 0.01496
##
## Sensitivity : 0.43750
## Specificity : 0.95547
## Pos Pred Value : 0.65625
## Neg Pred Value : 0.89734
## Prevalence : 0.16271
## Detection Rate : 0.07119
## Detection Prevalence : 0.10847
## Balanced Accuracy : 0.69648
##
## 'Positive' Class : Yes
##
extract_metrics <- function(cm, label) {
tibble(
Model = label,
Accuracy = round(cm$overall["Accuracy"], 4),
Sensitivity = round(cm$byClass["Sensitivity"], 4),
Specificity = round(cm$byClass["Specificity"], 4),
Precision = round(cm$byClass["Precision"], 4),
F1_Score = round(cm$byClass["F1"], 4),
Kappa = round(cm$overall["Kappa"], 4)
)
}
perf <- bind_rows(
extract_metrics(cm1, "M1: MonthlyIncome"),
extract_metrics(cm2, "M2: MonthlyIncome + OverTime"),
extract_metrics(cm3, "M3: All Variables")
)
best_row <- which.max(perf$Accuracy)
perf %>%
kable(caption = "Test-Set Performance Metrics (threshold = 0.50)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(2, bold = TRUE) %>%
row_spec(best_row, background = "#d4edda")| Model | Accuracy | Sensitivity | Specificity | Precision | F1_Score | Kappa |
|---|---|---|---|---|---|---|
| M1: MonthlyIncome | 0.8373 | 0.0000 | 1.0000 | NA | NA | 0.0000 |
| M2: MonthlyIncome + OverTime | 0.8373 | 0.0000 | 1.0000 | NA | NA | 0.0000 |
| M3: All Variables | 0.8712 | 0.4375 | 0.9555 | 0.6562 | 0.525 | 0.4539 |
prob_df <- test %>%
transmute(
Attrition,
`M1: MonthlyIncome` = predict(m1, newdata = test, type = "response"),
`M2: MonthlyIncome + OverTime` = predict(m2, newdata = test, type = "response"),
`M3: All Variables` = predict(m3, newdata = test, type = "response")
) %>%
pivot_longer(-Attrition, names_to = "Model", values_to = "Probability") %>%
mutate(Model = factor(Model, levels = c(
"M1: MonthlyIncome",
"M2: MonthlyIncome + OverTime",
"M3: All Variables"
)))
ggplot(prob_df, aes(x = Probability, fill = Attrition)) +
geom_histogram(bins = 40, alpha = 0.75, position = "identity") +
geom_vline(xintercept = 0.5, linetype = "dashed",
colour = "grey30", linewidth = 0.8) +
facet_wrap(~ Model, ncol = 1) +
scale_fill_manual(values = c("No" = "#2196F3", "Yes" = "#F44336")) +
labs(
title = "Predicted Probability Distributions by Model",
subtitle = "Dashed line = 0.50 classification threshold",
x = "Predicted Probability of Attrition",
y = "Count",
fill = "Actual Attrition"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold"))tidy(m3, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(Significant = if_else(p.value < 0.05, "p < 0.05", "p >= 0.05")) %>%
arrange(desc(abs(estimate))) %>%
slice_head(n = 20) %>%
ggplot(aes(x = estimate,
y = reorder(term, estimate),
colour = Significant)) +
geom_point(size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.25) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
scale_colour_manual(values = c("p < 0.05" = "#E53935",
"p >= 0.05" = "#90A4AE")) +
labs(
title = "Model 3 — Top 20 Coefficients (Log-Odds Scale)",
subtitle = "Error bars = 95 % confidence intervals",
x = "Coefficient Estimate",
y = NULL,
colour = NULL
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Model 1 (MonthlyIncome only) is a minimal baseline. Lower income is weakly associated with higher attrition risk but explains very little variance overall.
Model 2 adds OverTime and meaningfully improves sensitivity — employees who work overtime carry a substantially higher log-odds of leaving.
Model 3 (all variables) achieves the lowest AIC
and highest overall accuracy. Key significant predictors typically
include OverTime, JobRole,
MaritalStatus, and YearsAtCompany.
Class imbalance (~16 % attrition) suppresses sensitivity across all models at the 0.50 threshold. For production use, consider lowering the threshold, applying SMOTE, or incorporating class weights.
## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Ulaanbaatar
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 broom_1.0.12 caret_7.0-1
## [5] lattice_0.22-9 rsample_1.3.2 modeldata_1.5.1 lubridate_1.9.5
## [9] forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1
## [13] readr_2.2.0 tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.1 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.39 rpart_4.1.24
## [10] timechange_0.4.0 lifecycle_1.0.5 survival_3.8-6
## [13] magrittr_2.0.4 compiler_4.5.3 rlang_1.1.7
## [16] sass_0.4.10 tools_4.5.3 yaml_2.3.12
## [19] data.table_1.18.2.1 labeling_0.4.3 xml2_1.5.2
## [22] plyr_1.8.9 RColorBrewer_1.1-3 withr_3.0.2
## [25] nnet_7.3-20 grid_4.5.3 stats4_4.5.3
## [28] e1071_1.7-17 future_1.70.0 globals_0.19.1
## [31] scales_1.4.0 iterators_1.0.14 MASS_7.3-65
## [34] cli_3.6.5 rmarkdown_2.30 generics_0.1.4
## [37] rstudioapi_0.18.0 future.apply_1.20.2 reshape2_1.4.5
## [40] tzdb_0.5.0 proxy_0.4-29 cachem_1.1.0
## [43] splines_4.5.3 parallel_4.5.3 vctrs_0.7.1
## [46] hardhat_1.4.2 Matrix_1.7-4 jsonlite_2.0.0
## [49] hms_1.1.4 listenv_0.10.1 systemfonts_1.3.2
## [52] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
## [55] recipes_1.3.1 glue_1.8.0 parallelly_1.46.1
## [58] codetools_0.2-20 stringi_1.8.7 gtable_0.3.6
## [61] furrr_0.3.1 pillar_1.11.1 htmltools_0.5.9
## [64] ipred_0.9-15 lava_1.8.2 R6_2.6.1
## [67] textshaping_1.0.5 evaluate_1.0.5 backports_1.5.0
## [70] bslib_0.10.0 class_7.3-23 Rcpp_1.1.1
## [73] svglite_2.2.2 nlme_3.1-168 prodlim_2026.03.11
## [76] xfun_0.56 pkgconfig_2.0.3 ModelMetrics_1.2.2.2