Employee attrition — the voluntary departure of employees — is a costly challenge for organizations. Replacing a single employee can cost between 50% and 200% of their annual salary when accounting for recruitment, onboarding, and lost productivity. Predicting which employees are likely to leave allows HR departments to intervene early.
This report fits three logistic regression models of increasing complexity to predict attrition and compares their predictive performance using confusion matrices and ROC curves on a held-out test set.
Dataset: Employee
Attrition Dataset — Kaggle
Reference: Boehmke, B. — Hands-On
Machine Learning with R, Ch. 5
# Install any missing packages automatically
pkgs <- c("tidyverse", "caret", "knitr", "kableExtra",
"gridExtra", "pROC", "broom")
invisible(lapply(pkgs, function(p) {
if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
}))
library(tidyverse)
library(caret)
library(knitr)
library(kableExtra)
library(gridExtra)
library(pROC)
library(broom)
theme_set(theme_minimal(base_size = 12))df <- read.csv("test.csv", stringsAsFactors = TRUE)
# Standardise column names
names(df) <- make.names(names(df))
# Recode target: "Left" = positive class
df <- df %>%
rename(attrition = Attrition) %>%
mutate(attrition = factor(
ifelse(attrition == "Left", "Left", "Stayed"),
levels = c("Stayed", "Left")
))
cat("Dimensions:", nrow(df), "rows x", ncol(df), "columns\n")## Dimensions: 14900 rows x 24 columns
## Rows: 14,900
## Columns: 24
## $ Employee.ID <int> 52685, 30585, 54656, 33442, 15667, 3496, 4677…
## $ Age <int> 36, 35, 50, 58, 39, 45, 22, 34, 48, 55, 32, 2…
## $ Gender <fct> Male, Male, Male, Male, Male, Female, Female,…
## $ Years.at.Company <int> 13, 7, 7, 44, 24, 30, 5, 15, 40, 16, 12, 15, …
## $ Job.Role <fct> Healthcare, Education, Education, Media, Educ…
## $ Monthly.Income <int> 8029, 4563, 5583, 5525, 4604, 8104, 8700, 110…
## $ Work.Life.Balance <fct> Excellent, Good, Fair, Fair, Good, Fair, Good…
## $ Job.Satisfaction <fct> High, High, High, Very High, High, High, High…
## $ Performance.Rating <fct> Average, Average, Average, High, Average, Ave…
## $ Number.of.Promotions <int> 1, 1, 3, 0, 0, 0, 0, 1, 0, 0, 0, 2, 3, 0, 1, …
## $ Overtime <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, …
## $ Distance.from.Home <int> 83, 55, 14, 43, 47, 38, 2, 9, 65, 31, 28, 35,…
## $ Education.Level <fct> Master’s Degree, Associate Degree, Associate …
## $ Marital.Status <fct> Married, Single, Divorced, Single, Married, D…
## $ Number.of.Dependents <int> 1, 4, 2, 4, 6, 0, 0, 4, 1, 1, 1, 1, 3, 0, 0, …
## $ Job.Level <fct> Mid, Entry, Senior, Entry, Mid, Senior, Mid, …
## $ Company.Size <fct> Large, Medium, Medium, Medium, Large, Large, …
## $ Company.Tenure <int> 22, 27, 76, 96, 45, 75, 48, 16, 52, 46, 57, 9…
## $ Remote.Work <fct> No, No, No, No, Yes, No, No, No, No, No, No, …
## $ Leadership.Opportunities <fct> No, No, No, No, No, No, No, No, No, No, No, N…
## $ Innovation.Opportunities <fct> No, No, Yes, No, No, No, No, No, No, No, No, …
## $ Company.Reputation <fct> Poor, Good, Good, Poor, Good, Good, Poor, Goo…
## $ Employee.Recognition <fct> Medium, High, Low, Low, High, Low, High, Low,…
## $ attrition <fct> Stayed, Left, Stayed, Left, Stayed, Stayed, S…
Class imbalance can bias a model toward the majority class. We check the split before modelling.
df %>%
count(attrition) %>%
mutate(Percentage = scales::percent(n / sum(n), accuracy = 0.1)) %>%
rename(Class = attrition, Count = n) %>%
kable(caption = "Table 1 - Attrition Class Distribution") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Class | Count | Percentage |
|---|---|---|
| Stayed | 7868 | 52.8% |
| Left | 7032 | 47.2% |
ggplot(df, aes(x = attrition, fill = attrition)) +
geom_bar(width = 0.5, show.legend = FALSE) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
labs(title = "Figure 1 - Attrition Class Distribution",
x = "Attrition", y = "Count")Before modelling, we examine the two key predictors featured in Models 1 and 2.
ggplot(df, aes(x = attrition, y = Monthly.Income, fill = attrition)) +
geom_boxplot(alpha = 0.8, outlier.shape = 21, show.legend = FALSE) +
scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Figure 2 - Monthly Income by Attrition Status",
x = "Attrition", y = "Monthly Income (USD)")Observation: Employees who left tend to earn lower monthly incomes, suggesting income is a meaningful predictor of attrition.
df %>%
count(Overtime, attrition) %>%
group_by(Overtime) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = Overtime, y = pct, fill = attrition)) +
geom_col(position = "fill", width = 0.5) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
labs(title = "Figure 3 - Attrition Rate by Overtime Status",
x = "Works Overtime", y = "Proportion", fill = "Attrition")Observation: Employees who work overtime leave at a substantially higher rate, confirming overtime as a strong predictor of attrition.
We perform a stratified 80 / 20 split, preserving the class ratio in both sets.
set.seed(42)
train_idx <- createDataPartition(df$attrition, p = 0.80, list = FALSE)
train_df <- df[ train_idx, ]
test_df <- df[-train_idx, ]
cat(sprintf("Training set : %d rows (%.0f%%)\nTest set : %d rows (%.0f%%)\n",
nrow(train_df), 100 * nrow(train_df) / nrow(df),
nrow(test_df), 100 * nrow(test_df) / nrow(df)))## Training set : 11921 rows (80%)
## Test set : 2979 rows (20%)
Logistic regression models the log-odds of an event as a linear combination of predictors:
\[\log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k\]
The estimated coefficients can be exponentiated to obtain odds ratios (OR), which express the multiplicative change in the odds of attrition for a one-unit increase in each predictor (OR > 1 increases risk, OR < 1 decreases risk).
attrition ~ MonthlyIncomeA simple baseline using income alone.
m1 <- glm(attrition ~ Monthly.Income,
data = train_df,
family = binomial(link = "logit"))
tidy(m1, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
rename(Term = term, `Odds Ratio` = estimate,
`p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
kable(caption = "Table 2 - Model 1 Coefficients (Odds Ratios)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | std.error | statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| (Intercept) | 0.90221 | 0.06456 | -1.59400 | 0.11094 | 0.79495 | 1.02389 |
| Monthly.Income | 1.00000 | 0.00001 | -0.15265 | 0.87867 | 0.99998 | 1.00002 |
cat("AIC:", round(AIC(m1), 2),
"| Null deviance:", round(m1$null.deviance, 2),
"| Residual deviance:", round(m1$deviance, 2))## AIC: 16492.43 | Null deviance: 16488.45 | Residual deviance: 16488.43
Interpretation: An odds ratio < 1 for
Monthly.Incomemeans higher earners are less likely to leave. For each additional dollar of monthly income, the odds of attrition are multiplied by the OR shown above.
attrition ~ MonthlyIncome + OvertimeAdding overtime status to capture work-life balance effects.
m2 <- glm(attrition ~ Monthly.Income + Overtime,
data = train_df,
family = binomial(link = "logit"))
tidy(m2, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
rename(Term = term, `Odds Ratio` = estimate,
`p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
kable(caption = "Table 3 - Model 2 Coefficients (Odds Ratios)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | std.error | statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| (Intercept) | 0.83022 | 0.06603 | -2.81793 | 0.00483 | 0.72939 | 0.94488 |
| Monthly.Income | 1.00000 | 0.00001 | -0.10719 | 0.91464 | 0.99998 | 1.00002 |
| OvertimeYes | 1.27606 | 0.03907 | 6.23936 | 0.00000 | 1.18201 | 1.37765 |
cat("AIC:", round(AIC(m2), 2),
"| Null deviance:", round(m2$null.deviance, 2),
"| Residual deviance:", round(m2$deviance, 2))## AIC: 16455.46 | Null deviance: 16488.45 | Residual deviance: 16449.46
Interpretation: The
OvertimeYesodds ratio quantifies how much more likely overtime workers are to leave, holding income constant.
attrition ~ . (All Variables)The full model uses every available feature.
m3 <- glm(attrition ~ .,
data = train_df,
family = binomial(link = "logit"))
tidy(m3, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
arrange(p.value) %>%
rename(Term = term, `Odds Ratio` = estimate,
`p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
kable(caption = "Table 4 - Model 3 Coefficients (sorted by p-value)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
scroll_box(height = "350px")| Term | Odds Ratio | std.error | statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| GenderMale | 0.57555 | 0.04663 | -11.84789 | 0.00000 | 0.52520 | 0.63054 |
| Years.at.Company | 0.98285 | 0.00267 | -6.47919 | 0.00000 | 0.97772 | 0.98800 |
| Work.Life.BalanceFair | 3.77803 | 0.06997 | 18.99811 | 0.00000 | 3.29546 | 4.33546 |
| Work.Life.BalancePoor | 4.88953 | 0.08463 | 18.75239 | 0.00000 | 4.14473 | 5.77554 |
| Job.SatisfactionLow | 1.89724 | 0.07860 | 8.14792 | 0.00000 | 1.62682 | 2.21392 |
| Job.SatisfactionVery High | 1.76020 | 0.06109 | 9.25554 | 0.00000 | 1.56184 | 1.98449 |
| Performance.RatingBelow Average | 1.39517 | 0.06716 | 4.95855 | 0.00000 | 1.22322 | 1.59167 |
| Performance.RatingLow | 1.66272 | 0.10763 | 4.72402 | 0.00000 | 1.34714 | 2.05447 |
| Number.of.Promotions | 0.75766 | 0.02359 | -11.76291 | 0.00000 | 0.72334 | 0.79343 |
| OvertimeYes | 1.43469 | 0.04913 | 7.34737 | 0.00000 | 1.30311 | 1.57986 |
| Distance.from.Home | 1.00982 | 0.00082 | 11.98412 | 0.00000 | 1.00821 | 1.01144 |
| Education.LevelPhD | 0.16136 | 0.12913 | -14.12618 | 0.00000 | 0.12491 | 0.20727 |
| Marital.StatusMarried | 0.72264 | 0.06618 | -4.90866 | 0.00000 | 0.63475 | 0.82277 |
| Marital.StatusSingle | 4.77225 | 0.07225 | 21.63136 | 0.00000 | 4.14418 | 5.50106 |
| Number.of.Dependents | 0.87501 | 0.01497 | -8.91759 | 0.00000 | 0.84966 | 0.90103 |
| Job.LevelMid | 0.35051 | 0.05124 | -20.45979 | 0.00000 | 0.31693 | 0.38744 |
| Job.LevelSenior | 0.06742 | 0.07332 | -36.78211 | 0.00000 | 0.05833 | 0.07775 |
| Remote.WorkYes | 0.17508 | 0.06536 | -26.65993 | 0.00000 | 0.15390 | 0.19885 |
| Company.ReputationFair | 1.70160 | 0.09112 | 5.83358 | 0.00000 | 1.42363 | 2.03491 |
| Company.ReputationPoor | 1.99948 | 0.09038 | 7.66653 | 0.00000 | 1.67537 | 2.38776 |
| Work.Life.BalanceGood | 1.28968 | 0.06632 | 3.83583 | 0.00013 | 1.13271 | 1.46904 |
| Job.RoleMedia | 0.78744 | 0.08074 | -2.95972 | 0.00308 | 0.67212 | 0.92237 |
| Leadership.OpportunitiesYes | 0.75461 | 0.10676 | -2.63735 | 0.00836 | 0.61182 | 0.92989 |
| Age | 0.99422 | 0.00224 | -2.59112 | 0.00957 | 0.98987 | 0.99859 |
| Company.SizeSmall | 1.17992 | 0.06692 | 2.47232 | 0.01342 | 1.03494 | 1.34539 |
| Innovation.OpportunitiesYes | 0.87174 | 0.06306 | -2.17665 | 0.02951 | 0.77030 | 0.98635 |
| Job.RoleFinance | 0.82556 | 0.10912 | -1.75677 | 0.07896 | 0.66652 | 1.02235 |
| Education.LevelBachelor’s Degree | 0.89902 | 0.06191 | -1.71951 | 0.08552 | 0.79626 | 1.01497 |
| Education.LevelHigh School | 0.89281 | 0.06915 | -1.63980 | 0.10105 | 0.77961 | 1.02235 |
| Job.RoleTechnology | 0.83896 | 0.11030 | -1.59194 | 0.11140 | 0.67579 | 1.04138 |
| (Intercept) | 1.25869 | 0.19593 | 1.17426 | 0.24029 | 0.85730 | 1.84809 |
| Education.LevelMaster’s Degree | 0.92401 | 0.06874 | -1.14975 | 0.25025 | 0.80750 | 1.05726 |
| Monthly.Income | 1.00002 | 0.00002 | 1.13185 | 0.25770 | 0.99998 | 1.00006 |
| Job.SatisfactionMedium | 1.07190 | 0.06177 | 1.12396 | 0.26103 | 0.94963 | 1.20985 |
| Job.RoleHealthcare | 0.91110 | 0.09570 | -0.97288 | 0.33061 | 0.75523 | 1.09905 |
| Performance.RatingHigh | 1.05429 | 0.05877 | 0.89955 | 0.36836 | 0.93956 | 1.18300 |
| Employee.RecognitionLow | 1.04555 | 0.05903 | 0.75455 | 0.45052 | 0.93133 | 1.17382 |
| Company.SizeMedium | 0.95574 | 0.06114 | -0.74045 | 0.45903 | 0.84781 | 1.07743 |
| Employee.RecognitionMedium | 0.96023 | 0.06197 | -0.65490 | 0.51253 | 0.85038 | 1.08424 |
| Company.Tenure | 1.00037 | 0.00102 | 0.36026 | 0.71865 | 0.99838 | 1.00236 |
| Employee.ID | 1.00000 | 0.00000 | 0.33870 | 0.73483 | 1.00000 | 1.00000 |
| Company.ReputationGood | 0.98055 | 0.08123 | -0.24187 | 0.80888 | 0.83632 | 1.14994 |
| Employee.RecognitionVery High | 1.00168 | 0.11488 | 0.01464 | 0.98832 | 0.79962 | 1.25461 |
cat("AIC:", round(AIC(m3), 2),
"| Null deviance:", round(m3$null.deviance, 2),
"| Residual deviance:", round(m3$deviance, 2))## AIC: 11534.85 | Null deviance: 16488.45 | Residual deviance: 11448.85
A lower AIC indicates a better balance between goodness-of-fit and model complexity.
aic_tbl <- tibble(
Model = c("Model 1: MonthlyIncome",
"Model 2: MonthlyIncome + Overtime",
"Model 3: All Variables"),
Parameters = c(length(coef(m1)), length(coef(m2)), length(coef(m3))),
AIC = round(c(AIC(m1), AIC(m2), AIC(m3)), 2),
`Residual Deviance` = round(c(m1$deviance, m2$deviance, m3$deviance), 2)
)
aic_tbl %>%
kable(caption = "Table 5 - AIC and Deviance Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.min(aic_tbl$AIC), bold = TRUE,
color = "white", background = "#2c7fb8")| Model | Parameters | AIC | Residual Deviance |
|---|---|---|---|
| Model 1: MonthlyIncome | 2 | 16492.43 | 16488.43 |
| Model 2: MonthlyIncome + Overtime | 3 | 16455.46 | 16449.46 |
| Model 3: All Variables | 43 | 11534.85 | 11448.85 |
Note: The highlighted row has the lowest AIC. Lower AIC = better model fit without unnecessary complexity.
Predictions are made at a 0.5 probability threshold.
| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | (TP + TN) / N | Overall correct predictions |
| Sensitivity | TP / (TP + FN) | Correctly identified leavers |
| Specificity | TN / (TN + FP) | Correctly identified stayers |
| Precision | TP / (TP + FP) | Of predicted leavers, truly left |
| Kappa | — | Agreement beyond chance (0 = random) |
pred1 <- predict_class(m1, test_df)
cm1 <- confusionMatrix(pred1, test_df$attrition, positive = "Left")
print(cm1)## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 1573 1406
## Left 0 0
##
## Accuracy : 0.528
## 95% CI : (0.5099, 0.5461)
## No Information Rate : 0.528
## P-Value [Acc > NIR] : 0.5075
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.528
## Prevalence : 0.472
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : Left
##
pred2 <- predict_class(m2, test_df)
cm2 <- confusionMatrix(pred2, test_df$attrition, positive = "Left")
print(cm2)## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 1099 905
## Left 474 501
##
## Accuracy : 0.5371
## 95% CI : (0.519, 0.5551)
## No Information Rate : 0.528
## P-Value [Acc > NIR] : 0.1654
##
## Kappa : 0.0559
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3563
## Specificity : 0.6987
## Pos Pred Value : 0.5138
## Neg Pred Value : 0.5484
## Prevalence : 0.4720
## Detection Rate : 0.1682
## Detection Prevalence : 0.3273
## Balanced Accuracy : 0.5275
##
## 'Positive' Class : Left
##
pred3 <- predict_class(m3, test_df)
cm3 <- confusionMatrix(pred3, test_df$attrition, positive = "Left")
print(cm3)## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 1237 385
## Left 336 1021
##
## Accuracy : 0.758
## 95% CI : (0.7422, 0.7733)
## No Information Rate : 0.528
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5135
##
## Mcnemar's Test P-Value : 0.07384
##
## Sensitivity : 0.7262
## Specificity : 0.7864
## Pos Pred Value : 0.7524
## Neg Pred Value : 0.7626
## Prevalence : 0.4720
## Detection Rate : 0.3427
## Detection Prevalence : 0.4555
## Balanced Accuracy : 0.7563
##
## 'Positive' Class : Left
##
plot_cm <- function(cm, title) {
tbl <- as.data.frame(cm$table)
names(tbl) <- c("Predicted", "Actual", "Freq")
tbl <- tbl %>%
group_by(Actual) %>%
mutate(pct = scales::percent(Freq / sum(Freq), accuracy = 1))
ggplot(tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(colour = "white", linewidth = 1) +
geom_text(aes(label = paste0(Freq, "\n(", pct, ")")),
size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#dfe6e9", high = "#2c7fb8") +
labs(title = title, x = "Actual", y = "Predicted") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(face = "bold", hjust = 0.5))
}
grid.arrange(
plot_cm(cm1, "Model 1\nMonthlyIncome"),
plot_cm(cm2, "Model 2\nMonthlyIncome + Overtime"),
plot_cm(cm3, "Model 3\nAll Variables"),
ncol = 3,
top = grid::textGrob("Figure 4 - Confusion Matrices (Test Set)",
gp = grid::gpar(fontsize = 13, fontface = "bold"))
)The ROC curve plots sensitivity against (1 - specificity) across all classification thresholds. The AUC summarises discriminative ability: 0.5 = random, 1.0 = perfect.
prob1 <- predict(m1, test_df, type = "response")
prob2 <- predict(m2, test_df, type = "response")
prob3 <- predict(m3, test_df, type = "response")
roc1 <- roc(test_df$attrition, prob1, levels = c("Stayed", "Left"), quiet = TRUE)
roc2 <- roc(test_df$attrition, prob2, levels = c("Stayed", "Left"), quiet = TRUE)
roc3 <- roc(test_df$attrition, prob3, levels = c("Stayed", "Left"), quiet = TRUE)
roc_df <- bind_rows(
data.frame(FPR = 1 - roc1$specificities, TPR = roc1$sensitivities,
Model = sprintf("Model 1 (AUC = %.3f)", auc(roc1))),
data.frame(FPR = 1 - roc2$specificities, TPR = roc2$sensitivities,
Model = sprintf("Model 2 (AUC = %.3f)", auc(roc2))),
data.frame(FPR = 1 - roc3$specificities, TPR = roc3$sensitivities,
Model = sprintf("Model 3 (AUC = %.3f)", auc(roc3)))
)
ggplot(roc_df, aes(x = FPR, y = TPR, colour = Model)) +
geom_line(linewidth = 1) +
geom_abline(linetype = "dashed", colour = "grey50") +
scale_colour_manual(values = c("#e17055", "#0984e3", "#00b894")) +
labs(title = "Figure 5 - ROC Curves (Test Set)",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)",
colour = NULL) +
theme(legend.position = c(0.65, 0.2),
legend.background = element_rect(fill = "white", colour = "grey80"))Interpretation: A model with AUC closer to 1.0 better discriminates between employees who stay and those who leave, regardless of the chosen threshold.
extract_metrics <- function(cm, roc_obj, model_name) {
tibble(
Model = model_name,
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
Precision = cm$byClass["Pos Pred Value"],
Kappa = cm$overall["Kappa"],
AUC = as.numeric(auc(roc_obj))
)
}
comparison <- bind_rows(
extract_metrics(cm1, roc1, "Model 1: MonthlyIncome"),
extract_metrics(cm2, roc2, "Model 2: MonthlyIncome + Overtime"),
extract_metrics(cm3, roc3, "Model 3: All Variables")
)
comparison %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(caption = "Table 6 - Full Performance Comparison on Test Set") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
row_spec(which.max(comparison$AUC), bold = TRUE,
color = "white", background = "#2c7fb8")| Model | Accuracy | Sensitivity | Specificity | Precision | Kappa | AUC |
|---|---|---|---|---|---|---|
| Model 1: MonthlyIncome | 0.5280 | 0.0000 | 1.0000 | NaN | 0.0000 | 0.5198 |
| Model 2: MonthlyIncome + Overtime | 0.5371 | 0.3563 | 0.6987 | 0.5138 | 0.0559 | 0.5398 |
| Model 3: All Variables | 0.7580 | 0.7262 | 0.7864 | 0.7524 | 0.5135 | 0.8501 |
Model complexity matters. The full model (Model 3) achieves the highest AUC and accuracy, confirming that using all available employee features substantially improves attrition prediction beyond income alone.
Overtime is a key driver. Adding overtime status in Model 2 meaningfully improves on the income-only baseline, evidenced by the lower AIC and improved sensitivity. Employees working overtime are significantly more likely to leave, even after controlling for income.
Sensitivity vs. specificity trade-off. All three models are better at identifying stayers (high specificity) than leavers (lower sensitivity). This is common in imbalanced datasets and suggests that the 0.5 threshold could be lowered to catch more true attrition cases, at the cost of more false alarms.
Practical recommendation. For deployment, Model 3 is preferred for predictive accuracy. A simpler model like Model 2 may be sufficient when interpretability and data availability are priorities — it uses only two variables yet captures much of the signal.
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Ulaanbaatar
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_1.0.10 pROC_1.19.0.1 gridExtra_2.3 kableExtra_1.4.0
## [5] knitr_1.50 caret_7.0-1 lattice_0.22-7 lubridate_1.9.4
## [9] forcats_1.0.1 stringr_1.5.2 dplyr_1.2.0 purrr_1.1.0
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.3.1 ggplot2_4.0.2
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 timeDate_4051.111
## [4] farver_2.1.2 S7_0.2.0 fastmap_1.2.0
## [7] digest_0.6.37 rpart_4.1.24 timechange_0.3.0
## [10] lifecycle_1.0.5 survival_3.8-3 magrittr_2.0.3
## [13] compiler_4.5.1 rlang_1.1.7 sass_0.4.10
## [16] tools_4.5.1 yaml_2.3.10 data.table_1.17.8
## [19] labeling_0.4.3 plyr_1.8.9 xml2_1.4.0
## [22] RColorBrewer_1.1-3 withr_3.0.2 nnet_7.3-20
## [25] grid_4.5.1 stats4_4.5.1 e1071_1.7-17
## [28] future_1.67.0 globals_0.18.0 scales_1.4.0
## [31] iterators_1.0.14 MASS_7.3-65 cli_3.6.5
## [34] rmarkdown_2.29 generics_0.1.4 rstudioapi_0.17.1
## [37] future.apply_1.20.0 tzdb_0.5.0 reshape2_1.4.5
## [40] proxy_0.4-29 cachem_1.1.0 splines_4.5.1
## [43] parallel_4.5.1 vctrs_0.7.1 hardhat_1.4.2
## [46] Matrix_1.7-3 jsonlite_2.0.0 hms_1.1.3
## [49] listenv_0.9.1 systemfonts_1.3.2 foreach_1.5.2
## [52] gower_1.0.2 jquerylib_0.1.4 recipes_1.3.1
## [55] glue_1.8.0 parallelly_1.45.1 codetools_0.2-20
## [58] stringi_1.8.7 gtable_0.3.6 pillar_1.11.0
## [61] htmltools_0.5.8.1 ipred_0.9-15 lava_1.8.2
## [64] R6_2.6.1 textshaping_1.0.3 evaluate_1.0.5
## [67] backports_1.5.0 bslib_0.9.0 class_7.3-23
## [70] Rcpp_1.1.0 svglite_2.2.2 nlme_3.1-168
## [73] prodlim_2025.04.28 xfun_0.53 ModelMetrics_1.2.2.2
## [76] pkgconfig_2.0.3