1 Introduction

Employee attrition — the voluntary departure of employees — is a costly challenge for organizations. Replacing a single employee can cost between 50% and 200% of their annual salary when accounting for recruitment, onboarding, and lost productivity. Predicting which employees are likely to leave allows HR departments to intervene early.

This report fits three logistic regression models of increasing complexity to predict attrition and compares their predictive performance using confusion matrices and ROC curves on a held-out test set.

Dataset: Employee Attrition Dataset — Kaggle
Reference: Boehmke, B. — Hands-On Machine Learning with R, Ch. 5


2 Setup

# Install any missing packages automatically
pkgs <- c("tidyverse", "caret", "knitr", "kableExtra",
          "gridExtra", "pROC", "broom")
invisible(lapply(pkgs, function(p) {
  if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
}))

library(tidyverse)
library(caret)
library(knitr)
library(kableExtra)
library(gridExtra)
library(pROC)
library(broom)

theme_set(theme_minimal(base_size = 12))

3 Data Preparation

3.1 Loading & Cleaning

df <- read.csv("test.csv", stringsAsFactors = TRUE)

# Standardise column names
names(df) <- make.names(names(df))

# Recode target: "Left" = positive class
df <- df %>%
  rename(attrition = Attrition) %>%
  mutate(attrition = factor(
    ifelse(attrition == "Left", "Left", "Stayed"),
    levels = c("Stayed", "Left")
  ))

cat("Dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
## Dimensions: 14900 rows x 24 columns
glimpse(df)
## Rows: 14,900
## Columns: 24
## $ Employee.ID              <int> 52685, 30585, 54656, 33442, 15667, 3496, 4677…
## $ Age                      <int> 36, 35, 50, 58, 39, 45, 22, 34, 48, 55, 32, 2…
## $ Gender                   <fct> Male, Male, Male, Male, Male, Female, Female,…
## $ Years.at.Company         <int> 13, 7, 7, 44, 24, 30, 5, 15, 40, 16, 12, 15, …
## $ Job.Role                 <fct> Healthcare, Education, Education, Media, Educ…
## $ Monthly.Income           <int> 8029, 4563, 5583, 5525, 4604, 8104, 8700, 110…
## $ Work.Life.Balance        <fct> Excellent, Good, Fair, Fair, Good, Fair, Good…
## $ Job.Satisfaction         <fct> High, High, High, Very High, High, High, High…
## $ Performance.Rating       <fct> Average, Average, Average, High, Average, Ave…
## $ Number.of.Promotions     <int> 1, 1, 3, 0, 0, 0, 0, 1, 0, 0, 0, 2, 3, 0, 1, …
## $ Overtime                 <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, …
## $ Distance.from.Home       <int> 83, 55, 14, 43, 47, 38, 2, 9, 65, 31, 28, 35,…
## $ Education.Level          <fct> Master’s Degree, Associate Degree, Associate …
## $ Marital.Status           <fct> Married, Single, Divorced, Single, Married, D…
## $ Number.of.Dependents     <int> 1, 4, 2, 4, 6, 0, 0, 4, 1, 1, 1, 1, 3, 0, 0, …
## $ Job.Level                <fct> Mid, Entry, Senior, Entry, Mid, Senior, Mid, …
## $ Company.Size             <fct> Large, Medium, Medium, Medium, Large, Large, …
## $ Company.Tenure           <int> 22, 27, 76, 96, 45, 75, 48, 16, 52, 46, 57, 9…
## $ Remote.Work              <fct> No, No, No, No, Yes, No, No, No, No, No, No, …
## $ Leadership.Opportunities <fct> No, No, No, No, No, No, No, No, No, No, No, N…
## $ Innovation.Opportunities <fct> No, No, Yes, No, No, No, No, No, No, No, No, …
## $ Company.Reputation       <fct> Poor, Good, Good, Poor, Good, Good, Poor, Goo…
## $ Employee.Recognition     <fct> Medium, High, Low, Low, High, Low, High, Low,…
## $ attrition                <fct> Stayed, Left, Stayed, Left, Stayed, Stayed, S…

3.2 Class Balance

Class imbalance can bias a model toward the majority class. We check the split before modelling.

df %>%
  count(attrition) %>%
  mutate(Percentage = scales::percent(n / sum(n), accuracy = 0.1)) %>%
  rename(Class = attrition, Count = n) %>%
  kable(caption = "Table 1 - Attrition Class Distribution") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 1 - Attrition Class Distribution
Class Count Percentage
Stayed 7868 52.8%
Left 7032 47.2%
ggplot(df, aes(x = attrition, fill = attrition)) +
  geom_bar(width = 0.5, show.legend = FALSE) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
  scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
  labs(title = "Figure 1 - Attrition Class Distribution",
       x = "Attrition", y = "Count")


4 Exploratory Data Analysis

Before modelling, we examine the two key predictors featured in Models 1 and 2.

4.1 Monthly Income by Attrition

ggplot(df, aes(x = attrition, y = Monthly.Income, fill = attrition)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, show.legend = FALSE) +
  scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Figure 2 - Monthly Income by Attrition Status",
       x = "Attrition", y = "Monthly Income (USD)")

Observation: Employees who left tend to earn lower monthly incomes, suggesting income is a meaningful predictor of attrition.

4.2 Overtime by Attrition

df %>%
  count(Overtime, attrition) %>%
  group_by(Overtime) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = Overtime, y = pct, fill = attrition)) +
  geom_col(position = "fill", width = 0.5) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("Stayed" = "#74b9ff", "Left" = "#d63031")) +
  labs(title = "Figure 3 - Attrition Rate by Overtime Status",
       x = "Works Overtime", y = "Proportion", fill = "Attrition")

Observation: Employees who work overtime leave at a substantially higher rate, confirming overtime as a strong predictor of attrition.


5 Train / Test Split

We perform a stratified 80 / 20 split, preserving the class ratio in both sets.

set.seed(42)

train_idx <- createDataPartition(df$attrition, p = 0.80, list = FALSE)
train_df  <- df[ train_idx, ]
test_df   <- df[-train_idx, ]

cat(sprintf("Training set : %d rows  (%.0f%%)\nTest set     : %d rows  (%.0f%%)\n",
            nrow(train_df), 100 * nrow(train_df) / nrow(df),
            nrow(test_df),  100 * nrow(test_df)  / nrow(df)))
## Training set : 11921 rows  (80%)
## Test set     : 2979 rows  (20%)

6 Logistic Regression Models

Logistic regression models the log-odds of an event as a linear combination of predictors:

\[\log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k\]

The estimated coefficients can be exponentiated to obtain odds ratios (OR), which express the multiplicative change in the odds of attrition for a one-unit increase in each predictor (OR > 1 increases risk, OR < 1 decreases risk).

6.1 Model 1 — attrition ~ MonthlyIncome

A simple baseline using income alone.

m1 <- glm(attrition ~ Monthly.Income,
          data   = train_df,
          family = binomial(link = "logit"))

tidy(m1, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
  rename(Term = term, `Odds Ratio` = estimate,
         `p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
  kable(caption = "Table 2 - Model 1 Coefficients (Odds Ratios)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2 - Model 1 Coefficients (Odds Ratios)
Term Odds Ratio std.error statistic p-value CI 2.5% CI 97.5%
(Intercept) 0.90221 0.06456 -1.59400 0.11094 0.79495 1.02389
Monthly.Income 1.00000 0.00001 -0.15265 0.87867 0.99998 1.00002
cat("AIC:", round(AIC(m1), 2),
    "| Null deviance:", round(m1$null.deviance, 2),
    "| Residual deviance:", round(m1$deviance, 2))
## AIC: 16492.43 | Null deviance: 16488.45 | Residual deviance: 16488.43

Interpretation: An odds ratio < 1 for Monthly.Income means higher earners are less likely to leave. For each additional dollar of monthly income, the odds of attrition are multiplied by the OR shown above.

6.2 Model 2 — attrition ~ MonthlyIncome + Overtime

Adding overtime status to capture work-life balance effects.

m2 <- glm(attrition ~ Monthly.Income + Overtime,
          data   = train_df,
          family = binomial(link = "logit"))

tidy(m2, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
  rename(Term = term, `Odds Ratio` = estimate,
         `p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
  kable(caption = "Table 3 - Model 2 Coefficients (Odds Ratios)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3 - Model 2 Coefficients (Odds Ratios)
Term Odds Ratio std.error statistic p-value CI 2.5% CI 97.5%
(Intercept) 0.83022 0.06603 -2.81793 0.00483 0.72939 0.94488
Monthly.Income 1.00000 0.00001 -0.10719 0.91464 0.99998 1.00002
OvertimeYes 1.27606 0.03907 6.23936 0.00000 1.18201 1.37765
cat("AIC:", round(AIC(m2), 2),
    "| Null deviance:", round(m2$null.deviance, 2),
    "| Residual deviance:", round(m2$deviance, 2))
## AIC: 16455.46 | Null deviance: 16488.45 | Residual deviance: 16449.46

Interpretation: The OvertimeYes odds ratio quantifies how much more likely overtime workers are to leave, holding income constant.

6.3 Model 3 — attrition ~ . (All Variables)

The full model uses every available feature.

m3 <- glm(attrition ~ .,
          data   = train_df,
          family = binomial(link = "logit"))

tidy(m3, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 5))) %>%
  arrange(p.value) %>%
  rename(Term = term, `Odds Ratio` = estimate,
         `p-value` = p.value, `CI 2.5%` = conf.low, `CI 97.5%` = conf.high) %>%
  kable(caption = "Table 4 - Model 3 Coefficients (sorted by p-value)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(height = "350px")
Table 4 - Model 3 Coefficients (sorted by p-value)
Term Odds Ratio std.error statistic p-value CI 2.5% CI 97.5%
GenderMale 0.57555 0.04663 -11.84789 0.00000 0.52520 0.63054
Years.at.Company 0.98285 0.00267 -6.47919 0.00000 0.97772 0.98800
Work.Life.BalanceFair 3.77803 0.06997 18.99811 0.00000 3.29546 4.33546
Work.Life.BalancePoor 4.88953 0.08463 18.75239 0.00000 4.14473 5.77554
Job.SatisfactionLow 1.89724 0.07860 8.14792 0.00000 1.62682 2.21392
Job.SatisfactionVery High 1.76020 0.06109 9.25554 0.00000 1.56184 1.98449
Performance.RatingBelow Average 1.39517 0.06716 4.95855 0.00000 1.22322 1.59167
Performance.RatingLow 1.66272 0.10763 4.72402 0.00000 1.34714 2.05447
Number.of.Promotions 0.75766 0.02359 -11.76291 0.00000 0.72334 0.79343
OvertimeYes 1.43469 0.04913 7.34737 0.00000 1.30311 1.57986
Distance.from.Home 1.00982 0.00082 11.98412 0.00000 1.00821 1.01144
Education.LevelPhD 0.16136 0.12913 -14.12618 0.00000 0.12491 0.20727
Marital.StatusMarried 0.72264 0.06618 -4.90866 0.00000 0.63475 0.82277
Marital.StatusSingle 4.77225 0.07225 21.63136 0.00000 4.14418 5.50106
Number.of.Dependents 0.87501 0.01497 -8.91759 0.00000 0.84966 0.90103
Job.LevelMid 0.35051 0.05124 -20.45979 0.00000 0.31693 0.38744
Job.LevelSenior 0.06742 0.07332 -36.78211 0.00000 0.05833 0.07775
Remote.WorkYes 0.17508 0.06536 -26.65993 0.00000 0.15390 0.19885
Company.ReputationFair 1.70160 0.09112 5.83358 0.00000 1.42363 2.03491
Company.ReputationPoor 1.99948 0.09038 7.66653 0.00000 1.67537 2.38776
Work.Life.BalanceGood 1.28968 0.06632 3.83583 0.00013 1.13271 1.46904
Job.RoleMedia 0.78744 0.08074 -2.95972 0.00308 0.67212 0.92237
Leadership.OpportunitiesYes 0.75461 0.10676 -2.63735 0.00836 0.61182 0.92989
Age 0.99422 0.00224 -2.59112 0.00957 0.98987 0.99859
Company.SizeSmall 1.17992 0.06692 2.47232 0.01342 1.03494 1.34539
Innovation.OpportunitiesYes 0.87174 0.06306 -2.17665 0.02951 0.77030 0.98635
Job.RoleFinance 0.82556 0.10912 -1.75677 0.07896 0.66652 1.02235
Education.LevelBachelor’s Degree 0.89902 0.06191 -1.71951 0.08552 0.79626 1.01497
Education.LevelHigh School 0.89281 0.06915 -1.63980 0.10105 0.77961 1.02235
Job.RoleTechnology 0.83896 0.11030 -1.59194 0.11140 0.67579 1.04138
(Intercept) 1.25869 0.19593 1.17426 0.24029 0.85730 1.84809
Education.LevelMaster’s Degree 0.92401 0.06874 -1.14975 0.25025 0.80750 1.05726
Monthly.Income 1.00002 0.00002 1.13185 0.25770 0.99998 1.00006
Job.SatisfactionMedium 1.07190 0.06177 1.12396 0.26103 0.94963 1.20985
Job.RoleHealthcare 0.91110 0.09570 -0.97288 0.33061 0.75523 1.09905
Performance.RatingHigh 1.05429 0.05877 0.89955 0.36836 0.93956 1.18300
Employee.RecognitionLow 1.04555 0.05903 0.75455 0.45052 0.93133 1.17382
Company.SizeMedium 0.95574 0.06114 -0.74045 0.45903 0.84781 1.07743
Employee.RecognitionMedium 0.96023 0.06197 -0.65490 0.51253 0.85038 1.08424
Company.Tenure 1.00037 0.00102 0.36026 0.71865 0.99838 1.00236
Employee.ID 1.00000 0.00000 0.33870 0.73483 1.00000 1.00000
Company.ReputationGood 0.98055 0.08123 -0.24187 0.80888 0.83632 1.14994
Employee.RecognitionVery High 1.00168 0.11488 0.01464 0.98832 0.79962 1.25461
cat("AIC:", round(AIC(m3), 2),
    "| Null deviance:", round(m3$null.deviance, 2),
    "| Residual deviance:", round(m3$deviance, 2))
## AIC: 11534.85 | Null deviance: 16488.45 | Residual deviance: 11448.85

6.4 AIC Comparison

A lower AIC indicates a better balance between goodness-of-fit and model complexity.

aic_tbl <- tibble(
  Model = c("Model 1: MonthlyIncome",
            "Model 2: MonthlyIncome + Overtime",
            "Model 3: All Variables"),
  Parameters = c(length(coef(m1)), length(coef(m2)), length(coef(m3))),
  AIC = round(c(AIC(m1), AIC(m2), AIC(m3)), 2),
  `Residual Deviance` = round(c(m1$deviance, m2$deviance, m3$deviance), 2)
)

aic_tbl %>%
  kable(caption = "Table 5 - AIC and Deviance Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which.min(aic_tbl$AIC), bold = TRUE,
           color = "white", background = "#2c7fb8")
Table 5 - AIC and Deviance Comparison
Model Parameters AIC Residual Deviance
Model 1: MonthlyIncome 2 16492.43 16488.43
Model 2: MonthlyIncome + Overtime 3 16455.46 16449.46
Model 3: All Variables 43 11534.85 11448.85

Note: The highlighted row has the lowest AIC. Lower AIC = better model fit without unnecessary complexity.


7 Confusion Matrices on Test Data

7.1 Metric Reference

Predictions are made at a 0.5 probability threshold.

Metric Formula Meaning
Accuracy (TP + TN) / N Overall correct predictions
Sensitivity TP / (TP + FN) Correctly identified leavers
Specificity TN / (TN + FP) Correctly identified stayers
Precision TP / (TP + FP) Of predicted leavers, truly left
Kappa Agreement beyond chance (0 = random)
predict_class <- function(model, newdata, threshold = 0.5) {
  probs  <- predict(model, newdata = newdata, type = "response")
  labels <- ifelse(probs >= threshold, "Left", "Stayed")
  factor(labels, levels = c("Stayed", "Left"))
}

7.2 Model 1

pred1 <- predict_class(m1, test_df)
cm1   <- confusionMatrix(pred1, test_df$attrition, positive = "Left")
print(cm1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed Left
##     Stayed   1573 1406
##     Left        0    0
##                                           
##                Accuracy : 0.528           
##                  95% CI : (0.5099, 0.5461)
##     No Information Rate : 0.528           
##     P-Value [Acc > NIR] : 0.5075          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.528           
##              Prevalence : 0.472           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : Left            
## 

7.3 Model 2

pred2 <- predict_class(m2, test_df)
cm2   <- confusionMatrix(pred2, test_df$attrition, positive = "Left")
print(cm2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed Left
##     Stayed   1099  905
##     Left      474  501
##                                          
##                Accuracy : 0.5371         
##                  95% CI : (0.519, 0.5551)
##     No Information Rate : 0.528          
##     P-Value [Acc > NIR] : 0.1654         
##                                          
##                   Kappa : 0.0559         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.3563         
##             Specificity : 0.6987         
##          Pos Pred Value : 0.5138         
##          Neg Pred Value : 0.5484         
##              Prevalence : 0.4720         
##          Detection Rate : 0.1682         
##    Detection Prevalence : 0.3273         
##       Balanced Accuracy : 0.5275         
##                                          
##        'Positive' Class : Left           
## 

7.4 Model 3

pred3 <- predict_class(m3, test_df)
cm3   <- confusionMatrix(pred3, test_df$attrition, positive = "Left")
print(cm3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed Left
##     Stayed   1237  385
##     Left      336 1021
##                                           
##                Accuracy : 0.758           
##                  95% CI : (0.7422, 0.7733)
##     No Information Rate : 0.528           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5135          
##                                           
##  Mcnemar's Test P-Value : 0.07384         
##                                           
##             Sensitivity : 0.7262          
##             Specificity : 0.7864          
##          Pos Pred Value : 0.7524          
##          Neg Pred Value : 0.7626          
##              Prevalence : 0.4720          
##          Detection Rate : 0.3427          
##    Detection Prevalence : 0.4555          
##       Balanced Accuracy : 0.7563          
##                                           
##        'Positive' Class : Left            
## 

7.5 Confusion Matrix Heatmaps

plot_cm <- function(cm, title) {
  tbl <- as.data.frame(cm$table)
  names(tbl) <- c("Predicted", "Actual", "Freq")
  tbl <- tbl %>%
    group_by(Actual) %>%
    mutate(pct = scales::percent(Freq / sum(Freq), accuracy = 1))

  ggplot(tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
    geom_tile(colour = "white", linewidth = 1) +
    geom_text(aes(label = paste0(Freq, "\n(", pct, ")")),
              size = 4.5, fontface = "bold") +
    scale_fill_gradient(low = "#dfe6e9", high = "#2c7fb8") +
    labs(title = title, x = "Actual", y = "Predicted") +
    theme_minimal(base_size = 12) +
    theme(legend.position = "none",
          plot.title = element_text(face = "bold", hjust = 0.5))
}

grid.arrange(
  plot_cm(cm1, "Model 1\nMonthlyIncome"),
  plot_cm(cm2, "Model 2\nMonthlyIncome + Overtime"),
  plot_cm(cm3, "Model 3\nAll Variables"),
  ncol = 3,
  top = grid::textGrob("Figure 4 - Confusion Matrices (Test Set)",
                       gp = grid::gpar(fontsize = 13, fontface = "bold"))
)


8 ROC Curves & AUC

The ROC curve plots sensitivity against (1 - specificity) across all classification thresholds. The AUC summarises discriminative ability: 0.5 = random, 1.0 = perfect.

prob1 <- predict(m1, test_df, type = "response")
prob2 <- predict(m2, test_df, type = "response")
prob3 <- predict(m3, test_df, type = "response")

roc1 <- roc(test_df$attrition, prob1, levels = c("Stayed", "Left"), quiet = TRUE)
roc2 <- roc(test_df$attrition, prob2, levels = c("Stayed", "Left"), quiet = TRUE)
roc3 <- roc(test_df$attrition, prob3, levels = c("Stayed", "Left"), quiet = TRUE)

roc_df <- bind_rows(
  data.frame(FPR = 1 - roc1$specificities, TPR = roc1$sensitivities,
             Model = sprintf("Model 1 (AUC = %.3f)", auc(roc1))),
  data.frame(FPR = 1 - roc2$specificities, TPR = roc2$sensitivities,
             Model = sprintf("Model 2 (AUC = %.3f)", auc(roc2))),
  data.frame(FPR = 1 - roc3$specificities, TPR = roc3$sensitivities,
             Model = sprintf("Model 3 (AUC = %.3f)", auc(roc3)))
)

ggplot(roc_df, aes(x = FPR, y = TPR, colour = Model)) +
  geom_line(linewidth = 1) +
  geom_abline(linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("#e17055", "#0984e3", "#00b894")) +
  labs(title = "Figure 5 - ROC Curves (Test Set)",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)",
       colour = NULL) +
  theme(legend.position = c(0.65, 0.2),
        legend.background = element_rect(fill = "white", colour = "grey80"))

Interpretation: A model with AUC closer to 1.0 better discriminates between employees who stay and those who leave, regardless of the chosen threshold.


9 Summary Comparison

extract_metrics <- function(cm, roc_obj, model_name) {
  tibble(
    Model       = model_name,
    Accuracy    = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    Precision   = cm$byClass["Pos Pred Value"],
    Kappa       = cm$overall["Kappa"],
    AUC         = as.numeric(auc(roc_obj))
  )
}

comparison <- bind_rows(
  extract_metrics(cm1, roc1, "Model 1: MonthlyIncome"),
  extract_metrics(cm2, roc2, "Model 2: MonthlyIncome + Overtime"),
  extract_metrics(cm3, roc3, "Model 3: All Variables")
)

comparison %>%
  mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  kable(caption = "Table 6 - Full Performance Comparison on Test Set") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  row_spec(which.max(comparison$AUC), bold = TRUE,
           color = "white", background = "#2c7fb8")
Table 6 - Full Performance Comparison on Test Set
Model Accuracy Sensitivity Specificity Precision Kappa AUC
Model 1: MonthlyIncome 0.5280 0.0000 1.0000 NaN 0.0000 0.5198
Model 2: MonthlyIncome + Overtime 0.5371 0.3563 0.6987 0.5138 0.0559 0.5398
Model 3: All Variables 0.7580 0.7262 0.7864 0.7524 0.5135 0.8501

10 Conclusions

  1. Model complexity matters. The full model (Model 3) achieves the highest AUC and accuracy, confirming that using all available employee features substantially improves attrition prediction beyond income alone.

  2. Overtime is a key driver. Adding overtime status in Model 2 meaningfully improves on the income-only baseline, evidenced by the lower AIC and improved sensitivity. Employees working overtime are significantly more likely to leave, even after controlling for income.

  3. Sensitivity vs. specificity trade-off. All three models are better at identifying stayers (high specificity) than leavers (lower sensitivity). This is common in imbalanced datasets and suggests that the 0.5 threshold could be lowered to catch more true attrition cases, at the cost of more false alarms.

  4. Practical recommendation. For deployment, Model 3 is preferred for predictive accuracy. A simpler model like Model 2 may be sufficient when interpretability and data availability are priorities — it uses only two variables yet captures much of the signal.


11 Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Ulaanbaatar
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom_1.0.10     pROC_1.19.0.1    gridExtra_2.3    kableExtra_1.4.0
##  [5] knitr_1.50       caret_7.0-1      lattice_0.22-7   lubridate_1.9.4 
##  [9] forcats_1.0.1    stringr_1.5.2    dplyr_1.2.0      purrr_1.1.0     
## [13] readr_2.1.5      tidyr_1.3.1      tibble_3.3.1     ggplot2_4.0.2   
## [17] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    timeDate_4051.111   
##  [4] farver_2.1.2         S7_0.2.0             fastmap_1.2.0       
##  [7] digest_0.6.37        rpart_4.1.24         timechange_0.3.0    
## [10] lifecycle_1.0.5      survival_3.8-3       magrittr_2.0.3      
## [13] compiler_4.5.1       rlang_1.1.7          sass_0.4.10         
## [16] tools_4.5.1          yaml_2.3.10          data.table_1.17.8   
## [19] labeling_0.4.3       plyr_1.8.9           xml2_1.4.0          
## [22] RColorBrewer_1.1-3   withr_3.0.2          nnet_7.3-20         
## [25] grid_4.5.1           stats4_4.5.1         e1071_1.7-17        
## [28] future_1.67.0        globals_0.18.0       scales_1.4.0        
## [31] iterators_1.0.14     MASS_7.3-65          cli_3.6.5           
## [34] rmarkdown_2.29       generics_0.1.4       rstudioapi_0.17.1   
## [37] future.apply_1.20.0  tzdb_0.5.0           reshape2_1.4.5      
## [40] proxy_0.4-29         cachem_1.1.0         splines_4.5.1       
## [43] parallel_4.5.1       vctrs_0.7.1          hardhat_1.4.2       
## [46] Matrix_1.7-3         jsonlite_2.0.0       hms_1.1.3           
## [49] listenv_0.9.1        systemfonts_1.3.2    foreach_1.5.2       
## [52] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.1       
## [55] glue_1.8.0           parallelly_1.45.1    codetools_0.2-20    
## [58] stringi_1.8.7        gtable_0.3.6         pillar_1.11.0       
## [61] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.2          
## [64] R6_2.6.1             textshaping_1.0.3    evaluate_1.0.5      
## [67] backports_1.5.0      bslib_0.9.0          class_7.3-23        
## [70] Rcpp_1.1.0           svglite_2.2.2        nlme_3.1-168        
## [73] prodlim_2025.04.28   xfun_0.53            ModelMetrics_1.2.2.2
## [76] pkgconfig_2.0.3