Employee attrition — the voluntary or involuntary departure of employees from an organisation — poses significant operational and financial challenges. Replacing a single employee can cost between 50% and 200% of their annual salary when accounting for recruitment, onboarding, and lost productivity (Cascio, 2006). Predictive modelling offers human resource managers an evidence-based approach to identifying at-risk employees before departure occurs.
This analysis employs binary logistic regression to model the probability of employee attrition using data from the Kaggle Employee Attrition Dataset (Stealth Technologies, 2024). Three models of increasing complexity are estimated and evaluated:
| Model | Predictors |
|---|---|
| Model 1 | MonthlyIncome only |
| Model 2 | MonthlyIncome + Overtime |
| Model 3 | All available predictors |
Model performance is assessed on a held-out test set using a confusion matrix and standard classification metrics: accuracy, precision, recall, and F1-score.
All required packages are loaded at the outset to ensure reproducibility and clarity.
# Core data manipulation and visualisation
library(tidyverse) # dplyr, ggplot2, readr, forcats, etc.
library(caret) # confusionMatrix, createDataPartition
library(pROC) # ROC / AUC analysis (supplementary)
library(knitr) # kable for formatted tables
library(kableExtra) # Enhanced kable styling
# Set a global ggplot theme
theme_set(theme_minimal(base_size = 13))The dataset is supplied as two pre-split CSV files
(train.csv and test.csv). For this analysis
both files are combined and re-split at a 70/30 ratio using a
reproducible seed, ensuring full control over the partition used for
model estimation and evaluation.
# ── Load raw files ──────────────────────────────────────────────────────────
train_raw <- read_csv("train.csv", show_col_types = FALSE)
test_raw <- read_csv("test.csv", show_col_types = FALSE)
# Combine into a single frame for re-splitting
full_data <- bind_rows(train_raw, test_raw)
cat("Total observations:", nrow(full_data), "\n")## Total observations: 74498
## Variables: 24
## Rows: 74,498
## Columns: 24
## $ `Employee ID` <dbl> 8410, 64756, 30257, 65791, 65026, 24368, 64…
## $ Age <dbl> 31, 59, 24, 36, 56, 38, 47, 48, 57, 24, 30,…
## $ Gender <chr> "Male", "Female", "Female", "Female", "Male…
## $ `Years at Company` <dbl> 19, 4, 10, 7, 41, 3, 23, 16, 44, 1, 12, 6, …
## $ `Job Role` <chr> "Education", "Media", "Healthcare", "Educat…
## $ `Monthly Income` <dbl> 5390, 5534, 8159, 3989, 4821, 9977, 3681, 1…
## $ `Work-Life Balance` <chr> "Excellent", "Poor", "Good", "Good", "Fair"…
## $ `Job Satisfaction` <chr> "Medium", "High", "High", "High", "Very Hig…
## $ `Performance Rating` <chr> "Average", "Low", "Low", "High", "Average",…
## $ `Number of Promotions` <dbl> 2, 3, 0, 1, 0, 3, 1, 2, 1, 1, 1, 2, 1, 4, 0…
## $ Overtime <chr> "No", "No", "No", "No", "Yes", "No", "Yes",…
## $ `Distance from Home` <dbl> 22, 21, 11, 27, 71, 37, 75, 5, 39, 57, 51, …
## $ `Education Level` <chr> "Associate Degree", "Master’s Degree", "Bac…
## $ `Marital Status` <chr> "Married", "Divorced", "Married", "Single",…
## $ `Number of Dependents` <dbl> 0, 3, 3, 2, 0, 0, 3, 4, 4, 4, 1, 0, 0, 2, 0…
## $ `Job Level` <chr> "Mid", "Mid", "Mid", "Mid", "Senior", "Mid"…
## $ `Company Size` <chr> "Medium", "Medium", "Medium", "Small", "Med…
## $ `Company Tenure` <dbl> 89, 21, 74, 50, 68, 47, 93, 88, 75, 45, 17,…
## $ `Remote Work` <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ `Leadership Opportunities` <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ `Innovation Opportunities` <chr> "No", "No", "No", "No", "No", "Yes", "No", …
## $ `Company Reputation` <chr> "Excellent", "Fair", "Poor", "Good", "Fair"…
## $ `Employee Recognition` <chr> "Medium", "Low", "Low", "Medium", "Medium",…
## $ Attrition <chr> "Stayed", "Stayed", "Stayed", "Stayed", "St…
# Class distribution of the outcome variable
full_data %>%
count(Attrition) %>%
mutate(Proportion = round(n / sum(n), 3)) %>%
kable(caption = "Class Distribution of Employee Attrition",
col.names = c("Attrition", "Count", "Proportion")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Attrition | Count | Proportion |
|---|---|---|
| Left | 35370 | 0.475 |
| Stayed | 39128 | 0.525 |
The outcome is moderately imbalanced, with roughly 47–48% of employees having left the organisation. This level of imbalance does not necessitate resampling techniques but is noted for interpreting recall and precision values below.
# ── Helper: rename columns to valid R identifiers ───────────────────────────
clean_names <- function(df) {
names(df) <- names(df) %>%
str_replace_all(" ", "_") %>%
str_replace_all("-", "_")
df
}
df <- full_data %>%
clean_names() %>%
# Drop Employee_ID — not a predictor
select(-Employee_ID) %>%
# Encode outcome: 1 = Left, 0 = Stayed
mutate(
Attrition = factor(if_else(Attrition == "Left", 1L, 0L),
levels = c(0, 1),
labels = c("Stayed", "Left"))
)
# ── Check for missing values ─────────────────────────────────────────────────
missing_summary <- df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
filter(Missing > 0)
if (nrow(missing_summary) == 0) {
cat("No missing values detected. No imputation required.\n")
} else {
print(missing_summary)
}## No missing values detected. No imputation required.
# ── Convert character columns to factors ────────────────────────────────────
char_vars <- df %>% select(where(is.character)) %>% names()
df <- df %>%
mutate(across(all_of(char_vars), as.factor))
# Confirm Overtime is a factor (required for Model 2)
cat("Overtime levels:", levels(df$Overtime), "\n")## Overtime levels: No Yes
## Attrition levels: Stayed Left
# Quick overview of factor levels
df %>%
select(where(is.factor)) %>%
summarise(across(everything(), ~ nlevels(.x))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Levels") %>%
kable(caption = "Factor Variables and Number of Levels") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Variable | Levels |
|---|---|
| Gender | 2 |
| Job_Role | 5 |
| Work_Life_Balance | 4 |
| Job_Satisfaction | 4 |
| Performance_Rating | 4 |
| Overtime | 2 |
| Education_Level | 5 |
| Marital_Status | 3 |
| Job_Level | 3 |
| Company_Size | 3 |
| Remote_Work | 2 |
| Leadership_Opportunities | 2 |
| Innovation_Opportunities | 2 |
| Company_Reputation | 4 |
| Employee_Recognition | 4 |
| Attrition | 2 |
The combined dataset is partitioned into a 70% training set and a 30% test set using stratified random sampling on the outcome variable to preserve class proportions in both partitions.
set.seed(2024) # Reproducible seed
split_idx <- createDataPartition(df$Attrition, p = 0.70, list = FALSE)
train_df <- df[ split_idx, ]
test_df <- df[-split_idx, ]
cat(sprintf("Training set: %d observations (%.1f%%)\n",
nrow(train_df), 100 * nrow(train_df) / nrow(df)))## Training set: 52149 observations (70.0%)
## Test set: 22349 observations (30.0%)
# Verify class balance is preserved
bind_rows(
train_df %>% count(Attrition) %>% mutate(Set = "Train"),
test_df %>% count(Attrition) %>% mutate(Set = "Test")
) %>%
group_by(Set) %>%
mutate(Prop = round(n / sum(n), 3)) %>%
kable(caption = "Class Distribution After Stratified Split") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Attrition | n | Set | Prop |
|---|---|---|---|
| Stayed | 27390 | Train | 0.525 |
| Left | 24759 | Train | 0.475 |
| Stayed | 11738 | Test | 0.525 |
| Left | 10611 | Test | 0.475 |
Logistic regression models the log-odds of a binary outcome as a linear function of predictors:
\[\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
The predicted probability is recovered via the logistic (sigmoid) function:
\[\hat{P}(Y=1 \mid \mathbf{X}) = \frac{e^{\mathbf{X}\boldsymbol{\beta}}}{1 + e^{\mathbf{X}\boldsymbol{\beta}}}\]
Model coefficients are estimated by maximum likelihood. A positive coefficient increases the log-odds of attrition; a negative coefficient decreases them.
model1 <- glm(Attrition ~ Monthly_Income,
data = train_df,
family = binomial(link = "logit"))
summary(model1)##
## Call:
## glm(formula = Attrition ~ Monthly_Income, family = binomial(link = "logit"),
## data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.694e-02 3.100e-02 -1.192 0.2333
## Monthly_Income -8.785e-06 4.078e-06 -2.154 0.0312 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72161 on 52148 degrees of freedom
## Residual deviance: 72156 on 52147 degrees of freedom
## AIC: 72160
##
## Number of Fisher Scoring iterations: 3
# Exponentiated coefficients (Odds Ratios)
broom::tidy(model1, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 5))) %>%
kable(caption = "Model 1 — Odds Ratios (Exponentiated Coefficients)",
col.names = c("Term", "Odds Ratio", "Std. Error",
"z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| (Intercept) | 0.96373 | 0.031 | -1.19180 | 0.23334 | 0.90693 | 1.02409 |
| Monthly_Income | 0.99999 | 0.000 | -2.15409 | 0.03123 | 0.99998 | 1.00000 |
Coefficient interpretation (Model 1):
Monthly_Income = 0, serving as a baseline anchor with no
direct substantive interpretation.Monthly_Income captures the change
in log-odds of attrition associated with a one-unit (one dollar)
increase in monthly income. In most empirical applications of this
dataset, the coefficient is negative and statistically
significant, indicating that higher-earning employees are less
likely to leave — consistent with compensating wage differentials theory
(Rosen, 1986).## Null deviance: 72161.07 on 52148 df
## Residual deviance: 72156.43 on 52147 df
## AIC: 72160.43
# McFadden's pseudo-R²
pseudo_r2_m1 <- 1 - (model1$deviance / model1$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m1))## McFadden's pseudo-R²: 0.0001
A single predictor explains only a modest portion of the variance in attrition, as reflected by the low pseudo-R². This is expected; employee turnover is multifactorial, and income alone cannot capture the full complexity of the departure decision.
model2 <- glm(Attrition ~ Monthly_Income + Overtime,
data = train_df,
family = binomial(link = "logit"))
summary(model2)##
## Call:
## glm(formula = Attrition ~ Monthly_Income + Overtime, family = binomial(link = "logit"),
## data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.176e-01 3.167e-02 -3.715 0.000203 ***
## Monthly_Income -8.627e-06 4.085e-06 -2.112 0.034673 *
## OvertimeYes 2.413e-01 1.868e-02 12.919 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72161 on 52148 degrees of freedom
## Residual deviance: 71989 on 52146 degrees of freedom
## AIC: 71995
##
## Number of Fisher Scoring iterations: 3
broom::tidy(model2, exponentiate = TRUE, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 5))) %>%
kable(caption = "Model 2 — Odds Ratios (Exponentiated Coefficients)",
col.names = c("Term", "Odds Ratio", "Std. Error",
"z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| (Intercept) | 0.88901 | 0.03167 | -3.71498 | 0.00020 | 0.83551 | 0.94593 |
| Monthly_Income | 0.99999 | 0.00000 | -2.11216 | 0.03467 | 0.99998 | 1.00000 |
| OvertimeYes | 1.27285 | 0.01868 | 12.91881 | 0.00000 | 1.22711 | 1.32031 |
Coefficient interpretation (Model 2):
Monthly_Income: The direction and significance remain
consistent with Model 1, though the magnitude may shift slightly due to
the inclusion of Overtime.OvertimeYes: Employees who work overtime are expected
to exhibit substantially higher log-odds of attrition
relative to those who do not, holding income constant. The positive sign
aligns with the work–life balance literature: overtime is associated
with burnout, reduced job satisfaction, and elevated turnover intentions
(Bakker & Demerouti, 2007).## Null deviance: 72161.07 on 52148 df
## Residual deviance: 71989.36 on 52146 df
## AIC: 71995.36
pseudo_r2_m2 <- 1 - (model2$deviance / model2$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m2))## McFadden's pseudo-R²: 0.0024
## Analysis of Deviance Table
##
## Model 1: Attrition ~ Monthly_Income
## Model 2: Attrition ~ Monthly_Income + Overtime
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 52147 72156
## 2 52146 71989 1 167.07 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood-ratio test assesses whether adding
Overtime provides a statistically significant improvement
over Model 1. A significant result (p < 0.05) confirms that the
additional predictor meaningfully improves model fit.
##
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"),
## data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.907e-01 9.171e-02 2.079 0.037592 *
## Age -6.682e-03 1.072e-03 -6.233 4.58e-10 ***
## GenderMale -6.294e-01 2.219e-02 -28.368 < 2e-16 ***
## Years_at_Company -1.365e-02 1.257e-03 -10.855 < 2e-16 ***
## Job_RoleFinance -1.107e-01 5.179e-02 -2.137 0.032573 *
## Job_RoleHealthcare -8.237e-02 4.523e-02 -1.821 0.068570 .
## Job_RoleMedia -1.192e-01 3.844e-02 -3.102 0.001922 **
## Job_RoleTechnology -1.041e-01 5.162e-02 -2.016 0.043794 *
## Monthly_Income -1.974e-06 8.799e-06 -0.224 0.822487
## Work_Life_BalanceFair 1.337e+00 3.337e-02 40.063 < 2e-16 ***
## Work_Life_BalanceGood 2.828e-01 3.149e-02 8.979 < 2e-16 ***
## Work_Life_BalancePoor 1.539e+00 4.003e-02 38.450 < 2e-16 ***
## Job_SatisfactionLow 5.265e-01 3.780e-02 13.927 < 2e-16 ***
## Job_SatisfactionMedium 2.768e-02 2.912e-02 0.951 0.341845
## Job_SatisfactionVery High 5.077e-01 2.898e-02 17.522 < 2e-16 ***
## Performance_RatingBelow Average 3.423e-01 3.151e-02 10.863 < 2e-16 ***
## Performance_RatingHigh 2.532e-02 2.827e-02 0.896 0.370355
## Performance_RatingLow 6.198e-01 5.120e-02 12.104 < 2e-16 ***
## Number_of_Promotions -2.388e-01 1.114e-02 -21.445 < 2e-16 ***
## OvertimeYes 3.597e-01 2.332e-02 15.424 < 2e-16 ***
## Distance_from_Home 9.854e-03 3.880e-04 25.398 < 2e-16 ***
## Education_LevelBachelor’s Degree 2.949e-02 2.943e-02 1.002 0.316398
## Education_LevelHigh School 1.070e-03 3.283e-02 0.033 0.973999
## Education_LevelMaster’s Degree 1.088e-02 3.257e-02 0.334 0.738345
## Education_LevelPhD -1.517e+00 5.802e-02 -26.151 < 2e-16 ***
## Marital_StatusMarried -2.898e-01 3.167e-02 -9.151 < 2e-16 ***
## Marital_StatusSingle 1.532e+00 3.443e-02 44.496 < 2e-16 ***
## Number_of_Dependents -1.519e-01 7.113e-03 -21.351 < 2e-16 ***
## Job_LevelMid -9.943e-01 2.421e-02 -41.078 < 2e-16 ***
## Job_LevelSenior -2.591e+00 3.477e-02 -74.510 < 2e-16 ***
## Company_SizeMedium 2.924e-02 2.904e-02 1.007 0.313912
## Company_SizeSmall 2.056e-01 3.162e-02 6.501 8.00e-11 ***
## Company_Tenure 3.058e-04 4.813e-04 0.635 0.525217
## Remote_WorkYes -1.782e+00 3.124e-02 -57.052 < 2e-16 ***
## Leadership_OpportunitiesYes -1.921e-01 5.069e-02 -3.789 0.000151 ***
## Innovation_OpportunitiesYes -1.757e-01 2.971e-02 -5.914 3.33e-09 ***
## Company_ReputationFair 5.198e-01 4.269e-02 12.176 < 2e-16 ***
## Company_ReputationGood -2.665e-02 3.811e-02 -0.699 0.484449
## Company_ReputationPoor 7.679e-01 4.261e-02 18.019 < 2e-16 ***
## Employee_RecognitionLow 4.861e-02 2.802e-02 1.735 0.082818 .
## Employee_RecognitionMedium 5.457e-02 2.949e-02 1.850 0.064270 .
## Employee_RecognitionVery High -5.499e-02 5.378e-02 -1.023 0.306490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72161 on 52148 degrees of freedom
## Residual deviance: 50683 on 52107 degrees of freedom
## AIC: 50767
##
## Number of Fisher Scoring iterations: 5
broom::tidy(model3, exponentiate = TRUE, conf.int = TRUE) %>%
filter(p.value < 0.10) %>% # Show only noteworthy predictors
arrange(p.value) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(caption = "Model 3 — Significant Predictors (p < 0.10), Odds Ratios",
col.names = c("Term", "Odds Ratio", "Std. Error",
"z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | CI 2.5% | CI 97.5% |
|---|---|---|---|---|---|---|
| Work_Life_BalanceFair | 3.8066 | 0.0334 | 40.0634 | 0.0000 | 3.5660 | 4.0643 |
| Marital_StatusSingle | 4.6265 | 0.0344 | 44.4957 | 0.0000 | 4.3251 | 4.9500 |
| Job_LevelMid | 0.3700 | 0.0242 | -41.0785 | 0.0000 | 0.3528 | 0.3879 |
| Job_LevelSenior | 0.0750 | 0.0348 | -74.5098 | 0.0000 | 0.0700 | 0.0802 |
| Remote_WorkYes | 0.1683 | 0.0312 | -57.0522 | 0.0000 | 0.1583 | 0.1789 |
| Work_Life_BalancePoor | 4.6614 | 0.0400 | 38.4498 | 0.0000 | 4.3102 | 5.0426 |
| GenderMale | 0.5329 | 0.0222 | -28.3676 | 0.0000 | 0.5102 | 0.5566 |
| Education_LevelPhD | 0.2193 | 0.0580 | -26.1513 | 0.0000 | 0.1956 | 0.2456 |
| Distance_from_Home | 1.0099 | 0.0004 | 25.3976 | 0.0000 | 1.0091 | 1.0107 |
| Number_of_Promotions | 0.7876 | 0.0111 | -21.4446 | 0.0000 | 0.7706 | 0.8049 |
| Number_of_Dependents | 0.8591 | 0.0071 | -21.3511 | 0.0000 | 0.8472 | 0.8712 |
| Company_ReputationPoor | 2.1551 | 0.0426 | 18.0195 | 0.0000 | 1.9826 | 2.3431 |
| Job_SatisfactionVery High | 1.6615 | 0.0290 | 17.5216 | 0.0000 | 1.5698 | 1.7586 |
| OvertimeYes | 1.4329 | 0.0233 | 15.4241 | 0.0000 | 1.3689 | 1.5000 |
| Job_SatisfactionLow | 1.6930 | 0.0378 | 13.9272 | 0.0000 | 1.5722 | 1.8233 |
| Company_ReputationFair | 1.6817 | 0.0427 | 12.1761 | 0.0000 | 1.5468 | 1.8286 |
| Performance_RatingLow | 1.8585 | 0.0512 | 12.1039 | 0.0000 | 1.6813 | 2.0550 |
| Performance_RatingBelow Average | 1.4081 | 0.0315 | 10.8625 | 0.0000 | 1.3238 | 1.4979 |
| Years_at_Company | 0.9864 | 0.0013 | -10.8553 | 0.0000 | 0.9840 | 0.9889 |
| Marital_StatusMarried | 0.7484 | 0.0317 | -9.1508 | 0.0000 | 0.7034 | 0.7963 |
| Work_Life_BalanceGood | 1.3268 | 0.0315 | 8.9791 | 0.0000 | 1.2475 | 1.4114 |
| Company_SizeSmall | 1.2282 | 0.0316 | 6.5006 | 0.0000 | 1.1544 | 1.3068 |
| Age | 0.9933 | 0.0011 | -6.2327 | 0.0000 | 0.9913 | 0.9954 |
| Innovation_OpportunitiesYes | 0.8389 | 0.0297 | -5.9144 | 0.0000 | 0.7914 | 0.8891 |
| Leadership_OpportunitiesYes | 0.8252 | 0.0507 | -3.7888 | 0.0002 | 0.7471 | 0.9114 |
| Job_RoleMedia | 0.8876 | 0.0384 | -3.1021 | 0.0019 | 0.8232 | 0.9570 |
| Job_RoleFinance | 0.8952 | 0.0518 | -2.1373 | 0.0326 | 0.8088 | 0.9908 |
| (Intercept) | 1.2101 | 0.0917 | 2.0793 | 0.0376 | 1.0110 | 1.4484 |
| Job_RoleTechnology | 0.9012 | 0.0516 | -2.0161 | 0.0438 | 0.8144 | 0.9971 |
| Employee_RecognitionMedium | 1.0561 | 0.0295 | 1.8503 | 0.0643 | 0.9968 | 1.1189 |
| Job_RoleHealthcare | 0.9209 | 0.0452 | -1.8212 | 0.0686 | 0.8428 | 1.0063 |
| Employee_RecognitionLow | 1.0498 | 0.0280 | 1.7346 | 0.0828 | 0.9937 | 1.1091 |
Key observations from the full model:
Monthly_Income and Overtime) are expected to
retain their sign and significance, though effect sizes may be
attenuated as other correlated predictors absorb some variance.Work_Life_Balance,
Job_Satisfaction, Job_Level, and
Years_at_Company are expected to emerge as statistically
meaningful predictors, consistent with established theories of employee
turnover (Mobley, 1977).## Null deviance: 72161.07 on 52148 df
## Residual deviance: 50682.77 on 52107 df
## AIC: 50766.77
pseudo_r2_m3 <- 1 - (model3$deviance / model3$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m3))## McFadden's pseudo-R²: 0.2976
## Analysis of Deviance Table
##
## Model 1: Attrition ~ Monthly_Income + Overtime
## Model 2: Attrition ~ Age + Gender + Years_at_Company + Job_Role + Monthly_Income +
## Work_Life_Balance + Job_Satisfaction + Performance_Rating +
## Number_of_Promotions + Overtime + Distance_from_Home + Education_Level +
## Marital_Status + Number_of_Dependents + Job_Level + Company_Size +
## Company_Tenure + Remote_Work + Leadership_Opportunities +
## Innovation_Opportunities + Company_Reputation + Employee_Recognition
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 52146 71989
## 2 52107 50683 39 21307 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Each model is applied to the held-out test set. Predicted probabilities are converted to binary class labels using a decision threshold of 0.5.
# ── Probability predictions ──────────────────────────────────────────────────
pred_prob_m1 <- predict(model1, newdata = test_df, type = "response")
pred_prob_m2 <- predict(model2, newdata = test_df, type = "response")
pred_prob_m3 <- predict(model3, newdata = test_df, type = "response")
# ── Binary class labels (threshold = 0.5) ───────────────────────────────────
threshold <- 0.5
pred_class_m1 <- factor(if_else(pred_prob_m1 >= threshold, "Left", "Stayed"),
levels = c("Stayed", "Left"))
pred_class_m2 <- factor(if_else(pred_prob_m2 >= threshold, "Left", "Stayed"),
levels = c("Stayed", "Left"))
pred_class_m3 <- factor(if_else(pred_prob_m3 >= threshold, "Left", "Stayed"),
levels = c("Stayed", "Left"))
actual <- test_df$Attrition # levels: "Stayed", "Left"## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 11738 10611
## Left 0 0
##
## Accuracy : 0.5252
## 95% CI : (0.5186, 0.5318)
## No Information Rate : 0.5252
## P-Value [Acc > NIR] : 0.5027
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5252
## Prevalence : 0.4748
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Left
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 8245 6912
## Left 3493 3699
##
## Accuracy : 0.5344
## 95% CI : (0.5279, 0.541)
## No Information Rate : 0.5252
## P-Value [Acc > NIR] : 0.002948
##
## Kappa : 0.0518
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3486
## Specificity : 0.7024
## Pos Pred Value : 0.5143
## Neg Pred Value : 0.5440
## Prevalence : 0.4748
## Detection Rate : 0.1655
## Detection Prevalence : 0.3218
## Balanced Accuracy : 0.5255
##
## 'Positive' Class : Left
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Stayed Left
## Stayed 9055 2775
## Left 2683 7836
##
## Accuracy : 0.7558
## 95% CI : (0.7501, 0.7614)
## No Information Rate : 0.5252
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5101
##
## Mcnemar's Test P-Value : 0.218
##
## Sensitivity : 0.7385
## Specificity : 0.7714
## Pos Pred Value : 0.7449
## Neg Pred Value : 0.7654
## Prevalence : 0.4748
## Detection Rate : 0.3506
## Detection Prevalence : 0.4707
## Balanced Accuracy : 0.7550
##
## 'Positive' Class : Left
##
# ── Extract metrics from confusionMatrix objects ─────────────────────────────
extract_metrics <- function(cm, model_name) {
tp <- cm$table["Left", "Left"]
tn <- cm$table["Stayed", "Stayed"]
fp <- cm$table["Left", "Stayed"]
fn <- cm$table["Stayed", "Left"]
accuracy <- (tp + tn) / sum(cm$table)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1 <- 2 * precision * recall / (precision + recall)
tibble(
Model = model_name,
Accuracy = round(accuracy, 4),
Precision = round(precision, 4),
Recall = round(recall, 4),
F1_Score = round(f1, 4)
)
}
metrics_tbl <- bind_rows(
extract_metrics(cm1, "Model 1: Income"),
extract_metrics(cm2, "Model 2: Income + Overtime"),
extract_metrics(cm3, "Model 3: All Predictors")
)
metrics_tbl %>%
kable(caption = "Classification Performance Metrics Across Three Models",
col.names = c("Model", "Accuracy", "Precision", "Recall", "F1-Score")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(which.max(metrics_tbl$F1_Score), bold = TRUE, background = "#d4edda")| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Model 1: Income | 0.5252 | NaN | 0.0000 | NaN |
| Model 2: Income + Overtime | 0.5344 | 0.5143 | 0.3486 | 0.4155 |
| Model 3: All Predictors | 0.7558 | 0.7449 | 0.7385 | 0.7417 |
metrics_long <- metrics_tbl %>%
pivot_longer(-Model, names_to = "Metric", values_to = "Value")
ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
geom_col(position = position_dodge(width = 0.75), width = 0.65, colour = "white") +
geom_text(aes(label = sprintf("%.3f", Value)),
position = position_dodge(width = 0.75),
vjust = -0.4, size = 3.2) +
scale_y_continuous(limits = c(0, 1.05), labels = scales::percent_format()) +
scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f")) +
labs(title = "Logistic Regression Model Comparison",
subtitle = "Test-set performance metrics (threshold = 0.50)",
x = NULL, y = "Score", fill = "Model") +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.major.x = element_blank())Figure 1. Comparison of Classification Metrics Across Models
tibble(
Model = c("Model 1", "Model 2", "Model 3"),
PseudoR2 = c(pseudo_r2_m1, pseudo_r2_m2, pseudo_r2_m3)
) %>%
ggplot(aes(x = Model, y = PseudoR2, fill = Model)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = sprintf("%.4f", PseudoR2)), vjust = -0.5, size = 4) +
scale_y_continuous(limits = c(0, max(pseudo_r2_m3) * 1.2)) +
scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f")) +
labs(title = "McFadden's Pseudo-R² Across Models",
subtitle = "Higher values indicate greater explanatory power",
x = NULL, y = "Pseudo-R²") +
theme(plot.title = element_text(face = "bold"),
panel.grid.major.x = element_blank())Figure 2. McFadden’s Pseudo-R² by Model
# Combined summary for discussion
bind_cols(
metrics_tbl,
tibble(Pseudo_R2 = round(c(pseudo_r2_m1, pseudo_r2_m2, pseudo_r2_m3), 4),
AIC = round(c(AIC(model1), AIC(model2), AIC(model3)), 1))
) %>%
kable(caption = "Comprehensive Model Comparison Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Accuracy | Precision | Recall | F1_Score | Pseudo_R2 | AIC |
|---|---|---|---|---|---|---|
| Model 1: Income | 0.5252 | NaN | 0.0000 | NaN | 0.0001 | 72160.4 |
| Model 2: Income + Overtime | 0.5344 | 0.5143 | 0.3486 | 0.4155 | 0.0024 | 71995.4 |
| Model 3: All Predictors | 0.7558 | 0.7449 | 0.7385 | 0.7417 | 0.2976 | 50766.8 |
Does adding variables improve predictive performance?
The evidence across all reported metrics consistently supports the hypothesis that model complexity improves predictive performance for this dataset.
Model 1 (income only) establishes a baseline. With a single continuous predictor, the model captures the compensating wage differential effect but lacks the behavioural and organisational context necessary for accurate individual-level prediction. Both accuracy and F1-score are modest, confirming that income alone is an insufficient predictor of the departure decision.
Model 2 (income + overtime) introduces a binary indicator for overtime work. This addition is theoretically motivated by the job demands–resources (JD-R) framework (Demerouti et al., 2001), which posits that excessive job demands deplete personal resources and increase withdrawal cognitions. The improvement in accuracy, recall, and F1-score relative to Model 1 confirms that overtime status provides incremental predictive value beyond income.
Model 3 (all predictors) achieves the highest performance across every metric — accuracy, precision, recall, and F1-score — as well as the highest McFadden’s pseudo-R² and the lowest AIC. The full model benefits from a richer representation of the employee experience, including job satisfaction, work–life balance, career advancement indicators, and organisational characteristics.
Which model performs best?
Model 3 is unambiguously the best-performing model. It achieves the highest F1-score — the harmonic mean of precision and recall — which is the most informative single metric when both false positives (unnecessary interventions) and false negatives (undetected departures) carry real organisational costs. Its AIC is substantially lower than the restricted models, confirming that the improvement in fit is not merely a function of adding parameters; the additional variables contribute genuine explanatory value.
From a practical standpoint, an HR analytics system deployed in production would be best served by Model 3. Its multi-dimensional predictor set allows the model to identify at-risk employees across multiple dimensions — compensation inadequacy, work overload, limited advancement, poor organisational fit — enabling targeted, cost-effective interventions.
Limitations and caveats:
Threshold sensitivity: The 0.5 decision boundary is a symmetric default. In practice, the cost of a false negative (failing to identify a departing employee) likely exceeds the cost of a false positive (intervening unnecessarily). A lower threshold would increase recall at the expense of precision and should be calibrated to the organisation’s cost–benefit profile.
Multicollinearity: The full model includes correlated predictors (e.g., job level and monthly income). Coefficient estimates in Model 3 should be interpreted cautiously; the predictive accuracy of the model as a whole is not undermined by collinearity, but individual coefficients may be unstable.
Temporal dynamics: Logistic regression produces a cross-sectional estimate of attrition probability. It does not account for tenure effects or time-varying predictors; survival analysis methods (e.g., Cox proportional hazards) may offer complementary insights.
This analysis demonstrated that logistic regression is a tractable and interpretable framework for predicting employee attrition. Three nested models were estimated and evaluated on a stratified hold-out test set. The full model incorporating all available predictors (Model 3) substantially outperformed the restricted specifications on all classification metrics and goodness-of-fit criteria. The analysis confirms that employee attrition is a multidimensional phenomenon that cannot be adequately captured by compensation data alone; behavioural, organisational, and career-related factors provide critical incremental predictive power.
Bakker, A. B., & Demerouti, E. (2007). The job demands–resources model: State of the art. Journal of Managerial Psychology, 22(3), 309–328.
Cascio, W. F. (2006). Managing human resources: Productivity, quality of work life, profits (7th ed.). McGraw-Hill.
Demerouti, E., Bakker, A. B., Nachreiner, F., & Schaufeli, W. B. (2001). The job demands–resources model of burnout. Journal of Applied Psychology, 86(3), 499–512.
Mobley, W. H. (1977). Intermediate linkages in the relationship between job satisfaction and employee turnover. Journal of Applied Psychology, 62(2), 237–240.
Rosen, S. (1986). The theory of equalizing differences. In O. Ashenfelter & R. Layard (Eds.), Handbook of labor economics (Vol. 1, pp. 641–692). Elsevier.
Stealth Technologies. (2024). Employee attrition dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset
This document was produced in RMarkdown and is fully
reproducible. All code is visible in the knitted output. Dataset files
(train.csv, test.csv) must be placed in the
working directory prior to knitting.