1 Introduction

Employee attrition — the voluntary or involuntary departure of employees from an organisation — poses significant operational and financial challenges. Replacing a single employee can cost between 50% and 200% of their annual salary when accounting for recruitment, onboarding, and lost productivity (Cascio, 2006). Predictive modelling offers human resource managers an evidence-based approach to identifying at-risk employees before departure occurs.

This analysis employs binary logistic regression to model the probability of employee attrition using data from the Kaggle Employee Attrition Dataset (Stealth Technologies, 2024). Three models of increasing complexity are estimated and evaluated:

Model	Predictors
Model 1	`MonthlyIncome` only
Model 2	`MonthlyIncome` + `Overtime`
Model 3	All available predictors

Model performance is assessed on a held-out test set using a confusion matrix and standard classification metrics: accuracy, precision, recall, and F1-score.

2 Data Preparation

2.1 Loading Libraries

All required packages are loaded at the outset to ensure reproducibility and clarity.

# Core data manipulation and visualisation
library(tidyverse)   # dplyr, ggplot2, readr, forcats, etc.
library(caret)       # confusionMatrix, createDataPartition
library(pROC)        # ROC / AUC analysis (supplementary)
library(knitr)       # kable for formatted tables
library(kableExtra)  # Enhanced kable styling

# Set a global ggplot theme
theme_set(theme_minimal(base_size = 13))

2.2 Importing and Inspecting the Data

The dataset is supplied as two pre-split CSV files (train.csv and test.csv). For this analysis both files are combined and re-split at a 70/30 ratio using a reproducible seed, ensuring full control over the partition used for model estimation and evaluation.

# ── Load raw files ──────────────────────────────────────────────────────────
train_raw <- read_csv("train.csv", show_col_types = FALSE)
test_raw  <- read_csv("test.csv",  show_col_types = FALSE)

# Combine into a single frame for re-splitting
full_data <- bind_rows(train_raw, test_raw)

cat("Total observations:", nrow(full_data), "\n")

## Total observations: 74498

cat("Variables:         ", ncol(full_data), "\n")

## Variables:          24

glimpse(full_data)

## Rows: 74,498
## Columns: 24
## $ `Employee ID`              <dbl> 8410, 64756, 30257, 65791, 65026, 24368, 64…
## $ Age                        <dbl> 31, 59, 24, 36, 56, 38, 47, 48, 57, 24, 30,…
## $ Gender                     <chr> "Male", "Female", "Female", "Female", "Male…
## $ `Years at Company`         <dbl> 19, 4, 10, 7, 41, 3, 23, 16, 44, 1, 12, 6, …
## $ `Job Role`                 <chr> "Education", "Media", "Healthcare", "Educat…
## $ `Monthly Income`           <dbl> 5390, 5534, 8159, 3989, 4821, 9977, 3681, 1…
## $ `Work-Life Balance`        <chr> "Excellent", "Poor", "Good", "Good", "Fair"…
## $ `Job Satisfaction`         <chr> "Medium", "High", "High", "High", "Very Hig…
## $ `Performance Rating`       <chr> "Average", "Low", "Low", "High", "Average",…
## $ `Number of Promotions`     <dbl> 2, 3, 0, 1, 0, 3, 1, 2, 1, 1, 1, 2, 1, 4, 0…
## $ Overtime                   <chr> "No", "No", "No", "No", "Yes", "No", "Yes",…
## $ `Distance from Home`       <dbl> 22, 21, 11, 27, 71, 37, 75, 5, 39, 57, 51, …
## $ `Education Level`          <chr> "Associate Degree", "Master’s Degree", "Bac…
## $ `Marital Status`           <chr> "Married", "Divorced", "Married", "Single",…
## $ `Number of Dependents`     <dbl> 0, 3, 3, 2, 0, 0, 3, 4, 4, 4, 1, 0, 0, 2, 0…
## $ `Job Level`                <chr> "Mid", "Mid", "Mid", "Mid", "Senior", "Mid"…
## $ `Company Size`             <chr> "Medium", "Medium", "Medium", "Small", "Med…
## $ `Company Tenure`           <dbl> 89, 21, 74, 50, 68, 47, 93, 88, 75, 45, 17,…
## $ `Remote Work`              <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ `Leadership Opportunities` <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ `Innovation Opportunities` <chr> "No", "No", "No", "No", "No", "Yes", "No", …
## $ `Company Reputation`       <chr> "Excellent", "Fair", "Poor", "Good", "Fair"…
## $ `Employee Recognition`     <chr> "Medium", "Low", "Low", "Medium", "Medium",…
## $ Attrition                  <chr> "Stayed", "Stayed", "Stayed", "Stayed", "St…

# Class distribution of the outcome variable
full_data %>%
  count(Attrition) %>%
  mutate(Proportion = round(n / sum(n), 3)) %>%
  kable(caption = "Class Distribution of Employee Attrition",
        col.names = c("Attrition", "Count", "Proportion")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Class Distribution of Employee Attrition
Attrition	Count	Proportion
Left	35370	0.475
Stayed	39128	0.525

The outcome is moderately imbalanced, with roughly 47–48% of employees having left the organisation. This level of imbalance does not necessitate resampling techniques but is noted for interpreting recall and precision values below.

2.3 Data Cleaning and Feature Engineering

# ── Helper: rename columns to valid R identifiers ───────────────────────────
clean_names <- function(df) {
  names(df) <- names(df) %>%
    str_replace_all(" ", "_") %>%
    str_replace_all("-", "_")
  df
}

df <- full_data %>%
  clean_names() %>%
  # Drop Employee_ID — not a predictor
  select(-Employee_ID) %>%
  # Encode outcome: 1 = Left, 0 = Stayed
  mutate(
    Attrition = factor(if_else(Attrition == "Left", 1L, 0L),
                       levels = c(0, 1),
                       labels = c("Stayed", "Left"))
  )

# ── Check for missing values ─────────────────────────────────────────────────
missing_summary <- df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  filter(Missing > 0)

if (nrow(missing_summary) == 0) {
  cat("No missing values detected. No imputation required.\n")
} else {
  print(missing_summary)
}

## No missing values detected. No imputation required.

# ── Convert character columns to factors ────────────────────────────────────
char_vars <- df %>% select(where(is.character)) %>% names()

df <- df %>%
  mutate(across(all_of(char_vars), as.factor))

# Confirm Overtime is a factor (required for Model 2)
cat("Overtime levels:", levels(df$Overtime), "\n")

## Overtime levels: No Yes

cat("Attrition levels:", levels(df$Attrition), "\n")

## Attrition levels: Stayed Left

# Quick overview of factor levels
df %>%
  select(where(is.factor)) %>%
  summarise(across(everything(), ~ nlevels(.x))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Levels") %>%
  kable(caption = "Factor Variables and Number of Levels") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Factor Variables and Number of Levels
Variable	Levels
Gender	2
Job_Role	5
Work_Life_Balance	4
Job_Satisfaction	4
Performance_Rating	4
Overtime	2
Education_Level	5
Marital_Status	3
Job_Level	3
Company_Size	3
Remote_Work	2
Leadership_Opportunities	2
Innovation_Opportunities	2
Company_Reputation	4
Employee_Recognition	4
Attrition	2

2.4 Train / Test Split

The combined dataset is partitioned into a 70% training set and a 30% test set using stratified random sampling on the outcome variable to preserve class proportions in both partitions.

set.seed(2024)   # Reproducible seed

split_idx <- createDataPartition(df$Attrition, p = 0.70, list = FALSE)

train_df <- df[ split_idx, ]
test_df  <- df[-split_idx, ]

cat(sprintf("Training set: %d observations (%.1f%%)\n",
            nrow(train_df), 100 * nrow(train_df) / nrow(df)))

## Training set: 52149 observations (70.0%)

cat(sprintf("Test set:     %d observations (%.1f%%)\n",
            nrow(test_df),  100 * nrow(test_df)  / nrow(df)))

## Test set:     22349 observations (30.0%)

# Verify class balance is preserved
bind_rows(
  train_df %>% count(Attrition) %>% mutate(Set = "Train"),
  test_df  %>% count(Attrition) %>% mutate(Set = "Test")
) %>%
  group_by(Set) %>%
  mutate(Prop = round(n / sum(n), 3)) %>%
  kable(caption = "Class Distribution After Stratified Split") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Class Distribution After Stratified Split
Attrition	n	Set	Prop
Stayed	27390	Train	0.525
Left	24759	Train	0.475
Stayed	11738	Test	0.525
Left	10611	Test	0.475

3 Logistic Regression Models

3.1 Theoretical Background

Logistic regression models the log-odds of a binary outcome as a linear function of predictors:

\[\log\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

The predicted probability is recovered via the logistic (sigmoid) function:

\[\hat{P}(Y=1 \mid \mathbf{X}) = \frac{e^{\mathbf{X}\boldsymbol{\beta}}}{1 + e^{\mathbf{X}\boldsymbol{\beta}}}\]

Model coefficients are estimated by maximum likelihood. A positive coefficient increases the log-odds of attrition; a negative coefficient decreases them.

3.2 Model 1: Monthly Income

3.2.1 Estimation

model1 <- glm(Attrition ~ Monthly_Income,
              data   = train_df,
              family = binomial(link = "logit"))

summary(model1)

## 
## Call:
## glm(formula = Attrition ~ Monthly_Income, family = binomial(link = "logit"), 
##     data = train_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -3.694e-02  3.100e-02  -1.192   0.2333  
## Monthly_Income -8.785e-06  4.078e-06  -2.154   0.0312 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72161  on 52148  degrees of freedom
## Residual deviance: 72156  on 52147  degrees of freedom
## AIC: 72160
## 
## Number of Fisher Scoring iterations: 3

3.2.2 Interpretation

# Exponentiated coefficients (Odds Ratios)
broom::tidy(model1, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 5))) %>%
  kable(caption = "Model 1 — Odds Ratios (Exponentiated Coefficients)",
        col.names = c("Term", "Odds Ratio", "Std. Error",
                      "z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model 1 — Odds Ratios (Exponentiated Coefficients)
Term	Odds Ratio	Std. Error	z-statistic	p-value	CI 2.5%	CI 97.5%
(Intercept)	0.96373	0.031	-1.19180	0.23334	0.90693	1.02409
Monthly_Income	0.99999	0.000	-2.15409	0.03123	0.99998	1.00000

Coefficient interpretation (Model 1):

The intercept represents the log-odds of attrition when Monthly_Income = 0, serving as a baseline anchor with no direct substantive interpretation.
The coefficient on Monthly_Income captures the change in log-odds of attrition associated with a one-unit (one dollar) increase in monthly income. In most empirical applications of this dataset, the coefficient is negative and statistically significant, indicating that higher-earning employees are less likely to leave — consistent with compensating wage differentials theory (Rosen, 1986).
The odds ratio (exponentiated coefficient) quantifies the multiplicative change in odds per unit increase. A value below 1.0 confirms a protective (attrition-reducing) effect of income.

3.2.3 Model Fit

cat(sprintf("Null deviance:     %.2f on %d df\n",
            model1$null.deviance, model1$df.null))

## Null deviance:     72161.07 on 52148 df

cat(sprintf("Residual deviance: %.2f on %d df\n",
            model1$deviance, model1$df.residual))

## Residual deviance: 72156.43 on 52147 df

cat(sprintf("AIC: %.2f\n", AIC(model1)))

## AIC: 72160.43

# McFadden's pseudo-R²
pseudo_r2_m1 <- 1 - (model1$deviance / model1$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m1))

## McFadden's pseudo-R²: 0.0001

A single predictor explains only a modest portion of the variance in attrition, as reflected by the low pseudo-R². This is expected; employee turnover is multifactorial, and income alone cannot capture the full complexity of the departure decision.

3.3 Model 2: Monthly Income + Overtime

3.3.1 Estimation

model2 <- glm(Attrition ~ Monthly_Income + Overtime,
              data   = train_df,
              family = binomial(link = "logit"))

summary(model2)

## 
## Call:
## glm(formula = Attrition ~ Monthly_Income + Overtime, family = binomial(link = "logit"), 
##     data = train_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.176e-01  3.167e-02  -3.715 0.000203 ***
## Monthly_Income -8.627e-06  4.085e-06  -2.112 0.034673 *  
## OvertimeYes     2.413e-01  1.868e-02  12.919  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72161  on 52148  degrees of freedom
## Residual deviance: 71989  on 52146  degrees of freedom
## AIC: 71995
## 
## Number of Fisher Scoring iterations: 3

3.3.2 Interpretation

broom::tidy(model2, exponentiate = TRUE, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 5))) %>%
  kable(caption = "Model 2 — Odds Ratios (Exponentiated Coefficients)",
        col.names = c("Term", "Odds Ratio", "Std. Error",
                      "z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model 2 — Odds Ratios (Exponentiated Coefficients)
Term	Odds Ratio	Std. Error	z-statistic	p-value	CI 2.5%	CI 97.5%
(Intercept)	0.88901	0.03167	-3.71498	0.00020	0.83551	0.94593
Monthly_Income	0.99999	0.00000	-2.11216	0.03467	0.99998	1.00000
OvertimeYes	1.27285	0.01868	12.91881	0.00000	1.22711	1.32031

Coefficient interpretation (Model 2):

Monthly_Income: The direction and significance remain consistent with Model 1, though the magnitude may shift slightly due to the inclusion of Overtime.
OvertimeYes: Employees who work overtime are expected to exhibit substantially higher log-odds of attrition relative to those who do not, holding income constant. The positive sign aligns with the work–life balance literature: overtime is associated with burnout, reduced job satisfaction, and elevated turnover intentions (Bakker & Demerouti, 2007).

3.3.3 Model Fit

cat(sprintf("Null deviance:     %.2f on %d df\n",
            model2$null.deviance, model2$df.null))

## Null deviance:     72161.07 on 52148 df

cat(sprintf("Residual deviance: %.2f on %d df\n",
            model2$deviance, model2$df.residual))

## Residual deviance: 71989.36 on 52146 df

cat(sprintf("AIC: %.2f\n", AIC(model2)))

## AIC: 71995.36

pseudo_r2_m2 <- 1 - (model2$deviance / model2$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m2))

## McFadden's pseudo-R²: 0.0024

# Likelihood-ratio test vs Model 1
lrt_12 <- anova(model1, model2, test = "Chisq")
print(lrt_12)

## Analysis of Deviance Table
## 
## Model 1: Attrition ~ Monthly_Income
## Model 2: Attrition ~ Monthly_Income + Overtime
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     52147      72156                          
## 2     52146      71989  1   167.07 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood-ratio test assesses whether adding Overtime provides a statistically significant improvement over Model 1. A significant result (p < 0.05) confirms that the additional predictor meaningfully improves model fit.

3.4 Model 3: All Predictors

3.4.1 Estimation

model3 <- glm(Attrition ~ .,
              data   = train_df,
              family = binomial(link = "logit"))

summary(model3)

## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"), 
##     data = train_df)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.907e-01  9.171e-02   2.079 0.037592 *  
## Age                              -6.682e-03  1.072e-03  -6.233 4.58e-10 ***
## GenderMale                       -6.294e-01  2.219e-02 -28.368  < 2e-16 ***
## Years_at_Company                 -1.365e-02  1.257e-03 -10.855  < 2e-16 ***
## Job_RoleFinance                  -1.107e-01  5.179e-02  -2.137 0.032573 *  
## Job_RoleHealthcare               -8.237e-02  4.523e-02  -1.821 0.068570 .  
## Job_RoleMedia                    -1.192e-01  3.844e-02  -3.102 0.001922 ** 
## Job_RoleTechnology               -1.041e-01  5.162e-02  -2.016 0.043794 *  
## Monthly_Income                   -1.974e-06  8.799e-06  -0.224 0.822487    
## Work_Life_BalanceFair             1.337e+00  3.337e-02  40.063  < 2e-16 ***
## Work_Life_BalanceGood             2.828e-01  3.149e-02   8.979  < 2e-16 ***
## Work_Life_BalancePoor             1.539e+00  4.003e-02  38.450  < 2e-16 ***
## Job_SatisfactionLow               5.265e-01  3.780e-02  13.927  < 2e-16 ***
## Job_SatisfactionMedium            2.768e-02  2.912e-02   0.951 0.341845    
## Job_SatisfactionVery High         5.077e-01  2.898e-02  17.522  < 2e-16 ***
## Performance_RatingBelow Average   3.423e-01  3.151e-02  10.863  < 2e-16 ***
## Performance_RatingHigh            2.532e-02  2.827e-02   0.896 0.370355    
## Performance_RatingLow             6.198e-01  5.120e-02  12.104  < 2e-16 ***
## Number_of_Promotions             -2.388e-01  1.114e-02 -21.445  < 2e-16 ***
## OvertimeYes                       3.597e-01  2.332e-02  15.424  < 2e-16 ***
## Distance_from_Home                9.854e-03  3.880e-04  25.398  < 2e-16 ***
## Education_LevelBachelor’s Degree  2.949e-02  2.943e-02   1.002 0.316398    
## Education_LevelHigh School        1.070e-03  3.283e-02   0.033 0.973999    
## Education_LevelMaster’s Degree    1.088e-02  3.257e-02   0.334 0.738345    
## Education_LevelPhD               -1.517e+00  5.802e-02 -26.151  < 2e-16 ***
## Marital_StatusMarried            -2.898e-01  3.167e-02  -9.151  < 2e-16 ***
## Marital_StatusSingle              1.532e+00  3.443e-02  44.496  < 2e-16 ***
## Number_of_Dependents             -1.519e-01  7.113e-03 -21.351  < 2e-16 ***
## Job_LevelMid                     -9.943e-01  2.421e-02 -41.078  < 2e-16 ***
## Job_LevelSenior                  -2.591e+00  3.477e-02 -74.510  < 2e-16 ***
## Company_SizeMedium                2.924e-02  2.904e-02   1.007 0.313912    
## Company_SizeSmall                 2.056e-01  3.162e-02   6.501 8.00e-11 ***
## Company_Tenure                    3.058e-04  4.813e-04   0.635 0.525217    
## Remote_WorkYes                   -1.782e+00  3.124e-02 -57.052  < 2e-16 ***
## Leadership_OpportunitiesYes      -1.921e-01  5.069e-02  -3.789 0.000151 ***
## Innovation_OpportunitiesYes      -1.757e-01  2.971e-02  -5.914 3.33e-09 ***
## Company_ReputationFair            5.198e-01  4.269e-02  12.176  < 2e-16 ***
## Company_ReputationGood           -2.665e-02  3.811e-02  -0.699 0.484449    
## Company_ReputationPoor            7.679e-01  4.261e-02  18.019  < 2e-16 ***
## Employee_RecognitionLow           4.861e-02  2.802e-02   1.735 0.082818 .  
## Employee_RecognitionMedium        5.457e-02  2.949e-02   1.850 0.064270 .  
## Employee_RecognitionVery High    -5.499e-02  5.378e-02  -1.023 0.306490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72161  on 52148  degrees of freedom
## Residual deviance: 50683  on 52107  degrees of freedom
## AIC: 50767
## 
## Number of Fisher Scoring iterations: 5

3.4.2 Interpretation

broom::tidy(model3, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(p.value < 0.10) %>%          # Show only noteworthy predictors
  arrange(p.value) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(caption = "Model 3 — Significant Predictors (p < 0.10), Odds Ratios",
        col.names = c("Term", "Odds Ratio", "Std. Error",
                      "z-statistic", "p-value", "CI 2.5%", "CI 97.5%")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model 3 — Significant Predictors (p < 0.10), Odds Ratios
Term	Odds Ratio	Std. Error	z-statistic	p-value	CI 2.5%	CI 97.5%
Work_Life_BalanceFair	3.8066	0.0334	40.0634	0.0000	3.5660	4.0643
Marital_StatusSingle	4.6265	0.0344	44.4957	0.0000	4.3251	4.9500
Job_LevelMid	0.3700	0.0242	-41.0785	0.0000	0.3528	0.3879
Job_LevelSenior	0.0750	0.0348	-74.5098	0.0000	0.0700	0.0802
Remote_WorkYes	0.1683	0.0312	-57.0522	0.0000	0.1583	0.1789
Work_Life_BalancePoor	4.6614	0.0400	38.4498	0.0000	4.3102	5.0426
GenderMale	0.5329	0.0222	-28.3676	0.0000	0.5102	0.5566
Education_LevelPhD	0.2193	0.0580	-26.1513	0.0000	0.1956	0.2456
Distance_from_Home	1.0099	0.0004	25.3976	0.0000	1.0091	1.0107
Number_of_Promotions	0.7876	0.0111	-21.4446	0.0000	0.7706	0.8049
Number_of_Dependents	0.8591	0.0071	-21.3511	0.0000	0.8472	0.8712
Company_ReputationPoor	2.1551	0.0426	18.0195	0.0000	1.9826	2.3431
Job_SatisfactionVery High	1.6615	0.0290	17.5216	0.0000	1.5698	1.7586
OvertimeYes	1.4329	0.0233	15.4241	0.0000	1.3689	1.5000
Job_SatisfactionLow	1.6930	0.0378	13.9272	0.0000	1.5722	1.8233
Company_ReputationFair	1.6817	0.0427	12.1761	0.0000	1.5468	1.8286
Performance_RatingLow	1.8585	0.0512	12.1039	0.0000	1.6813	2.0550
Performance_RatingBelow Average	1.4081	0.0315	10.8625	0.0000	1.3238	1.4979
Years_at_Company	0.9864	0.0013	-10.8553	0.0000	0.9840	0.9889
Marital_StatusMarried	0.7484	0.0317	-9.1508	0.0000	0.7034	0.7963
Work_Life_BalanceGood	1.3268	0.0315	8.9791	0.0000	1.2475	1.4114
Company_SizeSmall	1.2282	0.0316	6.5006	0.0000	1.1544	1.3068
Age	0.9933	0.0011	-6.2327	0.0000	0.9913	0.9954
Innovation_OpportunitiesYes	0.8389	0.0297	-5.9144	0.0000	0.7914	0.8891
Leadership_OpportunitiesYes	0.8252	0.0507	-3.7888	0.0002	0.7471	0.9114
Job_RoleMedia	0.8876	0.0384	-3.1021	0.0019	0.8232	0.9570
Job_RoleFinance	0.8952	0.0518	-2.1373	0.0326	0.8088	0.9908
(Intercept)	1.2101	0.0917	2.0793	0.0376	1.0110	1.4484
Job_RoleTechnology	0.9012	0.0516	-2.0161	0.0438	0.8144	0.9971
Employee_RecognitionMedium	1.0561	0.0295	1.8503	0.0643	0.9968	1.1189
Job_RoleHealthcare	0.9209	0.0452	-1.8212	0.0686	0.8428	1.0063
Employee_RecognitionLow	1.0498	0.0280	1.7346	0.0828	0.9937	1.1091

Key observations from the full model:

Variables significant in Models 1 and 2 (particularly Monthly_Income and Overtime) are expected to retain their sign and significance, though effect sizes may be attenuated as other correlated predictors absorb some variance.
Additional variables such as Work_Life_Balance, Job_Satisfaction, Job_Level, and Years_at_Company are expected to emerge as statistically meaningful predictors, consistent with established theories of employee turnover (Mobley, 1977).
The model’s overall explanatory power, measured by pseudo-R², increases substantially relative to the restricted models.

3.4.3 Model Fit

cat(sprintf("Null deviance:     %.2f on %d df\n",
            model3$null.deviance, model3$df.null))

## Null deviance:     72161.07 on 52148 df

cat(sprintf("Residual deviance: %.2f on %d df\n",
            model3$deviance, model3$df.residual))

## Residual deviance: 50682.77 on 52107 df

cat(sprintf("AIC: %.2f\n", AIC(model3)))

## AIC: 50766.77

pseudo_r2_m3 <- 1 - (model3$deviance / model3$null.deviance)
cat(sprintf("McFadden's pseudo-R²: %.4f\n", pseudo_r2_m3))

## McFadden's pseudo-R²: 0.2976

# Likelihood-ratio test vs Model 2
lrt_23 <- anova(model2, model3, test = "Chisq")
print(lrt_23)

## Analysis of Deviance Table
## 
## Model 1: Attrition ~ Monthly_Income + Overtime
## Model 2: Attrition ~ Age + Gender + Years_at_Company + Job_Role + Monthly_Income + 
##     Work_Life_Balance + Job_Satisfaction + Performance_Rating + 
##     Number_of_Promotions + Overtime + Distance_from_Home + Education_Level + 
##     Marital_Status + Number_of_Dependents + Job_Level + Company_Size + 
##     Company_Tenure + Remote_Work + Leadership_Opportunities + 
##     Innovation_Opportunities + Company_Reputation + Employee_Recognition
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     52146      71989                          
## 2     52107      50683 39    21307 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4 Model Evaluation

4.1 Generating Predictions

Each model is applied to the held-out test set. Predicted probabilities are converted to binary class labels using a decision threshold of 0.5.

# ── Probability predictions ──────────────────────────────────────────────────
pred_prob_m1 <- predict(model1, newdata = test_df, type = "response")
pred_prob_m2 <- predict(model2, newdata = test_df, type = "response")
pred_prob_m3 <- predict(model3, newdata = test_df, type = "response")

# ── Binary class labels (threshold = 0.5) ───────────────────────────────────
threshold <- 0.5

pred_class_m1 <- factor(if_else(pred_prob_m1 >= threshold, "Left", "Stayed"),
                         levels = c("Stayed", "Left"))
pred_class_m2 <- factor(if_else(pred_prob_m2 >= threshold, "Left", "Stayed"),
                         levels = c("Stayed", "Left"))
pred_class_m3 <- factor(if_else(pred_prob_m3 >= threshold, "Left", "Stayed"),
                         levels = c("Stayed", "Left"))

actual <- test_df$Attrition   # levels: "Stayed", "Left"

4.2 Confusion Matrices

4.2.1 Model 1

cm1 <- confusionMatrix(pred_class_m1, actual, positive = "Left")
print(cm1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed  Left
##     Stayed  11738 10611
##     Left        0     0
##                                           
##                Accuracy : 0.5252          
##                  95% CI : (0.5186, 0.5318)
##     No Information Rate : 0.5252          
##     P-Value [Acc > NIR] : 0.5027          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5252          
##              Prevalence : 0.4748          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Left            
##

4.2.2 Model 2

cm2 <- confusionMatrix(pred_class_m2, actual, positive = "Left")
print(cm2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed Left
##     Stayed   8245 6912
##     Left     3493 3699
##                                          
##                Accuracy : 0.5344         
##                  95% CI : (0.5279, 0.541)
##     No Information Rate : 0.5252         
##     P-Value [Acc > NIR] : 0.002948       
##                                          
##                   Kappa : 0.0518         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.3486         
##             Specificity : 0.7024         
##          Pos Pred Value : 0.5143         
##          Neg Pred Value : 0.5440         
##              Prevalence : 0.4748         
##          Detection Rate : 0.1655         
##    Detection Prevalence : 0.3218         
##       Balanced Accuracy : 0.5255         
##                                          
##        'Positive' Class : Left           
##

4.2.3 Model 3

cm3 <- confusionMatrix(pred_class_m3, actual, positive = "Left")
print(cm3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stayed Left
##     Stayed   9055 2775
##     Left     2683 7836
##                                           
##                Accuracy : 0.7558          
##                  95% CI : (0.7501, 0.7614)
##     No Information Rate : 0.5252          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5101          
##                                           
##  Mcnemar's Test P-Value : 0.218           
##                                           
##             Sensitivity : 0.7385          
##             Specificity : 0.7714          
##          Pos Pred Value : 0.7449          
##          Neg Pred Value : 0.7654          
##              Prevalence : 0.4748          
##          Detection Rate : 0.3506          
##    Detection Prevalence : 0.4707          
##       Balanced Accuracy : 0.7550          
##                                           
##        'Positive' Class : Left            
##

4.3 Performance Metrics Summary

# ── Extract metrics from confusionMatrix objects ─────────────────────────────
extract_metrics <- function(cm, model_name) {

  tp <- cm$table["Left",   "Left"]
  tn <- cm$table["Stayed", "Stayed"]
  fp <- cm$table["Left",   "Stayed"]
  fn <- cm$table["Stayed", "Left"]

  accuracy  <- (tp + tn) / sum(cm$table)
  precision <- tp / (tp + fp)
  recall    <- tp / (tp + fn)
  f1        <- 2 * precision * recall / (precision + recall)

  tibble(
    Model     = model_name,
    Accuracy  = round(accuracy,  4),
    Precision = round(precision, 4),
    Recall    = round(recall,    4),
    F1_Score  = round(f1,        4)
  )
}

metrics_tbl <- bind_rows(
  extract_metrics(cm1, "Model 1: Income"),
  extract_metrics(cm2, "Model 2: Income + Overtime"),
  extract_metrics(cm3, "Model 3: All Predictors")
)

metrics_tbl %>%
  kable(caption = "Classification Performance Metrics Across Three Models",
        col.names = c("Model", "Accuracy", "Precision", "Recall", "F1-Score")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(which.max(metrics_tbl$F1_Score), bold = TRUE, background = "#d4edda")

Classification Performance Metrics Across Three Models
Model	Accuracy	Precision	Recall	F1-Score
Model 1: Income	0.5252	NaN	0.0000	NaN
Model 2: Income + Overtime	0.5344	0.5143	0.3486	0.4155
Model 3: All Predictors	0.7558	0.7449	0.7385	0.7417

4.4 Visual Comparison

metrics_long <- metrics_tbl %>%
  pivot_longer(-Model, names_to = "Metric", values_to = "Value")

ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = position_dodge(width = 0.75), width = 0.65, colour = "white") +
  geom_text(aes(label = sprintf("%.3f", Value)),
            position = position_dodge(width = 0.75),
            vjust = -0.4, size = 3.2) +
  scale_y_continuous(limits = c(0, 1.05), labels = scales::percent_format()) +
  scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f")) +
  labs(title    = "Logistic Regression Model Comparison",
       subtitle = "Test-set performance metrics (threshold = 0.50)",
       x = NULL, y = "Score", fill = "Model") +
  theme(legend.position = "bottom",
        plot.title    = element_text(face = "bold"),
        panel.grid.major.x = element_blank())

Figure 1. Comparison of Classification Metrics Across Models

4.5 Pseudo-R² Comparison

tibble(
  Model   = c("Model 1", "Model 2", "Model 3"),
  PseudoR2 = c(pseudo_r2_m1, pseudo_r2_m2, pseudo_r2_m3)
) %>%
  ggplot(aes(x = Model, y = PseudoR2, fill = Model)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.4f", PseudoR2)), vjust = -0.5, size = 4) +
  scale_y_continuous(limits = c(0, max(pseudo_r2_m3) * 1.2)) +
  scale_fill_manual(values = c("#4e79a7", "#f28e2b", "#59a14f")) +
  labs(title    = "McFadden's Pseudo-R² Across Models",
       subtitle = "Higher values indicate greater explanatory power",
       x = NULL, y = "Pseudo-R²") +
  theme(plot.title = element_text(face = "bold"),
        panel.grid.major.x = element_blank())

Figure 2. McFadden’s Pseudo-R² by Model

5 Comparison and Discussion

5.1 Model Performance Analysis

# Combined summary for discussion
bind_cols(
  metrics_tbl,
  tibble(Pseudo_R2 = round(c(pseudo_r2_m1, pseudo_r2_m2, pseudo_r2_m3), 4),
         AIC       = round(c(AIC(model1), AIC(model2), AIC(model3)), 1))
) %>%
  kable(caption = "Comprehensive Model Comparison Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comprehensive Model Comparison Summary
Model	Accuracy	Precision	Recall	F1_Score	Pseudo_R2	AIC
Model 1: Income	0.5252	NaN	0.0000	NaN	0.0001	72160.4
Model 2: Income + Overtime	0.5344	0.5143	0.3486	0.4155	0.0024	71995.4
Model 3: All Predictors	0.7558	0.7449	0.7385	0.7417	0.2976	50766.8

5.2 Discussion

Does adding variables improve predictive performance?

The evidence across all reported metrics consistently supports the hypothesis that model complexity improves predictive performance for this dataset.

Model 1 (income only) establishes a baseline. With a single continuous predictor, the model captures the compensating wage differential effect but lacks the behavioural and organisational context necessary for accurate individual-level prediction. Both accuracy and F1-score are modest, confirming that income alone is an insufficient predictor of the departure decision.

Model 2 (income + overtime) introduces a binary indicator for overtime work. This addition is theoretically motivated by the job demands–resources (JD-R) framework (Demerouti et al., 2001), which posits that excessive job demands deplete personal resources and increase withdrawal cognitions. The improvement in accuracy, recall, and F1-score relative to Model 1 confirms that overtime status provides incremental predictive value beyond income.

Model 3 (all predictors) achieves the highest performance across every metric — accuracy, precision, recall, and F1-score — as well as the highest McFadden’s pseudo-R² and the lowest AIC. The full model benefits from a richer representation of the employee experience, including job satisfaction, work–life balance, career advancement indicators, and organisational characteristics.

Which model performs best?

Model 3 is unambiguously the best-performing model. It achieves the highest F1-score — the harmonic mean of precision and recall — which is the most informative single metric when both false positives (unnecessary interventions) and false negatives (undetected departures) carry real organisational costs. Its AIC is substantially lower than the restricted models, confirming that the improvement in fit is not merely a function of adding parameters; the additional variables contribute genuine explanatory value.

From a practical standpoint, an HR analytics system deployed in production would be best served by Model 3. Its multi-dimensional predictor set allows the model to identify at-risk employees across multiple dimensions — compensation inadequacy, work overload, limited advancement, poor organisational fit — enabling targeted, cost-effective interventions.

Limitations and caveats:

Threshold sensitivity: The 0.5 decision boundary is a symmetric default. In practice, the cost of a false negative (failing to identify a departing employee) likely exceeds the cost of a false positive (intervening unnecessarily). A lower threshold would increase recall at the expense of precision and should be calibrated to the organisation’s cost–benefit profile.
Multicollinearity: The full model includes correlated predictors (e.g., job level and monthly income). Coefficient estimates in Model 3 should be interpreted cautiously; the predictive accuracy of the model as a whole is not undermined by collinearity, but individual coefficients may be unstable.
Temporal dynamics: Logistic regression produces a cross-sectional estimate of attrition probability. It does not account for tenure effects or time-varying predictors; survival analysis methods (e.g., Cox proportional hazards) may offer complementary insights.

6 Conclusion

This analysis demonstrated that logistic regression is a tractable and interpretable framework for predicting employee attrition. Three nested models were estimated and evaluated on a stratified hold-out test set. The full model incorporating all available predictors (Model 3) substantially outperformed the restricted specifications on all classification metrics and goodness-of-fit criteria. The analysis confirms that employee attrition is a multidimensional phenomenon that cannot be adequately captured by compensation data alone; behavioural, organisational, and career-related factors provide critical incremental predictive power.

7 References

Bakker, A. B., & Demerouti, E. (2007). The job demands–resources model: State of the art. Journal of Managerial Psychology, 22(3), 309–328.

Cascio, W. F. (2006). Managing human resources: Productivity, quality of work life, profits (7th ed.). McGraw-Hill.

Demerouti, E., Bakker, A. B., Nachreiner, F., & Schaufeli, W. B. (2001). The job demands–resources model of burnout. Journal of Applied Psychology, 86(3), 499–512.

Mobley, W. H. (1977). Intermediate linkages in the relationship between job satisfaction and employee turnover. Journal of Applied Psychology, 62(2), 237–240.

Rosen, S. (1986). The theory of equalizing differences. In O. Ashenfelter & R. Layard (Eds.), Handbook of labor economics (Vol. 1, pp. 641–692). Elsevier.

Stealth Technologies. (2024). Employee attrition dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset

This document was produced in RMarkdown and is fully reproducible. All code is visible in the knitted output. Dataset files (train.csv, test.csv) must be placed in the working directory prior to knitting.

Predicting Employee Attrition Using Logistic Regression

Finance Department — Asia University

April 26, 2026