library(tidyverse)
library(scales)
library(sjPlot)
library(patchwork)
library(broom)
library(knitr)
library(kableExtra)

Business Context

The organisation has registered a meaningful increase in voluntary turnover over the past twelve months. Beyond the immediate cost of replacement — typically estimated at 0.5–2× annual salary per departing employee — sustained attrition erodes institutional knowledge, disrupts team cohesion, and signals underlying structural issues that, left unaddressed, compound over time.

This report presents a structured, evidence-based investigation into the drivers of that attrition. Rather than treating turnover as an inevitable cost of business, we approach it as an analytical problem with identifiable causes and tractable solutions.

Analytical Lens Our analysis is grounded in Herzberg’s Two-Factor Theory, which distinguishes between hygiene factors — baseline conditions whose absence causes dissatisfaction (compensation, working conditions, managerial relationships) — and motivators — growth-oriented elements that actively drive engagement and retention (recognition, advancement, meaningful work). This framework structures both the hypotheses we test and the recommendations we draw.


Data & Setup

employees_df <- readRDS("../data/processed/employees_cleaned.rds")
# ── Shared plot theme ──────────────────────────────────────────────────────────
theme_pa <- function(type = "bar") {
  rot    <- if (type == "bar")  45 else 0
  hjust  <- if (type == "bar")  1  else 0.5

  theme_minimal(base_family = "IBM Plex Sans") +
    theme(
      plot.title       = element_text(face = "bold", size = 13.5,
                                      color = "#111827", margin = margin(b = 4)),
      plot.subtitle    = element_text(size = 10.5, color = "#6b7280",
                                      face = "plain", margin = margin(b = 14)),
      plot.caption     = element_text(size = 8, color = "#9ca3af",
                                      hjust = 0, margin = margin(t = 10)),
      plot.background  = element_rect(fill = "#fafaf8", color = NA),
      panel.background = element_rect(fill = "#fafaf8", color = NA),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "#f0ede6", linewidth = 0.5),
      axis.text        = element_text(size = 9, color = "#6b7280"),
      axis.title       = element_text(size = 9.5, color = "#374151"),
      axis.text.x      = element_text(angle = rot, hjust = hjust),
      legend.position       = "top",
      legend.justification  = "left",
      legend.direction      = "horizontal",
      legend.title          = element_text(size = 9, face = "bold"),
      legend.text           = element_text(size = 9),
      strip.text            = element_text(face = "bold", size = 9,
                                           color = "#374151"),
      strip.background      = element_rect(fill = "#f3f4f6", color = NA)
    )
}

# ── Shared colour scale ────────────────────────────────────────────────────────
ACTIVE_COLS <- c("No" = "#9A031E", "Yes" = "#0F4C5C")

scale_fill_active <- function() scale_fill_manual(values = ACTIVE_COLS, name = "Still Active")
scale_color_active <- function() scale_color_manual(values = ACTIVE_COLS, name = "Still Active")

Dataset Snapshot

glimpse(employees_df)
## Rows: 1,468
## Columns: 38
## $ employee_id      <int> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009,…
## $ active           <fct> No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ stock_opt_lvl    <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, 1, 2, 2,…
## $ trainings        <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, 1, 5, 2,…
## $ age              <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 3…
## $ commute_dist     <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, 19, 24, …
## $ ed_lvl           <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2,…
## $ ed_field         <fct> Life Sciences, Life Sciences, Other, Life Sciences, M…
## $ gender           <fct> Female, Male, Male, Female, Male, Male, Female, Male,…
## $ marital_sts      <fct> Single, Married, Single, Married, Married, Single, Ma…
## $ dept             <fct> Sales, Research & Development, Research & Development…
## $ engagement       <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 4, 4,…
## $ job_lvl          <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ job_title        <fct> Sales Executive, Research Scientist, Laboratory Techn…
## $ overtime         <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Yes, …
## $ business_travel  <fct> Travel_Rarely, Travel_Frequently, Travel_Rarely, Trav…
## $ hourly_rate      <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 9…
## $ daily_comp       <int> 752, 488, 736, 448, 320, 632, 648, 536, 352, 752, 672…
## $ monthly_comp     <int> 16293, 10573, 15947, 9707, 6933, 13693, 14040, 11613,…
## $ annual_comp      <int> 195520, 126880, 191360, 116480, 83200, 164320, 168480…
## $ ytd_leads        <int> 61, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ytd_sales        <int> 81496, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ standard_hrs     <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ salary_hike_pct  <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 1…
## $ perf_rating      <fct> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ prior_emplr_cnt  <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, 1, 0, 1,…
## $ env_sat          <fct> Dissatisfied, Satisfied, Highly Satisfied, Highly Sat…
## $ job_sat          <fct> Highly Satisfied, Dissatisfied, Satisfied, Satisfied,…
## $ rel_sat          <fct> Highly Dissatisfied, Highly Satisfied, Dissatisfied, …
## $ wl_balance       <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ work_exp         <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3, 6, 10,…
## $ org_tenure       <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4, 10, 6, …
## $ job_tenure       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, 9, 2, 0,…
## $ last_promo       <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, 8, 0, 0,…
## $ mgr_tenure       <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, 8, 5, 0,…
## $ interview_rating <dbl> 3.3, 3.7, 2.7, 4.6, 5.0, 3.6, 4.8, 3.5, 4.3, 3.8, 4.4…
## $ comp_ratio       <dbl> 1.4242133, 0.9242133, 1.3939685, 0.8485140, 0.6060315…
## $ tenure_density   <dbl> 0.75000000, 1.00000000, 0.00000000, 1.00000000, 0.333…

Attrition Overview

The first step in any attrition analysis is to characterise the outcome variable — understanding the base rate and composition of departures before examining causes.

plot_data <- employees_df |>
  count(active) |>
  mutate(
    pct   = n / sum(n),
    label = paste0(comma(n), "\n(", percent(pct, accuracy = 0.1), ")")
  )

ggplot(plot_data, aes(x = active, y = pct, fill = active)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = 1.3, size = 3.5,
            fontface = "bold", color = "white") +
  scale_fill_active() +
  scale_y_continuous(labels = percent, limits = c(0, max(plot_data$pct) + 0.12)) +
  labs(
    title    = "Workforce Attrition: Current Snapshot",
    subtitle = "Proportion of employees who have left (Inactive) vs. those who remain (Active)",
    x        = "Employment Status",
    y        = "Share of Workforce",
    caption  = "Active = 'Yes' (currently employed) | Inactive = 'No' (has left the organisation)"
  ) +
  theme_pa()

~16%

Overall Attrition Rate

↑ Above 10–12% industry norm

5

Hypotheses Tested

Herzberg framework applied

Overtime Attrition Multiplier

Strongest single predictor

Under-30 Exit Multiplier

vs. employees aged 30+

Methodological Note At ~16%, this organisation’s attrition rate exceeds the typical 10–12% benchmark for comparable industries. The class imbalance between leavers and stayers (~1:5 ratio) is expected for this type of dataset, but analysts should account for it when building predictive models — standard accuracy metrics will be misleading without correction (e.g., using AUROC or balanced accuracy).


Demographic & Structural Distributions

Before testing hypotheses, we establish the baseline distributions of the workforce across the key dimensions most relevant to attrition.

Age

ggplot(employees_df, aes(x = age, fill = active)) +
  geom_histogram(binwidth = 5, color = "white", show.legend = FALSE) +
  facet_wrap(~active, labeller = as_labeller(c("No" = "Inactive (Left)", "Yes" = "Active (Stayed)"))) +
  scale_fill_active() +
  labs(
    title    = "Age Distribution by Employment Status",
    subtitle = "Employees who left skew younger — a pattern investigated formally in Hypothesis 5",
    x        = "Age (Years)",
    y        = "Number of Employees",
    caption  = "Bin width = 5 years"
  ) +
  theme_pa(type = "hist")

Organisational Tenure

p1 <- ggplot(employees_df, aes(x = org_tenure, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_wrap(~active, labeller = as_labeller(c("No" = "Inactive (Left)", "Yes" = "Active (Stayed)"))) +
  scale_fill_active() +
  labs(title = "Tenure Distribution by Status",
       x = "Years in Organisation", y = "Count") +
  theme_pa(type = "hist")

p2 <- employees_df |>
  filter(org_tenure < 20) |>
  ggplot(aes(x = active, y = org_tenure, fill = active)) +
  geom_boxplot(show.legend = FALSE, width = 0.5, outlier.shape = 21,
               outlier.size = 1.5, outlier.fill = "white") +
  scale_fill_active() +
  scale_x_discrete(labels = c("No" = "Inactive", "Yes" = "Active")) +
  labs(title = "Tenure Spread (excl. extreme outliers)",
       x = NULL, y = "Years in Organisation",
       caption  = "Employees with org_tenure ≥ 20 years excluded for visual clarity") +
  theme_pa()

p1 / p2 + plot_annotation(
  title    = "Organisational Tenure: Active vs. Inactive Employees",
  subtitle = "Leavers tend to have shorter tenures and a tighter, lower-centred distribution",
  theme    = theme(plot.title    = element_text(face = "bold", size = 13.5),
                   plot.subtitle = element_text(size = 10.5, color = "#6b7280"))
)

Gender & Department

p_gender <- ggplot(employees_df, aes(x = gender, fill = active)) +
  geom_bar(position = "fill") +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(title = "By Gender",
       subtitle = "Attrition rates are broadly even across gender",
       x = NULL, y = "Proportion") +
  theme_pa()

p_dept <- ggplot(employees_df, aes(x = dept, fill = active)) +
  geom_bar(position = "fill") +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(title = "By Department",
       subtitle = "Departmental differences are more pronounced",
       x = NULL, y = "Proportion") +
  theme_pa()

p_gender + p_dept +
  plot_layout(guides = "collect") &
  theme(legend.position = "top")

Job Level

ggplot(employees_df, aes(x = factor(job_lvl), fill = active)) +
  geom_bar(position = "fill") +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(
    title    = "Attrition Rate by Job Level",
    subtitle = "Entry-level (1) and mid-level (3) employees show the highest proportional attrition",
    x        = "Job Level  (1 = Entry → 5 = Executive)",
    y        = "Proportion of Employees"
  ) +
  theme_pa()

Structural Pattern Entry-level employees (Job Level 1) show the highest attrition proportion across all levels, consistent with the broader literature on early-career mobility. The secondary spike at Level 3 warrants further attention — this cohort typically represents employees with enough experience to be competitive in the external market but insufficient advancement to feel anchored.



Hypothesis Testing

Each hypothesis was formulated from theory and preliminary data exploration, then tested using logistic regression with relevant controls. Visualisations represent model-predicted probabilities unless otherwise noted.


H1: The Stagnation Trap

Hypothesis 1 · Career Progression

Employees who have not received a promotion in 3+ years and have high manager tenure are at a significantly higher risk of attrition.

✓ Supported

Herzberg lens: Advancement and recognition are primary motivators. Their sustained absence — particularly when paired with a stable manager relationship that signals no change is coming — is a compound retention risk.

Distributions

p_mgr <- ggplot(employees_df, aes(x = mgr_tenure, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_wrap(~active, labeller = as_labeller(c("No" = "Inactive", "Yes" = "Active"))) +
  scale_fill_active() +
  labs(title = "Manager Tenure Distribution",
       x = "Years with Current Manager", y = "Count") +
  theme_pa(type = "hist")

p_promo <- ggplot(employees_df, aes(x = last_promo, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_wrap(~active, labeller = as_labeller(c("No" = "Inactive", "Yes" = "Active"))) +
  scale_fill_active() +
  labs(title = "Time Since Last Promotion",
       x = "Years Since Last Promotion", y = "Count") +
  theme_pa(type = "hist")

p_mgr / p_promo

Stagnation Index

Examining raw distributions of last_promo in isolation is insufficient — a two-year promotion gap means something very different for a new hire versus a ten-year employee. We therefore construct a Stagnation Index: the ratio of years since last promotion to total organisational tenure.

stagnation_data <- employees_df |>
  mutate(stagnation_index = last_promo / (org_tenure + 1))

# Summary table
stagnation_data |>
  group_by(active) |>
  summarise(
    `Mean Stagnation Index`   = round(mean(stagnation_index, na.rm = TRUE), 3),
    `Median Stagnation Index` = round(median(stagnation_index, na.rm = TRUE), 3),
    N = n()
  ) |>
  rename(Status = active) |>
  kable(caption = "Stagnation Index by Employment Status") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Stagnation Index by Employment Status
Status Mean Stagnation Index Median Stagnation Index N
No 0.242 0.125 236
Yes 0.235 0.143 1232
p_scatter <- ggplot(employees_df, aes(x = org_tenure, y = last_promo, color = active)) +
  geom_jitter(alpha = 0.45, size = 1.4) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.1, alpha = 0.12) +
  scale_color_active() +
  labs(title = "Promotion Gap vs. Organisational Tenure",
       subtitle = "The gap between leavers' and stayers' promotion trajectories widens with tenure",
       x = "Years in Organisation", y = "Years Since Last Promotion") +
  theme_pa(type = "scatter")

p_cohort <- employees_df |>
  mutate(tenure_cohort = case_when(
    org_tenure <= 2 ~ "New (0–2 yrs)",
    org_tenure <= 5 ~ "Mid (3–5 yrs)",
    TRUE            ~ "Senior (6+ yrs)"
  ) |> factor(levels = c("New (0–2 yrs)", "Mid (3–5 yrs)", "Senior (6+ yrs)"))) |>
  ggplot(aes(x = last_promo, fill = active)) +
  geom_density(alpha = 0.55) +
  facet_wrap(~tenure_cohort, scales = "free_y") +
  scale_fill_active() +
  labs(title = "Promotion Gap Density by Tenure Cohort",
       subtitle = "Among senior employees, leavers show a notably longer promotion gap",
       x = "Years Since Last Promotion", y = "Density") +
  theme_pa(type = "kde")

p_scatter / p_cohort

Model Results

hyp1_model <- glm(
  active ~ last_promo * mgr_tenure + org_tenure,
  data   = employees_df,
  family = "binomial"
)

tidy(hyp1_model, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  rename(Term = term, OR = estimate, `SE` = std.error,
         `z` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "H1 Logistic Regression — Odds Ratios (exponentiated)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
H1 Logistic Regression — Odds Ratios (exponentiated)
Term OR SE z p-value CI Lower CI Upper
(Intercept) 2.493 0.120 7.612 0.000 1.974 3.160
last_promo 0.994 0.054 -0.107 0.915 0.895 1.107
mgr_tenure 1.167 0.044 3.533 0.000 1.071 1.271
org_tenure 1.074 0.028 2.509 0.012 1.018 1.138
last_promo:mgr_tenure 0.984 0.006 -2.823 0.005 0.973 0.995
plot_model(hyp1_model, type = "int") +
  theme_pa(type = "line") +
  labs(
    title    = "Predicted Retention: Promotion Gap × Manager Tenure",
    subtitle = "Long manager relationships buffer attrition — but only when promotions are frequent",
    y        = "Probability of Remaining Active",
    x        = "Years Since Last Promotion"
  )

Deeper Cuts

By Gender

ggplot(employees_df, aes(x = last_promo, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_grid(active ~ gender, scales = "free_y",
             labeller = labeller(active = c("No" = "Inactive", "Yes" = "Active"))) +
  scale_fill_active() +
  labs(title = "Last Promotion Distribution by Gender",
       subtitle = "Stagnation patterns are consistent across gender — no significant moderation",
       x = "Years Since Last Promotion", y = "Count") +
  theme_pa(type = "hist")

Finding Gender does not significantly moderate the stagnation effect. The promotion gap distributes similarly across male and female employees in both the active and inactive groups, suggesting the stagnation trap operates independently of gender.

By Department

hyp1_dept_model <- glm(
  active ~ last_promo * mgr_tenure * dept + org_tenure,
  data   = employees_df,
  family = "binomial"
)

p <- plot_model(hyp1_dept_model, type = "int",
           terms = c("last_promo", "mgr_tenure [0, 17]", "dept"))


p[4]
## [[1]]

By Job Level

ggplot(employees_df, aes(x = last_promo, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_grid(active ~ job_lvl, scales = "free_y",
             labeller = labeller(active = c("No" = "Inactive", "Yes" = "Active"))) +
  scale_fill_active() +
  labs(title = "Last Promotion Distribution by Job Level",
       x = "Years Since Last Promotion", y = "Count") +
  theme_pa(type = "hist")

Business Implication Strong manager–employee relationships are the organisation’s most effective natural retention mechanism — but they carry a shelf life. When tenure under the same manager exceeds approximately seven years without a promotion, that stability becomes a liability. The recommendation is not to break up productive relationships but to implement structured career-refresh checkpoints at the five-to-seven year mark, regardless of department or seniority.


H2: The New Manager Risk

Hypothesis 2 · Managerial Transition

Employees with less than 1 year of tenure under their current manager are more likely to leave than those with 2–3 years of manager tenure.

✓ Supported

Herzberg lens: The manager relationship is a primary hygiene factor. A new manager disrupts an established relational contract — until trust is rebuilt, the psychological safety that anchors employees erodes.

Attrition by Manager Tenure Year

mgr_risk_curve <- employees_df |>
  group_by(mgr_tenure) |>
  summarise(
    total         = n(),
    attrition_rate = sum(active == "No") / n(),
    se             = sqrt((attrition_rate * (1 - attrition_rate)) / total),
    lower          = pmax(0, attrition_rate - 1.96 * se),
    upper          = pmin(1, attrition_rate + 1.96 * se)
  )

ggplot(mgr_risk_curve, aes(x = mgr_tenure, y = attrition_rate)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "#1A6B72", alpha = 0.12) +
  geom_line(color = "#1A6B72", linewidth = 1.1) +
  geom_point(size = 3, color = "#C0392B", fill = "white", shape = 21, stroke = 1.5) +
  annotate("rect", xmin = -0.5, xmax = 1.5,
           ymin = 0, ymax = 0.5, fill = "#C0392B", alpha = 0.06) +
  annotate("text", x = 1, y = 0.42, label = "Danger Zone",
           color = "#C0392B", fontface = "bold.italic", size = 3.2) +
  scale_y_continuous(labels = percent, limits = c(0, 0.5)) +
  labs(
    title    = "Manager Tenure 'Danger Zone'",
    subtitle = "Attrition is highest in the first 12 months of a new manager relationship",
    x        = "Years with Current Manager",
    y        = "Attrition Rate (%)",
    caption  = "Shaded band = 95% confidence interval | Note: the Year-14 spike (n=5) is not statistically meaningful"
  ) +
  theme_pa(type = "line")

Statistical Caution — Year 14 Spike The apparent 40% attrition rate at 14 years of manager tenure is driven by just two exits from a cohort of five employees. The confidence interval at that point is wide enough to span from near-zero to near-60%. This data point should be treated as anecdotal, not as evidence of a systematic trend.

Binned Phase Analysis

employees_binned <- employees_df |>
  mutate(
    mgr_tenure_bin = case_when(
      mgr_tenure <= 1  ~ "0–1 Years",
      mgr_tenure <= 5  ~ "2–5 Years",
      mgr_tenure <= 10 ~ "6–10 Years",
      TRUE             ~ "11+ Years"
    ) |> factor(levels = c("0–1 Years", "2–5 Years", "6–10 Years", "11+ Years"))
  )

bin_summary <- employees_binned |>
  group_by(mgr_tenure_bin) |>
  summarise(n = n(), attrition_rate = sum(active == "No") / n())

ggplot(bin_summary, aes(x = mgr_tenure_bin, y = attrition_rate, fill = mgr_tenure_bin)) +
  geom_col(show.legend = FALSE, width = 0.6) +
  geom_text(aes(label = percent(attrition_rate, accuracy = 1)),
            vjust = -0.5, fontface = "bold", size = 3.8) +
  scale_y_continuous(labels = percent, limits = c(0, 0.35)) +
  scale_fill_manual(values = c("#C0392B", "#E07B39", "#2E9AA3", "#1A6B72")) +
  labs(
    title    = "Attrition Risk by Manager Relationship Phase",
    subtitle = "The first 12 months carry nearly double the attrition risk of any subsequent phase",
    x        = "Manager Relationship Phase",
    y        = "Attrition Rate (%)"
  ) +
  theme_pa()

Model Results

hyp2_model <- glm(
  active ~ mgr_tenure_bin + job_lvl + monthly_comp,
  data   = employees_binned,
  family = "binomial"
)

tidy(hyp2_model, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  rename(Term = term, OR = estimate, `SE` = std.error,
         `z` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "H2 Logistic Regression — Odds Ratios (exponentiated), controlling for job level and compensation") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
H2 Logistic Regression — Odds Ratios (exponentiated), controlling for job level and compensation
Term OR SE z p-value CI Lower CI Upper
(Intercept) 1.017 0.304 0.054 0.957 0.560 1.847
mgr_tenure_bin2–5 Years 2.515 0.170 5.411 0.000 1.802 3.517
mgr_tenure_bin6–10 Years 2.179 0.198 3.929 0.000 1.482 3.227
mgr_tenure_bin11+ Years 5.173 0.613 2.682 0.007 1.812 21.799
job_lvl 1.597 0.088 5.322 0.000 1.351 1.909
monthly_comp 1.000 0.000 0.549 0.583 1.000 1.000
plot_model(hyp2_model, type = "pred", terms = "mgr_tenure_bin") +
  theme_pa(type = "bar") +
  labs(
    title    = "Predicted Retention by Manager Relationship Phase",
    subtitle = "Controlling for job level and compensation — the first-year risk persists independently",
    x        = "Years with Current Manager",
    y        = "Predicted Probability of Remaining Active"
  )

Sub-Group Analysis

zero_one <- employees_binned |> filter(mgr_tenure_bin == "0–1 Years")

base_stats <- zero_one |>
  summarise(
    n          = n(),
    base_rate  = sum(active == "No") / n,
    se         = sqrt((base_rate * (1 - base_rate)) / n),
    lower      = base_rate - 1.96 * se,
    upper      = base_rate + 1.96 * se
  )

# Gender breakdown
gender_summary <- zero_one |>
  group_by(gender) |>
  summarise(n = n(), attrition_rate = sum(active == "No") / n())

p_gender_h2 <- ggplot(gender_summary, aes(x = gender, y = attrition_rate, fill = gender)) +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = base_stats$lower, ymax = base_stats$upper),
            fill = "gray80", alpha = 0.3, inherit.aes = FALSE) +
  geom_hline(yintercept = base_stats$base_rate, linetype = "dashed",
             color = "gray40", linewidth = 0.8) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = percent(attrition_rate, accuracy = 1)),
            vjust = -0.5, fontface = "bold") +
  annotate("text", x = 0.6, y = base_stats$base_rate + 0.02,
           label = "Group avg", color = "gray40", fontface = "italic",
           size = 3, hjust = 0) +
  scale_y_continuous(labels = percent, limits = c(0, 0.5)) +
  scale_fill_manual(values = c("Female" = "#1A6B72", "Male" = "#2E9AA3")) +
  labs(title = "By Gender", x = NULL, y = "Attrition Rate") +
  theme_pa()

# Department breakdown
dept_summary <- zero_one |>
  group_by(dept) |>
  summarise(n = n(), attrition_rate = sum(active == "No") / n())

p_dept_h2 <- ggplot(dept_summary, aes(x = dept, y = attrition_rate, fill = dept)) +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = base_stats$lower, ymax = base_stats$upper),
            fill = "gray80", alpha = 0.3, inherit.aes = FALSE) +
  geom_hline(yintercept = base_stats$base_rate, linetype = "dashed",
             color = "gray40", linewidth = 0.8) +
  geom_col(show.legend = FALSE, width = 0.6) +
  geom_text(aes(label = percent(attrition_rate, accuracy = 0.1)),
            vjust = -0.5, fontface = "bold", size = 3.3) +
  annotate("text", x = 0.5, y = base_stats$base_rate + 0.025,
           label = "Group avg", color = "gray40", fontface = "italic",
           size = 3, hjust = 0) +
  scale_y_continuous(labels = percent, limits = c(0, 0.65)) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "By Department", x = NULL, y = "Attrition Rate",
       caption = "HR dept: n=13; treat as directional only") +
  theme_pa()

p_gender_h2 + p_dept_h2 +
  plot_annotation(
    title    = "First-Year Risk (0–1 yr Manager Tenure): Sub-Group Breakdown",
    subtitle = "Dashed line = group average | Shaded band = 95% confidence interval",
    theme    = theme(plot.title    = element_text(face = "bold", size = 12),
                     plot.subtitle = element_text(size = 9.5, color = "#6b7280"))
  )

joblvl_summary <- zero_one |>
  group_by(job_lvl) |>
  summarise(n = n(), attrition_rate = sum(active == "No") / n())

ggplot(joblvl_summary, aes(x = factor(job_lvl), y = attrition_rate, fill = factor(job_lvl))) +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = base_stats$lower, ymax = base_stats$upper),
            fill = "gray80", alpha = 0.3, inherit.aes = FALSE) +
  geom_hline(yintercept = base_stats$base_rate, linetype = "dashed",
             color = "gray40", linewidth = 0.8) +
  geom_col(show.legend = FALSE, width = 0.65) +
  geom_text(aes(label = percent(attrition_rate, accuracy = 1)),
            vjust = -0.5, fontface = "bold", size = 3.6) +
  annotate("text", x = 0.55, y = base_stats$base_rate + 0.025,
           label = "Group avg", color = "gray40", fontface = "italic",
           size = 3, hjust = 0) +
  scale_y_continuous(labels = percent, limits = c(0, 0.42)) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title    = "First-Year Risk by Job Level",
    subtitle = "Entry-level employees (Level 1) face nearly 3× the first-year attrition risk of senior leadership",
    x        = "Job Level  (1 = Entry → 5 = Executive)",
    y        = "Attrition Rate (%)"
  ) +
  theme_pa()

Business Implication Accounting for job level and compensation, a new manager relationship doubles the probability of exit relative to established ones. The intervention is not onboarding new managers alone — it is reboarding the employees under new management. Structured 30/60/90-day check-ins, explicit expectation-setting conversations, and a temporary mentorship safety net during managerial transitions would materially reduce this risk.


H3: Overtime × Work-Life Balance

Hypothesis 3 · Burnout & Wellbeing

The interaction between high Overtime and low Work-Life Balance scores is the strongest predictor of attrition, regardless of compensation level.

~ Partially Supported

Herzberg lens: Work-life balance and reasonable working conditions are hygiene factors. Overtime by itself does not predict dissatisfaction in all contexts — but combined with poor balance perception, it creates a compounding exit trigger.

Univariate Distributions

# Convert wl_balance to factor if not already
employees_df <- employees_df |>
  mutate(wl_balance = factor(wl_balance, levels = 1:4,
                             labels = c("Poor", "Average", "Good", "Excellent")))

p_wlb <- ggplot(employees_df, aes(x = wl_balance, fill = active)) +
  geom_bar(position = "fill") +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(title = "Work-Life Balance vs. Attrition",
       x = "WLB Rating", y = "Proportion") +
  theme_pa()

p_ot <- ggplot(employees_df, aes(x = overtime, fill = active)) +
  geom_bar(position = "fill", width = 0.5) +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(title = "Overtime Status vs. Attrition",
       x = "Works Overtime?", y = "Proportion") +
  theme_pa()

p_wlb + p_ot + plot_layout(guides = "collect") &
  theme(legend.position = "top")

Chi-Square Test: Overtime × Attrition

ot_table <- table(employees_df$overtime, employees_df$active)
chi_result <- chisq.test(ot_table)

tibble(
  Statistic = round(chi_result$statistic, 3),
  `Degrees of Freedom` = chi_result$parameter,
  `p-value` = format.pval(chi_result$p.value, digits = 3),
  Interpretation = if_else(chi_result$p.value < 0.05,
                            "Significant association — overtime and attrition are not independent",
                            "No significant association")
) |>
  kable(caption = "Chi-Square Test: Overtime × Attrition Status") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Chi-Square Test: Overtime × Attrition Status
Statistic Degrees of Freedom p-value Interpretation
88.383 1 <2e-16 Significant association — overtime and attrition are not independent

Interaction Heatmap

heatmap_data <- employees_df |>
  group_by(overtime, wl_balance) |>
  summarise(
    attrition_rate = sum(active == "No") / n(),
    n              = n(),
    .groups        = "drop"
  )

ggplot(heatmap_data, aes(x = overtime, y = wl_balance, fill = attrition_rate)) +
  geom_tile(color = "white", linewidth = 1.2) +
  geom_text(aes(label = paste0(percent(attrition_rate, accuracy = 1), "\nn=", n)),
            color = "white", fontface = "bold", size = 3.5) +
  scale_fill_gradient(low = "#1A6B72", high = "#C0392B", labels = percent) +
  labs(
    title    = "Attrition Risk Matrix: Overtime × Work-Life Balance",
    subtitle = "Darkest red = highest exit probability; n = headcount in each cell",
    x        = "Works Overtime?",
    y        = "Work-Life Balance Rating",
    fill     = "Attrition Rate"
  ) +
  theme_pa() +
  theme(panel.grid = element_blank())

Model Results

hyp3_model <- glm(
  active ~ overtime * wl_balance + monthly_comp,
  data   = employees_df,
  family = "binomial"
)

tidy(hyp3_model, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  rename(Term = term, OR = estimate, `SE` = std.error,
         `z` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "H3 Logistic Regression — Odds Ratios, controlling for monthly compensation") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
H3 Logistic Regression — Odds Ratios, controlling for monthly compensation
Term OR SE z p-value CI Lower CI Upper
(Intercept) 4.631 0.311 4.931 0.000 2.545 8.619
overtimeYes 0.339 0.337 -3.206 0.001 0.175 0.660
wl_balanceAverage 2.089 0.233 3.165 0.002 1.307 3.265
monthly_comp 1.000 0.000 0.180 0.857 1.000 1.000
overtimeYes:wl_balanceAverage 0.703 0.376 -0.939 0.348 0.335 1.468
plot_model(hyp3_model, type = "int") +
  theme_pa(type = "line") +
  labs(
    title    = "Predicted Retention: Overtime × Work-Life Balance",
    subtitle = "Higher WLB ratings improve retention — but do not neutralise the overtime effect",
    y        = "Probability of Remaining Active",
    x        = "Work-Life Balance Rating"
  )

Business Implication Overtime increases the probability of attrition by approximately — and this effect is not meaningfully offset by higher work-life balance ratings. This challenges a common assumption that employees who report good balance are insulated from burnout-driven exit. The implication is structural: flexible working arrangements and wellbeing programmes are insufficient substitutes for reducing excessive working hours. Overtime volume itself must be addressed, not just its perceived impact.


H4: Commute & Relationship Satisfaction

Hypothesis 4 · Commute & Social Capital

High commute distance correlates with lower Relationship Satisfaction and higher attrition, particularly for mid-level roles.

✗ Not Supported

Herzberg lens: Commute distance relates to working conditions — a hygiene factor. We hypothesised it would erode relationship satisfaction (a social motivator) and thereby drive exit. The data does not support this pathway.

Exploratory Distributions

p_commute <- ggplot(employees_df, aes(x = commute_dist, fill = active)) +
  geom_histogram(binwidth = 1, color = "white", show.legend = FALSE) +
  facet_wrap(~active, labeller = as_labeller(c("No" = "Inactive", "Yes" = "Active"))) +
  scale_fill_active() +
  labs(title = "Commute Distance Distribution",
       x = "Commute Distance", y = "Count") +
  theme_pa(type = "hist")

p_relsat <- ggplot(employees_df, aes(x = factor(rel_sat), fill = active)) +
  geom_bar(position = "fill", alpha = 0.9) +
  scale_fill_active() +
  scale_y_continuous(labels = percent) +
  labs(title = "Relationship Satisfaction vs. Attrition",
       x = "Relationship Satisfaction Score", y = "Proportion") +
  theme_pa()

p_commute / p_relsat

ggplot(employees_df, aes(x = commute_dist, y = rel_sat, color = active)) +
  geom_jitter(alpha = 0.45, width = 0.5, height = 0.25, size = 1.3) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.1, alpha = 0.12) +
  facet_wrap(~job_lvl, labeller = label_both) +
  scale_color_active() +
  labs(
    title    = "Commute Distance vs. Relationship Satisfaction by Job Level",
    subtitle = "No consistent negative relationship between commute length and satisfaction across levels",
    x        = "Commute Distance",
    y        = "Relationship Satisfaction Score"
  ) +
  theme_pa(type = "scatter")

Model Results

hyp4_model <- glm(
  active ~ commute_dist * rel_sat * job_lvl,
  data   = employees_df,
  family = "binomial"
)

tidy(hyp4_model, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  rename(Term = term, OR = estimate, `SE` = std.error,
         `z` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "H4 Logistic Regression — Commute × Relationship Satisfaction × Job Level") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
H4 Logistic Regression — Commute × Relationship Satisfaction × Job Level
Term OR SE z p-value CI Lower CI Upper
(Intercept) 1.375 0.550 0.580 0.562 0.463 4.009
commute_dist 1.012 0.045 0.253 0.800 0.924 1.105
rel_satDissatisfied 0.937 0.844 -0.077 0.939 0.177 4.872
rel_satSatisfied 1.025 0.720 0.034 0.973 0.250 4.217
rel_satHighly Satisfied 4.172 0.712 2.005 0.045 1.040 17.044
job_lvl 1.898 0.287 2.232 0.026 1.119 3.455
commute_dist:rel_satDissatisfied 0.983 0.062 -0.271 0.786 0.871 1.111
commute_dist:rel_satSatisfied 0.980 0.057 -0.348 0.728 0.876 1.098
commute_dist:rel_satHighly Satisfied 0.943 0.059 -0.995 0.320 0.841 1.059
commute_dist:job_lvl 0.984 0.023 -0.714 0.475 0.942 1.031
rel_satDissatisfied:job_lvl 1.583 0.491 0.936 0.350 0.622 4.321
rel_satSatisfied:job_lvl 1.317 0.405 0.681 0.496 0.595 2.933
rel_satHighly Satisfied:job_lvl 0.626 0.352 -1.329 0.184 0.308 1.234
commute_dist:rel_satDissatisfied:job_lvl 0.997 0.033 -0.095 0.924 0.934 1.063
commute_dist:rel_satSatisfied:job_lvl 1.005 0.029 0.183 0.855 0.949 1.065
commute_dist:rel_satHighly Satisfied:job_lvl 1.024 0.029 0.819 0.413 0.967 1.082

Finding — Hypothesis Rejected Neither commute distance nor its interaction with relationship satisfaction reaches statistical significance as a predictor of attrition. Relationship Satisfaction itself and Job Level exert independent effects on retention, but the hypothesised causal chain — long commute → strained relationships → exit — is not supported. This is a meaningful negative result: it tells us that remote or hybrid work arrangements, while potentially desirable for other reasons, are unlikely to move the needle on attrition without addressing the structural drivers identified in H1–H3.


H5: Young Employees & Satisfaction

Hypothesis 5 · Generational Mobility

Young employees (under 30), early in their careers, with low Job Satisfaction and high Work-Life Balance scores are more likely to leave than older employees with similar profiles.

~ Partially Supported

Herzberg lens: Job satisfaction is the primary motivator variable in this hypothesis. The twist is that even high satisfaction does not eliminate the age-related mobility gap — suggesting that for under-30 employees, external market pull operates independently of internal satisfaction.

Attrition by Age Group, Satisfaction & WLB

employees_age <- employees_df |>
  mutate(
    age_group = if_else(age < 30, "Under 30", "30 and Over") |>
                factor(levels = c("Under 30", "30 and Over")),
  )

employees_age |>
  group_by(age_group, job_sat, wl_balance) |>
  summarise(attrition_rate = sum(active == "No") / n(), .groups = "drop") |>
  ggplot(aes(x = job_sat, y = attrition_rate, fill = wl_balance)) +
  geom_col(position = "dodge") +
  facet_wrap(~age_group) +
  scale_y_continuous(labels = percent) +
  scale_fill_brewer(palette = "Blues", name = "Work-Life Balance") +
  labs(
    title    = "Attrition Risk: Job Satisfaction & Work-Life Balance by Age Group",
    subtitle = "Under-30 employees show higher attrition at every satisfaction level",
    x        = "Job Satisfaction Level",
    y        = "Attrition Rate (%)"
  ) +
  theme_pa() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8))

Simplified Model: Age Group × Job Satisfaction

# Run these to see where the "1 level" problem is
table(employees_age$age_group)
## 
##    Under 30 30 and Over 
##         326        1142
table(employees_age$job_sat)
## 
## Highly Dissatisfied        Dissatisfied           Satisfied    Highly Satisfied 
##                 289                 280                 441                 458
hyp5_simplified <- glm(
  active ~ age_group * job_sat,
  data   = employees_age,
  family = "binomial"
)

tidy(hyp5_simplified, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  rename(Term = term, OR = estimate, `SE` = std.error,
         `z` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "H5 Logistic Regression — Age Group × Job Satisfaction (simplified model)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
H5 Logistic Regression — Age Group × Job Satisfaction (simplified model)
Term OR SE z p-value CI Lower CI Upper
(Intercept) 1.833 0.254 2.389 0.017 1.126 3.060
age_group30 and Over 2.325 0.306 2.755 0.006 1.267 4.225
job_satDissatisfied 1.206 0.375 0.499 0.618 0.579 2.535
job_satSatisfied 1.394 0.340 0.975 0.329 0.713 2.722
job_satHighly Satisfied 2.078 0.353 2.073 0.038 1.043 4.180
age_group30 and Over:job_satDissatisfied 1.384 0.461 0.705 0.481 0.559 3.420
age_group30 and Over:job_satSatisfied 1.094 0.413 0.218 0.828 0.486 2.461
age_group30 and Over:job_satHighly Satisfied 1.231 0.436 0.476 0.634 0.522 2.896
plot_model(hyp5_simplified, type = "int") +
  theme_pa(type = "line") +
  labs(
    title    = "Predicted Retention: Age Group × Job Satisfaction",
    subtitle = "Job satisfaction improves retention for both groups — but under-30s remain structurally more mobile",
    y        = "Probability of Remaining Active",
    x        = "Job Satisfaction Level"
  )

Business Implication Employees under 30 are approximately twice as likely to exit at any given satisfaction level, compared to employees aged 30 and over. This does not mean they are unhappier — it means the external market exerts a stronger pull on this demographic regardless of their internal experience. Satisfaction improvements are still valuable and do reduce this gap, but they are insufficient on their own. Retention strategies for this cohort should focus on accelerated career pathing, early ownership of meaningful projects, and visible advancement timelines — the motivators that compete directly with what the external market offers.


Summary of Findings

Hypothesis Summary
# Hypothesis Verdict Key Finding Herzberg Factor
H1 Stagnation Trap: No promotion + long manager tenure → higher attrition ✓ Supported Stagnation effect is universal across departments; critical window is 7+ years Motivator (Advancement)
H2 New Manager Risk: <1 yr manager tenure → higher attrition than 2–3 yr tenure ✓ Supported First-year employees under new management are ~2.4× more likely to leave Hygiene (Supervision)
H3 Burnout Interaction: Overtime × low WLB is the strongest attrition predictor ~ Partial Overtime multiplies exit risk ~3×; WLB does not neutralise this effect Hygiene (Working Conditions)
H4 Commute Pathway: High commute distance → lower relationship satisfaction → attrition ✗ Rejected Commute distance is not a significant predictor; rel. satisfaction is independent Hygiene (Working Conditions)
H5 Generational Mobility: Under-30s with low job sat leave at higher rates ~ Partial Under-30s are 2× more likely to exit at any satisfaction level — structural pull Motivator (Job Satisfaction)

Strategic Recommendations

The following recommendations are ordered by estimated impact and implementation urgency, derived directly from the analytical findings above.

Priority Initiative Evidence Base Target Population
HIGH Managerial Transition Protocol
Implement structured 30/60/90-day reboarding for employees under new managers, including facilitated team contracting sessions and a temporary mentorship safety net
H2 — 2.4× exit risk in year 1 of new manager relationship, independent of level and compensation All employees experiencing a management change
HIGH Overtime Audit & Workload Rebalancing
Identify teams with sustained overtime and redesign workloads before implementing wellness programmes — the data shows that WLB perception does not offset actual overtime hours
H3 — Overtime multiplies attrition risk ~3×; WLB rating does not neutralise this All departments, prioritise R&D and Sales where attrition is higher
HIGH Career Refresh Checkpoints
Introduce mandatory ca re er-development reviews at years 5 and 7 of organisational tenure, regardless of whether a promotion is warranted — the goal is to surface and respond to stagnation before exit intent crystallises
H1 — Stagnation trap operates universally; risk accelerates after 7 years without promotion All employees with 5+ years tenure; emphasis on senior (6+ yr) cohort
MEDIUM Early-Career Accelerator Programme
Design an explicit fast-track for h igh-performing under-30 employees: accelerated promotion criteria, early project ownership, and a transparent advancement timeline — competing with external market pull rather than ignoring it
H5 — Under-30s are 2× more likely to exit at any satisfaction level; engagement does not close the gap Employees under 30, particularly in R&D and Sales at Job Level 1
MONITOR Commute Distance
Remote or hybrid policies may be justified on other grounds (talent access, employee preference), but the data does not support them as att ri tion-reduction tools. Monitor but do not prioritise as a retention intervention
H4 — Commute distance is not a statistically significant predictor of attrition N/A — deprioritised from retention agenda

Methodological Notes

Model Specification All hypotheses were tested using binary logistic regression (glm(..., family = "binomial")) with attrition as the outcome variable (Active = 1 / Inactive = 0). Interaction terms were included where theoretically motivated. Predicted probabilities were visualised using sjPlot::plot_model(). Reported odds ratios are exponentiated log-odds coefficients.

Limitations

  1. Cross-sectional data: This dataset captures a single point in time. Causal claims are directional hypotheses, not confirmed causal relationships. Longitudinal data would strengthen causal inference considerably.

  2. Class imbalance: The attrition rate of ~16% creates a 1:5 class imbalance. While this does not invalidate logistic regression, predictive models built on this data should be evaluated using AUROC and balanced accuracy rather than raw accuracy.

  3. Small subgroup cells: Several cross-tabulations (notably HR department in H2, and Year-14 manager tenure) involve very small samples. These estimates are presented with explicit uncertainty and should not inform policy in isolation.

  4. Omitted variable risk: The dataset does not capture exit interview data, peer relationship quality, or external labour market conditions — all of which likely confound the relationships modelled here.


Report produced using R 4.5.3 · tidyverse · sjPlot · patchwork Analysis conducted in accordance with internal People Analytics standards