1 Introduction

Employee attrition — the voluntary or involuntary departure of employees — is a costly challenge for organizations. Predicting which employees are likely to leave enables HR departments to intervene proactively, reducing turnover costs and retaining institutional knowledge.

This analysis applies logistic regression, a statistical technique suited for binary classification, to predict employee attrition (Left vs. Stayed) using the Kaggle Employee Attrition Dataset. Three models of increasing complexity are estimated and evaluated:

Model	Formula
Model 1	`Attrition ~ MonthlyIncome`
Model 2	`Attrition ~ MonthlyIncome + Overtime`
Model 3	`Attrition ~ .` (all predictors)

2 Prerequisites

2.1 Packages

# Data wrangling & visualization
library(dplyr)
library(ggplot2)
library(tidyr)

# Modeling
library(caret)

# ROC / AUC
library(pROC)

# Tidy model output
library(broom)

# Nice tables
library(knitr)
library(kableExtra)

2.2 Data Loading

The pre-split training and test sets are loaded directly from the Kaggle dataset files.

train_raw <- read.csv("train.csv", stringsAsFactors = FALSE)
test_raw  <- read.csv("test.csv",  stringsAsFactors = FALSE)

cat("Training observations:", nrow(train_raw), "\n")

## Training observations: 59598

cat("Test observations:    ", nrow(test_raw),  "\n")

## Test observations:     14900

cat("Features:             ", ncol(train_raw) - 1, "\n")

## Features:              23

3 Data Preprocessing

3.1 Variable Inspection

glimpse(train_raw)

## Rows: 59,598
## Columns: 24
## $ Employee.ID              <int> 8410, 64756, 30257, 65791, 65026, 24368, 6497…
## $ Age                      <int> 31, 59, 24, 36, 56, 38, 47, 48, 57, 24, 30, 2…
## $ Gender                   <chr> "Male", "Female", "Female", "Female", "Male",…
## $ Years.at.Company         <int> 19, 4, 10, 7, 41, 3, 23, 16, 44, 1, 12, 6, 38…
## $ Job.Role                 <chr> "Education", "Media", "Healthcare", "Educatio…
## $ Monthly.Income           <int> 5390, 5534, 8159, 3989, 4821, 9977, 3681, 112…
## $ Work.Life.Balance        <chr> "Excellent", "Poor", "Good", "Good", "Fair", …
## $ Job.Satisfaction         <chr> "Medium", "High", "High", "High", "Very High"…
## $ Performance.Rating       <chr> "Average", "Low", "Low", "High", "Average", "…
## $ Number.of.Promotions     <int> 2, 3, 0, 1, 0, 3, 1, 2, 1, 1, 1, 2, 1, 4, 0, …
## $ Overtime                 <chr> "No", "No", "No", "No", "Yes", "No", "Yes", "…
## $ Distance.from.Home       <int> 22, 21, 11, 27, 71, 37, 75, 5, 39, 57, 51, 26…
## $ Education.Level          <chr> "Associate Degree", "Master’s Degree", "Bache…
## $ Marital.Status           <chr> "Married", "Divorced", "Married", "Single", "…
## $ Number.of.Dependents     <int> 0, 3, 3, 2, 0, 0, 3, 4, 4, 4, 1, 0, 0, 2, 0, …
## $ Job.Level                <chr> "Mid", "Mid", "Mid", "Mid", "Senior", "Mid", …
## $ Company.Size             <chr> "Medium", "Medium", "Medium", "Small", "Mediu…
## $ Company.Tenure           <int> 89, 21, 74, 50, 68, 47, 93, 88, 75, 45, 17, 3…
## $ Remote.Work              <chr> "No", "No", "No", "Yes", "No", "No", "No", "N…
## $ Leadership.Opportunities <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ Innovation.Opportunities <chr> "No", "No", "No", "No", "No", "Yes", "No", "N…
## $ Company.Reputation       <chr> "Excellent", "Fair", "Poor", "Good", "Fair", …
## $ Employee.Recognition     <chr> "Medium", "Low", "Low", "Medium", "Medium", "…
## $ Attrition                <chr> "Stayed", "Stayed", "Stayed", "Stayed", "Stay…

3.2 Cleaning & Encoding

Several steps are required before modelling:

Rename columns with spaces to syntactically valid R names.
Recode the response variable: "Left" → 1 (event), "Stayed" → 0.
Convert character variables to factors so glm() handles dummy coding automatically.
Drop the Employee ID column (a unique identifier with no predictive value).

clean_data <- function(df) {
  df %>%
    rename(
      MonthlyIncome          = Monthly.Income,
      Overtime               = Overtime,
      YearsAtCompany         = Years.at.Company,
      JobRole                = Job.Role,
      WorkLifeBalance        = Work.Life.Balance,
      JobSatisfaction        = Job.Satisfaction,
      PerformanceRating      = Performance.Rating,
      NumberOfPromotions     = Number.of.Promotions,
      DistanceFromHome       = Distance.from.Home,
      EducationLevel         = Education.Level,
      MaritalStatus          = Marital.Status,
      NumberOfDependents     = Number.of.Dependents,
      JobLevel               = Job.Level,
      CompanySize            = Company.Size,
      CompanyTenure          = Company.Tenure,
      RemoteWork             = Remote.Work,
      LeadershipOpportunities = Leadership.Opportunities,
      InnovationOpportunities = Innovation.Opportunities,
      CompanyReputation      = Company.Reputation,
      EmployeeRecognition    = Employee.Recognition
    ) %>%
    select(-Employee.ID) %>%
    mutate(
      Attrition = factor(ifelse(Attrition == "Left", "Yes", "No"),
                         levels = c("No", "Yes")),
      across(where(is.character), as.factor)
    )
}

train <- clean_data(train_raw)
test  <- clean_data(test_raw)

3.3 Class Distribution

Understanding class imbalance is critical, as it directly affects the No Information Rate — the accuracy achievable by always predicting the majority class.

bind_rows(
  train %>% count(Attrition) %>% mutate(Split = "Train"),
  test  %>% count(Attrition) %>% mutate(Split = "Test")
) %>%
  group_by(Split) %>%
  mutate(Proportion = n / sum(n)) %>%
  ggplot(aes(x = Attrition, y = Proportion, fill = Attrition)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = scales::percent(Proportion, accuracy = 0.1)),
            vjust = -0.4, fontface = "bold") +
  facet_wrap(~Split) +
  scale_fill_manual(values = c("No" = "#2196F3", "Yes" = "#F44336")) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(title = "Attrition Class Distribution",
       subtitle = "Consistent class imbalance across both splits",
       x = "Attrition", y = "Proportion") +
  theme_minimal(base_size = 13)

The dataset is imbalanced — approximately 78–79% of employees stayed and ~21–22% left. This sets the No Information Rate as a baseline that any model must meaningfully exceed.

4 Model Estimation

All three models are fit using glm() with family = "binomial" on the training set.

4.1 Model 1: Monthly Income Only

model1 <- glm(Attrition ~ MonthlyIncome,
              family = "binomial", data = train)

tidy(model1) %>%
  mutate(across(where(is.numeric), ~round(., 6))) %>%
  kable(caption = "Model 1 – Coefficient Estimates") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"))

Model 1 – Coefficient Estimates
term	estimate	std.error	statistic	p.value
(Intercept)	-0.020809	0.029020	-0.717061	0.473336
MonthlyIncome	-0.000011	0.000004	-2.777349	0.005480

Interpretation:
The negative coefficient on MonthlyIncome (-1.1^{-5}) indicates that higher earners are less likely to leave. On the odds scale, each additional dollar of monthly income multiplies the attrition odds by 0.999989 — a very small but statistically significant protective effect.

4.2 Model 2: Monthly Income + Overtime

model2 <- glm(Attrition ~ MonthlyIncome + Overtime,
              family = "binomial", data = train)

tidy(model2) %>%
  mutate(across(where(is.numeric), ~round(., 6))) %>%
  kable(caption = "Model 2 – Coefficient Estimates") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"))

Model 2 – Coefficient Estimates
term	estimate	std.error	statistic	p.value
(Intercept)	-0.100061	0.029646	-3.375257	0.000737
MonthlyIncome	-0.000010	0.000004	-2.717495	0.006578
OvertimeYes	0.237424	0.017501	13.566062	0.000000

Interpretation:
Adding Overtime substantially improves the model. The positive coefficient on OvertimeYes means employees who work overtime face considerably higher attrition odds — by a factor of approximately 1.268 — confirming that overwork is a meaningful driver of turnover.

4.3 Model 3: All Predictors

model3 <- glm(Attrition ~ .,
              family = "binomial", data = train)

tidy(model3) %>%
  mutate(across(where(is.numeric), ~round(., 5)),
         Significant = ifelse(p.value < 0.05, "✓", "")) %>%
  kable(caption = "Model 3 – All Predictors (✓ = p < 0.05)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")

Model 3 – All Predictors (✓ = p < 0.05)
term	estimate	std.error	statistic	p.value	Significant
(Intercept)	0.25629	0.08586	2.98499	0.00284	✓
Age	-0.00600	0.00100	-5.97435	0.00000	✓
GenderMale	-0.62616	0.02078	-30.13064	0.00000	✓
YearsAtCompany	-0.01360	0.00117	-11.58778	0.00000	✓
JobRoleFinance	-0.09223	0.04838	-1.90642	0.05660
JobRoleHealthcare	-0.07214	0.04238	-1.70197	0.08876
JobRoleMedia	-0.09867	0.03616	-2.72900	0.00635	✓
JobRoleTechnology	-0.08476	0.04847	-1.74869	0.08034
MonthlyIncome	-0.00001	0.00001	-0.81470	0.41524
WorkLifeBalanceFair	1.32703	0.03129	42.41040	0.00000	✓
WorkLifeBalanceGood	0.29336	0.02958	9.91615	0.00000	✓
WorkLifeBalancePoor	1.50922	0.03757	40.16735	0.00000	✓
JobSatisfactionLow	0.48803	0.03572	13.66402	0.00000	✓
JobSatisfactionMedium	0.01133	0.02721	0.41662	0.67695
JobSatisfactionVery High	0.49882	0.02701	18.46942	0.00000	✓
PerformanceRatingBelow Average	0.33386	0.02948	11.32415	0.00000	✓
PerformanceRatingHigh	0.00508	0.02647	0.19182	0.84788
PerformanceRatingLow	0.59725	0.04847	12.32091	0.00000	✓
NumberOfPromotions	-0.24919	0.01043	-23.89984	0.00000	✓
OvertimeYes	0.35116	0.02189	16.04303	0.00000	✓
DistanceFromHome	0.00995	0.00036	27.36510	0.00000	✓
EducationLevelBachelor’s Degree	0.04581	0.02757	1.66164	0.09658
EducationLevelHigh School	0.03035	0.03072	0.98824	0.32304
EducationLevelMaster’s Degree	0.03087	0.03048	1.01311	0.31101
EducationLevelPhD	-1.56391	0.05447	-28.71405	0.00000	✓
MaritalStatusMarried	-0.25588	0.02957	-8.65441	0.00000	✓
MaritalStatusSingle	1.57326	0.03220	48.86324	0.00000	✓
NumberOfDependents	-0.15732	0.00666	-23.62126	0.00000	✓
JobLevelMid	-1.00326	0.02266	-44.27069	0.00000	✓
JobLevelSenior	-2.61640	0.03262	-80.22062	0.00000	✓
CompanySizeMedium	0.00614	0.02713	0.22645	0.82085
CompanySizeSmall	0.20635	0.02958	6.97489	0.00000	✓
CompanyTenure	-0.00020	0.00045	-0.45223	0.65110
RemoteWorkYes	-1.77541	0.02916	-60.88797	0.00000	✓
LeadershipOpportunitiesYes	-0.16271	0.04753	-3.42362	0.00062	✓
InnovationOpportunitiesYes	-0.14100	0.02784	-5.06418	0.00000	✓
CompanyReputationFair	0.46981	0.03962	11.85776	0.00000	✓
CompanyReputationGood	-0.06027	0.03535	-1.70512	0.08817
CompanyReputationPoor	0.75645	0.03972	19.04269	0.00000	✓
EmployeeRecognitionLow	0.03956	0.02620	1.51009	0.13102
EmployeeRecognitionMedium	0.04354	0.02769	1.57245	0.11585
EmployeeRecognitionVery High	-0.08300	0.05019	-1.65376	0.09818

5 Model Evaluation on Test Data

5.1 Predicted Probabilities

test$prob1 <- predict(model1, newdata = test, type = "response")
test$prob2 <- predict(model2, newdata = test, type = "response")
test$prob3 <- predict(model3, newdata = test, type = "response")

# Default 0.5 threshold for class predictions
test$pred1 <- factor(ifelse(test$prob1 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred2 <- factor(ifelse(test$prob2 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred3 <- factor(ifelse(test$prob3 >= 0.5, "Yes", "No"), levels = c("No","Yes"))

5.2 Confusion Matrices

5.2.1 Model 1 — Monthly Income

cm1 <- confusionMatrix(test$pred1, test$Attrition, positive = "Yes")
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  7868 7032
##        Yes    0    0
##                                         
##                Accuracy : 0.5281        
##                  95% CI : (0.52, 0.5361)
##     No Information Rate : 0.5281        
##     P-Value [Acc > NIR] : 0.5033        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.0000        
##             Specificity : 1.0000        
##          Pos Pred Value :    NaN        
##          Neg Pred Value : 0.5281        
##              Prevalence : 0.4719        
##          Detection Rate : 0.0000        
##    Detection Prevalence : 0.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : Yes           
##

5.2.2 Model 2 — Monthly Income + Overtime

cm2 <- confusionMatrix(test$pred2, test$Attrition, positive = "Yes")
cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5487 4530
##        Yes 2381 2502
##                                           
##                Accuracy : 0.5362          
##                  95% CI : (0.5281, 0.5442)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : 0.02397         
##                                           
##                   Kappa : 0.0541          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.3558          
##             Specificity : 0.6974          
##          Pos Pred Value : 0.5124          
##          Neg Pred Value : 0.5478          
##              Prevalence : 0.4719          
##          Detection Rate : 0.1679          
##    Detection Prevalence : 0.3277          
##       Balanced Accuracy : 0.5266          
##                                           
##        'Positive' Class : Yes             
##

5.2.3 Model 3 — All Variables

cm3 <- confusionMatrix(test$pred3, test$Attrition, positive = "Yes")
cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6080 1855
##        Yes 1788 5177
##                                           
##                Accuracy : 0.7555          
##                  95% CI : (0.7485, 0.7624)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5092          
##                                           
##  Mcnemar's Test P-Value : 0.2742          
##                                           
##             Sensitivity : 0.7362          
##             Specificity : 0.7728          
##          Pos Pred Value : 0.7433          
##          Neg Pred Value : 0.7662          
##              Prevalence : 0.4719          
##          Detection Rate : 0.3474          
##    Detection Prevalence : 0.4674          
##       Balanced Accuracy : 0.7545          
##                                           
##        'Positive' Class : Yes             
##

5.3 Confusion Matrix Heat-maps

Visualizing all three confusion matrices side by side for quick comparison.

plot_cm <- function(cm, title) {
  df <- as.data.frame(cm$table) %>%
    rename(Predicted = Prediction, Actual = Reference) %>%
    group_by(Actual) %>%
    mutate(ColTotal = sum(Freq)) %>%
    ungroup() %>%
    mutate(
      Pct      = Freq / sum(Freq) * 100,
      Label    = paste0(Freq, "\n(", sprintf("%.1f%%", Pct), ")")
    )

  ggplot(df, aes(x = Actual, y = Predicted, fill = Freq)) +
    geom_tile(color = "white", linewidth = 0.8) +
    geom_text(aes(label = Label), size = 4.2, fontface = "bold", color = "white",
              lineheight = 1.4) +
    scale_fill_gradient(low = "#90CAF9", high = "#1565C0") +
    labs(title = title, fill = "Count") +
    theme_minimal(base_size = 12) +
    theme(legend.position = "none",
          plot.title = element_text(hjust = 0.5, face = "bold"))
}

library(gridExtra)
grid.arrange(
  plot_cm(cm1, "Model 1\nMonthlyIncome"),
  plot_cm(cm2, "Model 2\nMonthlyIncome + Overtime"),
  plot_cm(cm3, "Model 3\nAll Predictors"),
  ncol = 3
)

5.4 Performance Summary Table

extract_metrics <- function(cm, label) {
  tibble(
    Model       = label,
    Accuracy    = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    `Pos Pred Value` = cm$byClass["Pos Pred Value"],
    `Neg Pred Value` = cm$byClass["Neg Pred Value"],
    `Balanced Accuracy` = cm$byClass["Balanced Accuracy"]
  )
}

bind_rows(
  extract_metrics(cm1, "Model 1: MonthlyIncome"),
  extract_metrics(cm2, "Model 2: MonthlyIncome + Overtime"),
  extract_metrics(cm3, "Model 3: All Predictors")
) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Test-Set Performance Metrics Across Three Models") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed")) %>%
  row_spec(3, bold = TRUE, background = "#E3F2FD")

Test-Set Performance Metrics Across Three Models
Model	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Balanced Accuracy
Model 1: MonthlyIncome	0.5281	0.0000	1.0000	NaN	0.5281	0.5000
Model 2: MonthlyIncome + Overtime	0.5362	0.3558	0.6974	0.5124	0.5478	0.5266
Model 3: All Predictors	0.7555	0.7362	0.7728	0.7433	0.7662	0.7545

6 ROC Curves & AUC

ROC (Receiver Operating Characteristic) curves plot the True Positive Rate against the False Positive Rate across all decision thresholds. The Area Under the Curve (AUC) summarises discrimination ability in a single value: 0.5 = random, 1.0 = perfect.

roc1 <- roc(test$Attrition, test$prob1, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc2 <- roc(test$Attrition, test$prob2, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc3 <- roc(test$Attrition, test$prob3, levels = c("No","Yes"), direction = "<", quiet = TRUE)

# Build tidy data frames for ggplot
roc_df <- function(roc_obj, model_name) {
  data.frame(
    FPR   = 1 - roc_obj$specificities,
    TPR   = roc_obj$sensitivities,
    Model = model_name
  )
}

roc_data <- bind_rows(
  roc_df(roc1, sprintf("Model 1 (AUC = %.3f)", auc(roc1))),
  roc_df(roc2, sprintf("Model 2 (AUC = %.3f)", auc(roc2))),
  roc_df(roc3, sprintf("Model 3 (AUC = %.3f)", auc(roc3)))
)

ggplot(roc_data, aes(x = FPR, y = TPR, color = Model)) +
  geom_line(linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", color = "grey60") +
  scale_color_manual(values = c("#E53935","#FB8C00","#1E88E5")) +
  labs(
    title    = "ROC Curves — Logistic Regression Models",
    subtitle = "Evaluated on the held-out test set",
    x        = "False Positive Rate (1 – Specificity)",
    y        = "True Positive Rate (Sensitivity)",
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = c(0.65, 0.2),
    legend.background = element_rect(fill = "white", color = "grey85"),
    plot.title    = element_text(face = "bold"),
    plot.subtitle = element_text(color = "grey50")
  ) +
  annotate("text", x = 0.8, y = 0.05,
           label = "Random\nClassifier", color = "grey60", size = 3.5)

7 Discussion

7.1 Key Findings

Aspect	Observation
Best overall model	Model 3 (all predictors) achieves the highest accuracy and AUC
Single most informative predictor	`Overtime` — employees working overtime face substantially higher attrition risk
MonthlyIncome alone (Model 1)	Modest discrimination; AUC barely exceeds the random baseline
Sensitivity vs. Specificity trade-off	All models identify Stayed employees well (high specificity) but struggle to catch Left employees (lower sensitivity) — a direct consequence of class imbalance
Class imbalance	~79% majority class sets a high No Information Rate; raw accuracy is a misleading metric here; Balanced Accuracy and AUC are preferred

7.2 Limitations

Default threshold (0.5): Given class imbalance, lowering the threshold (e.g., to 0.3) can substantially improve sensitivity at a modest specificity cost.
Logistic regression assumptions: Linearity on the log-odds scale, no multicollinearity, and no influential outliers — all require verification before deployment.
Feature engineering: Several categorical variables (e.g., JobRole, EducationLevel) contain many levels; sparse cells may inflate variance in coefficient estimates.
No cross-validation here: The confusion matrix is evaluated on a single test split; k-fold CV would provide more stable performance estimates.

7.3 Recommendations

Explore threshold tuning using the ROC curve’s optimal operating point (pROC::coords(roc3, "best")).
Consider regularized logistic regression (LASSO/Ridge via glmnet) to handle multicollinearity and perform implicit feature selection.
For non-linear relationships and interaction effects, tree-based methods (Random Forest, XGBoost) typically outperform logistic regression on this type of HR data.

8 Conclusion

This analysis demonstrates the step-by-step application of logistic regression to an employee attrition prediction problem. Starting from a single financial predictor (Model 1) through to a full-feature model (Model 3), each successive model meaningfully improved discrimination ability as measured by AUC. Model 3 is the recommended baseline for further development, with the caveat that addressing class imbalance (via threshold tuning, SMOTE, or cost-sensitive learning) is essential before operational deployment.

9 References

Boehmke, B. & Greenwell, B. (2020). Hands-On Machine Learning with R. CRC Press. Chapter 5: Logistic Regression. https://bradleyboehmke.github.io/HOML/logistic-regression.html
Kaggle Dataset: Employee Attrition Dataset by Stealth Technologies. https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset

Logistic Regression for Employee Attrition

Khongorzul Erkhembayar

2026-04-26