1 Introduction

Employee attrition — the voluntary or involuntary departure of employees — is a costly challenge for organizations. Predicting which employees are likely to leave enables HR departments to intervene proactively, reducing turnover costs and retaining institutional knowledge.

This analysis applies logistic regression, a statistical technique suited for binary classification, to predict employee attrition (Left vs. Stayed) using the Kaggle Employee Attrition Dataset. Three models of increasing complexity are estimated and evaluated:

Model Formula
Model 1 Attrition ~ MonthlyIncome
Model 2 Attrition ~ MonthlyIncome + Overtime
Model 3 Attrition ~ . (all predictors)

2 Prerequisites

2.1 Packages

# Data wrangling & visualization
library(dplyr)
library(ggplot2)
library(tidyr)

# Modeling
library(caret)

# ROC / AUC
library(pROC)

# Tidy model output
library(broom)

# Nice tables
library(knitr)
library(kableExtra)

2.2 Data Loading

The pre-split training and test sets are loaded directly from the Kaggle dataset files.

train_raw <- read.csv("train.csv", stringsAsFactors = FALSE)
test_raw  <- read.csv("test.csv",  stringsAsFactors = FALSE)

cat("Training observations:", nrow(train_raw), "\n")
## Training observations: 59598
cat("Test observations:    ", nrow(test_raw),  "\n")
## Test observations:     14900
cat("Features:             ", ncol(train_raw) - 1, "\n")
## Features:              23

3 Data Preprocessing

3.1 Variable Inspection

glimpse(train_raw)
## Rows: 59,598
## Columns: 24
## $ Employee.ID              <int> 8410, 64756, 30257, 65791, 65026, 24368, 6497…
## $ Age                      <int> 31, 59, 24, 36, 56, 38, 47, 48, 57, 24, 30, 2…
## $ Gender                   <chr> "Male", "Female", "Female", "Female", "Male",…
## $ Years.at.Company         <int> 19, 4, 10, 7, 41, 3, 23, 16, 44, 1, 12, 6, 38…
## $ Job.Role                 <chr> "Education", "Media", "Healthcare", "Educatio…
## $ Monthly.Income           <int> 5390, 5534, 8159, 3989, 4821, 9977, 3681, 112…
## $ Work.Life.Balance        <chr> "Excellent", "Poor", "Good", "Good", "Fair", …
## $ Job.Satisfaction         <chr> "Medium", "High", "High", "High", "Very High"…
## $ Performance.Rating       <chr> "Average", "Low", "Low", "High", "Average", "…
## $ Number.of.Promotions     <int> 2, 3, 0, 1, 0, 3, 1, 2, 1, 1, 1, 2, 1, 4, 0, …
## $ Overtime                 <chr> "No", "No", "No", "No", "Yes", "No", "Yes", "…
## $ Distance.from.Home       <int> 22, 21, 11, 27, 71, 37, 75, 5, 39, 57, 51, 26…
## $ Education.Level          <chr> "Associate Degree", "Master’s Degree", "Bache…
## $ Marital.Status           <chr> "Married", "Divorced", "Married", "Single", "…
## $ Number.of.Dependents     <int> 0, 3, 3, 2, 0, 0, 3, 4, 4, 4, 1, 0, 0, 2, 0, …
## $ Job.Level                <chr> "Mid", "Mid", "Mid", "Mid", "Senior", "Mid", …
## $ Company.Size             <chr> "Medium", "Medium", "Medium", "Small", "Mediu…
## $ Company.Tenure           <int> 89, 21, 74, 50, 68, 47, 93, 88, 75, 45, 17, 3…
## $ Remote.Work              <chr> "No", "No", "No", "Yes", "No", "No", "No", "N…
## $ Leadership.Opportunities <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ Innovation.Opportunities <chr> "No", "No", "No", "No", "No", "Yes", "No", "N…
## $ Company.Reputation       <chr> "Excellent", "Fair", "Poor", "Good", "Fair", …
## $ Employee.Recognition     <chr> "Medium", "Low", "Low", "Medium", "Medium", "…
## $ Attrition                <chr> "Stayed", "Stayed", "Stayed", "Stayed", "Stay…

3.2 Cleaning & Encoding

Several steps are required before modelling:

  1. Rename columns with spaces to syntactically valid R names.
  2. Recode the response variable: "Left"1 (event), "Stayed"0.
  3. Convert character variables to factors so glm() handles dummy coding automatically.
  4. Drop the Employee ID column (a unique identifier with no predictive value).
clean_data <- function(df) {
  df %>%
    rename(
      MonthlyIncome          = Monthly.Income,
      Overtime               = Overtime,
      YearsAtCompany         = Years.at.Company,
      JobRole                = Job.Role,
      WorkLifeBalance        = Work.Life.Balance,
      JobSatisfaction        = Job.Satisfaction,
      PerformanceRating      = Performance.Rating,
      NumberOfPromotions     = Number.of.Promotions,
      DistanceFromHome       = Distance.from.Home,
      EducationLevel         = Education.Level,
      MaritalStatus          = Marital.Status,
      NumberOfDependents     = Number.of.Dependents,
      JobLevel               = Job.Level,
      CompanySize            = Company.Size,
      CompanyTenure          = Company.Tenure,
      RemoteWork             = Remote.Work,
      LeadershipOpportunities = Leadership.Opportunities,
      InnovationOpportunities = Innovation.Opportunities,
      CompanyReputation      = Company.Reputation,
      EmployeeRecognition    = Employee.Recognition
    ) %>%
    select(-Employee.ID) %>%
    mutate(
      Attrition = factor(ifelse(Attrition == "Left", "Yes", "No"),
                         levels = c("No", "Yes")),
      across(where(is.character), as.factor)
    )
}

train <- clean_data(train_raw)
test  <- clean_data(test_raw)

3.3 Class Distribution

Understanding class imbalance is critical, as it directly affects the No Information Rate — the accuracy achievable by always predicting the majority class.

bind_rows(
  train %>% count(Attrition) %>% mutate(Split = "Train"),
  test  %>% count(Attrition) %>% mutate(Split = "Test")
) %>%
  group_by(Split) %>%
  mutate(Proportion = n / sum(n)) %>%
  ggplot(aes(x = Attrition, y = Proportion, fill = Attrition)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = scales::percent(Proportion, accuracy = 0.1)),
            vjust = -0.4, fontface = "bold") +
  facet_wrap(~Split) +
  scale_fill_manual(values = c("No" = "#2196F3", "Yes" = "#F44336")) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(title = "Attrition Class Distribution",
       subtitle = "Consistent class imbalance across both splits",
       x = "Attrition", y = "Proportion") +
  theme_minimal(base_size = 13)

The dataset is imbalanced — approximately 78–79% of employees stayed and ~21–22% left. This sets the No Information Rate as a baseline that any model must meaningfully exceed.


4 Model Estimation

All three models are fit using glm() with family = "binomial" on the training set.

4.1 Model 1: Monthly Income Only

model1 <- glm(Attrition ~ MonthlyIncome,
              family = "binomial", data = train)

tidy(model1) %>%
  mutate(across(where(is.numeric), ~round(., 6))) %>%
  kable(caption = "Model 1 – Coefficient Estimates") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"))
Model 1 – Coefficient Estimates
term estimate std.error statistic p.value
(Intercept) -0.020809 0.029020 -0.717061 0.473336
MonthlyIncome -0.000011 0.000004 -2.777349 0.005480

Interpretation:
The negative coefficient on MonthlyIncome (-1.1^{-5}) indicates that higher earners are less likely to leave. On the odds scale, each additional dollar of monthly income multiplies the attrition odds by 0.999989 — a very small but statistically significant protective effect.

4.2 Model 2: Monthly Income + Overtime

model2 <- glm(Attrition ~ MonthlyIncome + Overtime,
              family = "binomial", data = train)

tidy(model2) %>%
  mutate(across(where(is.numeric), ~round(., 6))) %>%
  kable(caption = "Model 2 – Coefficient Estimates") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"))
Model 2 – Coefficient Estimates
term estimate std.error statistic p.value
(Intercept) -0.100061 0.029646 -3.375257 0.000737
MonthlyIncome -0.000010 0.000004 -2.717495 0.006578
OvertimeYes 0.237424 0.017501 13.566062 0.000000

Interpretation:
Adding Overtime substantially improves the model. The positive coefficient on OvertimeYes means employees who work overtime face considerably higher attrition odds — by a factor of approximately 1.268 — confirming that overwork is a meaningful driver of turnover.

4.3 Model 3: All Predictors

model3 <- glm(Attrition ~ .,
              family = "binomial", data = train)

tidy(model3) %>%
  mutate(across(where(is.numeric), ~round(., 5)),
         Significant = ifelse(p.value < 0.05, "✓", "")) %>%
  kable(caption = "Model 3 – All Predictors (✓ = p < 0.05)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Model 3 – All Predictors (✓ = p < 0.05)
term estimate std.error statistic p.value Significant
(Intercept) 0.25629 0.08586 2.98499 0.00284
Age -0.00600 0.00100 -5.97435 0.00000
GenderMale -0.62616 0.02078 -30.13064 0.00000
YearsAtCompany -0.01360 0.00117 -11.58778 0.00000
JobRoleFinance -0.09223 0.04838 -1.90642 0.05660
JobRoleHealthcare -0.07214 0.04238 -1.70197 0.08876
JobRoleMedia -0.09867 0.03616 -2.72900 0.00635
JobRoleTechnology -0.08476 0.04847 -1.74869 0.08034
MonthlyIncome -0.00001 0.00001 -0.81470 0.41524
WorkLifeBalanceFair 1.32703 0.03129 42.41040 0.00000
WorkLifeBalanceGood 0.29336 0.02958 9.91615 0.00000
WorkLifeBalancePoor 1.50922 0.03757 40.16735 0.00000
JobSatisfactionLow 0.48803 0.03572 13.66402 0.00000
JobSatisfactionMedium 0.01133 0.02721 0.41662 0.67695
JobSatisfactionVery High 0.49882 0.02701 18.46942 0.00000
PerformanceRatingBelow Average 0.33386 0.02948 11.32415 0.00000
PerformanceRatingHigh 0.00508 0.02647 0.19182 0.84788
PerformanceRatingLow 0.59725 0.04847 12.32091 0.00000
NumberOfPromotions -0.24919 0.01043 -23.89984 0.00000
OvertimeYes 0.35116 0.02189 16.04303 0.00000
DistanceFromHome 0.00995 0.00036 27.36510 0.00000
EducationLevelBachelor’s Degree 0.04581 0.02757 1.66164 0.09658
EducationLevelHigh School 0.03035 0.03072 0.98824 0.32304
EducationLevelMaster’s Degree 0.03087 0.03048 1.01311 0.31101
EducationLevelPhD -1.56391 0.05447 -28.71405 0.00000
MaritalStatusMarried -0.25588 0.02957 -8.65441 0.00000
MaritalStatusSingle 1.57326 0.03220 48.86324 0.00000
NumberOfDependents -0.15732 0.00666 -23.62126 0.00000
JobLevelMid -1.00326 0.02266 -44.27069 0.00000
JobLevelSenior -2.61640 0.03262 -80.22062 0.00000
CompanySizeMedium 0.00614 0.02713 0.22645 0.82085
CompanySizeSmall 0.20635 0.02958 6.97489 0.00000
CompanyTenure -0.00020 0.00045 -0.45223 0.65110
RemoteWorkYes -1.77541 0.02916 -60.88797 0.00000
LeadershipOpportunitiesYes -0.16271 0.04753 -3.42362 0.00062
InnovationOpportunitiesYes -0.14100 0.02784 -5.06418 0.00000
CompanyReputationFair 0.46981 0.03962 11.85776 0.00000
CompanyReputationGood -0.06027 0.03535 -1.70512 0.08817
CompanyReputationPoor 0.75645 0.03972 19.04269 0.00000
EmployeeRecognitionLow 0.03956 0.02620 1.51009 0.13102
EmployeeRecognitionMedium 0.04354 0.02769 1.57245 0.11585
EmployeeRecognitionVery High -0.08300 0.05019 -1.65376 0.09818

5 Model Evaluation on Test Data

5.1 Predicted Probabilities

test$prob1 <- predict(model1, newdata = test, type = "response")
test$prob2 <- predict(model2, newdata = test, type = "response")
test$prob3 <- predict(model3, newdata = test, type = "response")

# Default 0.5 threshold for class predictions
test$pred1 <- factor(ifelse(test$prob1 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred2 <- factor(ifelse(test$prob2 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred3 <- factor(ifelse(test$prob3 >= 0.5, "Yes", "No"), levels = c("No","Yes"))

5.2 Confusion Matrices

5.2.1 Model 1 — Monthly Income

cm1 <- confusionMatrix(test$pred1, test$Attrition, positive = "Yes")
cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  7868 7032
##        Yes    0    0
##                                         
##                Accuracy : 0.5281        
##                  95% CI : (0.52, 0.5361)
##     No Information Rate : 0.5281        
##     P-Value [Acc > NIR] : 0.5033        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.0000        
##             Specificity : 1.0000        
##          Pos Pred Value :    NaN        
##          Neg Pred Value : 0.5281        
##              Prevalence : 0.4719        
##          Detection Rate : 0.0000        
##    Detection Prevalence : 0.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : Yes           
## 

5.2.2 Model 2 — Monthly Income + Overtime

cm2 <- confusionMatrix(test$pred2, test$Attrition, positive = "Yes")
cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5487 4530
##        Yes 2381 2502
##                                           
##                Accuracy : 0.5362          
##                  95% CI : (0.5281, 0.5442)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : 0.02397         
##                                           
##                   Kappa : 0.0541          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.3558          
##             Specificity : 0.6974          
##          Pos Pred Value : 0.5124          
##          Neg Pred Value : 0.5478          
##              Prevalence : 0.4719          
##          Detection Rate : 0.1679          
##    Detection Prevalence : 0.3277          
##       Balanced Accuracy : 0.5266          
##                                           
##        'Positive' Class : Yes             
## 

5.2.3 Model 3 — All Variables

cm3 <- confusionMatrix(test$pred3, test$Attrition, positive = "Yes")
cm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6080 1855
##        Yes 1788 5177
##                                           
##                Accuracy : 0.7555          
##                  95% CI : (0.7485, 0.7624)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5092          
##                                           
##  Mcnemar's Test P-Value : 0.2742          
##                                           
##             Sensitivity : 0.7362          
##             Specificity : 0.7728          
##          Pos Pred Value : 0.7433          
##          Neg Pred Value : 0.7662          
##              Prevalence : 0.4719          
##          Detection Rate : 0.3474          
##    Detection Prevalence : 0.4674          
##       Balanced Accuracy : 0.7545          
##                                           
##        'Positive' Class : Yes             
## 

5.3 Confusion Matrix Heat-maps

Visualizing all three confusion matrices side by side for quick comparison.

plot_cm <- function(cm, title) {
  df <- as.data.frame(cm$table) %>%
    rename(Predicted = Prediction, Actual = Reference) %>%
    group_by(Actual) %>%
    mutate(ColTotal = sum(Freq)) %>%
    ungroup() %>%
    mutate(
      Pct      = Freq / sum(Freq) * 100,
      Label    = paste0(Freq, "\n(", sprintf("%.1f%%", Pct), ")")
    )

  ggplot(df, aes(x = Actual, y = Predicted, fill = Freq)) +
    geom_tile(color = "white", linewidth = 0.8) +
    geom_text(aes(label = Label), size = 4.2, fontface = "bold", color = "white",
              lineheight = 1.4) +
    scale_fill_gradient(low = "#90CAF9", high = "#1565C0") +
    labs(title = title, fill = "Count") +
    theme_minimal(base_size = 12) +
    theme(legend.position = "none",
          plot.title = element_text(hjust = 0.5, face = "bold"))
}

library(gridExtra)
grid.arrange(
  plot_cm(cm1, "Model 1\nMonthlyIncome"),
  plot_cm(cm2, "Model 2\nMonthlyIncome + Overtime"),
  plot_cm(cm3, "Model 3\nAll Predictors"),
  ncol = 3
)

5.4 Performance Summary Table

extract_metrics <- function(cm, label) {
  tibble(
    Model       = label,
    Accuracy    = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    `Pos Pred Value` = cm$byClass["Pos Pred Value"],
    `Neg Pred Value` = cm$byClass["Neg Pred Value"],
    `Balanced Accuracy` = cm$byClass["Balanced Accuracy"]
  )
}

bind_rows(
  extract_metrics(cm1, "Model 1: MonthlyIncome"),
  extract_metrics(cm2, "Model 2: MonthlyIncome + Overtime"),
  extract_metrics(cm3, "Model 3: All Predictors")
) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Test-Set Performance Metrics Across Three Models") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed")) %>%
  row_spec(3, bold = TRUE, background = "#E3F2FD")
Test-Set Performance Metrics Across Three Models
Model Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value Balanced Accuracy
Model 1: MonthlyIncome 0.5281 0.0000 1.0000 NaN 0.5281 0.5000
Model 2: MonthlyIncome + Overtime 0.5362 0.3558 0.6974 0.5124 0.5478 0.5266
Model 3: All Predictors 0.7555 0.7362 0.7728 0.7433 0.7662 0.7545

6 ROC Curves & AUC

ROC (Receiver Operating Characteristic) curves plot the True Positive Rate against the False Positive Rate across all decision thresholds. The Area Under the Curve (AUC) summarises discrimination ability in a single value: 0.5 = random, 1.0 = perfect.

roc1 <- roc(test$Attrition, test$prob1, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc2 <- roc(test$Attrition, test$prob2, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc3 <- roc(test$Attrition, test$prob3, levels = c("No","Yes"), direction = "<", quiet = TRUE)

# Build tidy data frames for ggplot
roc_df <- function(roc_obj, model_name) {
  data.frame(
    FPR   = 1 - roc_obj$specificities,
    TPR   = roc_obj$sensitivities,
    Model = model_name
  )
}

roc_data <- bind_rows(
  roc_df(roc1, sprintf("Model 1 (AUC = %.3f)", auc(roc1))),
  roc_df(roc2, sprintf("Model 2 (AUC = %.3f)", auc(roc2))),
  roc_df(roc3, sprintf("Model 3 (AUC = %.3f)", auc(roc3)))
)

ggplot(roc_data, aes(x = FPR, y = TPR, color = Model)) +
  geom_line(linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", color = "grey60") +
  scale_color_manual(values = c("#E53935","#FB8C00","#1E88E5")) +
  labs(
    title    = "ROC Curves — Logistic Regression Models",
    subtitle = "Evaluated on the held-out test set",
    x        = "False Positive Rate (1 – Specificity)",
    y        = "True Positive Rate (Sensitivity)",
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = c(0.65, 0.2),
    legend.background = element_rect(fill = "white", color = "grey85"),
    plot.title    = element_text(face = "bold"),
    plot.subtitle = element_text(color = "grey50")
  ) +
  annotate("text", x = 0.8, y = 0.05,
           label = "Random\nClassifier", color = "grey60", size = 3.5)


7 Discussion

7.1 Key Findings

Aspect Observation
Best overall model Model 3 (all predictors) achieves the highest accuracy and AUC
Single most informative predictor Overtime — employees working overtime face substantially higher attrition risk
MonthlyIncome alone (Model 1) Modest discrimination; AUC barely exceeds the random baseline
Sensitivity vs. Specificity trade-off All models identify Stayed employees well (high specificity) but struggle to catch Left employees (lower sensitivity) — a direct consequence of class imbalance
Class imbalance ~79% majority class sets a high No Information Rate; raw accuracy is a misleading metric here; Balanced Accuracy and AUC are preferred

7.2 Limitations

  1. Default threshold (0.5): Given class imbalance, lowering the threshold (e.g., to 0.3) can substantially improve sensitivity at a modest specificity cost.
  2. Logistic regression assumptions: Linearity on the log-odds scale, no multicollinearity, and no influential outliers — all require verification before deployment.
  3. Feature engineering: Several categorical variables (e.g., JobRole, EducationLevel) contain many levels; sparse cells may inflate variance in coefficient estimates.
  4. No cross-validation here: The confusion matrix is evaluated on a single test split; k-fold CV would provide more stable performance estimates.

7.3 Recommendations

  • Explore threshold tuning using the ROC curve’s optimal operating point (pROC::coords(roc3, "best")).
  • Consider regularized logistic regression (LASSO/Ridge via glmnet) to handle multicollinearity and perform implicit feature selection.
  • For non-linear relationships and interaction effects, tree-based methods (Random Forest, XGBoost) typically outperform logistic regression on this type of HR data.

8 Conclusion

This analysis demonstrates the step-by-step application of logistic regression to an employee attrition prediction problem. Starting from a single financial predictor (Model 1) through to a full-feature model (Model 3), each successive model meaningfully improved discrimination ability as measured by AUC. Model 3 is the recommended baseline for further development, with the caveat that addressing class imbalance (via threshold tuning, SMOTE, or cost-sensitive learning) is essential before operational deployment.


9 References