Employee attrition — the voluntary or involuntary departure of employees — is a costly challenge for organizations. Predicting which employees are likely to leave enables HR departments to intervene proactively, reducing turnover costs and retaining institutional knowledge.
This analysis applies logistic regression, a statistical technique suited for binary classification, to predict employee attrition (Left vs. Stayed) using the Kaggle Employee Attrition Dataset. Three models of increasing complexity are estimated and evaluated:
| Model | Formula |
|---|---|
| Model 1 | Attrition ~ MonthlyIncome |
| Model 2 | Attrition ~ MonthlyIncome + Overtime |
| Model 3 | Attrition ~ . (all predictors) |
The pre-split training and test sets are loaded directly from the Kaggle dataset files.
train_raw <- read.csv("train.csv", stringsAsFactors = FALSE)
test_raw <- read.csv("test.csv", stringsAsFactors = FALSE)
cat("Training observations:", nrow(train_raw), "\n")## Training observations: 59598
## Test observations: 14900
## Features: 23
## Rows: 59,598
## Columns: 24
## $ Employee.ID <int> 8410, 64756, 30257, 65791, 65026, 24368, 6497…
## $ Age <int> 31, 59, 24, 36, 56, 38, 47, 48, 57, 24, 30, 2…
## $ Gender <chr> "Male", "Female", "Female", "Female", "Male",…
## $ Years.at.Company <int> 19, 4, 10, 7, 41, 3, 23, 16, 44, 1, 12, 6, 38…
## $ Job.Role <chr> "Education", "Media", "Healthcare", "Educatio…
## $ Monthly.Income <int> 5390, 5534, 8159, 3989, 4821, 9977, 3681, 112…
## $ Work.Life.Balance <chr> "Excellent", "Poor", "Good", "Good", "Fair", …
## $ Job.Satisfaction <chr> "Medium", "High", "High", "High", "Very High"…
## $ Performance.Rating <chr> "Average", "Low", "Low", "High", "Average", "…
## $ Number.of.Promotions <int> 2, 3, 0, 1, 0, 3, 1, 2, 1, 1, 1, 2, 1, 4, 0, …
## $ Overtime <chr> "No", "No", "No", "No", "Yes", "No", "Yes", "…
## $ Distance.from.Home <int> 22, 21, 11, 27, 71, 37, 75, 5, 39, 57, 51, 26…
## $ Education.Level <chr> "Associate Degree", "Master’s Degree", "Bache…
## $ Marital.Status <chr> "Married", "Divorced", "Married", "Single", "…
## $ Number.of.Dependents <int> 0, 3, 3, 2, 0, 0, 3, 4, 4, 4, 1, 0, 0, 2, 0, …
## $ Job.Level <chr> "Mid", "Mid", "Mid", "Mid", "Senior", "Mid", …
## $ Company.Size <chr> "Medium", "Medium", "Medium", "Small", "Mediu…
## $ Company.Tenure <int> 89, 21, 74, 50, 68, 47, 93, 88, 75, 45, 17, 3…
## $ Remote.Work <chr> "No", "No", "No", "Yes", "No", "No", "No", "N…
## $ Leadership.Opportunities <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ Innovation.Opportunities <chr> "No", "No", "No", "No", "No", "Yes", "No", "N…
## $ Company.Reputation <chr> "Excellent", "Fair", "Poor", "Good", "Fair", …
## $ Employee.Recognition <chr> "Medium", "Low", "Low", "Medium", "Medium", "…
## $ Attrition <chr> "Stayed", "Stayed", "Stayed", "Stayed", "Stay…
Several steps are required before modelling:
"Left" →
1 (event), "Stayed" → 0.glm() handles dummy coding automatically.clean_data <- function(df) {
df %>%
rename(
MonthlyIncome = Monthly.Income,
Overtime = Overtime,
YearsAtCompany = Years.at.Company,
JobRole = Job.Role,
WorkLifeBalance = Work.Life.Balance,
JobSatisfaction = Job.Satisfaction,
PerformanceRating = Performance.Rating,
NumberOfPromotions = Number.of.Promotions,
DistanceFromHome = Distance.from.Home,
EducationLevel = Education.Level,
MaritalStatus = Marital.Status,
NumberOfDependents = Number.of.Dependents,
JobLevel = Job.Level,
CompanySize = Company.Size,
CompanyTenure = Company.Tenure,
RemoteWork = Remote.Work,
LeadershipOpportunities = Leadership.Opportunities,
InnovationOpportunities = Innovation.Opportunities,
CompanyReputation = Company.Reputation,
EmployeeRecognition = Employee.Recognition
) %>%
select(-Employee.ID) %>%
mutate(
Attrition = factor(ifelse(Attrition == "Left", "Yes", "No"),
levels = c("No", "Yes")),
across(where(is.character), as.factor)
)
}
train <- clean_data(train_raw)
test <- clean_data(test_raw)Understanding class imbalance is critical, as it directly affects the No Information Rate — the accuracy achievable by always predicting the majority class.
bind_rows(
train %>% count(Attrition) %>% mutate(Split = "Train"),
test %>% count(Attrition) %>% mutate(Split = "Test")
) %>%
group_by(Split) %>%
mutate(Proportion = n / sum(n)) %>%
ggplot(aes(x = Attrition, y = Proportion, fill = Attrition)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = scales::percent(Proportion, accuracy = 0.1)),
vjust = -0.4, fontface = "bold") +
facet_wrap(~Split) +
scale_fill_manual(values = c("No" = "#2196F3", "Yes" = "#F44336")) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
labs(title = "Attrition Class Distribution",
subtitle = "Consistent class imbalance across both splits",
x = "Attrition", y = "Proportion") +
theme_minimal(base_size = 13)The dataset is imbalanced — approximately 78–79% of employees stayed and ~21–22% left. This sets the No Information Rate as a baseline that any model must meaningfully exceed.
All three models are fit using glm() with
family = "binomial" on the training set.
model1 <- glm(Attrition ~ MonthlyIncome,
family = "binomial", data = train)
tidy(model1) %>%
mutate(across(where(is.numeric), ~round(., 6))) %>%
kable(caption = "Model 1 – Coefficient Estimates") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.020809 | 0.029020 | -0.717061 | 0.473336 |
| MonthlyIncome | -0.000011 | 0.000004 | -2.777349 | 0.005480 |
Interpretation:
The negative coefficient on MonthlyIncome (-1.1^{-5})
indicates that higher earners are less likely to leave. On the
odds scale, each additional dollar of monthly income multiplies the
attrition odds by 0.999989 — a very small but statistically significant
protective effect.
model2 <- glm(Attrition ~ MonthlyIncome + Overtime,
family = "binomial", data = train)
tidy(model2) %>%
mutate(across(where(is.numeric), ~round(., 6))) %>%
kable(caption = "Model 2 – Coefficient Estimates") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.100061 | 0.029646 | -3.375257 | 0.000737 |
| MonthlyIncome | -0.000010 | 0.000004 | -2.717495 | 0.006578 |
| OvertimeYes | 0.237424 | 0.017501 | 13.566062 | 0.000000 |
Interpretation:
Adding Overtime substantially improves the model. The
positive coefficient on OvertimeYes means employees who
work overtime face considerably higher attrition odds — by a factor of
approximately 1.268 — confirming that overwork is a meaningful driver of
turnover.
model3 <- glm(Attrition ~ .,
family = "binomial", data = train)
tidy(model3) %>%
mutate(across(where(is.numeric), ~round(., 5)),
Significant = ifelse(p.value < 0.05, "✓", "")) %>%
kable(caption = "Model 3 – All Predictors (✓ = p < 0.05)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE) %>%
scroll_box(height = "400px")| term | estimate | std.error | statistic | p.value | Significant |
|---|---|---|---|---|---|
| (Intercept) | 0.25629 | 0.08586 | 2.98499 | 0.00284 | ✓ |
| Age | -0.00600 | 0.00100 | -5.97435 | 0.00000 | ✓ |
| GenderMale | -0.62616 | 0.02078 | -30.13064 | 0.00000 | ✓ |
| YearsAtCompany | -0.01360 | 0.00117 | -11.58778 | 0.00000 | ✓ |
| JobRoleFinance | -0.09223 | 0.04838 | -1.90642 | 0.05660 | |
| JobRoleHealthcare | -0.07214 | 0.04238 | -1.70197 | 0.08876 | |
| JobRoleMedia | -0.09867 | 0.03616 | -2.72900 | 0.00635 | ✓ |
| JobRoleTechnology | -0.08476 | 0.04847 | -1.74869 | 0.08034 | |
| MonthlyIncome | -0.00001 | 0.00001 | -0.81470 | 0.41524 | |
| WorkLifeBalanceFair | 1.32703 | 0.03129 | 42.41040 | 0.00000 | ✓ |
| WorkLifeBalanceGood | 0.29336 | 0.02958 | 9.91615 | 0.00000 | ✓ |
| WorkLifeBalancePoor | 1.50922 | 0.03757 | 40.16735 | 0.00000 | ✓ |
| JobSatisfactionLow | 0.48803 | 0.03572 | 13.66402 | 0.00000 | ✓ |
| JobSatisfactionMedium | 0.01133 | 0.02721 | 0.41662 | 0.67695 | |
| JobSatisfactionVery High | 0.49882 | 0.02701 | 18.46942 | 0.00000 | ✓ |
| PerformanceRatingBelow Average | 0.33386 | 0.02948 | 11.32415 | 0.00000 | ✓ |
| PerformanceRatingHigh | 0.00508 | 0.02647 | 0.19182 | 0.84788 | |
| PerformanceRatingLow | 0.59725 | 0.04847 | 12.32091 | 0.00000 | ✓ |
| NumberOfPromotions | -0.24919 | 0.01043 | -23.89984 | 0.00000 | ✓ |
| OvertimeYes | 0.35116 | 0.02189 | 16.04303 | 0.00000 | ✓ |
| DistanceFromHome | 0.00995 | 0.00036 | 27.36510 | 0.00000 | ✓ |
| EducationLevelBachelor’s Degree | 0.04581 | 0.02757 | 1.66164 | 0.09658 | |
| EducationLevelHigh School | 0.03035 | 0.03072 | 0.98824 | 0.32304 | |
| EducationLevelMaster’s Degree | 0.03087 | 0.03048 | 1.01311 | 0.31101 | |
| EducationLevelPhD | -1.56391 | 0.05447 | -28.71405 | 0.00000 | ✓ |
| MaritalStatusMarried | -0.25588 | 0.02957 | -8.65441 | 0.00000 | ✓ |
| MaritalStatusSingle | 1.57326 | 0.03220 | 48.86324 | 0.00000 | ✓ |
| NumberOfDependents | -0.15732 | 0.00666 | -23.62126 | 0.00000 | ✓ |
| JobLevelMid | -1.00326 | 0.02266 | -44.27069 | 0.00000 | ✓ |
| JobLevelSenior | -2.61640 | 0.03262 | -80.22062 | 0.00000 | ✓ |
| CompanySizeMedium | 0.00614 | 0.02713 | 0.22645 | 0.82085 | |
| CompanySizeSmall | 0.20635 | 0.02958 | 6.97489 | 0.00000 | ✓ |
| CompanyTenure | -0.00020 | 0.00045 | -0.45223 | 0.65110 | |
| RemoteWorkYes | -1.77541 | 0.02916 | -60.88797 | 0.00000 | ✓ |
| LeadershipOpportunitiesYes | -0.16271 | 0.04753 | -3.42362 | 0.00062 | ✓ |
| InnovationOpportunitiesYes | -0.14100 | 0.02784 | -5.06418 | 0.00000 | ✓ |
| CompanyReputationFair | 0.46981 | 0.03962 | 11.85776 | 0.00000 | ✓ |
| CompanyReputationGood | -0.06027 | 0.03535 | -1.70512 | 0.08817 | |
| CompanyReputationPoor | 0.75645 | 0.03972 | 19.04269 | 0.00000 | ✓ |
| EmployeeRecognitionLow | 0.03956 | 0.02620 | 1.51009 | 0.13102 | |
| EmployeeRecognitionMedium | 0.04354 | 0.02769 | 1.57245 | 0.11585 | |
| EmployeeRecognitionVery High | -0.08300 | 0.05019 | -1.65376 | 0.09818 |
test$prob1 <- predict(model1, newdata = test, type = "response")
test$prob2 <- predict(model2, newdata = test, type = "response")
test$prob3 <- predict(model3, newdata = test, type = "response")
# Default 0.5 threshold for class predictions
test$pred1 <- factor(ifelse(test$prob1 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred2 <- factor(ifelse(test$prob2 >= 0.5, "Yes", "No"), levels = c("No","Yes"))
test$pred3 <- factor(ifelse(test$prob3 >= 0.5, "Yes", "No"), levels = c("No","Yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 7868 7032
## Yes 0 0
##
## Accuracy : 0.5281
## 95% CI : (0.52, 0.5361)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : 0.5033
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5281
## Prevalence : 0.4719
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 5487 4530
## Yes 2381 2502
##
## Accuracy : 0.5362
## 95% CI : (0.5281, 0.5442)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : 0.02397
##
## Kappa : 0.0541
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.3558
## Specificity : 0.6974
## Pos Pred Value : 0.5124
## Neg Pred Value : 0.5478
## Prevalence : 0.4719
## Detection Rate : 0.1679
## Detection Prevalence : 0.3277
## Balanced Accuracy : 0.5266
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6080 1855
## Yes 1788 5177
##
## Accuracy : 0.7555
## 95% CI : (0.7485, 0.7624)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5092
##
## Mcnemar's Test P-Value : 0.2742
##
## Sensitivity : 0.7362
## Specificity : 0.7728
## Pos Pred Value : 0.7433
## Neg Pred Value : 0.7662
## Prevalence : 0.4719
## Detection Rate : 0.3474
## Detection Prevalence : 0.4674
## Balanced Accuracy : 0.7545
##
## 'Positive' Class : Yes
##
Visualizing all three confusion matrices side by side for quick comparison.
plot_cm <- function(cm, title) {
df <- as.data.frame(cm$table) %>%
rename(Predicted = Prediction, Actual = Reference) %>%
group_by(Actual) %>%
mutate(ColTotal = sum(Freq)) %>%
ungroup() %>%
mutate(
Pct = Freq / sum(Freq) * 100,
Label = paste0(Freq, "\n(", sprintf("%.1f%%", Pct), ")")
)
ggplot(df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", linewidth = 0.8) +
geom_text(aes(label = Label), size = 4.2, fontface = "bold", color = "white",
lineheight = 1.4) +
scale_fill_gradient(low = "#90CAF9", high = "#1565C0") +
labs(title = title, fill = "Count") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
}
library(gridExtra)
grid.arrange(
plot_cm(cm1, "Model 1\nMonthlyIncome"),
plot_cm(cm2, "Model 2\nMonthlyIncome + Overtime"),
plot_cm(cm3, "Model 3\nAll Predictors"),
ncol = 3
)extract_metrics <- function(cm, label) {
tibble(
Model = label,
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
`Pos Pred Value` = cm$byClass["Pos Pred Value"],
`Neg Pred Value` = cm$byClass["Neg Pred Value"],
`Balanced Accuracy` = cm$byClass["Balanced Accuracy"]
)
}
bind_rows(
extract_metrics(cm1, "Model 1: MonthlyIncome"),
extract_metrics(cm2, "Model 2: MonthlyIncome + Overtime"),
extract_metrics(cm3, "Model 3: All Predictors")
) %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
kable(caption = "Test-Set Performance Metrics Across Three Models") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed")) %>%
row_spec(3, bold = TRUE, background = "#E3F2FD")| Model | Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Balanced Accuracy |
|---|---|---|---|---|---|---|
| Model 1: MonthlyIncome | 0.5281 | 0.0000 | 1.0000 | NaN | 0.5281 | 0.5000 |
| Model 2: MonthlyIncome + Overtime | 0.5362 | 0.3558 | 0.6974 | 0.5124 | 0.5478 | 0.5266 |
| Model 3: All Predictors | 0.7555 | 0.7362 | 0.7728 | 0.7433 | 0.7662 | 0.7545 |
ROC (Receiver Operating Characteristic) curves plot the True Positive Rate against the False Positive Rate across all decision thresholds. The Area Under the Curve (AUC) summarises discrimination ability in a single value: 0.5 = random, 1.0 = perfect.
roc1 <- roc(test$Attrition, test$prob1, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc2 <- roc(test$Attrition, test$prob2, levels = c("No","Yes"), direction = "<", quiet = TRUE)
roc3 <- roc(test$Attrition, test$prob3, levels = c("No","Yes"), direction = "<", quiet = TRUE)
# Build tidy data frames for ggplot
roc_df <- function(roc_obj, model_name) {
data.frame(
FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities,
Model = model_name
)
}
roc_data <- bind_rows(
roc_df(roc1, sprintf("Model 1 (AUC = %.3f)", auc(roc1))),
roc_df(roc2, sprintf("Model 2 (AUC = %.3f)", auc(roc2))),
roc_df(roc3, sprintf("Model 3 (AUC = %.3f)", auc(roc3)))
)
ggplot(roc_data, aes(x = FPR, y = TPR, color = Model)) +
geom_line(linewidth = 1.1) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "grey60") +
scale_color_manual(values = c("#E53935","#FB8C00","#1E88E5")) +
labs(
title = "ROC Curves — Logistic Regression Models",
subtitle = "Evaluated on the held-out test set",
x = "False Positive Rate (1 – Specificity)",
y = "True Positive Rate (Sensitivity)",
color = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = c(0.65, 0.2),
legend.background = element_rect(fill = "white", color = "grey85"),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "grey50")
) +
annotate("text", x = 0.8, y = 0.05,
label = "Random\nClassifier", color = "grey60", size = 3.5)| Aspect | Observation |
|---|---|
| Best overall model | Model 3 (all predictors) achieves the highest accuracy and AUC |
| Single most informative predictor | Overtime — employees working overtime face
substantially higher attrition risk |
| MonthlyIncome alone (Model 1) | Modest discrimination; AUC barely exceeds the random baseline |
| Sensitivity vs. Specificity trade-off | All models identify Stayed employees well (high specificity) but struggle to catch Left employees (lower sensitivity) — a direct consequence of class imbalance |
| Class imbalance | ~79% majority class sets a high No Information Rate; raw accuracy is a misleading metric here; Balanced Accuracy and AUC are preferred |
JobRole, EducationLevel) contain many
levels; sparse cells may inflate variance in coefficient
estimates.pROC::coords(roc3, "best")).glmnet) to handle multicollinearity and perform implicit
feature selection.This analysis demonstrates the step-by-step application of logistic regression to an employee attrition prediction problem. Starting from a single financial predictor (Model 1) through to a full-feature model (Model 3), each successive model meaningfully improved discrimination ability as measured by AUC. Model 3 is the recommended baseline for further development, with the caveat that addressing class imbalance (via threshold tuning, SMOTE, or cost-sensitive learning) is essential before operational deployment.