Employee attrition is a binary classification problem where the goal is to predict whether an employee will leave the company or stay. This report uses logistic regression to estimate employee attrition on the Kaggle Employee Attrition Dataset.
Dataset: https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset
Reference: https://bradleyboehmke.github.io/HOML/logistic-regression.html#assessing-model-accuracy-1
The assignment requires three logistic regression models:
Attrition ~ MonthlyIncomeAttrition ~ MonthlyIncome + OvertimeAttrition ~ . using all available variablesThe models are trained on the training dataset and evaluated on the test dataset using confusion matrices.
The main purpose of this report is to compare how model performance changes when more predictors are added.
Because the response variable has two outcomes, Left and
Stayed, logistic regression is appropriate. The model
estimates the probability that an employee belongs to the positive
class, which is defined here as:
| Class | Value |
|---|---|
| Positive class | Left |
| Negative class | Stayed |
A cutoff value of 0.5 is used:
| Rule | Prediction |
|---|---|
| Predicted probability >= 0.5 | Left |
| Predicted probability < 0.5 | Stayed |
Model accuracy is assessed using:
Left classLeft classLeft class| Metric | Definition |
|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 score | 2 * Precision * Recall / (Precision + Recall) |
In this report, TP means an employee who actually left
and was predicted as Left.
train <- read.csv("data/train.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")
test <- read.csv("data/test.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")
dim(train)
## [1] 59598 24
dim(test)
## [1] 14900 24
train$attrition_binary <- ifelse(train$Attrition == "Left", 1, 0)
test$attrition_binary <- ifelse(test$Attrition == "Left", 1, 0)
train$Attrition <- NULL
test$Attrition <- NULL
# Employee ID is only an identifier, so it is excluded from the model.
train$Employee.ID <- NULL
test$Employee.ID <- NULL
predictors <- setdiff(names(train), "attrition_binary")
categorical_cols <- predictors[sapply(train[predictors], function(x) is.character(x) || is.factor(x))]
for (col in categorical_cols) {
train[[col]] <- factor(train[[col]])
test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
}
| Dataset | Rows | Stayed | Left | Left_Rate |
|---|---|---|---|---|
| Training | 59598 | 31260 | 28338 | 0.4755 |
| Test | 14900 | 7868 | 7032 | 0.4719 |
The training and test datasets have similar attrition rates, so the test dataset is suitable for evaluating the models.
| Model | Formula | Notes |
|---|---|---|
| Model 1 | attrition_binary ~ Monthly.Income | Income-only baseline model |
| Model 2 | attrition_binary ~ Monthly.Income + Overtime | Adds overtime indicator |
| Model 3 | attrition_binary ~ . | Uses all predictors except Employee ID |
The first model uses only monthly income as the predictor.
model1 <- glm(attrition_binary ~ Monthly.Income, data = train, family = binomial)
model1_results <- confusion_stats(model1, test)
| Term | Estimate | Odds_Ratio | Std_Error | z_value | p_value |
|---|---|---|---|---|---|
| (Intercept) | -0.0208 | 0.9794 | 0.0290 | -0.7171 | 0.4733 |
| Monthly.Income | -1.059e-05 | 1.0000 | 3.813e-06 | -2.7773 | 0.0055 |
| Actual | Stayed | Left |
|---|---|---|
| Stayed | 7868 | 0 |
| Left | 7032 | 0 |
| Accuracy | Precision | Recall | F1 |
|---|---|---|---|
| 0.5281 | 0.0000 | 0.0000 | 0.0000 |
Model 1 predicts all test observations as Stayed.
Therefore, although monthly income is statistically significant, it is
not useful by itself for identifying employees who leave.
The second model adds overtime as a categorical predictor.
model2 <- glm(attrition_binary ~ Monthly.Income + Overtime, data = train, family = binomial)
model2_results <- confusion_stats(model2, test)
| Term | Estimate | Odds_Ratio | Std_Error | z_value | p_value |
|---|---|---|---|---|---|
| (Intercept) | -0.1001 | 0.9048 | 0.0296 | -3.3753 | 0.0007 |
| Monthly.Income | -1.038e-05 | 1.0000 | 3.819e-06 | -2.7175 | 0.0066 |
| OvertimeYes | 0.2374 | 1.2680 | 0.0175 | 13.5661 | <0.0001 |
| Actual | Stayed | Left |
|---|---|---|
| Stayed | 5487 | 2381 |
| Left | 4530 | 2502 |
| Accuracy | Precision | Recall | F1 |
|---|---|---|---|
| 0.5362 | 0.5124 | 0.3558 | 0.4200 |
Adding overtime improves recall for the Left class.
However, many employees who actually left are still predicted as
Stayed.
The third model uses all available predictors except Employee ID.
all_predictors <- setdiff(names(train), "attrition_binary")
model3_formula <- as.formula(paste("attrition_binary ~", paste(all_predictors, collapse = " + ")))
model3 <- glm(model3_formula, data = train, family = binomial)
model3_results <- confusion_stats(model3, test)
The full model contains many coefficients, so the table below shows the 10 largest coefficients by absolute value. The complete coefficient table is included in the appendix.
| Term | Estimate | Odds_Ratio | Std_Error | z_value | p_value |
|---|---|---|---|---|---|
| Job.LevelSenior | -2.6164 | 0.0731 | 0.0326 | -80.2206 | <0.0001 |
| Remote.WorkYes | -1.7754 | 0.1694 | 0.0292 | -60.8880 | <0.0001 |
| Marital.StatusSingle | 1.5733 | 4.8223 | 0.0322 | 48.8632 | <0.0001 |
| Education.LevelPhD | -1.5639 | 0.2093 | 0.0545 | -28.7140 | <0.0001 |
| Work.Life.BalancePoor | 1.5092 | 4.5232 | 0.0376 | 40.1674 | <0.0001 |
| Work.Life.BalanceFair | 1.3270 | 3.7698 | 0.0313 | 42.4104 | <0.0001 |
| Job.LevelMid | -1.0033 | 0.3667 | 0.0227 | -44.2707 | <0.0001 |
| Company.ReputationPoor | 0.7564 | 2.1307 | 0.0397 | 19.0427 | <0.0001 |
| GenderMale | -0.6262 | 0.5346 | 0.0208 | -30.1306 | <0.0001 |
| Performance.RatingLow | 0.5972 | 1.8171 | 0.0485 | 12.3209 | <0.0001 |
| Actual | Stayed | Left |
|---|---|---|
| Stayed | 6080 | 1788 |
| Left | 1855 | 5177 |
| Accuracy | Precision | Recall | F1 |
|---|---|---|---|
| 0.7555 | 0.7433 | 0.7362 | 0.7397 |
The full model performs much better than the first two models. This indicates that attrition is explained by multiple employee, job, and company-related factors, not by income alone.
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| MonthlyIncome | 0.5281 | 0.0000 | 0.0000 | 0.0000 |
| MonthlyIncome + Overtime | 0.5362 | 0.5124 | 0.3558 | 0.4200 |
| All variables | 0.7555 | 0.7433 | 0.7362 | 0.7397 |
The best performing model is Model 3. It has the highest accuracy, precision, recall, and F1 score.
Compared with Model 1, the full model improves accuracy from
0.5281 to 0.7555. More importantly, recall for
the Left class increases from 0.0000 to
0.7362, meaning the full model is much better at
identifying employees who actually leave.
Three logistic regression models were estimated for employee
attrition classification. Model 1, using only monthly income, predicted
every observation as Stayed and failed to identify
employees who left. Model 2 improved performance by adding overtime, but
recall remained limited.
Model 3, which used all available predictors except Employee ID,
achieved the best performance on the test dataset. Its accuracy was
0.7555, precision was 0.7433, recall was
0.7362, and F1 score was 0.7397.
Overall, the results show that employee attrition is a multifactorial problem. A model using many employee and workplace characteristics performs much better than a model using compensation variables alone.
| Term | Estimate | Odds_Ratio | Std_Error | z_value | p_value |
|---|---|---|---|---|---|
| (Intercept) | 0.2563 | 1.2921 | 0.0859 | 2.9850 | 0.0028 |
| Age | -0.0060 | 0.9940 | 0.0010 | -5.9743 | <0.0001 |
| GenderMale | -0.6262 | 0.5346 | 0.0208 | -30.1306 | <0.0001 |
| Years.at.Company | -0.0136 | 0.9865 | 0.0012 | -11.5878 | <0.0001 |
| Job.RoleFinance | -0.0922 | 0.9119 | 0.0484 | -1.9064 | 0.0566 |
| Job.RoleHealthcare | -0.0721 | 0.9304 | 0.0424 | -1.7020 | 0.0888 |
| Job.RoleMedia | -0.0987 | 0.9060 | 0.0362 | -2.7290 | 0.0064 |
| Job.RoleTechnology | -0.0848 | 0.9187 | 0.0485 | -1.7487 | 0.0803 |
| Monthly.Income | -6.718e-06 | 1.0000 | 8.246e-06 | -0.8147 | 0.4152 |
| Work.Life.BalanceFair | 1.3270 | 3.7698 | 0.0313 | 42.4104 | <0.0001 |
| Work.Life.BalanceGood | 0.2934 | 1.3409 | 0.0296 | 9.9162 | <0.0001 |
| Work.Life.BalancePoor | 1.5092 | 4.5232 | 0.0376 | 40.1674 | <0.0001 |
| Job.SatisfactionLow | 0.4880 | 1.6291 | 0.0357 | 13.6640 | <0.0001 |
| Job.SatisfactionMedium | 0.0113 | 1.0114 | 0.0272 | 0.4166 | 0.6770 |
| Job.SatisfactionVery High | 0.4988 | 1.6468 | 0.0270 | 18.4694 | <0.0001 |
| Performance.RatingBelow Average | 0.3339 | 1.3963 | 0.0295 | 11.3241 | <0.0001 |
| Performance.RatingHigh | 0.0051 | 1.0051 | 0.0265 | 0.1918 | 0.8479 |
| Performance.RatingLow | 0.5972 | 1.8171 | 0.0485 | 12.3209 | <0.0001 |
| Number.of.Promotions | -0.2492 | 0.7794 | 0.0104 | -23.8998 | <0.0001 |
| OvertimeYes | 0.3512 | 1.4207 | 0.0219 | 16.0430 | <0.0001 |
| Distance.from.Home | 0.0099 | 1.0100 | 0.0004 | 27.3651 | <0.0001 |
| Education.LevelBachelor’s Degree | 0.0458 | 1.0469 | 0.0276 | 1.6616 | 0.0966 |
| Education.LevelHigh School | 0.0304 | 1.0308 | 0.0307 | 0.9882 | 0.3230 |
| Education.LevelMaster’s Degree | 0.0309 | 1.0314 | 0.0305 | 1.0131 | 0.3110 |
| Education.LevelPhD | -1.5639 | 0.2093 | 0.0545 | -28.7140 | <0.0001 |
| Marital.StatusMarried | -0.2559 | 0.7742 | 0.0296 | -8.6544 | <0.0001 |
| Marital.StatusSingle | 1.5733 | 4.8223 | 0.0322 | 48.8632 | <0.0001 |
| Number.of.Dependents | -0.1573 | 0.8544 | 0.0067 | -23.6213 | <0.0001 |
| Job.LevelMid | -1.0033 | 0.3667 | 0.0227 | -44.2707 | <0.0001 |
| Job.LevelSenior | -2.6164 | 0.0731 | 0.0326 | -80.2206 | <0.0001 |
| Company.SizeMedium | 0.0061 | 1.0062 | 0.0271 | 0.2265 | 0.8208 |
| Company.SizeSmall | 0.2063 | 1.2292 | 0.0296 | 6.9749 | <0.0001 |
| Company.Tenure | -0.0002 | 0.9998 | 0.0005 | -0.4522 | 0.6511 |
| Remote.WorkYes | -1.7754 | 0.1694 | 0.0292 | -60.8880 | <0.0001 |
| Leadership.OpportunitiesYes | -0.1627 | 0.8498 | 0.0475 | -3.4236 | 0.0006 |
| Innovation.OpportunitiesYes | -0.1410 | 0.8685 | 0.0278 | -5.0642 | <0.0001 |
| Company.ReputationFair | 0.4698 | 1.5997 | 0.0396 | 11.8578 | <0.0001 |
| Company.ReputationGood | -0.0603 | 0.9415 | 0.0353 | -1.7051 | 0.0882 |
| Company.ReputationPoor | 0.7564 | 2.1307 | 0.0397 | 19.0427 | <0.0001 |
| Employee.RecognitionLow | 0.0396 | 1.0404 | 0.0262 | 1.5101 | 0.1310 |
| Employee.RecognitionMedium | 0.0435 | 1.0445 | 0.0277 | 1.5725 | 0.1158 |
| Employee.RecognitionVery High | -0.0830 | 0.9204 | 0.0502 | -1.6538 | 0.0982 |