1 Introduction

Employee attrition is a binary classification problem where the goal is to predict whether an employee will leave the company or stay. This report uses logistic regression to estimate employee attrition on the Kaggle Employee Attrition Dataset.

Dataset: https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset

Reference: https://bradleyboehmke.github.io/HOML/logistic-regression.html#assessing-model-accuracy-1

The assignment requires three logistic regression models:

  1. Attrition ~ MonthlyIncome
  2. Attrition ~ MonthlyIncome + Overtime
  3. Attrition ~ . using all available variables

The models are trained on the training dataset and evaluated on the test dataset using confusion matrices.

The main purpose of this report is to compare how model performance changes when more predictors are added.

2 Methodology

Because the response variable has two outcomes, Left and Stayed, logistic regression is appropriate. The model estimates the probability that an employee belongs to the positive class, which is defined here as:

Class definition
Class Value
Positive class Left
Negative class Stayed

A cutoff value of 0.5 is used:

Classification rule
Rule Prediction
Predicted probability >= 0.5 Left
Predicted probability < 0.5 Stayed

Model accuracy is assessed using:

Evaluation metric definitions
Metric Definition
Accuracy (TP + TN) / (TP + TN + FP + FN)
Precision TP / (TP + FP)
Recall TP / (TP + FN)
F1 score 2 * Precision * Recall / (Precision + Recall)

In this report, TP means an employee who actually left and was predicted as Left.

3 Data Preparation

train <- read.csv("data/train.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")
test <- read.csv("data/test.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")

dim(train)
## [1] 59598    24
dim(test)
## [1] 14900    24
train$attrition_binary <- ifelse(train$Attrition == "Left", 1, 0)
test$attrition_binary <- ifelse(test$Attrition == "Left", 1, 0)

train$Attrition <- NULL
test$Attrition <- NULL

# Employee ID is only an identifier, so it is excluded from the model.
train$Employee.ID <- NULL
test$Employee.ID <- NULL

predictors <- setdiff(names(train), "attrition_binary")
categorical_cols <- predictors[sapply(train[predictors], function(x) is.character(x) || is.factor(x))]

for (col in categorical_cols) {
  train[[col]] <- factor(train[[col]])
  test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
}
Dataset summary
Dataset Rows Stayed Left Left_Rate
Training 59598 31260 28338 0.4755
Test 14900 7868 7032 0.4719

The training and test datasets have similar attrition rates, so the test dataset is suitable for evaluating the models.

Model specifications
Model Formula Notes
Model 1 attrition_binary ~ Monthly.Income Income-only baseline model
Model 2 attrition_binary ~ Monthly.Income + Overtime Adds overtime indicator
Model 3 attrition_binary ~ . Uses all predictors except Employee ID

4 Evaluation Functions

5 Model 1: Monthly Income

The first model uses only monthly income as the predictor.

model1 <- glm(attrition_binary ~ Monthly.Income, data = train, family = binomial)
model1_results <- confusion_stats(model1, test)
Model 1 coefficient estimates
Term Estimate Odds_Ratio Std_Error z_value p_value
(Intercept) -0.0208 0.9794 0.0290 -0.7171 0.4733
Monthly.Income -1.059e-05 1.0000 3.813e-06 -2.7773 0.0055
Model 1 confusion matrix on test data
Actual Stayed Left
Stayed 7868 0
Left 7032 0
Model 1 test metrics
Accuracy Precision Recall F1
0.5281 0.0000 0.0000 0.0000

Model 1 predicts all test observations as Stayed. Therefore, although monthly income is statistically significant, it is not useful by itself for identifying employees who leave.

6 Model 2: Monthly Income and Overtime

The second model adds overtime as a categorical predictor.

model2 <- glm(attrition_binary ~ Monthly.Income + Overtime, data = train, family = binomial)
model2_results <- confusion_stats(model2, test)
Model 2 coefficient estimates
Term Estimate Odds_Ratio Std_Error z_value p_value
(Intercept) -0.1001 0.9048 0.0296 -3.3753 0.0007
Monthly.Income -1.038e-05 1.0000 3.819e-06 -2.7175 0.0066
OvertimeYes 0.2374 1.2680 0.0175 13.5661 <0.0001
Model 2 confusion matrix on test data
Actual Stayed Left
Stayed 5487 2381
Left 4530 2502
Model 2 test metrics
Accuracy Precision Recall F1
0.5362 0.5124 0.3558 0.4200

Adding overtime improves recall for the Left class. However, many employees who actually left are still predicted as Stayed.

7 Model 3: All Variables

The third model uses all available predictors except Employee ID.

all_predictors <- setdiff(names(train), "attrition_binary")
model3_formula <- as.formula(paste("attrition_binary ~", paste(all_predictors, collapse = " + ")))
model3 <- glm(model3_formula, data = train, family = binomial)
model3_results <- confusion_stats(model3, test)

The full model contains many coefficients, so the table below shows the 10 largest coefficients by absolute value. The complete coefficient table is included in the appendix.

Top 10 Model 3 coefficients by absolute coefficient size
Term Estimate Odds_Ratio Std_Error z_value p_value
Job.LevelSenior -2.6164 0.0731 0.0326 -80.2206 <0.0001
Remote.WorkYes -1.7754 0.1694 0.0292 -60.8880 <0.0001
Marital.StatusSingle 1.5733 4.8223 0.0322 48.8632 <0.0001
Education.LevelPhD -1.5639 0.2093 0.0545 -28.7140 <0.0001
Work.Life.BalancePoor 1.5092 4.5232 0.0376 40.1674 <0.0001
Work.Life.BalanceFair 1.3270 3.7698 0.0313 42.4104 <0.0001
Job.LevelMid -1.0033 0.3667 0.0227 -44.2707 <0.0001
Company.ReputationPoor 0.7564 2.1307 0.0397 19.0427 <0.0001
GenderMale -0.6262 0.5346 0.0208 -30.1306 <0.0001
Performance.RatingLow 0.5972 1.8171 0.0485 12.3209 <0.0001
Model 3 confusion matrix on test data
Actual Stayed Left
Stayed 6080 1788
Left 1855 5177
Model 3 test metrics
Accuracy Precision Recall F1
0.7555 0.7433 0.7362 0.7397

The full model performs much better than the first two models. This indicates that attrition is explained by multiple employee, job, and company-related factors, not by income alone.

8 Model Comparison

Classification performance comparison
Model Accuracy Precision Recall F1
MonthlyIncome 0.5281 0.0000 0.0000 0.0000
MonthlyIncome + Overtime 0.5362 0.5124 0.3558 0.4200
All variables 0.7555 0.7433 0.7362 0.7397

The best performing model is Model 3. It has the highest accuracy, precision, recall, and F1 score.

Compared with Model 1, the full model improves accuracy from 0.5281 to 0.7555. More importantly, recall for the Left class increases from 0.0000 to 0.7362, meaning the full model is much better at identifying employees who actually leave.

9 Conclusion

Three logistic regression models were estimated for employee attrition classification. Model 1, using only monthly income, predicted every observation as Stayed and failed to identify employees who left. Model 2 improved performance by adding overtime, but recall remained limited.

Model 3, which used all available predictors except Employee ID, achieved the best performance on the test dataset. Its accuracy was 0.7555, precision was 0.7433, recall was 0.7362, and F1 score was 0.7397.

Overall, the results show that employee attrition is a multifactorial problem. A model using many employee and workplace characteristics performs much better than a model using compensation variables alone.

10 Appendix: Complete Model 3 Coefficients

Complete Model 3 coefficient estimates
Term Estimate Odds_Ratio Std_Error z_value p_value
(Intercept) 0.2563 1.2921 0.0859 2.9850 0.0028
Age -0.0060 0.9940 0.0010 -5.9743 <0.0001
GenderMale -0.6262 0.5346 0.0208 -30.1306 <0.0001
Years.at.Company -0.0136 0.9865 0.0012 -11.5878 <0.0001
Job.RoleFinance -0.0922 0.9119 0.0484 -1.9064 0.0566
Job.RoleHealthcare -0.0721 0.9304 0.0424 -1.7020 0.0888
Job.RoleMedia -0.0987 0.9060 0.0362 -2.7290 0.0064
Job.RoleTechnology -0.0848 0.9187 0.0485 -1.7487 0.0803
Monthly.Income -6.718e-06 1.0000 8.246e-06 -0.8147 0.4152
Work.Life.BalanceFair 1.3270 3.7698 0.0313 42.4104 <0.0001
Work.Life.BalanceGood 0.2934 1.3409 0.0296 9.9162 <0.0001
Work.Life.BalancePoor 1.5092 4.5232 0.0376 40.1674 <0.0001
Job.SatisfactionLow 0.4880 1.6291 0.0357 13.6640 <0.0001
Job.SatisfactionMedium 0.0113 1.0114 0.0272 0.4166 0.6770
Job.SatisfactionVery High 0.4988 1.6468 0.0270 18.4694 <0.0001
Performance.RatingBelow Average 0.3339 1.3963 0.0295 11.3241 <0.0001
Performance.RatingHigh 0.0051 1.0051 0.0265 0.1918 0.8479
Performance.RatingLow 0.5972 1.8171 0.0485 12.3209 <0.0001
Number.of.Promotions -0.2492 0.7794 0.0104 -23.8998 <0.0001
OvertimeYes 0.3512 1.4207 0.0219 16.0430 <0.0001
Distance.from.Home 0.0099 1.0100 0.0004 27.3651 <0.0001
Education.LevelBachelor’s Degree 0.0458 1.0469 0.0276 1.6616 0.0966
Education.LevelHigh School 0.0304 1.0308 0.0307 0.9882 0.3230
Education.LevelMaster’s Degree 0.0309 1.0314 0.0305 1.0131 0.3110
Education.LevelPhD -1.5639 0.2093 0.0545 -28.7140 <0.0001
Marital.StatusMarried -0.2559 0.7742 0.0296 -8.6544 <0.0001
Marital.StatusSingle 1.5733 4.8223 0.0322 48.8632 <0.0001
Number.of.Dependents -0.1573 0.8544 0.0067 -23.6213 <0.0001
Job.LevelMid -1.0033 0.3667 0.0227 -44.2707 <0.0001
Job.LevelSenior -2.6164 0.0731 0.0326 -80.2206 <0.0001
Company.SizeMedium 0.0061 1.0062 0.0271 0.2265 0.8208
Company.SizeSmall 0.2063 1.2292 0.0296 6.9749 <0.0001
Company.Tenure -0.0002 0.9998 0.0005 -0.4522 0.6511
Remote.WorkYes -1.7754 0.1694 0.0292 -60.8880 <0.0001
Leadership.OpportunitiesYes -0.1627 0.8498 0.0475 -3.4236 0.0006
Innovation.OpportunitiesYes -0.1410 0.8685 0.0278 -5.0642 <0.0001
Company.ReputationFair 0.4698 1.5997 0.0396 11.8578 <0.0001
Company.ReputationGood -0.0603 0.9415 0.0353 -1.7051 0.0882
Company.ReputationPoor 0.7564 2.1307 0.0397 19.0427 <0.0001
Employee.RecognitionLow 0.0396 1.0404 0.0262 1.5101 0.1310
Employee.RecognitionMedium 0.0435 1.0445 0.0277 1.5725 0.1158
Employee.RecognitionVery High -0.0830 0.9204 0.0502 -1.6538 0.0982