1 Introduction

Employee attrition is a binary classification problem where the goal is to predict whether an employee will leave the company or stay. This report uses logistic regression to estimate employee attrition on the Kaggle Employee Attrition Dataset.

Dataset: https://www.kaggle.com/datasets/stealthtechnologies/employee-attrition-dataset

Reference: https://bradleyboehmke.github.io/HOML/logistic-regression.html#assessing-model-accuracy-1

The assignment requires three logistic regression models:

Attrition ~ MonthlyIncome
Attrition ~ MonthlyIncome + Overtime
Attrition ~ . using all available variables

The models are trained on the training dataset and evaluated on the test dataset using confusion matrices.

The main purpose of this report is to compare how model performance changes when more predictors are added.

2 Methodology

Because the response variable has two outcomes, Left and Stayed, logistic regression is appropriate. The model estimates the probability that an employee belongs to the positive class, which is defined here as:

Class definition
Class	Value
Positive class	Left
Negative class	Stayed

A cutoff value of 0.5 is used:

Classification rule
Rule	Prediction
Predicted probability >= 0.5	Left
Predicted probability < 0.5	Stayed

Model accuracy is assessed using:

Confusion matrix
Accuracy
Precision for the Left class
Recall for the Left class
F1 score for the Left class

Evaluation metric definitions
Metric	Definition
Accuracy	(TP + TN) / (TP + TN + FP + FN)
Precision	TP / (TP + FP)
Recall	TP / (TP + FN)
F1 score	2 * Precision * Recall / (Precision + Recall)

In this report, TP means an employee who actually left and was predicted as Left.

3 Data Preparation

train <- read.csv("data/train.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")
test <- read.csv("data/test.csv", check.names = TRUE, fileEncoding = "UTF-8-BOM")

dim(train)

## [1] 59598    24

dim(test)

## [1] 14900    24

train$attrition_binary <- ifelse(train$Attrition == "Left", 1, 0)
test$attrition_binary <- ifelse(test$Attrition == "Left", 1, 0)

train$Attrition <- NULL
test$Attrition <- NULL

# Employee ID is only an identifier, so it is excluded from the model.
train$Employee.ID <- NULL
test$Employee.ID <- NULL

predictors <- setdiff(names(train), "attrition_binary")
categorical_cols <- predictors[sapply(train[predictors], function(x) is.character(x) || is.factor(x))]

for (col in categorical_cols) {
  train[[col]] <- factor(train[[col]])
  test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
}

Dataset summary
Dataset	Rows	Stayed	Left	Left_Rate
Training	59598	31260	28338	0.4755
Test	14900	7868	7032	0.4719

The training and test datasets have similar attrition rates, so the test dataset is suitable for evaluating the models.

Model specifications
Model	Formula	Notes
Model 1	attrition_binary ~ Monthly.Income	Income-only baseline model
Model 2	attrition_binary ~ Monthly.Income + Overtime	Adds overtime indicator
Model 3	attrition_binary ~ .	Uses all predictors except Employee ID

4 Evaluation Functions

5 Model 1: Monthly Income

The first model uses only monthly income as the predictor.

model1 <- glm(attrition_binary ~ Monthly.Income, data = train, family = binomial)
model1_results <- confusion_stats(model1, test)

Model 1 coefficient estimates
Term	Estimate	Odds_Ratio	Std_Error	z_value	p_value
(Intercept)	-0.0208	0.9794	0.0290	-0.7171	0.4733
Monthly.Income	-1.059e-05	1.0000	3.813e-06	-2.7773	0.0055

Model 1 confusion matrix on test data
Actual	Stayed	Left
Stayed	7868	0
Left	7032	0

Model 1 test metrics
Accuracy	Precision	Recall	F1
0.5281	0.0000	0.0000	0.0000

Model 1 predicts all test observations as Stayed. Therefore, although monthly income is statistically significant, it is not useful by itself for identifying employees who leave.

6 Model 2: Monthly Income and Overtime

The second model adds overtime as a categorical predictor.

model2 <- glm(attrition_binary ~ Monthly.Income + Overtime, data = train, family = binomial)
model2_results <- confusion_stats(model2, test)

Model 2 coefficient estimates
Term	Estimate	Odds_Ratio	Std_Error	z_value	p_value
(Intercept)	-0.1001	0.9048	0.0296	-3.3753	0.0007
Monthly.Income	-1.038e-05	1.0000	3.819e-06	-2.7175	0.0066
OvertimeYes	0.2374	1.2680	0.0175	13.5661	<0.0001

Model 2 confusion matrix on test data
Actual	Stayed	Left
Stayed	5487	2381
Left	4530	2502

Model 2 test metrics
Accuracy	Precision	Recall	F1
0.5362	0.5124	0.3558	0.4200

Adding overtime improves recall for the Left class. However, many employees who actually left are still predicted as Stayed.

7 Model 3: All Variables

The third model uses all available predictors except Employee ID.

all_predictors <- setdiff(names(train), "attrition_binary")
model3_formula <- as.formula(paste("attrition_binary ~", paste(all_predictors, collapse = " + ")))
model3 <- glm(model3_formula, data = train, family = binomial)
model3_results <- confusion_stats(model3, test)

The full model contains many coefficients, so the table below shows the 10 largest coefficients by absolute value. The complete coefficient table is included in the appendix.

Top 10 Model 3 coefficients by absolute coefficient size
Term	Estimate	Odds_Ratio	Std_Error	z_value	p_value
Job.LevelSenior	-2.6164	0.0731	0.0326	-80.2206	<0.0001
Remote.WorkYes	-1.7754	0.1694	0.0292	-60.8880	<0.0001
Marital.StatusSingle	1.5733	4.8223	0.0322	48.8632	<0.0001
Education.LevelPhD	-1.5639	0.2093	0.0545	-28.7140	<0.0001
Work.Life.BalancePoor	1.5092	4.5232	0.0376	40.1674	<0.0001
Work.Life.BalanceFair	1.3270	3.7698	0.0313	42.4104	<0.0001
Job.LevelMid	-1.0033	0.3667	0.0227	-44.2707	<0.0001
Company.ReputationPoor	0.7564	2.1307	0.0397	19.0427	<0.0001
GenderMale	-0.6262	0.5346	0.0208	-30.1306	<0.0001
Performance.RatingLow	0.5972	1.8171	0.0485	12.3209	<0.0001

Model 3 confusion matrix on test data
Actual	Stayed	Left
Stayed	6080	1788
Left	1855	5177

Model 3 test metrics
Accuracy	Precision	Recall	F1
0.7555	0.7433	0.7362	0.7397

The full model performs much better than the first two models. This indicates that attrition is explained by multiple employee, job, and company-related factors, not by income alone.

8 Model Comparison

Classification performance comparison
Model	Accuracy	Precision	Recall	F1
MonthlyIncome	0.5281	0.0000	0.0000	0.0000
MonthlyIncome + Overtime	0.5362	0.5124	0.3558	0.4200
All variables	0.7555	0.7433	0.7362	0.7397

The best performing model is Model 3. It has the highest accuracy, precision, recall, and F1 score.

Compared with Model 1, the full model improves accuracy from 0.5281 to 0.7555. More importantly, recall for the Left class increases from 0.0000 to 0.7362, meaning the full model is much better at identifying employees who actually leave.

9 Conclusion

Three logistic regression models were estimated for employee attrition classification. Model 1, using only monthly income, predicted every observation as Stayed and failed to identify employees who left. Model 2 improved performance by adding overtime, but recall remained limited.

Model 3, which used all available predictors except Employee ID, achieved the best performance on the test dataset. Its accuracy was 0.7555, precision was 0.7433, recall was 0.7362, and F1 score was 0.7397.

Overall, the results show that employee attrition is a multifactorial problem. A model using many employee and workplace characteristics performs much better than a model using compensation variables alone.

10 Appendix: Complete Model 3 Coefficients

Complete Model 3 coefficient estimates
Term	Estimate	Odds_Ratio	Std_Error	z_value	p_value
(Intercept)	0.2563	1.2921	0.0859	2.9850	0.0028
Age	-0.0060	0.9940	0.0010	-5.9743	<0.0001
GenderMale	-0.6262	0.5346	0.0208	-30.1306	<0.0001
Years.at.Company	-0.0136	0.9865	0.0012	-11.5878	<0.0001
Job.RoleFinance	-0.0922	0.9119	0.0484	-1.9064	0.0566
Job.RoleHealthcare	-0.0721	0.9304	0.0424	-1.7020	0.0888
Job.RoleMedia	-0.0987	0.9060	0.0362	-2.7290	0.0064
Job.RoleTechnology	-0.0848	0.9187	0.0485	-1.7487	0.0803
Monthly.Income	-6.718e-06	1.0000	8.246e-06	-0.8147	0.4152
Work.Life.BalanceFair	1.3270	3.7698	0.0313	42.4104	<0.0001
Work.Life.BalanceGood	0.2934	1.3409	0.0296	9.9162	<0.0001
Work.Life.BalancePoor	1.5092	4.5232	0.0376	40.1674	<0.0001
Job.SatisfactionLow	0.4880	1.6291	0.0357	13.6640	<0.0001
Job.SatisfactionMedium	0.0113	1.0114	0.0272	0.4166	0.6770
Job.SatisfactionVery High	0.4988	1.6468	0.0270	18.4694	<0.0001
Performance.RatingBelow Average	0.3339	1.3963	0.0295	11.3241	<0.0001
Performance.RatingHigh	0.0051	1.0051	0.0265	0.1918	0.8479
Performance.RatingLow	0.5972	1.8171	0.0485	12.3209	<0.0001
Number.of.Promotions	-0.2492	0.7794	0.0104	-23.8998	<0.0001
OvertimeYes	0.3512	1.4207	0.0219	16.0430	<0.0001
Distance.from.Home	0.0099	1.0100	0.0004	27.3651	<0.0001
Education.LevelBachelor’s Degree	0.0458	1.0469	0.0276	1.6616	0.0966
Education.LevelHigh School	0.0304	1.0308	0.0307	0.9882	0.3230
Education.LevelMaster’s Degree	0.0309	1.0314	0.0305	1.0131	0.3110
Education.LevelPhD	-1.5639	0.2093	0.0545	-28.7140	<0.0001
Marital.StatusMarried	-0.2559	0.7742	0.0296	-8.6544	<0.0001
Marital.StatusSingle	1.5733	4.8223	0.0322	48.8632	<0.0001
Number.of.Dependents	-0.1573	0.8544	0.0067	-23.6213	<0.0001
Job.LevelMid	-1.0033	0.3667	0.0227	-44.2707	<0.0001
Job.LevelSenior	-2.6164	0.0731	0.0326	-80.2206	<0.0001
Company.SizeMedium	0.0061	1.0062	0.0271	0.2265	0.8208
Company.SizeSmall	0.2063	1.2292	0.0296	6.9749	<0.0001
Company.Tenure	-0.0002	0.9998	0.0005	-0.4522	0.6511
Remote.WorkYes	-1.7754	0.1694	0.0292	-60.8880	<0.0001
Leadership.OpportunitiesYes	-0.1627	0.8498	0.0475	-3.4236	0.0006
Innovation.OpportunitiesYes	-0.1410	0.8685	0.0278	-5.0642	<0.0001
Company.ReputationFair	0.4698	1.5997	0.0396	11.8578	<0.0001
Company.ReputationGood	-0.0603	0.9415	0.0353	-1.7051	0.0882
Company.ReputationPoor	0.7564	2.1307	0.0397	19.0427	<0.0001
Employee.RecognitionLow	0.0396	1.0404	0.0262	1.5101	0.1310
Employee.RecognitionMedium	0.0435	1.0445	0.0277	1.5725	0.1158
Employee.RecognitionVery High	-0.0830	0.9204	0.0502	-1.6538	0.0982

HW 7: Employee Attrition Classification

Enkhjin.N

2026-04-26