Maulana Ahmad Fahrezi
Student ID: 114035115
This analysis aims to develop and evaluate logistic regression models to predict employee attrition using the Employee Attrition dataset. Three models are constructed with increasing complexity, and their performance is assessed using a confusion matrix on the test dataset.
First, load the necessary libraries for data manipulation and modeling.
# Load libraries
library(tidyverse)
library(caret)
# Load training and testing data
train <- read.csv("train.csv")
test <- read.csv("test.csv")
# Inspect structure
str(train)
## 'data.frame': 59598 obs. of 24 variables:
## $ Employee.ID : int 8410 64756 30257 65791 65026 24368 64970 36999 32714 15944 ...
## $ Age : int 31 59 24 36 56 38 47 48 57 24 ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Years.at.Company : int 19 4 10 7 41 3 23 16 44 1 ...
## $ Job.Role : chr "Education" "Media" "Healthcare" "Education" ...
## $ Monthly.Income : int 5390 5534 8159 3989 4821 9977 3681 11223 3773 7319 ...
## $ Work.Life.Balance : chr "Excellent" "Poor" "Good" "Good" ...
## $ Job.Satisfaction : chr "Medium" "High" "High" "High" ...
## $ Performance.Rating : chr "Average" "Low" "Low" "High" ...
## $ Number.of.Promotions : int 2 3 0 1 0 3 1 2 1 1 ...
## $ Overtime : chr "No" "No" "No" "No" ...
## $ Distance.from.Home : int 22 21 11 27 71 37 75 5 39 57 ...
## $ Education.Level : chr "Associate Degree" "Master’s Degree" "Bachelor’s Degree" "High School" ...
## $ Marital.Status : chr "Married" "Divorced" "Married" "Single" ...
## $ Number.of.Dependents : int 0 3 3 2 0 0 3 4 4 4 ...
## $ Job.Level : chr "Mid" "Mid" "Mid" "Mid" ...
## $ Company.Size : chr "Medium" "Medium" "Medium" "Small" ...
## $ Company.Tenure : int 89 21 74 50 68 47 93 88 75 45 ...
## $ Remote.Work : chr "No" "No" "No" "Yes" ...
## $ Leadership.Opportunities: chr "No" "No" "No" "No" ...
## $ Innovation.Opportunities: chr "No" "No" "No" "No" ...
## $ Company.Reputation : chr "Excellent" "Fair" "Poor" "Good" ...
## $ Employee.Recognition : chr "Medium" "Low" "Low" "Medium" ...
## $ Attrition : chr "Stayed" "Stayed" "Stayed" "Stayed" ...
str(test)
## 'data.frame': 14900 obs. of 24 variables:
## $ Employee.ID : int 52685 30585 54656 33442 15667 3496 46775 72645 4941 65181 ...
## $ Age : int 36 35 50 58 39 45 22 34 48 55 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Years.at.Company : int 13 7 7 44 24 30 5 15 40 16 ...
## $ Job.Role : chr "Healthcare" "Education" "Education" "Media" ...
## $ Monthly.Income : int 8029 4563 5583 5525 4604 8104 8700 11025 11452 5939 ...
## $ Work.Life.Balance : chr "Excellent" "Good" "Fair" "Fair" ...
## $ Job.Satisfaction : chr "High" "High" "High" "Very High" ...
## $ Performance.Rating : chr "Average" "Average" "Average" "High" ...
## $ Number.of.Promotions : int 1 1 3 0 0 0 0 1 0 0 ...
## $ Overtime : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Distance.from.Home : int 83 55 14 43 47 38 2 9 65 31 ...
## $ Education.Level : chr "Master’s Degree" "Associate Degree" "Associate Degree" "Master’s Degree" ...
## $ Marital.Status : chr "Married" "Single" "Divorced" "Single" ...
## $ Number.of.Dependents : int 1 4 2 4 6 0 0 4 1 1 ...
## $ Job.Level : chr "Mid" "Entry" "Senior" "Entry" ...
## $ Company.Size : chr "Large" "Medium" "Medium" "Medium" ...
## $ Company.Tenure : int 22 27 76 96 45 75 48 16 52 46 ...
## $ Remote.Work : chr "No" "No" "No" "No" ...
## $ Leadership.Opportunities: chr "No" "No" "No" "No" ...
## $ Innovation.Opportunities: chr "No" "No" "Yes" "No" ...
## $ Company.Reputation : chr "Poor" "Good" "Good" "Poor" ...
## $ Employee.Recognition : chr "Medium" "High" "Low" "Low" ...
## $ Attrition : chr "Stayed" "Left" "Stayed" "Left" ...
# Check missing values
colSums(is.na(train))
## Employee.ID Age Gender
## 0 0 0
## Years.at.Company Job.Role Monthly.Income
## 0 0 0
## Work.Life.Balance Job.Satisfaction Performance.Rating
## 0 0 0
## Number.of.Promotions Overtime Distance.from.Home
## 0 0 0
## Education.Level Marital.Status Number.of.Dependents
## 0 0 0
## Job.Level Company.Size Company.Tenure
## 0 0 0
## Remote.Work Leadership.Opportunities Innovation.Opportunities
## 0 0 0
## Company.Reputation Employee.Recognition Attrition
## 0 0 0
colSums(is.na(test))
## Employee.ID Age Gender
## 0 0 0
## Years.at.Company Job.Role Monthly.Income
## 0 0 0
## Work.Life.Balance Job.Satisfaction Performance.Rating
## 0 0 0
## Number.of.Promotions Overtime Distance.from.Home
## 0 0 0
## Education.Level Marital.Status Number.of.Dependents
## 0 0 0
## Job.Level Company.Size Company.Tenure
## 0 0 0
## Remote.Work Leadership.Opportunities Innovation.Opportunities
## 0 0 0
## Company.Reputation Employee.Recognition Attrition
## 0 0 0
This step ensures that:
# Convert Attrition to factor
train$Attrition <- as.factor(train$Attrition)
test$Attrition <- as.factor(test$Attrition)
Logistic regression in R requires the target variable to be categorical (factor).
model1 <- glm(Attrition ~ Monthly.Income,
data = train,
family = "binomial")
summary(model1)
##
## Call:
## glm(formula = Attrition ~ Monthly.Income, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.081e-02 2.902e-02 0.717 0.47334
## Monthly.Income 1.059e-05 3.813e-06 2.777 0.00548 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82477 on 59597 degrees of freedom
## Residual deviance: 82469 on 59596 degrees of freedom
## AIC: 82473
##
## Number of Fisher Scoring iterations: 3
This model evaluates whether monthly income alone can predict employee attrition.
model2 <- glm(Attrition ~ Monthly.Income + Overtime,
data = train,
family = "binomial")
summary(model2)
##
## Call:
## glm(formula = Attrition ~ Monthly.Income + Overtime, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.001e-01 2.965e-02 3.375 0.000737 ***
## Monthly.Income 1.038e-05 3.819e-06 2.717 0.006578 **
## OvertimeYes -2.374e-01 1.750e-02 -13.566 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82477 on 59597 degrees of freedom
## Residual deviance: 82285 on 59595 degrees of freedom
## AIC: 82291
##
## Number of Fisher Scoring iterations: 3
This model adds overtime as an additional predictor, allowing us to assess whether workload contributes to attrition.
model3 <- glm(Attrition ~ .,
data = train,
family = "binomial")
summary(model3)
##
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.624e-01 8.764e-02 -2.994 0.00275 **
## Employee.ID 1.653e-07 4.761e-07 0.347 0.72843
## Age 5.998e-03 1.004e-03 5.974 2.31e-09 ***
## GenderMale 6.262e-01 2.078e-02 30.131 < 2e-16 ***
## Years.at.Company 1.360e-02 1.174e-03 11.588 < 2e-16 ***
## Job.RoleFinance 9.212e-02 4.838e-02 1.904 0.05690 .
## Job.RoleHealthcare 7.204e-02 4.239e-02 1.700 0.08919 .
## Job.RoleMedia 9.854e-02 3.616e-02 2.725 0.00642 **
## Job.RoleTechnology 8.459e-02 4.847e-02 1.745 0.08096 .
## Monthly.Income 6.733e-06 8.246e-06 0.817 0.41417
## Work.Life.BalanceFair -1.327e+00 3.129e-02 -42.409 < 2e-16 ***
## Work.Life.BalanceGood -2.934e-01 2.958e-02 -9.916 < 2e-16 ***
## Work.Life.BalancePoor -1.509e+00 3.757e-02 -40.165 < 2e-16 ***
## Job.SatisfactionLow -4.880e-01 3.572e-02 -13.665 < 2e-16 ***
## Job.SatisfactionMedium -1.135e-02 2.721e-02 -0.417 0.67664
## Job.SatisfactionVery High -4.989e-01 2.701e-02 -18.471 < 2e-16 ***
## Performance.RatingBelow Average -3.338e-01 2.948e-02 -11.322 < 2e-16 ***
## Performance.RatingHigh -5.054e-03 2.647e-02 -0.191 0.84860
## Performance.RatingLow -5.972e-01 4.847e-02 -12.321 < 2e-16 ***
## Number.of.Promotions 2.492e-01 1.043e-02 23.898 < 2e-16 ***
## OvertimeYes -3.512e-01 2.189e-02 -16.043 < 2e-16 ***
## Distance.from.Home -9.951e-03 3.636e-04 -27.367 < 2e-16 ***
## Education.LevelBachelor’s Degree -4.580e-02 2.757e-02 -1.661 0.09669 .
## Education.LevelHigh School -3.037e-02 3.072e-02 -0.989 0.32272
## Education.LevelMaster’s Degree -3.095e-02 3.048e-02 -1.015 0.30991
## Education.LevelPhD 1.564e+00 5.446e-02 28.716 < 2e-16 ***
## Marital.StatusMarried 2.559e-01 2.957e-02 8.656 < 2e-16 ***
## Marital.StatusSingle -1.573e+00 3.220e-02 -48.864 < 2e-16 ***
## Number.of.Dependents 1.573e-01 6.660e-03 23.622 < 2e-16 ***
## Job.LevelMid 1.003e+00 2.266e-02 44.269 < 2e-16 ***
## Job.LevelSenior 2.616e+00 3.261e-02 80.221 < 2e-16 ***
## Company.SizeMedium -6.168e-03 2.713e-02 -0.227 0.82015
## Company.SizeSmall -2.063e-01 2.958e-02 -6.975 3.07e-12 ***
## Company.Tenure 2.034e-04 4.501e-04 0.452 0.65134
## Remote.WorkYes 1.775e+00 2.916e-02 60.887 < 2e-16 ***
## Leadership.OpportunitiesYes 1.627e-01 4.752e-02 3.423 0.00062 ***
## Innovation.OpportunitiesYes 1.410e-01 2.784e-02 5.063 4.12e-07 ***
## Company.ReputationFair -4.699e-01 3.962e-02 -11.859 < 2e-16 ***
## Company.ReputationGood 6.023e-02 3.535e-02 1.704 0.08836 .
## Company.ReputationPoor -7.565e-01 3.972e-02 -19.043 < 2e-16 ***
## Employee.RecognitionLow -3.956e-02 2.620e-02 -1.510 0.13099
## Employee.RecognitionMedium -4.354e-02 2.769e-02 -1.573 0.11578
## Employee.RecognitionVery High 8.304e-02 5.019e-02 1.655 0.09802 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82477 on 59597 degrees of freedom
## Residual deviance: 57833 on 59555 degrees of freedom
## AIC: 57919
##
## Number of Fisher Scoring iterations: 5
This model includes all available predictors, providing the most comprehensive analysis.
# Predict probabilities
pred1 <- predict(model1, test, type = "response")
pred2 <- predict(model2, test, type = "response")
pred3 <- predict(model3, test, type = "response")
The output is a probability between 0 and 1 indicating the likelihood of attrition.
# Convert probabilities to class labels (MATCH DATASET LABEL)
pred1_class <- ifelse(pred1 > 0.5, "Left", "Stayed")
pred2_class <- ifelse(pred2 > 0.5, "Left", "Stayed")
pred3_class <- ifelse(pred3 > 0.5, "Left", "Stayed")
# Convert to factor with SAME levels
pred1_class <- factor(pred1_class, levels = levels(test$Attrition))
pred2_class <- factor(pred2_class, levels = levels(test$Attrition))
pred3_class <- factor(pred3_class, levels = levels(test$Attrition))
# Confusion matrices
cm1 <- confusionMatrix(pred1_class, test$Attrition)
cm2 <- confusionMatrix(pred2_class, test$Attrition)
cm3 <- confusionMatrix(pred3_class, test$Attrition)
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction Left Stayed
## Left 7032 7868
## Stayed 0 0
##
## Accuracy : 0.4719
## 95% CI : (0.4639, 0.48)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.4719
## Neg Pred Value : NaN
## Prevalence : 0.4719
## Detection Rate : 0.4719
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Left
##
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction Left Stayed
## Left 4530 5487
## Stayed 2502 2381
##
## Accuracy : 0.4638
## 95% CI : (0.4558, 0.4719)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.052
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6442
## Specificity : 0.3026
## Pos Pred Value : 0.4522
## Neg Pred Value : 0.4876
## Prevalence : 0.4719
## Detection Rate : 0.3040
## Detection Prevalence : 0.6723
## Balanced Accuracy : 0.4734
##
## 'Positive' Class : Left
##
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction Left Stayed
## Left 1853 6077
## Stayed 5179 1791
##
## Accuracy : 0.2446
## 95% CI : (0.2377, 0.2515)
## No Information Rate : 0.5281
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.5054
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2635
## Specificity : 0.2276
## Pos Pred Value : 0.2337
## Neg Pred Value : 0.2570
## Prevalence : 0.4719
## Detection Rate : 0.1244
## Detection Prevalence : 0.5322
## Balanced Accuracy : 0.2456
##
## 'Positive' Class : Left
##
The confusion matrix evaluates:
library(pROC)
# ROC curves
roc1 <- roc(test$Attrition, pred1)
roc2 <- roc(test$Attrition, pred2)
roc3 <- roc(test$Attrition, pred3)
# Plot ROC
plot(roc1, col = "blue", main = "ROC Curve Comparison")
lines(roc2, col = "red")
lines(roc3, col = "green")
legend("bottomright",
legend = c("Model 1", "Model 2", "Model 3"),
col = c("blue", "red", "green"),
lwd = 2)
The ROC curve comparison shows that Model 3 outperforms Models 1 and 2 in terms of class discrimination, as its curve is closest to the top-left corner. In contrast, Models 1 and 2 lie near the diagonal, indicating performance close to random guessing.
However, this finding contrasts with the confusion matrix, where Model 3 has the lowest accuracy. This discrepancy suggests that although Model 3 produces better probability estimates, the fixed classification threshold (0.5) is not optimal, leading to poor classification results.
accuracy <- c(cm1$overall["Accuracy"],
cm2$overall["Accuracy"],
cm3$overall["Accuracy"])
model_names <- c("Model 1", "Model 2", "Model 3")
barplot(accuracy,
names.arg = model_names,
main = "Model Accuracy Comparison",
ylab = "Accuracy",
col = c("blue", "red", "green"))
This bar chart provides a visual comparison of the accuracy of the three models, highlighting their relative performance.
library(ggplot2)
cm_df <- as.data.frame(cm2$table)
ggplot(cm_df, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq)) +
scale_fill_gradient(low = "white", high = "blue") +
ggtitle("Confusion Matrix - Model 2")
The performance of the three logistic regression models shows notable differences in predictive capability and behavior.
Model 1 demonstrates very limited predictive power. Although the coefficient for Monthly Income is statistically significant (p < 0.01), its practical impact is minimal. This is reflected in the confusion matrix, where the model predicts all observations as “Left.” As a result, the model achieves a sensitivity of 1.000 but a specificity of 0.000, indicating that it fails completely to identify employees who stayed.
The overall accuracy is 47.19%, which is lower than the No Information Rate (52.81%). This indicates that the model performs worse than a naive classifier that always predicts the majority class. Therefore, relying solely on income is insufficient to explain employee attrition.
Model 2 introduces Overtime as an additional predictor, which significantly improves the model’s behavior. The variable Overtime is highly significant (p < 0.001), indicating that workload plays an important role in employee attrition.
Compared to Model 1, Model 2 produces a more balanced classification:
This shows that the model can now distinguish between employees who leave and those who stay, although performance is still modest. The overall accuracy is 46.38%, which is still below the No Information Rate.
Interestingly, the negative coefficient for OvertimeYes suggests that employees who work overtime are less likely to leave. While this may seem counterintuitive, it could indicate that employees working overtime are more engaged or incentivized.
Model 3 incorporates all available predictors and provides deeper insights into the factors influencing attrition. Many variables are statistically significant, including:
These results confirm that employee attrition is a complex phenomenon influenced by multiple factors rather than a single variable. However, despite its complexity, Model 3 performs poorly in prediction:
This indicates that the model struggles to generalize to the test data. The significant drop in performance suggests the possibility of overfitting, where the model fits the training data well but fails to perform on unseen data.
Comparative Analysis
When comparing all three models:
This highlights an important principle in machine learning:
Model 2 appears to offer the best trade-off between interpretability and predictive capability among the three. The ROC curve and accuracy comparison further confirm that Model 2 provides a more balanced classification performance compared to the other models.
This study examined the effectiveness of logistic regression models in predicting employee attrition using three different model specifications.
The results indicate that:
Overall, Model 2 provides the most reasonable performance among the three models, suggesting that a moderate number of relevant predictors is preferable to both overly simple and overly complex models.
From a practical perspective, the findings suggest that organizations should consider workload factors, such as overtime, alongside compensation when addressing employee attrition. Additionally, the results emphasize the importance of model evaluation using unseen data to ensure that predictive models are robust and generalizable.