1. Introduction

This analysis aims to develop and evaluate logistic regression models to predict employee attrition using the Employee Attrition dataset. Three models are constructed with increasing complexity, and their performance is assessed using a confusion matrix on the test dataset.

2. Data Preparation

2.1 Load Required Libraries

First, load the necessary libraries for data manipulation and modeling.

# Load libraries
library(tidyverse)
library(caret)
  • tidyverse is used for data manipulation and visualization
  • caret is used for model evaluation, especially confusion matrix

2.2 Load the Dataset

# Load training and testing data
train <- read.csv("train.csv")
test  <- read.csv("test.csv")
  • The dataset is divided into training and testing sets
  • The training set is used to build the model
  • The testing set is used to evaluate model performance

2.3 Data Inspection

# Inspect structure
str(train)
## 'data.frame':    59598 obs. of  24 variables:
##  $ Employee.ID             : int  8410 64756 30257 65791 65026 24368 64970 36999 32714 15944 ...
##  $ Age                     : int  31 59 24 36 56 38 47 48 57 24 ...
##  $ Gender                  : chr  "Male" "Female" "Female" "Female" ...
##  $ Years.at.Company        : int  19 4 10 7 41 3 23 16 44 1 ...
##  $ Job.Role                : chr  "Education" "Media" "Healthcare" "Education" ...
##  $ Monthly.Income          : int  5390 5534 8159 3989 4821 9977 3681 11223 3773 7319 ...
##  $ Work.Life.Balance       : chr  "Excellent" "Poor" "Good" "Good" ...
##  $ Job.Satisfaction        : chr  "Medium" "High" "High" "High" ...
##  $ Performance.Rating      : chr  "Average" "Low" "Low" "High" ...
##  $ Number.of.Promotions    : int  2 3 0 1 0 3 1 2 1 1 ...
##  $ Overtime                : chr  "No" "No" "No" "No" ...
##  $ Distance.from.Home      : int  22 21 11 27 71 37 75 5 39 57 ...
##  $ Education.Level         : chr  "Associate Degree" "Master’s Degree" "Bachelor’s Degree" "High School" ...
##  $ Marital.Status          : chr  "Married" "Divorced" "Married" "Single" ...
##  $ Number.of.Dependents    : int  0 3 3 2 0 0 3 4 4 4 ...
##  $ Job.Level               : chr  "Mid" "Mid" "Mid" "Mid" ...
##  $ Company.Size            : chr  "Medium" "Medium" "Medium" "Small" ...
##  $ Company.Tenure          : int  89 21 74 50 68 47 93 88 75 45 ...
##  $ Remote.Work             : chr  "No" "No" "No" "Yes" ...
##  $ Leadership.Opportunities: chr  "No" "No" "No" "No" ...
##  $ Innovation.Opportunities: chr  "No" "No" "No" "No" ...
##  $ Company.Reputation      : chr  "Excellent" "Fair" "Poor" "Good" ...
##  $ Employee.Recognition    : chr  "Medium" "Low" "Low" "Medium" ...
##  $ Attrition               : chr  "Stayed" "Stayed" "Stayed" "Stayed" ...
str(test)
## 'data.frame':    14900 obs. of  24 variables:
##  $ Employee.ID             : int  52685 30585 54656 33442 15667 3496 46775 72645 4941 65181 ...
##  $ Age                     : int  36 35 50 58 39 45 22 34 48 55 ...
##  $ Gender                  : chr  "Male" "Male" "Male" "Male" ...
##  $ Years.at.Company        : int  13 7 7 44 24 30 5 15 40 16 ...
##  $ Job.Role                : chr  "Healthcare" "Education" "Education" "Media" ...
##  $ Monthly.Income          : int  8029 4563 5583 5525 4604 8104 8700 11025 11452 5939 ...
##  $ Work.Life.Balance       : chr  "Excellent" "Good" "Fair" "Fair" ...
##  $ Job.Satisfaction        : chr  "High" "High" "High" "Very High" ...
##  $ Performance.Rating      : chr  "Average" "Average" "Average" "High" ...
##  $ Number.of.Promotions    : int  1 1 3 0 0 0 0 1 0 0 ...
##  $ Overtime                : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Distance.from.Home      : int  83 55 14 43 47 38 2 9 65 31 ...
##  $ Education.Level         : chr  "Master’s Degree" "Associate Degree" "Associate Degree" "Master’s Degree" ...
##  $ Marital.Status          : chr  "Married" "Single" "Divorced" "Single" ...
##  $ Number.of.Dependents    : int  1 4 2 4 6 0 0 4 1 1 ...
##  $ Job.Level               : chr  "Mid" "Entry" "Senior" "Entry" ...
##  $ Company.Size            : chr  "Large" "Medium" "Medium" "Medium" ...
##  $ Company.Tenure          : int  22 27 76 96 45 75 48 16 52 46 ...
##  $ Remote.Work             : chr  "No" "No" "No" "No" ...
##  $ Leadership.Opportunities: chr  "No" "No" "No" "No" ...
##  $ Innovation.Opportunities: chr  "No" "No" "Yes" "No" ...
##  $ Company.Reputation      : chr  "Poor" "Good" "Good" "Poor" ...
##  $ Employee.Recognition    : chr  "Medium" "High" "Low" "Low" ...
##  $ Attrition               : chr  "Stayed" "Left" "Stayed" "Left" ...
# Check missing values
colSums(is.na(train))
##              Employee.ID                      Age                   Gender 
##                        0                        0                        0 
##         Years.at.Company                 Job.Role           Monthly.Income 
##                        0                        0                        0 
##        Work.Life.Balance         Job.Satisfaction       Performance.Rating 
##                        0                        0                        0 
##     Number.of.Promotions                 Overtime       Distance.from.Home 
##                        0                        0                        0 
##          Education.Level           Marital.Status     Number.of.Dependents 
##                        0                        0                        0 
##                Job.Level             Company.Size           Company.Tenure 
##                        0                        0                        0 
##              Remote.Work Leadership.Opportunities Innovation.Opportunities 
##                        0                        0                        0 
##       Company.Reputation     Employee.Recognition                Attrition 
##                        0                        0                        0
colSums(is.na(test))
##              Employee.ID                      Age                   Gender 
##                        0                        0                        0 
##         Years.at.Company                 Job.Role           Monthly.Income 
##                        0                        0                        0 
##        Work.Life.Balance         Job.Satisfaction       Performance.Rating 
##                        0                        0                        0 
##     Number.of.Promotions                 Overtime       Distance.from.Home 
##                        0                        0                        0 
##          Education.Level           Marital.Status     Number.of.Dependents 
##                        0                        0                        0 
##                Job.Level             Company.Size           Company.Tenure 
##                        0                        0                        0 
##              Remote.Work Leadership.Opportunities Innovation.Opportunities 
##                        0                        0                        0 
##       Company.Reputation     Employee.Recognition                Attrition 
##                        0                        0                        0

This step ensures that:

  • Data types are correct
  • No missing values interfere with modeling

2.4 Convert Target Variable to Factor

# Convert Attrition to factor
train$Attrition <- as.factor(train$Attrition)
test$Attrition  <- as.factor(test$Attrition)

Logistic regression in R requires the target variable to be categorical (factor).

3. Model Development

3.1 Model 1: Attrition ~ MonthlyIncome

model1 <- glm(Attrition ~ Monthly.Income, 
              data = train, 
              family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = Attrition ~ Monthly.Income, family = "binomial", 
##     data = train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    2.081e-02  2.902e-02   0.717  0.47334   
## Monthly.Income 1.059e-05  3.813e-06   2.777  0.00548 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82477  on 59597  degrees of freedom
## Residual deviance: 82469  on 59596  degrees of freedom
## AIC: 82473
## 
## Number of Fisher Scoring iterations: 3

This model evaluates whether monthly income alone can predict employee attrition.

3.2 Model 2: Attrition ~ MonthlyIncome + Overtime

model2 <- glm(Attrition ~ Monthly.Income + Overtime, 
              data = train, 
              family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = Attrition ~ Monthly.Income + Overtime, family = "binomial", 
##     data = train)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.001e-01  2.965e-02   3.375 0.000737 ***
## Monthly.Income  1.038e-05  3.819e-06   2.717 0.006578 ** 
## OvertimeYes    -2.374e-01  1.750e-02 -13.566  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82477  on 59597  degrees of freedom
## Residual deviance: 82285  on 59595  degrees of freedom
## AIC: 82291
## 
## Number of Fisher Scoring iterations: 3

This model adds overtime as an additional predictor, allowing us to assess whether workload contributes to attrition.

3.3 Model 3: Attrition ~ . (All Variables)

model3 <- glm(Attrition ~ ., 
              data = train, 
              family = "binomial")
summary(model3)
## 
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -2.624e-01  8.764e-02  -2.994  0.00275 ** 
## Employee.ID                       1.653e-07  4.761e-07   0.347  0.72843    
## Age                               5.998e-03  1.004e-03   5.974 2.31e-09 ***
## GenderMale                        6.262e-01  2.078e-02  30.131  < 2e-16 ***
## Years.at.Company                  1.360e-02  1.174e-03  11.588  < 2e-16 ***
## Job.RoleFinance                   9.212e-02  4.838e-02   1.904  0.05690 .  
## Job.RoleHealthcare                7.204e-02  4.239e-02   1.700  0.08919 .  
## Job.RoleMedia                     9.854e-02  3.616e-02   2.725  0.00642 ** 
## Job.RoleTechnology                8.459e-02  4.847e-02   1.745  0.08096 .  
## Monthly.Income                    6.733e-06  8.246e-06   0.817  0.41417    
## Work.Life.BalanceFair            -1.327e+00  3.129e-02 -42.409  < 2e-16 ***
## Work.Life.BalanceGood            -2.934e-01  2.958e-02  -9.916  < 2e-16 ***
## Work.Life.BalancePoor            -1.509e+00  3.757e-02 -40.165  < 2e-16 ***
## Job.SatisfactionLow              -4.880e-01  3.572e-02 -13.665  < 2e-16 ***
## Job.SatisfactionMedium           -1.135e-02  2.721e-02  -0.417  0.67664    
## Job.SatisfactionVery High        -4.989e-01  2.701e-02 -18.471  < 2e-16 ***
## Performance.RatingBelow Average  -3.338e-01  2.948e-02 -11.322  < 2e-16 ***
## Performance.RatingHigh           -5.054e-03  2.647e-02  -0.191  0.84860    
## Performance.RatingLow            -5.972e-01  4.847e-02 -12.321  < 2e-16 ***
## Number.of.Promotions              2.492e-01  1.043e-02  23.898  < 2e-16 ***
## OvertimeYes                      -3.512e-01  2.189e-02 -16.043  < 2e-16 ***
## Distance.from.Home               -9.951e-03  3.636e-04 -27.367  < 2e-16 ***
## Education.LevelBachelor’s Degree -4.580e-02  2.757e-02  -1.661  0.09669 .  
## Education.LevelHigh School       -3.037e-02  3.072e-02  -0.989  0.32272    
## Education.LevelMaster’s Degree   -3.095e-02  3.048e-02  -1.015  0.30991    
## Education.LevelPhD                1.564e+00  5.446e-02  28.716  < 2e-16 ***
## Marital.StatusMarried             2.559e-01  2.957e-02   8.656  < 2e-16 ***
## Marital.StatusSingle             -1.573e+00  3.220e-02 -48.864  < 2e-16 ***
## Number.of.Dependents              1.573e-01  6.660e-03  23.622  < 2e-16 ***
## Job.LevelMid                      1.003e+00  2.266e-02  44.269  < 2e-16 ***
## Job.LevelSenior                   2.616e+00  3.261e-02  80.221  < 2e-16 ***
## Company.SizeMedium               -6.168e-03  2.713e-02  -0.227  0.82015    
## Company.SizeSmall                -2.063e-01  2.958e-02  -6.975 3.07e-12 ***
## Company.Tenure                    2.034e-04  4.501e-04   0.452  0.65134    
## Remote.WorkYes                    1.775e+00  2.916e-02  60.887  < 2e-16 ***
## Leadership.OpportunitiesYes       1.627e-01  4.752e-02   3.423  0.00062 ***
## Innovation.OpportunitiesYes       1.410e-01  2.784e-02   5.063 4.12e-07 ***
## Company.ReputationFair           -4.699e-01  3.962e-02 -11.859  < 2e-16 ***
## Company.ReputationGood            6.023e-02  3.535e-02   1.704  0.08836 .  
## Company.ReputationPoor           -7.565e-01  3.972e-02 -19.043  < 2e-16 ***
## Employee.RecognitionLow          -3.956e-02  2.620e-02  -1.510  0.13099    
## Employee.RecognitionMedium       -4.354e-02  2.769e-02  -1.573  0.11578    
## Employee.RecognitionVery High     8.304e-02  5.019e-02   1.655  0.09802 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82477  on 59597  degrees of freedom
## Residual deviance: 57833  on 59555  degrees of freedom
## AIC: 57919
## 
## Number of Fisher Scoring iterations: 5

This model includes all available predictors, providing the most comprehensive analysis.

4. Model Evaluation

4.1 Generate Predictions

# Predict probabilities
pred1 <- predict(model1, test, type = "response")
pred2 <- predict(model2, test, type = "response")
pred3 <- predict(model3, test, type = "response")

The output is a probability between 0 and 1 indicating the likelihood of attrition.

4.2 Convert Probabilities to Classes

# Convert probabilities to class labels (MATCH DATASET LABEL)
pred1_class <- ifelse(pred1 > 0.5, "Left", "Stayed")
pred2_class <- ifelse(pred2 > 0.5, "Left", "Stayed")
pred3_class <- ifelse(pred3 > 0.5, "Left", "Stayed")

# Convert to factor with SAME levels
pred1_class <- factor(pred1_class, levels = levels(test$Attrition))
pred2_class <- factor(pred2_class, levels = levels(test$Attrition))
pred3_class <- factor(pred3_class, levels = levels(test$Attrition))
  • A threshold of 0.5 is used
  • If probability > 0.5 → predicted as “Yes” (attrition)
  • Otherwise → “No”

4.3 Confusion Matrix

# Confusion matrices
cm1 <- confusionMatrix(pred1_class, test$Attrition)
cm2 <- confusionMatrix(pred2_class, test$Attrition)
cm3 <- confusionMatrix(pred3_class, test$Attrition)

cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Left Stayed
##     Left   7032   7868
##     Stayed    0      0
##                                         
##                Accuracy : 0.4719        
##                  95% CI : (0.4639, 0.48)
##     No Information Rate : 0.5281        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.0000        
##          Pos Pred Value : 0.4719        
##          Neg Pred Value :    NaN        
##              Prevalence : 0.4719        
##          Detection Rate : 0.4719        
##    Detection Prevalence : 1.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : Left          
## 
cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Left Stayed
##     Left   4530   5487
##     Stayed 2502   2381
##                                           
##                Accuracy : 0.4638          
##                  95% CI : (0.4558, 0.4719)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.052          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6442          
##             Specificity : 0.3026          
##          Pos Pred Value : 0.4522          
##          Neg Pred Value : 0.4876          
##              Prevalence : 0.4719          
##          Detection Rate : 0.3040          
##    Detection Prevalence : 0.6723          
##       Balanced Accuracy : 0.4734          
##                                           
##        'Positive' Class : Left            
## 
cm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Left Stayed
##     Left   1853   6077
##     Stayed 5179   1791
##                                           
##                Accuracy : 0.2446          
##                  95% CI : (0.2377, 0.2515)
##     No Information Rate : 0.5281          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.5054         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2635          
##             Specificity : 0.2276          
##          Pos Pred Value : 0.2337          
##          Neg Pred Value : 0.2570          
##              Prevalence : 0.4719          
##          Detection Rate : 0.1244          
##    Detection Prevalence : 0.5322          
##       Balanced Accuracy : 0.2456          
##                                           
##        'Positive' Class : Left            
## 

The confusion matrix evaluates:

  • Accuracy
  • Sensitivity (Recall)
  • Specificity

4.4 ROC Curve Analysis

library(pROC)

# ROC curves
roc1 <- roc(test$Attrition, pred1)
roc2 <- roc(test$Attrition, pred2)
roc3 <- roc(test$Attrition, pred3)

# Plot ROC
plot(roc1, col = "blue", main = "ROC Curve Comparison")
lines(roc2, col = "red")
lines(roc3, col = "green")

legend("bottomright",
       legend = c("Model 1", "Model 2", "Model 3"),
       col = c("blue", "red", "green"),
       lwd = 2)

The ROC curve comparison shows that Model 3 outperforms Models 1 and 2 in terms of class discrimination, as its curve is closest to the top-left corner. In contrast, Models 1 and 2 lie near the diagonal, indicating performance close to random guessing.

However, this finding contrasts with the confusion matrix, where Model 3 has the lowest accuracy. This discrepancy suggests that although Model 3 produces better probability estimates, the fixed classification threshold (0.5) is not optimal, leading to poor classification results.

4.5 Accuracy Comparison

accuracy <- c(cm1$overall["Accuracy"],
              cm2$overall["Accuracy"],
              cm3$overall["Accuracy"])

model_names <- c("Model 1", "Model 2", "Model 3")

barplot(accuracy,
        names.arg = model_names,
        main = "Model Accuracy Comparison",
        ylab = "Accuracy",
        col = c("blue", "red", "green"))

This bar chart provides a visual comparison of the accuracy of the three models, highlighting their relative performance.

Confusion Matrix Visualization

library(ggplot2)

cm_df <- as.data.frame(cm2$table)

ggplot(cm_df, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq)) +
  scale_fill_gradient(low = "white", high = "blue") +
  ggtitle("Confusion Matrix - Model 2")

5. Interpretation of Results

The performance of the three logistic regression models shows notable differences in predictive capability and behavior.

5.1 Model 1: Attrition ~ Monthly Income

Model 1 demonstrates very limited predictive power. Although the coefficient for Monthly Income is statistically significant (p < 0.01), its practical impact is minimal. This is reflected in the confusion matrix, where the model predicts all observations as “Left.” As a result, the model achieves a sensitivity of 1.000 but a specificity of 0.000, indicating that it fails completely to identify employees who stayed.

The overall accuracy is 47.19%, which is lower than the No Information Rate (52.81%). This indicates that the model performs worse than a naive classifier that always predicts the majority class. Therefore, relying solely on income is insufficient to explain employee attrition.

5.2 Model 2: Attrition ~ Monthly Income + Overtime

Model 2 introduces Overtime as an additional predictor, which significantly improves the model’s behavior. The variable Overtime is highly significant (p < 0.001), indicating that workload plays an important role in employee attrition.

Compared to Model 1, Model 2 produces a more balanced classification:

  • Sensitivity: 0.6442
  • Specificity: 0.3026

This shows that the model can now distinguish between employees who leave and those who stay, although performance is still modest. The overall accuracy is 46.38%, which is still below the No Information Rate.

Interestingly, the negative coefficient for OvertimeYes suggests that employees who work overtime are less likely to leave. While this may seem counterintuitive, it could indicate that employees working overtime are more engaged or incentivized.

5.3 Model 3: Attrition ~ All Variables

Model 3 incorporates all available predictors and provides deeper insights into the factors influencing attrition. Many variables are statistically significant, including:

  • Age
  • Gender
  • Years at Company
  • Work-Life Balance
  • Job Satisfaction
  • Overtime
  • Remote Work
  • Marital Status

These results confirm that employee attrition is a complex phenomenon influenced by multiple factors rather than a single variable. However, despite its complexity, Model 3 performs poorly in prediction:

  • Accuracy: 24.46%
  • Sensitivity: 0.2635
  • Specificity: 0.2276

This indicates that the model struggles to generalize to the test data. The significant drop in performance suggests the possibility of overfitting, where the model fits the training data well but fails to perform on unseen data.

Comparative Analysis

When comparing all three models:

  • Model 1 is overly simplistic and fails to classify properly
  • Model 2 shows improvement and more balanced predictions
  • Model 3, despite being the most complex, performs the worst on test data

This highlights an important principle in machine learning:

  • Increasing model complexity does not always improve predictive performance.

Model 2 appears to offer the best trade-off between interpretability and predictive capability among the three. The ROC curve and accuracy comparison further confirm that Model 2 provides a more balanced classification performance compared to the other models.

6. Conclusion

This study examined the effectiveness of logistic regression models in predicting employee attrition using three different model specifications.

The results indicate that:

  1. Monthly income alone is not sufficient to explain employee attrition, despite being statistically significant.
  2. The inclusion of overtime improves the model by capturing workload-related behavior, making Model 2 more balanced in classification.
  3. The full model, although rich in explanatory variables, suffers from poor generalization and likely overfitting, resulting in the lowest predictive performance.

Overall, Model 2 provides the most reasonable performance among the three models, suggesting that a moderate number of relevant predictors is preferable to both overly simple and overly complex models.

From a practical perspective, the findings suggest that organizations should consider workload factors, such as overtime, alongside compensation when addressing employee attrition. Additionally, the results emphasize the importance of model evaluation using unseen data to ensure that predictive models are robust and generalizable.