Background

Employee attrition is one of the key problems affecting the productivity of companies worldwide. In this brief study, we will attempt to address this problem by utilizing machine learning algorithms to: (i) constructing a model to predict whether a particular employee will leave the company, (ii) identifying the key factors contributing to employee leaving the company, and (iii) generate key actionables to reduce employee attrition rate.

Due to the fact that HR datasets are highly confidential information and closely hold by companies, we will utilize a simulated dataset retrieved from Kaggle for this study as a proof of concept.

Data Import and Processing

We will first load the necessary dependencies.

library(ggplot2)
library(dplyr)
library(caret)
library(GGally)
library(plotly)
library(reshape2)

Load the dataset (assuming dataset has been downloaded to the working directory and named HR_raw.csv).

raw <- read.csv("HR_raw.csv", stringsAsFactors = FALSE)

A quick look at the data structure.

head(raw)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low
anyNA(raw)
## [1] FALSE

There are no missing values in the dataset, as such, no imputations are required.

Some data cleaning and formatting.

names(raw)[3] <- "no_projects"
names(raw)[4] <- "monthly_hours"
names(raw)[5] <- "years_in_co"
names(raw)[7] <- "turnover"
names(raw)[9] <- "dept"
raw$turnover <- as.factor(raw$turnover)
raw$dept <- as.factor(raw$dept)
raw$salary <- as.factor(raw$salary)
raw$promotion_last_5years <- as.factor(raw$promotion_last_5years)
raw$Work_accident <- as.factor(raw$Work_accident)

Exploratory Data Analysis

Exploratory data analysis is arguably the most important step in any data analysis project.

Basic Descriptive Analysis

We will first look at what is the scale and variables of the dataset.

dim(raw)
## [1] 14999    10
turnover_rate <- round(nrow(filter(raw, turnover=="1"))/length(raw$turnover),2)
turnover_rate
## [1] 0.24

The dataset has 14999 observations (employees) and 10 variables. The total turnover rate is 0.24. This will be our benchmark comparison. If a model only predicts that all employees will not leave the company, the model will still have a prediction accuracy of 0.76. As such, our prediction model’s accuracy shall be a lot higher than 0.76 to be useful. Similarly, the lower the turnover rate of a particular dataset, the harder it is to construct a useful prediction model.

The list of 10 variables are as below:-

names(raw)
##  [1] "satisfaction_level"    "last_evaluation"      
##  [3] "no_projects"           "monthly_hours"        
##  [5] "years_in_co"           "Work_accident"        
##  [7] "turnover"              "promotion_last_5years"
##  [9] "dept"                  "salary"

The variable of interest, turnover, represent whether an employee have left the company. Employees that have left the company are annotated as “1”.

Correlation Matrix

Now we will explore the relationship between variables using a correlation matrix.

ggpairs(data=raw, lower = list(continuous = "smooth"))

qplot(x=Var1, y=Var2, data=melt(cor(raw[,1:5])), 
      fill=value, geom="tile")+
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank())

From the correlation matrix and correlation heatmap above, the below are some interesting observations:-

  • No strong correlation (|r| > 0.5) between variables

  • Slight positive correlation between number of projects and average monthly hours worked

  • Employees that left the company tend to have low satisfaction levels

Just by exploring the dataset alone, many interesting questions can be asked and investigated. However, we will just focus on our two objectives for this particular study:-

  • Objective 1: Construct a model to predict whether a particular employee will leave the company

  • Objective 2: Identify key factors contributing to employee attrition

  • Objective 3: Generate key actionables to reduce employee attrition

Objective 1: Construct Predictive Model

For the purpose of constructing and evaluating the prediction models, the dataset will be split into a training (80%) and testing (20%) dataset, with equal proportions of employees that has left the company in both datasets.

The training dataset will be used to construct (train) the model, whereas the testing dataset will be used to evaluate (test) the accuracy of the model constructed.

set.seed(1234)
inTrain <- createDataPartition(raw$turnover, p = 0.2, list = FALSE)
training <- raw[inTrain,]
testing <- raw[-inTrain,]

As the variable of interest, turnover, is binomial, this is a classification problem. We will construct three different models using different classification algorithms.

Model A: Logistic Regression Model

We will use logistic regression to construct our first model. Logistic regression classifies the categorical variable by estimating the probabilities via a cumulative logistic distribution. The advantages of using logistic regression is that it is easy to interpret. However, it usually does not perform as well as other modern classification algorithm.

fit1 <- train(turnover~., data = training, 
              method = "glm", family = "binomial", 
              preProc = c("center", "scale"))
summary(fit1)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0355  -0.6712  -0.4092  -0.1287   2.9066  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.569273   0.059201 -26.508  < 2e-16 ***
## satisfaction_level     -1.012157   0.054502 -18.571  < 2e-16 ***
## last_evaluation         0.148772   0.056794   2.620  0.00881 ** 
## no_projects            -0.336435   0.058641  -5.737 9.62e-09 ***
## monthly_hours           0.187094   0.057244   3.268  0.00108 ** 
## years_in_co             0.387602   0.050072   7.741 9.87e-15 ***
## Work_accident1         -0.419930   0.066427  -6.322 2.59e-10 ***
## promotion_last_5years1 -0.208927   0.072951  -2.864  0.00418 ** 
## depthr                  0.012936   0.061556   0.210  0.83354    
## deptIT                 -0.051573   0.072673  -0.710  0.47792    
## deptmanagement         -0.079910   0.068111  -1.173  0.24070    
## deptmarketing          -0.057048   0.067883  -0.840  0.40069    
## deptproduct_mng        -0.051702   0.066376  -0.779  0.43602    
## deptRandD              -0.123590   0.070681  -1.749  0.08037 .  
## deptsales              -0.043355   0.098001  -0.442  0.65821    
## deptsupport            -0.008532   0.084225  -0.101  0.91931    
## depttechnical          -0.032565   0.087701  -0.371  0.71040    
## salarylow               1.042516   0.145116   7.184 6.77e-13 ***
## salarymedium            0.693566   0.145587   4.764 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3295.5  on 3000  degrees of freedom
## Residual deviance: 2597.9  on 2982  degrees of freedom
## AIC: 2635.9
## 
## Number of Fisher Scoring iterations: 5

Now let’s evaluate Model A using the testing dataset.

pred1 <- predict(fit1, newdata = testing)
confusionMatrix(pred1, testing$turnover)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8489 1891
##          1  653  965
##                                           
##                Accuracy : 0.788           
##                  95% CI : (0.7805, 0.7952)
##     No Information Rate : 0.762           
##     P-Value [Acc > NIR] : 6.713e-12       
##                                           
##                   Kappa : 0.3131          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9286          
##             Specificity : 0.3379          
##          Pos Pred Value : 0.8178          
##          Neg Pred Value : 0.5964          
##              Prevalence : 0.7620          
##          Detection Rate : 0.7075          
##    Detection Prevalence : 0.8651          
##       Balanced Accuracy : 0.6332          
##                                           
##        'Positive' Class : 0               
## 

The accuracy of Model A is less than 80%, which is just slightly above the benchmark rate, 0.76. This is reflected by the low Kappa of less than 0.4.

As we are interested to correctly predict employees that are leaving the company, the metric Specificity (percentage of employees actually leaving that is predicted to be leaving) will also be one of our key evaluation metric. The Specificity of Model A is less than 40% (i.e. out of all the employees actually leaving the company, we only manage to predict less than 40% of them using this model).

As such, Model A is a less than desirable fit.

Model B: Support Vector Machine (SVM) Model

Support Vector Machines (SVMs) are a class of algorithms that performs classifications by mapping their input in high dimensional space and separating the inputs based on the gaps in the high dimensional space. The advantages of SVM is in general it has a high performance. However, the disadvantages of SVM is it will require time to train for larger datasets and is difficult to interpret the results.

We will use the RBF (Radial Basis Function) kernel and retain the defaults for most of the tuning parameters in constructing this model.

fit2 <- train(turnover~., data = training, 
              method = "svmRadial",
              preProc = c("center", "scale"))
fit2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3001 samples
##    9 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (18), scaled (18) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3001, 3001, 3001, 3001, 3001, 3001, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.9035577  0.6948703
##   0.50  0.9314909  0.7938462
##   1.00  0.9457136  0.8411365
## 
## Tuning parameter 'sigma' was held constant at a value of 0.04024686
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.04024686 and C = 1.

Now let’s evaluate Model B using the testing dataset.

pred2 <- predict(fit2, newdata = testing)
confusionMatrix(pred2, testing$turnover)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9078  480
##          1   64 2376
##                                           
##                Accuracy : 0.9547          
##                  95% CI : (0.9508, 0.9583)
##     No Information Rate : 0.762           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8684          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9930          
##             Specificity : 0.8319          
##          Pos Pred Value : 0.9498          
##          Neg Pred Value : 0.9738          
##              Prevalence : 0.7620          
##          Detection Rate : 0.7566          
##    Detection Prevalence : 0.7966          
##       Balanced Accuracy : 0.9125          
##                                           
##        'Positive' Class : 0               
## 

Model B acchieves an accuracy of 95%, with specificity of 83% (able to identify 83% of employees leaving the company) and Kappa of 0.87. This is a good model for our purpose, relative to Model A.

Model C: Extreme Gradient Boosted Decision Trees (XGBoost) Model

XGBoost is part of a gradient boosting technique that produces a number of very weak models (typically decision trees) and combine multiple of these weak models into a strong prediction model. The advantages of XGBoost is of its general high performance and high scalability. However, the disadvantages of XGBoost is that its result will be difficult to interpret.

For simplicity sake, we will retain the most defaults of the tuning parameters in constructing this model.

fit3 <- train(turnover~., data = training, 
              method = "xgbTree",
              preProc = c("center", "scale"))

Let’s evaluate Model C using the testing dataset.

pred3 <- predict(fit3, newdata = testing)
confusionMatrix(pred3, testing$turnover)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9056  231
##          1   86 2625
##                                           
##                Accuracy : 0.9736          
##                  95% CI : (0.9705, 0.9764)
##     No Information Rate : 0.762           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9259          
##  Mcnemar's Test P-Value : 6.073e-16       
##                                           
##             Sensitivity : 0.9906          
##             Specificity : 0.9191          
##          Pos Pred Value : 0.9751          
##          Neg Pred Value : 0.9683          
##              Prevalence : 0.7620          
##          Detection Rate : 0.7548          
##    Detection Prevalence : 0.7740          
##       Balanced Accuracy : 0.9549          
##                                           
##        'Positive' Class : 0               
## 

Model C result outperformed Model A and Model B. The model acchieved 97% Accuracy, 92% Specificity and a Kappa of 0.92. As such, Model C will be selected as the prediction model for this study.

Note: further tuning and ensembling various models is likely to improve the accuracy of the model. However, the degree of accuracy required will be dependent on the purpose and resource available (e.g. processing time, computing power etc.)

Objective 2: Identify Key Factors Contributing To Employee Attrition

Due to its difficulty in interpreting the various variables in Model C, we will utilize an indirect method in selecting the variables which plays an “important” role in the classification.

Leveraging on our Model C constructed above, we can identify the key variables that explains the most variance of the dataset in the model.

var_imp3 <- varImp(fit3)
plot(var_imp3)

We can see that the following 5 variables (in order) plays the most important role:

  1. Employee Satisfaction Level
  2. Average Hours Worked Monthly
  3. Number of Years in Company
  4. Number of Projects
  5. Last Evaluation Score

We will look into each of these variables separately.

Satisfaction Level vs Attrition

plot_ly(data = raw, color = ~turnover, x = ~satisfaction_level, 
        type = "histogram", alpha = 0.7) %>%
  layout(barmode = "overlay") %>%
  layout(title = "Histogram of Employee Satisfaction Level",
         xaxis = list(title = "Employee Satisfaction Level"),
         yaxis = list(title = "No of Employees"))

Key observations:

  • Three distinct peaks at different satisfaction level for employees that left the company

  • Last peak at 0.7~0.9 satisfaction level is interesting, as it suggests that employees are highly satisfied at the jobs are also leaving the company.

  • Employees with satisfaction level below 0.5 is very likely to leave the company.

Let’s see whether the two distribution are statistically significant.

kruskal.test(raw$satisfaction_level~raw$turnover)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  raw$satisfaction_level by raw$turnover
## Kruskal-Wallis chi-squared = 2007.3, df = 1, p-value < 2.2e-16

Average Monthly Hours Worked vs Attrition

plot_ly(data = raw, color = ~turnover, x = ~monthly_hours, 
        type = "histogram", alpha = 0.7) %>%
  layout(barmode = "overlay") %>%
  layout(title = "Histogram of Average Hours Worked Monthly",
         xaxis = list(title = "Average Hours Worked Monthly"),
         yaxis = list(title = "No of Employees"))

Key Observations:

  • Two distinct peaks at the extreme ends of the average monthly hours worked for employees that left the company.

  • Employees that worked more than an average of 277 hours every month is highly likely to leave the company, with employees worked more than 292 hours all left the company.

Let’s see whether the two distribution are statistically significant.

kruskal.test(raw$monthly_hours~raw$turnover)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  raw$monthly_hours by raw$turnover
## Kruskal-Wallis chi-squared = 32.366, df = 1, p-value = 1.277e-08

Number of Years in Company vs Attrition

plot_ly(data = raw, color = ~turnover, x = ~years_in_co, 
        type = "histogram", alpha = 0.7) %>%
  layout(barmode = "overlay") %>%
  layout(title = "Histogram of No of Years Worked in Company",
         xaxis = list(title = "No of Years Worked in Company"),
         yaxis = list(title = "No of Employees"))

Key Observations:

  • 82% of employees worked for the company for less than 5 years.

  • 93% of employees that left the company worked between 3 to 5 years for the company.

  • Employees that worked for more than 7 years for the company did not leave the company.

Let’s see whether the two distribution are statistically significant.

kruskal.test(raw$years_in_co~raw$turnover)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  raw$years_in_co by raw$turnover
## Kruskal-Wallis chi-squared = 1084.3, df = 1, p-value < 2.2e-16

Number of Projects vs Attrition

plot_ly(data = raw, color = ~turnover, x = ~no_projects,
        type = "histogram", alpha = 0.7) %>%
  layout(barmode = "overlay") %>%
  layout(title = "Histogram of No of Projects Hold by Employee",
         xaxis = list(title = "No of Projects"),
         yaxis = list(title = "No of Employees"))

Key Observations:

  • Employees that has 2 or less projects are highly likely to leave the company.

  • Employees that has 6 or more projects are highly likely to leave the company, with all employees that held 7 projects left the company.

Let’s see whether the two distribution are statistically significant.

kruskal.test(raw$no_projects~raw$turnover)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  raw$no_projects by raw$turnover
## Kruskal-Wallis chi-squared = 5.7285, df = 1, p-value = 0.01669

Last Evaluation Score vs Attrition

plot_ly(data = raw, color = ~turnover, x = ~last_evaluation, 
        type = "histogram", alpha = 0.7) %>%
  layout(barmode = "overlay") %>%
  layout(title = "Histogram of Last Evaluation Score of Employees",
         xaxis = list(title = "Last Evaluation Score"),
         yaxis = list(title = "No of Employees"))

Key Observations:

  • Two distinct peaks towards the two end of the last evaluation scores of employees that left the company (low peak 0.45~0.57, high peak 0.77~1.00).

Let’s see whether the two distribution are statistically significant.

kruskal.test(raw$last_evaluation~raw$turnover)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  raw$last_evaluation by raw$turnover
## Kruskal-Wallis chi-squared = 0.089328, df = 1, p-value = 0.765

There is no significant differences between the distribution of last evaluation scores for employees that stayed and left.

Multivariate Comparison

plot_ly(data = raw, color = ~turnover, x = ~last_evaluation, 
        y = ~satisfaction_level, z = ~monthly_hours) %>%
  layout(title = "Last Evaluation Score vs Satisfaction Level vs Hours Worked")

There are three distinct clusters of employees that left the company:

  • Group 1 (unhappy limited performer): low evaluation score, low satisfaction level, low hours worked. based on their hours worked, would probably consists mostly of contract workers or part-time workers which have a high turnover in nature. May be subjected to other variables.

  • Group 2 (happy high performers): high evaluation score, high satisfaction level, high hours worked. Due to their high contribution (and sometimes ambition), are likely to leave the company to pursue a better opportunity presented (such as poaching/headhunting by other company).

  • Group 3 (unhappy high performers) : very high evaluation score, very low satisfaction level, very high hours worked. Likely to leave the company due to being overworked.

Objective 3: Identify Key Actionables

With the a good prediction model (Model C) and a basic understanding of the key factors contributing to employee attrition, we can now work on the actionables that the company can take to reduce the attrition rate.

Below is a proposal of a general step-by-step action plan:-

Aside from the periodic screening, managers can also keep a few key indicators in mind to reduce the possibility of an employee leaving the company:-

Further Actions

The granularity and specificity of actionables can be further improved by conducting a cluster analysis (developing a multinomial classification model) using the key factors for a better understanding of employee’s motivation to leave in the respective clusters. Different actionables can also be tested / evaluated on their efficiency by keeping detailed records and performing A/B testing. However, do keep in mind that dealing with employees will always require human engagement and there are certain inherent limitations by just using models and statistics. These tools as shown in this proof of concept can be used to as a complement and not a replacement to existing human resource activities.