Employee attrition is one of the key problems affecting the productivity of companies worldwide. In this brief study, we will attempt to address this problem by utilizing machine learning algorithms to: (i) constructing a model to predict whether a particular employee will leave the company, (ii) identifying the key factors contributing to employee leaving the company, and (iii) generate key actionables to reduce employee attrition rate.
Due to the fact that HR datasets are highly confidential information and closely hold by companies, we will utilize a simulated dataset retrieved from Kaggle for this study as a proof of concept.
We will first load the necessary dependencies.
library(ggplot2)
library(dplyr)
library(caret)
library(GGally)
library(plotly)
library(reshape2)
Load the dataset (assuming dataset has been downloaded to the working directory and named HR_raw.csv).
raw <- read.csv("HR_raw.csv", stringsAsFactors = FALSE)
A quick look at the data structure.
head(raw)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
anyNA(raw)
## [1] FALSE
There are no missing values in the dataset, as such, no imputations are required.
Some data cleaning and formatting.
names(raw)[3] <- "no_projects"
names(raw)[4] <- "monthly_hours"
names(raw)[5] <- "years_in_co"
names(raw)[7] <- "turnover"
names(raw)[9] <- "dept"
raw$turnover <- as.factor(raw$turnover)
raw$dept <- as.factor(raw$dept)
raw$salary <- as.factor(raw$salary)
raw$promotion_last_5years <- as.factor(raw$promotion_last_5years)
raw$Work_accident <- as.factor(raw$Work_accident)
Exploratory data analysis is arguably the most important step in any data analysis project.
We will first look at what is the scale and variables of the dataset.
dim(raw)
## [1] 14999 10
turnover_rate <- round(nrow(filter(raw, turnover=="1"))/length(raw$turnover),2)
turnover_rate
## [1] 0.24
The dataset has 14999 observations (employees) and 10 variables. The total turnover rate is 0.24. This will be our benchmark comparison. If a model only predicts that all employees will not leave the company, the model will still have a prediction accuracy of 0.76. As such, our prediction model’s accuracy shall be a lot higher than 0.76 to be useful. Similarly, the lower the turnover rate of a particular dataset, the harder it is to construct a useful prediction model.
The list of 10 variables are as below:-
names(raw)
## [1] "satisfaction_level" "last_evaluation"
## [3] "no_projects" "monthly_hours"
## [5] "years_in_co" "Work_accident"
## [7] "turnover" "promotion_last_5years"
## [9] "dept" "salary"
The variable of interest, turnover, represent whether an employee have left the company. Employees that have left the company are annotated as “1”.
Now we will explore the relationship between variables using a correlation matrix.
ggpairs(data=raw, lower = list(continuous = "smooth"))
qplot(x=Var1, y=Var2, data=melt(cor(raw[,1:5])),
fill=value, geom="tile")+
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
From the correlation matrix and correlation heatmap above, the below are some interesting observations:-
No strong correlation (|r| > 0.5) between variables
Slight positive correlation between number of projects and average monthly hours worked
Employees that left the company tend to have low satisfaction levels
Just by exploring the dataset alone, many interesting questions can be asked and investigated. However, we will just focus on our two objectives for this particular study:-
Objective 1: Construct a model to predict whether a particular employee will leave the company
Objective 2: Identify key factors contributing to employee attrition
Objective 3: Generate key actionables to reduce employee attrition
For the purpose of constructing and evaluating the prediction models, the dataset will be split into a training (80%) and testing (20%) dataset, with equal proportions of employees that has left the company in both datasets.
The training dataset will be used to construct (train) the model, whereas the testing dataset will be used to evaluate (test) the accuracy of the model constructed.
set.seed(1234)
inTrain <- createDataPartition(raw$turnover, p = 0.2, list = FALSE)
training <- raw[inTrain,]
testing <- raw[-inTrain,]
As the variable of interest, turnover, is binomial, this is a classification problem. We will construct three different models using different classification algorithms.
We will use logistic regression to construct our first model. Logistic regression classifies the categorical variable by estimating the probabilities via a cumulative logistic distribution. The advantages of using logistic regression is that it is easy to interpret. However, it usually does not perform as well as other modern classification algorithm.
fit1 <- train(turnover~., data = training,
method = "glm", family = "binomial",
preProc = c("center", "scale"))
summary(fit1)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0355 -0.6712 -0.4092 -0.1287 2.9066
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.569273 0.059201 -26.508 < 2e-16 ***
## satisfaction_level -1.012157 0.054502 -18.571 < 2e-16 ***
## last_evaluation 0.148772 0.056794 2.620 0.00881 **
## no_projects -0.336435 0.058641 -5.737 9.62e-09 ***
## monthly_hours 0.187094 0.057244 3.268 0.00108 **
## years_in_co 0.387602 0.050072 7.741 9.87e-15 ***
## Work_accident1 -0.419930 0.066427 -6.322 2.59e-10 ***
## promotion_last_5years1 -0.208927 0.072951 -2.864 0.00418 **
## depthr 0.012936 0.061556 0.210 0.83354
## deptIT -0.051573 0.072673 -0.710 0.47792
## deptmanagement -0.079910 0.068111 -1.173 0.24070
## deptmarketing -0.057048 0.067883 -0.840 0.40069
## deptproduct_mng -0.051702 0.066376 -0.779 0.43602
## deptRandD -0.123590 0.070681 -1.749 0.08037 .
## deptsales -0.043355 0.098001 -0.442 0.65821
## deptsupport -0.008532 0.084225 -0.101 0.91931
## depttechnical -0.032565 0.087701 -0.371 0.71040
## salarylow 1.042516 0.145116 7.184 6.77e-13 ***
## salarymedium 0.693566 0.145587 4.764 1.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3295.5 on 3000 degrees of freedom
## Residual deviance: 2597.9 on 2982 degrees of freedom
## AIC: 2635.9
##
## Number of Fisher Scoring iterations: 5
Now let’s evaluate Model A using the testing dataset.
pred1 <- predict(fit1, newdata = testing)
confusionMatrix(pred1, testing$turnover)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8489 1891
## 1 653 965
##
## Accuracy : 0.788
## 95% CI : (0.7805, 0.7952)
## No Information Rate : 0.762
## P-Value [Acc > NIR] : 6.713e-12
##
## Kappa : 0.3131
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9286
## Specificity : 0.3379
## Pos Pred Value : 0.8178
## Neg Pred Value : 0.5964
## Prevalence : 0.7620
## Detection Rate : 0.7075
## Detection Prevalence : 0.8651
## Balanced Accuracy : 0.6332
##
## 'Positive' Class : 0
##
The accuracy of Model A is less than 80%, which is just slightly above the benchmark rate, 0.76. This is reflected by the low Kappa of less than 0.4.
As we are interested to correctly predict employees that are leaving the company, the metric Specificity (percentage of employees actually leaving that is predicted to be leaving) will also be one of our key evaluation metric. The Specificity of Model A is less than 40% (i.e. out of all the employees actually leaving the company, we only manage to predict less than 40% of them using this model).
As such, Model A is a less than desirable fit.
Support Vector Machines (SVMs) are a class of algorithms that performs classifications by mapping their input in high dimensional space and separating the inputs based on the gaps in the high dimensional space. The advantages of SVM is in general it has a high performance. However, the disadvantages of SVM is it will require time to train for larger datasets and is difficult to interpret the results.
We will use the RBF (Radial Basis Function) kernel and retain the defaults for most of the tuning parameters in constructing this model.
fit2 <- train(turnover~., data = training,
method = "svmRadial",
preProc = c("center", "scale"))
fit2
## Support Vector Machines with Radial Basis Function Kernel
##
## 3001 samples
## 9 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (18), scaled (18)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3001, 3001, 3001, 3001, 3001, 3001, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9035577 0.6948703
## 0.50 0.9314909 0.7938462
## 1.00 0.9457136 0.8411365
##
## Tuning parameter 'sigma' was held constant at a value of 0.04024686
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.04024686 and C = 1.
Now let’s evaluate Model B using the testing dataset.
pred2 <- predict(fit2, newdata = testing)
confusionMatrix(pred2, testing$turnover)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9078 480
## 1 64 2376
##
## Accuracy : 0.9547
## 95% CI : (0.9508, 0.9583)
## No Information Rate : 0.762
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8684
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9930
## Specificity : 0.8319
## Pos Pred Value : 0.9498
## Neg Pred Value : 0.9738
## Prevalence : 0.7620
## Detection Rate : 0.7566
## Detection Prevalence : 0.7966
## Balanced Accuracy : 0.9125
##
## 'Positive' Class : 0
##
Model B acchieves an accuracy of 95%, with specificity of 83% (able to identify 83% of employees leaving the company) and Kappa of 0.87. This is a good model for our purpose, relative to Model A.
XGBoost is part of a gradient boosting technique that produces a number of very weak models (typically decision trees) and combine multiple of these weak models into a strong prediction model. The advantages of XGBoost is of its general high performance and high scalability. However, the disadvantages of XGBoost is that its result will be difficult to interpret.
For simplicity sake, we will retain the most defaults of the tuning parameters in constructing this model.
fit3 <- train(turnover~., data = training,
method = "xgbTree",
preProc = c("center", "scale"))
Let’s evaluate Model C using the testing dataset.
pred3 <- predict(fit3, newdata = testing)
confusionMatrix(pred3, testing$turnover)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9056 231
## 1 86 2625
##
## Accuracy : 0.9736
## 95% CI : (0.9705, 0.9764)
## No Information Rate : 0.762
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9259
## Mcnemar's Test P-Value : 6.073e-16
##
## Sensitivity : 0.9906
## Specificity : 0.9191
## Pos Pred Value : 0.9751
## Neg Pred Value : 0.9683
## Prevalence : 0.7620
## Detection Rate : 0.7548
## Detection Prevalence : 0.7740
## Balanced Accuracy : 0.9549
##
## 'Positive' Class : 0
##
Model C result outperformed Model A and Model B. The model acchieved 97% Accuracy, 92% Specificity and a Kappa of 0.92. As such, Model C will be selected as the prediction model for this study.
Note: further tuning and ensembling various models is likely to improve the accuracy of the model. However, the degree of accuracy required will be dependent on the purpose and resource available (e.g. processing time, computing power etc.)
Due to its difficulty in interpreting the various variables in Model C, we will utilize an indirect method in selecting the variables which plays an “important” role in the classification.
Leveraging on our Model C constructed above, we can identify the key variables that explains the most variance of the dataset in the model.
var_imp3 <- varImp(fit3)
plot(var_imp3)
We can see that the following 5 variables (in order) plays the most important role:
We will look into each of these variables separately.
plot_ly(data = raw, color = ~turnover, x = ~satisfaction_level,
type = "histogram", alpha = 0.7) %>%
layout(barmode = "overlay") %>%
layout(title = "Histogram of Employee Satisfaction Level",
xaxis = list(title = "Employee Satisfaction Level"),
yaxis = list(title = "No of Employees"))
Key observations:
Three distinct peaks at different satisfaction level for employees that left the company
Last peak at 0.7~0.9 satisfaction level is interesting, as it suggests that employees are highly satisfied at the jobs are also leaving the company.
Employees with satisfaction level below 0.5 is very likely to leave the company.
Let’s see whether the two distribution are statistically significant.
kruskal.test(raw$satisfaction_level~raw$turnover)
##
## Kruskal-Wallis rank sum test
##
## data: raw$satisfaction_level by raw$turnover
## Kruskal-Wallis chi-squared = 2007.3, df = 1, p-value < 2.2e-16
plot_ly(data = raw, color = ~turnover, x = ~monthly_hours,
type = "histogram", alpha = 0.7) %>%
layout(barmode = "overlay") %>%
layout(title = "Histogram of Average Hours Worked Monthly",
xaxis = list(title = "Average Hours Worked Monthly"),
yaxis = list(title = "No of Employees"))
Key Observations:
Two distinct peaks at the extreme ends of the average monthly hours worked for employees that left the company.
Employees that worked more than an average of 277 hours every month is highly likely to leave the company, with employees worked more than 292 hours all left the company.
Let’s see whether the two distribution are statistically significant.
kruskal.test(raw$monthly_hours~raw$turnover)
##
## Kruskal-Wallis rank sum test
##
## data: raw$monthly_hours by raw$turnover
## Kruskal-Wallis chi-squared = 32.366, df = 1, p-value = 1.277e-08
plot_ly(data = raw, color = ~turnover, x = ~years_in_co,
type = "histogram", alpha = 0.7) %>%
layout(barmode = "overlay") %>%
layout(title = "Histogram of No of Years Worked in Company",
xaxis = list(title = "No of Years Worked in Company"),
yaxis = list(title = "No of Employees"))
Key Observations:
82% of employees worked for the company for less than 5 years.
93% of employees that left the company worked between 3 to 5 years for the company.
Employees that worked for more than 7 years for the company did not leave the company.
Let’s see whether the two distribution are statistically significant.
kruskal.test(raw$years_in_co~raw$turnover)
##
## Kruskal-Wallis rank sum test
##
## data: raw$years_in_co by raw$turnover
## Kruskal-Wallis chi-squared = 1084.3, df = 1, p-value < 2.2e-16
plot_ly(data = raw, color = ~turnover, x = ~no_projects,
type = "histogram", alpha = 0.7) %>%
layout(barmode = "overlay") %>%
layout(title = "Histogram of No of Projects Hold by Employee",
xaxis = list(title = "No of Projects"),
yaxis = list(title = "No of Employees"))
Key Observations:
Employees that has 2 or less projects are highly likely to leave the company.
Employees that has 6 or more projects are highly likely to leave the company, with all employees that held 7 projects left the company.
Let’s see whether the two distribution are statistically significant.
kruskal.test(raw$no_projects~raw$turnover)
##
## Kruskal-Wallis rank sum test
##
## data: raw$no_projects by raw$turnover
## Kruskal-Wallis chi-squared = 5.7285, df = 1, p-value = 0.01669
plot_ly(data = raw, color = ~turnover, x = ~last_evaluation,
type = "histogram", alpha = 0.7) %>%
layout(barmode = "overlay") %>%
layout(title = "Histogram of Last Evaluation Score of Employees",
xaxis = list(title = "Last Evaluation Score"),
yaxis = list(title = "No of Employees"))
Key Observations:
Let’s see whether the two distribution are statistically significant.
kruskal.test(raw$last_evaluation~raw$turnover)
##
## Kruskal-Wallis rank sum test
##
## data: raw$last_evaluation by raw$turnover
## Kruskal-Wallis chi-squared = 0.089328, df = 1, p-value = 0.765
There is no significant differences between the distribution of last evaluation scores for employees that stayed and left.
plot_ly(data = raw, color = ~turnover, x = ~last_evaluation,
y = ~satisfaction_level, z = ~monthly_hours) %>%
layout(title = "Last Evaluation Score vs Satisfaction Level vs Hours Worked")
There are three distinct clusters of employees that left the company:
Group 1 (unhappy limited performer): low evaluation score, low satisfaction level, low hours worked. based on their hours worked, would probably consists mostly of contract workers or part-time workers which have a high turnover in nature. May be subjected to other variables.
Group 2 (happy high performers): high evaluation score, high satisfaction level, high hours worked. Due to their high contribution (and sometimes ambition), are likely to leave the company to pursue a better opportunity presented (such as poaching/headhunting by other company).
Group 3 (unhappy high performers) : very high evaluation score, very low satisfaction level, very high hours worked. Likely to leave the company due to being overworked.
With the a good prediction model (Model C) and a basic understanding of the key factors contributing to employee attrition, we can now work on the actionables that the company can take to reduce the attrition rate.
Below is a proposal of a general step-by-step action plan:-
The company runs a screening using the Prediction Model every quarter (or any time period as it deems fit)
The Prediction Model will list out employees who are predicted to leave the company
Employees who are predicted to leave are then plotted into the 3D Scatterplot above and identified as belong to Group 1, Group 2, Group 3, or others.
For Group 1, as there are no specific traits, the company will have to conduct further investigation into the reason a particular employee may leave, or the company can include these into their hiring plans.
For Group 2, the company may want to conduct a performance review with the employee, and potentially offer better opportunities and/or growth for the employee to retain them.
For Group 3, the company can work to reduce the number of projects or time worked to reduce their dissatisfaction due to being overworked.
Aside from the periodic screening, managers can also keep a few key indicators in mind to reduce the possibility of an employee leaving the company:-
Number of Projects under each employee should not exceed 6 projects
Try to prevent employees from working more than 277 hours on average per month
Lookout for employees with satisfaction level below 0.5
The granularity and specificity of actionables can be further improved by conducting a cluster analysis (developing a multinomial classification model) using the key factors for a better understanding of employee’s motivation to leave in the respective clusters. Different actionables can also be tested / evaluated on their efficiency by keeping detailed records and performing A/B testing. However, do keep in mind that dealing with employees will always require human engagement and there are certain inherent limitations by just using models and statistics. These tools as shown in this proof of concept can be used to as a complement and not a replacement to existing human resource activities.