Employees are the backbone of any organization. They are a company’s biggest investment and source of revenue generation. Since a company invests large amount of time and money in acquiring, training and equipping employees with the needed skills and expertise,keeping an employee satisfied and retaining him for longer years would be the ideal goal for any company.
In this project, I use a simulated dataset consisting of employee information and try to analyse and draw insights on why “valuable” employees are leaving a company. Then, I build three different prediction models and choose the best one to predict if an employee will leave the company. I believe that identifying these factors will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving. Please refer the tab under Data Preparation for details on the dataset used.
library(dplyr)
library(knitr)
library(tidyr)
library(DT)
library(caret)
library(glmnet)
library(ggplot2)
library(plotly)
library(gridExtra)
library(e1071)
library(ranger)
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.We use it in our project to transform data(joining, summarizing etc.)
knitr - A General-Purpose Package for Dynamic Report Generation in R
tidyr - Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions
DT - Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’. The package name ‘DT’ is an abbreviation of ‘DataTables’
caret - Provides functions for training and plotting classification and regression models
glmnet - Used for building efficietn Lasso and ridge regression models
ggplot2 - A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”
gridExtra - for spacing the layout of ggplots
e1071 - The caret package uses this package for building random forests
ranger - The caret package uses this package for building random forests. It is a fast implementation of Random Forests, particularly suited for high dimensional data
Prior to understanding why employees leave a company, we need to acquire the data and clean it too. The data used for this project is from The Human Resources Analytics dataset on Kaggle. As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link. The data dictionary tab below also explains each variable
#Importing csv file
empData <- read.csv("HR_comma_sep.csv",header = TRUE, stringsAsFactors = FALSE)
Once the data has been imported, we can now understand the variables and create a data dictionary.
| Variable | Datatype | Description |
|---|---|---|
| satisfaction_level | Numeric | Value between 0 to 10 |
| last_evaluation | Numeric | Value between 0 to 10 |
| number_projects | Integer | No. of projects the employee has worked on |
| average_monthly_hrs | Integer | Average hours an employee works per month |
| time_spend_company | Integer | No. of years spent in a company |
| Work_accident | Integer | Boolean value 0 or 1 indicating if an employee had |
| an ac | cident at wo | rk |
| left | Integer | Boolean value 0 or 1 indicating if an employee left the company |
| promotion_5year | Integer | Boolean value 0 or 1 indicating if an employee was promoted |
| in th | e last 5 yea | rs |
| sales | Character | The department an employee belongs to |
| salary | Character | Categorical value indicating if the salary is low, medium, high |
First and foremost we see that the variable ‘Sales’ actually represents the department an employee belongs to like Marketing, account, hr, support etc., So we change the name of the variable to ‘Dept’ to make it intuitive.
colnames(empData)[9] <- "Dept" #changing col name
We can see that there are
sum(is.na(empData)) #finding if there are any NAs in the data
## [1] 0
missing values in the data
The data as such looks tidy, with one variable per column and each row representing an observation.
Goal of our EDA is to find out why employees are leaving the company and then find why “good” employees are leaving. We try to analyse if there is any difference in the factors observed between valuable(we define a valuable employee) and invaluable(if we may say so) employees.In most parts of this project, we use visualizations such as tables, histograms and barplots for the purpose of our EDA.
Filtering the employees that left the company,
Out of the 15000 total employees,
## [1] 3571
have left the company.
Below we visualize each variable in our dataset to see what’s going on
From the above plots, we can observe the following:
We define a good employee as one who has been in the company for more than four years, worked on four or more projects or has a last evaluation greater than or equal to 0.7..
Filtering the good employees that left and selecting only important rows:
Out of the 3571 employees that left,
## [1] 2020
were valuable/ good employees
Below we visaulize each variable in the dataset pertaining to “good employees” to see what’s going on Note that here I’m using the hist() and barplot() function present in the base R package instead of ggplot2 functions
We can observe the following from the above graphs:
We can see that majority of the good employees that left had very low levels of satisfaction
We can see that majority of the good employees that left belonged to the Sales,support and technical departments
We can see that majority of the good employees that left were not promoted in the last 5 years
We can see that majority of the good employees that left were over worked and spent many hours in the company
Hence, we can see that valuable employees left because they were overworked, not promoted in the last 5 years, were in the low level of salary and were dissatisifed. We have also found that most of the employees leaving are from the Sales/technical/support departments
Now we will analyse the sales,technical and support department to see what is going on here:
## # A tibble: 10 x 7
## Var1 high low medium totalEmp PercentLow PercentHigh
## <chr> <int> <int> <int> <int> <dbl> <dbl>
## 1 accounting 74 358 335 767 46.67536 9.647979
## 2 hr 45 335 359 739 45.33153 6.089310
## 3 IT 83 609 535 1227 49.63325 6.764466
## 4 management 225 180 225 630 28.57143 35.714286
## 5 marketing 80 402 376 858 46.85315 9.324009
## 6 product_mng 68 451 383 902 50.00000 7.538803
## 7 RandD 51 364 372 787 46.25159 6.480305
## 8 sales 269 2099 1772 4140 50.70048 6.497585
## 9 support 141 1146 942 2229 51.41319 6.325707
## 10 technical 201 1372 1147 2720 50.44118 7.389706
We can see that in the Sales , Technical,support departments around 50% of the employees are in the low salary level, because of which employees might have left.
## # A tibble: 10 x 4
## Dept `Not Promoted` Promoted PercentPromoted
## <chr> <int> <int> <dbl>
## 1 accounting 753 14 1.8252934
## 2 hr 724 15 2.0297700
## 3 IT 1224 3 0.2444988
## 4 management 561 69 10.9523810
## 5 marketing 815 43 5.0116550
## 6 product_mng 902 0 0.0000000
## 7 RandD 760 27 3.4307497
## 8 sales 4040 100 2.4154589
## 9 support 2209 20 0.8972633
## 10 technical 2692 28 1.0294118
We can see that the Sales , Technical,support department has promoted only a very small percentage of its employees in the last 5 years, because of which employees might have left.
One thing to note here is that, contrary to our expectation, the satisfaction level of the employees in all these 3 departments is not very low.
Plot showing employees’ work hours in the support, sales and technical departments:
We can see that majority of the employees in the 3 departments were all overworked because of which they might have left.
Additionally we can also use boxplots to check for outliers and find interesting relationships as follows:
Following are the observations that we can make:
From all the above analyses, we can see that the scenario in the above 3 departments align with the observations we made on why employees are leaving
In this section we build 3 different prediction models and choose the best one amongst them. We predict if a person left a company using the binary response variable ‘left’, and use all the other variables as explanotary variables. Throughout this section we make use of the Caret package which provides many functions to construct different model types and tune their parameters.
We first create a shared custom trainControl object to use across all three models as we need a fair comparison of the train/test set between the 2 models. We use createfolds to create a train control object so that we have the exact same cross validation folds for each model
myFolds <- createFolds(empData$left,k=5) #creating folds
myControl <- trainControl(summaryFunction=twoClassSummary,classProbs = TRUE,verboseIter = TRUE, index=myFolds) #custom control object to be used across all models
As this is a classic classification problem with a binary response variable, we first build a logistic regression model
model_glm <- train(left ~ ., data=empData,method="glm",trControl=myControl) #using train and shared control object myControl to build a logistic regression model
Model summary:
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2248 -0.6645 -0.4026 -0.1177 3.0688
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4762862 0.1938373 -7.616 2.61e-14 ***
## satisfaction_level -4.1356889 0.0980538 -42.178 < 2e-16 ***
## last_evaluation 0.7309032 0.1491787 4.900 9.61e-07 ***
## number_project -0.3150787 0.0213248 -14.775 < 2e-16 ***
## average_montly_hours 0.0044603 0.0005161 8.643 < 2e-16 ***
## time_spend_company 0.2677537 0.0155736 17.193 < 2e-16 ***
## Work_accident -1.5298283 0.0895473 -17.084 < 2e-16 ***
## promotion_last_5years -1.4301364 0.2574958 -5.554 2.79e-08 ***
## Depthr 0.2323779 0.1313084 1.770 0.07678 .
## DeptIT -0.1807179 0.1221276 -1.480 0.13894
## Deptmanagement -0.4484236 0.1598254 -2.806 0.00502 **
## Deptmarketing -0.0120882 0.1319304 -0.092 0.92700
## Deptproduct_mng -0.1532529 0.1301538 -1.177 0.23901
## DeptRandD -0.5823659 0.1448848 -4.020 5.83e-05 ***
## Deptsales -0.0387859 0.1024006 -0.379 0.70486
## Deptsupport 0.0500251 0.1092834 0.458 0.64713
## Depttechnical 0.0701464 0.1065379 0.658 0.51027
## salarylow 1.9440627 0.1286272 15.114 < 2e-16 ***
## salarymedium 1.4132244 0.1293534 10.925 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16465 on 14998 degrees of freedom
## Residual deviance: 12850 on 14980 degrees of freedom
## AIC: 12888
##
## Number of Fisher Scoring iterations: 5
We next build a decision tree which represents the entire data in a visual format which enables us to understand the data quickly and make a prediction
model_dtree <- train(left ~., data = empData, method = "rpart",trControl=myControl,
tuneLength = 10,
parms=list(split='information')) ##using train and shared control object myControl to build a decision tree
plot(model_dtree)
This is a highly used model which creates a multitude of decision trees and combines weak trees(say decision trees) to form a stronger final tree
model_rf <- train(left ~ .,data = empData,method = "ranger",trControl=myControl,metric="ROC")
plot(model_rf)
### Finding the best model
We compare the three models to find the best performing model by summarizing the results of the distributions as a boxplot, xyplot and dotplot. We choose the model with the highest AUC(area under curve) value as our best model
Creating a list of the models and collecting the results
model_list <- list(glm = model_glm,rf = model_rf,tree = model_dtree) #creating a list of the above 3 models
resamps <- resamples(model_list) #collecting results
Summaring the results:
summary(resamps) #summarizing results
##
## Call:
## summary.resamples(object = resamps)
##
## Models: glm, rf, tree
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.8090899 0.8134251 0.8194887 0.8169878 0.8204509 0.8224845 0
## rf 0.9807052 0.9829490 0.9838707 0.9835975 0.9851353 0.9853275 0
## tree 0.9671087 0.9681915 0.9685040 0.9690591 0.9698746 0.9716163 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.9210238 0.9216887 0.9275870 0.9258838 0.9294465 0.9296730 0
## rf 0.9960621 0.9962813 0.9964997 0.9966311 0.9966094 0.9977029 0
## tree 0.9850142 0.9865456 0.9881877 0.9889525 0.9916867 0.9933282 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.2775639 0.3395170 0.3524676 0.3557851 0.3907563 0.4186209 0
## rf 0.9173959 0.9198179 0.9212461 0.9207504 0.9215961 0.9236962 0
## tree 0.9086454 0.9096639 0.9124956 0.9123493 0.9128456 0.9180959 0
Next we draw the plots to choose the model with the highest AUC
We can see that the ROC value is the highest for the Random Forest model and is greater than 0.95
We can see values for the random forest model closer to 0.95
We can see that the ROC value is the highest for the Random Forest model and is greater than 0.95
** From the above plots we can clearly see that the AUC value is the highest for the Random Forest model. Hence we choose this as our best model **
About the data
In this project we used the Human Resources Analytics dataset available on Kaggle.As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link.
Problem Statement and approach taken
Broadly, we analysed why employees were leaving a company and built prediction models to predict the same.[variable ‘left’ is the binary response variable and all the other variables were used as explanatory variables]
Following were the steps performed on the dataset
Inferences
Based on the EDA ,we can broadly conclude that, any employee, irrespective of him being valuable(as per our definition) or not leaves if he is overworked, paid less or is not promoted for 5 years.
Based on the prediction models we built, random forest seemed to be the best amongst the three as it had the highest AUC value. Also, this model would work well for any new data as we have used cross validation to train our model.
Benefit to the consumer
I believe that identifying the factors that lead to an employee leaving hte company will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving.
Limitations/future work
There is a lot of scope for building more number of prediction models such as SVM, GBM, Naiive Bayes etc and checking if these models perform better than the chosen random forest model.