Introduction

Employees are the backbone of any organization. They are a company’s biggest investment and source of revenue generation. Since a company invests large amount of time and money in acquiring, training and equipping employees with the needed skills and expertise,keeping an employee satisfied and retaining him for longer years would be the ideal goal for any company.

In this project, I use a simulated dataset consisting of employee information and try to analyse and draw insights on why “valuable” employees are leaving a company. Then, I build three different prediction models and choose the best one to predict if an employee will leave the company. I believe that identifying these factors will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving. Please refer the tab under Data Preparation for details on the dataset used.

Packages Required

library(dplyr) 
library(knitr) 
library(tidyr)
library(DT)
library(caret)
library(glmnet)
library(ggplot2)
library(plotly)
library(gridExtra)
library(e1071)
library(ranger)
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.We use it in our project to transform data(joining, summarizing etc.)

  • knitr - A General-Purpose Package for Dynamic Report Generation in R

  • tidyr - Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions

  • DT - Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’. The package name ‘DT’ is an abbreviation of ‘DataTables’

  • caret - Provides functions for training and plotting classification and regression models

  • glmnet - Used for building efficietn Lasso and ridge regression models

  • ggplot2 - A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”

  • gridExtra - for spacing the layout of ggplots

  • e1071 - The caret package uses this package for building random forests

  • ranger - The caret package uses this package for building random forests. It is a fast implementation of Random Forests, particularly suited for high dimensional data

Data Preparation

Prior to understanding why employees leave a company, we need to acquire the data and clean it too. The data used for this project is from The Human Resources Analytics dataset on Kaggle. As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link. The data dictionary tab below also explains each variable

Import data

#Importing csv file
empData <- read.csv("HR_comma_sep.csv",header = TRUE, stringsAsFactors = FALSE)

Data dictionary

Once the data has been imported, we can now understand the variables and create a data dictionary.

Variable Datatype Description
satisfaction_level Numeric Value between 0 to 10
last_evaluation Numeric Value between 0 to 10
number_projects Integer No. of projects the employee has worked on
average_monthly_hrs Integer Average hours an employee works per month
time_spend_company Integer No. of years spent in a company
Work_accident Integer Boolean value 0 or 1 indicating if an employee had
an ac cident at wo rk
left Integer Boolean value 0 or 1 indicating if an employee left the company
promotion_5year Integer Boolean value 0 or 1 indicating if an employee was promoted
in th e last 5 yea rs
sales Character The department an employee belongs to
salary Character Categorical value indicating if the salary is low, medium, high

Creating Tidy data

First and foremost we see that the variable ‘Sales’ actually represents the department an employee belongs to like Marketing, account, hr, support etc., So we change the name of the variable to ‘Dept’ to make it intuitive.

colnames(empData)[9] <- "Dept" #changing col name

We can see that there are

sum(is.na(empData)) #finding if there are any NAs in the data
## [1] 0

missing values in the data

The data as such looks tidy, with one variable per column and each row representing an observation.

HR Dataset

Exploratory Data Analysis

Goal of our EDA is to find out why employees are leaving the company and then find why “good” employees are leaving. We try to analyse if there is any difference in the factors observed between valuable(we define a valuable employee) and invaluable(if we may say so) employees.In most parts of this project, we use visualizations such as tables, histograms and barplots for the purpose of our EDA.

Why are employees leaving?

Filtering the employees that left the company,

Out of the 15000 total employees,

## [1] 3571

have left the company.

Below we visualize each variable in our dataset to see what’s going on

From the above plots, we can observe the following:

  • Low level of satisfaction
  • There are a set of employees who received a low evaluation which is why the management might have asked them to leave the company
  • It is very interesting to note that almost all the employees who left were not promoted in the last 5 years which might have caused them to leave
  • Majority of the employees who left were from the Sales,support and technical departments
  • Majority of the emploees who left received low level of salaries
  • Employees who left spent an average of 200-250 monthly hours in the company.

Why are good employees leaving?

We define a good employee as one who has been in the company for more than four years, worked on four or more projects or has a last evaluation greater than or equal to 0.7..

Filtering the good employees that left and selecting only important rows:

Out of the 3571 employees that left,

## [1] 2020

were valuable/ good employees

Below we visaulize each variable in the dataset pertaining to “good employees” to see what’s going on Note that here I’m using the hist() and barplot() function present in the base R package instead of ggplot2 functions

We can observe the following from the above graphs:

  • We can see that majority of the good employees that left had very low levels of satisfaction

  • We can see that majority of the good employees that left belonged to the Sales,support and technical departments

  • We can see that majority of the good employees that left were not promoted in the last 5 years

  • We can see that majority of the good employees that left were over worked and spent many hours in the company

Hence, we can see that valuable employees left because they were overworked, not promoted in the last 5 years, were in the low level of salary and were dissatisifed. We have also found that most of the employees leaving are from the Sales/technical/support departments

Now we will analyse the sales,technical and support department to see what is going on here:

## # A tibble: 10 x 7
##           Var1  high   low medium totalEmp PercentLow PercentHigh
##          <chr> <int> <int>  <int>    <int>      <dbl>       <dbl>
##  1  accounting    74   358    335      767   46.67536    9.647979
##  2          hr    45   335    359      739   45.33153    6.089310
##  3          IT    83   609    535     1227   49.63325    6.764466
##  4  management   225   180    225      630   28.57143   35.714286
##  5   marketing    80   402    376      858   46.85315    9.324009
##  6 product_mng    68   451    383      902   50.00000    7.538803
##  7       RandD    51   364    372      787   46.25159    6.480305
##  8       sales   269  2099   1772     4140   50.70048    6.497585
##  9     support   141  1146    942     2229   51.41319    6.325707
## 10   technical   201  1372   1147     2720   50.44118    7.389706

We can see that in the Sales , Technical,support departments around 50% of the employees are in the low salary level, because of which employees might have left.

## # A tibble: 10 x 4
##           Dept `Not Promoted` Promoted PercentPromoted
##          <chr>          <int>    <int>           <dbl>
##  1  accounting            753       14       1.8252934
##  2          hr            724       15       2.0297700
##  3          IT           1224        3       0.2444988
##  4  management            561       69      10.9523810
##  5   marketing            815       43       5.0116550
##  6 product_mng            902        0       0.0000000
##  7       RandD            760       27       3.4307497
##  8       sales           4040      100       2.4154589
##  9     support           2209       20       0.8972633
## 10   technical           2692       28       1.0294118

We can see that the Sales , Technical,support department has promoted only a very small percentage of its employees in the last 5 years, because of which employees might have left.

One thing to note here is that, contrary to our expectation, the satisfaction level of the employees in all these 3 departments is not very low.

Plot showing employees’ work hours in the support, sales and technical departments:

We can see that majority of the employees in the 3 departments were all overworked because of which they might have left.

Additionally we can also use boxplots to check for outliers and find interesting relationships as follows:

Following are the observations that we can make:

  • Boxplot between Satisfaction levels and Salary : We find that the average satisfaction of the employees who left is lower than who haven’t left
  • Boxplot between Time Spent in Company/Average monthly hours and Salary : We can see that the people leaving are more experienced (i.e. higher time spent in company on average) in the low and medium salary class.
  • Boxplot between Average Monthly Hours and Time spent Company: We can see that people leaving have spent more hours at work

From all the above analyses, we can see that the scenario in the above 3 departments align with the observations we made on why employees are leaving

Predictive modeling

In this section we build 3 different prediction models and choose the best one amongst them. We predict if a person left a company using the binary response variable ‘left’, and use all the other variables as explanotary variables. Throughout this section we make use of the Caret package which provides many functions to construct different model types and tune their parameters.

We first create a shared custom trainControl object to use across all three models as we need a fair comparison of the train/test set between the 2 models. We use createfolds to create a train control object so that we have the exact same cross validation folds for each model

myFolds <- createFolds(empData$left,k=5) #creating folds
myControl <- trainControl(summaryFunction=twoClassSummary,classProbs = TRUE,verboseIter = TRUE, index=myFolds) #custom control object to be used across all models

Logisitic regression model

As this is a classic classification problem with a binary response variable, we first build a logistic regression model

model_glm <- train(left ~ ., data=empData,method="glm",trControl=myControl) #using train and shared control object myControl to build a logistic regression model

Model summary:

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2248  -0.6645  -0.4026  -0.1177   3.0688  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.4762862  0.1938373  -7.616 2.61e-14 ***
## satisfaction_level    -4.1356889  0.0980538 -42.178  < 2e-16 ***
## last_evaluation        0.7309032  0.1491787   4.900 9.61e-07 ***
## number_project        -0.3150787  0.0213248 -14.775  < 2e-16 ***
## average_montly_hours   0.0044603  0.0005161   8.643  < 2e-16 ***
## time_spend_company     0.2677537  0.0155736  17.193  < 2e-16 ***
## Work_accident         -1.5298283  0.0895473 -17.084  < 2e-16 ***
## promotion_last_5years -1.4301364  0.2574958  -5.554 2.79e-08 ***
## Depthr                 0.2323779  0.1313084   1.770  0.07678 .  
## DeptIT                -0.1807179  0.1221276  -1.480  0.13894    
## Deptmanagement        -0.4484236  0.1598254  -2.806  0.00502 ** 
## Deptmarketing         -0.0120882  0.1319304  -0.092  0.92700    
## Deptproduct_mng       -0.1532529  0.1301538  -1.177  0.23901    
## DeptRandD             -0.5823659  0.1448848  -4.020 5.83e-05 ***
## Deptsales             -0.0387859  0.1024006  -0.379  0.70486    
## Deptsupport            0.0500251  0.1092834   0.458  0.64713    
## Depttechnical          0.0701464  0.1065379   0.658  0.51027    
## salarylow              1.9440627  0.1286272  15.114  < 2e-16 ***
## salarymedium           1.4132244  0.1293534  10.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16465  on 14998  degrees of freedom
## Residual deviance: 12850  on 14980  degrees of freedom
## AIC: 12888
## 
## Number of Fisher Scoring iterations: 5

Decision tree

We next build a decision tree which represents the entire data in a visual format which enables us to understand the data quickly and make a prediction

model_dtree <- train(left ~., data = empData, method = "rpart",trControl=myControl,
                     tuneLength = 10, 
                     parms=list(split='information')) ##using train and shared control object myControl to build a decision tree
plot(model_dtree)

Random forest

This is a highly used model which creates a multitude of decision trees and combines weak trees(say decision trees) to form a stronger final tree

model_rf <- train(left ~ .,data = empData,method = "ranger",trControl=myControl,metric="ROC")
plot(model_rf)

### Finding the best model

We compare the three models to find the best performing model by summarizing the results of the distributions as a boxplot, xyplot and dotplot. We choose the model with the highest AUC(area under curve) value as our best model

Creating a list of the models and collecting the results

model_list <- list(glm = model_glm,rf = model_rf,tree = model_dtree) #creating a list of the above 3 models
resamps <- resamples(model_list) #collecting results

Summaring the results:

summary(resamps) #summarizing results
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: glm, rf, tree 
## Number of resamples: 5 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.8090899 0.8134251 0.8194887 0.8169878 0.8204509 0.8224845    0
## rf   0.9807052 0.9829490 0.9838707 0.9835975 0.9851353 0.9853275    0
## tree 0.9671087 0.9681915 0.9685040 0.9690591 0.9698746 0.9716163    0
## 
## Sens 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.9210238 0.9216887 0.9275870 0.9258838 0.9294465 0.9296730    0
## rf   0.9960621 0.9962813 0.9964997 0.9966311 0.9966094 0.9977029    0
## tree 0.9850142 0.9865456 0.9881877 0.9889525 0.9916867 0.9933282    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm  0.2775639 0.3395170 0.3524676 0.3557851 0.3907563 0.4186209    0
## rf   0.9173959 0.9198179 0.9212461 0.9207504 0.9215961 0.9236962    0
## tree 0.9086454 0.9096639 0.9124956 0.9123493 0.9128456 0.9180959    0

Next we draw the plots to choose the model with the highest AUC

We can see that the ROC value is the highest for the Random Forest model and is greater than 0.95

We can see values for the random forest model closer to 0.95

We can see that the ROC value is the highest for the Random Forest model and is greater than 0.95

** From the above plots we can clearly see that the AUC value is the highest for the Random Forest model. Hence we choose this as our best model **

Summary

About the data

In this project we used the Human Resources Analytics dataset available on Kaggle.As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link.

Problem Statement and approach taken

Broadly, we analysed why employees were leaving a company and built prediction models to predict the same.[variable ‘left’ is the binary response variable and all the other variables were used as explanatory variables]

Following were the steps performed on the dataset

  • Imported the dataset
  • Cleaned and tidied the dataset
  • Performed Exploratory data analysis using various plots and tables
  • Built 3 prediction models using the Caret package: Logisitc regression model, Decision tree and Random Forest and chose the best one based on highest AUC

Inferences

  • Based on the EDA ,we can broadly conclude that, any employee, irrespective of him being valuable(as per our definition) or not leaves if he is overworked, paid less or is not promoted for 5 years.

  • Based on the prediction models we built, random forest seemed to be the best amongst the three as it had the highest AUC value. Also, this model would work well for any new data as we have used cross validation to train our model.

Benefit to the consumer

I believe that identifying the factors that lead to an employee leaving hte company will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving.

Limitations/future work

There is a lot of scope for building more number of prediction models such as SVM, GBM, Naiive Bayes etc and checking if these models perform better than the chosen random forest model.