Employees are the backbone of any organization. They are a company’s biggest investment and source of revenue generation. Since a company invests large amount of time and money in acquiring, training and equipping employees with the needed skills and expertise,keeping an employee satisfied and retaining him for longer years would be the ideal goal for any company. In this project, I use a simulated dataset consisting of employee information and try to analyse and draw insights on why “valuable” employees are leaving a company and then build a prediction model to find out which employee will be leaving the company next. I believe that identifying these factors will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving. Please refer the tab under Data Preparation for details on the dataset used.
library(dplyr)
library(knitr)
library(formattable)
library(tidyr)
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.We use it in our project to transform data(joining, summarizing etc.)
knitr - A General-Purpose Package for Dynamic Report Generation in R
formattable - Provides functions to create formattable vectors and data frames. It improves the readability of data presented in tabular form rendered in web pages
tidyr - Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions
Prior to understanding why employees leave a company, we need to acquire the data and clean it too. The data used for this project is from The Human Resources Analytics dataset on Kaggle. As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link. The data dictionary tab below also explains each variable
empData <- read.csv("HR_comma_sep.csv",header = TRUE, stringsAsFactors = FALSE)
Once the data has been imported, we can now understand the variables and create a data dictionary.
Variable <- c("satisfaction_level","last_evaluation","number_projects","average_monthly_hrs","time_spend_company","Work_accident","left","promotion_5year","sales","salary" )
Datatype <- c("Numeric","Numeric","Integer","Integer","Integer", "Integer","Integer","Integer","Character","Character")
Description <- c("Value between 0 to 10","Value between 0 to 10","No. of projects the employee has worked on","Average hours an employee works per month","No. of years spent in a company ","Boolean value 0 or 1 indicating if an employee had an accident at work","Boolean value 0 or 1 indicating if an employee left the company","Boolean value 0 or 1 indicating if an employee was promoted in the last 5 years","The department an employee belongs to","Categorical value indicating if the salary is low, medium, high")
datadic<-data.frame(Variable,Datatype,Description)
kable(datadic)
| Variable | Datatype | Description |
|---|---|---|
| satisfaction_level | Numeric | Value between 0 to 10 |
| last_evaluation | Numeric | Value between 0 to 10 |
| number_projects | Integer | No. of projects the employee has worked on |
| average_monthly_hrs | Integer | Average hours an employee works per month |
| time_spend_company | Integer | No. of years spent in a company |
| Work_accident | Integer | Boolean value 0 or 1 indicating if an employee had an accident at work |
| left | Integer | Boolean value 0 or 1 indicating if an employee left the company |
| promotion_5year | Integer | Boolean value 0 or 1 indicating if an employee was promoted in the last 5 years |
| sales | Character | The department an employee belongs to |
| salary | Character | Categorical value indicating if the salary is low, medium, high |
First and foremost we see that the variable ‘Sales’ actually represents the department an employee belongs to like Marketing, account, hr, support etc., So we change the name of the variable to ‘Dept’ to make it intuitive.
colnames(empData)[9] <- "Dept"
We can see that there are
sum(is.na(empData))
## [1] 0
missing values in the data
The data as such looks tidy, with one variable per column and each row representing an observation.
empData %>%
formattable() %>%
as.datatable(options=list(dom = 't',scrollX = TRUE,scrollCollapse = TRUE))
Goal of our EDA is to find out why employees are leaving the company and then find why “good” employees are leaving. We try to analyse if there is any difference in the factors observed between valuable(we define a valuable employee) and invaluable(if we may say so) employees.In most parts of this project, we use visualizations such as tables, histograms and barplots for the purpose of our EDA.
Filtering the employees that left the company,
emp.left <- empData %>%
filter(left==1)
Out of the 15000 employees,
nrow(emp.left)
## [1] 3571
have left the company.
Below we visualize each variable in our dataset to see what’s going on
par(mfrow=c(3,3))
hist(emp.left$satisfaction_level,col="grey",main="Satisfaction level")
hist(emp.left$last_evaluation,col="grey",main="Last Evaluation")
hist(emp.left$promotion_last_5years,col="grey",main="Promoted in the last 5 years?")
hist(emp.left$Work_accident,col="grey",main="Work Accident")
barplot(table(emp.left$Dept),col="grey",main="Department")
barplot(table(emp.left$salary),col="grey",main="Salary")
hist(emp.left$average_montly_hours,col="grey",main="Avergae monthly hours")
hist(emp.left$time_spend_company,col="grey",main="Time spent in company")
hist(emp.left$number_project,col="grey",main="Number of projects")
From the above plots, we can observe the following:
We define a good employee as one who has been in the company for more than 4 years, worked on 4 or more projects or has a last evaluation greater than or equal to 0.7..
Filtering the good employees that left and selecting only important rows:
emp.good.left <- emp.left %>%
filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 4)
Out of the 3571 employees that left,
nrow(emp.good.left)
## [1] 2020
were valuable/ good employees
Below we visaulize each variable in the dataset pertaining to “good employees” to see what’s going on
par(mfrow=c(2,3))
hist(emp.good.left$satisfaction_level,col="blue",main="Satisfaction level")
barplot(table(emp.good.left$Dept),col="blue",main="Satisfaction level")
hist(emp.good.left$promotion_last_5years,col="blue",main="Promoted in the last 5 years?")
barplot(table(emp.good.left$salary),col="blue",main="Salary")
hist(emp.good.left$average_montly_hours,col="blue",main="Average monthly hours")
hist(emp.good.left$time_spend_company,col="blue",main="Time spent in company")
We can observe the following from the above graphs:
We can see that majority of the good employees that left had very low levels of satisfaction
We can see that majority of the good employees that left belonged to the Sales,support and technical departments
We can see that majority of the good employees that left were not promoted in the last 5 years
We can see that majority of the good employees that left were over worked and spent many hours in the company
** Hence, we can see that valuable employees left because they were overworked, not promoted in the last 5 years, were in the low level of salary and were dissatisifed. We have also found that most of the employees leaving are from the Sales/technical/support departments **
Now we will analyse the sales,technical and support department to see what is going on here:
emp.temp <- tbl_df(table(empData$Dept,empData$salary))
emp.temp<-spread(emp.temp,Var2,n)
emp.temp %>% mutate(totalEmp=low+medium+high,PercentLow=(low/totalEmp)*100,PercentHigh=(high/totalEmp)*100)
## # A tibble: 10 x 7
## Var1 high low medium totalEmp PercentLow PercentHigh
## <chr> <int> <int> <int> <int> <dbl> <dbl>
## 1 accounting 74 358 335 767 46.67536 9.647979
## 2 hr 45 335 359 739 45.33153 6.089310
## 3 IT 83 609 535 1227 49.63325 6.764466
## 4 management 225 180 225 630 28.57143 35.714286
## 5 marketing 80 402 376 858 46.85315 9.324009
## 6 product_mng 68 451 383 902 50.00000 7.538803
## 7 RandD 51 364 372 787 46.25159 6.480305
## 8 sales 269 2099 1772 4140 50.70048 6.497585
## 9 support 141 1146 942 2229 51.41319 6.325707
## 10 technical 201 1372 1147 2720 50.44118 7.389706
We can see that in the Sales , Technical,support departments around 50% of the employees are in the low salary level, because of which employees might have left.
emp.temp <- tbl_df(table(empData$Dept,empData$promotion_last_5years))
emp.temp<-spread(emp.temp,Var2,n)
colnames(emp.temp) <- c("Dept","Not Promoted","Promoted")
emp.temp %>% mutate(PercentPromoted=(`Promoted`/(`Promoted` + `Not Promoted`))*100)
## # A tibble: 10 x 4
## Dept `Not Promoted` Promoted PercentPromoted
## <chr> <int> <int> <dbl>
## 1 accounting 753 14 1.8252934
## 2 hr 724 15 2.0297700
## 3 IT 1224 3 0.2444988
## 4 management 561 69 10.9523810
## 5 marketing 815 43 5.0116550
## 6 product_mng 902 0 0.0000000
## 7 RandD 760 27 3.4307497
## 8 sales 4040 100 2.4154589
## 9 support 2209 20 0.8972633
## 10 technical 2692 28 1.0294118
We can see that the Sales , Technical,support department has promoted only a very small percentage of its employees in the last 5 years, because of which employees might have left.
par(mfrow=c(1,3))
emp.satis <- empData %>%
select(Dept,satisfaction_level) %>%
filter(Dept=="technical")
hist(emp.satis$satisfaction_level, main = "Technical department")
emp.satis <- empData %>%
select(Dept,satisfaction_level) %>%
filter(Dept=="sales")
hist(emp.satis$satisfaction_level, main = "Sales department")
emp.satis <- empData %>%
select(Dept,satisfaction_level) %>%
filter(Dept=="support")
hist(emp.satis$satisfaction_level,main = "Support department")
One thing to note here is that, the satisfaction level of the employees in all these 3 departments is not as low as we expected them to be.
par(mfrow=c(1,3))
emp.workhours <- empData %>%
select(Dept,average_montly_hours) %>%
filter(Dept=="support")
hist(emp.workhours$average_montly_hours, main="Support department")
emp.workhours <- empData %>%
select(Dept,average_montly_hours) %>%
filter(Dept=="sales")
hist(emp.workhours$average_montly_hours, main="Sales department")
emp.workhours <- empData %>%
select(Dept,average_montly_hours) %>%
filter(Dept=="technical")
hist(emp.workhours$average_montly_hours, main="Technical department")
We can see that majority of the employees in the 3 departments were all overworked because of which they might have left.
** From all the above analyses, we can see that the scenario in the above 3 departments align with the observations we made on why employees are leaving **
We can broadly conclude that, any employee, irrespective of him being valuable(as per our definition) or not leaves if he is overworked, paid less or is not promoted for 5 years.
I’m planning to build a decision tree for predicting whether an employee will leave the company or not. I also want to upload my input dataset on to github so I can directly use the data from there.