Introduction

Employees are the backbone of any organization. They are a company’s biggest investment and source of revenue generation. Since a company invests large amount of time and money in acquiring, training and equipping employees with the needed skills and expertise,keeping an employee satisfied and retaining him for longer years would be the ideal goal for any company. In this project, I use a simulated dataset consisting of employee information and try to analyse and draw insights on why “valuable” employees are leaving a company and then build a prediction model to find out which employee will be leaving the company next. I believe that identifying these factors will help a company get to the root cause of employee attrition. Furthermore, with the help of the prediction model a company can take steps to prevent the next employee from leaving. Please refer the tab under Data Preparation for details on the dataset used.

Packages Required

library(dplyr) 
library(knitr) 
library(formattable)
library(tidyr)
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.We use it in our project to transform data(joining, summarizing etc.)

  • knitr - A General-Purpose Package for Dynamic Report Generation in R

  • formattable - Provides functions to create formattable vectors and data frames. It improves the readability of data presented in tabular form rendered in web pages

  • tidyr - Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions

Data Preparation

Prior to understanding why employees leave a company, we need to acquire the data and clean it too. The data used for this project is from The Human Resources Analytics dataset on Kaggle. As per the source, this dataset was simulated and contains 10 variables and 14999 rows of data. It reports data metrics such as Employee Satisfaction level, Last evaluation, Number of Projects, Salary etc.. The full data and documentation can be found in the above link. The data dictionary tab below also explains each variable

Import data

empData <- read.csv("HR_comma_sep.csv",header = TRUE, stringsAsFactors = FALSE)

Data dictionary

Once the data has been imported, we can now understand the variables and create a data dictionary.

Variable <- c("satisfaction_level","last_evaluation","number_projects","average_monthly_hrs","time_spend_company","Work_accident","left","promotion_5year","sales","salary" )

Datatype <- c("Numeric","Numeric","Integer","Integer","Integer", "Integer","Integer","Integer","Character","Character")

Description <- c("Value between 0 to 10","Value between 0 to 10","No. of projects the employee has worked on","Average hours an employee works per month","No. of years spent in a company ","Boolean value 0 or 1 indicating if an employee had an accident at work","Boolean value 0 or 1 indicating if an employee left the company","Boolean value 0 or 1 indicating if an employee was promoted in the last 5 years","The department an employee belongs to","Categorical value indicating if the salary is low, medium, high")
datadic<-data.frame(Variable,Datatype,Description)
kable(datadic)
Variable Datatype Description
satisfaction_level Numeric Value between 0 to 10
last_evaluation Numeric Value between 0 to 10
number_projects Integer No. of projects the employee has worked on
average_monthly_hrs Integer Average hours an employee works per month
time_spend_company Integer No. of years spent in a company
Work_accident Integer Boolean value 0 or 1 indicating if an employee had an accident at work
left Integer Boolean value 0 or 1 indicating if an employee left the company
promotion_5year Integer Boolean value 0 or 1 indicating if an employee was promoted in the last 5 years
sales Character The department an employee belongs to
salary Character Categorical value indicating if the salary is low, medium, high

Creating Tidy data

First and foremost we see that the variable ‘Sales’ actually represents the department an employee belongs to like Marketing, account, hr, support etc., So we change the name of the variable to ‘Dept’ to make it intuitive.

colnames(empData)[9] <- "Dept"

We can see that there are

sum(is.na(empData))
## [1] 0

missing values in the data

The data as such looks tidy, with one variable per column and each row representing an observation.

HR Dataset

empData %>%
  formattable() %>%
    as.datatable(options=list(dom = 't',scrollX = TRUE,scrollCollapse = TRUE))

Exploratory Data Analysis

Goal of our EDA is to find out why employees are leaving the company and then find why “good” employees are leaving. We try to analyse if there is any difference in the factors observed between valuable(we define a valuable employee) and invaluable(if we may say so) employees.In most parts of this project, we use visualizations such as tables, histograms and barplots for the purpose of our EDA.

Why are employees leaving?

Filtering the employees that left the company,

emp.left <- empData %>%
    filter(left==1)

Out of the 15000 employees,

nrow(emp.left)
## [1] 3571

have left the company.

Below we visualize each variable in our dataset to see what’s going on

par(mfrow=c(3,3))
hist(emp.left$satisfaction_level,col="grey",main="Satisfaction level")
hist(emp.left$last_evaluation,col="grey",main="Last Evaluation")
hist(emp.left$promotion_last_5years,col="grey",main="Promoted in the last 5 years?")
hist(emp.left$Work_accident,col="grey",main="Work Accident")
barplot(table(emp.left$Dept),col="grey",main="Department")
barplot(table(emp.left$salary),col="grey",main="Salary")
hist(emp.left$average_montly_hours,col="grey",main="Avergae monthly hours")
hist(emp.left$time_spend_company,col="grey",main="Time spent in company")
hist(emp.left$number_project,col="grey",main="Number of projects")

From the above plots, we can observe the following:

  • Low level of satisfaction
  • There are a set of employees who received a low evaluation which is why the management might have asked them to leave the company
  • It is very interesting to note that almost all the employees who left were not promoted in the last 5 years which might have caused them to leave
  • Majority of the employees who left were from the Sales,support and technical departments
  • Majority of the emploees who left received low level of salaries
  • Employees who left spent an average of 200-250 monthly hours in the company.

Why are good employees leaving?

We define a good employee as one who has been in the company for more than 4 years, worked on 4 or more projects or has a last evaluation greater than or equal to 0.7..

Filtering the good employees that left and selecting only important rows:

emp.good.left <- emp.left %>% 
                  filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 4)

Out of the 3571 employees that left,

nrow(emp.good.left)
## [1] 2020

were valuable/ good employees

Below we visaulize each variable in the dataset pertaining to “good employees” to see what’s going on

par(mfrow=c(2,3))
hist(emp.good.left$satisfaction_level,col="blue",main="Satisfaction level")
barplot(table(emp.good.left$Dept),col="blue",main="Satisfaction level")
hist(emp.good.left$promotion_last_5years,col="blue",main="Promoted in the last 5 years?")
barplot(table(emp.good.left$salary),col="blue",main="Salary")
hist(emp.good.left$average_montly_hours,col="blue",main="Average monthly hours")
hist(emp.good.left$time_spend_company,col="blue",main="Time spent in company")

We can observe the following from the above graphs:

  • We can see that majority of the good employees that left had very low levels of satisfaction

  • We can see that majority of the good employees that left belonged to the Sales,support and technical departments

  • We can see that majority of the good employees that left were not promoted in the last 5 years

  • We can see that majority of the good employees that left were over worked and spent many hours in the company

** Hence, we can see that valuable employees left because they were overworked, not promoted in the last 5 years, were in the low level of salary and were dissatisifed. We have also found that most of the employees leaving are from the Sales/technical/support departments **

Now we will analyse the sales,technical and support department to see what is going on here:

emp.temp <- tbl_df(table(empData$Dept,empData$salary))
emp.temp<-spread(emp.temp,Var2,n)
emp.temp %>% mutate(totalEmp=low+medium+high,PercentLow=(low/totalEmp)*100,PercentHigh=(high/totalEmp)*100)
## # A tibble: 10 x 7
##           Var1  high   low medium totalEmp PercentLow PercentHigh
##          <chr> <int> <int>  <int>    <int>      <dbl>       <dbl>
##  1  accounting    74   358    335      767   46.67536    9.647979
##  2          hr    45   335    359      739   45.33153    6.089310
##  3          IT    83   609    535     1227   49.63325    6.764466
##  4  management   225   180    225      630   28.57143   35.714286
##  5   marketing    80   402    376      858   46.85315    9.324009
##  6 product_mng    68   451    383      902   50.00000    7.538803
##  7       RandD    51   364    372      787   46.25159    6.480305
##  8       sales   269  2099   1772     4140   50.70048    6.497585
##  9     support   141  1146    942     2229   51.41319    6.325707
## 10   technical   201  1372   1147     2720   50.44118    7.389706

We can see that in the Sales , Technical,support departments around 50% of the employees are in the low salary level, because of which employees might have left.

emp.temp <- tbl_df(table(empData$Dept,empData$promotion_last_5years))
emp.temp<-spread(emp.temp,Var2,n)
colnames(emp.temp) <- c("Dept","Not Promoted","Promoted")
emp.temp %>% mutate(PercentPromoted=(`Promoted`/(`Promoted` + `Not Promoted`))*100)
## # A tibble: 10 x 4
##           Dept `Not Promoted` Promoted PercentPromoted
##          <chr>          <int>    <int>           <dbl>
##  1  accounting            753       14       1.8252934
##  2          hr            724       15       2.0297700
##  3          IT           1224        3       0.2444988
##  4  management            561       69      10.9523810
##  5   marketing            815       43       5.0116550
##  6 product_mng            902        0       0.0000000
##  7       RandD            760       27       3.4307497
##  8       sales           4040      100       2.4154589
##  9     support           2209       20       0.8972633
## 10   technical           2692       28       1.0294118

We can see that the Sales , Technical,support department has promoted only a very small percentage of its employees in the last 5 years, because of which employees might have left.

par(mfrow=c(1,3))
emp.satis <- empData %>% 
  select(Dept,satisfaction_level) %>%
    filter(Dept=="technical")
hist(emp.satis$satisfaction_level, main = "Technical department")
emp.satis <- empData %>% 
  select(Dept,satisfaction_level) %>%
    filter(Dept=="sales")
hist(emp.satis$satisfaction_level, main = "Sales department")
emp.satis <- empData %>% 
  select(Dept,satisfaction_level) %>%
    filter(Dept=="support")
hist(emp.satis$satisfaction_level,main = "Support department")

One thing to note here is that, the satisfaction level of the employees in all these 3 departments is not as low as we expected them to be.

par(mfrow=c(1,3))
emp.workhours <- empData %>% 
  select(Dept,average_montly_hours) %>%
  filter(Dept=="support")
hist(emp.workhours$average_montly_hours, main="Support department")

emp.workhours <- empData %>% 
  select(Dept,average_montly_hours) %>%
  filter(Dept=="sales")
hist(emp.workhours$average_montly_hours, main="Sales department")

emp.workhours <- empData %>% 
  select(Dept,average_montly_hours) %>%
  filter(Dept=="technical")
hist(emp.workhours$average_montly_hours, main="Technical department")

We can see that majority of the employees in the 3 departments were all overworked because of which they might have left.

** From all the above analyses, we can see that the scenario in the above 3 departments align with the observations we made on why employees are leaving **

Summary

We can broadly conclude that, any employee, irrespective of him being valuable(as per our definition) or not leaves if he is overworked, paid less or is not promoted for 5 years.

Next steps

I’m planning to build a decision tree for predicting whether an employee will leave the company or not. I also want to upload my input dataset on to github so I can directly use the data from there.