EMPLOYEE ATTRITION PHASE 1

Introduction

One of the main aspects of Human Resource in the company is to recruit the best talents for the company, which we all think when it comes to HR. But that’s not only the role HR is take care off. Retention of the current staff is equally important as recruitment. Presently, Employee retention is one of the biggest challenges that companies in the world are facing these days. If the companies are not trying to keep their employees satisfied with their aspirations and goals they might lose their best talents. Nowadays lots of research and training are given to HR and managers how to retain their best employees. Every company is trying to get an answer as to why their best and most experienced employees are leaving the coming prematurely. Ludovic Benistant has released a simulated dataset on Kaggle https://www.kaggle.com/ludobenistant/hr-analytics. By analysis of this dataset, we will try to predict using machine learning algorithm whether an employee is likely to leave the company or not and what factors (satisfaction rate, salary, the number of projects, average monthly hours) affects and employee to level the company. After knowing the factors which have a major impact on employees leaving the company, companies can take appropriate actions to retain them.

Methods

The Human Resource analytics simulated dataset consists of 14999 employees data and has 10 different variables as followed. Employee satisfaction level - Employee is satisfied or not with his or her work (values range from 0 to 1). 0 stands for not satisfied and 1 stand for satisfied. Last evaluation - Last rating of employee value in the columns (values range from 0 to 1). Number of projects - Number of project employee is working on (2, 3, 4, 5, 6, 7 projects) . Average monthly hours - Average monthly hour spent Time spent at the company — Number of years spent in the company (2, 3, 4, 5, 6, 7, 8, 10 years). Work accident - During working did employee have an accident 0 - no and 1 for yes. Promotion in the last 5 years - promotion in last 5 years 0 - no and 1 for yes. Department - employee working in which departments (sales, accounting, technical, support, IT, product_mng, marketing, hr , RandD, management). Salary - Salary of the employees is categorized in three levels (low medium and high). . Whether the employee has left - Employee has left the company or not 0 - no and 1 for yes.

library(readr)
library(mosaic)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: lattice

## Loading required package: ggplot2

## Loading required package: mosaicData

## Loading required package: Matrix

## 
## The 'mosaic' package masks several functions from core packages in order to add additional features.  
## The original behavior of these functions should not be affected by this.

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

library(nortest)
library(corrplot)
library(Hmisc)

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(ggplot2)
library(mosaic)

Loading the appropriate library

Reading the data

Loading the data from the CSV file

HR <- read_csv("C:/Users/Bhargab/Desktop/HR_comma_sep.csv",  col_types = cols(salary = col_factor(levels = c("low", "medium", "high")), sales = col_factor(levels = c("accounting", "hr", "IT", "management", "marketing", "product_mng", "RandD", "sales" ,  "support", "technical"))))

Exploring the data

head(HR) #Checking the data, 1st 6 entries

## # A tibble: 6 x 10
##   satisfaction_level last_evaluation number_project average_montly_hours
##                <dbl>           <dbl>          <int>                <int>
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
## # ... with 6 more variables: time_spend_company <int>,
## #   Work_accident <int>, left <int>, promotion_last_5years <int>,
## #   sales <fctr>, salary <fctr>

str(HR) #We have 2 factor variables (Sales and Salary), 2 numeric variable and 6 integer variables.

## Classes 'tbl_df', 'tbl' and 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "low","medium",..: 1 2 2 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ satisfaction_level   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ last_evaluation      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ number_project       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ average_montly_hours : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ time_spend_company   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Work_accident        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ left                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ promotion_last_5years: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ sales                :List of 2
##   .. .. ..$ levels : chr  "accounting" "hr" "IT" "management" ...
##   .. .. ..$ ordered: logi FALSE
##   .. .. ..- attr(*, "class")= chr  "collector_factor" "collector"
##   .. ..$ salary               :List of 2
##   .. .. ..$ levels : chr  "low" "medium" "high"
##   .. .. ..$ ordered: logi FALSE
##   .. .. ..- attr(*, "class")= chr  "collector_factor" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

HR$Department=HR$sales #replacing the sales column with appropriate name
HR$sales<-NULL
HR$promotion_last_5years<- as.factor(HR$promotion_last_5years)
HR$Work_accident <- as.factor(HR$Work_accident)
HR$salary<-factor(HR$salary)
HR$left<-as.factor(HR$left)
str(HR) # Rechecking the datatypes of the variables

## Classes 'tbl_df', 'tbl' and 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ salary               : Factor w/ 3 levels "low","medium",..: 1 2 2 1 1 1 1 1 1 1 ...
##  $ Department           : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ satisfaction_level   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ last_evaluation      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ number_project       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ average_montly_hours : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ time_spend_company   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Work_accident        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ left                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ promotion_last_5years: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ sales                :List of 2
##   .. .. ..$ levels : chr  "accounting" "hr" "IT" "management" ...
##   .. .. ..$ ordered: logi FALSE
##   .. .. ..- attr(*, "class")= chr  "collector_factor" "collector"
##   .. ..$ salary               :List of 2
##   .. .. ..$ levels : chr  "low" "medium" "high"
##   .. .. ..$ ordered: logi FALSE
##   .. .. ..- attr(*, "class")= chr  "collector_factor" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

summary(HR )# Checking the entire summary of the dataset.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident left      promotion_last_5years
##  Min.   : 2.000     0:12830       0:11428   0:14680              
##  1st Qu.: 3.000     1: 2169       1: 3571   1:  319              
##  Median : 3.000                                                  
##  Mean   : 3.498                                                  
##  3rd Qu.: 4.000                                                  
##  Max.   :10.000                                                  
##                                                                  
##     salary           Department  
##  low   :7316   sales      :4140  
##  medium:6446   technical  :2720  
##  high  :1237   support    :2229  
##                IT         :1227  
##                product_mng: 902  
##                marketing  : 858  
##                (Other)    :2923

Satisfaction level has a minimum value of 0.09 and a maximum value of 1.0, the mean is 0.6128, as the data lies between the range (0-1), we can say there are no outliers.

Last_Evaluation has a minimum value of 0.36 and a maximum value of 1, with a mean of 0.7161, and the data also lies within the range (0-1), we can say there are no outliers.

Number_of_Projects has a minimum value of 2 and a maximum value of 7 with a mean of 3.803.

Average_monthly_hours has a minimum vaue of 96 and a maximum value of 310 with a mean of 201.1.

Time_spend_company has a minimum value of 2 years and a maximum value of 10 years with a mean of 3.498.

Work_accident is a categorical variable with 0 for the employee without any accidents (12830) and their are 2169 employee who had an accident.

Left is a categorical variable with values 0 and 1, 0 represents the employees that are still working in the company(11428) and 1 represents that the employee has left(3571).

promotion_last_5years is also a categorical variable with 0 representing the empolyess who did not get a promotion in the last 5 years (14680) and 1 represents the employees who got a promotion (319).

7316 employee get low salary , 6446 employee get medium salary and 1237 employee recieve high salary.

4140 sales ,2720 technical, 2229 support, 1227 IT ,product_mng 902 , 858 marketing , 2923 other(accouting,HR,Rand,management).

Chekcing for missing values

sum(is.na(HR))

## [1] 0

As there is 0 missing values in the dataset.We don’t need to use any missing value techniques.

Checking for outliers

boxplot(HR$satisfaction_level,main="BoxPlot of Satisfaction Level",lwd=2)

boxplot(HR$last_evaluation,main="BoxPlot of Last Evaluation",lwd=2)

boxplot(HR$number_project,main="BoxPlot ofNumber of Projects",lwd=2)

boxplot(HR$average_montly_hours,main="BoxPlot ofaverage_montly_hours",lwd=2)

boxplot(HR$time_spend_company,main="BoxPlot of time_spend_company",lwd=2)

As per box plot we can see that all the variables instead of time spend in the income has values with in the ranges and hence we can say that there are no outliers in this data set. There can be an employee which are working in the company for more then 5 years.Hence we can’t directly delete them it depend on future analysis.

Checking for normality

hist(HR$satisfaction_level,main="Histogram of Satisfaction Level",lwd=2)

ad.test(HR$satisfaction_level)

## 
##  Anderson-Darling normality test
## 
## data:  HR$satisfaction_level
## A = 168.37, p-value < 2.2e-16

qqnorm(HR$satisfaction_level,main="Q-Q plot for Satisfaction level")
qqline(HR$satisfaction_level,col="red",lwd=2)

hist(HR$last_evaluation,main="Histogram of Last Evaluation",lwd=2)

ad.test(HR$last_evaluation)

## 
##  Anderson-Darling normality test
## 
## data:  HR$last_evaluation
## A = 221.12, p-value < 2.2e-16

qqnorm(HR$last_evaluation,main="Q-Q plot for Last evaluation")
qqline(HR$last_evaluation,col="red",lwd=2)

hist(HR$average_montly_hours,main="Histogram of Average Monthly Hours",lwd=2)

ad.test(HR$average_montly_hours)

## 
##  Anderson-Darling normality test
## 
## data:  HR$average_montly_hours
## A = 195.66, p-value < 2.2e-16

qqnorm(HR$average_montly_hours,main="Q-Q plot for Average montly hours")
qqline(HR$average_montly_hours,col="red",lwd=2)

Since the p- value for all the variables is less than 0.05, we reject null hypothesis i.e. the data is not normally distributed. From QQ plot shows variables to have flat tails.

Correlation

corrplot(cor(HR[sapply(HR,is.numeric)]), method="pie") # to check the pairwise correlation

rcorr(as.matrix(HR[,c(1,2,3,4,5)]))

##                      satisfaction_level last_evaluation number_project
## satisfaction_level                 1.00            0.11          -0.14
## last_evaluation                    0.11            1.00           0.35
## number_project                    -0.14            0.35           1.00
## average_montly_hours              -0.02            0.34           0.42
## time_spend_company                -0.10            0.13           0.20
##                      average_montly_hours time_spend_company
## satisfaction_level                  -0.02              -0.10
## last_evaluation                      0.34               0.13
## number_project                       0.42               0.20
## average_montly_hours                 1.00               0.13
## time_spend_company                   0.13               1.00
## 
## n= 14999 
## 
## 
## P
##                      satisfaction_level last_evaluation number_project
## satisfaction_level                      0.0000          0.0000        
## last_evaluation      0.0000                             0.0000        
## number_project       0.0000             0.0000                        
## average_montly_hours 0.0141             0.0000          0.0000        
## time_spend_company   0.0000             0.0000          0.0000        
##                      average_montly_hours time_spend_company
## satisfaction_level   0.0141               0.0000            
## last_evaluation      0.0000               0.0000            
## number_project       0.0000               0.0000            
## average_montly_hours                      0.0000            
## time_spend_company   0.0000

As per the matrix and plot, none of the variables have correlations that are significant.Hence we can go ahead and use all the predictors for modeling and visualtaions (We don’t have to worry about multicollinearity)

Visualizing the data

HR$left=factor(HR$left,levels = c(0,1),labels=c("No","Yes")) #factoring the target variable


# Box Plot of Satisfaction Level and Salary along with employees who left and who did not
ggplot(HR, aes(x =  salary, y = satisfaction_level, fill = as.factor(left), colour =as.factor(left))) + 
  geom_boxplot(outlier.colour = "black") + xlab("Salary") + ylab("Satisfacion level") + ggtitle(("Plot of Satisfaction Level by Salary"))

For employees with high salary, but a low level of satsifaction, all of them seem to be leaving. There are a few outliers as well, we can deal with them in further analysis to see their impact (while building the model). If satisfaction level is less than 0.5, there are high chances that an employee will leave regardless of the salary.

ggplot(HR, aes(x = as.factor(number_project), y = satisfaction_level,fill=as.factor(left),color=salary)) + geom_boxplot(width=1)+ ggtitle(("Plot of Satisfaction Level by number of projects \n & Salary"))

It can be said that neiter too less or too many projects lead to a decreased satisfaction level and chances of an employee leaving are very high. Also, with 4-5 projects, even with a high amount of satisfaction level, it seems employees are leaving.

ggplot(HR, aes(x = as.factor(number_project), y= satisfaction_level,fill=as.factor(number_project)  )) + geom_boxplot() + ggtitle(("Plot of Satisfaction Level by number of projects"))

ggplot(HR, aes(x = as.factor(number_project), fill = as.factor(left))) + geom_bar(position = "dodge") + ggtitle(("Plot of number of projects by count \n based on Left"))

It can be said with 100% confidence that anyone with 7 projects is leaving and has very low satisfaction level.Followed by the employee who are working on very less number of projects as there satisfcation is also low which shown in box plot. Employees working on 3 to 5 projects are not levaning the company since their satisfcation is above 0.5.

ggplot(HR, aes(x = as.factor(number_project), fill = promotion_last_5years)) + geom_bar(position = "dodge") + ggtitle(("Plot of number of projects \n by count based on Promotion"))

ggplot(HR, aes(x = as.factor(left), fill = promotion_last_5years)) + geom_bar(position = "dodge")+ ggtitle(("Plot of Employees leaving by Count \n based on Promtion"))

Promotion in the last 5 years does not seem to have any significant relation with the employees leaving which can be seen from both the graphs.Only minor number of employees had recieved the promotion with in last 5 years.

ggplot(HR, aes(x =average_montly_hours)) + geom_histogram(binwidth = 2) + facet_grid(as.factor(left) ~.) + ggtitle(("Plot of Employees leaving by Average Monthly Hours"))

We can see that there is a dip in the number of employees who are not leaving the company and work between 150 and 200 hours, but if they work more than 225 hours they start leaving , we will check the density plot to confirm this and below 150 hours as well.

ggplot(HR, aes(x=average_montly_hours, colour = as.factor(left))) + geom_density()+ ggtitle((" Density Plot of number of employees leaving based \n on avergave motnhly hours"))

The density plot clearly confirms the above mentioned observation

ggplot(HR, aes(x = as.factor(time_spend_company), y=average_montly_hours, fill =as.factor(left),color=salary)) + geom_boxplot()+ ggtitle(("Average monthly hours by time spent \n in the company and salary"))

It is also seen that anyone staying for 6+ years, is likely not to leave even if the salary is low and monthly working hour more than 225.Employees working in the range of 4 to 6 and working more than 225 hours is leavning irrespective of the salary.As we can say from this graph if the employees are working between 150 to 225 hours montly are less like to leave the company.

p<-ggplot(HR, aes(x = Department, fill = salary)) + geom_bar(position = "dodge") + geom_text(stat='count',aes(label=..count..),vjust=-1,color='black', position=position_dodge(1.2), size=3) + ggtitle(("Plot of Intra department salaries by Count "))
p + scale_x_discrete(labels = abbreviate)

## Warning: position_dodge requires non-overlapping x intervals

It can be said that the proportion of high saalry in management (35%) is the highest and salaries are lowest in sales by proportion (50%), which is confirmed with the tally function below.Sales team has the higest 4140 employees when compared with all other deparments with in the company.Managment has the highest 225 employee with high salary and lowest 180 employees with low salary compared to all departments.HR has maximum employees with medium salary.

tally(HR$salary~HR$Department, format = "proportion")

##          HR$Department
## HR$salary accounting         hr         IT management  marketing
##    low    0.46675359 0.45331529 0.49633252 0.28571429 0.46853147
##    medium 0.43676662 0.48579161 0.43602282 0.35714286 0.43822844
##    high   0.09647979 0.06089310 0.06764466 0.35714286 0.09324009
##          HR$Department
## HR$salary product_mng      RandD      sales    support  technical
##    low     0.50000000 0.46251588 0.50700483 0.51413190 0.50441176
##    medium  0.42461197 0.47268107 0.42801932 0.42261104 0.42169118
##    high    0.07538803 0.06480305 0.06497585 0.06325707 0.07389706

Conclusion

Since this is a simulated dataset, there are no outliers expect time to spend in a company which we can’t treat as an outlier based on our objectives and goals i.e. we are not making any changes in the dataset. There are no missing values and hence cleaning the dataset is not necessary. We also saw that the data is not normally distributed, but for our modeling, we have decided to use: Decision Tree, Random Forest and Logistic Regression with Lasso. None of the models depend on the assumption of Normality. Only for Lasso, we need to center the data, but that will be done during the modeling phase. As for now based on data visualization and explorations, we can say that satisfaction level has a high impact on the retention of an employee, followed by the number of projects and average number of working hours. Surprisingly, salary or a promotion does not seem to have much of an effect.

Machine Learning

Bhargab Dhar @ Siddharth Jain

April 18, 2017