Our Goal: To find the factor contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company.
Introduction: We all know that high quality employees play the most important role in a successful company. So when a company experiences a high rate of employee turnover, we should realize something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees. Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.
Our goal is to help the company to understand what factors contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company. The goal is to create or improve different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.
About the dataset: We found a dataset from kaggle, which is a human resource data. When we first looked at this dataset, there are 14999 observations or employees and 10 variables includes satisfaction level of employees, last evaluation grades, number of projects worked on, promotion in last 5 years, department, work accident yes or not, time spend in company, average monthly hours, salary and outcome.
##load data and EDA
data <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
head(data)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
summary(data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
str(data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "IT","RandD","accounting",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
attach(data)
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
library(car)
library(mice)
library(ggplot2)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
md.pattern(data)
## satisfaction_level last_evaluation number_project
## [1,] 1 1 1
## [2,] 0 0 0
## average_montly_hours time_spend_company Work_accident left
## [1,] 1 1 1 1
## [2,] 0 0 0 0
## promotion_last_5years sales salary
## [1,] 1 1 1 0
## [2,] 0 0 0 0
#The first thing we did was checking the missing values because there are too many data information in our dataset. We can see that there is no missing (or NA) values in the dataset by looking the histogram and pattern.
aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE,labels=names(data), cex.axis=.6, gap=1, ylab=c("Histogram of missing data","Pattern")) #We can also know there is no missing (or NA) values in the dataset by looking the histogram and pattern.
##
## Variables sorted by number of missings:
## Variable Count
## satisfaction_level 0
## last_evaluation 0
## number_project 0
## average_montly_hours 0
## time_spend_company 0
## Work_accident 0
## left 0
## promotion_last_5years 0
## sales 0
## salary 0
countNA(data) #Another way to count NA values
## [1] 0
#plot the graphs of correlations between each variables.
#This figure shows how all the variables correlates to each other. The size of each bubble shows the significance of the correlation. The size of each bubble shows the direction (either positive or negative). Moreover, we can see the staff who leave have a low satisfaction level. In addition, they employees who worked more but didn't get promoted within the past five years because there is no relationship between them according to this graph.
corrplot(cor(data[1:8]), method = "circle", type = "lower", tl.col = "black", tl.srt = 45)
corrplot(cor(data[1:8]), method = "number", type = "lower", tl.col = "black", tl.srt = 45)
#we separate the dataset based on if the employees left or not. Then we only focus on the dataset which employees left. We compared 4 variables to see if they can influence the result (go or stay).
data1 <- data[left==1,]
data0 <- data[left==0,]
str(data0)
## 'data.frame': 11428 obs. of 10 variables:
## $ satisfaction_level : num 0.58 0.82 0.45 0.78 0.49 0.36 0.54 0.99 0.5 0.74 ...
## $ last_evaluation : num 0.74 0.67 0.69 0.82 0.6 0.95 0.37 0.91 0.75 0.64 ...
## $ number_project : int 4 2 5 5 3 3 2 5 6 4 ...
## $ average_montly_hours : int 215 202 193 247 214 206 176 136 127 268 ...
## $ time_spend_company : int 3 3 3 3 2 4 2 4 3 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 0 0 0 0 0 0 0 0 0 0 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "IT","RandD","accounting",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 2 2 2 2 2 2 2 2 2 ...
#At first, we want to find out is the low satisfaction level is really a reason to let employees leave the company? Then we made a histogram to show the satisfaction level of all the employees who left the company. From the figure we can see that employees who left the company tend to have low satisfaction level. Therefore, we conclude that can be a reason to leave the company.
plot <- function(i){ggplot(data = data1, aes(x= i, y=..density..)) + geom_histogram(fill="blue",color="green",alpha = 0.1,bins =60) +
geom_density()
}
plot(data1$satisfaction_level)
#To find out the is the evaluation have impacts on people to leave the company, we drew the distribution we can see from the figure that the people who left the company have whether really high evaluation or really low evaluation, not many in the meddle. We can see that the reason they leave the company might because they are too good or too bad to stay in.
plot(data1$last_evaluation)
#Then we want to see if the salary can influence the employee to leave the company, so we did the salary compare for both people who left the company and left the company. We can see for the employees who left the company, they tend to have less percentage of medium and high level salary than the employees who stay. Therefore, we can say that the reason that they leave the company could be their salary is low.
data1$salary <-factor(data1$salary, levels = c("low","medium","high"))
barplot(table(data1$salary), col=c("orange","blue", "red"),
main="Salary for employees who have left")
data0$salary <-factor(data0$salary, levels = c("low","medium","high"))
barplot(table(data0$salary), col=c("orange","blue", "red"),
main="Salary for employees who still stay")
#Then we want to find is there a certain time period after that employees begin to leave the company. From the figure we can see that Most of the employees who have left company have spent three years in company so it might possible that they have considered three years as enough time leave.
ggplot(data1, aes(time_spend_company))+
geom_histogram(binwidth=0.3,fill='blue')+
labs(x="Time spent company", title="Time Spend in company")+
theme_bw()
#To create z-scores (standardized scores) for variables.
#Before we built the prediction model for the human resource data, we firstly converted some variables??? values to z-scores, in others words, we standardized the values of satisfaction level of employees, last evaluation grades, number of projects worked on, time spend in company, average monthly hours. we created a standard deviation for each one of these variable terms, because we can see in the original data, the values of the satisfaction level are decimals, which are less than 1, but for the average monthly hour are hundreds. So the variable range for these two variables are very different. There is a huge influence for coefficients for variables. So if we standardize them, that influence will not happen.
rm(list=ls())
data <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
attach(data)
## The following objects are masked from data (pos = 11):
##
## Work_accident, average_montly_hours, last_evaluation, left,
## number_project, promotion_last_5years, salary, sales,
## satisfaction_level, time_spend_company
#change the categorical variables to factors
data$left <- factor(data$left)
data$promotion_last_5years <- factor(data$promotion_last_5years)
data$Work_accident<-factor(data$Work_accident)
satisfaction_level<-scale(satisfaction_level, center = TRUE, scale=TRUE)
last_evaluation<-scale(last_evaluation, center = TRUE, scale=TRUE)
number_project<-scale(number_project, center = TRUE, scale=TRUE)
average_montly_hours<-scale(average_montly_hours, center = TRUE, scale=TRUE)
time_spend_company<-scale(time_spend_company, center = TRUE, scale=TRUE)
data <- subset(data, select = -c(satisfaction_level, last_evaluation, number_project,time_spend_company, average_montly_hours) )
data <- data.frame(satisfaction_level,last_evaluation,number_project,time_spend_company,average_montly_hours,data)
head(data)
## satisfaction_level last_evaluation number_project time_spend_company
## 1 -0.9364635 -1.0872390 -1.4628141 -0.3412238
## 2 0.7527892 0.8406789 0.9710806 1.7133790
## 3 -2.0224116 0.9575224 2.5936770 0.3436438
## 4 0.4310268 0.8991007 0.9710806 1.0285114
## 5 -0.9766838 -1.1456608 -1.4628141 -0.3412238
## 6 -0.8158026 -1.2625043 -1.4628141 -0.3412238
## average_montly_hours Work_accident left promotion_last_5years sales
## 1 -0.8820105 0 1 0 sales
## 2 1.2203821 0 1 0 sales
## 3 1.4206099 0 1 0 sales
## 4 0.4394934 0 1 0 sales
## 5 -0.8419649 0 1 0 sales
## 6 -0.9621016 0 1 0 sales
## salary
## 1 low
## 2 medium
## 3 medium
## 4 low
## 5 low
## 6 low
summary(data)
## satisfaction_level last_evaluation number_project
## Min. :-2.1029 Min. :-2.08041 Min. :-1.4628
## 1st Qu.:-0.6951 1st Qu.:-0.91197 1st Qu.:-0.6515
## Median : 0.1093 Median : 0.02277 Median : 0.1598
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.8332 3rd Qu.: 0.89910 3rd Qu.: 0.9711
## Max. : 1.5572 Max. : 1.65858 Max. : 2.5937
##
## time_spend_company average_montly_hours Work_accident left
## Min. :-1.0261 Min. :-2.10340 0:12830 0:11428
## 1st Qu.:-0.3412 1st Qu.:-0.90203 1: 2169 1: 3571
## Median :-0.3412 Median :-0.02103
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3436 3rd Qu.: 0.87999
## Max. : 4.4528 Max. : 2.18148
##
## promotion_last_5years sales salary
## 0:14680 sales :4140 high :1237
## 1: 319 technical :2720 low :7316
## support :2229 medium:6446
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
#Because the dependent variable left is binary variable, which is consisted of 1 and 0, so we used the logistics regression model to find the factors contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company.
#First, select train dataset and test dataset.
set.seed(1) #Randomly splitting the data into train (80% of data) and test (20%) segmenets.
train_index = sample(1:nrow(data), nrow(data)*0.8, replace=FALSE)
train = data[train_index, ]
test = data[-train_index, ]
#Fit a logistic regression model using train data.
HR <- glm(left~.-sales, family = binomial(link = "logit"), data = train)
summary(HR)
##
## Call:
## glm(formula = left ~ . - sales, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1626 -0.6652 -0.4105 -0.1299 3.0316
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.89134 0.13755 -21.020 < 2e-16 ***
## satisfaction_level -1.00472 0.02701 -37.197 < 2e-16 ***
## last_evaluation 0.13426 0.02839 4.729 2.26e-06 ***
## number_project -0.37373 0.02919 -12.803 < 2e-16 ***
## time_spend_company 0.36971 0.02482 14.893 < 2e-16 ***
## average_montly_hours 0.22641 0.02855 7.931 2.18e-15 ***
## Work_accident1 -1.50768 0.09866 -15.282 < 2e-16 ***
## promotion_last_5years1 -1.67159 0.30433 -5.493 3.96e-08 ***
## salarylow 1.94680 0.14031 13.875 < 2e-16 ***
## salarymedium 1.40264 0.14127 9.929 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13111 on 11998 degrees of freedom
## Residual deviance: 10335 on 11989 degrees of freedom
## AIC: 10355
##
## Number of Fisher Scoring iterations: 5
#convert to % via exp() function
HR.ouput <- exp(coef(HR))
HR.ouput
## (Intercept) satisfaction_level last_evaluation
## 0.05550161 0.36614544 1.14368940
## number_project time_spend_company average_montly_hours
## 0.68816261 1.44731465 1.25408642
## Work_accident1 promotion_last_5years1 salarylow
## 0.22142336 0.18794782 7.00622481
## salarymedium
## 4.06591742
#Then we can see which variable contribute most to the employees left by the coefficient values. According to the p-values of the logistic regression output, we can see all of the variables are statistically significant to the dependent variable left. The coefficient of salary is larger than others, so we can say it contributes most to employee turnover. It uses the high salary as the baseline so the log odds increase 1.94 when move from the high salary to low salary. This also make sense because intuitively employees have more probability to leave the company if the salary is lower.
#In order to check the model accuracy, we calculate the ROC(Receiver Operating Characteristic) curves, AUC, Pseudo R value and the hit rate using 50% as the break point.
prob=predict(HR, type = c("response"))
#install.packages("pROC")
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:colorspace':
##
## coords
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
h <- roc(left~prob, data=train)
h
##
## Call:
## roc.formula(formula = left ~ prob, data = train)
##
## Data: prob in 9168 controls (left 0) < 2831 cases (left 1).
## Area under the curve: 0.817
plot(h) #ROC plot
#Psuedo R
library(pscl)
## Warning: package 'pscl' was built under R version 3.4.2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(HR)
## llh llhNull G2 McFadden r2ML
## -5167.4379280 -6555.6581265 2776.4403969 0.2117591 0.2065695
## r2CU
## 0.3107755
#Calculate the hit rate.
HR2 <- predict(HR,test,type='response')
HR2.1 <- ifelse(HR2 > 0.5,1,0)
head(HR2.1)
## 10 30 50 52 56 57
## 0 0 0 0 0 1
table(as.factor(test$left),HR2.1)
## HR2.1
## 0 1
## 0 2105 155
## 1 510 230
HR3 <- mean(HR2.1!=as.numeric(test$left))
head(HR2.1)
## 10 30 50 52 56 57
## 0 0 0 0 0 1
head(left)
## [1] 1 1 1 1 1 1
HR3
## [1] 0.9483333
#According to the ROC result which measure the sensitivity and the specificity of the model, we can see the curve converges to the right left side, and the AUC is 0.817, the Pseudo R is 0.21, and the hit rate is 94.8%, all of these statistics show us out model is very good.
#Finally, we used our model to predict the percentage likelihood of a certain employee will leave the company. For example, we assume there is a valuable employee whose last evaluation is 90%, the number of project is 5, time spend company is 9, and the salary is low, we can know the percentage likelihood of this employee will leave the company is 56.2%, but if the salary of this employee increase to high level, the probability that he leaves the company decreases to 14.7%.
data2 <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
summary(data2)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
data2$left <- factor(data2$left)
data2$promotion_last_5years <- factor(data2$promotion_last_5years)
data2$Work_accident<-factor(data2$Work_accident)
attach(data2)
## The following objects are masked _by_ .GlobalEnv:
##
## average_montly_hours, last_evaluation, number_project,
## satisfaction_level, time_spend_company
## The following objects are masked from data (pos = 5):
##
## Work_accident, average_montly_hours, last_evaluation, left,
## number_project, promotion_last_5years, salary, sales,
## satisfaction_level, time_spend_company
## The following objects are masked from data (pos = 14):
##
## Work_accident, average_montly_hours, last_evaluation, left,
## number_project, promotion_last_5years, salary, sales,
## satisfaction_level, time_spend_company
HRnew <- glm(left~satisfaction_level+last_evaluation+average_montly_hours+ number_project+time_spend_company+Work_accident+promotion_last_5years+ salary, family = binomial(link = "logit"), data = data2)
mean(data2$satisfaction_level)
## [1] 0.6128335
mean(data2$average_montly_hours)
## [1] 201.0503
HR.df_low <- data.frame(left="1", satisfaction_level=0.613, last_evaluation=0.9,average_montly_hours=201.05, number_project=5, time_spend_company=9, Work_accident="0", promotion_last_5years="0", salary="low")
pred.df_low <- predict(HRnew,HR.df_low,type='response')
pred.df_low
## 1
## 0.5618217
HR.df_high <- data.frame(left="1", satisfaction_level=0.613, last_evaluation=0.9,average_montly_hours=201.05, number_project=5, time_spend_company=9, Work_accident="0", promotion_last_5years="0", salary="high")
pred.df_high <- predict(HRnew,HR.df_high,type='response')
pred.df_high
## 1
## 0.1471827
Conclusion: According to EDA result, satisfaction level, salary, time spent in company are related to the employees leaving. According to the regression result, the variable salary contributes most to the effect of employees leaving. The percentage of a valuable employee will leave the company can reduce a lot (56.2% to 14.7%) if increase the salary from low level to high level.
Company improvement: In order to prevent the valuable employees leave the company, the company need to improve employees benefit such as increasing salary based on their performance.
Limitation: Information about employee age, working environment, company performance are not provided. If they are all included, the data will be much better for building our regression model.