Our Goal: To find the factor contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company.

Introduction: We all know that high quality employees play the most important role in a successful company. So when a company experiences a high rate of employee turnover, we should realize something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees. Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.

Our goal is to help the company to understand what factors contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company. The goal is to create or improve different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.

About the dataset: We found a dataset from kaggle, which is a human resource data. When we first looked at this dataset, there are 14999 observations or employees and 10 variables includes satisfaction level of employees, last evaluation grades, number of projects worked on, promotion in last 5 years, department, work accident yes or not, time spend in company, average monthly hours, salary and outcome.

##load data and EDA
data <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
head(data)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low
summary(data)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923
str(data)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "IT","RandD","accounting",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
attach(data)
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
library(car)
library(mice)
library(ggplot2)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
md.pattern(data)
##      satisfaction_level last_evaluation number_project
## [1,]                  1               1              1
## [2,]                  0               0              0
##      average_montly_hours time_spend_company Work_accident left
## [1,]                    1                  1             1    1
## [2,]                    0                  0             0    0
##      promotion_last_5years sales salary  
## [1,]                     1     1      1 0
## [2,]                     0     0      0 0
#The first thing we did was checking the missing values because there are too many data information in our dataset. We can see that there is no missing (or NA) values in the dataset by looking the histogram and pattern.

aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE,labels=names(data), cex.axis=.6, gap=1, ylab=c("Histogram of missing data","Pattern")) #We can also know there is no missing (or NA) values in the dataset by looking the histogram and pattern.

## 
##  Variables sorted by number of missings: 
##               Variable Count
##     satisfaction_level     0
##        last_evaluation     0
##         number_project     0
##   average_montly_hours     0
##     time_spend_company     0
##          Work_accident     0
##                   left     0
##  promotion_last_5years     0
##                  sales     0
##                 salary     0
countNA(data) #Another way to count NA values
## [1] 0
#plot the graphs of correlations between each variables.
#This figure shows how all the variables correlates to each other. The size of each bubble shows the significance of the correlation. The size of each bubble shows the direction (either positive or negative). Moreover, we can see the staff who leave have a low satisfaction level. In addition, they employees who worked more but didn't get promoted within the past five years because there is no relationship between them according to this graph.

corrplot(cor(data[1:8]), method = "circle", type = "lower", tl.col = "black", tl.srt = 45) 

corrplot(cor(data[1:8]), method = "number", type = "lower", tl.col = "black", tl.srt = 45)

#we separate the dataset based on if the employees left or not. Then we only focus on the dataset which employees left. We compared 4 variables to see if they can influence the result (go or stay).

data1 <- data[left==1,]
data0 <- data[left==0,]
str(data0)
## 'data.frame':    11428 obs. of  10 variables:
##  $ satisfaction_level   : num  0.58 0.82 0.45 0.78 0.49 0.36 0.54 0.99 0.5 0.74 ...
##  $ last_evaluation      : num  0.74 0.67 0.69 0.82 0.6 0.95 0.37 0.91 0.75 0.64 ...
##  $ number_project       : int  4 2 5 5 3 3 2 5 6 4 ...
##  $ average_montly_hours : int  215 202 193 247 214 206 176 136 127 268 ...
##  $ time_spend_company   : int  3 3 3 3 2 4 2 4 3 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "IT","RandD","accounting",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 2 2 2 2 2 2 2 2 2 ...
#At first, we want to find out is the low satisfaction level is really a reason to let employees leave the company? Then we made a histogram to show the satisfaction level of all the employees who left the company. From the figure we can see that employees who left the company tend to have low satisfaction level. Therefore, we conclude that can be a reason to leave the company.
 
plot <- function(i){ggplot(data = data1, aes(x= i, y=..density..)) + geom_histogram(fill="blue",color="green",alpha = 0.1,bins =60) +
geom_density()
}
plot(data1$satisfaction_level)

#To find out the is the evaluation have impacts on people to leave the company, we drew the distribution we can see from the figure that the people who left the company have whether really high evaluation or really low evaluation, not many in the meddle. We can see that the reason they leave the company might because they are too good or too bad to stay in.
plot(data1$last_evaluation)

#Then we want to see if the salary can influence the employee to leave the company, so we did the salary compare for both people who left the company and left the company. We can see for the employees who left the company, they tend to have less percentage of medium and high level salary than the employees who stay. Therefore, we can say that the reason that they leave the company could be their salary is low.
data1$salary <-factor(data1$salary, levels = c("low","medium","high"))
barplot(table(data1$salary), col=c("orange","blue", "red"), 
        main="Salary for employees who have left")

data0$salary <-factor(data0$salary, levels = c("low","medium","high"))
barplot(table(data0$salary), col=c("orange","blue", "red"), 
        main="Salary for employees who still stay")

#Then we want to find is there a certain time period after that employees begin to leave the company. From the figure we can see that Most of the employees who have left company have spent three years in company so it might possible that they have considered three years as enough time leave.
ggplot(data1, aes(time_spend_company))+
  geom_histogram(binwidth=0.3,fill='blue')+
  labs(x="Time spent company", title="Time Spend in company")+
  theme_bw()

#To create z-scores (standardized scores) for variables.
#Before we built the prediction model for the human resource data, we firstly converted some variables??? values to z-scores, in others words, we standardized the values of satisfaction level of employees, last evaluation grades, number of projects worked on, time spend in company, average monthly hours. we created a standard deviation for each one of these variable terms, because we can see in the original data, the values of the satisfaction level are decimals, which are less than 1, but for the average monthly hour are hundreds. So the variable range for these two variables are very different. There is a huge influence for coefficients for variables. So if we standardize them, that influence will not happen. 
rm(list=ls())
data <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
attach(data)
## The following objects are masked from data (pos = 11):
## 
##     Work_accident, average_montly_hours, last_evaluation, left,
##     number_project, promotion_last_5years, salary, sales,
##     satisfaction_level, time_spend_company
#change the categorical variables to factors
data$left <- factor(data$left) 
data$promotion_last_5years <- factor(data$promotion_last_5years)
data$Work_accident<-factor(data$Work_accident)

satisfaction_level<-scale(satisfaction_level, center = TRUE, scale=TRUE)
last_evaluation<-scale(last_evaluation, center = TRUE, scale=TRUE)
number_project<-scale(number_project, center = TRUE, scale=TRUE)
average_montly_hours<-scale(average_montly_hours, center = TRUE, scale=TRUE)
time_spend_company<-scale(time_spend_company, center = TRUE, scale=TRUE)

data <- subset(data, select = -c(satisfaction_level, last_evaluation, number_project,time_spend_company, average_montly_hours) )
data <- data.frame(satisfaction_level,last_evaluation,number_project,time_spend_company,average_montly_hours,data)
head(data)
##   satisfaction_level last_evaluation number_project time_spend_company
## 1         -0.9364635      -1.0872390     -1.4628141         -0.3412238
## 2          0.7527892       0.8406789      0.9710806          1.7133790
## 3         -2.0224116       0.9575224      2.5936770          0.3436438
## 4          0.4310268       0.8991007      0.9710806          1.0285114
## 5         -0.9766838      -1.1456608     -1.4628141         -0.3412238
## 6         -0.8158026      -1.2625043     -1.4628141         -0.3412238
##   average_montly_hours Work_accident left promotion_last_5years sales
## 1           -0.8820105             0    1                     0 sales
## 2            1.2203821             0    1                     0 sales
## 3            1.4206099             0    1                     0 sales
## 4            0.4394934             0    1                     0 sales
## 5           -0.8419649             0    1                     0 sales
## 6           -0.9621016             0    1                     0 sales
##   salary
## 1    low
## 2 medium
## 3 medium
## 4    low
## 5    low
## 6    low
summary(data)
##  satisfaction_level last_evaluation    number_project   
##  Min.   :-2.1029    Min.   :-2.08041   Min.   :-1.4628  
##  1st Qu.:-0.6951    1st Qu.:-0.91197   1st Qu.:-0.6515  
##  Median : 0.1093    Median : 0.02277   Median : 0.1598  
##  Mean   : 0.0000    Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.8332    3rd Qu.: 0.89910   3rd Qu.: 0.9711  
##  Max.   : 1.5572    Max.   : 1.65858   Max.   : 2.5937  
##                                                         
##  time_spend_company average_montly_hours Work_accident left     
##  Min.   :-1.0261    Min.   :-2.10340     0:12830       0:11428  
##  1st Qu.:-0.3412    1st Qu.:-0.90203     1: 2169       1: 3571  
##  Median :-0.3412    Median :-0.02103                            
##  Mean   : 0.0000    Mean   : 0.00000                            
##  3rd Qu.: 0.3436    3rd Qu.: 0.87999                            
##  Max.   : 4.4528    Max.   : 2.18148                            
##                                                                 
##  promotion_last_5years         sales         salary    
##  0:14680               sales      :4140   high  :1237  
##  1:  319               technical  :2720   low   :7316  
##                        support    :2229   medium:6446  
##                        IT         :1227                
##                        product_mng: 902                
##                        marketing  : 858                
##                        (Other)    :2923
#Because the dependent variable left is binary variable, which is consisted of 1 and 0, so we used the logistics regression model to find the factors contributed most to employee turnover and to create a model that can predict the percentage likelihood of a certain employee will leave the company.
#First, select train dataset and test dataset.
set.seed(1) #Randomly splitting the data into train (80% of data) and test (20%) segmenets. 
train_index = sample(1:nrow(data), nrow(data)*0.8, replace=FALSE) 
train = data[train_index, ]
test = data[-train_index, ]

#Fit a logistic regression model using train data.
HR <- glm(left~.-sales, family = binomial(link = "logit"), data = train)
summary(HR)
## 
## Call:
## glm(formula = left ~ . - sales, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1626  -0.6652  -0.4105  -0.1299   3.0316  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.89134    0.13755 -21.020  < 2e-16 ***
## satisfaction_level     -1.00472    0.02701 -37.197  < 2e-16 ***
## last_evaluation         0.13426    0.02839   4.729 2.26e-06 ***
## number_project         -0.37373    0.02919 -12.803  < 2e-16 ***
## time_spend_company      0.36971    0.02482  14.893  < 2e-16 ***
## average_montly_hours    0.22641    0.02855   7.931 2.18e-15 ***
## Work_accident1         -1.50768    0.09866 -15.282  < 2e-16 ***
## promotion_last_5years1 -1.67159    0.30433  -5.493 3.96e-08 ***
## salarylow               1.94680    0.14031  13.875  < 2e-16 ***
## salarymedium            1.40264    0.14127   9.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13111  on 11998  degrees of freedom
## Residual deviance: 10335  on 11989  degrees of freedom
## AIC: 10355
## 
## Number of Fisher Scoring iterations: 5
#convert to % via exp() function
HR.ouput <- exp(coef(HR))
HR.ouput
##            (Intercept)     satisfaction_level        last_evaluation 
##             0.05550161             0.36614544             1.14368940 
##         number_project     time_spend_company   average_montly_hours 
##             0.68816261             1.44731465             1.25408642 
##         Work_accident1 promotion_last_5years1              salarylow 
##             0.22142336             0.18794782             7.00622481 
##           salarymedium 
##             4.06591742
#Then we can see which variable contribute most to the employees left by the coefficient values. According to the p-values of the logistic regression output, we can see all of the variables are statistically significant to the dependent variable left. The coefficient of salary is larger than others, so we can say it contributes most to employee turnover. It uses the high salary as the baseline so the log odds increase 1.94 when move from the high salary to low salary. This also make sense because intuitively employees have more probability to leave the company if the salary is lower. 


#In order to check the model accuracy, we calculate the ROC(Receiver Operating Characteristic) curves, AUC, Pseudo R value and the hit rate using 50% as the break point. 
prob=predict(HR, type = c("response"))

#install.packages("pROC")
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:colorspace':
## 
##     coords
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
h <- roc(left~prob, data=train)
h
## 
## Call:
## roc.formula(formula = left ~ prob, data = train)
## 
## Data: prob in 9168 controls (left 0) < 2831 cases (left 1).
## Area under the curve: 0.817
plot(h) #ROC plot

#Psuedo R
library(pscl)
## Warning: package 'pscl' was built under R version 3.4.2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(HR)
##           llh       llhNull            G2      McFadden          r2ML 
## -5167.4379280 -6555.6581265  2776.4403969     0.2117591     0.2065695 
##          r2CU 
##     0.3107755
#Calculate the hit rate.
HR2 <- predict(HR,test,type='response')
HR2.1 <- ifelse(HR2 > 0.5,1,0)
head(HR2.1)
## 10 30 50 52 56 57 
##  0  0  0  0  0  1
table(as.factor(test$left),HR2.1)
##    HR2.1
##        0    1
##   0 2105  155
##   1  510  230
HR3 <- mean(HR2.1!=as.numeric(test$left))
head(HR2.1)
## 10 30 50 52 56 57 
##  0  0  0  0  0  1
head(left)
## [1] 1 1 1 1 1 1
HR3
## [1] 0.9483333
#According to the ROC result which measure the sensitivity and the specificity of the model, we can see the curve converges to the right left side, and the AUC is 0.817, the Pseudo R is 0.21, and the hit rate is 94.8%, all of these statistics show us out model is very good. 
#Finally, we used our model to predict the percentage likelihood of a certain employee will leave the company. For example, we assume there is a valuable employee whose last evaluation is 90%, the number of project is 5, time spend company is 9, and the salary is low, we can know the percentage likelihood of this employee will leave the company is 56.2%, but if the salary of this employee increase to high level, the probability that he leaves the company decreases to 14.7%.
data2 <- read.csv("/Users/HUHU1995/Desktop/6101/6101 Final Project/HR_comma_sep.csv")
summary(data2)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923
data2$left <- factor(data2$left) 
data2$promotion_last_5years <- factor(data2$promotion_last_5years)
data2$Work_accident<-factor(data2$Work_accident)
attach(data2)
## The following objects are masked _by_ .GlobalEnv:
## 
##     average_montly_hours, last_evaluation, number_project,
##     satisfaction_level, time_spend_company
## The following objects are masked from data (pos = 5):
## 
##     Work_accident, average_montly_hours, last_evaluation, left,
##     number_project, promotion_last_5years, salary, sales,
##     satisfaction_level, time_spend_company
## The following objects are masked from data (pos = 14):
## 
##     Work_accident, average_montly_hours, last_evaluation, left,
##     number_project, promotion_last_5years, salary, sales,
##     satisfaction_level, time_spend_company
HRnew <- glm(left~satisfaction_level+last_evaluation+average_montly_hours+ number_project+time_spend_company+Work_accident+promotion_last_5years+ salary, family = binomial(link = "logit"), data = data2)

mean(data2$satisfaction_level)
## [1] 0.6128335
mean(data2$average_montly_hours)
## [1] 201.0503
HR.df_low <- data.frame(left="1", satisfaction_level=0.613, last_evaluation=0.9,average_montly_hours=201.05, number_project=5,  time_spend_company=9, Work_accident="0", promotion_last_5years="0", salary="low")
pred.df_low <- predict(HRnew,HR.df_low,type='response')
pred.df_low
##         1 
## 0.5618217
HR.df_high <- data.frame(left="1", satisfaction_level=0.613, last_evaluation=0.9,average_montly_hours=201.05, number_project=5,  time_spend_company=9, Work_accident="0", promotion_last_5years="0", salary="high")
pred.df_high <- predict(HRnew,HR.df_high,type='response')
pred.df_high
##         1 
## 0.1471827

Conclusion: According to EDA result, satisfaction level, salary, time spent in company are related to the employees leaving. According to the regression result, the variable salary contributes most to the effect of employees leaving. The percentage of a valuable employee will leave the company can reduce a lot (56.2% to 14.7%) if increase the salary from low level to high level.

Company improvement: In order to prevent the valuable employees leave the company, the company need to improve employees benefit such as increasing salary based on their performance.

Limitation: Information about employee age, working environment, company performance are not provided. If they are all included, the data will be much better for building our regression model.