EMPLOYEE ATTRITION PHASE 2

METHODOLOGY

To further investigate on what factors influence an employee to leave the company, we formulated 3 models: Logistic Regression with Lasso, Decission Tree with Cross Validation and Random Forest. The binary variable ‘left’ was used as the response variable. The 2 levels of response were “0” which corresponded to an employee not leaving the company and “1” which corresponded to an employee leaving the company. All the variables were used in the model formulation and based on advanced techniques they were dropped as per their significance. The data set was divided into training and test data sets (75% training and 25% test). The training set was used to build the model and the test set used to validate the model genereated.

Preparing the dataset for modelling.

library(readr)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caTools)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
## 
## Attaching package: 'glmnet'
## The following object is masked from 'package:pROC':
## 
##     auc
HR<- read_csv("C:/Users/Bhargab/Desktop/HR_comma_sep.csv")
## Parsed with column specification:
## cols(
##   satisfaction_level = col_double(),
##   last_evaluation = col_double(),
##   number_project = col_integer(),
##   average_montly_hours = col_integer(),
##   time_spend_company = col_integer(),
##   Work_accident = col_integer(),
##   left = col_integer(),
##   promotion_last_5years = col_integer(),
##   sales = col_character(),
##   salary = col_character()
## )
HR$Work_accident<-as.factor(HR$Work_accident)
HR$left<-as.factor(HR$left)
HR$promotion_last_5years<-as.factor(HR$promotion_last_5years)
HR$sales<-as.factor(HR$sales)
HR$salary<-as.factor(HR$salary)
str(HR)
## Classes 'tbl_df', 'tbl' and 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ satisfaction_level   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ last_evaluation      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ number_project       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ average_montly_hours : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ time_spend_company   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Work_accident        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ left                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ promotion_last_5years: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ sales                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ salary               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Converted predictors left,Promotion last five year ,sales and salary to factor for modeling purpose.

Dividing the dataset into training and test data set

set.seed(200)
sample = sample.split(HR$left, SplitRatio = .75)
train = subset(HR, sample == TRUE)
test  = subset(HR, sample == FALSE)

Converting split data to model matrix

x=model.matrix(left~.,data = train)
y=(train$left)

Logistic Regression with Lasso

We find the lambda value(Regularization Parameter) with cross validation.

fit.lasso=glmnet(x,y,family = "binomial")
plot(fit.lasso,xvar="lambda",label=TRUE)

cv.lasso=cv.glmnet(x,y,family="binomial",alpha=1,type.measure = "class")
plot(cv.lasso) # Plotting the value of Lambda with the lowest misclassification error

log(cv.lasso$lambda.min)
## [1] -2.997232
bestlambda=exp(-3) 

As we can seen from the plot of missicasfication error, a lambda value of -3 gives us the best result,i.e we use this value to fit the model.

#Fitting the lasso model with the best lambda value and checking the coefficients
lasso.mod=glmnet(x,y ,family='binomial', alpha = 1,lambda=bestlambda)
coef(lasso.mod)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## (Intercept)             0.3432778
## (Intercept)             .        
## satisfaction_level     -2.5826623
## last_evaluation         .        
## number_project          .        
## average_montly_hours    .        
## time_spend_company      .        
## Work_accident1         -0.2041824
## promotion_last_5years1  .        
## saleshr                 .        
## salesIT                 .        
## salesmanagement         .        
## salesmarketing          .        
## salesproduct_mng        .        
## salesRandD              .        
## salessales              .        
## salessupport            .        
## salestechnical          .        
## salarylow               .        
## salarymedium            .

The best variables come out to be satisfaction level and having a work accident.

Evaluating model

new.x<-model.matrix(~.,data=test)
predLasso<-predict(lasso.mod,newx=x[1:3750,],s="bestlambda",type="response")
lasso.probs<-ifelse(predLasso > 0.5,"1","0")
table(test$left, lasso.probs)
##    lasso.probs
##        0    1
##   0 2595  262
##   1  767  126
summary(lasso.probs)
##  1       
##  0:3362  
##  1: 388
mean(lasso.probs==test$left) # Accuracy of 72.56%
## [1] 0.7256

With the fitted model, we use it on the test data set. Cases having probabilities of greater than 0.5 we categorized as ‘yes’, which basically means that employees with a probability value of greater than 0.5 were likely to leave the company. The mean accuracy obtainted from this process was 72.56%.

AUC curve for Logistic Regression

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
predict_lasso_ROC <-predict(lasso.mod,newx=x[1:3750,],s="lambda.min",type="response")
pred_lasso <- prediction(predict_lasso_ROC, test$left)
perf_lasso <- performance(pred_lasso, "tpr", "fpr")

auc_lasso <- performance(pred_lasso,"auc")
auc_lasso <- round(as.numeric(auc_lasso@y.values),3)
plot(perf_lasso, main = "ROC curves for the models", col='blue')

print(paste('AUC of Lasso Regression:',auc_lasso))
## [1] "AUC of Lasso Regression: 0.552"
#AUC comes out to be only 55%, which is very poor

The area under the AUC curve (true positive rate vs the false positive rate) obtained was very low at around 55%.

Decision tree

Building the descision tree using all the variables

library(tree)
set.seed(200)
traind=sample(1:nrow(HR),11249)
tree.left=tree(left~.,HR,subset = traind)
{plot(tree.left)
text(tree.left,pretty = 0) }

tree.predict=predict(tree.left,test,type = "class")
with(test,table(tree.predict,left))
##             left
## tree.predict    0    1
##            0 2819   74
##            1   38  819
(2819+819)/3750 #(97%)
## [1] 0.9701333

By using all the variables from the dataset we got 97% of accuracy. We can see satisfaction level is the most important variable, but we know that not all the variables from the dataset contribute towards an employee leaving. We will apply Cross validation to find the optimal number of leaves needed to make an acceptable accuracy and accordonly prune the tree.

Cross validating using the full tree

cv.dt=cv.tree(tree.left,FUN = prune.misclass)
cv.dt
## $size
## [1] 10  9  8  7  6  4  2  1
## 
## $dev
## [1]  321  346  394  489  638 1014 2131 2638
## 
## $k
## [1]  -Inf  26.0  49.0  95.0 151.0 188.5 507.0 610.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.dt) # Lowest mis classification comes out to be at size 9

prune.trees=prune.misclass(tree.left,best = 9)
{plot(prune.trees)
text(prune.trees,pretty = 0) }

cv.tree.predict=predict(prune.trees,test,type = "class")
with(test,table(cv.tree.predict,left))
##                left
## cv.tree.predict    0    1
##               0 2808   74
##               1   49  819
(2808+819)/3750 #Accuracy of 96.72%
## [1] 0.9672

Based on the decision tree we found out that the most important factors affecting an employee retention was satisfaction level, the time spent in the company, and the number of projects the employee is working on. Using cross validation, we pruned the tree and the minimum misclassification error rate was obtained for the tree with 9 leaves. The model was used on the test data set and it estimated properly the cases in which an employee leaves with a 96.72% accuracy.

Pruned Decision Tree AUC

predict_dt_ROC <-predict(prune.trees,test)
pred_dt <- prediction(predict_dt_ROC[,2], test$left)
perf_dt <- performance(pred_dt, "tpr", "fpr")

auc_dt <- performance(pred_dt,"auc")
auc_dt <- round(as.numeric(auc_dt@y.values),3)
plot(perf_dt, main = "ROC curves for the models", col='blue')

print(paste('AUC of Lasso Regression:',auc_dt))
## [1] "AUC of Lasso Regression: 0.97"
#AUC comes out to be 97%

Finally the AUC curve was plotted (true positive rate vs the false positive rate) on pruned model and a satisfying 97% area was achieved.

Random Forest

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
# Using mtry=3, so 3 random variables will be used for each split
modelrandom<-randomForest(left~.,data=train,mtry=3,ntree=400)
modelrandom
## 
## Call:
##  randomForest(formula = left ~ ., data = train, mtry = 3, ntree = 400) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.79%
## Confusion matrix:
##      0    1 class.error
## 0 8556   15 0.001750088
## 1   74 2604 0.027632562
importance(modelrandom)
##                       MeanDecreaseGini
## satisfaction_level         1429.373837
## last_evaluation             483.709458
## number_project              720.120423
## average_montly_hours        569.509520
## time_spend_company          731.855401
## Work_accident                22.300603
## promotion_last_5years         3.451512
## sales                        63.304188
## salary                       30.320133
varImpPlot(modelrandom)

The final model we used for this analysis was random forests. The parameters used are mtry =3(number of variables that were used for each split) and ntree=400(the number of random forests that were grown and then errors calculated). Based on the best tree that was selected from the process, it is clearly seen that satisfaction level is by far the most important factor which determines whether an employee is likely to stay or not with a Gini Index of 1377, followed by number of projects and time spent in the company.

predictrftree<-predict(modelrandom,test,type = "class")
table(predictrftree,test$left)
##              
## predictrftree    0    1
##             0 2854   28
##             1    3  865
mean(predictrftree==test$left) #Accuracy 99.17%
## [1] 0.9917333

We applied this model to our test data set and we got an impressive 99.16% accuracy.

Random Forest AUC

predict_rf_ROC <- predict(modelrandom, test, type="prob")
pred_rf <- prediction(predict_rf_ROC[,2], test$left)
perf_rf <- performance(pred_rf, "tpr", "fpr")

auc_rf <- performance(pred_rf,"auc")
auc_rf <- round(as.numeric(auc_rf@y.values),3)
plot(perf_rf, main = "ROC curves for the models", col='blue')

print(paste('AUC of Random Forest:',auc_rf)) #AUC comes out to be 99.3%
## [1] "AUC of Random Forest: 0.993"

The AUC curve was also quite nice with a 99.3%.

Limitations

Since this is a simulated dataset, models and predictions that were built might not be applicable when applied to a real world dataset using different measures to asses an employee’s retention.This dataset does not have any outliers, which is highly improbable in real world. We have also not accounted for interaction terms, like for an example if an employee is working for a long time, yet has a low salary.

Comparison

Random Forest with the highest accuracy of 99.16% and area under curve of 99.3%, out classes both logistic regression (with an accuracy of 72.56% and AUC of 55%) and Pruned Decision Tree (with an accuracy of 96.72% and AUC of 97%). Here we want to classify people who are most likely to leave the company, so even if we mis-classify someone who won’t leave the company as someone leaving the copmany it won’t hurt the model. Eventually, our goal is to classify the maximum number of employees who are likely to leave, for which the best measure of performance metric would be AUC curve.By classifying employees who have a chance of leaving, the company can take steps to retain the employee.

Conclusion

Overall for this stimulated dataset Random Forest seems to be the best model with the highest area under curve of 99.3%. We can confidently say that co mpanies must focus on satisfaction level of employees to make sure they are happy with their work. Along with that too many projects does not help in retaining an employee either. Time spent in the company also plays a significant role, but we feel it is more linked with a few other factors like salary, promotion as well.