High employee turnover is unhealthy for an organization

A study by Employee Benefits News found that the average cost of losing an employee is 33% of their annual salary. And in 2016, the SHRM Benchmarking Report found that the average cost-per-hire is $4,129.

Therefore, our goal is to accurately predict employees that are planning to leave so that companies can prepare and plan appropriately.

Source:

To evaluate employees that are a potential threat to companies we manipulated the following variables

  • Satisfaction level
  • Last evaluation
  • Number of projects
  • Average monthly hours
  • Time spent at the company
  • Work accidents
  • Promotion in the last 5 years
  • Sales
  • Salary

RandomForest

Splitting the data into training & validation

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4510 
## 
##  
##              | val$predicted 
##     val$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3421 |         5 |      3426 | 
##              |   241.144 |   790.697 |           | 
##              |     0.999 |     0.001 |     0.760 | 
##              |     0.990 |     0.005 |           | 
##              |     0.759 |     0.001 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        35 |      1049 |      1084 | 
##              |   762.141 |  2499.012 |           | 
##              |     0.032 |     0.968 |     0.240 | 
##              |     0.010 |     0.995 |           | 
##              |     0.008 |     0.233 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3456 |      1054 |      4510 | 
##              |     0.766 |     0.234 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 99.8% of the population predicted would leave left the organisation & 96.9% who would continue to stay stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

CART

##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720

Split the data in training & validation

Creating Tree

##           CP nsplit rel error    xerror        xstd
## 1 0.26491366      0 1.0000000 1.0000000 0.017268873
## 2 0.18269231      1 0.7350863 0.7350863 0.015413205
## 3 0.07397959      3 0.3697017 0.3697017 0.011498380
## 4 0.04984301      5 0.2217425 0.2217425 0.009076995
## 5 0.02825746      6 0.1718995 0.1762166 0.008138311
## 6 0.01687598      7 0.1436421 0.1503140 0.007540782
## 7 0.01000000      8 0.1267661 0.1326531 0.007099516

Identifying most important variables

##                       ct1.variable.importance
## satisfaction_level                2205.051676
## average_montly_hours              1171.008521
## number_project                    1136.764011
## last_evaluation                    929.554234
## time_spend_company                 829.388045
## Work_accident                       30.133103
## promotion_last_5years                4.703823

Report on accuracy of the model

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6500     Median :0.7200   Median :4.000   Median :202.0       
##  Mean   :0.6153     Mean   :0.7181   Mean   :3.801   Mean   :201.6       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident    left     promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   0:3366   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1:1023   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000            Median :0.00000      
##  Mean   : 3.508     Mean   :0.1408            Mean   :0.02028      
##  3rd Qu.: 4.000     3rd Qu.:0.0000            3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000            Max.   :1.00000      
##                                                                    
##          sales         salary     left_predicted
##  sales      :1198   high  : 379   0:3418        
##  technical  : 783   low   :2140   1: 971        
##  support    : 668   medium:1870                 
##  IT         : 355                               
##  product_mng: 268                               
##  marketing  : 253                               
##  (Other)    : 864
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4389 
## 
##  
##                 | validation$left_predicted 
## validation$left |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               0 |      3314 |        52 |      3366 | 
##                 |   183.038 |   644.308 |           | 
##                 |     0.985 |     0.015 |     0.767 | 
##                 |     0.970 |     0.054 |           | 
##                 |     0.755 |     0.012 |           | 
## ----------------|-----------|-----------|-----------|
##               1 |       104 |       919 |      1023 | 
##                 |   602.253 |  2119.980 |           | 
##                 |     0.102 |     0.898 |     0.233 | 
##                 |     0.030 |     0.946 |           | 
##                 |     0.024 |     0.209 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |      3418 |       971 |      4389 | 
##                 |     0.779 |     0.221 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Interpretation of the data

The classification & regression tree is a visual representation of various important variables which could be in relation to the employees leaving the company

#1. 11% of the data states that when satisfaction level is less than 0.46, and their last evaluation is less than 0.45 they are likely to leave the company. #2. 58% of the data states that when satisfaction level is greater than or equal to 0.47 and they have been with the company for less than 5 years they will continue to stay with the company

This model helped predict the employees’ who would leave and who would stay with the company and 98.9% of the population predicted would leave, left the organisation & 90.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

Logistic Regression

Splitting the data into trainLR & ValiLR

Create Model

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3927  -0.6771  -0.4306  -0.1523   3.1369  
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.2297893  0.1399177   1.642    0.101    
## satisfaction_level    -4.1541678  0.1156513 -35.920  < 2e-16 ***
## last_evaluation        0.8140587  0.1744926   4.665 3.08e-06 ***
## number_project        -0.3252356  0.0249471 -13.037  < 2e-16 ***
## average_montly_hours   0.0042840  0.0006043   7.089 1.35e-12 ***
## time_spend_company     0.2394629  0.0179363  13.351  < 2e-16 ***
## Work_accident         -1.5533337  0.1066776 -14.561  < 2e-16 ***
## promotion_last_5years -1.7907012  0.3050679  -5.870 4.36e-09 ***
## salary                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11569.0  on 10582  degrees of freedom
## Residual deviance:  9336.2  on 10575  degrees of freedom
## AIC: 9352.2
## 
## Number of Fisher Scoring iterations: 5

to see which variables to use

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10582    11569.0              
## satisfaction_level     1  1595.97     10581     9973.0 < 2.2e-16 ***
## last_evaluation        1    12.41     10580     9960.6  0.000426 ***
## number_project         1    93.35     10579     9867.2 < 2.2e-16 ***
## average_montly_hours   1    56.30     10578     9810.9 6.226e-14 ***
## time_spend_company     1   135.68     10577     9675.2 < 2.2e-16 ***
## Work_accident          1   289.22     10576     9386.0 < 2.2e-16 ***
## promotion_last_5years  1    49.82     10575     9336.2 1.683e-12 ***
## salary                 0     0.00     10575     9336.2              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + average_montly_hours + 
##     promotion_last_5years, family = binomial(link = "logit"), 
##     data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4859  -0.7023  -0.4980  -0.3267   2.7014  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.4469563  0.1320889   3.384 0.000715 ***
## satisfaction_level    -3.8272903  0.1053896 -36.316  < 2e-16 ***
## last_evaluation        0.3094267  0.1580656   1.958 0.050279 .  
## average_montly_hours   0.0016123  0.0005289   3.048 0.002302 ** 
## promotion_last_5years -1.4525982  0.2865873  -5.069 4.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11569.0  on 10582  degrees of freedom
## Residual deviance:  9914.2  on 10578  degrees of freedom
## AIC: 9924.2
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10582    11569.0              
## satisfaction_level     1  1595.97     10581     9973.0 < 2.2e-16 ***
## last_evaluation        1    12.41     10580     9960.6  0.000426 ***
## average_montly_hours   1     9.43     10579     9951.1  0.002129 ** 
## promotion_last_5years  1    36.96     10578     9914.2 1.205e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Score the validation data (create predicted values)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01476 0.10711 0.18567 0.24017 0.31129 0.70888

How many left the job

Applying New Cutoff

## [1] 1072

How well are we predicting ~ Confusion Matrix

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4416 
## 
##  
##              | vallr$predicted_left01 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2861 |       483 |      3344 | 
##              |    42.685 |   133.152 |           | 
##              |     0.856 |     0.144 |     0.757 | 
##              |     0.856 |     0.451 |           | 
##              |     0.648 |     0.109 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       483 |       589 |      1072 | 
##              |   133.152 |   415.354 |           | 
##              |     0.451 |     0.549 |     0.243 | 
##              |     0.144 |     0.549 |           | 
##              |     0.109 |     0.133 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3344 |      1072 |      4416 | 
##              |     0.757 |     0.243 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 84.8% of the population predicted would leave, left the organisation & 54.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 78.3% and is based of 70:30 model.

Conclusion

    Particulars                               Logistic Regression |   CART     | RandomForest
                                                 -----------------   --------     --------
Predicted left vs actually left                     84.8%         |   98.9%    |    99.8%

Predicted to leave vs actually stayed               15.2%         |   1.1%     |     0.2% 

Predicted to stay vs actually left                  45.9%         |   9.9%     |     3.1%

Predicted stay vs actually stayed                   54.1%         |   90.1%    |    96.9% 

Logistic Regression - Overall accuracy is 78.3%

  • Benefits: Likelihood of an estimation such as “yes” or “no” using various variables
  • Limitations: No gray area; it is or it isn’t

Classification And Regression Tree (CART) - Overall accuracy is 96.9%

  • Benefits: Beneficial to use market research on survey responses
  • Limitations: Readability of the charts; Apprearence isnt great and its very hard to read & analyse

Random Forest - Overall accuracy is 99%

  • Benefits: Acuuracy
  • Limitations: Complexity & longer training

It’s hard to favor one algorithm over an other. Each algorithm has a purpose and has pros and cons. the algorithm that is to be used actually “depends” upon the data that is being used to draw the analyses.