High employee turnover is unhealthy for an organization

A study by Employee Benefits News found that the average cost of losing an employee is 33% of their annual salary. And in 2016, the SHRM Benchmarking Report found that the average cost-per-hire is $4,129.

Therefore, our goal is to accurately predict employees that are planning to leave so that companies can prepare and plan appropriately.

Source:

De Leon, L. (2019, September 20). The Costs and Trends of Employee Turnover - Part 1 | Employers Resource. Employers Resource. https://www.employersresource.com/blog/the-costs-and-trends-of-employee-turnover-part-1/

To evaluate employees that are a potential threat to companies we manipulated the following variables

Satisfaction level
Last evaluation
Number of projects
Average monthly hours
Time spent at the company
Work accidents
Promotion in the last 5 years
Sales
Salary

RandomForest

Splitting the data into training & validation

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4510 
## 
##  
##              | val$predicted 
##     val$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3421 |         5 |      3426 | 
##              |   241.144 |   790.697 |           | 
##              |     0.999 |     0.001 |     0.760 | 
##              |     0.990 |     0.005 |           | 
##              |     0.759 |     0.001 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        35 |      1049 |      1084 | 
##              |   762.141 |  2499.012 |           | 
##              |     0.032 |     0.968 |     0.240 | 
##              |     0.010 |     0.995 |           | 
##              |     0.008 |     0.233 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3456 |      1054 |      4510 | 
##              |     0.766 |     0.234 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 99.8% of the population predicted would leave left the organisation & 96.9% who would continue to stay stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

CART

##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720

Split the data in training & validation

Creating Tree

##           CP nsplit rel error    xerror        xstd
## 1 0.26491366      0 1.0000000 1.0000000 0.017268873
## 2 0.18269231      1 0.7350863 0.7350863 0.015413205
## 3 0.07397959      3 0.3697017 0.3697017 0.011498380
## 4 0.04984301      5 0.2217425 0.2217425 0.009076995
## 5 0.02825746      6 0.1718995 0.1762166 0.008138311
## 6 0.01687598      7 0.1436421 0.1503140 0.007540782
## 7 0.01000000      8 0.1267661 0.1326531 0.007099516

Identifying most important variables

##                       ct1.variable.importance
## satisfaction_level                2205.051676
## average_montly_hours              1171.008521
## number_project                    1136.764011
## last_evaluation                    929.554234
## time_spend_company                 829.388045
## Work_accident                       30.133103
## promotion_last_5years                4.703823

Report on accuracy of the model

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6500     Median :0.7200   Median :4.000   Median :202.0       
##  Mean   :0.6153     Mean   :0.7181   Mean   :3.801   Mean   :201.6       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident    left     promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   0:3366   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1:1023   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000            Median :0.00000      
##  Mean   : 3.508     Mean   :0.1408            Mean   :0.02028      
##  3rd Qu.: 4.000     3rd Qu.:0.0000            3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000            Max.   :1.00000      
##                                                                    
##          sales         salary     left_predicted
##  sales      :1198   high  : 379   0:3418        
##  technical  : 783   low   :2140   1: 971        
##  support    : 668   medium:1870                 
##  IT         : 355                               
##  product_mng: 268                               
##  marketing  : 253                               
##  (Other)    : 864

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4389 
## 
##  
##                 | validation$left_predicted 
## validation$left |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               0 |      3314 |        52 |      3366 | 
##                 |   183.038 |   644.308 |           | 
##                 |     0.985 |     0.015 |     0.767 | 
##                 |     0.970 |     0.054 |           | 
##                 |     0.755 |     0.012 |           | 
## ----------------|-----------|-----------|-----------|
##               1 |       104 |       919 |      1023 | 
##                 |   602.253 |  2119.980 |           | 
##                 |     0.102 |     0.898 |     0.233 | 
##                 |     0.030 |     0.946 |           | 
##                 |     0.024 |     0.209 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |      3418 |       971 |      4389 | 
##                 |     0.779 |     0.221 |           | 
## ----------------|-----------|-----------|-----------|
## 
##

Interpretation of the data

The classification & regression tree is a visual representation of various important variables which could be in relation to the employees leaving the company

#1. 11% of the data states that when satisfaction level is less than 0.46, and their last evaluation is less than 0.45 they are likely to leave the company. #2. 58% of the data states that when satisfaction level is greater than or equal to 0.47 and they have been with the company for less than 5 years they will continue to stay with the company

This model helped predict the employees’ who would leave and who would stay with the company and 98.9% of the population predicted would leave, left the organisation & 90.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

Logistic Regression

Splitting the data into trainLR & ValiLR

Create Model

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3927  -0.6771  -0.4306  -0.1523   3.1369  
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.2297893  0.1399177   1.642    0.101    
## satisfaction_level    -4.1541678  0.1156513 -35.920  < 2e-16 ***
## last_evaluation        0.8140587  0.1744926   4.665 3.08e-06 ***
## number_project        -0.3252356  0.0249471 -13.037  < 2e-16 ***
## average_montly_hours   0.0042840  0.0006043   7.089 1.35e-12 ***
## time_spend_company     0.2394629  0.0179363  13.351  < 2e-16 ***
## Work_accident         -1.5533337  0.1066776 -14.561  < 2e-16 ***
## promotion_last_5years -1.7907012  0.3050679  -5.870 4.36e-09 ***
## salary                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11569.0  on 10582  degrees of freedom
## Residual deviance:  9336.2  on 10575  degrees of freedom
## AIC: 9352.2
## 
## Number of Fisher Scoring iterations: 5

to see which variables to use

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10582    11569.0              
## satisfaction_level     1  1595.97     10581     9973.0 < 2.2e-16 ***
## last_evaluation        1    12.41     10580     9960.6  0.000426 ***
## number_project         1    93.35     10579     9867.2 < 2.2e-16 ***
## average_montly_hours   1    56.30     10578     9810.9 6.226e-14 ***
## time_spend_company     1   135.68     10577     9675.2 < 2.2e-16 ***
## Work_accident          1   289.22     10576     9386.0 < 2.2e-16 ***
## promotion_last_5years  1    49.82     10575     9336.2 1.683e-12 ***
## salary                 0     0.00     10575     9336.2              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + average_montly_hours + 
##     promotion_last_5years, family = binomial(link = "logit"), 
##     data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4859  -0.7023  -0.4980  -0.3267   2.7014  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.4469563  0.1320889   3.384 0.000715 ***
## satisfaction_level    -3.8272903  0.1053896 -36.316  < 2e-16 ***
## last_evaluation        0.3094267  0.1580656   1.958 0.050279 .  
## average_montly_hours   0.0016123  0.0005289   3.048 0.002302 ** 
## promotion_last_5years -1.4525982  0.2865873  -5.069 4.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11569.0  on 10582  degrees of freedom
## Residual deviance:  9914.2  on 10578  degrees of freedom
## AIC: 9924.2
## 
## Number of Fisher Scoring iterations: 5

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10582    11569.0              
## satisfaction_level     1  1595.97     10581     9973.0 < 2.2e-16 ***
## last_evaluation        1    12.41     10580     9960.6  0.000426 ***
## average_montly_hours   1     9.43     10579     9951.1  0.002129 ** 
## promotion_last_5years  1    36.96     10578     9914.2 1.205e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Score the validation data (create predicted values)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01476 0.10711 0.18567 0.24017 0.31129 0.70888

How many left the job

Applying New Cutoff

## [1] 1072

How well are we predicting ~ Confusion Matrix

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4416 
## 
##  
##              | vallr$predicted_left01 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2861 |       483 |      3344 | 
##              |    42.685 |   133.152 |           | 
##              |     0.856 |     0.144 |     0.757 | 
##              |     0.856 |     0.451 |           | 
##              |     0.648 |     0.109 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       483 |       589 |      1072 | 
##              |   133.152 |   415.354 |           | 
##              |     0.451 |     0.549 |     0.243 | 
##              |     0.144 |     0.549 |           | 
##              |     0.109 |     0.133 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3344 |      1072 |      4416 | 
##              |     0.757 |     0.243 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 84.8% of the population predicted would leave, left the organisation & 54.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 78.3% and is based of 70:30 model.

Conclusion

    Particulars                               Logistic Regression |   CART     | RandomForest
                                                 -----------------   --------     --------
Predicted left vs actually left                     84.8%         |   98.9%    |    99.8%

Predicted to leave vs actually stayed               15.2%         |   1.1%     |     0.2% 

Predicted to stay vs actually left                  45.9%         |   9.9%     |     3.1%

Predicted stay vs actually stayed                   54.1%         |   90.1%    |    96.9%

Logistic Regression - Overall accuracy is 78.3%

Benefits: Likelihood of an estimation such as “yes” or “no” using various variables
Limitations: No gray area; it is or it isn’t

Classification And Regression Tree (CART) - Overall accuracy is 96.9%

Benefits: Beneficial to use market research on survey responses
Limitations: Readability of the charts; Apprearence isnt great and its very hard to read & analyse

Random Forest - Overall accuracy is 99%

Benefits: Acuuracy
Limitations: Complexity & longer training

It’s hard to favor one algorithm over an other. Each algorithm has a purpose and has pros and cons. the algorithm that is to be used actually “depends” upon the data that is being used to draw the analyses.

Final Project

Ramya Murthy

4/24/2020

High employee turnover is unhealthy for an organization

A study by Employee Benefits News found that the average cost of losing an employee is 33% of their annual salary. And in 2016, the SHRM Benchmarking Report found that the average cost-per-hire is $4,129.

Therefore, our goal is to accurately predict employees that are planning to leave so that companies can prepare and plan appropriately.

Source:

De Leon, L. (2019, September 20). The Costs and Trends of Employee Turnover - Part 1 | Employers Resource. Employers Resource. https://www.employersresource.com/blog/the-costs-and-trends-of-employee-turnover-part-1/

To evaluate employees that are a potential threat to companies we manipulated the following variables

RandomForest

Splitting the data into training & validation

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 99.8% of the population predicted would leave left the organisation & 96.9% who would continue to stay stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

CART

Split the data in training & validation

Creating Tree

Identifying most important variables

Report on accuracy of the model

Interpretation of the data

The classification & regression tree is a visual representation of various important variables which could be in relation to the employees leaving the company

This model helped predict the employees’ who would leave and who would stay with the company and 98.9% of the population predicted would leave, left the organisation & 90.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 99.1% and is based of 70:30 model.

Logistic Regression

Splitting the data into trainLR & ValiLR

Create Model

to see which variables to use

Score the validation data (create predicted values)

How many left the job

Applying New Cutoff

How well are we predicting ~ Confusion Matrix

Interpretation of the data

This model helped predict the employees’ who would leave and who would stay with the company and 84.8% of the population predicted would leave, left the organisation & 54.1% who would continue to stay, stayed with the organisation

Accuracy of the model is 78.3% and is based of 70:30 model.

Conclusion

Logistic Regression - Overall accuracy is 78.3%

Classification And Regression Tree (CART) - Overall accuracy is 96.9%

Random Forest - Overall accuracy is 99%

It’s hard to favor one algorithm over an other. Each algorithm has a purpose and has pros and cons. the algorithm that is to be used actually “depends” upon the data that is being used to draw the analyses.