Who left? HR Data Prediction

We found the data about human resource on kaggle website and want to use logistic regression to see what element can affact the left of staff. Our S.M.A.R.T. question is “What factors play a role in employee departure and what are their impacts?”. Notice: the dataset is simulated. This document is puclished to RPubs(http://rpubs.com/lewk/gwu-hrdata-1).

First, we checked there is no NAs and then we begin to check the normality and type of variables.

Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (left) is numeric type.

Correlation plot is also made to see the initial correlation between variables.From the plot, we can see that the variable “satisfction_level” has the highest correlation with “left”, while it is only -0.35, which is a moderate correlation. From correlation between independent variables, we can see that there is no multicolinearity in the dataset.

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : Factor w/ 2 levels "1","0": 2 2 2 2 2 2 2 2 2 2 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

run histograms and describe on the numeric variables to determine normality From the plot, we can see that the skew problem is not serious.

##          satisfaction_level last_evaluation number_project
## vars     1                  1               1             
## n        14999              14999           14999         
## mean     0.6128335          0.7161017       3.803054      
## sd       0.2486307          0.1711691       1.232592      
## median   0.64               0.72            4             
## trimmed  0.6301642          0.716467        3.738272      
## mad      0.281694           0.22239         1.4826        
## min      0.09               0.36            2             
## max      1                  1               7             
## range    0.91               0.64            5             
## skew     -0.4762651         -0.02661643     0.3376381     
## kurtosis -0.6713455         -1.239262       -0.4960467    
## se       0.002030128        0.001397637     0.01006441    
##          average_montly_hours time_spend_company left       
## vars     1                    1                  1          
## n        14999                14999              14999      
## mean     201.0503             3.498233           0.7619175  
## sd       49.9431              1.460136           0.4259241  
## median   200                  3                  1          
## trimmed  200.6359             3.276977           0.8273477  
## mad      65.2344              1.4826             0          
## min      96                   2                  0          
## max      310                  10                 1          
## range    214                  8                  1          
## skew     0.05283142           1.852948           -1.229797  
## kurtosis -1.135252            4.770184           -0.4876329 
## se       0.4077973            0.01192236         0.003477772

Create training and testing datasets, and run correlation against variables (with factors coerced into Numeric)

## 'data.frame':    12000 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (left)

Looking at the p values in the model using every variable, each variable seems to contribute to the model

NOTES: satisfaction_level seems to output a strange coefficient of -4, additionally average_montly_hours and time_spend_company have almost a negligable affect on the depenedent

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1555   0.1921   0.3614   0.5658   2.1595  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.0505718  0.2521011   8.134 4.16e-16 ***
## satisfaction_level      4.0226715  0.1211739  33.198  < 2e-16 ***
## last_evaluation        -0.5962296  0.1800562  -3.311 0.000928 ***
## number_project          0.2886107  0.0258907  11.147  < 2e-16 ***
## average_montly_hours   -0.0041173  0.0006216  -6.624 3.51e-11 ***
## time_spend_company     -0.3079136  0.0202271 -15.223  < 2e-16 ***
## Work_accident1          1.4429068  0.1109351  13.007  < 2e-16 ***
## promotion_last_5years1  1.4628025  0.3843782   3.806 0.000141 ***
## saleshr                -0.1377605  0.1657808  -0.831 0.405985    
## salesIT                 0.0191346  0.1525542   0.125 0.900184    
## salesmanagement         0.2295874  0.2024092   1.134 0.256680    
## salesmarketing         -0.0405011  0.1662055  -0.244 0.807478    
## salesproduct_mng        0.0344749  0.1646693   0.209 0.834168    
## salesRandD              0.3495131  0.1733268   2.016 0.043748 *  
## salessales             -0.0561511  0.1290379  -0.435 0.663452    
## salessupport           -0.1173290  0.1370692  -0.856 0.392007    
## salestechnical         -0.0839353  0.1337196  -0.628 0.530203    
## salarylow              -1.8386550  0.1638898 -11.219  < 2e-16 ***
## salarymedium           -1.3844728  0.1647690  -8.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8602.1  on 11981  degrees of freedom
## AIC: 8640.1
## 
## Number of Fisher Scoring iterations: 6

use “bestglm()” with forward selection method to determine the best possible model for use. The best model shows that the higher the satisfaction level the less possible the staff will leave, also staff who has higher number of project, exprience of work accident and has been promoted in last 5 years may have higher chance to stay in the company. While, staff who has higher “last evaluation”, average monthly hours spend in company, years spend in company and low to medium salary may have higher chance to leave the company.

## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.

## 
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1821   0.1930   0.3631   0.5658   2.1373  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.0359697  0.2242410   9.079  < 2e-16 ***
## satisfaction_level      4.0179362  0.1209866  33.210  < 2e-16 ***
## last_evaluation        -0.5948691  0.1797951  -3.309 0.000938 ***
## number_project          0.2878855  0.0258428  11.140  < 2e-16 ***
## average_montly_hours   -0.0040972  0.0006208  -6.600 4.12e-11 ***
## time_spend_company     -0.3041396  0.0200786 -15.147  < 2e-16 ***
## Work_accident1          1.4469436  0.1108942  13.048  < 2e-16 ***
## promotion_last_5years1  1.5125628  0.3835336   3.944 8.02e-05 ***
## salarylow              -1.8709210  0.1631060 -11.471  < 2e-16 ***
## salarymedium           -1.4135286  0.1640800  -8.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8618.2  on 11990  degrees of freedom
## AIC: 8638.2
## 
## Number of Fisher Scoring iterations: 6

compare the bestglm model (bglm) against the all variable model (m.all) using all the variables using roc()

The AUC of all variable model is 79.5% while the AUC of the best model is 82.5%. Also, the AIC is decreased from 8640 to 8638.

## 
## Call:
## roc.formula(formula = left ~ best.pred, data = test)
## 
## Data: best.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.792

## 
## Call:
## roc.formula(formula = left ~ all.pred, data = test)
## 
## Data: all.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.7954

## [1] 8640.053

## [1] 8638.177

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev
## NULL                                  11999    10813.5
## satisfaction_level     1  1437.83     11998     9375.6
## last_evaluation        1    10.52     11997     9365.1
## number_project         1    57.52     11996     9307.6
## average_montly_hours   1    51.31     11995     9256.3
## time_spend_company     1   160.53     11994     9095.8
## Work_accident          1   221.94     11993     8873.8
## promotion_last_5years  1    34.87     11992     8839.0
## salary                 2   220.77     11990     8618.2

##           llh       llhNull            G2      McFadden          r2ML 
## -4301.0264319 -5406.7345064  2211.4161490     0.2045057     0.1683010 
##          r2CU 
##     0.2833892

##            (Intercept)     satisfaction_level        last_evaluation 
##            677.2344315           5485.0112815            -44.9115197 
##         number_project   average_montly_hours     time_spend_company 
##             33.4572030             -0.4108826            -26.5021204 
##         Work_accident1 promotion_last_5years1                saleshr 
##            323.2982337            331.8043690            -12.8692698 
##                salesIT        salesmanagement         salesmarketing 
##              1.9318868             25.8080779             -3.9691851 
##       salesproduct_mng             salesRandD             salessales 
##              3.5076078             41.8376802             -5.4603728 
##           salessupport         salestechnical              salarylow 
##            -11.0707401             -8.0509221            -84.0968826 
##           salarymedium 
##            -74.9544193

##            (Intercept)     satisfaction_level        last_evaluation 
##            665.9675825           5458.6270996            -44.8365213 
##         number_project   average_montly_hours     time_spend_company 
##             33.3604611             -0.4088828            -26.2242161 
##         Work_accident1 promotion_last_5years1              salarylow 
##            325.0104710            353.8346722            -84.6018217 
##           salarymedium 
##            -75.6716690

## [1] 0.4504835

## [1] 0.4431477

Transform satisfaction_level and last_evaluation into factor level variables as they can be repsented by 3 level factors (1,2,3)

This time, the best regression shows that staff who has higher satisfaction level, more projects, experience of work accident, and promoted in last five years may have higher chance to stay in the company. However, if a staff has more average monthly hours, time spend in company and low or medium salary may have higher chance to leave the company. The result is almost the same as the former best regrssion.

In doing so, this actually raised the AIC of the models, including the newly run bestglm() model, which indicates we lost some prediction power. AUC went down to 77% and AIC increased to 8966 when using this transformed dataframe.

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trans.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0732   0.1851   0.3960   0.5900   2.4726  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.978e+01  1.862e+02   0.106  0.91537    
## satisfaction_level2     1.153e+00  8.760e-02  13.160  < 2e-16 ***
## satisfaction_level3     1.769e+00  8.431e-02  20.980  < 2e-16 ***
## last_evaluation2       -1.692e+01  1.862e+02  -0.091  0.92760    
## last_evaluation3       -1.581e+01  1.862e+02  -0.085  0.93231    
## number_project          1.326e-01  2.678e-02   4.953 7.33e-07 ***
## average_montly_hours   -5.073e-03  6.161e-04  -8.234  < 2e-16 ***
## time_spend_company     -3.620e-01  2.002e-02 -18.083  < 2e-16 ***
## Work_accident1          1.460e+00  1.104e-01  13.232  < 2e-16 ***
## promotion_last_5years1  1.493e+00  3.828e-01   3.899 9.64e-05 ***
## saleshr                -1.341e-01  1.628e-01  -0.824  0.40992    
## salesIT                 1.090e-01  1.493e-01   0.730  0.46511    
## salesmanagement         2.784e-01  1.985e-01   1.402  0.16082    
## salesmarketing          1.543e-02  1.623e-01   0.095  0.92428    
## salesproduct_mng        1.000e-01  1.619e-01   0.618  0.53671    
## salesRandD              4.443e-01  1.706e-01   2.604  0.00921 ** 
## salessales              1.154e-02  1.266e-01   0.091  0.92738    
## salessupport           -5.343e-02  1.343e-01  -0.398  0.69070    
## salestechnical         -4.987e-03  1.310e-01  -0.038  0.96963    
## salarylow              -1.877e+00  1.625e-01 -11.552  < 2e-16 ***
## salarymedium           -1.403e+00  1.634e-01  -8.586  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8924.8  on 11979  degrees of freedom
## AIC: 8966.8
## 
## Number of Fisher Scoring iterations: 16

## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.

## 
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0732   0.1851   0.3960   0.5900   2.4726  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.978e+01  1.862e+02   0.106  0.91537    
## satisfaction_level2     1.153e+00  8.760e-02  13.160  < 2e-16 ***
## satisfaction_level3     1.769e+00  8.431e-02  20.980  < 2e-16 ***
## last_evaluation2       -1.692e+01  1.862e+02  -0.091  0.92760    
## last_evaluation3       -1.581e+01  1.862e+02  -0.085  0.93231    
## number_project          1.326e-01  2.678e-02   4.953 7.33e-07 ***
## average_montly_hours   -5.073e-03  6.161e-04  -8.234  < 2e-16 ***
## time_spend_company     -3.620e-01  2.002e-02 -18.083  < 2e-16 ***
## Work_accident1          1.460e+00  1.104e-01  13.232  < 2e-16 ***
## promotion_last_5years1  1.493e+00  3.828e-01   3.899 9.64e-05 ***
## saleshr                -1.341e-01  1.628e-01  -0.824  0.40992    
## salesIT                 1.090e-01  1.493e-01   0.730  0.46511    
## salesmanagement         2.784e-01  1.985e-01   1.402  0.16082    
## salesmarketing          1.543e-02  1.623e-01   0.095  0.92428    
## salesproduct_mng        1.000e-01  1.619e-01   0.618  0.53671    
## salesRandD              4.443e-01  1.706e-01   2.604  0.00921 ** 
## salessales              1.154e-02  1.266e-01   0.091  0.92738    
## salessupport           -5.343e-02  1.343e-01  -0.398  0.69070    
## salestechnical         -4.987e-03  1.310e-01  -0.038  0.96963    
## salarylow              -1.877e+00  1.625e-01 -11.552  < 2e-16 ***
## salarymedium           -1.403e+00  1.634e-01  -8.586  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8924.8  on 11979  degrees of freedom
## AIC: 8966.8
## 
## Number of Fisher Scoring iterations: 16

## [1] 8966.751

## [1] 8966.751

## 
## Call:
## roc.formula(formula = left ~ best.trans.pred, data = trans.test)
## 
## Data: best.trans.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.7738

## [1] 0.5268423

##           llh       llhNull            G2      McFadden          r2ML 
## -4462.3754608 -5406.7345064  1888.7180913     0.1746635     0.1456319 
##          r2CU 
##     0.2452185

So, we still think that the former best model we made is better and here are some conclusions we made. 1.Staff are 500% more likely to quit if they have a low salary, 200% more likely to quit if they have a medium salary. This means salary is an important factor that affect people’s decision. 2.Staff are 80% more likely to quit if they have a good last evaluation. 3.Staff are 100% less likely to quit if they are satisfied.

Since the salary is significant to the decision of whether left, we consider that if we have nemeric values of salary, the prediction would be better.Also, investigate evaluations can help, as they seem to play a part in people leaving.

Looked into other models on Kaggle, similar results are found there.

Who left? HR Data Prediction

Luke Bogacz and Zheng Lyu

June 21, 2017