Who left? HR Data Prediction

We find the data about human resource on kaggle website and want to use logistic regression to see what element can affact the left of staff. First, we checked there is no NAs and then begin to check the normality and type of variables.

Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (`left`) is numeric type

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

run histograms and describe on the numeric variables to determine normality

##          satisfaction_level last_evaluation number_project
## vars     1                  1               1             
## n        14999              14999           14999         
## mean     0.6128335          0.7161017       3.803054      
## sd       0.2486307          0.1711691       1.232592      
## median   0.64               0.72            4             
## trimmed  0.6301642          0.716467        3.738272      
## mad      0.281694           0.22239         1.4826        
## min      0.09               0.36            2             
## max      1                  1               7             
## range    0.91               0.64            5             
## skew     -0.4762651         -0.02661643     0.3376381     
## kurtosis -0.6713455         -1.239262       -0.4960467    
## se       0.002030128        0.001397637     0.01006441    
##          average_montly_hours time_spend_company left       
## vars     1                    1                  1          
## n        14999                14999              14999      
## mean     201.0503             3.498233           0.2380825  
## sd       49.9431              1.460136           0.4259241  
## median   200                  3                  0          
## trimmed  200.6359             3.276977           0.1726523  
## mad      65.2344              1.4826             0          
## min      96                   2                  0          
## max      310                  10                 1          
## range    214                  8                  1          
## skew     0.05283142           1.852948           1.229797   
## kurtosis -1.135252            4.770184           -0.4876329 
## se       0.4077973            0.01192236         0.003477772

Create training and testing datasets, and run correlation against numeric variables

## [1] 14999

## 'data.frame':    12000 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (`left`)

Looking at the p values in the model using every variable, each variable seems to contribute to the model

NOTES: satisfaction_level seems to output a strange coefficient of -4, additionally average_montly_hours and time_spend_company have almost a negligable affect on the depenedent

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1595  -0.5658  -0.3614  -0.1921   3.1555  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.0505718  0.2521011  -8.134 4.16e-16 ***
## satisfaction_level     -4.0226715  0.1211739 -33.198  < 2e-16 ***
## last_evaluation         0.5962296  0.1800562   3.311 0.000928 ***
## number_project         -0.2886107  0.0258907 -11.147  < 2e-16 ***
## average_montly_hours    0.0041173  0.0006216   6.624 3.51e-11 ***
## time_spend_company      0.3079136  0.0202271  15.223  < 2e-16 ***
## Work_accident1         -1.4429068  0.1109351 -13.007  < 2e-16 ***
## promotion_last_5years1 -1.4628025  0.3843782  -3.806 0.000141 ***
## saleshr                 0.1377605  0.1657808   0.831 0.405985    
## salesIT                -0.0191346  0.1525542  -0.125 0.900184    
## salesmanagement        -0.2295874  0.2024092  -1.134 0.256680    
## salesmarketing          0.0405011  0.1662055   0.244 0.807478    
## salesproduct_mng       -0.0344749  0.1646693  -0.209 0.834168    
## salesRandD             -0.3495131  0.1733268  -2.016 0.043748 *  
## salessales              0.0561511  0.1290379   0.435 0.663452    
## salessupport            0.1173290  0.1370692   0.856 0.392007    
## salestechnical          0.0839353  0.1337196   0.628 0.530203    
## salarylow               1.8386550  0.1638898  11.219  < 2e-16 ***
## salarymedium            1.3844728  0.1647690   8.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8602.1  on 11981  degrees of freedom
## AIC: 8640.1
## 
## Number of Fisher Scoring iterations: 6

use `bestglm()` with forward method to determine the best possible model for use

The best model shows that the higher the satisfaction level the less possible the staff will leave, also staff who has higher number of project, exprience of work accident and has been promoted in last 5 years may have higher chance to stay in the company. While, staff who has higher “last evaluation”, average monthly hours, time spend in company and low to medium salary may have higher chance to leave the company.

## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.

## 
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1373  -0.5658  -0.3631  -0.1930   3.1821  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.0359697  0.2242410  -9.079  < 2e-16 ***
## satisfaction_level     -4.0179362  0.1209866 -33.210  < 2e-16 ***
## last_evaluation         0.5948691  0.1797951   3.309 0.000938 ***
## number_project         -0.2878855  0.0258428 -11.140  < 2e-16 ***
## average_montly_hours    0.0040972  0.0006208   6.600 4.12e-11 ***
## time_spend_company      0.3041396  0.0200786  15.147  < 2e-16 ***
## Work_accident1         -1.4469436  0.1108942 -13.048  < 2e-16 ***
## promotion_last_5years1 -1.5125628  0.3835336  -3.944 8.02e-05 ***
## salarylow               1.8709210  0.1631060  11.471  < 2e-16 ***
## salarymedium            1.4135286  0.1640800   8.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8618.2  on 11990  degrees of freedom
## AIC: 8638.2
## 
## Number of Fisher Scoring iterations: 6

compare the bestglm model (`bglm`) against the all variable model (`m.all`) using all the variables using `roc()`

both models have a very similar AUC - about 79% which seems fairly decent

## 
## Call:
## roc.formula(formula = left ~ best.pred, data = test)
## 
## Data: best.pred in 1428 controls (left 0) < 1571 cases (left 1).
## Area under the curve: 0.792

## 
## Call:
## roc.formula(formula = left ~ all.pred, data = test)
## 
## Data: all.pred in 1428 controls (left 0) < 1571 cases (left 1).
## Area under the curve: 0.7954

## [1] 8640.053

## [1] 8638.177

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev
## NULL                                  11999    10813.5
## satisfaction_level     1  1437.83     11998     9375.6
## last_evaluation        1    10.52     11997     9365.1
## number_project         1    57.52     11996     9307.6
## average_montly_hours   1    51.31     11995     9256.3
## time_spend_company     1   160.53     11994     9095.8
## Work_accident          1   221.94     11993     8873.8
## promotion_last_5years  1    34.87     11992     8839.0
## salary                 2   220.77     11990     8618.2

##           llh       llhNull            G2      McFadden          r2ML 
## -4301.0264319 -5406.7345064  2211.4161490     0.2045057     0.1683010 
##          r2CU 
##     0.2833892

## [1] 0.4504835

## [1] 0.4431477

Transform `satisfaction_level` and `last_evaluation` into factor level variables as they can be repsented by 3 level factors (1,2,3)

In doing so, this actually raised the AIC of the models, including the newly run bestglm() model, which indicates we lost some prediction power. AUC went from around 79% to 72 when using this transformed dataframe

This time, the best regression shows that staff who has higher satisfaction level, more number of project, experience of work accident and promoted in last 5 years may have higher chance to stay in the company. However, if a staff has more average monthly hours, time spend in company and low or medium salary may have higher chance to leave the company. The result is almost the same as the former best regrssion. Only last evaluation is missed in this regression and this may cause the increase in AIC.

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trans.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4726  -0.5900  -0.3960  -0.1851   3.0732  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.978e+01  1.862e+02  -0.106  0.91537    
## satisfaction_level2    -1.153e+00  8.760e-02 -13.160  < 2e-16 ***
## satisfaction_level3    -1.769e+00  8.431e-02 -20.980  < 2e-16 ***
## last_evaluation2        1.692e+01  1.862e+02   0.091  0.92760    
## last_evaluation3        1.581e+01  1.862e+02   0.085  0.93231    
## number_project         -1.326e-01  2.678e-02  -4.953 7.33e-07 ***
## average_montly_hours    5.073e-03  6.161e-04   8.234  < 2e-16 ***
## time_spend_company      3.620e-01  2.002e-02  18.083  < 2e-16 ***
## Work_accident1         -1.460e+00  1.104e-01 -13.232  < 2e-16 ***
## promotion_last_5years1 -1.493e+00  3.828e-01  -3.899 9.64e-05 ***
## saleshr                 1.341e-01  1.628e-01   0.824  0.40992    
## salesIT                -1.090e-01  1.493e-01  -0.730  0.46511    
## salesmanagement        -2.784e-01  1.985e-01  -1.402  0.16082    
## salesmarketing         -1.543e-02  1.623e-01  -0.095  0.92428    
## salesproduct_mng       -1.000e-01  1.619e-01  -0.618  0.53671    
## salesRandD             -4.443e-01  1.706e-01  -2.604  0.00921 ** 
## salessales             -1.154e-02  1.266e-01  -0.091  0.92738    
## salessupport            5.343e-02  1.343e-01   0.398  0.69070    
## salestechnical          4.987e-03  1.310e-01   0.038  0.96963    
## salarylow               1.877e+00  1.625e-01  11.552  < 2e-16 ***
## salarymedium            1.403e+00  1.634e-01   8.586  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8924.8  on 11979  degrees of freedom
## AIC: 8966.8
## 
## Number of Fisher Scoring iterations: 16

## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.

## 
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4726  -0.5900  -0.3960  -0.1851   3.0732  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.978e+01  1.862e+02  -0.106  0.91537    
## satisfaction_level2    -1.153e+00  8.760e-02 -13.160  < 2e-16 ***
## satisfaction_level3    -1.769e+00  8.431e-02 -20.980  < 2e-16 ***
## last_evaluation2        1.692e+01  1.862e+02   0.091  0.92760    
## last_evaluation3        1.581e+01  1.862e+02   0.085  0.93231    
## number_project         -1.326e-01  2.678e-02  -4.953 7.33e-07 ***
## average_montly_hours    5.073e-03  6.161e-04   8.234  < 2e-16 ***
## time_spend_company      3.620e-01  2.002e-02  18.083  < 2e-16 ***
## Work_accident1         -1.460e+00  1.104e-01 -13.232  < 2e-16 ***
## promotion_last_5years1 -1.493e+00  3.828e-01  -3.899 9.64e-05 ***
## saleshr                 1.341e-01  1.628e-01   0.824  0.40992    
## salesIT                -1.090e-01  1.493e-01  -0.730  0.46511    
## salesmanagement        -2.784e-01  1.985e-01  -1.402  0.16082    
## salesmarketing         -1.543e-02  1.623e-01  -0.095  0.92428    
## salesproduct_mng       -1.000e-01  1.619e-01  -0.618  0.53671    
## salesRandD             -4.443e-01  1.706e-01  -2.604  0.00921 ** 
## salessales             -1.154e-02  1.266e-01  -0.091  0.92738    
## salessupport            5.343e-02  1.343e-01   0.398  0.69070    
## salestechnical          4.987e-03  1.310e-01   0.038  0.96963    
## salarylow               1.877e+00  1.625e-01  11.552  < 2e-16 ***
## salarymedium            1.403e+00  1.634e-01   8.586  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10813.5  on 11999  degrees of freedom
## Residual deviance:  8924.8  on 11979  degrees of freedom
## AIC: 8966.8
## 
## Number of Fisher Scoring iterations: 16

## [1] 8966.751

## [1] 8966.751

## [1] 0.5268423

##           llh       llhNull            G2      McFadden          r2ML 
## -4462.3754608 -5406.7345064  1888.7180913     0.1746635     0.1456319 
##          r2CU 
##     0.2452185

Who left? HR Data Prediction

Luke Bogacz and Zheng Lyu

June 21, 2017

Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (`left`) is numeric type

run histograms and describe on the numeric variables to determine normality

Create training and testing datasets, and run correlation against numeric variables

Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (`left`)

use `bestglm()` with forward method to determine the best possible model for use

compare the bestglm model (`bglm`) against the all variable model (`m.all`) using all the variables using `roc()`

Transform `satisfaction_level` and `last_evaluation` into factor level variables as they can be repsented by 3 level factors (1,2,3)

Who left? HR Data Prediction

Luke Bogacz and Zheng Lyu

June 21, 2017

Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (left) is numeric type

run histograms and describe on the numeric variables to determine normality

Create training and testing datasets, and run correlation against numeric variables

Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (left)

use bestglm() with forward method to determine the best possible model for use

compare the bestglm model (bglm) against the all variable model (m.all) using all the variables using roc()

Transform satisfaction_level and last_evaluation into factor level variables as they can be repsented by 3 level factors (1,2,3)

Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (`left`) is numeric type

Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (`left`)

use `bestglm()` with forward method to determine the best possible model for use

compare the bestglm model (`bglm`) against the all variable model (`m.all`) using all the variables using `roc()`

Transform `satisfaction_level` and `last_evaluation` into factor level variables as they can be repsented by 3 level factors (1,2,3)