We found the data about human resource on kaggle website and want to use logistic regression to see what element can affact the left of staff. Our S.M.A.R.T. question is “What factors play a role in employee departure and what are their impacts?”. Notice: the dataset is simulated. This document is puclished to RPubs(http://rpubs.com/lewk/gwu-hrdata-1).
First, we checked there is no NAs and then we begin to check the normality and type of variables.
Read in the HR dataset, reclassify any possible factor variables to factors (if missed on ingest), and ensure dependent binomial variable (left) is numeric type.
Correlation plot is also made to see the initial correlation between variables.From the plot, we can see that the variable “satisfction_level” has the highest correlation with “left”, while it is only -0.35, which is a moderate correlation. From correlation between independent variables, we can see that there is no multicolinearity in the dataset.
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ left : Factor w/ 2 levels "1","0": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ left : num 0 0 0 0 0 0 0 0 0 0 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
run histograms and describe on the numeric variables to determine normality From the plot, we can see that the skew problem is not serious.
## satisfaction_level last_evaluation number_project
## vars 1 1 1
## n 14999 14999 14999
## mean 0.6128335 0.7161017 3.803054
## sd 0.2486307 0.1711691 1.232592
## median 0.64 0.72 4
## trimmed 0.6301642 0.716467 3.738272
## mad 0.281694 0.22239 1.4826
## min 0.09 0.36 2
## max 1 1 7
## range 0.91 0.64 5
## skew -0.4762651 -0.02661643 0.3376381
## kurtosis -0.6713455 -1.239262 -0.4960467
## se 0.002030128 0.001397637 0.01006441
## average_montly_hours time_spend_company left
## vars 1 1 1
## n 14999 14999 14999
## mean 201.0503 3.498233 0.7619175
## sd 49.9431 1.460136 0.4259241
## median 200 3 1
## trimmed 200.6359 3.276977 0.8273477
## mad 65.2344 1.4826 0
## min 96 2 0
## max 310 10 1
## range 214 8 1
## skew 0.05283142 1.852948 -1.229797
## kurtosis -1.135252 4.770184 -0.4876329
## se 0.4077973 0.01192236 0.003477772
Create training and testing datasets, and run correlation against variables (with factors coerced into Numeric)
## 'data.frame': 12000 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ left : num 0 0 0 0 0 0 0 0 0 0 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
Create logistic regression model using all variables to gain understanding of variable interaction on the dependent (left)
Looking at the p values in the model using every variable, each variable seems to contribute to the model
NOTES: satisfaction_level seems to output a strange coefficient of -4, additionally average_montly_hours and time_spend_company have almost a negligable affect on the depenedent
##
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1555 0.1921 0.3614 0.5658 2.1595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0505718 0.2521011 8.134 4.16e-16 ***
## satisfaction_level 4.0226715 0.1211739 33.198 < 2e-16 ***
## last_evaluation -0.5962296 0.1800562 -3.311 0.000928 ***
## number_project 0.2886107 0.0258907 11.147 < 2e-16 ***
## average_montly_hours -0.0041173 0.0006216 -6.624 3.51e-11 ***
## time_spend_company -0.3079136 0.0202271 -15.223 < 2e-16 ***
## Work_accident1 1.4429068 0.1109351 13.007 < 2e-16 ***
## promotion_last_5years1 1.4628025 0.3843782 3.806 0.000141 ***
## saleshr -0.1377605 0.1657808 -0.831 0.405985
## salesIT 0.0191346 0.1525542 0.125 0.900184
## salesmanagement 0.2295874 0.2024092 1.134 0.256680
## salesmarketing -0.0405011 0.1662055 -0.244 0.807478
## salesproduct_mng 0.0344749 0.1646693 0.209 0.834168
## salesRandD 0.3495131 0.1733268 2.016 0.043748 *
## salessales -0.0561511 0.1290379 -0.435 0.663452
## salessupport -0.1173290 0.1370692 -0.856 0.392007
## salestechnical -0.0839353 0.1337196 -0.628 0.530203
## salarylow -1.8386550 0.1638898 -11.219 < 2e-16 ***
## salarymedium -1.3844728 0.1647690 -8.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10813.5 on 11999 degrees of freedom
## Residual deviance: 8602.1 on 11981 degrees of freedom
## AIC: 8640.1
##
## Number of Fisher Scoring iterations: 6
use “bestglm()” with forward selection method to determine the best possible model for use. The best model shows that the higher the satisfaction level the less possible the staff will leave, also staff who has higher number of project, exprience of work accident and has been promoted in last 5 years may have higher chance to stay in the company. While, staff who has higher “last evaluation”, average monthly hours spend in company, years spend in company and low to medium salary may have higher chance to leave the company.
## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.
##
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1821 0.1930 0.3631 0.5658 2.1373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0359697 0.2242410 9.079 < 2e-16 ***
## satisfaction_level 4.0179362 0.1209866 33.210 < 2e-16 ***
## last_evaluation -0.5948691 0.1797951 -3.309 0.000938 ***
## number_project 0.2878855 0.0258428 11.140 < 2e-16 ***
## average_montly_hours -0.0040972 0.0006208 -6.600 4.12e-11 ***
## time_spend_company -0.3041396 0.0200786 -15.147 < 2e-16 ***
## Work_accident1 1.4469436 0.1108942 13.048 < 2e-16 ***
## promotion_last_5years1 1.5125628 0.3835336 3.944 8.02e-05 ***
## salarylow -1.8709210 0.1631060 -11.471 < 2e-16 ***
## salarymedium -1.4135286 0.1640800 -8.615 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10813.5 on 11999 degrees of freedom
## Residual deviance: 8618.2 on 11990 degrees of freedom
## AIC: 8638.2
##
## Number of Fisher Scoring iterations: 6
compare the bestglm model (bglm) against the all variable model (m.all) using all the variables using roc()
The AUC of all variable model is 79.5% while the AUC of the best model is 82.5%. Also, the AIC is decreased from 8640 to 8638.
##
## Call:
## roc.formula(formula = left ~ best.pred, data = test)
##
## Data: best.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.792
##
## Call:
## roc.formula(formula = left ~ all.pred, data = test)
##
## Data: all.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.7954
## [1] 8640.053
## [1] 8638.177
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 11999 10813.5
## satisfaction_level 1 1437.83 11998 9375.6
## last_evaluation 1 10.52 11997 9365.1
## number_project 1 57.52 11996 9307.6
## average_montly_hours 1 51.31 11995 9256.3
## time_spend_company 1 160.53 11994 9095.8
## Work_accident 1 221.94 11993 8873.8
## promotion_last_5years 1 34.87 11992 8839.0
## salary 2 220.77 11990 8618.2
## llh llhNull G2 McFadden r2ML
## -4301.0264319 -5406.7345064 2211.4161490 0.2045057 0.1683010
## r2CU
## 0.2833892
## (Intercept) satisfaction_level last_evaluation
## 677.2344315 5485.0112815 -44.9115197
## number_project average_montly_hours time_spend_company
## 33.4572030 -0.4108826 -26.5021204
## Work_accident1 promotion_last_5years1 saleshr
## 323.2982337 331.8043690 -12.8692698
## salesIT salesmanagement salesmarketing
## 1.9318868 25.8080779 -3.9691851
## salesproduct_mng salesRandD salessales
## 3.5076078 41.8376802 -5.4603728
## salessupport salestechnical salarylow
## -11.0707401 -8.0509221 -84.0968826
## salarymedium
## -74.9544193
## (Intercept) satisfaction_level last_evaluation
## 665.9675825 5458.6270996 -44.8365213
## number_project average_montly_hours time_spend_company
## 33.3604611 -0.4088828 -26.2242161
## Work_accident1 promotion_last_5years1 salarylow
## 325.0104710 353.8346722 -84.6018217
## salarymedium
## -75.6716690
## [1] 0.4504835
## [1] 0.4431477
Transform satisfaction_level and last_evaluation into factor level variables as they can be repsented by 3 level factors (1,2,3)
This time, the best regression shows that staff who has higher satisfaction level, more projects, experience of work accident, and promoted in last five years may have higher chance to stay in the company. However, if a staff has more average monthly hours, time spend in company and low or medium salary may have higher chance to leave the company. The result is almost the same as the former best regrssion.
In doing so, this actually raised the AIC of the models, including the newly run bestglm() model, which indicates we lost some prediction power. AUC went down to 77% and AIC increased to 8966 when using this transformed dataframe.
##
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trans.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0732 0.1851 0.3960 0.5900 2.4726
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.978e+01 1.862e+02 0.106 0.91537
## satisfaction_level2 1.153e+00 8.760e-02 13.160 < 2e-16 ***
## satisfaction_level3 1.769e+00 8.431e-02 20.980 < 2e-16 ***
## last_evaluation2 -1.692e+01 1.862e+02 -0.091 0.92760
## last_evaluation3 -1.581e+01 1.862e+02 -0.085 0.93231
## number_project 1.326e-01 2.678e-02 4.953 7.33e-07 ***
## average_montly_hours -5.073e-03 6.161e-04 -8.234 < 2e-16 ***
## time_spend_company -3.620e-01 2.002e-02 -18.083 < 2e-16 ***
## Work_accident1 1.460e+00 1.104e-01 13.232 < 2e-16 ***
## promotion_last_5years1 1.493e+00 3.828e-01 3.899 9.64e-05 ***
## saleshr -1.341e-01 1.628e-01 -0.824 0.40992
## salesIT 1.090e-01 1.493e-01 0.730 0.46511
## salesmanagement 2.784e-01 1.985e-01 1.402 0.16082
## salesmarketing 1.543e-02 1.623e-01 0.095 0.92428
## salesproduct_mng 1.000e-01 1.619e-01 0.618 0.53671
## salesRandD 4.443e-01 1.706e-01 2.604 0.00921 **
## salessales 1.154e-02 1.266e-01 0.091 0.92738
## salessupport -5.343e-02 1.343e-01 -0.398 0.69070
## salestechnical -4.987e-03 1.310e-01 -0.038 0.96963
## salarylow -1.877e+00 1.625e-01 -11.552 < 2e-16 ***
## salarymedium -1.403e+00 1.634e-01 -8.586 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10813.5 on 11999 degrees of freedom
## Residual deviance: 8924.8 on 11979 degrees of freedom
## AIC: 8966.8
##
## Number of Fisher Scoring iterations: 16
## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.
##
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0732 0.1851 0.3960 0.5900 2.4726
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.978e+01 1.862e+02 0.106 0.91537
## satisfaction_level2 1.153e+00 8.760e-02 13.160 < 2e-16 ***
## satisfaction_level3 1.769e+00 8.431e-02 20.980 < 2e-16 ***
## last_evaluation2 -1.692e+01 1.862e+02 -0.091 0.92760
## last_evaluation3 -1.581e+01 1.862e+02 -0.085 0.93231
## number_project 1.326e-01 2.678e-02 4.953 7.33e-07 ***
## average_montly_hours -5.073e-03 6.161e-04 -8.234 < 2e-16 ***
## time_spend_company -3.620e-01 2.002e-02 -18.083 < 2e-16 ***
## Work_accident1 1.460e+00 1.104e-01 13.232 < 2e-16 ***
## promotion_last_5years1 1.493e+00 3.828e-01 3.899 9.64e-05 ***
## saleshr -1.341e-01 1.628e-01 -0.824 0.40992
## salesIT 1.090e-01 1.493e-01 0.730 0.46511
## salesmanagement 2.784e-01 1.985e-01 1.402 0.16082
## salesmarketing 1.543e-02 1.623e-01 0.095 0.92428
## salesproduct_mng 1.000e-01 1.619e-01 0.618 0.53671
## salesRandD 4.443e-01 1.706e-01 2.604 0.00921 **
## salessales 1.154e-02 1.266e-01 0.091 0.92738
## salessupport -5.343e-02 1.343e-01 -0.398 0.69070
## salestechnical -4.987e-03 1.310e-01 -0.038 0.96963
## salarylow -1.877e+00 1.625e-01 -11.552 < 2e-16 ***
## salarymedium -1.403e+00 1.634e-01 -8.586 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10813.5 on 11999 degrees of freedom
## Residual deviance: 8924.8 on 11979 degrees of freedom
## AIC: 8966.8
##
## Number of Fisher Scoring iterations: 16
## [1] 8966.751
## [1] 8966.751
##
## Call:
## roc.formula(formula = left ~ best.trans.pred, data = trans.test)
##
## Data: best.trans.pred in 1571 controls (left 0) < 1428 cases (left 1).
## Area under the curve: 0.7738
## [1] 0.5268423
## llh llhNull G2 McFadden r2ML
## -4462.3754608 -5406.7345064 1888.7180913 0.1746635 0.1456319
## r2CU
## 0.2452185
So, we still think that the former best model we made is better and here are some conclusions we made. 1.Staff are 500% more likely to quit if they have a low salary, 200% more likely to quit if they have a medium salary. This means salary is an important factor that affect people’s decision. 2.Staff are 80% more likely to quit if they have a good last evaluation. 3.Staff are 100% less likely to quit if they are satisfied.
Since the salary is significant to the decision of whether left, we consider that if we have nemeric values of salary, the prediction would be better.Also, investigate evaluations can help, as they seem to play a part in people leaving.
Looked into other models on Kaggle, similar results are found there.