Stela Mucllari, Shahla Khan
December 5, 2019
We present the results of an experiment based on our dataset comparing the prediction performance of 3 models - Logistic Regression, Classification Tree and Random Forest. Our dataset comprises of employee information from different departments of a company and we are predicting if an employee from the dataset has left the company or continues to work.
Results - Random Forest is the best model.
2.The predictions of Decision Tree is more accurate when compared to Logistic Regression but Random Forest is the most accurate.
Observations
HR_comma_sep_lr <- HR_comma_sep_random %>%
mutate(sales = as.factor(sales),
salary = as.factor(salary))
HR_lr <- HR_comma_sep_lr%>%
mutate(Salary = ifelse(salary == 'low' , 1 ,
ifelse(salary == 'medium' , 2 , 3)) ,
Department = sales) %>%
select(-sales , -salary)
summary(HR_lr)## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years random Salary
## Min. :0.00000 Min. :0.0000102 Min. :1.000
## 1st Qu.:0.00000 1st Qu.:0.2484315 1st Qu.:1.000
## Median :0.00000 Median :0.5016901 Median :2.000
## Mean :0.02127 Mean :0.5005216 Mean :1.595
## 3rd Qu.:0.00000 3rd Qu.:0.7520898 3rd Qu.:2.000
## Max. :1.00000 Max. :0.9998694 Max. :3.000
##
## Department
## sales :4140
## technical :2720
## support :2229
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
## Training Data
trainlr <- HR_lr %>%
filter(random < .7) %>%
select(-random)
## Validation Data
vallr <- HR_lr %>%
filter(random >= .7) %>%
select(-random)3.Developing an initial Logistic Regression model
Modellr <- glm(left~ . , data = trainlr ,
family = binomial(link = 'logit'))
summary(Modellr)##
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trainlr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1681 -0.6565 -0.4034 -0.1322 3.1260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0786643 0.1923993 5.606 2.07e-08 ***
## satisfaction_level -4.1095998 0.1172338 -35.055 < 2e-16 ***
## last_evaluation 0.8094812 0.1793053 4.515 6.35e-06 ***
## number_project -0.2938758 0.0254624 -11.542 < 2e-16 ***
## average_montly_hours 0.0045784 0.0006173 7.416 1.20e-13 ***
## time_spend_company 0.2737543 0.0184879 14.807 < 2e-16 ***
## Work_accident -1.4921655 0.1059261 -14.087 < 2e-16 ***
## promotion_last_5years -1.5294833 0.3218676 -4.752 2.02e-06 ***
## Salary -0.7250425 0.0463260 -15.651 < 2e-16 ***
## Departmenthr 0.1480178 0.1602499 0.924 0.35566
## DepartmentIT -0.0862304 0.1458337 -0.591 0.55432
## Departmentmanagement -0.6211896 0.1953026 -3.181 0.00147 **
## Departmentmarketing -0.0970064 0.1583028 -0.613 0.54002
## Departmentproduct_mng -0.2073260 0.1572405 -1.319 0.18733
## DepartmentRandD -0.6024246 0.1735907 -3.470 0.00052 ***
## Departmentsales -0.0938085 0.1232529 -0.761 0.44659
## Departmentsupport 0.0339621 0.1311244 0.259 0.79563
## Departmenttechnical 0.0948718 0.1280236 0.741 0.45866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11403.1 on 10418 degrees of freedom
## Residual deviance: 8884.8 on 10401 degrees of freedom
## AIC: 8920.8
##
## Number of Fisher Scoring iterations: 5
anova(Modellr , test = "Chisq")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: left
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 10418 11403.1
## satisfaction_level 1 1577.13 10417 9826.0 < 2.2e-16 ***
## last_evaluation 1 21.83 10416 9804.1 2.976e-06 ***
## number_project 1 58.69 10415 9745.5 1.848e-14 ***
## average_montly_hours 1 67.85 10414 9677.6 < 2.2e-16 ***
## time_spend_company 1 137.38 10413 9540.2 < 2.2e-16 ***
## Work_accident 1 265.05 10412 9275.2 < 2.2e-16 ***
## promotion_last_5years 1 59.33 10411 9215.8 1.334e-14 ***
## Salary 1 285.27 10410 8930.6 < 2.2e-16 ***
## Department 9 45.76 10401 8884.8 6.669e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have identified all variables as significant except Department which is less significant in comparision to the other variables.
Last evaluation, Average monthly hours and time spent in the company are directly proportional to the variable left. For example higher the number of projects assigned higher are the chances of employees staying in the company.
Model2lr <- glm(left~ satisfaction_level + last_evaluation + number_project + average_montly_hours + time_spend_company + Work_accident + promotion_last_5years + Salary, data = trainlr ,
family = binomial(link = 'logit'))
summary(Model2lr)##
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + number_project +
## average_montly_hours + time_spend_company + Work_accident +
## promotion_last_5years + Salary, family = binomial(link = "logit"),
## data = trainlr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1501 -0.6552 -0.4088 -0.1353 3.1617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0670148 0.1565432 6.816 9.35e-12 ***
## satisfaction_level -4.0997713 0.1167253 -35.123 < 2e-16 ***
## last_evaluation 0.7998351 0.1786764 4.476 7.59e-06 ***
## number_project -0.2916011 0.0253404 -11.507 < 2e-16 ***
## average_montly_hours 0.0045708 0.0006148 7.434 1.05e-13 ***
## time_spend_company 0.2644625 0.0182117 14.522 < 2e-16 ***
## Work_accident -1.4998533 0.1057153 -14.188 < 2e-16 ***
## promotion_last_5years -1.6416480 0.3191967 -5.143 2.70e-07 ***
## Salary -0.7425782 0.0458780 -16.186 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11403.1 on 10418 degrees of freedom
## Residual deviance: 8930.6 on 10410 degrees of freedom
## AIC: 8948.6
##
## Number of Fisher Scoring iterations: 5
anova(Model2lr , test = "Chisq")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: left
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 10418 11403.1
## satisfaction_level 1 1577.13 10417 9826.0 < 2.2e-16 ***
## last_evaluation 1 21.83 10416 9804.1 2.976e-06 ***
## number_project 1 58.69 10415 9745.5 1.848e-14 ***
## average_montly_hours 1 67.85 10414 9677.6 < 2.2e-16 ***
## time_spend_company 1 137.38 10413 9540.2 < 2.2e-16 ***
## Work_accident 1 265.05 10412 9275.2 < 2.2e-16 ***
## promotion_last_5years 1 59.33 10411 9215.8 1.334e-14 ***
## Salary 1 285.27 10410 8930.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vallr$logit <- predict.glm(Model2lr , vallr)
vallr <- vallr %>%
mutate(predicted_left = 1/(1+exp(-logit)))
summary(vallr$predicted_left)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001977 0.077629 0.172063 0.241537 0.350740 0.900875
num_left_vallr <- sum(vallr$left)
sum(vallr$left)## [1] 1105
cutoff <- vallr %>%
arrange(desc(predicted_left)) %>%
filter(row_number() == num_left_vallr)
cutoff <- as.numeric(cutoff[ , 'predicted_left'])
cutoff## [1] 0.3671782
vallr <- vallr %>%
mutate(predicted_left_1 = ifelse(predicted_left >= cutoff ,
1 , 0))
sum(vallr$predicted_left_1)## [1] 1105
CrossTable(vallr$left, vallr$predicted_left_1)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4580
##
##
## | vallr$predicted_left_1
## vallr$left | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 2970 | 505 | 3475 |
## | 42.159 | 132.581 | |
## | 0.855 | 0.145 | 0.759 |
## | 0.855 | 0.457 | |
## | 0.648 | 0.110 | |
## -------------|-----------|-----------|-----------|
## 1 | 505 | 600 | 1105 |
## | 132.581 | 416.940 | |
## | 0.457 | 0.543 | 0.241 |
## | 0.145 | 0.543 | |
## | 0.110 | 0.131 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3475 | 1105 | 4580 |
## | 0.759 | 0.241 | |
## -------------|-----------|-----------|-----------|
##
##
vallr <- vallr %>%
mutate(predicted_left_1 = ifelse(predicted_left >= 0.75 ,
1 , 0))
CrossTable(vallr$left , vallr$predicted_left_1)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4580
##
##
## | vallr$predicted_left_1
## vallr$left | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 3422 | 53 | 3475 |
## | 0.680 | 22.748 | |
## | 0.985 | 0.015 | 0.759 |
## | 0.770 | 0.398 | |
## | 0.747 | 0.012 | |
## -------------|-----------|-----------|-----------|
## 1 | 1025 | 80 | 1105 |
## | 2.140 | 71.537 | |
## | 0.928 | 0.072 | 0.241 |
## | 0.230 | 0.602 | |
## | 0.224 | 0.017 | |
## -------------|-----------|-----------|-----------|
## Column Total | 4447 | 133 | 4580 |
## | 0.971 | 0.029 | |
## -------------|-----------|-----------|-----------|
##
##
Coef <- vallr %>%
mutate(logit_confirmation = 1.0670148 + satisfaction_level*(-4.0997713) +
last_evaluation*0.7998351 + number_project*(-0.2916011) + average_montly_hours*(0.0045708) +
time_spend_company*(0.2644625) + Work_accident* (-1.49985333) + promotion_last_5years*(-1.6416480) +
Salary*(-0.7425782),
predicted_left = 1/(1+exp(-logit)))
Coef## # A tibble: 4,580 x 14
## satisfaction_le~ last_evaluation number_project average_montly_~
## <dbl> <dbl> <dbl> <dbl>
## 1 0.89 1 5 224
## 2 0.45 0.54 2 135
## 3 0.84 0.92 4 234
## 4 0.41 0.55 2 148
## 5 0.45 0.51 2 160
## 6 0.76 0.89 5 262
## 7 0.11 0.83 6 282
## 8 0.89 0.92 5 242
## 9 0.82 0.87 4 239
## 10 0.45 0.570 2 134
## # ... with 4,570 more rows, and 10 more variables:
## # time_spend_company <dbl>, Work_accident <dbl>, left <dbl>,
## # promotion_last_5years <dbl>, Salary <dbl>, Department <fct>,
## # logit <dbl>, predicted_left <dbl>, predicted_left_1 <dbl>,
## # logit_confirmation <dbl>
Observations
HR_comma_sep_ct <- HR_comma_sep_random %>%
mutate(sales = as.factor(sales),
salary = as.factor(salary))
HR_ct <- HR_comma_sep_ct%>%
mutate(Salary = ifelse(salary == 'low' , 1 ,
ifelse(salary == 'medium' , 2 , 3)) ,
Department = sales) %>%
select(-sales , -salary)
summary(HR_ct)## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years random Salary
## Min. :0.00000 Min. :0.0000102 Min. :1.000
## 1st Qu.:0.00000 1st Qu.:0.2484315 1st Qu.:1.000
## Median :0.00000 Median :0.5016901 Median :2.000
## Mean :0.02127 Mean :0.5005216 Mean :1.595
## 3rd Qu.:0.00000 3rd Qu.:0.7520898 3rd Qu.:2.000
## Max. :1.00000 Max. :0.9998694 Max. :3.000
##
## Department
## sales :4140
## technical :2720
## support :2229
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
## Training Data
trainct <- HR_ct %>%
filter(random < .7) %>%
select(-random)
## Validation Data
valct <- HR_ct %>%
filter(random >= .7) %>%
select(-random)3.Creating a Classification Tree
ct <- rpart(left ~ . , data = trainct, method = 'class', cp = 0.02)
rpart.plot(ct)var_importance <- data.frame(ct$variable.importance)
ct$variable.importance## satisfaction_level average_montly_hours number_project
## 2057.31378 1039.93673 1029.31457
## last_evaluation time_spend_company Work_accident
## 977.81765 753.47089 26.88815
## Department
## 10.77219
valct$left_predicted <- predict(ct, valct, type = 'class')
summary(valct)## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5500 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6300 Median :0.7100 Median :4.000 Median :199.0
## Mean :0.6102 Mean :0.7151 Mean :3.791 Mean :200.8
## 3rd Qu.:0.8100 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.486 Mean :0.1393 Mean :0.2413
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years Salary Department left_predicted
## Min. :0.00000 Min. :1.000 sales :1263 0:3499
## 1st Qu.:0.00000 1st Qu.:1.000 technical : 862 1:1081
## Median :0.00000 Median :2.000 support : 649
## Mean :0.01812 Mean :1.587 IT : 368
## 3rd Qu.:0.00000 3rd Qu.:2.000 product_mng: 277
## Max. :1.00000 Max. :3.000 marketing : 250
## (Other) : 911
CrossTable(valct$left , valct$left_predicted)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4580
##
##
## | valct$left_predicted
## valct$left | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 3403 | 72 | 3475 |
## | 210.859 | 682.512 | |
## | 0.979 | 0.021 | 0.759 |
## | 0.973 | 0.067 | |
## | 0.743 | 0.016 | |
## -------------|-----------|-----------|-----------|
## 1 | 96 | 1009 | 1105 |
## | 663.108 | 2146.360 | |
## | 0.087 | 0.913 | 0.241 |
## | 0.027 | 0.933 | |
## | 0.021 | 0.220 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3499 | 1081 | 4580 |
## | 0.764 | 0.236 | |
## -------------|-----------|-----------|-----------|
##
##
Observations
HR_comma_sep_rf <- HR_comma_sep_random %>%
mutate(sales = as.factor(sales),
salary = as.factor(salary),
Work_accident = as.factor(Work_accident),
promotion_last_5years = as.factor(promotion_last_5years),
left = as.factor(left))
HR_rf <- HR_comma_sep_rf%>%
mutate(Salary = ifelse(salary == 'low' , 1 ,
ifelse(salary == 'medium' , 2 , 3)) ,
Department = sales) %>%
select(-sales , -salary)
summary(HR_rf)## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 0:12830 0:11428 0:14680
## 1st Qu.: 3.000 1: 2169 1: 3571 1: 319
## Median : 3.000
## Mean : 3.498
## 3rd Qu.: 4.000
## Max. :10.000
##
## random Salary Department
## Min. :0.0000102 Min. :1.000 sales :4140
## 1st Qu.:0.2484315 1st Qu.:1.000 technical :2720
## Median :0.5016901 Median :2.000 support :2229
## Mean :0.5005216 Mean :1.595 IT :1227
## 3rd Qu.:0.7520898 3rd Qu.:2.000 product_mng: 902
## Max. :0.9998694 Max. :3.000 marketing : 858
## (Other) :2923
## Training Data
train_rf <- HR_rf %>%
filter(random < .7) %>%
select(-random)
## Validation Data
validation_rf <- HR_rf %>%
filter(random >= .7) %>%
select(-random)rf <- randomForest(left ~. ,
type = 'classification' ,
data = train_rf ,
importance = T)rf$importance## 0 1 MeanDecreaseAccuracy
## satisfaction_level 0.0540639636 0.602137565 0.1841024316
## last_evaluation 0.0034459274 0.435523657 0.1059847505
## number_project 0.0149989679 0.452986066 0.1189328334
## average_montly_hours 0.0175139357 0.416289340 0.1121669158
## time_spend_company 0.0116851865 0.353231444 0.0927197055
## Work_accident 0.0003320547 0.004589783 0.0013390412
## promotion_last_5years 0.0001060105 0.001519397 0.0004407257
## Salary 0.0014637000 0.012173810 0.0040056784
## Department 0.0015717265 0.018329947 0.0055469067
## MeanDecreaseGini
## satisfaction_level 1264.131721
## last_evaluation 454.096527
## number_project 665.676154
## average_montly_hours 568.313110
## time_spend_company 672.712577
## Work_accident 18.886370
## promotion_last_5years 3.731783
## Salary 31.881837
## Department 61.198830
validation_rf$predicted_left <- predict(rf , validation_rf)
summary(validation_rf$left)## 0 1
## 3475 1105
CrossTable(validation_rf$left , validation_rf$predicted_left)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4580
##
##
## | validation_rf$predicted_left
## validation_rf$left | 0 | 1 | Row Total |
## -------------------|-----------|-----------|-----------|
## 0 | 3465 | 10 | 3475 |
## | 247.787 | 801.072 | |
## | 0.997 | 0.003 | 0.759 |
## | 0.991 | 0.009 | |
## | 0.757 | 0.002 | |
## -------------------|-----------|-----------|-----------|
## 1 | 33 | 1072 | 1105 |
## | 779.240 | 2519.207 | |
## | 0.030 | 0.970 | 0.241 |
## | 0.009 | 0.991 | |
## | 0.007 | 0.234 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 3498 | 1082 | 4580 |
## | 0.764 | 0.236 | |
## -------------------|-----------|-----------|-----------|
##
##
comp <- matrix(c(1105,"600",1105,"1009",1105,"1072"),ncol=2,byrow=TRUE)
colnames(comp) <- c("Actuals-left","Predicted-left")
rownames(comp) <- c("Logistic Regression" ,"CART","Random Forest")
comp <- as.table(comp)
comp## Actuals-left Predicted-left
## Logistic Regression 1105 600
## CART 1105 1009
## Random Forest 1105 1072
compa <- matrix(c(3475,2970,3475,3403,3475,3464),ncol=2,byrow=TRUE)
colnames(compa) <- c("Actuals-notleft","Predicted-notleft")
rownames(compa) <- c("Logistic Regression" ,"CART","Random Forest")
compa <- as.table(compa)
compa## Actuals-notleft Predicted-notleft
## Logistic Regression 3475 2970
## CART 3475 3403
## Random Forest 3475 3464