Final Project

Clive Baker, Dawn Thompson, Hewan Soltau, Kaushika Priya Kankanala and Sanjana Marasandra Rangaraj

April 30, 2019

  1. Based on our assumption, we observed that Random Forest predicts employees who left and who didn’t leave with more accuracy when compared to the accuracy of Logistic Regression and Decision Tree.
  2. We also observed that the predictions of Decision Tree is more accurate when compared to Logistic Regression but Random Forest is the most accurate.

LOGISTIC REGRESSION

##                      Accurate prediction Wrong Prediction
## Employees who left   0.855               0.145           
## Employees who stayed 0.543               0.457

Observations

  1. Initial data preparations.
HR_comma_sep_lr <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary))

HR_lr <- HR_comma_sep_lr%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)


summary(HR_lr)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     random              Salary     
##  Min.   :0.00000       Min.   :0.0000102   Min.   :1.000  
##  1st Qu.:0.00000       1st Qu.:0.2484315   1st Qu.:1.000  
##  Median :0.00000       Median :0.5016901   Median :2.000  
##  Mean   :0.02127       Mean   :0.5005216   Mean   :1.595  
##  3rd Qu.:0.00000       3rd Qu.:0.7520898   3rd Qu.:2.000  
##  Max.   :1.00000       Max.   :0.9998694   Max.   :3.000  
##                                                           
##        Department  
##  sales      :4140  
##  technical  :2720  
##  support    :2229  
##  IT         :1227  
##  product_mng: 902  
##  marketing  : 858  
##  (Other)    :2923
  1. Spliting the data into training and validation.
## Training Data

trainlr <- HR_lr %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

vallr <- HR_lr %>% 
  filter(random >= .7) %>% 
  select(-random)

3.Developing an initial Logistic Regression model

Mdllr <- glm(left~ . , data = trainlr ,
             family = binomial(link = 'logit'))

summary(Mdllr)
## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1681  -0.6565  -0.4034  -0.1322   3.1260  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.0786643  0.1923993   5.606 2.07e-08 ***
## satisfaction_level    -4.1095998  0.1172338 -35.055  < 2e-16 ***
## last_evaluation        0.8094812  0.1793053   4.515 6.35e-06 ***
## number_project        -0.2938758  0.0254624 -11.542  < 2e-16 ***
## average_montly_hours   0.0045784  0.0006173   7.416 1.20e-13 ***
## time_spend_company     0.2737543  0.0184879  14.807  < 2e-16 ***
## Work_accident         -1.4921655  0.1059261 -14.087  < 2e-16 ***
## promotion_last_5years -1.5294833  0.3218676  -4.752 2.02e-06 ***
## Salary                -0.7250425  0.0463260 -15.651  < 2e-16 ***
## Departmenthr           0.1480178  0.1602499   0.924  0.35566    
## DepartmentIT          -0.0862304  0.1458337  -0.591  0.55432    
## Departmentmanagement  -0.6211896  0.1953026  -3.181  0.00147 ** 
## Departmentmarketing   -0.0970064  0.1583028  -0.613  0.54002    
## Departmentproduct_mng -0.2073260  0.1572405  -1.319  0.18733    
## DepartmentRandD       -0.6024246  0.1735907  -3.470  0.00052 ***
## Departmentsales       -0.0938085  0.1232529  -0.761  0.44659    
## Departmentsupport      0.0339621  0.1311244   0.259  0.79563    
## Departmenttechnical    0.0948718  0.1280236   0.741  0.45866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11403.1  on 10418  degrees of freedom
## Residual deviance:  8884.8  on 10401  degrees of freedom
## AIC: 8920.8
## 
## Number of Fisher Scoring iterations: 5
anova(Mdllr , test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10418    11403.1              
## satisfaction_level     1  1577.13     10417     9826.0 < 2.2e-16 ***
## last_evaluation        1    21.83     10416     9804.1 2.976e-06 ***
## number_project         1    58.69     10415     9745.5 1.848e-14 ***
## average_montly_hours   1    67.85     10414     9677.6 < 2.2e-16 ***
## time_spend_company     1   137.38     10413     9540.2 < 2.2e-16 ***
## Work_accident          1   265.05     10412     9275.2 < 2.2e-16 ***
## promotion_last_5years  1    59.33     10411     9215.8 1.334e-14 ***
## Salary                 1   285.27     10410     8930.6 < 2.2e-16 ***
## Department             9    45.76     10401     8884.8 6.669e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Developing a model with significant variables
Mdl2lr <- glm(left~ satisfaction_level + last_evaluation + number_project + average_montly_hours + time_spend_company + Work_accident + promotion_last_5years + Salary, data = trainlr ,
              family = binomial(link = 'logit'))

summary(Mdl2lr)
## 
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + number_project + 
##     average_montly_hours + time_spend_company + Work_accident + 
##     promotion_last_5years + Salary, family = binomial(link = "logit"), 
##     data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1501  -0.6552  -0.4088  -0.1353   3.1617  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.0670148  0.1565432   6.816 9.35e-12 ***
## satisfaction_level    -4.0997713  0.1167253 -35.123  < 2e-16 ***
## last_evaluation        0.7998351  0.1786764   4.476 7.59e-06 ***
## number_project        -0.2916011  0.0253404 -11.507  < 2e-16 ***
## average_montly_hours   0.0045708  0.0006148   7.434 1.05e-13 ***
## time_spend_company     0.2644625  0.0182117  14.522  < 2e-16 ***
## Work_accident         -1.4998533  0.1057153 -14.188  < 2e-16 ***
## promotion_last_5years -1.6416480  0.3191967  -5.143 2.70e-07 ***
## Salary                -0.7425782  0.0458780 -16.186  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11403.1  on 10418  degrees of freedom
## Residual deviance:  8930.6  on 10410  degrees of freedom
## AIC: 8948.6
## 
## Number of Fisher Scoring iterations: 5
anova(Mdl2lr , test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10418    11403.1              
## satisfaction_level     1  1577.13     10417     9826.0 < 2.2e-16 ***
## last_evaluation        1    21.83     10416     9804.1 2.976e-06 ***
## number_project         1    58.69     10415     9745.5 1.848e-14 ***
## average_montly_hours   1    67.85     10414     9677.6 < 2.2e-16 ***
## time_spend_company     1   137.38     10413     9540.2 < 2.2e-16 ***
## Work_accident          1   265.05     10412     9275.2 < 2.2e-16 ***
## promotion_last_5years  1    59.33     10411     9215.8 1.334e-14 ***
## Salary                 1   285.27     10410     8930.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Measuring the accuracy of the model
vallr$logit <- predict.glm(Mdl2lr , vallr)

vallr <- vallr %>% 
  mutate(predicted_left = 1/(1+exp(-logit)))


summary(vallr$predicted_left)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001977 0.077629 0.172063 0.241537 0.350740 0.900875
  1. Calculating number of left employee in vallr
num_left_vallr <- sum(vallr$left)
sum(vallr$left)
## [1] 1105
cutoff <- vallr %>% 
  arrange(desc(predicted_left)) %>% 
  filter(row_number() == num_left_vallr)

cutoff <- as.numeric(cutoff[ , 'predicted_left'])

cutoff
## [1] 0.3671782
  1. Applying cutoff
vallr <- vallr %>% 
  mutate(predicted_left_1 = ifelse(predicted_left >= cutoff ,
                                   1 , 0))

sum(vallr$predicted_left_1)
## [1] 1105
CrossTable(vallr$left, vallr$predicted_left_1)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | vallr$predicted_left_1 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2970 |       505 |      3475 | 
##              |    42.159 |   132.581 |           | 
##              |     0.855 |     0.145 |     0.759 | 
##              |     0.855 |     0.457 |           | 
##              |     0.648 |     0.110 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       505 |       600 |      1105 | 
##              |   132.581 |   416.940 |           | 
##              |     0.457 |     0.543 |     0.241 | 
##              |     0.145 |     0.543 |           | 
##              |     0.110 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3475 |      1105 |      4580 | 
##              |     0.759 |     0.241 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
vallr <- vallr %>% 
  mutate(predicted_left_1 = ifelse(predicted_left >= 0.75 ,
                                   1 , 0))

CrossTable(vallr$left , vallr$predicted_left_1)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | vallr$predicted_left_1 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3422 |        53 |      3475 | 
##              |     0.680 |    22.748 |           | 
##              |     0.985 |     0.015 |     0.759 | 
##              |     0.770 |     0.398 |           | 
##              |     0.747 |     0.012 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1025 |        80 |      1105 | 
##              |     2.140 |    71.537 |           | 
##              |     0.928 |     0.072 |     0.241 | 
##              |     0.230 |     0.602 |           | 
##              |     0.224 |     0.017 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      4447 |       133 |      4580 | 
##              |     0.971 |     0.029 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
  1. Equation
Coef <- vallr %>% 
  mutate(logit_confirmation =  1.0670148 + satisfaction_level*(-4.0997713) + 
           last_evaluation*0.7998351 + number_project*(-0.2916011)  + average_montly_hours*(0.0045708) +
           time_spend_company*(0.2644625) + Work_accident* (-1.49985333) + promotion_last_5years*(-1.6416480) +
           Salary*(-0.7425782), 
         predicted_left = 1/(1+exp(-logit)))
Coef
## # A tibble: 4,580 x 14
##    satisfaction_le~ last_evaluation number_project average_montly_~
##               <dbl>           <dbl>          <dbl>            <dbl>
##  1             0.89           1                  5              224
##  2             0.45           0.54               2              135
##  3             0.84           0.92               4              234
##  4             0.41           0.55               2              148
##  5             0.45           0.51               2              160
##  6             0.76           0.89               5              262
##  7             0.11           0.83               6              282
##  8             0.89           0.92               5              242
##  9             0.82           0.87               4              239
## 10             0.45           0.570              2              134
## # ... with 4,570 more rows, and 10 more variables:
## #   time_spend_company <dbl>, Work_accident <dbl>, left <dbl>,
## #   promotion_last_5years <dbl>, Salary <dbl>, Department <fct>,
## #   logit <dbl>, predicted_left <dbl>, predicted_left_1 <dbl>,
## #   logit_confirmation <dbl>

CLASSIFICATION TREE

##                      Accurate prediction Wrong Prediction
## Employees who left   0.913               0.087           
## Employees who stayed 0.979               0.021

Observations

  1. Initial data preparations.
HR_comma_sep_ct <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary))

HR_ct <- HR_comma_sep_ct%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
                         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)


summary(HR_ct)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     random              Salary     
##  Min.   :0.00000       Min.   :0.0000102   Min.   :1.000  
##  1st Qu.:0.00000       1st Qu.:0.2484315   1st Qu.:1.000  
##  Median :0.00000       Median :0.5016901   Median :2.000  
##  Mean   :0.02127       Mean   :0.5005216   Mean   :1.595  
##  3rd Qu.:0.00000       3rd Qu.:0.7520898   3rd Qu.:2.000  
##  Max.   :1.00000       Max.   :0.9998694   Max.   :3.000  
##                                                           
##        Department  
##  sales      :4140  
##  technical  :2720  
##  support    :2229  
##  IT         :1227  
##  product_mng: 902  
##  marketing  : 858  
##  (Other)    :2923
  1. Spliting the data into training and validation.
## Training Data

trainct <- HR_ct %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

valct <- HR_ct %>% 
  filter(random >= .7) %>% 
  select(-random)

3.Creating a Classification Tree

ct <- rpart(left ~ . , data = trainct, method = 'class', cp = 0.02)
rpart.plot(ct)

  1. Identifying the most important variables
var_importance <- data.frame(ct$variable.importance)
ct$variable.importance
##   satisfaction_level average_montly_hours       number_project 
##           2057.31378           1039.93673           1029.31457 
##      last_evaluation   time_spend_company        Work_accident 
##            977.81765            753.47089             26.88815 
##           Department 
##             10.77219
  1. Measuring the accuracy of the model
valct$left_predicted <- predict(ct, valct, type = 'class')

summary(valct)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5500   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6300     Median :0.7100   Median :4.000   Median :199.0       
##  Mean   :0.6102     Mean   :0.7151   Mean   :3.791   Mean   :200.8       
##  3rd Qu.:0.8100     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.486     Mean   :0.1393   Mean   :0.2413  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     Salary            Department   left_predicted
##  Min.   :0.00000       Min.   :1.000   sales      :1263   0:3499        
##  1st Qu.:0.00000       1st Qu.:1.000   technical  : 862   1:1081        
##  Median :0.00000       Median :2.000   support    : 649                 
##  Mean   :0.01812       Mean   :1.587   IT         : 368                 
##  3rd Qu.:0.00000       3rd Qu.:2.000   product_mng: 277                 
##  Max.   :1.00000       Max.   :3.000   marketing  : 250                 
##                                        (Other)    : 911
CrossTable(valct$left , valct$left_predicted)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | valct$left_predicted 
##   valct$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3403 |        72 |      3475 | 
##              |   210.859 |   682.512 |           | 
##              |     0.979 |     0.021 |     0.759 | 
##              |     0.973 |     0.067 |           | 
##              |     0.743 |     0.016 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        96 |      1009 |      1105 | 
##              |   663.108 |  2146.360 |           | 
##              |     0.087 |     0.913 |     0.241 | 
##              |     0.027 |     0.933 |           | 
##              |     0.021 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3499 |      1081 |      4580 | 
##              |     0.764 |     0.236 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

RANDOM FOREST

##                      Accurate prediction Wrong Prediction
## Employees who left   0.971               0.029           
## Employees who stayed 0.997               0.003

Observations

  1. Initial data preparations.
HR_comma_sep_rf <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary),
         Work_accident = as.factor(Work_accident),
         promotion_last_5years = as.factor(promotion_last_5years),
         left = as.factor(left))

HR_rf <- HR_comma_sep_rf%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
                         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)

summary(HR_rf)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident left      promotion_last_5years
##  Min.   : 2.000     0:12830       0:11428   0:14680              
##  1st Qu.: 3.000     1: 2169       1: 3571   1:  319              
##  Median : 3.000                                                  
##  Mean   : 3.498                                                  
##  3rd Qu.: 4.000                                                  
##  Max.   :10.000                                                  
##                                                                  
##      random              Salary            Department  
##  Min.   :0.0000102   Min.   :1.000   sales      :4140  
##  1st Qu.:0.2484315   1st Qu.:1.000   technical  :2720  
##  Median :0.5016901   Median :2.000   support    :2229  
##  Mean   :0.5005216   Mean   :1.595   IT         :1227  
##  3rd Qu.:0.7520898   3rd Qu.:2.000   product_mng: 902  
##  Max.   :0.9998694   Max.   :3.000   marketing  : 858  
##                                      (Other)    :2923
  1. Spliting the data into training and validation.
## Training Data

train_rf <- HR_rf %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

validation_rf <- HR_rf %>% 
  filter(random >= .7) %>% 
  select(-random)
  1. Creating a model using randomForest
rf <- randomForest(left ~. , 
                   type = 'classification' ,
                   data = train_rf ,
                   importance = T)
  1. Identifying the important Variables
rf$importance
##                                  0           1 MeanDecreaseAccuracy
## satisfaction_level    5.348398e-02 0.603238937         0.1835829576
## last_evaluation       3.159604e-03 0.428543726         0.1038466321
## number_project        1.435852e-02 0.449707308         0.1173864479
## average_montly_hours  1.790360e-02 0.416656712         0.1122600168
## time_spend_company    1.213630e-02 0.359557546         0.0943697485
## Work_accident         3.710969e-04 0.004571812         0.0013647509
## promotion_last_5years 7.211924e-05 0.001537716         0.0004174981
## Salary                1.337696e-03 0.011877213         0.0038320956
## Department            1.480444e-03 0.017867063         0.0053601537
##                       MeanDecreaseGini
## satisfaction_level         1305.832016
## last_evaluation             433.502542
## number_project              617.859641
## average_montly_hours        579.507817
## time_spend_company          693.970021
## Work_accident                17.955451
## promotion_last_5years         3.634031
## Salary                       31.449350
## Department                   58.997485
  1. Measuring the accuracy of the model
validation_rf$predicted_left <- predict(rf , validation_rf)

summary(validation_rf$left)
##    0    1 
## 3475 1105
CrossTable(validation_rf$left , validation_rf$predicted_left)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##                    | validation_rf$predicted_left 
## validation_rf$left |         0 |         1 | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  0 |      3465 |        10 |      3475 | 
##                    |   247.787 |   801.072 |           | 
##                    |     0.997 |     0.003 |     0.759 | 
##                    |     0.991 |     0.009 |           | 
##                    |     0.757 |     0.002 |           | 
## -------------------|-----------|-----------|-----------|
##                  1 |        33 |      1072 |      1105 | 
##                    |   779.240 |  2519.207 |           | 
##                    |     0.030 |     0.970 |     0.241 | 
##                    |     0.009 |     0.991 |           | 
##                    |     0.007 |     0.234 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |      3498 |      1082 |      4580 | 
##                    |     0.764 |     0.236 |           | 
## -------------------|-----------|-----------|-----------|
## 
## 

Conclusion

##                     Actuals-left Predicted-left
## Logistic Regression 1105         600           
## CART                1105         1009          
## Random Forest       1105         1072
##                     Actuals-notleft Predicted-notleft
## Logistic Regression            3475              2970
## CART                           3475              3403
## Random Forest                  3475              3464