LOGISTIC REGRESSION

Observations

The total accuracy of our final model is only 77.9% i.e 64.8% + 13.1% .
The model wrongly predicts 14.5% of the employees who stayed in the company as left and 45.7% of the employees who left were predicted as employees who will stay.

Initial data preparations.

HR_comma_sep_lr <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary))

HR_lr <- HR_comma_sep_lr%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)


summary(HR_lr)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     random              Salary     
##  Min.   :0.00000       Min.   :0.0000102   Min.   :1.000  
##  1st Qu.:0.00000       1st Qu.:0.2484315   1st Qu.:1.000  
##  Median :0.00000       Median :0.5016901   Median :2.000  
##  Mean   :0.02127       Mean   :0.5005216   Mean   :1.595  
##  3rd Qu.:0.00000       3rd Qu.:0.7520898   3rd Qu.:2.000  
##  Max.   :1.00000       Max.   :0.9998694   Max.   :3.000  
##                                                           
##        Department  
##  sales      :4140  
##  technical  :2720  
##  support    :2229  
##  IT         :1227  
##  product_mng: 902  
##  marketing  : 858  
##  (Other)    :2923

Spliting the data into training and validation.

## Training Data

trainlr <- HR_lr %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

vallr <- HR_lr %>% 
  filter(random >= .7) %>% 
  select(-random)

3.Developing an initial Logistic Regression model

Modellr <- glm(left~ . , data = trainlr ,
             family = binomial(link = 'logit'))

summary(Modellr)

## 
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1681  -0.6565  -0.4034  -0.1322   3.1260  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.0786643  0.1923993   5.606 2.07e-08 ***
## satisfaction_level    -4.1095998  0.1172338 -35.055  < 2e-16 ***
## last_evaluation        0.8094812  0.1793053   4.515 6.35e-06 ***
## number_project        -0.2938758  0.0254624 -11.542  < 2e-16 ***
## average_montly_hours   0.0045784  0.0006173   7.416 1.20e-13 ***
## time_spend_company     0.2737543  0.0184879  14.807  < 2e-16 ***
## Work_accident         -1.4921655  0.1059261 -14.087  < 2e-16 ***
## promotion_last_5years -1.5294833  0.3218676  -4.752 2.02e-06 ***
## Salary                -0.7250425  0.0463260 -15.651  < 2e-16 ***
## Departmenthr           0.1480178  0.1602499   0.924  0.35566    
## DepartmentIT          -0.0862304  0.1458337  -0.591  0.55432    
## Departmentmanagement  -0.6211896  0.1953026  -3.181  0.00147 ** 
## Departmentmarketing   -0.0970064  0.1583028  -0.613  0.54002    
## Departmentproduct_mng -0.2073260  0.1572405  -1.319  0.18733    
## DepartmentRandD       -0.6024246  0.1735907  -3.470  0.00052 ***
## Departmentsales       -0.0938085  0.1232529  -0.761  0.44659    
## Departmentsupport      0.0339621  0.1311244   0.259  0.79563    
## Departmenttechnical    0.0948718  0.1280236   0.741  0.45866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11403.1  on 10418  degrees of freedom
## Residual deviance:  8884.8  on 10401  degrees of freedom
## AIC: 8920.8
## 
## Number of Fisher Scoring iterations: 5

anova(Modellr , test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10418    11403.1              
## satisfaction_level     1  1577.13     10417     9826.0 < 2.2e-16 ***
## last_evaluation        1    21.83     10416     9804.1 2.976e-06 ***
## number_project         1    58.69     10415     9745.5 1.848e-14 ***
## average_montly_hours   1    67.85     10414     9677.6 < 2.2e-16 ***
## time_spend_company     1   137.38     10413     9540.2 < 2.2e-16 ***
## Work_accident          1   265.05     10412     9275.2 < 2.2e-16 ***
## promotion_last_5years  1    59.33     10411     9215.8 1.334e-14 ***
## Salary                 1   285.27     10410     8930.6 < 2.2e-16 ***
## Department             9    45.76     10401     8884.8 6.669e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have identified all variables as significant except Department which is less significant in comparision to the other variables.
Last evaluation, Average monthly hours and time spent in the company are directly proportional to the variable left. For example higher the number of projects assigned higher are the chances of employees staying in the company.

Developing a model with significant variables

Model2lr <- glm(left~ satisfaction_level + last_evaluation + number_project + average_montly_hours + time_spend_company + Work_accident + promotion_last_5years + Salary, data = trainlr ,
              family = binomial(link = 'logit'))

summary(Model2lr)

## 
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + number_project + 
##     average_montly_hours + time_spend_company + Work_accident + 
##     promotion_last_5years + Salary, family = binomial(link = "logit"), 
##     data = trainlr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1501  -0.6552  -0.4088  -0.1353   3.1617  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.0670148  0.1565432   6.816 9.35e-12 ***
## satisfaction_level    -4.0997713  0.1167253 -35.123  < 2e-16 ***
## last_evaluation        0.7998351  0.1786764   4.476 7.59e-06 ***
## number_project        -0.2916011  0.0253404 -11.507  < 2e-16 ***
## average_montly_hours   0.0045708  0.0006148   7.434 1.05e-13 ***
## time_spend_company     0.2644625  0.0182117  14.522  < 2e-16 ***
## Work_accident         -1.4998533  0.1057153 -14.188  < 2e-16 ***
## promotion_last_5years -1.6416480  0.3191967  -5.143 2.70e-07 ***
## Salary                -0.7425782  0.0458780 -16.186  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11403.1  on 10418  degrees of freedom
## Residual deviance:  8930.6  on 10410  degrees of freedom
## AIC: 8948.6
## 
## Number of Fisher Scoring iterations: 5

anova(Model2lr , test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: left
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  10418    11403.1              
## satisfaction_level     1  1577.13     10417     9826.0 < 2.2e-16 ***
## last_evaluation        1    21.83     10416     9804.1 2.976e-06 ***
## number_project         1    58.69     10415     9745.5 1.848e-14 ***
## average_montly_hours   1    67.85     10414     9677.6 < 2.2e-16 ***
## time_spend_company     1   137.38     10413     9540.2 < 2.2e-16 ***
## Work_accident          1   265.05     10412     9275.2 < 2.2e-16 ***
## promotion_last_5years  1    59.33     10411     9215.8 1.334e-14 ***
## Salary                 1   285.27     10410     8930.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Measuring the accuracy of the model

vallr$logit <- predict.glm(Model2lr , vallr)

vallr <- vallr %>% 
  mutate(predicted_left = 1/(1+exp(-logit)))


summary(vallr$predicted_left)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001977 0.077629 0.172063 0.241537 0.350740 0.900875

Calculating number of left employee in vallr

num_left_vallr <- sum(vallr$left)
sum(vallr$left)

## [1] 1105

cutoff <- vallr %>% 
  arrange(desc(predicted_left)) %>% 
  filter(row_number() == num_left_vallr)

cutoff <- as.numeric(cutoff[ , 'predicted_left'])

cutoff

## [1] 0.3671782

Applying cutoff

vallr <- vallr %>% 
  mutate(predicted_left_1 = ifelse(predicted_left >= cutoff ,
                                   1 , 0))

sum(vallr$predicted_left_1)

## [1] 1105

CrossTable(vallr$left, vallr$predicted_left_1)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | vallr$predicted_left_1 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2970 |       505 |      3475 | 
##              |    42.159 |   132.581 |           | 
##              |     0.855 |     0.145 |     0.759 | 
##              |     0.855 |     0.457 |           | 
##              |     0.648 |     0.110 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       505 |       600 |      1105 | 
##              |   132.581 |   416.940 |           | 
##              |     0.457 |     0.543 |     0.241 | 
##              |     0.145 |     0.543 |           | 
##              |     0.110 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3475 |      1105 |      4580 | 
##              |     0.759 |     0.241 |           | 
## -------------|-----------|-----------|-----------|
## 
##

vallr <- vallr %>% 
  mutate(predicted_left_1 = ifelse(predicted_left >= 0.75 ,
                                   1 , 0))

CrossTable(vallr$left , vallr$predicted_left_1)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | vallr$predicted_left_1 
##   vallr$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3422 |        53 |      3475 | 
##              |     0.680 |    22.748 |           | 
##              |     0.985 |     0.015 |     0.759 | 
##              |     0.770 |     0.398 |           | 
##              |     0.747 |     0.012 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1025 |        80 |      1105 | 
##              |     2.140 |    71.537 |           | 
##              |     0.928 |     0.072 |     0.241 | 
##              |     0.230 |     0.602 |           | 
##              |     0.224 |     0.017 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      4447 |       133 |      4580 | 
##              |     0.971 |     0.029 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Equation

Coef <- vallr %>% 
  mutate(logit_confirmation =  1.0670148 + satisfaction_level*(-4.0997713) + 
           last_evaluation*0.7998351 + number_project*(-0.2916011)  + average_montly_hours*(0.0045708) +
           time_spend_company*(0.2644625) + Work_accident* (-1.49985333) + promotion_last_5years*(-1.6416480) +
           Salary*(-0.7425782), 
         predicted_left = 1/(1+exp(-logit)))
Coef

## # A tibble: 4,580 x 14
##    satisfaction_le~ last_evaluation number_project average_montly_~
##               <dbl>           <dbl>          <dbl>            <dbl>
##  1             0.89           1                  5              224
##  2             0.45           0.54               2              135
##  3             0.84           0.92               4              234
##  4             0.41           0.55               2              148
##  5             0.45           0.51               2              160
##  6             0.76           0.89               5              262
##  7             0.11           0.83               6              282
##  8             0.89           0.92               5              242
##  9             0.82           0.87               4              239
## 10             0.45           0.570              2              134
## # ... with 4,570 more rows, and 10 more variables:
## #   time_spend_company <dbl>, Work_accident <dbl>, left <dbl>,
## #   promotion_last_5years <dbl>, Salary <dbl>, Department <fct>,
## #   logit <dbl>, predicted_left <dbl>, predicted_left_1 <dbl>,
## #   logit_confirmation <dbl>

CLASSIFICATION TREE

Observations

The total accuracy of our final model is only 96.3% i.e 74.3% + 22% .
The model wrongly predicts 2.1% of the employees who stayed in the company as left and 8.7% of the employees who left were predicted as employees who stayed.

Initial data preparations.

HR_comma_sep_ct <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary))

HR_ct <- HR_comma_sep_ct%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
                         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)


summary(HR_ct)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     random              Salary     
##  Min.   :0.00000       Min.   :0.0000102   Min.   :1.000  
##  1st Qu.:0.00000       1st Qu.:0.2484315   1st Qu.:1.000  
##  Median :0.00000       Median :0.5016901   Median :2.000  
##  Mean   :0.02127       Mean   :0.5005216   Mean   :1.595  
##  3rd Qu.:0.00000       3rd Qu.:0.7520898   3rd Qu.:2.000  
##  Max.   :1.00000       Max.   :0.9998694   Max.   :3.000  
##                                                           
##        Department  
##  sales      :4140  
##  technical  :2720  
##  support    :2229  
##  IT         :1227  
##  product_mng: 902  
##  marketing  : 858  
##  (Other)    :2923

Spliting the data into training and validation.

## Training Data

trainct <- HR_ct %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

valct <- HR_ct %>% 
  filter(random >= .7) %>% 
  select(-random)

3.Creating a Classification Tree

ct <- rpart(left ~ . , data = trainct, method = 'class', cp = 0.02)
rpart.plot(ct)

Identifying the most important variables

var_importance <- data.frame(ct$variable.importance)
ct$variable.importance

##   satisfaction_level average_montly_hours       number_project 
##           2057.31378           1039.93673           1029.31457 
##      last_evaluation   time_spend_company        Work_accident 
##            977.81765            753.47089             26.88815 
##           Department 
##             10.77219

Measuring the accuracy of the model

valct$left_predicted <- predict(ct, valct, type = 'class')

summary(valct)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5500   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6300     Median :0.7100   Median :4.000   Median :199.0       
##  Mean   :0.6102     Mean   :0.7151   Mean   :3.791   Mean   :200.8       
##  3rd Qu.:0.8100     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.486     Mean   :0.1393   Mean   :0.2413  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years     Salary            Department   left_predicted
##  Min.   :0.00000       Min.   :1.000   sales      :1263   0:3499        
##  1st Qu.:0.00000       1st Qu.:1.000   technical  : 862   1:1081        
##  Median :0.00000       Median :2.000   support    : 649                 
##  Mean   :0.01812       Mean   :1.587   IT         : 368                 
##  3rd Qu.:0.00000       3rd Qu.:2.000   product_mng: 277                 
##  Max.   :1.00000       Max.   :3.000   marketing  : 250                 
##                                        (Other)    : 911

CrossTable(valct$left , valct$left_predicted)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##              | valct$left_predicted 
##   valct$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      3403 |        72 |      3475 | 
##              |   210.859 |   682.512 |           | 
##              |     0.979 |     0.021 |     0.759 | 
##              |     0.973 |     0.067 |           | 
##              |     0.743 |     0.016 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        96 |      1009 |      1105 | 
##              |   663.108 |  2146.360 |           | 
##              |     0.087 |     0.913 |     0.241 | 
##              |     0.027 |     0.933 |           | 
##              |     0.021 |     0.220 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3499 |      1081 |      4580 | 
##              |     0.764 |     0.236 |           | 
## -------------|-----------|-----------|-----------|
## 
##

RANDOM FOREST

Observations

The total accuracy of our final model is only 99% i.e 75.6% + 23.4% .
The model misclassifies only 0.03% of current employees as employees left.

Initial data preparations.

HR_comma_sep_rf <- HR_comma_sep_random %>%
  mutate(sales = as.factor(sales),
         salary = as.factor(salary),
         Work_accident = as.factor(Work_accident),
         promotion_last_5years = as.factor(promotion_last_5years),
         left = as.factor(left))

HR_rf <- HR_comma_sep_rf%>% 
  mutate(Salary = ifelse(salary == 'low' , 1 ,
                         ifelse(salary == 'medium' , 2 , 3)) ,
         Department = sales) %>%
  select(-sales , -salary)

summary(HR_rf)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident left      promotion_last_5years
##  Min.   : 2.000     0:12830       0:11428   0:14680              
##  1st Qu.: 3.000     1: 2169       1: 3571   1:  319              
##  Median : 3.000                                                  
##  Mean   : 3.498                                                  
##  3rd Qu.: 4.000                                                  
##  Max.   :10.000                                                  
##                                                                  
##      random              Salary            Department  
##  Min.   :0.0000102   Min.   :1.000   sales      :4140  
##  1st Qu.:0.2484315   1st Qu.:1.000   technical  :2720  
##  Median :0.5016901   Median :2.000   support    :2229  
##  Mean   :0.5005216   Mean   :1.595   IT         :1227  
##  3rd Qu.:0.7520898   3rd Qu.:2.000   product_mng: 902  
##  Max.   :0.9998694   Max.   :3.000   marketing  : 858  
##                                      (Other)    :2923

Spliting the data into training and validation.

## Training Data

train_rf <- HR_rf %>% 
  filter(random < .7) %>% 
  select(-random)

## Validation Data

validation_rf <- HR_rf %>% 
  filter(random >= .7) %>% 
  select(-random)

Creating a model using randomForest

rf <- randomForest(left ~. , 
                   type = 'classification' ,
                   data = train_rf ,
                   importance = T)

Identifying the important Variables

rf$importance

##                                  0           1 MeanDecreaseAccuracy
## satisfaction_level    0.0540639636 0.602137565         0.1841024316
## last_evaluation       0.0034459274 0.435523657         0.1059847505
## number_project        0.0149989679 0.452986066         0.1189328334
## average_montly_hours  0.0175139357 0.416289340         0.1121669158
## time_spend_company    0.0116851865 0.353231444         0.0927197055
## Work_accident         0.0003320547 0.004589783         0.0013390412
## promotion_last_5years 0.0001060105 0.001519397         0.0004407257
## Salary                0.0014637000 0.012173810         0.0040056784
## Department            0.0015717265 0.018329947         0.0055469067
##                       MeanDecreaseGini
## satisfaction_level         1264.131721
## last_evaluation             454.096527
## number_project              665.676154
## average_montly_hours        568.313110
## time_spend_company          672.712577
## Work_accident                18.886370
## promotion_last_5years         3.731783
## Salary                       31.881837
## Department                   61.198830

Measuring the accuracy of the model

validation_rf$predicted_left <- predict(rf , validation_rf)

summary(validation_rf$left)

##    0    1 
## 3475 1105

CrossTable(validation_rf$left , validation_rf$predicted_left)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4580 
## 
##  
##                    | validation_rf$predicted_left 
## validation_rf$left |         0 |         1 | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  0 |      3465 |        10 |      3475 | 
##                    |   247.787 |   801.072 |           | 
##                    |     0.997 |     0.003 |     0.759 | 
##                    |     0.991 |     0.009 |           | 
##                    |     0.757 |     0.002 |           | 
## -------------------|-----------|-----------|-----------|
##                  1 |        33 |      1072 |      1105 | 
##                    |   779.240 |  2519.207 |           | 
##                    |     0.030 |     0.970 |     0.241 | 
##                    |     0.009 |     0.991 |           | 
##                    |     0.007 |     0.234 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |      3498 |      1082 |      4580 | 
##                    |     0.764 |     0.236 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

Final Project

LOGISTIC REGRESSION

CLASSIFICATION TREE

RANDOM FOREST

Conclusion