library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.3.4     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(corrplot)
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
library(gmodels)
HR <- read_csv("C:/Users/ThuyAnh/Desktop/ITKM560/HR_comma_sep2.csv")
## Parsed with column specification:
## cols(
##   satisfaction_level = col_double(),
##   last_evaluation = col_double(),
##   number_project = col_integer(),
##   average_montly_hours = col_integer(),
##   time_spend_company = col_integer(),
##   Work_accident = col_integer(),
##   left = col_integer(),
##   promotion_last_5years = col_integer(),
##   sales = col_character(),
##   salary = col_character()
## )
summary (HR)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years    sales              salary         
##  Min.   :0.00000       Length:14999       Length:14999      
##  1st Qu.:0.00000       Class :character   Class :character  
##  Median :0.00000       Mode  :character   Mode  :character  
##  Mean   :0.02127                                            
##  3rd Qu.:0.00000                                            
##  Max.   :1.00000
HR <- HR %>% 
  mutate(salary = ifelse(salary == 'low', 3, ifelse(salary == 'medium', 2, 1))) %>% 
         rename (department = 'sales')
summary (HR)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years  department            salary     
##  Min.   :0.00000       Length:14999       Min.   :1.000  
##  1st Qu.:0.00000       Class :character   1st Qu.:2.000  
##  Median :0.00000       Mode  :character   Median :2.000  
##  Mean   :0.02127                          Mean   :2.405  
##  3rd Qu.:0.00000                          3rd Qu.:3.000  
##  Max.   :1.00000                          Max.   :3.000

1.Clearly define the research question:

How can I predict people leaving?

Variable ‘left’ is my dependent var. Which variables have strong correlation with ‘left’?

Which factors affect the turnover rate?

2.Explore the data understanding:

2.1.Which independent variables show a relationship with the dependent variable

Test 1

HR1 <- HR %>% 
  select(-department, -salary) 
cors1 <- cor(HR1) 
corrplot (cors1, method = "number")

From the correlation plot, it looks like the turnoever rate has a negative relation with satisfaction level, and work accident. Turnover rate has positive relation with average monthly hours, time_spend_company.

Test 2: satisfaction and left

t.test(HR$satisfaction_level~HR$left)
## 
##  Welch Two Sample t-test
## 
## data:  HR$satisfaction_level by HR$left
## t = 46.636, df = 5167, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2171815 0.2362417
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6668096       0.4400980

The mean of satisfaction is 0.66 for people who are staying. The mean of satisfaction is 0.44 for people who left.

Test 3: last_evaluation and left

t.test(HR$last_evaluation~HR$left)
## 
##  Welch Two Sample t-test
## 
## data:  HR$last_evaluation by HR$left
## t = -0.72534, df = 5154.9, p-value = 0.4683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.009772224  0.004493874
## sample estimates:
## mean in group 0 mean in group 1 
##       0.7154734       0.7181126

The relation between ‘last evaluation’ and ‘left’ is not significant for this research because the mean values for ‘last evaluation’ of two groups of people who are staying and left are not much difference. and the confidence interval shows the range (-0.009, 0.0004) this supports that there is not such a significant difference here.

Test 4: number_project and left

t.test(HR$number_project~HR$left)
## 
##  Welch Two Sample t-test
## 
## data:  HR$number_project by HR$left
## t = -2.1663, df = 4236.5, p-value = 0.03034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.131136535 -0.006540119
## sample estimates:
## mean in group 0 mean in group 1 
##        3.786664        3.855503

The relation of ‘left’ and ‘number project’ is also not significant, comparing the mean of number projects that people stay and left. And the p value is large here.

Test 5: average_montly_hours and left

t.test(HR$average_montly_hours~HR$left)
## 
##  Welch Two Sample t-test
## 
## data:  HR$average_montly_hours by HR$left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.534631  -6.183384
## sample estimates:
## mean in group 0 mean in group 1 
##        199.0602        207.4192

People who are likely stay work around 199 hours, on average.People who left worked around 207 hours, on average.

Test 5: time_spend_company and left

t.test(HR$time_spend_company~HR$left)
## 
##  Welch Two Sample t-test
## 
## data:  HR$time_spend_company by HR$left
## t = -22.631, df = 9625.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5394767 -0.4534706
## sample estimates:
## mean in group 0 mean in group 1 
##        3.380032        3.876505

A significant differnece between the average ‘time spend copany’ and ‘left’. People who are likely to staywork for company 3.38 years, on average. People who left worked for company 3.87 years, on average.

Test 6: Work_accident and left

CrossTable(HR$Work_accident,HR$left, chisq = T)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##                  | HR$left 
## HR$Work_accident |         0 |         1 | Row Total | 
## -----------------|-----------|-----------|-----------|
##                0 |      9428 |      3402 |     12830 | 
##                  |    12.346 |    39.510 |           | 
##                  |     0.735 |     0.265 |     0.855 | 
##                  |     0.825 |     0.953 |           | 
##                  |     0.629 |     0.227 |           | 
## -----------------|-----------|-----------|-----------|
##                1 |      2000 |       169 |      2169 | 
##                  |    73.029 |   233.709 |           | 
##                  |     0.922 |     0.078 |     0.145 | 
##                  |     0.175 |     0.047 |           | 
##                  |     0.133 |     0.011 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |     11428 |      3571 |     14999 | 
##                  |     0.762 |     0.238 |           | 
## -----------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  358.5938     d.f. =  1     p =  5.698673e-80 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  357.5624     d.f. =  1     p =  9.55824e-80 
## 
## 

This is an interesting result that people did not have ‘work_accident’ left more than the people had ‘work_accident’. This relation is significant. Need further test to decide whether ‘work accident’ affects ‘left’ prediction.

Test 7: promotion and left

CrossTable(HR$promotion_last_5years, HR$left, chisq=T)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##                          | HR$left 
## HR$promotion_last_5years |         0 |         1 | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                        0 |     11128 |      3552 |     14680 | 
##                          |     0.290 |     0.928 |           | 
##                          |     0.758 |     0.242 |     0.979 | 
##                          |     0.974 |     0.995 |           | 
##                          |     0.742 |     0.237 |           | 
## -------------------------|-----------|-----------|-----------|
##                        1 |       300 |        19 |       319 | 
##                          |    13.343 |    42.702 |           | 
##                          |     0.940 |     0.060 |     0.021 | 
##                          |     0.026 |     0.005 |           | 
##                          |     0.020 |     0.001 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |     11428 |      3571 |     14999 | 
##                          |     0.762 |     0.238 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  57.26273     d.f. =  1     p =  3.813123e-14 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  56.26163     d.f. =  1     p =  6.344155e-14 
## 
## 

Promotion is a good factor in predicting ‘left’. It seems that people did not get promotted left (24.2%) more than people got promotted over 5 years (6%).

Test 8: department and left

CrossTable(HR$department, HR$left , chisq=TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##               | HR$left 
## HR$department |         0 |         1 | Row Total | 
## --------------|-----------|-----------|-----------|
##    accounting |       563 |       204 |       767 | 
##               |     0.783 |     2.506 |           | 
##               |     0.734 |     0.266 |     0.051 | 
##               |     0.049 |     0.057 |           | 
##               |     0.038 |     0.014 |           | 
## --------------|-----------|-----------|-----------|
##            hr |       524 |       215 |       739 | 
##               |     2.709 |     8.670 |           | 
##               |     0.709 |     0.291 |     0.049 | 
##               |     0.046 |     0.060 |           | 
##               |     0.035 |     0.014 |           | 
## --------------|-----------|-----------|-----------|
##            IT |       954 |       273 |      1227 | 
##               |     0.391 |     1.252 |           | 
##               |     0.778 |     0.222 |     0.082 | 
##               |     0.083 |     0.076 |           | 
##               |     0.064 |     0.018 |           | 
## --------------|-----------|-----------|-----------|
##    management |       539 |        91 |       630 | 
##               |     7.250 |    23.202 |           | 
##               |     0.856 |     0.144 |     0.042 | 
##               |     0.047 |     0.025 |           | 
##               |     0.036 |     0.006 |           | 
## --------------|-----------|-----------|-----------|
##     marketing |       655 |       203 |       858 | 
##               |     0.002 |     0.008 |           | 
##               |     0.763 |     0.237 |     0.057 | 
##               |     0.057 |     0.057 |           | 
##               |     0.044 |     0.014 |           | 
## --------------|-----------|-----------|-----------|
##   product_mng |       704 |       198 |       902 | 
##               |     0.408 |     1.307 |           | 
##               |     0.780 |     0.220 |     0.060 | 
##               |     0.062 |     0.055 |           | 
##               |     0.047 |     0.013 |           | 
## --------------|-----------|-----------|-----------|
##         RandD |       666 |       121 |       787 | 
##               |     7.346 |    23.510 |           | 
##               |     0.846 |     0.154 |     0.052 | 
##               |     0.058 |     0.034 |           | 
##               |     0.044 |     0.008 |           | 
## --------------|-----------|-----------|-----------|
##         sales |      3126 |      1014 |      4140 | 
##               |     0.255 |     0.815 |           | 
##               |     0.755 |     0.245 |     0.276 | 
##               |     0.274 |     0.284 |           | 
##               |     0.208 |     0.068 |           | 
## --------------|-----------|-----------|-----------|
##       support |      1674 |       555 |      2229 | 
##               |     0.348 |     1.114 |           | 
##               |     0.751 |     0.249 |     0.149 | 
##               |     0.146 |     0.155 |           | 
##               |     0.112 |     0.037 |           | 
## --------------|-----------|-----------|-----------|
##     technical |      2023 |       697 |      2720 | 
##               |     1.178 |     3.771 |           | 
##               |     0.744 |     0.256 |     0.181 | 
##               |     0.177 |     0.195 |           | 
##               |     0.135 |     0.046 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |     11428 |      3571 |     14999 | 
##               |     0.762 |     0.238 |           | 
## --------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  86.82547     d.f. =  9     p =  7.04213e-15 
## 
## 
## 

Management and RandD are the two departments having a much lower percentage of people who lelf (14.4% and 15.4% respectively), comparing with other departments which (mostly have more than 20%). This seems a significant factor to predict ‘left’.

Test 9: department and left

CrossTable(HR$salary, HR$left, chisq = T)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##              | HR$left 
##    HR$salary |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |      1155 |        82 |      1237 | 
##              |    47.915 |   153.339 |           | 
##              |     0.934 |     0.066 |     0.082 | 
##              |     0.101 |     0.023 |           | 
##              |     0.077 |     0.005 |           | 
## -------------|-----------|-----------|-----------|
##            2 |      5129 |      1317 |      6446 | 
##              |     9.648 |    30.876 |           | 
##              |     0.796 |     0.204 |     0.430 | 
##              |     0.449 |     0.369 |           | 
##              |     0.342 |     0.088 |           | 
## -------------|-----------|-----------|-----------|
##            3 |      5144 |      2172 |      7316 | 
##              |    33.200 |   106.247 |           | 
##              |     0.703 |     0.297 |     0.488 | 
##              |     0.450 |     0.608 |           | 
##              |     0.343 |     0.145 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     11428 |      3571 |     14999 | 
##              |     0.762 |     0.238 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  381.225     d.f. =  2     p =  1.652087e-83 
## 
## 
## 

In this test, Salary ‘1’ is the high level of salary, ‘2’ is medium, ‘3’ is low. It seems like people who left tend to not have high salary. Salary is a significant factor to use as well.

2.2.Are there any of the independent variables that correlate with each other?

Test 1: Correlation test of continuous IVs

cors2 <- cor(HR1 %>% 
              select(-left, -promotion_last_5years, -Work_accident ))

corrplot(cors2 , method = 'number')

The correlation of numeric IVs shows that ‘last evaluation’, ‘number_project’, and ‘average_monthly_hour’ have positive relation. This probably means the last evaluation is based on the number of projaces and work hours, the employees have high evaluation maybe because they work more tasks/hours. However, from employees side, it seems that the number of projects affect satisfaction adversely.

Test 2:Correlation test of satisfaction and number of projects

cor.test(HR1$satisfaction_level, HR1$number_project)
## 
##  Pearson's product-moment correlation
## 
## data:  HR1$satisfaction_level and HR1$number_project
## t = -17.69, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1586105 -0.1272570
## sample estimates:
##        cor 
## -0.1429696

The hypothesis of the more hours employees work, the lower satisfaction might be is pretty strong.

Test 3:Correlation test of last evaluation and number of projects

cor.test(HR1$last_evaluation, HR1$number_project)
## 
##  Pearson's product-moment correlation
## 
## data:  HR1$last_evaluation and HR1$number_project
## t = 45.656, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3352028 0.3633053
## sample estimates:
##       cor 
## 0.3493326

Test 4:Correlation test of last evaluation and avarage monthly hours

cor.test(HR1$last_evaluation, HR1$average_montly_hours)
## 
##  Pearson's product-moment correlation
## 
## data:  HR1$last_evaluation and HR1$average_montly_hours
## t = 44.237, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3255078 0.3538218
## sample estimates:
##       cor 
## 0.3397418

From test 3 & 4, it seems that the more hours and projects employees work, the evaluation is likely at the higher level.

3.Create a model using a decision tree predicting attrition:

3.1 Creating tree, and the most influential variables

tree1 <- rpart (left~., HR, method = "class")

rpart.plot(tree1, shadow.col="gray", tweak = 0.8, type =1,
           extra = 101, fallen.leaves =  F, box.palette = "Purples")

3.2. pruning as/if necessary

cp6<- which(tree1$cptable [,2]==6)
final_tree <- prune(tree1, tree1$cptable[cp6,1])
rpart.plot(final_tree, shadow.col="gray", tweak =0.8, type =1,
           extra = 101,fallen.leaves =  F, box.palette = "Blues",
           main = "Final Tree 1")

cp6<- which(tree1$cptable [,2]==6)
final_tree <- prune(tree1, tree1$cptable[cp6,1])
rpart.plot(final_tree, shadow.col="gray", tweak =0.8, type =1,
           extra = 5,fallen.leaves =  F, box.palette = "Blues",
           main = "Final Tree 2")

4. Explanation of the tree.

I have 2 trees as pruned at nsplit 6. These pruned trees have the same structure. The only difference is final_tree1 shows the numeric numbers, the final_tree2 shows the percentage. Showing numeric and proportion number helps me in interpreting more easily. I decided to split the tree branches with 6 levels because the level of branch 7,8,9 showing very small population of observations, comparing with the size of data, it might not be objective if later on we apply that population to predict.

Turnover people decision is most likely based on their satisfaction to the company. Frist, people who rate their satisfaction from 0.46 and higher are staying with 76% out of 14999 observation.

Group 1: The group of 76% people who have satisfaction level >= 0.46, 90% of them work for company less than4.5 years tend to stay; only 1% left.

Group 2: 10% of people with satisfaction >= 0.46, if they have last evaluation <0.8, they tend to stay as well; only 4% out of this number of people left.

Group 3: 24% employee out of 14,999 observations have satisfaction level < 0.46. 39% of these people who had more than 2.5 project are still working for the company although they dont feel satisfied (satisfaction >= 0.11). If their satisfaction < 0.11, they are likely to leave.

Group 4: People who had satisfaction level < 0.46, and number_project <2.5, this seems like they did not have enough job to work, and that is probably the reason they did not feel satisfied with the copany, and they left.

I think, the intersting point of decision tree is how it connects the important variables together. As we ran many correlation tests, t-tests, crosstable tests to see the significance of dependent variable (DV) vs independent variables (IVs), and independent ones themselfve, we may conclude which one is important for the model later on. Moreover, the tree helps to connect and define which cutoff point is important in each variable. By ploting the tree, it’s easier to see which variables affect the DV the most. And it may support to interpret the predictive models.