library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.3.4 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(corrplot)
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
library(gmodels)
HR <- read_csv("C:/Users/ThuyAnh/Desktop/ITKM560/HR_comma_sep2.csv")
## Parsed with column specification:
## cols(
## satisfaction_level = col_double(),
## last_evaluation = col_double(),
## number_project = col_integer(),
## average_montly_hours = col_integer(),
## time_spend_company = col_integer(),
## Work_accident = col_integer(),
## left = col_integer(),
## promotion_last_5years = col_integer(),
## sales = col_character(),
## salary = col_character()
## )
summary (HR)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
## promotion_last_5years sales salary
## Min. :0.00000 Length:14999 Length:14999
## 1st Qu.:0.00000 Class :character Class :character
## Median :0.00000 Mode :character Mode :character
## Mean :0.02127
## 3rd Qu.:0.00000
## Max. :1.00000
HR <- HR %>%
mutate(salary = ifelse(salary == 'low', 3, ifelse(salary == 'medium', 2, 1))) %>%
rename (department = 'sales')
summary (HR)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
## promotion_last_5years department salary
## Min. :0.00000 Length:14999 Min. :1.000
## 1st Qu.:0.00000 Class :character 1st Qu.:2.000
## Median :0.00000 Mode :character Median :2.000
## Mean :0.02127 Mean :2.405
## 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :1.00000 Max. :3.000
1.Clearly define the research question:
How can I predict people leaving?
Variable ‘left’ is my dependent var. Which variables have strong correlation with ‘left’?
Which factors affect the turnover rate?
2.Explore the data understanding:
2.1.Which independent variables show a relationship with the dependent variable
Test 1
HR1 <- HR %>%
select(-department, -salary)
cors1 <- cor(HR1)
corrplot (cors1, method = "number")

From the correlation plot, it looks like the turnoever rate has a negative relation with satisfaction level, and work accident. Turnover rate has positive relation with average monthly hours, time_spend_company.
Test 2: satisfaction and left
t.test(HR$satisfaction_level~HR$left)
##
## Welch Two Sample t-test
##
## data: HR$satisfaction_level by HR$left
## t = 46.636, df = 5167, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2171815 0.2362417
## sample estimates:
## mean in group 0 mean in group 1
## 0.6668096 0.4400980
The mean of satisfaction is 0.66 for people who are staying. The mean of satisfaction is 0.44 for people who left.
Test 3: last_evaluation and left
t.test(HR$last_evaluation~HR$left)
##
## Welch Two Sample t-test
##
## data: HR$last_evaluation by HR$left
## t = -0.72534, df = 5154.9, p-value = 0.4683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.009772224 0.004493874
## sample estimates:
## mean in group 0 mean in group 1
## 0.7154734 0.7181126
The relation between ‘last evaluation’ and ‘left’ is not significant for this research because the mean values for ‘last evaluation’ of two groups of people who are staying and left are not much difference. and the confidence interval shows the range (-0.009, 0.0004) this supports that there is not such a significant difference here.
Test 4: number_project and left
t.test(HR$number_project~HR$left)
##
## Welch Two Sample t-test
##
## data: HR$number_project by HR$left
## t = -2.1663, df = 4236.5, p-value = 0.03034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.131136535 -0.006540119
## sample estimates:
## mean in group 0 mean in group 1
## 3.786664 3.855503
The relation of ‘left’ and ‘number project’ is also not significant, comparing the mean of number projects that people stay and left. And the p value is large here.
Test 5: average_montly_hours and left
t.test(HR$average_montly_hours~HR$left)
##
## Welch Two Sample t-test
##
## data: HR$average_montly_hours by HR$left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.534631 -6.183384
## sample estimates:
## mean in group 0 mean in group 1
## 199.0602 207.4192
People who are likely stay work around 199 hours, on average.People who left worked around 207 hours, on average.
Test 5: time_spend_company and left
t.test(HR$time_spend_company~HR$left)
##
## Welch Two Sample t-test
##
## data: HR$time_spend_company by HR$left
## t = -22.631, df = 9625.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5394767 -0.4534706
## sample estimates:
## mean in group 0 mean in group 1
## 3.380032 3.876505
A significant differnece between the average ‘time spend copany’ and ‘left’. People who are likely to staywork for company 3.38 years, on average. People who left worked for company 3.87 years, on average.
Test 6: Work_accident and left
CrossTable(HR$Work_accident,HR$left, chisq = T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14999
##
##
## | HR$left
## HR$Work_accident | 0 | 1 | Row Total |
## -----------------|-----------|-----------|-----------|
## 0 | 9428 | 3402 | 12830 |
## | 12.346 | 39.510 | |
## | 0.735 | 0.265 | 0.855 |
## | 0.825 | 0.953 | |
## | 0.629 | 0.227 | |
## -----------------|-----------|-----------|-----------|
## 1 | 2000 | 169 | 2169 |
## | 73.029 | 233.709 | |
## | 0.922 | 0.078 | 0.145 |
## | 0.175 | 0.047 | |
## | 0.133 | 0.011 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 11428 | 3571 | 14999 |
## | 0.762 | 0.238 | |
## -----------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 358.5938 d.f. = 1 p = 5.698673e-80
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 357.5624 d.f. = 1 p = 9.55824e-80
##
##
This is an interesting result that people did not have ‘work_accident’ left more than the people had ‘work_accident’. This relation is significant. Need further test to decide whether ‘work accident’ affects ‘left’ prediction.
Promotion is a good factor in predicting ‘left’. It seems that people did not get promotted left (24.2%) more than people got promotted over 5 years (6%).
Test 8: department and left
CrossTable(HR$department, HR$left , chisq=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14999
##
##
## | HR$left
## HR$department | 0 | 1 | Row Total |
## --------------|-----------|-----------|-----------|
## accounting | 563 | 204 | 767 |
## | 0.783 | 2.506 | |
## | 0.734 | 0.266 | 0.051 |
## | 0.049 | 0.057 | |
## | 0.038 | 0.014 | |
## --------------|-----------|-----------|-----------|
## hr | 524 | 215 | 739 |
## | 2.709 | 8.670 | |
## | 0.709 | 0.291 | 0.049 |
## | 0.046 | 0.060 | |
## | 0.035 | 0.014 | |
## --------------|-----------|-----------|-----------|
## IT | 954 | 273 | 1227 |
## | 0.391 | 1.252 | |
## | 0.778 | 0.222 | 0.082 |
## | 0.083 | 0.076 | |
## | 0.064 | 0.018 | |
## --------------|-----------|-----------|-----------|
## management | 539 | 91 | 630 |
## | 7.250 | 23.202 | |
## | 0.856 | 0.144 | 0.042 |
## | 0.047 | 0.025 | |
## | 0.036 | 0.006 | |
## --------------|-----------|-----------|-----------|
## marketing | 655 | 203 | 858 |
## | 0.002 | 0.008 | |
## | 0.763 | 0.237 | 0.057 |
## | 0.057 | 0.057 | |
## | 0.044 | 0.014 | |
## --------------|-----------|-----------|-----------|
## product_mng | 704 | 198 | 902 |
## | 0.408 | 1.307 | |
## | 0.780 | 0.220 | 0.060 |
## | 0.062 | 0.055 | |
## | 0.047 | 0.013 | |
## --------------|-----------|-----------|-----------|
## RandD | 666 | 121 | 787 |
## | 7.346 | 23.510 | |
## | 0.846 | 0.154 | 0.052 |
## | 0.058 | 0.034 | |
## | 0.044 | 0.008 | |
## --------------|-----------|-----------|-----------|
## sales | 3126 | 1014 | 4140 |
## | 0.255 | 0.815 | |
## | 0.755 | 0.245 | 0.276 |
## | 0.274 | 0.284 | |
## | 0.208 | 0.068 | |
## --------------|-----------|-----------|-----------|
## support | 1674 | 555 | 2229 |
## | 0.348 | 1.114 | |
## | 0.751 | 0.249 | 0.149 |
## | 0.146 | 0.155 | |
## | 0.112 | 0.037 | |
## --------------|-----------|-----------|-----------|
## technical | 2023 | 697 | 2720 |
## | 1.178 | 3.771 | |
## | 0.744 | 0.256 | 0.181 |
## | 0.177 | 0.195 | |
## | 0.135 | 0.046 | |
## --------------|-----------|-----------|-----------|
## Column Total | 11428 | 3571 | 14999 |
## | 0.762 | 0.238 | |
## --------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 86.82547 d.f. = 9 p = 7.04213e-15
##
##
##
Management and RandD are the two departments having a much lower percentage of people who lelf (14.4% and 15.4% respectively), comparing with other departments which (mostly have more than 20%). This seems a significant factor to predict ‘left’.
Test 9: department and left
CrossTable(HR$salary, HR$left, chisq = T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14999
##
##
## | HR$left
## HR$salary | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 1155 | 82 | 1237 |
## | 47.915 | 153.339 | |
## | 0.934 | 0.066 | 0.082 |
## | 0.101 | 0.023 | |
## | 0.077 | 0.005 | |
## -------------|-----------|-----------|-----------|
## 2 | 5129 | 1317 | 6446 |
## | 9.648 | 30.876 | |
## | 0.796 | 0.204 | 0.430 |
## | 0.449 | 0.369 | |
## | 0.342 | 0.088 | |
## -------------|-----------|-----------|-----------|
## 3 | 5144 | 2172 | 7316 |
## | 33.200 | 106.247 | |
## | 0.703 | 0.297 | 0.488 |
## | 0.450 | 0.608 | |
## | 0.343 | 0.145 | |
## -------------|-----------|-----------|-----------|
## Column Total | 11428 | 3571 | 14999 |
## | 0.762 | 0.238 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 381.225 d.f. = 2 p = 1.652087e-83
##
##
##
In this test, Salary ‘1’ is the high level of salary, ‘2’ is medium, ‘3’ is low. It seems like people who left tend to not have high salary. Salary is a significant factor to use as well.
2.2.Are there any of the independent variables that correlate with each other?
Test 1: Correlation test of continuous IVs
cors2 <- cor(HR1 %>%
select(-left, -promotion_last_5years, -Work_accident ))
corrplot(cors2 , method = 'number')

The correlation of numeric IVs shows that ‘last evaluation’, ‘number_project’, and ‘average_monthly_hour’ have positive relation. This probably means the last evaluation is based on the number of projaces and work hours, the employees have high evaluation maybe because they work more tasks/hours. However, from employees side, it seems that the number of projects affect satisfaction adversely.
Test 2:Correlation test of satisfaction and number of projects
cor.test(HR1$satisfaction_level, HR1$number_project)
##
## Pearson's product-moment correlation
##
## data: HR1$satisfaction_level and HR1$number_project
## t = -17.69, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1586105 -0.1272570
## sample estimates:
## cor
## -0.1429696
The hypothesis of the more hours employees work, the lower satisfaction might be is pretty strong.
Test 3:Correlation test of last evaluation and number of projects
cor.test(HR1$last_evaluation, HR1$number_project)
##
## Pearson's product-moment correlation
##
## data: HR1$last_evaluation and HR1$number_project
## t = 45.656, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3352028 0.3633053
## sample estimates:
## cor
## 0.3493326
Test 4:Correlation test of last evaluation and avarage monthly hours
cor.test(HR1$last_evaluation, HR1$average_montly_hours)
##
## Pearson's product-moment correlation
##
## data: HR1$last_evaluation and HR1$average_montly_hours
## t = 44.237, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3255078 0.3538218
## sample estimates:
## cor
## 0.3397418
From test 3 & 4, it seems that the more hours and projects employees work, the evaluation is likely at the higher level.
3.Create a model using a decision tree predicting attrition:
3.1 Creating tree, and the most influential variables
tree1 <- rpart (left~., HR, method = "class")
rpart.plot(tree1, shadow.col="gray", tweak = 0.8, type =1,
extra = 101, fallen.leaves = F, box.palette = "Purples")

3.2. pruning as/if necessary
cp6<- which(tree1$cptable [,2]==6)
final_tree <- prune(tree1, tree1$cptable[cp6,1])
rpart.plot(final_tree, shadow.col="gray", tweak =0.8, type =1,
extra = 101,fallen.leaves = F, box.palette = "Blues",
main = "Final Tree 1")

cp6<- which(tree1$cptable [,2]==6)
final_tree <- prune(tree1, tree1$cptable[cp6,1])
rpart.plot(final_tree, shadow.col="gray", tweak =0.8, type =1,
extra = 5,fallen.leaves = F, box.palette = "Blues",
main = "Final Tree 2")

4. Explanation of the tree.
I have 2 trees as pruned at nsplit 6. These pruned trees have the same structure. The only difference is final_tree1 shows the numeric numbers, the final_tree2 shows the percentage. Showing numeric and proportion number helps me in interpreting more easily. I decided to split the tree branches with 6 levels because the level of branch 7,8,9 showing very small population of observations, comparing with the size of data, it might not be objective if later on we apply that population to predict.
Turnover people decision is most likely based on their satisfaction to the company. Frist, people who rate their satisfaction from 0.46 and higher are staying with 76% out of 14999 observation.
Group 1: The group of 76% people who have satisfaction level >= 0.46, 90% of them work for company less than4.5 years tend to stay; only 1% left.
Group 2: 10% of people with satisfaction >= 0.46, if they have last evaluation <0.8, they tend to stay as well; only 4% out of this number of people left.
Group 3: 24% employee out of 14,999 observations have satisfaction level < 0.46. 39% of these people who had more than 2.5 project are still working for the company although they dont feel satisfied (satisfaction >= 0.11). If their satisfaction < 0.11, they are likely to leave.
Group 4: People who had satisfaction level < 0.46, and number_project <2.5, this seems like they did not have enough job to work, and that is probably the reason they did not feel satisfied with the copany, and they left.
I think, the intersting point of decision tree is how it connects the important variables together. As we ran many correlation tests, t-tests, crosstable tests to see the significance of dependent variable (DV) vs independent variables (IVs), and independent ones themselfve, we may conclude which one is important for the model later on. Moreover, the tree helps to connect and define which cutoff point is important in each variable. By ploting the tree, it’s easier to see which variables affect the DV the most. And it may support to interpret the predictive models.