Our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.
We have two goals: first, we want to understand why valuable employees leave, and second, we want to predict who will leave next.
Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees’ departure.
For our almost 15000 employees we know: satisfaction level, latest evaluation (yearly), number of project worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary.
At this stage, we read the data set in R and visualize its length and breadth
hr <- read.csv(paste("HR_comma_sep.csv", sep = " "))
nrow(hr)
## [1] 14999
ncol(hr)
## [1] 10
This table describe the characteristics of each features of our ABT. We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.
summary(hr)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
table1 <- table(hr$satisfaction_level,hr$left)
table2 <- table(hr$promotion_last_5years, hr$left)
head(table1)
##
## 0 1
## 0.09 0 195
## 0.1 0 358
## 0.11 0 335
## 0.12 26 4
## 0.13 51 3
## 0.14 63 10
table2
##
## 0 1
## 0 11128 3552
## 1 300 19
table3 <- table(xtabs(~left+satisfaction_level+promotion_last_5years, data = hr))
table3
##
## 0 1 2 3 4 5 6 7 8 9 10 11 17 20 24 25 26 27
## 121 19 23 17 16 13 10 3 14 2 4 1 1 1 1 1 1 1
## 28 29 30 31 33 35 36 37 38 39 40 41 42 43 44 45 46 47
## 2 1 1 2 3 1 1 2 1 1 2 1 4 1 2 2 1 2
## 50 51 52 56 57 58 62 63 64 65 66 67 68 69 71 75 96 108
## 1 5 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1
## 109 126 131 140 152 154 155 158 161 163 164 165 167 168 169 170 173 174
## 1 1 1 1 1 2 2 1 1 3 3 1 2 1 2 3 1 1
## 176 178 179 180 182 183 184 185 186 187 191 194 195 196 197 198 202 204
## 2 3 1 2 1 2 2 1 1 1 2 2 3 1 1 2 2 1
## 205 210 215 218 332 358
## 1 1 1 1 1 1
Taking visuals of only those people who left:
hr_hist <- hr[hr$left==1,]
par(mfrow=c(1,3))
hist(hr_hist$satisfaction_level,col="#3090C7", main = "Satisfaction level")
hist(hr_hist$last_evaluation,col="#3090C7", main = "Last evaluation")
hist(hr_hist$average_montly_hours,col="#3090C7", main = "Average montly hours")
boxplot(hr_hist$satisfaction_level, main = "Satisfaction level")
boxplot(hr_hist$last_evaluation, main = "Last evaluation")
boxplot(hr_hist$average_montly_hours, main = "Average monthly hours")
We can see why we don’t want to retain everybody. Some people don’t work well as we can see from their evaluation, but clearly there are also many good workers that leave.
par(mfrow=c(1,2))
hist(hr_hist$Work_accident,col="#3090C7", main = "Work accident")
plot(hr_hist$salary,col="#3090C7", main = "Salary")
The total number of people that left:
nrow(hr_hist)
## [1] 3571
More problematic, here are the total of employees that received an evaluation above average, or spend at least four years in the company, or were working on more than 5 projects at the same time and still have left the company. These are the people the company should have retained.
hr_good_leaving_people <- hr_hist[hr_hist$last_evaluation >= 0.70 | hr_hist$time_spend_company >= 4 | hr_hist$number_project > 5,]
nrow(hr_good_leaving_people)
## [1] 2014
HR_correlation <- hr[,1:8]
M <- cor(HR_correlation)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
corrplot(M, method="circle")
On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(M, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="HR Analysis")
library(car)
## Warning: package 'car' was built under R version 3.4.3
scatterplot.matrix(~left+satisfaction_level+average_montly_hours+promotion_last_5years, data = hr, main = "HR Analysis ")
## Warning: 'scatterplot.matrix' is deprecated.
## Use 'scatterplotMatrix' instead.
## See help("Deprecated") and help("car-deprecated").
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
fit1 <- lm(left~satisfaction_level, data = hr_hist)
summary(fit1)
##
## Call:
## lm(formula = left ~ satisfaction_level, data = hr_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.525e-12 3.200e-16 4.400e-16 5.400e-16 5.600e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 8.310e-16 1.203e+15 <2e-16 ***
## satisfaction_level 3.688e-16 1.619e-15 2.280e-01 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: 0.4999
## F-statistic: 3569 on 1 and 3569 DF, p-value: < 2.2e-16
This shows that there is a relation between satisfaction level and the employee leaving and the model is approximately 50% accurate.
fit2 <- lm(left~average_montly_hours, data = hr_hist)
summary(fit2)
##
## Call:
## lm(formula = left ~ average_montly_hours, data = hr_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.525e-12 1.100e-16 3.300e-16 7.800e-16 9.000e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 1.510e-15 6.622e+14 <2e-16 ***
## average_montly_hours 5.753e-18 6.983e-18 8.240e-01 0.41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: 0.4999
## F-statistic: 3569 on 1 and 3569 DF, p-value: < 2.2e-16
This shows that there is a relation between average monthly hours spent by an employee and the employee leaving and the model is approximately 50% accurate.
fit3 <- lm(left~promotion_last_5years, data = hr_hist)
summary(fit3)
##
## Call:
## lm(formula = left ~ promotion_last_5years, data = hr_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.525e-12 4.300e-16 4.300e-16 4.300e-16 4.300e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 4.285e-16 2.334e+15 <2e-16 ***
## promotion_last_5years 4.296e-16 5.875e-15 7.300e-02 0.942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: 0.4998
## F-statistic: 3569 on 1 and 3569 DF, p-value: < 2.2e-16
This shows that there is a relation between whether the employee was promoted in the last 5 years and the employee leaving and the model is approximately 50% accurate.
fit4 <- lm(left~number_project, data = hr_hist)
summary(fit4)
##
## Call:
## lm(formula = left ~ number_project, data = hr_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.525e-12 -9.000e-17 3.900e-16 8.700e-16 8.700e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 1.002e-15 9.981e+14 <2e-16 ***
## number_project 2.399e-16 2.351e-16 1.021e+00 0.307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.553e-14 on 3569 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: 0.4998
## F-statistic: 3569 on 1 and 3569 DF, p-value: < 2.2e-16
This shows that there is a relation between whether the number of projects the employee has worked on and the employee leaving and the model is approximately 50% accurate.
t.test(hr_hist$left,hr_hist$satisfaction_level)
##
## Welch Two Sample t-test
##
## data: hr_hist$left and hr_hist$satisfaction_level
## t = 126.77, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5512424 0.5685615
## sample estimates:
## mean of x mean of y
## 1.000000 0.440098
The null hypothesis can be rejected.
t.test(hr_hist$left,hr_hist$number_project)
##
## Welch Two Sample t-test
##
## data: hr_hist$left and hr_hist$number_project
## t = -93.852, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.915156 -2.795849
## sample estimates:
## mean of x mean of y
## 1.000000 3.855503
The null hypothesis can be rejected.
t.test(hr_hist$left,hr_hist$average_montly_hours)
##
## Welch Two Sample t-test
##
## data: hr_hist$left and hr_hist$average_montly_hours
## t = -201.55, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -208.4272 -204.4112
## sample estimates:
## mean of x mean of y
## 1.0000 207.4192
The null hypothesis can be rejected.
t.test(hr_hist$left,hr_hist$promotion_last_5years)
##
## Welch Two Sample t-test
##
## data: hr_hist$left and hr_hist$promotion_last_5years
## t = 816.95, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9922922 0.9970665
## sample estimates:
## mean of x mean of y
## 1.000000000 0.005320638
The null hypothesis can be rejected.
On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.