Project Title : HR Analysis: Why are our best and most experienced employees leaving prematurely?

Name : Ananya Gupta

Email : ananyagupta6@gmail.com

College : Manipal Institute of Technology

Our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.

Analytic solution

We have two goals: first, we want to understand why valuable employees leave, and second, we want to predict who will leave next.

Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees’ departure.

Assessment

For our almost 15000 employees we know: satisfaction level, latest evaluation (yearly), number of project worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary.

At this stage, we read the data set in R and visualize its length and breadth

hr <- read.csv(paste("HR_comma_sep.csv", sep = " "))
nrow(hr)
## [1] 14999
ncol(hr)
## [1] 10

Data Description Analysis

This table describe the characteristics of each features of our ABT. We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.

summary(hr)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

Relation between the employees who left and their satisfaction levels and whether they were promoted or not

table1 <- table(hr$satisfaction_level,hr$left)
table2 <- table(hr$promotion_last_5years, hr$left)
head(table1)
##       
##          0   1
##   0.09   0 195
##   0.1    0 358
##   0.11   0 335
##   0.12  26   4
##   0.13  51   3
##   0.14  63  10
table2
##    
##         0     1
##   0 11128  3552
##   1   300    19
table3 <- table(xtabs(~left+satisfaction_level+promotion_last_5years, data = hr))
table3
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  17  20  24  25  26  27 
## 121  19  23  17  16  13  10   3  14   2   4   1   1   1   1   1   1   1 
##  28  29  30  31  33  35  36  37  38  39  40  41  42  43  44  45  46  47 
##   2   1   1   2   3   1   1   2   1   1   2   1   4   1   2   2   1   2 
##  50  51  52  56  57  58  62  63  64  65  66  67  68  69  71  75  96 108 
##   1   5   1   2   2   1   1   2   1   1   1   1   1   1   1   1   1   1 
## 109 126 131 140 152 154 155 158 161 163 164 165 167 168 169 170 173 174 
##   1   1   1   1   1   2   2   1   1   3   3   1   2   1   2   3   1   1 
## 176 178 179 180 182 183 184 185 186 187 191 194 195 196 197 198 202 204 
##   2   3   1   2   1   2   2   1   1   1   2   2   3   1   1   2   2   1 
## 205 210 215 218 332 358 
##   1   1   1   1   1   1

Who is leaving?

Taking visuals of only those people who left:

hr_hist <- hr[hr$left==1,]
par(mfrow=c(1,3))
hist(hr_hist$satisfaction_level,col="#3090C7", main = "Satisfaction level") 
hist(hr_hist$last_evaluation,col="#3090C7", main = "Last evaluation")
hist(hr_hist$average_montly_hours,col="#3090C7", main = "Average montly hours")

boxplot(hr_hist$satisfaction_level, main = "Satisfaction level")

boxplot(hr_hist$last_evaluation, main = "Last evaluation")

boxplot(hr_hist$average_montly_hours, main = "Average monthly hours")

We can see why we don’t want to retain everybody. Some people don’t work well as we can see from their evaluation, but clearly there are also many good workers that leave.

par(mfrow=c(1,2))
hist(hr_hist$Work_accident,col="#3090C7", main = "Work accident")
plot(hr_hist$salary,col="#3090C7", main = "Salary")

The total number of people that left:

nrow(hr_hist)
## [1] 3571

More problematic, here are the total of employees that received an evaluation above average, or spend at least four years in the company, or were working on more than 5 projects at the same time and still have left the company. These are the people the company should have retained.

hr_good_leaving_people <- hr_hist[hr_hist$last_evaluation >= 0.70 | hr_hist$time_spend_company >= 4 | hr_hist$number_project > 5,]
nrow(hr_good_leaving_people)
## [1] 2014

Correlation matrix

HR_correlation <- hr[,1:8]
M <- cor(HR_correlation)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
corrplot(M, method="circle")

On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.

Corrgram

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(M, order=TRUE, lower.panel=panel.shade,
  upper.panel=panel.pie, text.panel=panel.txt,
  main="HR Analysis")

Scatterplot matrix

library(car)
## Warning: package 'car' was built under R version 3.4.3
scatterplot.matrix(~left+satisfaction_level+average_montly_hours+promotion_last_5years, data = hr, main = "HR Analysis ")
## Warning: 'scatterplot.matrix' is deprecated.
## Use 'scatterplotMatrix' instead.
## See help("Deprecated") and help("car-deprecated").
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

Proving the relation between whether an employee leaves and certain factors

fit1 <- lm(left~satisfaction_level, data = hr_hist)
summary(fit1)
## 
## Call:
## lm(formula = left ~ satisfaction_level, data = hr_hist)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.525e-12  3.200e-16  4.400e-16  5.400e-16  5.600e-16 
## 
## Coefficients:
##                     Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)        1.000e+00  8.310e-16 1.203e+15   <2e-16 ***
## satisfaction_level 3.688e-16  1.619e-15 2.280e-01     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4999 
## F-statistic:  3569 on 1 and 3569 DF,  p-value: < 2.2e-16

This shows that there is a relation between satisfaction level and the employee leaving and the model is approximately 50% accurate.

fit2 <- lm(left~average_montly_hours, data = hr_hist)
summary(fit2)
## 
## Call:
## lm(formula = left ~ average_montly_hours, data = hr_hist)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.525e-12  1.100e-16  3.300e-16  7.800e-16  9.000e-16 
## 
## Coefficients:
##                       Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)          1.000e+00  1.510e-15 6.622e+14   <2e-16 ***
## average_montly_hours 5.753e-18  6.983e-18 8.240e-01     0.41    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4999 
## F-statistic:  3569 on 1 and 3569 DF,  p-value: < 2.2e-16

This shows that there is a relation between average monthly hours spent by an employee and the employee leaving and the model is approximately 50% accurate.

fit3 <- lm(left~promotion_last_5years, data = hr_hist)
summary(fit3)
## 
## Call:
## lm(formula = left ~ promotion_last_5years, data = hr_hist)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.525e-12  4.300e-16  4.300e-16  4.300e-16  4.300e-16 
## 
## Coefficients:
##                        Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)           1.000e+00  4.285e-16 2.334e+15   <2e-16 ***
## promotion_last_5years 4.296e-16  5.875e-15 7.300e-02    0.942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.554e-14 on 3569 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4998 
## F-statistic:  3569 on 1 and 3569 DF,  p-value: < 2.2e-16

This shows that there is a relation between whether the employee was promoted in the last 5 years and the employee leaving and the model is approximately 50% accurate.

fit4 <- lm(left~number_project, data = hr_hist)
summary(fit4)
## 
## Call:
## lm(formula = left ~ number_project, data = hr_hist)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.525e-12 -9.000e-17  3.900e-16  8.700e-16  8.700e-16 
## 
## Coefficients:
##                 Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)    1.000e+00  1.002e-15 9.981e+14   <2e-16 ***
## number_project 2.399e-16  2.351e-16 1.021e+00    0.307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.553e-14 on 3569 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4998 
## F-statistic:  3569 on 1 and 3569 DF,  p-value: < 2.2e-16

This shows that there is a relation between whether the number of projects the employee has worked on and the employee leaving and the model is approximately 50% accurate.

T tests

t.test(hr_hist$left,hr_hist$satisfaction_level)
## 
##  Welch Two Sample t-test
## 
## data:  hr_hist$left and hr_hist$satisfaction_level
## t = 126.77, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5512424 0.5685615
## sample estimates:
## mean of x mean of y 
##  1.000000  0.440098

The null hypothesis can be rejected.

t.test(hr_hist$left,hr_hist$number_project)
## 
##  Welch Two Sample t-test
## 
## data:  hr_hist$left and hr_hist$number_project
## t = -93.852, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.915156 -2.795849
## sample estimates:
## mean of x mean of y 
##  1.000000  3.855503

The null hypothesis can be rejected.

t.test(hr_hist$left,hr_hist$average_montly_hours)
## 
##  Welch Two Sample t-test
## 
## data:  hr_hist$left and hr_hist$average_montly_hours
## t = -201.55, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -208.4272 -204.4112
## sample estimates:
## mean of x mean of y 
##    1.0000  207.4192

The null hypothesis can be rejected.

t.test(hr_hist$left,hr_hist$promotion_last_5years)
## 
##  Welch Two Sample t-test
## 
## data:  hr_hist$left and hr_hist$promotion_last_5years
## t = 816.95, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9922922 0.9970665
## sample estimates:
##   mean of x   mean of y 
## 1.000000000 0.005320638

The null hypothesis can be rejected.

On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.