HR Analytics

Intoduction

1.Introduction

Employee attrition is one of the biggest challenges that the companies face.There are several factors that lead to attrition. While it may not be easy to control all the factors, it may not be worthwhile to look into those factors that seem controllable. Factors such as average number of hours spend per month by the employees, salary, promotions, job rotation, number of projects are a few which are easier to manage. Our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next. If we are able to extract cut-off levels for some of the above mentioned factors through our analysis, then we should be able to have a better understanding about the factors that are responsible for the employees leaving the company prematurely.

Overview

2.Overview of the Study

Objective & Goal of this study is to predict whether an employee is going to stay or leave. We will calculate the probability of an employee leaving/resigning.

2.1 Data

This dataset is simulated and downloaded from kaggle. Why are our best and most experienced employees leaving prematurely? Have fun with this database and try to predict which valuable employees will leave next. Fields in the dataset include:

Data Dictionary

Variable Name : Variable Definition
Satisfaction Level : Employee Satisfaction (can be interpreted as a %)
Last evaluation : Employee Evaluation (can be interpreted as a %)
Projects : Number of Projects (per year)
Average monthly hours : Average monthly hours
Time spent at company : Time spent at company
Accident : Whether they have had a work accident
Promotion Last 5 yrs : Whether they have had a promotion in the last 5 years
Positions : Type of Job Position
Salary : Salary level (1= low, 2= medium, 3= high)
Left : Whether the employee has left (0= remains employed, 1= left)

2.2 Data Structure

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

2.3 Model

What factors increase job satisfaction?

The next step of our analysis is going to involve preforming some modeling. In order to model, we are going to create a scaled data frame of our original data called ‘p’. For our problem “What factors increase job satisfaction?”, we would begin by including all variables. This provides a much more complex model and we want to obtain the simplest and best significance and error rates.

Full Model

The first model contains all 9 indepedent variables in the dataset to predict the 1 dependent variable (satisfaction level). In this model we see that average monthly hours, number project, time spent at company, last evaluation, and whether or not the employee left have significant p-values and therefore, are significant indicator variables. R-squared and adjusted R-squared are both at about 0.19. This is not typically a good value, but is rather common in any data analysis of human behavior. For example, in psychology studies this R-squared level would not eliminate the model’s validity, especially if p-values indicate significance

fullmodel <- lm(satisfaction_level ~ salary + average_montly_hours + number_project + time_spend_company +  promotion_last_5years + last_evaluation + Work_accident +  left )
summary(fullmodel)

## 
## Call:
## lm(formula = satisfaction_level ~ salary + average_montly_hours + 
##     number_project + time_spend_company + promotion_last_5years + 
##     last_evaluation + Work_accident + left)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64740 -0.13677 -0.01193  0.17004  0.52773 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.148e-01  1.159e-02  53.065  < 2e-16 ***
## salarylow              1.200e-02  6.961e-03   1.724   0.0847 .  
## salarymedium           1.306e-02  6.956e-03   1.878   0.0604 .  
## average_montly_hours   1.913e-04  4.127e-05   4.636 3.58e-06 ***
## number_project        -4.090e-02  1.691e-03 -24.183  < 2e-16 ***
## time_spend_company    -5.525e-03  1.295e-03  -4.267 2.00e-05 ***
## promotion_last_5years  9.285e-03  1.272e-02   0.730   0.4655    
## last_evaluation        2.460e-01  1.167e-02  21.071  < 2e-16 ***
## Work_accident         -3.356e-05  5.238e-03  -0.006   0.9949    
## left                  -2.241e-01  4.449e-03 -50.360  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2227 on 14989 degrees of freedom
## Multiple R-squared:  0.1982, Adjusted R-squared:  0.1977 
## F-statistic: 411.6 on 9 and 14989 DF,  p-value: < 2.2e-16

Revised Model

Because we want the simplest model possible, this model removes the predictor variables which appear insignificant. This model has even lower R-squared and adjusted R-squared values of about 0.05, which again can be attributed to the fact that human behavior is very difficult to predict. Many of the variables remained significant, except for average monthly hours. Based on these p-values, we will conclude the model is still viable and remove the insignificant variable for the next model.

revisedmodel<- lm(satisfaction_level ~ average_montly_hours + number_project + time_spend_company + last_evaluation)
summary(revisedmodel)

## 
## Call:
## lm(formula = satisfaction_level ~ average_montly_hours + number_project + 
##     time_spend_company + last_evaluation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61923 -0.19061  0.02274  0.19617  0.59000 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.152e-01  1.066e-02   57.70   <2e-16 ***
## average_montly_hours  5.183e-05  4.469e-05    1.16    0.246    
## number_project       -3.894e-02  1.835e-03  -21.23   <2e-16 ***
## time_spend_company   -1.498e-02  1.383e-03  -10.83   <2e-16 ***
## last_evaluation       2.622e-01  1.266e-02   20.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2417 on 14994 degrees of freedom
## Multiple R-squared:  0.05522,    Adjusted R-squared:  0.05497 
## F-statistic: 219.1 on 4 and 14994 DF,  p-value: < 2.2e-16

Final Model

This final model displays similar multiple R-squared and adjusted R-squared values. Also, it only contains variables that are significant. Based on this, we could conclude that number project, time spent at company, and last evaluation are significant predictors of job satisfaction.

finalmodel <- lm(satisfaction_level ~ number_project + time_spend_company + last_evaluation)
summary(finalmodel)

## 
## Call:
## lm(formula = satisfaction_level ~ number_project + time_spend_company + 
##     last_evaluation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62420 -0.19154  0.02245  0.19690  0.59039 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.620334   0.009683   64.06   <2e-16 ***
## number_project     -0.038241   0.001732  -22.08   <2e-16 ***
## time_spend_company -0.014918   0.001382  -10.80   <2e-16 ***
## last_evaluation     0.265490   0.012334   21.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2417 on 14995 degrees of freedom
## Multiple R-squared:  0.05514,    Adjusted R-squared:  0.05495 
## F-statistic: 291.7 on 3 and 14995 DF,  p-value: < 2.2e-16

Conclusion

3.Conclusion

Result

From this analysis we get the answers to the two questions stated earlier.

Why are our best and most experienced employees leaving prematurely?

High salaried employees show a different pattern for leaving the company as compared to the pattern shown by the medium and low salaried employees. This needs further analysis and is out of scope of this report. The key indicators to watch out for are: The employees who have left have had a satisfaction level < 0.5. Average monthly hours > 200 have resulted in employees leaving the company. Employees also leave after spending average 4 years of time in the company.

Which employee will leave next?

The next employee to leave is predicted to be the employee number 7, who has low salary, with satisfaction level < 0.5 and is putting in average monthly hours > 200. The probability of the employee leaving is 70%.

Interesting Insights

The main conclusion we can draw from our analysis is that number of projects, time spent at company, and last evaluation are significant predictors of job satisfaction. Therefore, a students should seek positions where they plan to remain employed for longer periods of time, work on a significant number of projects, and have potential to recieve high evaluations or other forms of positive feedback.

The well-balanced worker who has recently been promoted is the happiest. Employees who are over-worked or under-work are relatively dissatisfied. For a students, this could mean that jobs with more opportunity for growth could be a better choice in terms of their happiness. For example, a job offer with a higher initial salary and less opportunity for growth could be less satisfying than a job with a lower initial salary, but more room for improvement and advancement. In addition, company managers could use this insight to set up a better system for vertical growth, perhaps with more levels of employment or more opportunities for improvement to increase satisfaction of their employee work force.

Surprisingly, employees at the lowest salary level (level 1) are actually the most satisfied. This shows that more money does not necessarily increase job satisfaction, and in fact, has the potential to decrease satisfaction. The happiest employees work 3 or 4 projects each year and people with the most extreme hours are typically very disatisfied. Employees with poor or excellent evaluations are the most satisfied and employees who have left the company were typically on the higher end of average monthly hours. Employees who left worked on under 3 projects, recieved poor or excellent performance rating, were not promoted in the last 5 years, and most did not have a work accident. We see that the last evaluations of employees who left have a bimodal distribution with great quantities at each extreme.

In terms of time spent at the company, there is a huge spike in satisfaction levels for employees who have remained with the company for 2.5 years or more. Satisfaction levels decrease past the 2.5 year satisfaction peak. However, loyal employees who have remained employed with the company for more than 6 years tend to be happier in their positions.

Appendix

4.Appendix

Read your dataset in R and visualize the length and breadth of your dataset.

hrdata.df <- read.csv(paste("HR_comma_sep.csv", sep=""))
head(hrdata.df)

  satisfaction_level last_evaluation number_project average_montly_hours
1               0.38            0.53              2                  157
2               0.80            0.86              5                  262
3               0.11            0.88              7                  272
4               0.72            0.87              5                  223
5               0.37            0.52              2                  159
6               0.41            0.50              2                  153
  time_spend_company Work_accident left promotion_last_5years sales salary
1                  3             0    1                     0 sales    low
2                  6             0    1                     0 sales medium
3                  4             0    1                     0 sales medium
4                  5             0    1                     0 sales    low
5                  3             0    1                     0 sales    low
6                  3             0    1                     0 sales    low

attach(hrdata.df)
dim(hrdata.df)

[1] 14999    10

Create a descriptive statistics (min, max, median etc) of each variable.

library(psych)
describe(hrdata.df)

##                       vars     n   mean    sd median trimmed   mad   min
## satisfaction_level       1 14999   0.61  0.25   0.64    0.63  0.28  0.09
## last_evaluation          2 14999   0.72  0.17   0.72    0.72  0.22  0.36
## number_project           3 14999   3.80  1.23   4.00    3.74  1.48  2.00
## average_montly_hours     4 14999 201.05 49.94 200.00  200.64 65.23 96.00
## time_spend_company       5 14999   3.50  1.46   3.00    3.28  1.48  2.00
## Work_accident            6 14999   0.14  0.35   0.00    0.06  0.00  0.00
## left                     7 14999   0.24  0.43   0.00    0.17  0.00  0.00
## promotion_last_5years    8 14999   0.02  0.14   0.00    0.00  0.00  0.00
## sales*                   9 14999   6.94  2.75   8.00    7.23  2.97  1.00
## salary*                 10 14999   2.35  0.63   2.00    2.41  1.48  1.00
##                       max  range  skew kurtosis   se
## satisfaction_level      1   0.91 -0.48    -0.67 0.00
## last_evaluation         1   0.64 -0.03    -1.24 0.00
## number_project          7   5.00  0.34    -0.50 0.01
## average_montly_hours  310 214.00  0.05    -1.14 0.41
## time_spend_company     10   8.00  1.85     4.77 0.01
## Work_accident           1   1.00  2.02     2.08 0.00
## left                    1   1.00  1.23    -0.49 0.00
## promotion_last_5years   1   1.00  6.64    42.03 0.00
## sales*                 10   9.00 -0.79    -0.62 0.02
## salary*                 3   2.00 -0.42    -0.67 0.01

Create one-way contingency tables for the categorical variables in your dataset.

table(number_project)

## number_project
##    2    3    4    5    6    7 
## 2388 4055 4365 2761 1174  256

table(time_spend_company)

## time_spend_company
##    2    3    4    5    6    7    8   10 
## 3244 6443 2557 1473  718  188  162  214

table(Work_accident)

## Work_accident
##     0     1 
## 12830  2169

table(left)

## left
##     0     1 
## 11428  3571

table(promotion_last_5years)

## promotion_last_5years
##     0     1 
## 14680   319

table(sales)

## sales
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720

table(salary)

## salary
##   high    low medium 
##   1237   7316   6446

Create two-way contingency tables for the categorical variables in your dataset.

table(number_project,time_spend_company)

##               time_spend_company
## number_project    2    3    4    5    6    7    8   10
##              2  224 1854  136   83   53   16   12   10
##              3 1255 1782  530  135  139   58   62   94
##              4 1144 1798  577  445  215   64   46   76
##              5  554  866  431  592  224   38   34   22
##              6   66  136  673  180   87   12    8   12
##              7    1    7  210   38    0    0    0    0

table(Work_accident,left)

##              left
## Work_accident    0    1
##             0 9428 3402
##             1 2000  169

table(promotion_last_5years,left)

##                      left
## promotion_last_5years     0     1
##                     0 11128  3552
##                     1   300    19

table(sales,promotion_last_5years)

##              promotion_last_5years
## sales            0    1
##   accounting   753   14
##   hr           724   15
##   IT          1224    3
##   management   561   69
##   marketing    815   43
##   product_mng  902    0
##   RandD        760   27
##   sales       4040  100
##   support     2209   20
##   technical   2692   28

table(salary,sales)

##         sales
## salary   accounting   hr   IT management marketing product_mng RandD sales
##   high           74   45   83        225        80          68    51   269
##   low           358  335  609        180       402         451   364  2099
##   medium        335  359  535        225       376         383   372  1772
##         sales
## salary   support technical
##   high       141       201
##   low       1146      1372
##   medium     942      1147

table(number_project,salary)

##               salary
## number_project high  low medium
##              2  140 1344    904
##              3  408 1791   1856
##              4  368 2087   1910
##              5  245 1317   1199
##              6   73  633    468
##              7    3  144    109

Draw a boxplot of the variables that belong to your study.

boxplot(satisfaction_level ~ left, horizontal=TRUE,
           ylab="left", xlab="Satisfaction level", las=1,
           main="Analysis of employee left on the basis of their satisfaction level",
           col=c("red","blue")
           )

boxplot(average_montly_hours ~ left, horizontal=TRUE,
           ylab="left", xlab="Average monthly hours", las=1,
           main="Analysis of employee left on the basis of average monthly hours spent",
           col=c("red","blue")
           )

boxplot(time_spend_company ~ left, horizontal=TRUE,
           ylab="left", xlab="Time spent at company", las=1,
           main="Analysis of employee left on the basis of time spent at company",
           col=c("red","blue")
           )

boxplot(last_evaluation ~ left, horizontal=TRUE,
           ylab="left", xlab="Last evaluation", las=1,
           main="Analysis of employee left on the basis of last evaluation",
           col=c("red","blue")
           )

Draw Histograms for your suitable data fields.

hist(satisfaction_level, main = "Satisfaction level Distribution", xlab = "Satisfaction level")

hist(last_evaluation, main = "Last Evaluation Distribution", xlab = "last evaluation")

hist(average_montly_hours, main = "Average monthly hours Distribution", xlab = "Average monthly hours")

hist(time_spend_company, main = "Time Spend at company Distribution", xlab = "Time Spend")

Draw suitable plot for your data fields.

plot(y=salary, x=sales,
     col="light blue",
     main="Relationship Btw salary and sales",
     ylab="Salary", xlab="Sales")

plot(y=satisfaction_level, x=salary,
     col="red",
     main="Relationship Btw satisfaction level and Salary",
     ylab="Salary", xlab="Sales")

plot(y=satisfaction_level, x=sales,
     col="light green",
     main="Relationship Btw satisfaction level and sales",
     ylab="Salary", xlab="Sales")

library(corrplot)

## corrplot 0.84 loaded

correlationMatrix <- cor(hrdata.df[,c(1:8)])
corrplot(correlationMatrix, method="circle")

Create a correlation matrix.

cor(hrdata.df[ ,c(1,2,3,4,5,6,7,8)])

##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000

Visualize your correlation matrix using corrgram.

library(corrgram)
corrgram(hrdata.df, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all  variables")

Create a scatter plot matrix for your data set.

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(formula = ~left + satisfaction_level + time_spend_company + Work_accident +average_montly_hours , data = hrdata.df,smooth= TRUE)

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

Run a suitable test to check your hypothesis for your suitable assumptions.

cor.test(left,satisfaction_level)

## 
##  Pearson's product-moment correlation
## 
## data:  left and satisfaction_level
## t = -51.613, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4018809 -0.3747001
## sample estimates:
##       cor 
## -0.388375

cor.test(left,time_spend_company)

## 
##  Pearson's product-moment correlation
## 
## data:  left and time_spend_company
## t = 17.924, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1291176 0.1604541
## sample estimates:
##       cor 
## 0.1448222

cor.test(left,average_montly_hours)

## 
##  Pearson's product-moment correlation
## 
## data:  left and average_montly_hours
## t = 8.7523, df = 14997, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05534652 0.08719151
## sample estimates:
##        cor 
## 0.07128718

cor.test(left,last_evaluation)

## 
##  Pearson's product-moment correlation
## 
## data:  left and last_evaluation
## t = 0.80424, df = 14997, p-value = 0.4213
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.009437678  0.022568555
## sample estimates:
##        cor 
## 0.00656712

cor.test(left,number_project)

## 
##  Pearson's product-moment correlation
## 
## data:  left and number_project
## t = 2.9139, df = 14997, p-value = 0.003575
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.007786343 0.039775850
## sample estimates:
##        cor 
## 0.02378719

Run a t-test to analyse your hypothesis

t.test(satisfaction_level~left)

## 
##  Welch Two Sample t-test
## 
## data:  satisfaction_level by left
## t = 46.636, df = 5167, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2171815 0.2362417
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6668096       0.4400980

t.test(time_spend_company~left)

## 
##  Welch Two Sample t-test
## 
## data:  time_spend_company by left
## t = -22.631, df = 9625.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5394767 -0.4534706
## sample estimates:
## mean in group 0 mean in group 1 
##        3.380032        3.876505

t.test(average_montly_hours~left)

## 
##  Welch Two Sample t-test
## 
## data:  average_montly_hours by left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.534631  -6.183384
## sample estimates:
## mean in group 0 mean in group 1 
##        199.0602        207.4192

t.test(last_evaluation~left)

## 
##  Welch Two Sample t-test
## 
## data:  last_evaluation by left
## t = -0.72534, df = 5154.9, p-value = 0.4683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.009772224  0.004493874
## sample estimates:
## mean in group 0 mean in group 1 
##       0.7154734       0.7181126

t.test(number_project~left)

## 
##  Welch Two Sample t-test
## 
## data:  number_project by left
## t = -2.1663, df = 4236.5, p-value = 0.03034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.131136535 -0.006540119
## sample estimates:
## mean in group 0 mean in group 1 
##        3.786664        3.855503