HR Analytics

Name: “Parv Patni”
Email: “parvpatni@gmail.com”
College: “SRM University, Chennai”

Introduction

This project concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.

We have two goals: first, we want to understand why valuable employees leave, and
Second, we want to predict who will leave next.

Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees departure.

Unfortunately, managers didn’t kept an organised record of why people have left, but we can still find some explications in our data set provided by the HR department.

For our 15 000 employees we know: satisfaction level,
latest evaluation (yearly),
number of project worked on,
average monthly hours,
time spend in the company (in years),
work accident (within the past 2 years),
promotion within the past 5 years,
department and salary.

Data exploration

Reading the csv file into dataframe and basic description of the dataset

getwd()

## [1] "C:/Users/parvp/Desktop/data analytics internship"

hr.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
dim(hr.df)

## [1] 14999    10

Our dataset consist of 14999 rows and 10 columns.

library(psych)

## Warning: package 'psych' was built under R version 3.4.3

describe(hr.df)

##                       vars     n   mean    sd median trimmed   mad   min
## satisfaction_level       1 14999   0.61  0.25   0.64    0.63  0.28  0.09
## last_evaluation          2 14999   0.72  0.17   0.72    0.72  0.22  0.36
## number_project           3 14999   3.80  1.23   4.00    3.74  1.48  2.00
## average_montly_hours     4 14999 201.05 49.94 200.00  200.64 65.23 96.00
## time_spend_company       5 14999   3.50  1.46   3.00    3.28  1.48  2.00
## Work_accident            6 14999   0.14  0.35   0.00    0.06  0.00  0.00
## left                     7 14999   0.24  0.43   0.00    0.17  0.00  0.00
## promotion_last_5years    8 14999   0.02  0.14   0.00    0.00  0.00  0.00
## sales*                   9 14999   6.94  2.75   8.00    7.23  2.97  1.00
## salary*                 10 14999   2.35  0.63   2.00    2.41  1.48  1.00
##                       max  range  skew kurtosis   se
## satisfaction_level      1   0.91 -0.48    -0.67 0.00
## last_evaluation         1   0.64 -0.03    -1.24 0.00
## number_project          7   5.00  0.34    -0.50 0.01
## average_montly_hours  310 214.00  0.05    -1.14 0.41
## time_spend_company     10   8.00  1.85     4.77 0.01
## Work_accident           1   1.00  2.02     2.08 0.00
## left                    1   1.00  1.23    -0.49 0.00
## promotion_last_5years   1   1.00  6.64    42.03 0.00
## sales*                 10   9.00 -0.79    -0.62 0.02
## salary*                 3   2.00 -0.42    -0.67 0.01

We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.

One way contigency table

mytable <- with(hr.df,table(left))
mytable

## left
##     0     1 
## 11428  3571

prop.table(mytable)*100

## left
##        0        1 
## 76.19175 23.80825

mytable1 <- with(hr.df,table(promotion_last_5years))
mytable1

## promotion_last_5years
##     0     1 
## 14680   319

prop.table(mytable1)*100

## promotion_last_5years
##         0         1 
## 97.873192  2.126808

mytable2 <- with(hr.df,table(salary))
mytable2

## salary
##   high    low medium 
##   1237   7316   6446

prop.table(mytable2)*100

## salary
##      high       low    medium 
##  8.247216 48.776585 42.976198

mytable3 <- with(hr.df,table(Work_accident))
mytable3

## Work_accident
##     0     1 
## 12830  2169

prop.table(mytable3)*100

## Work_accident
##        0        1 
## 85.53904 14.46096

Two way contigency table

mytable4 <- xtabs(~ left+promotion_last_5years, data=hr.df)
mytable4

##     promotion_last_5years
## left     0     1
##    0 11128   300
##    1  3552    19

prop.table(mytable4, 1)*100

##     promotion_last_5years
## left          0          1
##    0 97.3748687  2.6251313
##    1 99.4679362  0.5320638

mytable5 <-xtabs(~left+salary,data=hr.df)
mytable5

##     salary
## left high  low medium
##    0 1155 5144   5129
##    1   82 2172   1317

margin.table(mytable5,2)

## salary
##   high    low medium 
##   1237   7316   6446

prop.table(mytable5, 2)

##     salary
## left       high        low     medium
##    0 0.93371059 0.70311646 0.79568725
##    1 0.06628941 0.29688354 0.20431275

mytable6 <- xtabs(~left+Work_accident,data=hr.df)
mytable6

##     Work_accident
## left    0    1
##    0 9428 2000
##    1 3402  169

margin.table(mytable6,1)

## left
##     0     1 
## 11428  3571

prop.table(mytable6, 1)

##     Work_accident
## left          0          1
##    0 0.82499125 0.17500875
##    1 0.95267432 0.04732568

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.4.3

CrossTable(hr.df$left,hr.df$salary)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##              | hr.df$salary 
##   hr.df$left |      high |       low |    medium | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            0 |      1155 |      5144 |      5129 |     11428 | 
##              |    47.915 |    33.200 |     9.648 |           | 
##              |     0.101 |     0.450 |     0.449 |     0.762 | 
##              |     0.934 |     0.703 |     0.796 |           | 
##              |     0.077 |     0.343 |     0.342 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        82 |      2172 |      1317 |      3571 | 
##              |   153.339 |   106.247 |    30.876 |           | 
##              |     0.023 |     0.608 |     0.369 |     0.238 | 
##              |     0.066 |     0.297 |     0.204 |           | 
##              |     0.005 |     0.145 |     0.088 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      1237 |      7316 |      6446 |     14999 | 
##              |     0.082 |     0.488 |     0.430 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
##

CrossTable(hr.df$left,hr.df$promotion_last_5years)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##              | hr.df$promotion_last_5years 
##   hr.df$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |     11128 |       300 |     11428 | 
##              |     0.290 |    13.343 |           | 
##              |     0.974 |     0.026 |     0.762 | 
##              |     0.758 |     0.940 |           | 
##              |     0.742 |     0.020 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      3552 |        19 |      3571 | 
##              |     0.928 |    42.702 |           | 
##              |     0.995 |     0.005 |     0.238 | 
##              |     0.242 |     0.060 |           | 
##              |     0.237 |     0.001 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     14680 |       319 |     14999 | 
##              |     0.979 |     0.021 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Boxplot

attach(hr.df)
boxplot(hr.df$satisfaction_level,col="yellow",main="Satisfaction level",horizontal = TRUE)

boxplot(hr.df$last_evaluation,col="blue",main="Last Evaluation",horizontal = TRUE)

boxplot(hr.df$number_project,col="green",main="Number of project",horizontal = TRUE)

boxplot(hr.df$average_montly_hours,col="red",main="Average monthly hours",horizontal = TRUE)

boxplot(time_spend_company,horizontal = TRUE,main="Time Spend in the Company",xlab="IN YEARS",col = "blue")

boxplot(satisfaction_level ~left  ,data=hr.df, main="Distribution of satisfaction with working hours", ylab="satisfaction level", xlab="left",col= "lightblue",vertical=TRUE)

boxplot(satisfaction_level ~promotion_last_5years  ,data=hr.df, main="Distribution of satisfaction with promotion", ylab="satisfaction level", xlab="promotion in last 5 years",col= "peachpuff",vertical=TRUE)

boxplot(number_project ~left  ,data=hr.df, main="Distribution of number of projects", ylab="number of projects", xlab="left",col= "lightblue",vertical=TRUE)

boxplot(satisfaction_level ~salary  ,data=hr.df, main="Distribution of satisfaction with salary", ylab="satisfaction level", xlab=" salary",col= "blue",vertical=TRUE)

Histogram

hr.df$Work_accident[hr.df$Work_accident == 1] <- 'yes'
hr.df$Work_accident[hr.df$Work_accident == 0] <- 'no'
hr.df$Work_accident <- factor(hr.df$Work_accident)

hr.df$promotion_last_5years[hr.df$promotion_last_5years ==0] <-'no'
hr.df$promotion_last_5years[hr.df$promotion_last_5years ==1] <- 'yes'
hr.df$promotion_last_5years <- factor(hr.df$promotion_last_5years)

hr.df$left[hr.df$left==0] <- 'no'
hr.df$left[hr.df$left==1] <-'yes'
hr.df$left <- factor(hr.df$left)

Let’s visualise the number of employees who left through a pie chart.

pie(table(hr.df$left),main="Employees who Left", col=c("lightblue","khaki"))

With the help of pie chart we get to know that approximately a quanter of the given data set company employees left their job.

library(lattice)

## Warning: package 'lattice' was built under R version 3.4.3

histogram(~left, data = hr.df,
 main = "Frequency of human resource leaving the company", xlab="left", col='lightgreen' )

histogram(~satisfaction_level,data=hr.df,main="Frequency of satisfaction level",col="lightblue")

histogram(~last_evaluation,data=hr.df,main="frequency of last evalution",col="yellow")

histogram(~salary,data=hr.df,main="frequency of salary",col="darkolivegreen1")

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(hr.df,  aes( sales,fill = sales) ) + geom_bar()

ggplot(hr.df, aes(satisfaction_level, number_project))+ scale_x_continuous("satisfaction level", breaks = seq(0,1,0.1))+scale_y_continuous("Projects", breaks = seq(2,7,1)) + labs(title="satisfaction level vs number of projects") + geom_jitter(colour="darkolivegreen1")

People who are least satisfied did more projects as compared to number of projects done by higher satisfied people.
Higher satisfied people did 3-5 projects on an average.

scatterplot

library(car)

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(satisfaction_level~number_project,     data=hr.df,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of satisfaction level vs number of project",
            xlab="number of project",
            ylab="satisfaction level")

scatterplot(satisfaction_level~average_montly_hours,data=hr.df,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of satisfaction level vs average working hours",
            xlab="average working hours",
            ylab="satisfaction level")

scatterplot(satisfaction_level,last_evaluation,main="Satisfaction level vs last evaluation",pch = 16)

pairs(formula= ~satisfaction_level + average_montly_hours+ salary +last_evaluation + number_project, cex=0.6, data=hr.df)

Correlation Matrix

hr1.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
cor(hr1.df[,1:8])

##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000

Corrplot

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.4.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

HR_correlation <- hr1.df %>% select(satisfaction_level:promotion_last_5years)
M <- cor(HR_correlation)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.3

## corrplot 0.84 loaded

corrplot(M, method="number")

On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.

CORRGRAM

library(corrgram)

## Warning: package 'corrgram' was built under R version 3.4.3

corrgram(hr.df,upper.panel = panel.pie,main="Corrgram")

Chi-square test(Independency test) :-

H0: The two variables are independent

H1: The two variables are not independent

mytable6

##     Work_accident
## left    0    1
##    0 9428 2000
##    1 3402  169

chisq.test(mytable6)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable6
## X-squared = 357.56, df = 1, p-value < 2.2e-16

p-value < 0.05 The varibale “left” and “work_accidents” are not independent.

mytable4

##     promotion_last_5years
## left     0     1
##    0 11128   300
##    1  3552    19

chisq.test(mytable4)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable4
## X-squared = 56.262, df = 1, p-value = 6.344e-14

p-value < 0.05 The variable “left” and “Promotion_last_5years” are not independent.

mytable5

##     salary
## left high  low medium
##    0 1155 5144   5129
##    1   82 2172   1317

chisq.test(mytable5)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable5
## X-squared = 381.23, df = 2, p-value < 2.2e-16

p-value <0.05 The variable “left” and “salary” are not independent

Hypothesis(H1):- Employee who get promotion has more average monthly hours than the employee who don not get promotion.

t.test(average_montly_hours~promotion_last_5years)

## 
##  Welch Two Sample t-test
## 
## data:  average_montly_hours by promotion_last_5years
## t = 0.44937, df = 333.03, p-value = 0.6535
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.143788  6.597589
## sample estimates:
## mean in group 0 mean in group 1 
##        201.0764        199.8495

p- value > 0.05 Result- There is no significant difference in avg monthly hours of a employee who got promoted and the employee who do not get promoted.

Hypothesis(H1):- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.

t.test(time_spend_company~promotion_last_5years)

## 
##  Welch Two Sample t-test
## 
## data:  time_spend_company by promotion_last_5years
## t = -5.6111, df = 324.14, p-value = 4.316e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9216896 -0.4431601
## sample estimates:
## mean in group 0 mean in group 1 
##        3.483719        4.166144

p-value < 0.05 implies the rejection of NULL Hypothesis (H-0) and the acception of H-1 result- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.

Hypothesis(H1):- Employee who left having more avg monthly hours than the employee who do not left

t.test(average_montly_hours~left)

## 
##  Welch Two Sample t-test
## 
## data:  average_montly_hours by left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.534631  -6.183384
## sample estimates:
## mean in group 0 mean in group 1 
##        199.0602        207.4192

p-value < 0.05 implies the rejection of null hypothesis(H-0) Result - Employee who left having more avg monthly hours than the employee who do not left

Hypothesis (H1) :- The Employee who got promotion is more satisfied than the employee who dont get the promotion.

t.test(satisfaction_level~promotion_last_5years)

## 
##  Welch Two Sample t-test
## 
## data:  satisfaction_level by promotion_last_5years
## t = -3.6545, df = 337.3, p-value = 0.0002987
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06787297 -0.02037446
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6118951       0.6560188

p-value < 0.05 implies the rejection of null hypothesis (H-0)

Result- The Employee who got promotion is more satisfied than the employee who dont get the promotion.

Regression Models

Fitting a Linear Regression Model using lm()

model1<-lm(formula=number_project~Work_accident,data = hr1.df)
summary(model1)

## 
## Call:
## lm(formula = number_project ~ Work_accident, data = hr1.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8055 -0.8055  0.1945  1.1945  3.2112 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.80546    0.01088 349.696   <2e-16 ***
## Work_accident -0.01661    0.02862  -0.581    0.562    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.233 on 14997 degrees of freedom
## Multiple R-squared:  2.247e-05,  Adjusted R-squared:  -4.421e-05 
## F-statistic: 0.337 on 1 and 14997 DF,  p-value: 0.5616

model2<-lm(formula=average_montly_hours~salary,data = hr1.df)
summary(model2)

## 
## Call:
## lm(formula = average_montly_hours ~ salary, data = hr1.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.338  -45.338   -0.997   44.003  109.003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   199.867      1.420 140.746   <2e-16 ***
## salarylow       1.129      1.535   0.735    0.462    
## salarymedium    1.471      1.550   0.949    0.343    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.94 on 14996 degrees of freedom
## Multiple R-squared:  6.113e-05,  Adjusted R-squared:  -7.223e-05 
## F-statistic: 0.4584 on 2 and 14996 DF,  p-value: 0.6323

model3<-lm(formula=satisfaction_level~salary,data = hr1.df)
summary(model3)

## 
## Call:
## lm(formula = satisfaction_level ~ salary, data = hr1.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54747 -0.17182  0.02925  0.19925  0.39925 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.637470   0.007061  90.284  < 2e-16 ***
## salarylow    -0.036717   0.007634  -4.809 1.53e-06 ***
## salarymedium -0.015653   0.007709  -2.031   0.0423 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2483 on 14996 degrees of freedom
## Multiple R-squared:  0.002522,   Adjusted R-squared:  0.002389 
## F-statistic: 18.96 on 2 and 14996 DF,  p-value: 5.967e-09

model4<-lm(formula=satisfaction_level ~ number_project,data = hr1.df)
summary(model4)

## 
## Call:
## lm(formula = satisfaction_level ~ number_project, data = hr1.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56483 -0.21483  0.03169  0.20401  0.45052 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.722509   0.006517  110.86   <2e-16 ***
## number_project -0.028839   0.001630  -17.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2461 on 14997 degrees of freedom
## Multiple R-squared:  0.02044,    Adjusted R-squared:  0.02037 
## F-statistic: 312.9 on 1 and 14997 DF,  p-value: < 2.2e-16

model5<-lm(formula=average_montly_hours ~ satisfaction_level,data = hr1.df)
summary(model4)

## 
## Call:
## lm(formula = satisfaction_level ~ number_project, data = hr1.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56483 -0.21483  0.03169  0.20401  0.45052 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.722509   0.006517  110.86   <2e-16 ***
## number_project -0.028839   0.001630  -17.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2461 on 14997 degrees of freedom
## Multiple R-squared:  0.02044,    Adjusted R-squared:  0.02037 
## F-statistic: 312.9 on 1 and 14997 DF,  p-value: < 2.2e-16