This project concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.

getwd()
## [1] "C:/Users/parvp/Desktop/data analytics internship"
hr.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
dim(hr.df)
## [1] 14999    10

Our dataset consist of 14999 rows and 10 columns.

library(psych)
## Warning: package 'psych' was built under R version 3.4.3
describe(hr.df)
##                       vars     n   mean    sd median trimmed   mad   min
## satisfaction_level       1 14999   0.61  0.25   0.64    0.63  0.28  0.09
## last_evaluation          2 14999   0.72  0.17   0.72    0.72  0.22  0.36
## number_project           3 14999   3.80  1.23   4.00    3.74  1.48  2.00
## average_montly_hours     4 14999 201.05 49.94 200.00  200.64 65.23 96.00
## time_spend_company       5 14999   3.50  1.46   3.00    3.28  1.48  2.00
## Work_accident            6 14999   0.14  0.35   0.00    0.06  0.00  0.00
## left                     7 14999   0.24  0.43   0.00    0.17  0.00  0.00
## promotion_last_5years    8 14999   0.02  0.14   0.00    0.00  0.00  0.00
## sales*                   9 14999   6.94  2.75   8.00    7.23  2.97  1.00
## salary*                 10 14999   2.35  0.63   2.00    2.41  1.48  1.00
##                       max  range  skew kurtosis   se
## satisfaction_level      1   0.91 -0.48    -0.67 0.00
## last_evaluation         1   0.64 -0.03    -1.24 0.00
## number_project          7   5.00  0.34    -0.50 0.01
## average_montly_hours  310 214.00  0.05    -1.14 0.41
## time_spend_company     10   8.00  1.85     4.77 0.01
## Work_accident           1   1.00  2.02     2.08 0.00
## left                    1   1.00  1.23    -0.49 0.00
## promotion_last_5years   1   1.00  6.64    42.03 0.00
## sales*                 10   9.00 -0.79    -0.62 0.02
## salary*                 3   2.00 -0.42    -0.67 0.01

We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.

one way contigency table

mytable <- with(hr.df,table(left))
mytable
## left
##     0     1 
## 11428  3571
prop.table(mytable)*100
## left
##        0        1 
## 76.19175 23.80825
mytable1 <- with(hr.df,table(promotion_last_5years))
mytable1
## promotion_last_5years
##     0     1 
## 14680   319
prop.table(mytable1)*100
## promotion_last_5years
##         0         1 
## 97.873192  2.126808
mytable2 <- with(hr.df,table(salary))
mytable2
## salary
##   high    low medium 
##   1237   7316   6446
prop.table(mytable2)*100
## salary
##      high       low    medium 
##  8.247216 48.776585 42.976198
mytable3 <- with(hr.df,table(Work_accident))
mytable3
## Work_accident
##     0     1 
## 12830  2169
prop.table(mytable3)*100
## Work_accident
##        0        1 
## 85.53904 14.46096

two way contigency table

mytable4 <- xtabs(~ left+promotion_last_5years, data=hr.df)
mytable4
##     promotion_last_5years
## left     0     1
##    0 11128   300
##    1  3552    19
prop.table(mytable4, 1)*100
##     promotion_last_5years
## left          0          1
##    0 97.3748687  2.6251313
##    1 99.4679362  0.5320638
mytable5 <-xtabs(~left+salary,data=hr.df)
mytable5
##     salary
## left high  low medium
##    0 1155 5144   5129
##    1   82 2172   1317
margin.table(mytable5,2) 
## salary
##   high    low medium 
##   1237   7316   6446
prop.table(mytable5, 2)
##     salary
## left       high        low     medium
##    0 0.93371059 0.70311646 0.79568725
##    1 0.06628941 0.29688354 0.20431275
mytable6 <- xtabs(~left+Work_accident,data=hr.df)
mytable6
##     Work_accident
## left    0    1
##    0 9428 2000
##    1 3402  169
margin.table(mytable6,1)
## left
##     0     1 
## 11428  3571
prop.table(mytable6, 1)
##     Work_accident
## left          0          1
##    0 0.82499125 0.17500875
##    1 0.95267432 0.04732568
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.3
CrossTable(hr.df$left,hr.df$salary)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##              | hr.df$salary 
##   hr.df$left |      high |       low |    medium | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            0 |      1155 |      5144 |      5129 |     11428 | 
##              |    47.915 |    33.200 |     9.648 |           | 
##              |     0.101 |     0.450 |     0.449 |     0.762 | 
##              |     0.934 |     0.703 |     0.796 |           | 
##              |     0.077 |     0.343 |     0.342 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        82 |      2172 |      1317 |      3571 | 
##              |   153.339 |   106.247 |    30.876 |           | 
##              |     0.023 |     0.608 |     0.369 |     0.238 | 
##              |     0.066 |     0.297 |     0.204 |           | 
##              |     0.005 |     0.145 |     0.088 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      1237 |      7316 |      6446 |     14999 | 
##              |     0.082 |     0.488 |     0.430 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 
CrossTable(hr.df$left,hr.df$promotion_last_5years)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14999 
## 
##  
##              | hr.df$promotion_last_5years 
##   hr.df$left |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |     11128 |       300 |     11428 | 
##              |     0.290 |    13.343 |           | 
##              |     0.974 |     0.026 |     0.762 | 
##              |     0.758 |     0.940 |           | 
##              |     0.742 |     0.020 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      3552 |        19 |      3571 | 
##              |     0.928 |    42.702 |           | 
##              |     0.995 |     0.005 |     0.238 | 
##              |     0.242 |     0.060 |           | 
##              |     0.237 |     0.001 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     14680 |       319 |     14999 | 
##              |     0.979 |     0.021 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Boxplot

attach(hr.df)
boxplot(hr.df$satisfaction_level,col="yellow",main="Satisfaction level",horizontal = TRUE)

boxplot(hr.df$last_evaluation,col="blue",main="Last Evaluation",horizontal = TRUE)

boxplot(hr.df$number_project,col="green",main="Number of project",horizontal = TRUE)

boxplot(hr.df$average_montly_hours,col="red",main="Average monthly hours",horizontal = TRUE)

boxplot(time_spend_company,horizontal = TRUE,main="Time Spend in the Company",xlab="IN YEARS",col = "blue")

boxplot(satisfaction_level ~left  ,data=hr.df, main="Distribution of satisfaction with working hours", ylab="satisfaction level", xlab="left",col= "lightblue",vertical=TRUE)

boxplot(satisfaction_level ~promotion_last_5years  ,data=hr.df, main="Distribution of satisfaction with promotion", ylab="satisfaction level", xlab="promotion in last 5 years",col= "peachpuff",vertical=TRUE)

boxplot(number_project ~left  ,data=hr.df, main="Distribution of number of projects", ylab="number of projects", xlab="left",col= "lightblue",vertical=TRUE)

boxplot(satisfaction_level ~salary  ,data=hr.df, main="Distribution of satisfaction with salary", ylab="satisfaction level", xlab=" salary",col= "blue",vertical=TRUE)

Histogram

hr.df$Work_accident[hr.df$Work_accident == 1] <- 'yes'
hr.df$Work_accident[hr.df$Work_accident == 0] <- 'no'
hr.df$Work_accident <- factor(hr.df$Work_accident)

hr.df$promotion_last_5years[hr.df$promotion_last_5years ==0] <-'no'
hr.df$promotion_last_5years[hr.df$promotion_last_5years ==1] <- 'yes'
hr.df$promotion_last_5years <- factor(hr.df$promotion_last_5years)

hr.df$left[hr.df$left==0] <- 'no'
hr.df$left[hr.df$left==1] <-'yes'
hr.df$left <- factor(hr.df$left)
library(lattice)
## Warning: package 'lattice' was built under R version 3.4.3
histogram(~left, data = hr.df,
 main = "Frequency of human resource leaving the company", xlab="left", col='lightgreen' ) 

histogram(~satisfaction_level,data=hr.df,main="Frequency of satisfaction level",col="lightblue")

histogram(~last_evaluation,data=hr.df,main="frequency of last evalution",col="yellow")

histogram(~salary,data=hr.df,main="frequency of salary",col="darkolivegreen1")

scatterplot

library(car)
## Warning: package 'car' was built under R version 3.4.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(satisfaction_level~number_project,     data=hr.df,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of satisfaction level vs number of project",
            xlab="number of project",
            ylab="satisfaction level")

scatterplot(satisfaction_level~average_montly_hours,data=hr.df,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of satisfaction level vs average working hours",
            xlab="average working hours",
            ylab="satisfaction level")

scatterplot(satisfaction_level,last_evaluation,main="Satisfaction level vs last evaluation",pch = 16)

pairs(formula= ~satisfaction_level + average_montly_hours+ salary +last_evaluation + number_project, cex=0.6, data=hr.df)

Correlation Matrix

hr1.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
cor(hr1.df[,1:8])
##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000

CORRGRAM

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(hr.df,upper.panel = panel.pie,main="Corrgram")

Chi-square test(Independency test) :-

H0: The two variables are independent

H1: The two variables are not independent

mytable6
##     Work_accident
## left    0    1
##    0 9428 2000
##    1 3402  169
chisq.test(mytable6)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable6
## X-squared = 357.56, df = 1, p-value < 2.2e-16

p-value < 0.05 The varibale “left” and “work_accidents” are not independent.

mytable4
##     promotion_last_5years
## left     0     1
##    0 11128   300
##    1  3552    19
chisq.test(mytable4)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable4
## X-squared = 56.262, df = 1, p-value = 6.344e-14

p-value < 0.05 The variable “left” and “Promotion_last_5years” are not independent.

mytable5
##     salary
## left high  low medium
##    0 1155 5144   5129
##    1   82 2172   1317
chisq.test(mytable5)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable5
## X-squared = 381.23, df = 2, p-value < 2.2e-16

p-value <0.05 The variable “left” and “salary” are not independent

Hypothesis(H1):- Employee who get promotion has more average monthly hours than the employee who don not get promotion.

t.test(average_montly_hours~promotion_last_5years)
## 
##  Welch Two Sample t-test
## 
## data:  average_montly_hours by promotion_last_5years
## t = 0.44937, df = 333.03, p-value = 0.6535
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.143788  6.597589
## sample estimates:
## mean in group 0 mean in group 1 
##        201.0764        199.8495

p- value > 0.05 Result- There is no significant difference in avg monthly hours of a employee who got promoted and the employee who do not get promoted.

Hypothesis(H1):- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.

t.test(time_spend_company~promotion_last_5years)
## 
##  Welch Two Sample t-test
## 
## data:  time_spend_company by promotion_last_5years
## t = -5.6111, df = 324.14, p-value = 4.316e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9216896 -0.4431601
## sample estimates:
## mean in group 0 mean in group 1 
##        3.483719        4.166144

p-value < 0.05 implies the rejection of NULL Hypothesis (H-0) and the acception of H-1 result- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.

Hypothesis(H1):- Employee who left having more avg monthly hours than the employee who do not left

t.test(average_montly_hours~left)
## 
##  Welch Two Sample t-test
## 
## data:  average_montly_hours by left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.534631  -6.183384
## sample estimates:
## mean in group 0 mean in group 1 
##        199.0602        207.4192

p-value < 0.05 implies the rejection of null hypothesis(H-0) Result - Employee who left having more avg monthly hours than the employee who do not left

Hypothesis (H1) :- The Employee who got promotion is more satisfied than the employee who dont get the promotion.

t.test(satisfaction_level~promotion_last_5years)
## 
##  Welch Two Sample t-test
## 
## data:  satisfaction_level by promotion_last_5years
## t = -3.6545, df = 337.3, p-value = 0.0002987
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06787297 -0.02037446
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6118951       0.6560188

p-value < 0.05 implies the rejection of null hypothesis (H-0)

Result- The Employee who got promotion is more satisfied than the employee who dont get the promotion.