Name: “Parv Patni”
Email: “parvpatni@gmail.com”
College: “SRM University, Chennai”
This project concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.
We have two goals: first, we want to understand why valuable employees leave, and
Second, we want to predict who will leave next.
Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees departure.
Unfortunately, managers didn’t kept an organised record of why people have left, but we can still find some explications in our data set provided by the HR department.
For our 15 000 employees we know: satisfaction level,
latest evaluation (yearly),
number of project worked on,
average monthly hours,
time spend in the company (in years),
work accident (within the past 2 years),
promotion within the past 5 years,
department and salary.
Reading the csv file into dataframe and basic description of the dataset
getwd()
## [1] "C:/Users/parvp/Desktop/data analytics internship"
hr.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
dim(hr.df)
## [1] 14999 10
Our dataset consist of 14999 rows and 10 columns.
library(psych)
## Warning: package 'psych' was built under R version 3.4.3
describe(hr.df)
## vars n mean sd median trimmed mad min
## satisfaction_level 1 14999 0.61 0.25 0.64 0.63 0.28 0.09
## last_evaluation 2 14999 0.72 0.17 0.72 0.72 0.22 0.36
## number_project 3 14999 3.80 1.23 4.00 3.74 1.48 2.00
## average_montly_hours 4 14999 201.05 49.94 200.00 200.64 65.23 96.00
## time_spend_company 5 14999 3.50 1.46 3.00 3.28 1.48 2.00
## Work_accident 6 14999 0.14 0.35 0.00 0.06 0.00 0.00
## left 7 14999 0.24 0.43 0.00 0.17 0.00 0.00
## promotion_last_5years 8 14999 0.02 0.14 0.00 0.00 0.00 0.00
## sales* 9 14999 6.94 2.75 8.00 7.23 2.97 1.00
## salary* 10 14999 2.35 0.63 2.00 2.41 1.48 1.00
## max range skew kurtosis se
## satisfaction_level 1 0.91 -0.48 -0.67 0.00
## last_evaluation 1 0.64 -0.03 -1.24 0.00
## number_project 7 5.00 0.34 -0.50 0.01
## average_montly_hours 310 214.00 0.05 -1.14 0.41
## time_spend_company 10 8.00 1.85 4.77 0.01
## Work_accident 1 1.00 2.02 2.08 0.00
## left 1 1.00 1.23 -0.49 0.00
## promotion_last_5years 1 1.00 6.64 42.03 0.00
## sales* 10 9.00 -0.79 -0.62 0.02
## salary* 3 2.00 -0.42 -0.67 0.01
We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.
mytable <- with(hr.df,table(left))
mytable
## left
## 0 1
## 11428 3571
prop.table(mytable)*100
## left
## 0 1
## 76.19175 23.80825
mytable1 <- with(hr.df,table(promotion_last_5years))
mytable1
## promotion_last_5years
## 0 1
## 14680 319
prop.table(mytable1)*100
## promotion_last_5years
## 0 1
## 97.873192 2.126808
mytable2 <- with(hr.df,table(salary))
mytable2
## salary
## high low medium
## 1237 7316 6446
prop.table(mytable2)*100
## salary
## high low medium
## 8.247216 48.776585 42.976198
mytable3 <- with(hr.df,table(Work_accident))
mytable3
## Work_accident
## 0 1
## 12830 2169
prop.table(mytable3)*100
## Work_accident
## 0 1
## 85.53904 14.46096
mytable4 <- xtabs(~ left+promotion_last_5years, data=hr.df)
mytable4
## promotion_last_5years
## left 0 1
## 0 11128 300
## 1 3552 19
prop.table(mytable4, 1)*100
## promotion_last_5years
## left 0 1
## 0 97.3748687 2.6251313
## 1 99.4679362 0.5320638
mytable5 <-xtabs(~left+salary,data=hr.df)
mytable5
## salary
## left high low medium
## 0 1155 5144 5129
## 1 82 2172 1317
margin.table(mytable5,2)
## salary
## high low medium
## 1237 7316 6446
prop.table(mytable5, 2)
## salary
## left high low medium
## 0 0.93371059 0.70311646 0.79568725
## 1 0.06628941 0.29688354 0.20431275
mytable6 <- xtabs(~left+Work_accident,data=hr.df)
mytable6
## Work_accident
## left 0 1
## 0 9428 2000
## 1 3402 169
margin.table(mytable6,1)
## left
## 0 1
## 11428 3571
prop.table(mytable6, 1)
## Work_accident
## left 0 1
## 0 0.82499125 0.17500875
## 1 0.95267432 0.04732568
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.3
CrossTable(hr.df$left,hr.df$salary)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14999
##
##
## | hr.df$salary
## hr.df$left | high | low | medium | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## 0 | 1155 | 5144 | 5129 | 11428 |
## | 47.915 | 33.200 | 9.648 | |
## | 0.101 | 0.450 | 0.449 | 0.762 |
## | 0.934 | 0.703 | 0.796 | |
## | 0.077 | 0.343 | 0.342 | |
## -------------|-----------|-----------|-----------|-----------|
## 1 | 82 | 2172 | 1317 | 3571 |
## | 153.339 | 106.247 | 30.876 | |
## | 0.023 | 0.608 | 0.369 | 0.238 |
## | 0.066 | 0.297 | 0.204 | |
## | 0.005 | 0.145 | 0.088 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 1237 | 7316 | 6446 | 14999 |
## | 0.082 | 0.488 | 0.430 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
CrossTable(hr.df$left,hr.df$promotion_last_5years)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14999
##
##
## | hr.df$promotion_last_5years
## hr.df$left | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 11128 | 300 | 11428 |
## | 0.290 | 13.343 | |
## | 0.974 | 0.026 | 0.762 |
## | 0.758 | 0.940 | |
## | 0.742 | 0.020 | |
## -------------|-----------|-----------|-----------|
## 1 | 3552 | 19 | 3571 |
## | 0.928 | 42.702 | |
## | 0.995 | 0.005 | 0.238 |
## | 0.242 | 0.060 | |
## | 0.237 | 0.001 | |
## -------------|-----------|-----------|-----------|
## Column Total | 14680 | 319 | 14999 |
## | 0.979 | 0.021 | |
## -------------|-----------|-----------|-----------|
##
##
attach(hr.df)
boxplot(hr.df$satisfaction_level,col="yellow",main="Satisfaction level",horizontal = TRUE)
boxplot(hr.df$last_evaluation,col="blue",main="Last Evaluation",horizontal = TRUE)
boxplot(hr.df$number_project,col="green",main="Number of project",horizontal = TRUE)
boxplot(hr.df$average_montly_hours,col="red",main="Average monthly hours",horizontal = TRUE)
boxplot(time_spend_company,horizontal = TRUE,main="Time Spend in the Company",xlab="IN YEARS",col = "blue")
boxplot(satisfaction_level ~left ,data=hr.df, main="Distribution of satisfaction with working hours", ylab="satisfaction level", xlab="left",col= "lightblue",vertical=TRUE)
boxplot(satisfaction_level ~promotion_last_5years ,data=hr.df, main="Distribution of satisfaction with promotion", ylab="satisfaction level", xlab="promotion in last 5 years",col= "peachpuff",vertical=TRUE)
boxplot(number_project ~left ,data=hr.df, main="Distribution of number of projects", ylab="number of projects", xlab="left",col= "lightblue",vertical=TRUE)
boxplot(satisfaction_level ~salary ,data=hr.df, main="Distribution of satisfaction with salary", ylab="satisfaction level", xlab=" salary",col= "blue",vertical=TRUE)
hr.df$Work_accident[hr.df$Work_accident == 1] <- 'yes'
hr.df$Work_accident[hr.df$Work_accident == 0] <- 'no'
hr.df$Work_accident <- factor(hr.df$Work_accident)
hr.df$promotion_last_5years[hr.df$promotion_last_5years ==0] <-'no'
hr.df$promotion_last_5years[hr.df$promotion_last_5years ==1] <- 'yes'
hr.df$promotion_last_5years <- factor(hr.df$promotion_last_5years)
hr.df$left[hr.df$left==0] <- 'no'
hr.df$left[hr.df$left==1] <-'yes'
hr.df$left <- factor(hr.df$left)
Let’s visualise the number of employees who left through a pie chart.
pie(table(hr.df$left),main="Employees who Left", col=c("lightblue","khaki"))
With the help of pie chart we get to know that approximately a quanter of the given data set company employees left their job.
library(lattice)
## Warning: package 'lattice' was built under R version 3.4.3
histogram(~left, data = hr.df,
main = "Frequency of human resource leaving the company", xlab="left", col='lightgreen' )
histogram(~satisfaction_level,data=hr.df,main="Frequency of satisfaction level",col="lightblue")
histogram(~last_evaluation,data=hr.df,main="frequency of last evalution",col="yellow")
histogram(~salary,data=hr.df,main="frequency of salary",col="darkolivegreen1")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(hr.df, aes( sales,fill = sales) ) + geom_bar()
ggplot(hr.df, aes(satisfaction_level, number_project))+ scale_x_continuous("satisfaction level", breaks = seq(0,1,0.1))+scale_y_continuous("Projects", breaks = seq(2,7,1)) + labs(title="satisfaction level vs number of projects") + geom_jitter(colour="darkolivegreen1")
People who are least satisfied did more projects as compared to number of projects done by higher satisfied people.
Higher satisfied people did 3-5 projects on an average.
library(car)
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(satisfaction_level~number_project, data=hr.df,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of satisfaction level vs number of project",
xlab="number of project",
ylab="satisfaction level")
scatterplot(satisfaction_level~average_montly_hours,data=hr.df,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of satisfaction level vs average working hours",
xlab="average working hours",
ylab="satisfaction level")
scatterplot(satisfaction_level,last_evaluation,main="Satisfaction level vs last evaluation",pch = 16)
pairs(formula= ~satisfaction_level + average_montly_hours+ salary +last_evaluation + number_project, cex=0.6, data=hr.df)
hr1.df<-read.csv(paste("HR_comma_sep.csv",sep = ""))
cor(hr1.df[,1:8])
## satisfaction_level last_evaluation number_project
## satisfaction_level 1.00000000 0.105021214 -0.142969586
## last_evaluation 0.10502121 1.000000000 0.349332589
## number_project -0.14296959 0.349332589 1.000000000
## average_montly_hours -0.02004811 0.339741800 0.417210634
## time_spend_company -0.10086607 0.131590722 0.196785891
## Work_accident 0.05869724 -0.007104289 -0.004740548
## left -0.38837498 0.006567120 0.023787185
## promotion_last_5years 0.02560519 -0.008683768 -0.006063958
## average_montly_hours time_spend_company
## satisfaction_level -0.020048113 -0.100866073
## last_evaluation 0.339741800 0.131590722
## number_project 0.417210634 0.196785891
## average_montly_hours 1.000000000 0.127754910
## time_spend_company 0.127754910 1.000000000
## Work_accident -0.010142888 0.002120418
## left 0.071287179 0.144822175
## promotion_last_5years -0.003544414 0.067432925
## Work_accident left promotion_last_5years
## satisfaction_level 0.058697241 -0.38837498 0.025605186
## last_evaluation -0.007104289 0.00656712 -0.008683768
## number_project -0.004740548 0.02378719 -0.006063958
## average_montly_hours -0.010142888 0.07128718 -0.003544414
## time_spend_company 0.002120418 0.14482217 0.067432925
## Work_accident 1.000000000 -0.15462163 0.039245435
## left -0.154621634 1.00000000 -0.061788107
## promotion_last_5years 0.039245435 -0.06178811 1.000000000
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
HR_correlation <- hr1.df %>% select(satisfaction_level:promotion_last_5years)
M <- cor(HR_correlation)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
corrplot(M, method="number")
On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(hr.df,upper.panel = panel.pie,main="Corrgram")
Chi-square test(Independency test) :-
H0: The two variables are independent
H1: The two variables are not independent
mytable6
## Work_accident
## left 0 1
## 0 9428 2000
## 1 3402 169
chisq.test(mytable6)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable6
## X-squared = 357.56, df = 1, p-value < 2.2e-16
p-value < 0.05 The varibale “left” and “work_accidents” are not independent.
mytable4
## promotion_last_5years
## left 0 1
## 0 11128 300
## 1 3552 19
chisq.test(mytable4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable4
## X-squared = 56.262, df = 1, p-value = 6.344e-14
p-value < 0.05 The variable “left” and “Promotion_last_5years” are not independent.
mytable5
## salary
## left high low medium
## 0 1155 5144 5129
## 1 82 2172 1317
chisq.test(mytable5)
##
## Pearson's Chi-squared test
##
## data: mytable5
## X-squared = 381.23, df = 2, p-value < 2.2e-16
p-value <0.05 The variable “left” and “salary” are not independent
Hypothesis(H1):- Employee who get promotion has more average monthly hours than the employee who don not get promotion.
t.test(average_montly_hours~promotion_last_5years)
##
## Welch Two Sample t-test
##
## data: average_montly_hours by promotion_last_5years
## t = 0.44937, df = 333.03, p-value = 0.6535
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.143788 6.597589
## sample estimates:
## mean in group 0 mean in group 1
## 201.0764 199.8495
p- value > 0.05 Result- There is no significant difference in avg monthly hours of a employee who got promoted and the employee who do not get promoted.
Hypothesis(H1):- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.
t.test(time_spend_company~promotion_last_5years)
##
## Welch Two Sample t-test
##
## data: time_spend_company by promotion_last_5years
## t = -5.6111, df = 324.14, p-value = 4.316e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9216896 -0.4431601
## sample estimates:
## mean in group 0 mean in group 1
## 3.483719 4.166144
p-value < 0.05 implies the rejection of NULL Hypothesis (H-0) and the acception of H-1 result- The time spend in the company of the employee who get promotion is higher than the time spend in the company of the employee who do not get the promotion.
Hypothesis(H1):- Employee who left having more avg monthly hours than the employee who do not left
t.test(average_montly_hours~left)
##
## Welch Two Sample t-test
##
## data: average_montly_hours by left
## t = -7.5323, df = 4875.1, p-value = 5.907e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.534631 -6.183384
## sample estimates:
## mean in group 0 mean in group 1
## 199.0602 207.4192
p-value < 0.05 implies the rejection of null hypothesis(H-0) Result - Employee who left having more avg monthly hours than the employee who do not left
Hypothesis (H1) :- The Employee who got promotion is more satisfied than the employee who dont get the promotion.
t.test(satisfaction_level~promotion_last_5years)
##
## Welch Two Sample t-test
##
## data: satisfaction_level by promotion_last_5years
## t = -3.6545, df = 337.3, p-value = 0.0002987
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06787297 -0.02037446
## sample estimates:
## mean in group 0 mean in group 1
## 0.6118951 0.6560188
p-value < 0.05 implies the rejection of null hypothesis (H-0)
Result- The Employee who got promotion is more satisfied than the employee who dont get the promotion.
Fitting a Linear Regression Model using lm()
model1<-lm(formula=number_project~Work_accident,data = hr1.df)
summary(model1)
##
## Call:
## lm(formula = number_project ~ Work_accident, data = hr1.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8055 -0.8055 0.1945 1.1945 3.2112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.80546 0.01088 349.696 <2e-16 ***
## Work_accident -0.01661 0.02862 -0.581 0.562
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.233 on 14997 degrees of freedom
## Multiple R-squared: 2.247e-05, Adjusted R-squared: -4.421e-05
## F-statistic: 0.337 on 1 and 14997 DF, p-value: 0.5616
model2<-lm(formula=average_montly_hours~salary,data = hr1.df)
summary(model2)
##
## Call:
## lm(formula = average_montly_hours ~ salary, data = hr1.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.338 -45.338 -0.997 44.003 109.003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 199.867 1.420 140.746 <2e-16 ***
## salarylow 1.129 1.535 0.735 0.462
## salarymedium 1.471 1.550 0.949 0.343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.94 on 14996 degrees of freedom
## Multiple R-squared: 6.113e-05, Adjusted R-squared: -7.223e-05
## F-statistic: 0.4584 on 2 and 14996 DF, p-value: 0.6323
model3<-lm(formula=satisfaction_level~salary,data = hr1.df)
summary(model3)
##
## Call:
## lm(formula = satisfaction_level ~ salary, data = hr1.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54747 -0.17182 0.02925 0.19925 0.39925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.637470 0.007061 90.284 < 2e-16 ***
## salarylow -0.036717 0.007634 -4.809 1.53e-06 ***
## salarymedium -0.015653 0.007709 -2.031 0.0423 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2483 on 14996 degrees of freedom
## Multiple R-squared: 0.002522, Adjusted R-squared: 0.002389
## F-statistic: 18.96 on 2 and 14996 DF, p-value: 5.967e-09
model4<-lm(formula=satisfaction_level ~ number_project,data = hr1.df)
summary(model4)
##
## Call:
## lm(formula = satisfaction_level ~ number_project, data = hr1.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56483 -0.21483 0.03169 0.20401 0.45052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.722509 0.006517 110.86 <2e-16 ***
## number_project -0.028839 0.001630 -17.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 14997 degrees of freedom
## Multiple R-squared: 0.02044, Adjusted R-squared: 0.02037
## F-statistic: 312.9 on 1 and 14997 DF, p-value: < 2.2e-16
model5<-lm(formula=average_montly_hours ~ satisfaction_level,data = hr1.df)
summary(model4)
##
## Call:
## lm(formula = satisfaction_level ~ number_project, data = hr1.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56483 -0.21483 0.03169 0.20401 0.45052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.722509 0.006517 110.86 <2e-16 ***
## number_project -0.028839 0.001630 -17.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 14997 degrees of freedom
## Multiple R-squared: 0.02044, Adjusted R-squared: 0.02037
## F-statistic: 312.9 on 1 and 14997 DF, p-value: < 2.2e-16