Project Title: HR Analytics
NAME: Ram Sankar
EMAIL: ramshankar797@gmail.com
COLLEGE : NIT Trichy
A company is trying to figure out why their best and experienced employees are leaving prematurely Following are some of the important coloumns of the dataset
-> satisfaction level (ranges from 0 to 1)
-> Number of projects
-> Average monthly hours
-> Time spent at the company (in years)
-> work accident (0 or 1)
-> Promotion (0 or 1)
-> sales (Departments)
-> salary(low, medium, high)
-> left (0 or 1)
Following is the analysis related to the dataset
Reading the csv file into dataframe and basic description of the dataset
hr.df <- read.csv(paste("HR_comma_sep.csv", sep=""))
str(hr.df)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
summary(hr.df)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
library(psych)
describe(hr.df)
## vars n mean sd median trimmed mad min
## satisfaction_level 1 14999 0.61 0.25 0.64 0.63 0.28 0.09
## last_evaluation 2 14999 0.72 0.17 0.72 0.72 0.22 0.36
## number_project 3 14999 3.80 1.23 4.00 3.74 1.48 2.00
## average_montly_hours 4 14999 201.05 49.94 200.00 200.64 65.23 96.00
## time_spend_company 5 14999 3.50 1.46 3.00 3.28 1.48 2.00
## Work_accident 6 14999 0.14 0.35 0.00 0.06 0.00 0.00
## left 7 14999 0.24 0.43 0.00 0.17 0.00 0.00
## promotion_last_5years 8 14999 0.02 0.14 0.00 0.00 0.00 0.00
## sales* 9 14999 6.94 2.75 8.00 7.23 2.97 1.00
## salary* 10 14999 2.35 0.63 2.00 2.41 1.48 1.00
## max range skew kurtosis se
## satisfaction_level 1 0.91 -0.48 -0.67 0.00
## last_evaluation 1 0.64 -0.03 -1.24 0.00
## number_project 7 5.00 0.34 -0.50 0.01
## average_montly_hours 310 214.00 0.05 -1.14 0.41
## time_spend_company 10 8.00 1.85 4.77 0.01
## Work_accident 1 1.00 2.02 2.08 0.00
## left 1 1.00 1.23 -0.49 0.00
## promotion_last_5years 1 1.00 6.64 42.03 0.00
## sales* 10 9.00 -0.79 -0.62 0.02
## salary* 3 2.00 -0.42 -0.67 0.01
hr.df$left[hr.df$left==1]<-'Yes'
hr.df$left[hr.df$left==0]<-'No'
pie(table(hr.df$left),main="Left", col=c("green4","Red2"))
This piechart shows that approximately a quarter of its employees left the company
library(RColorBrewer)
par(mfrow=c(1,2))
boxplot(hr.df$satisfaction_level,horizontal = TRUE,main="satisfaction level of employees",xlab="Satisfaction level")
hist(hr.df$satisfaction_level,col=brewer.pal(8,"Greens"),main="",xlab="satisfaction level",breaks=5)
#heatmap(as.matrix(hr.df$satisfaction_level))
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(hr.df, aes(satisfaction_level)) + geom_area( stat = "bin", bins = 30,fill = "steelblue")+ scale_x_continuous(breaks = seq(0,1,0.1))
library(RColorBrewer)
par(mfrow=c(1,2))
boxplot(hr.df$last_evaluation,horizontal = TRUE,main="last evaluation in years",xlab="years")
hist(hr.df$last_evaluation,col=brewer.pal(8,"Greys"),main="",xlab="years",breaks=7)
-> The last evaluation indicates the age of data collected and the data’s age is more or less like 8 months.
library(ggplot2)
ggplot(hr.df, aes(number_project)) + scale_x_continuous("Projects",seq(2,7,by=1))+
geom_bar(fill="red")
-> Most of the employees did 3 or 4 projects
library(ggplot2)
ggplot(hr.df, aes(average_montly_hours)) +
geom_bar(fill="blue")+scale_x_continuous("hours", breaks = seq(90,320, by = 50))+labs(title="average working hours per month")
ggplot(hr.df, aes(time_spend_company)) + geom_bar(fill = "darkgreen")+
coord_flip()+ labs(title = "Years in the company")
par(mfrow=c(1,2))
hr.df$promotion_last_5years[hr.df$promotion_last_5years==1]<-'Yes'
hr.df$promotion_last_5years[hr.df$promotion_last_5years==0]<-'No'
pie(table(hr.df$promotion_last_5years),main="promotion", col=c("orangered","yellowgreen"))
hr.df$Work_accident[hr.df$Work_accident==1]<-'Yes'
hr.df$Work_accident[hr.df$Work_accident==0]<-'No'
pie(table(hr.df$Work_accident),main="Accident in work place", col=c("antiquewhite","darkslategray"))
library(ggplot2)
ggplot(hr.df, aes( sales,fill = sales) ) +
geom_bar()
library(ggplot2)
ggplot(hr.df,aes(x = factor(""), fill = salary) ) +
geom_bar() +
coord_polar(theta = "y") +
scale_x_discrete("")
library(ggplot2)
ggplot(hr.df, aes(satisfaction_level, number_project))+
scale_x_continuous("satisfaction level", breaks = seq(0,1,0.1))+scale_y_continuous("Projects", breaks = seq(2,7,1))+
labs(title="satisfaction level vs number of projects")+geom_jitter(colour="greenyellow")
-> The higher satisfied people did 3-5 projects
library(ggplot2)
ggplot(hr.df, aes(time_spend_company, fill = sales) ) +
geom_bar(position = "stack")+labs(title="time spent in company department wise")
ggplot(hr.df, aes(left, fill = sales) ) +
geom_bar(position = "stack")
-> Most of the people who left the company were from the sales department
hr.df$left[hr.df$left=="Yes"]<-1
hr.df$left[hr.df$left=="No"]<-2
library(lattice)
par(mfrow=c(2,1))
histogram(~left|promotion_last_5years+Work_accident,data=hr.df,main="left vs promotion and work accident histogram")
histogram(~left|salary,data=hr.df,main="left and salary histogram")
#0.96 represents people who left and 2.04 represents people who are present in company
library(ggplot2)
ggplot(hr.df, aes(satisfaction_level,average_montly_hours)) + geom_point(aes(color = sales)) +
scale_x_continuous("satisfaction level", breaks = seq(0,1,0.1))+
scale_y_continuous("hours", breaks = seq(90,310,50))+
theme_bw() + labs(title="Scatterplot")
hr.df <- read.csv(paste("HR_comma_sep.csv", sep=""))
hr.df$sales <- as.numeric(hr.df$sales)
hr.df$salary <- as.numeric(hr.df$salary)
cor(hr.df)
## satisfaction_level last_evaluation number_project
## satisfaction_level 1.00000000 0.105021214 -0.142969586
## last_evaluation 0.10502121 1.000000000 0.349332589
## number_project -0.14296959 0.349332589 1.000000000
## average_montly_hours -0.02004811 0.339741800 0.417210634
## time_spend_company -0.10086607 0.131590722 0.196785891
## Work_accident 0.05869724 -0.007104289 -0.004740548
## left -0.38837498 0.006567120 0.023787185
## promotion_last_5years 0.02560519 -0.008683768 -0.006063958
## sales 0.01226082 0.006809655 0.019077888
## salary 0.01175416 0.013964915 0.009671757
## average_montly_hours time_spend_company
## satisfaction_level -0.020048113 -0.100866073
## last_evaluation 0.339741800 0.131590722
## number_project 0.417210634 0.196785891
## average_montly_hours 1.000000000 0.127754910
## time_spend_company 0.127754910 1.000000000
## Work_accident -0.010142888 0.002120418
## left 0.071287179 0.144822175
## promotion_last_5years -0.003544414 0.067432925
## sales 0.007722204 -0.034825154
## salary 0.007081960 -0.003086256
## Work_accident left promotion_last_5years
## satisfaction_level 0.058697241 -0.388374983 0.025605186
## last_evaluation -0.007104289 0.006567120 -0.008683768
## number_project -0.004740548 0.023787185 -0.006063958
## average_montly_hours -0.010142888 0.071287179 -0.003544414
## time_spend_company 0.002120418 0.144822175 0.067432925
## Work_accident 1.000000000 -0.154621634 0.039245435
## left -0.154621634 1.000000000 -0.061788107
## promotion_last_5years 0.039245435 -0.061788107 1.000000000
## sales 0.011323553 0.009935740 -0.036953987
## salary -0.002505606 -0.001293717 -0.001318425
## sales salary
## satisfaction_level 0.012260815 0.011754160
## last_evaluation 0.006809655 0.013964915
## number_project 0.019077888 0.009671757
## average_montly_hours 0.007722204 0.007081960
## time_spend_company -0.034825154 -0.003086256
## Work_accident 0.011323553 -0.002505606
## left 0.009935740 -0.001293717
## promotion_last_5years -0.036953987 -0.001318425
## sales 1.000000000 0.016429857
## salary 0.016429857 1.000000000
library(corrgram)
corrgram(hr.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of hr.df intercorrelations")
t1<-xtabs(~hr.df$left+hr.df$promotion_last_5years)
t2<-xtabs(~hr.df$left+hr.df$satisfaction_level)
t3<-xtabs(~hr.df$left+hr.df$salary)
t4<-xtabs(~hr.df$left+hr.df$salary)
t5<-xtabs(~hr.df$left+hr.df$time_spend_company)
t6<-xtabs(~hr.df$left+hr.df$average_montly_hours)
t7<-xtabs(~hr.df$left+hr.df$sales)
t8<-xtabs(~hr.df$left+hr.df$number_project)
chisq.test(t1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t1
## X-squared = 56.262, df = 1, p-value = 6.344e-14
chisq.test(t2)
##
## Pearson's Chi-squared test
##
## data: t2
## X-squared = 7937.7, df = 91, p-value < 2.2e-16
chisq.test(t3)
##
## Pearson's Chi-squared test
##
## data: t3
## X-squared = 381.23, df = 2, p-value < 2.2e-16
chisq.test(t4)
##
## Pearson's Chi-squared test
##
## data: t4
## X-squared = 381.23, df = 2, p-value < 2.2e-16
chisq.test(t5)
##
## Pearson's Chi-squared test
##
## data: t5
## X-squared = 2110.1, df = 7, p-value < 2.2e-16
chisq.test(t6)
## Warning in chisq.test(t6): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: t6
## X-squared = 3623.1, df = 214, p-value < 2.2e-16
chisq.test(t7)
##
## Pearson's Chi-squared test
##
## data: t7
## X-squared = 86.825, df = 9, p-value = 7.042e-15
chisq.test(t8)
##
## Pearson's Chi-squared test
##
## data: t8
## X-squared = 5373.6, df = 5, p-value < 2.2e-16
Pearson’s chisquared test states that all the column metrics are significant and the we reject the null hypothesis since the probability is small (p < 0.01).
t.test(hr.df$left - hr.df$satisfaction_level)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$satisfaction_level
## t = -80.447, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.383882 -0.365620
## sample estimates:
## mean of x
## -0.374751
t.test(hr.df$left - hr.df$number_project)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$number_project
## t = -337.28, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -3.585689 -3.544253
## sample estimates:
## mean of x
## -3.564971
t.test(hr.df$left - hr.df$average_montly_hours)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$average_montly_hours
## t = -492.71, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -201.6111 -200.0134
## sample estimates:
## mean of x
## -200.8123
t.test(hr.df$left - hr.df$time_spend_company)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$time_spend_company
## t = -273.37, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -3.283527 -3.236774
## sample estimates:
## mean of x
## -3.260151
t.test(hr.df$left - hr.df$Work_accident)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$Work_accident
## t = 19.31, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.08398479 0.10296101
## sample estimates:
## mean of x
## 0.0934729
t.test(hr.df$left - hr.df$promotion_last_5years)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$promotion_last_5years
## t = 57.969, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.2094832 0.2241457
## sample estimates:
## mean of x
## 0.2168145
t.test(hr.df$left - hr.df$sales)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$sales
## t = -295.52, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -6.742675 -6.653818
## sample estimates:
## mean of x
## -6.698247
t.test(hr.df$left - hr.df$salary)
##
## One Sample t-test
##
## data: hr.df$left - hr.df$salary
## t = -341.03, df = 14998, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.121330 -2.097084
## sample estimates:
## mean of x
## -2.109207
Even in t-test p-values are < 0.01 and we reject the null hypothesis that the column metrics in hr.df and leaving are independent.
fit <- lm(satisfaction_level ~ average_montly_hours+Work_accident+promotion_last_5years+sales+salary , data = hr.df)
summary(fit)
##
## Call:
## lm(formula = satisfaction_level ~ average_montly_hours + Work_accident +
## promotion_last_5years + sales + salary, data = hr.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55470 -0.17486 0.02876 0.20363 0.40719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.068e-01 1.236e-02 49.093 < 2e-16 ***
## average_montly_hours -9.738e-05 4.057e-05 -2.401 0.0164 *
## Work_accident 4.062e-02 5.765e-03 7.045 1.93e-12 ***
## promotion_last_5years 4.094e-02 1.406e-02 2.911 0.0036 **
## sales 1.126e-03 7.381e-04 1.526 0.1271
## salary 4.713e-03 3.238e-03 1.456 0.1455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2481 on 14993 degrees of freedom
## Multiple R-squared: 0.004665, Adjusted R-squared: 0.004333
## F-statistic: 14.05 on 5 and 14993 DF, p-value: 9.711e-14
prom <- lm(number_project ~ satisfaction_level+average_montly_hours+Work_accident+promotion_last_5years+sales+salary , data = hr.df)
summary(prom)
##
## Call:
## lm(formula = number_project ~ satisfaction_level + average_montly_hours +
## Work_accident + promotion_last_5years + sales + salary, data = hr.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9320 -0.9334 -0.0846 0.8333 3.7053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0636094 0.0594599 34.706 <2e-16 ***
## satisfaction_level -0.6711281 0.0364662 -18.404 <2e-16 ***
## average_montly_hours 0.0102268 0.0001812 56.449 <2e-16 ***
## Work_accident 0.0254529 0.0257838 0.987 0.3236
## promotion_last_5years -0.0065286 0.0628034 -0.104 0.9172
## sales 0.0077595 0.0032958 2.354 0.0186 *
## salary 0.0158775 0.0144571 1.098 0.2721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.108 on 14992 degrees of freedom
## Multiple R-squared: 0.1926, Adjusted R-squared: 0.1923
## F-statistic: 596.1 on 6 and 14992 DF, p-value: < 2.2e-16
prom <- lm(promotion_last_5years ~ satisfaction_level+average_montly_hours+Work_accident+number_project+sales+salary , data = hr.df)
summary(prom)
##
## Call:
## lm(formula = promotion_last_5years ~ satisfaction_level + average_montly_hours +
## Work_accident + number_project + sales + salary, data = hr.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05179 -0.02681 -0.01916 -0.01483 0.99466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.634e-02 8.034e-03 3.278 0.00105 **
## satisfaction_level 1.373e-02 4.794e-03 2.863 0.00420 **
## average_montly_hours -5.755e-06 2.594e-05 -0.222 0.82446
## Work_accident 1.569e-02 3.351e-03 4.684 2.84e-06 ***
## number_project -1.104e-04 1.062e-03 -0.104 0.91721
## sales -1.976e-03 4.284e-04 -4.613 4.00e-06 ***
## salary -1.981e-04 1.880e-03 -0.105 0.91610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1441 on 14992 degrees of freedom
## Multiple R-squared: 0.003512, Adjusted R-squared: 0.003113
## F-statistic: 8.805 on 6 and 14992 DF, p-value: 1.32e-09
model <- glm(left ~.,family=binomial(link='logit'),data=hr.df)
summary(model)
##
## Call:
## glm(formula = left ~ ., family = binomial(link = "logit"), data = hr.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3568 -0.6819 -0.4343 -0.1533 3.1068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.054122 0.151993 0.356 0.72178
## satisfaction_level -4.129254 0.096584 -42.753 < 2e-16 ***
## last_evaluation 0.762165 0.145708 5.231 1.69e-07 ***
## number_project -0.310068 0.020850 -14.872 < 2e-16 ***
## average_montly_hours 0.004346 0.000504 8.624 < 2e-16 ***
## time_spend_company 0.228638 0.014855 15.391 < 2e-16 ***
## Work_accident -1.498575 0.088254 -16.980 < 2e-16 ***
## promotion_last_5years -1.768024 0.255495 -6.920 4.52e-12 ***
## sales 0.020587 0.007854 2.621 0.00876 **
## salary 0.011953 0.035040 0.341 0.73300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16465 on 14998 degrees of freedom
## Residual deviance: 13323 on 14989 degrees of freedom
## AIC: 13343
##
## Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: left
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 14998 16465
## satisfaction_level 1 2266.90 14997 14198 < 2.2e-16 ***
## last_evaluation 1 17.89 14996 14180 2.337e-05 ***
## number_project 1 115.05 14995 14065 < 2.2e-16 ***
## average_montly_hours 1 83.35 14994 13982 < 2.2e-16 ***
## time_spend_company 1 187.24 14993 13794 < 2.2e-16 ***
## Work_accident 1 390.95 14992 13403 < 2.2e-16 ***
## promotion_last_5years 1 73.32 14991 13330 < 2.2e-16 ***
## sales 1 6.91 14990 13323 0.008558 **
## salary 1 0.12 14989 13323 0.732948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test <- hr.df
fitted.results <- predict(model,newdata=subset(test,select=c(1,2,3,4,5,6,8,9,10)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$left)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.765584372291486"
-> The linear regression on satisfaction level showed that work_accident and department are the most significant factors
-> The linear regression on number of projects done showed that satisfaction level and the average monthly hours are most significant
-> The linear regression on promotion showed that sales,salary,work accident are most significant.
-> when logistic regression was applied to left it showed that all the coloumns metrics are significant and when the model was tested against the same dataset the accuracy was 79.2%