The objective of the following analysis is to analyze a company’s employee data. The aim is to identify the significant factors leading to turnover and determine the employees who are at a greater risk of leaving the organization. The final results can be found at the end of the analysis.
The dataset has:
# 14999 total employee observations and 10 features captured
dim(data_set)
## [1] 14999 10
#View the 10 features collected for these 14999 employee observations.
str(data_set)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ exp_in_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ role : chr "sales" "sales" "sales" "sales" ...
## $ salary : chr "low" "medium" "medium" "low" ...
#From the calculation below, the company kept 76% of their employees and 24% of employees left.
#Calculating employee turn over rate.
turn_over_rate <- percent((sum(data_set$left ==1) / length(data_set$left)))
#It looks like the average satisfaction level for those that left the company was 0.44. These employees seem to be more productive and less accident prone then those that have stayed at the company.
data_set2 <- data_set
data_set2$Left2 <- sapply(data_set2$left, function(x) if(x ==1) "Yes" else "No")
aggregate(data_set2[,-c(7,9,10,11)], by = list(Terminated = data_set2$Left2), FUN = mean)
## Terminated satisfaction_level last_evaluation number_project
## 1 No 0.6668096 0.7154734 3.786664
## 2 Yes 0.4400980 0.7181126 3.855503
## average_montly_hours exp_in_company Work_accident promotion_last_5years
## 1 199.0602 3.380032 0.17500875 0.026251313
## 2 207.4192 3.876505 0.04732568 0.005320638
Moderate Positively Correlated Features:
Moderate Negatively Correlated Feature:
From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.
For the negative(-) relationships, turnover and satisfaction are highly correlated. I’m assuming that people tend to leave a company more when they are less satisfied.
corrplot(cor(data_set[,-c(9,10)]), tl.col = 'black', type = 'lower',tl.srt = 45, addCoef.col ='black',number.cex = 0.8)
A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.
We will run the following test to see whether the average satisfaction level of employees that had a turnover differs from the entire employee population.
Hypothesis Testing: Is there significant difference in the means of satisfaction level between employees who had a turnover and the entire employee population?
Null Hypothesis: (H0: pTS = pES) The null hypothesis would be that there is no difference in satisfaction level between employees who did turnover and the entire employee population.
Alternate Hypothesis: (HA: pTS != pES) The alternative hypothesis would be that there is a difference in satisfaction level between employees who did turnover and the entire employee population.
## [1] "The mean satisfaction level for all employees (the polulation) is: 0.612833522234816"
## [1] "The mean satisfaction level for terminated employees is: 0.440098011761411"
Let’s conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population.
#Running the T-Test
left_pop<-subset(data_set,left==1)
t.test(left_pop$satisfaction_level,mu = mean(data_set$satisfaction_level))
##
## One Sample t-test
##
## data: left_pop$satisfaction_level
## t = -39.109, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.6128335
## 95 percent confidence interval:
## 0.4314385 0.4487576
## sample estimates:
## mean of x
## 0.440098
#Degress of freedom
dof<-sum(data_set$left)
LQ <-qt(0.025,dof) # Left Quartile
RQ <-qt(0.975,dof) # Right Quartile
print(c(paste('The t-distribution left quartile range is: ',LQ), paste('The t-distribution right quartile range is: ' ,RQ)))
## [1] "The t-distribution left quartile range is: -1.96062852159556"
## [2] "The t-distribution right quartile range is: 1.96062852159556"
The test result shows the test statistic “t” is equal to -39.109. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis.
Reject the null hypothesis because:
Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a turnover and the entire employee population. The super low P-value of 9.012e-279 at a 5% confidence level is a good indicator to reject the null hypothesis.
par(mfrow=c(1,3))
hist(data_set$satisfaction_level, col="green", xlab = "Satisfaction level", main = "Satifaction Level")
hist(data_set$last_evaluation, col="red", xlab = "Last Evaluation" , main = "Last Evaluation")
hist(data_set$average_montly_hours, col="blue", xlab = "Avg Monthly Hours", main = "Avg Monthly Hours")
vis_1<-table(data_set$salary,data_set$left)
colnames(vis_1) <- c("No","Yes")
d_vis_1<-data.frame(vis_1)
colnames(d_vis_1) <- c("Salary","Terminated Status","Freq")
ggplot(d_vis_1, aes(x=Salary,y=Freq,fill=`Terminated Status`)) +
geom_bar(position="dodge",stat='identity') + coord_flip() + theme_classic()
vis_2<-table(data_set$role,data_set$left)
d_vis_2<-as.data.frame(vis_2)
d_vis_2<-subset(d_vis_2,Var2==1)
colnames(d_vis_2)[1] <- "Department"
d_vis_2$Department <- factor(d_vis_2$Department, levels = d_vis_2$Department[order(-d_vis_2$Freq)])
ggplot(d_vis_2, aes(x=Department,y=Freq,fill=Department)) + theme_classic() +
geom_bar(stat='identity') +theme(axis.text.x = element_text(angle = 90, hjust = 1))
vis_3<-table(data_set$number_project,data_set$left)
colnames(vis_3) <- c("No", "Yes")
d_vis_3<-as.data.frame(vis_3)
colnames(d_vis_3) <- c("Project Count", "Terminated Status", "Freq")
ggplot(d_vis_3, aes(x=`Project Count`,y=Freq,fill=`Terminated Status`)) +
geom_bar(position="dodge",stat='identity') + coord_flip() + theme_classic()
# Kernel Density Plot
left_data<-subset(data_set,left==1)
stay_data<-subset(data_set,left==0)
ggplot() + geom_density(aes(x=last_evaluation), colour="red", data=left_data) +
geom_density(aes(x=last_evaluation), colour="blue", data=stay_data) + theme_classic()
#Kernel Density Estimate Plot
ggplot() + geom_density(aes(x=average_montly_hours), colour="red", data=left_data) +
geom_density(aes(x=average_montly_hours), colour="blue", data=stay_data) + theme_classic()
Reasons why employees left the company.
Employees generally left when they are underworked (less than 150hr/month or 6hr/day).
Employees generally left when they are overworked (more than 250hr/month or 10hr/day).
Employees with either really high or low evaluations should be taken into consideration for high turnover rate.
Employees with low to medium salaries are the bulk of employee turnover.
Employees that had 2,6, or 7 project count was at risk of leaving the company.
Employee satisfaction is the highest indicator for employee turnover.