Breif Overview


The objective of the following analysis is to analyze a company’s employee data. The aim is to identify the significant factors leading to turnover and determine the employees who are at a greater risk of leaving the organization. The final results can be found at the end of the analysis.

Statistical Overview


The dataset has:


# 14999 total employee observations and 10 features captured
dim(data_set)
## [1] 14999    10
#View the 10 features collected for these 14999 employee observations. 
str(data_set)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ exp_in_company       : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ role                 : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...
#From the calculation below, the company kept 76% of their employees and 24% of employees left.  
#Calculating employee turn over rate.
turn_over_rate <- percent((sum(data_set$left ==1) / length(data_set$left)))
#It looks like the average satisfaction level for those that left the company was 0.44. These employees seem to be more productive and less accident prone then those that have stayed at the company. 
data_set2 <- data_set
data_set2$Left2 <- sapply(data_set2$left, function(x) if(x ==1) "Yes" else "No")
aggregate(data_set2[,-c(7,9,10,11)], by = list(Terminated = data_set2$Left2), FUN = mean)
##   Terminated satisfaction_level last_evaluation number_project
## 1         No          0.6668096       0.7154734       3.786664
## 2        Yes          0.4400980       0.7181126       3.855503
##   average_montly_hours exp_in_company Work_accident promotion_last_5years
## 1             199.0602       3.380032    0.17500875           0.026251313
## 2             207.4192       3.876505    0.04732568           0.005320638

Correlation Plot


Moderate Positively Correlated Features:

Moderate Negatively Correlated Feature:

Corrplot Summary


From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.

For the negative(-) relationships, turnover and satisfaction are highly correlated. I’m assuming that people tend to leave a company more when they are less satisfied.


corrplot(cor(data_set[,-c(9,10)]), tl.col = 'black', type = 'lower',tl.srt = 45, addCoef.col ='black',number.cex = 0.8)

T-Test


T-Test (Measuring Satisfaction Level)

A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

We will run the following test to see whether the average satisfaction level of employees that had a turnover differs from the entire employee population.

Hypothesis Testing: Is there significant difference in the means of satisfaction level between employees who had a turnover and the entire employee population?

  • Null Hypothesis: (H0: pTS = pES) The null hypothesis would be that there is no difference in satisfaction level between employees who did turnover and the entire employee population.

  • Alternate Hypothesis: (HA: pTS != pES) The alternative hypothesis would be that there is a difference in satisfaction level between employees who did turnover and the entire employee population.

## [1] "The mean satisfaction level for all employees (the polulation) is:  0.612833522234816"
## [1] "The mean satisfaction level for terminated employees is:  0.440098011761411"

Conducting the T-Test


Let’s conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population.

#Running the T-Test
left_pop<-subset(data_set,left==1)
t.test(left_pop$satisfaction_level,mu = mean(data_set$satisfaction_level))
## 
##  One Sample t-test
## 
## data:  left_pop$satisfaction_level
## t = -39.109, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.6128335
## 95 percent confidence interval:
##  0.4314385 0.4487576
## sample estimates:
## mean of x 
##  0.440098
#Degress of freedom
dof<-sum(data_set$left)
LQ <-qt(0.025,dof)  # Left Quartile
RQ <-qt(0.975,dof)  # Right Quartile

print(c(paste('The t-distribution left quartile range is: ',LQ), paste('The t-distribution right quartile range is: ' ,RQ)))
## [1] "The t-distribution left quartile range is:  -1.96062852159556"
## [2] "The t-distribution right quartile range is:  1.96062852159556"

T-Test Result


The test result shows the test statistic “t” is equal to -39.109. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis.

T-Test Summary


T-Test = -39.109 | P-Value = 2.2e-16 | Reject Null Hypothesis

Reject the null hypothesis because:

  • T-Test score is outside the quantiles
  • P-value is lower than confidence level of 5%

Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a turnover and the entire employee population. The super low P-value of 9.012e-279 at a 5% confidence level is a good indicator to reject the null hypothesis.

Distribution plots


Summary: The following are some the distributions to some of the employee features.
  • Employee Satisfaction - There is a huge spike for employees with low satisfaction and high satisfaction.
  • Employee Evaluation - There is a bimodal distrubtion of employees for low evaluations (less than 0.6) and high evaluations (more than 0.8). Which basically gives us an indication there are two distinct types of people present in the data.
  • Average Monthly Hours - There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours). Which basically gives us an indication there are two distinct types of people present in the data.
  • The evaluation and average monthly hour graphs both share a similar distribution.
  • Employees with lower average monthly hours were evaluated less and vice versa.
  • Looking back at the correlation matrix, there is a high correlation between evaluation and averageMonthlyHours (0.34) which supports this.
par(mfrow=c(1,3))
hist(data_set$satisfaction_level, col="green", xlab = "Satisfaction level", main = "Satifaction Level")
hist(data_set$last_evaluation, col="red", xlab = "Last Evaluation" , main = "Last Evaluation")
hist(data_set$average_montly_hours, col="blue", xlab = "Avg Monthly Hours",  main = "Avg Monthly Hours")

Salary vs. Turnover


Summary:

  • The majority of employees that tend to leave the company have low or average salaries.
vis_1<-table(data_set$salary,data_set$left)
colnames(vis_1) <- c("No","Yes")
d_vis_1<-data.frame(vis_1)
colnames(d_vis_1) <- c("Salary","Terminated Status","Freq")
ggplot(d_vis_1, aes(x=Salary,y=Freq,fill=`Terminated Status`)) +
 geom_bar(position="dodge",stat='identity') + coord_flip() + theme_classic()

Departemnt vs. Turnover


Summary:

  • The sales, technical, and support departments have the highest employee turnover.
  • The management department has the lowest employee turnover.
vis_2<-table(data_set$role,data_set$left)
d_vis_2<-as.data.frame(vis_2)
d_vis_2<-subset(d_vis_2,Var2==1)
colnames(d_vis_2)[1] <- "Department"
d_vis_2$Department <- factor(d_vis_2$Department, levels = d_vis_2$Department[order(-d_vis_2$Freq)])
ggplot(d_vis_2, aes(x=Department,y=Freq,fill=Department)) + theme_classic() +
 geom_bar(stat='identity') +theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Project Count vs. Turnover


Summary:

  • More than half of the employees with 2,6, and 7 projects left the company.
  • All of the employees with 7 projects left the company.
  • There is an increase in employee turnover rate as project count increases.
vis_3<-table(data_set$number_project,data_set$left)
colnames(vis_3) <- c("No", "Yes")
d_vis_3<-as.data.frame(vis_3)
colnames(d_vis_3) <- c("Project Count", "Terminated Status", "Freq")
ggplot(d_vis_3, aes(x=`Project Count`,y=Freq,fill=`Terminated Status`)) +
 geom_bar(position="dodge",stat='identity') + coord_flip() + theme_classic()

Evaluation vs. Turnover


Summary:

  • Employees with low performance tend to leave the company more.
  • Employees with high performance tend to leave the company more.
  • The ideal performance score for employees that stayed is 0.6-0.8.
# Kernel Density Plot
left_data<-subset(data_set,left==1)
stay_data<-subset(data_set,left==0)
ggplot() + geom_density(aes(x=last_evaluation), colour="red", data=left_data) + 
  geom_density(aes(x=last_evaluation), colour="blue", data=stay_data) + theme_classic()

Evaluation vs. Turnover


Summary:

  • Employees who worked (~150 hours or less) left the company more.
  • Employees who worked (~250 hours or more) left the company more.
  • Employees who left were generally under worked or over worked.
#Kernel Density Estimate Plot
ggplot() + geom_density(aes(x=average_montly_hours), colour="red", data=left_data) + 
  geom_density(aes(x=average_montly_hours), colour="blue", data=stay_data) + theme_classic()

Overall Summary:

Reasons why employees left the company.

  1. Employees generally left when they are underworked (less than 150hr/month or 6hr/day).

  2. Employees generally left when they are overworked (more than 250hr/month or 10hr/day).

  3. Employees with either really high or low evaluations should be taken into consideration for high turnover rate.

  4. Employees with low to medium salaries are the bulk of employee turnover.

  5. Employees that had 2,6, or 7 project count was at risk of leaving the company.

  6. Employee satisfaction is the highest indicator for employee turnover.