Abstract

This analysis deals with exploring different categorical and quantitative variables and their relationship to employee attrition. We will begin this analysis with a general overview to see how the data set is organized, calculate and visualize the correlation between the different variables and employee attrition, and generate a generalized linear model to reveal the good predictors of employee attrition. The data represented in this analysis is artificial/hypothetical and can be found on kaggle.com.

Key:

The data analysis will be broken down by the following sections:

Proportion of Attrition Within Dataset

ggplot(employee_data, aes(x = Attrition)) +
  geom_bar(position = "stack", fill = wes_palette("GrandBudapest2", n = 2)) +
  theme_dark() +
  labs(x = "Attrition", 
       y = "Count",
       title = "Less Attrition in this Data Set",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

employee_data %>%
  group_by(Attrition) %>%
  summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Attrition     n
##   <chr>     <int>
## 1 No         1233
## 2 Yes         237

Pearson’s Correlation - Heat Map

corr_data <- employee_data %>%
  mutate(attrition = ifelse(Attrition == "No", 0, 1),
         gender = ifelse(Gender == "Female", 0, 1),
         overtime = ifelse(OverTime == "No", 0, 1)) %>%
  select(Age, attrition, DistanceFromHome, Education, NumCompaniesWorked, EnvironmentSatisfaction, gender, HourlyRate, JobSatisfaction, PercentSalaryHike,  overtime, TotalWorkingYears, WorkLifeBalance, YearsAtCompany:YearsWithCurrManager)


ggcorrplot(cor(corr_data), hc.order = TRUE, lab = TRUE, lab_size = 2) +
  labs(title = "Correlation Between Variables and Attrition",
       subtitle = "Netural and Positive Correlation",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Positive Correlation to Attrition

In this section, we will evaluate the positive correlations within our data set in relation to attrition. According to our heat map, we have 2:

This indicates that as overtime increases and work distance from home increases, so does the likelihood of attrition as shown below:

ot_bar <- ggplot(employee_data, aes(x = OverTime, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  labs(x = "Over Time", 
       y = "Proportion",
       title = "Over Time Employees",
       subtitle = "Have More Attrition",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

dist_box <- ggplot(employee_data, aes(x = Attrition, y = DistanceFromHome)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  theme_dark() +
  labs(x = "Attrition", 
       y = "Distance from Home",
       title = "More Distance from Work to Home",
       subtitle = "Has More Attrition",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(ot_bar, dist_box, cols = 2)

This bar chart tells us, that there is more attrition in those that decide to work overtime vs those that do not. Also, this box plot tells us, that there is more attrition in those that have to commute greater distances than those that do not.

Although both plots show a positive correlation to attrition, is it statistically significant?

t.test(corr_data$overtime ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  corr_data$overtime by corr_data$attrition
## t = -8.7046, df = 304.63, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3696299 -0.2333247
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2343877       0.5358650
t.test(corr_data$DistanceFromHome ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  corr_data$DistanceFromHome by corr_data$attrition
## t = -2.8882, df = 322.72, p-value = 0.004137
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.8870025 -0.5475146
## sample estimates:
## mean in group 0 mean in group 1 
##        8.915653       10.632911

Null Hypothesis 1: states that there is no significant difference between the attrition of those who work and do not work overtime.

Null Hypothesis 2: states that there is no significant difference between the attrition of those who work closer to home and those who work further from home.

Due to our p-values being less than 5% and the confidence interval does not include zero, we can reject both of the null hypotheses above with 95% confidence and state that there is a statistically significant difference in attrition for each respective variable.

Therefore, overtime workers and commuters (average of >10.63 miles) have a higher likelihood of attrition.

Neutral Correlation to Attrition

In this section, we will evaluate the neutral correlation within our data set in relation to attrition. According to our heat map, we have 7 (+- .05 from zero):

This indicates, that as each variable increases there will be no change in the likelihood of attrition as shown below:

hr_att <- ggplot(employee_data, aes(x = Attrition, y = HourlyRate)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Hourly Rate"
)

perc_att <- ggplot(employee_data, aes(x = Attrition, y = PercentSalaryHike)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "% Salary Hike"
  )

years_att <- ggplot(employee_data, aes(x = Attrition, y = YearsSinceLastPromotion)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs Since Promo"
  )

numcomp_att <- ggplot(employee_data, aes(x = Attrition, y = NumCompaniesWorked)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "# Companies Worked"
  )

edu_att <- ggplot(employee_data, aes(x = Attrition, y = Education)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Education"
  )

gender_bar<- ggplot(employee_data, aes(x = Gender, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark()

multiplot(hr_att, perc_att, years_att, numcomp_att, edu_att, gender_bar, cols = 2)

hourrate_t <- t.test(corr_data$HourlyRate ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
percsalhike_t <- t.test(corr_data$PercentSalaryHike ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrslastpromo_t <- t.test(corr_data$YearsSinceLastPromotion ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
gender_t <- t.test(corr_data$gender ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
numcompworked_t <- t.test(corr_data$NumCompaniesWorked ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
edu_t <-  t.test(corr_data$Education ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

kable1 <- tribble(
  ~name, ~p.value,
  "Hourly Rate", hourrate_t$p.value,
  "Percent Salary Hike", percsalhike_t$p.value,
  "Years Since Last Promotion", yrslastpromo_t$p.value,
  "Number of Companies Worked", numcompworked_t$p.value,
  "Education", edu_t$p.value,
  "Gender", gender_t$p.value
)

knitr::kable(kable1)
name p.value
Hourly Rate 0.7913501
Percent Salary Hike 0.6144301
Years Since Last Promotion 0.1986513
Number of Companies Worked 0.1163340
Education 0.2241713
Gender 0.2542175

Hourly rate, percent salary hike, years since last promotion, number of companies worked, education, and gender all show no significant difference in those who did and did not leave their job. Which aligns with Pearson’s correlation test and our hypothesis!

Negative Correlation to Attrition

In this section, we will evaluate the negative correlation within our data set in relation to attrition. According to our heat map, we have 8:

This indicates, that as each variable increases there will be less likelihood of attrition as shown below:

job_att <- ggplot(employee_data, aes(x = Attrition, y = JobSatisfaction)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Job Satisfaction"
  )

age_att <- ggplot(employee_data, aes(x = Attrition, y = Age)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Age"
  )

work_att <- ggplot(employee_data, aes(x = Attrition, y = WorkLifeBalance)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Work Life Balance"
  )

worky_att <- ggplot(employee_data, aes(x = Attrition, y = TotalWorkingYears)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Total Working Yrs"
  )

yearscurr_att <- ggplot(employee_data, aes(x = Attrition, y = YearsInCurrentRole)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs in Curr Role"
  )

yearscomp_att <- ggplot(employee_data, aes(x = Attrition, y = YearsAtCompany)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs at Company"
  )

yearsmgr_att <- ggplot(employee_data, aes(x = Attrition, y = YearsWithCurrManager)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs w/ Curr Manager"
  )

env_att <- ggplot(employee_data, aes(x = Attrition, y = EnvironmentSatisfaction)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Env Satisfaction"
  )


multiplot(job_att, age_att, work_att, worky_att, yearscurr_att, yearscomp_att, yearsmgr_att, env_att, cols = 3)

jobsat_t <- t.test(corr_data$JobSatisfaction ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
age_t <- t.test(corr_data$Age ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
wrklifebal_t <- t.test(corr_data$WorkLifeBalance ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
workyrs_t <- t.test(corr_data$TotalWorkingYears ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrsrole_t <- t.test(corr_data$YearsInCurrentRole ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrscomp_t <- t.test(corr_data$YearsAtCompany ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrsmgr_t <- t.test(corr_data$YearsWithCurrManager ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
envsat_t <- t.test(corr_data$EnvironmentSatisfaction ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
kable2 <- tribble(
  ~name, ~p.value,
  "Job Satisfaction", jobsat_t$p.value,
  "Age", age_t$p.value,
  "Work Life Balance", wrklifebal_t$p.value,
  "Total Working Years", workyrs_t$p.value,
  "Years In Current Role", yrsrole_t$p.value,
  "Years at Company", yrscomp_t$p.value,
  "Years with Current Mgr", yrsmgr_t$p.value,
  "Environment Satisfaction", envsat_t$p.value
)

knitr::kable(kable2)
name p.value
Job Satisfaction 0.0001052
Age 0.0000000
Work Life Balance 0.0304657
Total Working Years 0.0000000
Years In Current Role 0.0000000
Years at Company 0.0000002
Years with Current Mgr 0.0000000
Environment Satisfaction 0.0002092

As shown in the table above, the p-value is the smallest for Years In Current Role, Total Working Years, and years with current Manager.

Logistic Linear Regression Model

As promised in earlier sections, we will build a binomial generalized linear model to predict the likelihood of attrition depending on all variables explored within the data analysis thus far.

When analyzing the results of the model, the variables with significant p-values (less than 5%) and large effect sizes will be categorized as good predictors of attrition.

Below is our Logistic Linear Regression Model and the summary of each variable in relation to each other and how it corresponds to the probability of attrition:

predicted <- glm(attrition ~ ., family = "binomial", data = corr_data)
summary(predicted)
## 
## Call:
## glm(formula = attrition ~ ., family = "binomial", data = corr_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7683  -0.5781  -0.3673  -0.1867   3.4194  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              2.177261   0.733859   2.967 0.003009 ** 
## Age                     -0.040006   0.012531  -3.193 0.001410 ** 
## DistanceFromHome         0.030448   0.009570   3.182 0.001465 ** 
## Education               -0.004928   0.078363  -0.063 0.949853    
## NumCompaniesWorked       0.146438   0.033706   4.345 1.40e-05 ***
## EnvironmentSatisfaction -0.369854   0.073564  -5.028 4.97e-07 ***
## gender                   0.263308   0.164557   1.600 0.109575    
## HourlyRate              -0.002323   0.003969  -0.585 0.558363    
## JobSatisfaction         -0.341895   0.072292  -4.729 2.25e-06 ***
## PercentSalaryHike       -0.014925   0.021932  -0.681 0.496163    
## overtime                 1.635352   0.165674   9.871  < 2e-16 ***
## TotalWorkingYears       -0.080278   0.022430  -3.579 0.000345 ***
## WorkLifeBalance         -0.260008   0.108545  -2.395 0.016603 *  
## YearsAtCompany           0.068049   0.034250   1.987 0.046939 *  
## YearsInCurrentRole      -0.125399   0.040721  -3.079 0.002074 ** 
## YearsSinceLastPromotion  0.143608   0.037811   3.798 0.000146 ***
## YearsWithCurrManager    -0.115703   0.041306  -2.801 0.005093 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1038.1  on 1453  degrees of freedom
## AIC: 1072.1
## 
## Number of Fisher Scoring iterations: 6

Based on the summary: Age, Distance from Home, Environment satisfaction, job satisfaction, overtime, total working years, work life balance, years in current role, years since last promotion, and years with current manager are all good predictors.

Top 3 Predictors

It is interesting to see that Numbers of Companies Worked and Years Since Last Promotion is statistically significant and are good predictors of attrition when it showed no average difference, noted by the t.test and heat map in the neutral correlation section.

It is also surprising to see that years at company is also not statistically significant variable and a good predictor of attrition despite the t.test and heat map (in the negative correlation section) stating there was a significant difference.

Categorical Variable Analysis

In this section, we will evaluate the correlation between the categorical variables within our data set in relation to attrition. Within this data set we have 5:

We want to explore if someone who travels rarely will have a different likelihood of attrition vs someone who travels frequently and so on.

To do this I have created a bar chart for each categorical variable and visualized the proportion of attrition for each level.

As shown below each categorical variable has a set of different levels:

Below are the bar charts for each categorical variable in relation to attrition:

ggplot(employee_data, aes(x = BusinessTravel, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

ggplot(employee_data, aes(x = Department, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

ggplot(employee_data, aes(x = EducationField, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

ggplot(employee_data, aes(x = JobRole, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

ggplot(employee_data, aes(x = MaritalStatus, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Keep in mind that these outcomes are a representation of this particular data set population but not the general population. In order to make a statistical inference on the general population, we would need a larger/diverse data set.

Predictors of Interest - Numeric vs Categorical

Within this section of the analysis, I wanted to dive deeper into the good predictors and analyze their relationship to one another.

We will facet wrap multiple variables together and analyze the relationship shown:

facet_data <- employee_data %>%
  mutate(education = mapvalues(employee_data$Education,
            from = c(1, 2, 3, 4, 5),
            to = c("Below College", "College", "Bachelor", "Master", "Doctor")),
          jobsatisfaction = mapvalues(employee_data$JobSatisfaction,
                                      from = c(1, 2, 3, 4),
                                      to = c("Low", "Medium", "High", "Very High")),
         environmentsatisfaction = mapvalues(employee_data$EnvironmentSatisfaction,
                                             from = c(1, 2, 3, 4),
                                             to = c("Low", "Medium", "High", "Very High")),
         worklifebalance = mapvalues(employee_data$WorkLifeBalance,
                                             from = c(1, 2, 3, 4),
                                             to = c("Low", "Medium", "High", "Very High"))
         )

facet_data$jobsatisfaction_f <- factor(facet_data$jobsatisfaction, levels = c("Low", "Medium", "High", "Very High"))
facet_data$education_f <- factor(facet_data$education, levels = c("Below College", "College", "Bachelor", "Master", "Doctor"))
facet_data$environmentsatisfaction_f <- factor(facet_data$environmentsatisfaction, levels = c("Low", "Medium", "High", "Very High"))
facet_data$worklifebalance_f <- factor(facet_data$worklifebalance, levels = c("Low", "Medium", "High", "Very High"))
ggplot(facet_data, aes(x = YearsInCurrentRole, y = YearsWithCurrManager)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(Attrition ~ jobsatisfaction_f) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As seen in the above graph, years in current role and years with current manager has a positive linear relationship in relation to attrition and job satisfaction

Let’s look at each section separately by attrition:

ggplot(facet_data, aes(x = TotalWorkingYears, y = YearsSinceLastPromotion)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(Attrition ~ education_f) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As seen in the above graph, total working years and years since last promotion, there are varying relationships in relation to attrition and education

Let’s look at each section separately by attrition:

Conclusion

In conclusion, the Pearson’s correlation heat map was very accurate in correspondence to the t.test results. Those that showed statistical significance showed either a positive or negative correlation with attrition and those that did not show statistical significance showed a neutral correlation with attrition. Although the accuracy was profound, we noticed varying results from our Linear Regression Model (which is a better representation for real-world application). The number of companies an individual worked at and years since their last promotion showed statistically significant positive relationships to attrition, although these two variables showed a neutral correlation in Pearson’s correlation heat map. Also, we found in our Linear Regression Model that the amount of years at a company is not statistically significant in comparison to the other variables used in this analysis.

There are many reasons why an individual chooses to leave their current role/company. However, throughout this analysis, we have generated possible focus points to improve retention (negative correlation section). Such as acquiring tenured professionals, both in total work experience and experience in their role, as well as maintaining a good employee to manager relationship. Lastly, we have also generated good predictors of attrition that, if focused on, can predict the likelihood of attrition within an individual. Such as, Overtime (more overtime workers have a higher likelihood of attrition), Environment satisfaction, and Job satisfaction (more satisfied the less likelihood of attrition).