Employee Attrition EDA

Abstract

This analysis deals with exploring different categorical and quantitative variables and their relationship to employee attrition. We will begin this analysis with a general overview to see how the data set is organized, calculate and visualize the correlation between the different variables and employee attrition, and generate a generalized linear model to reveal the good predictors of employee attrition. The data represented in this analysis is artificial/hypothetical and can be found on kaggle.com.

Key:

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’
RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

The data analysis will be broken down by the following sections:

Proportion of Attrition Within Dataset
Pearson’s Correlation - Heat Map
Positive Correlation to Attrition
Neutral Correlation to Attrition
Negative Correlation to Attrition
Logistic Linear Regression Model
Categorical Variable Analysis
Predictors of Interest - Numeric vs Categorical

Proportion of Attrition Within Dataset

ggplot(employee_data, aes(x = Attrition)) +
  geom_bar(position = "stack", fill = wes_palette("GrandBudapest2", n = 2)) +
  theme_dark() +
  labs(x = "Attrition", 
       y = "Count",
       title = "Less Attrition in this Data Set",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

employee_data %>%
  group_by(Attrition) %>%
  summarise(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   Attrition     n
##   <chr>     <int>
## 1 No         1233
## 2 Yes         237

Within the data set, 237 individuals left their employer and 1233 individuals who are still currently with their employer.
- This data set shows that roughly 20% of employees have left their employer.
We will analyze attrition and its relationship to multiple variables to see their correlation and possible focus points of improvement.
- To do this, we will run a Pearson’s Correlation heat map and see which variables have a positive, neutral, or negative correlation and explore each group and produce possible explanations for these relationships.

Pearson’s Correlation - Heat Map

corr_data <- employee_data %>%
  mutate(attrition = ifelse(Attrition == "No", 0, 1),
         gender = ifelse(Gender == "Female", 0, 1),
         overtime = ifelse(OverTime == "No", 0, 1)) %>%
  select(Age, attrition, DistanceFromHome, Education, NumCompaniesWorked, EnvironmentSatisfaction, gender, HourlyRate, JobSatisfaction, PercentSalaryHike,  overtime, TotalWorkingYears, WorkLifeBalance, YearsAtCompany:YearsWithCurrManager)


ggcorrplot(cor(corr_data), hc.order = TRUE, lab = TRUE, lab_size = 2) +
  labs(title = "Correlation Between Variables and Attrition",
       subtitle = "Netural and Positive Correlation",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

First we have narrowed our data set to focus on variables of interest. To ensure that the correlation heat map correctly ran, all variables must be a numeric value.
Based on this correlation plot, we are more interested in how attrition fares with the other employee factors
We can separate by positive correlation, neutral correlation. and negative correlation
- Positive correlation: overtime and distance from home
- Neutral correlation (+- .05 from zero): hourly rate, distance from home, percent salary hike, gender, years since last promotion, and work life balance.
- Negative correlation: job satisfaction, age, total working years, years in current role, years at company, years with current manager, and environment satisfaction
This will be our overhead analysis and we will now dive a bit deeper into each group and see if it matches our linear regression model (which we will formulate later in our analysis).

Positive Correlation to Attrition

In this section, we will evaluate the positive correlations within our data set in relation to attrition. According to our heat map, we have 2:

OverTime
Distance from Home

This indicates that as overtime increases and work distance from home increases, so does the likelihood of attrition as shown below:

ot_bar <- ggplot(employee_data, aes(x = OverTime, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  labs(x = "Over Time", 
       y = "Proportion",
       title = "Over Time Employees",
       subtitle = "Have More Attrition",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

dist_box <- ggplot(employee_data, aes(x = Attrition, y = DistanceFromHome)) +
  geom_boxplot(fill = wes_palette("Moonrise3", n = 2)) +
  theme_dark() +
  labs(x = "Attrition", 
       y = "Distance from Home",
       title = "More Distance from Work to Home",
       subtitle = "Has More Attrition",
       caption = "Source: IBM HR Analytics") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(ot_bar, dist_box, cols = 2)

This bar chart tells us, that there is more attrition in those that decide to work overtime vs those that do not. Also, this box plot tells us, that there is more attrition in those that have to commute greater distances than those that do not.

Within this data set, attrition will occur when an individual will have to commute 10.63+ miles to work.
- Keep in mind commute tolerance will vary depending on job type, geographic location, etc.

Although both plots show a positive correlation to attrition, is it statistically significant?

t.test(corr_data$overtime ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  corr_data$overtime by corr_data$attrition
## t = -8.7046, df = 304.63, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3696299 -0.2333247
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2343877       0.5358650

t.test(corr_data$DistanceFromHome ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  corr_data$DistanceFromHome by corr_data$attrition
## t = -2.8882, df = 322.72, p-value = 0.004137
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.8870025 -0.5475146
## sample estimates:
## mean in group 0 mean in group 1 
##        8.915653       10.632911

Null Hypothesis 1: states that there is no significant difference between the attrition of those who work and do not work overtime.

Null Hypothesis 2: states that there is no significant difference between the attrition of those who work closer to home and those who work further from home.

Due to our p-values being less than 5% and the confidence interval does not include zero, we can reject both of the null hypotheses above with 95% confidence and state that there is a statistically significant difference in attrition for each respective variable.

Therefore, overtime workers and commuters (average of >10.63 miles) have a higher likelihood of attrition.

Neutral Correlation to Attrition

In this section, we will evaluate the neutral correlation within our data set in relation to attrition. According to our heat map, we have 7 (+- .05 from zero):

Hourly Rate
Percent Salary Hike
Years Since Last Promotion
Number of Companies Worked For
Education
Gender

This indicates, that as each variable increases there will be no change in the likelihood of attrition as shown below:

hr_att <- ggplot(employee_data, aes(x = Attrition, y = HourlyRate)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Hourly Rate"
)

perc_att <- ggplot(employee_data, aes(x = Attrition, y = PercentSalaryHike)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "% Salary Hike"
  )

years_att <- ggplot(employee_data, aes(x = Attrition, y = YearsSinceLastPromotion)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs Since Promo"
  )

numcomp_att <- ggplot(employee_data, aes(x = Attrition, y = NumCompaniesWorked)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "# Companies Worked"
  )

edu_att <- ggplot(employee_data, aes(x = Attrition, y = Education)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Education"
  )

gender_bar<- ggplot(employee_data, aes(x = Gender, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark()

multiplot(hr_att, perc_att, years_att, numcomp_att, edu_att, gender_bar, cols = 2)

As we can see, the averages of each variable are similar regardless of attrition status. Let’s confirm this by running t.tests for each variable in relation to attrition.
Because these are neutral correlation variables, we hypothesize, that each p-value will be > 5% which claims insignificance.

hourrate_t <- t.test(corr_data$HourlyRate ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
percsalhike_t <- t.test(corr_data$PercentSalaryHike ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrslastpromo_t <- t.test(corr_data$YearsSinceLastPromotion ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
gender_t <- t.test(corr_data$gender ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
numcompworked_t <- t.test(corr_data$NumCompaniesWorked ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
edu_t <-  t.test(corr_data$Education ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

kable1 <- tribble(
  ~name, ~p.value,
  "Hourly Rate", hourrate_t$p.value,
  "Percent Salary Hike", percsalhike_t$p.value,
  "Years Since Last Promotion", yrslastpromo_t$p.value,
  "Number of Companies Worked", numcompworked_t$p.value,
  "Education", edu_t$p.value,
  "Gender", gender_t$p.value
)

knitr::kable(kable1)

name	p.value
Hourly Rate	0.7913501
Percent Salary Hike	0.6144301
Years Since Last Promotion	0.1986513
Number of Companies Worked	0.1163340
Education	0.2241713
Gender	0.2542175

Hourly rate, percent salary hike, years since last promotion, number of companies worked, education, and gender all show no significant difference in those who did and did not leave their job. Which aligns with Pearson’s correlation test and our hypothesis!

Therefore, within this data set, we can assume that the variables above do not affect attrition and will not be a good predictor when formulating a linear regression model (which will be formulated in later sections).

Negative Correlation to Attrition

In this section, we will evaluate the negative correlation within our data set in relation to attrition. According to our heat map, we have 8:

Job Satisfaction
Age
Work Life Balance
Total Working Years
Years in Current Role
Years at Company
Years with Current Manager
Environment Satisfaction

This indicates, that as each variable increases there will be less likelihood of attrition as shown below:

job_att <- ggplot(employee_data, aes(x = Attrition, y = JobSatisfaction)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Job Satisfaction"
  )

age_att <- ggplot(employee_data, aes(x = Attrition, y = Age)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Age"
  )

work_att <- ggplot(employee_data, aes(x = Attrition, y = WorkLifeBalance)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Work Life Balance"
  )

worky_att <- ggplot(employee_data, aes(x = Attrition, y = TotalWorkingYears)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Total Working Yrs"
  )

yearscurr_att <- ggplot(employee_data, aes(x = Attrition, y = YearsInCurrentRole)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs in Curr Role"
  )

yearscomp_att <- ggplot(employee_data, aes(x = Attrition, y = YearsAtCompany)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs at Company"
  )

yearsmgr_att <- ggplot(employee_data, aes(x = Attrition, y = YearsWithCurrManager)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Yrs w/ Curr Manager"
  )

env_att <- ggplot(employee_data, aes(x = Attrition, y = EnvironmentSatisfaction)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Env Satisfaction"
  )


multiplot(job_att, age_att, work_att, worky_att, yearscurr_att, yearscomp_att, yearsmgr_att, env_att, cols = 3)

jobsat_t <- t.test(corr_data$JobSatisfaction ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
age_t <- t.test(corr_data$Age ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
wrklifebal_t <- t.test(corr_data$WorkLifeBalance ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
workyrs_t <- t.test(corr_data$TotalWorkingYears ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrsrole_t <- t.test(corr_data$YearsInCurrentRole ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrscomp_t <- t.test(corr_data$YearsAtCompany ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
yrsmgr_t <- t.test(corr_data$YearsWithCurrManager ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
envsat_t <- t.test(corr_data$EnvironmentSatisfaction ~ corr_data$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)

kable2 <- tribble(
  ~name, ~p.value,
  "Job Satisfaction", jobsat_t$p.value,
  "Age", age_t$p.value,
  "Work Life Balance", wrklifebal_t$p.value,
  "Total Working Years", workyrs_t$p.value,
  "Years In Current Role", yrsrole_t$p.value,
  "Years at Company", yrscomp_t$p.value,
  "Years with Current Mgr", yrsmgr_t$p.value,
  "Environment Satisfaction", envsat_t$p.value
)

knitr::kable(kable2)

name	p.value
Job Satisfaction	0.0001052
Age	0.0000000
Work Life Balance	0.0304657
Total Working Years	0.0000000
Years In Current Role	0.0000000
Years at Company	0.0000002
Years with Current Mgr	0.0000000
Environment Satisfaction	0.0002092

All t.tests above show a significant difference between the average of each variable with respect to attrition.
- Therefore we can say with 95% confidence that each variable has a significantly negative correlation with attrition.
To retain employees, they are more satisfied with their job, have had more tenure and working experience at the company.
They have also stayed under their current manager and are satisfied with their environment.
- If we were to choose areas to improve retention, we would look at the p-values and see which one had the smallest and make those areas of focus, since the probability of retention would be greater!

As shown in the table above, the p-value is the smallest for Years In Current Role, Total Working Years, and years with current Manager.

We need to ensure that we acquire tenured professionals, both in total work experience and experience in their role, as well as maintain a good employee to manager relationship.

Logistic Linear Regression Model

As promised in earlier sections, we will build a binomial generalized linear model to predict the likelihood of attrition depending on all variables explored within the data analysis thus far.

When analyzing the results of the model, the variables with significant p-values (less than 5%) and large effect sizes will be categorized as good predictors of attrition.

Below is our Logistic Linear Regression Model and the summary of each variable in relation to each other and how it corresponds to the probability of attrition:

predicted <- glm(attrition ~ ., family = "binomial", data = corr_data)
summary(predicted)

## 
## Call:
## glm(formula = attrition ~ ., family = "binomial", data = corr_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7683  -0.5781  -0.3673  -0.1867   3.4194  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              2.177261   0.733859   2.967 0.003009 ** 
## Age                     -0.040006   0.012531  -3.193 0.001410 ** 
## DistanceFromHome         0.030448   0.009570   3.182 0.001465 ** 
## Education               -0.004928   0.078363  -0.063 0.949853    
## NumCompaniesWorked       0.146438   0.033706   4.345 1.40e-05 ***
## EnvironmentSatisfaction -0.369854   0.073564  -5.028 4.97e-07 ***
## gender                   0.263308   0.164557   1.600 0.109575    
## HourlyRate              -0.002323   0.003969  -0.585 0.558363    
## JobSatisfaction         -0.341895   0.072292  -4.729 2.25e-06 ***
## PercentSalaryHike       -0.014925   0.021932  -0.681 0.496163    
## overtime                 1.635352   0.165674   9.871  < 2e-16 ***
## TotalWorkingYears       -0.080278   0.022430  -3.579 0.000345 ***
## WorkLifeBalance         -0.260008   0.108545  -2.395 0.016603 *  
## YearsAtCompany           0.068049   0.034250   1.987 0.046939 *  
## YearsInCurrentRole      -0.125399   0.040721  -3.079 0.002074 ** 
## YearsSinceLastPromotion  0.143608   0.037811   3.798 0.000146 ***
## YearsWithCurrManager    -0.115703   0.041306  -2.801 0.005093 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1038.1  on 1453  degrees of freedom
## AIC: 1072.1
## 
## Number of Fisher Scoring iterations: 6

Based on the summary: Age, Distance from Home, Environment satisfaction, job satisfaction, overtime, total working years, work life balance, years in current role, years since last promotion, and years with current manager are all good predictors.

Top 3 Predictors

Overtime had the smallest p-value as well as the largest effect size (positive correlation)
Environment satisfaction had a small p-value and the second largest effect size (negative correlation)
Job Satisfaction had a slightly larger p-value and the third largest effect size (negative correlation)

It is interesting to see that Numbers of Companies Worked and Years Since Last Promotion is statistically significant and are good predictors of attrition when it showed no average difference, noted by the t.test and heat map in the neutral correlation section.

It is also surprising to see that years at company is also not statistically significant variable and a good predictor of attrition despite the t.test and heat map (in the negative correlation section) stating there was a significant difference.

Keep in mind, these good/bad predictors are determined by their relationship to one another in regards to attrition
- They are not individually analyzed in relation to attrition as the earlier sections did
- The real world would operate more closely to the results of the generalized linear model as people will generally consider multiple variables together prior to leaving a job versus just one.

Categorical Variable Analysis

In this section, we will evaluate the correlation between the categorical variables within our data set in relation to attrition. Within this data set we have 5:

Business Travel
Department
Education Field
Job Role
Marital Status

We want to explore if someone who travels rarely will have a different likelihood of attrition vs someone who travels frequently and so on.

To do this I have created a bar chart for each categorical variable and visualized the proportion of attrition for each level.

As shown below each categorical variable has a set of different levels:

Business Travel: Travel Rarely, Travel Freq, Non-Travel
Department: Sales, Research & Development, HR
Education Field: Life Sciences, Other, Medical, Marketing, Tech degree, HR
Job Role: Sales Exec, Research Scientist, Lab tech, manufacturing director, HC rep, Manager, Sales rep, research director, and HR
Marital Status: Single, Married, and Divorced

Below are the bar charts for each categorical variable in relation to attrition:

ggplot(employee_data, aes(x = BusinessTravel, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Those who travel more frequently have a higher attrition rate. So the more frequently you travel, the higher the likelihood you are to leave your job within this data set.

ggplot(employee_data, aes(x = Department, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Within this analysis, those who work in sales and HR have a higher attrition rate vs those in the research and development department.

ggplot(employee_data, aes(x = EducationField, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Those with degrees or an educational background of HR, Marketing, and tech have a higher attrition rate than those with degrees or an educational background of life sciences and medical.

ggplot(employee_data, aes(x = JobRole, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Sales representatives have the highest attrition rate of all job roles in this data set, next are those who work in HR and as a lab technician
Sales executives and research scientists have lower attrition than the above but still have high attrition rates.
Manufacturing directors, managers, healthcare representatives, and research directors all have low attrition rates.

ggplot(employee_data, aes(x = MaritalStatus, fill = Attrition)) +
  geom_bar(position = "fill") +
  theme_dark() +
  coord_flip() +
  labs(y = "Proportion")

Those that are single have the highest attrition rates, those that are married and divorced have lower attrition rates.

Keep in mind that these outcomes are a representation of this particular data set population but not the general population. In order to make a statistical inference on the general population, we would need a larger/diverse data set.

Predictors of Interest - Numeric vs Categorical

Within this section of the analysis, I wanted to dive deeper into the good predictors and analyze their relationship to one another.

We will facet wrap multiple variables together and analyze the relationship shown:

facet_data <- employee_data %>%
  mutate(education = mapvalues(employee_data$Education,
            from = c(1, 2, 3, 4, 5),
            to = c("Below College", "College", "Bachelor", "Master", "Doctor")),
          jobsatisfaction = mapvalues(employee_data$JobSatisfaction,
                                      from = c(1, 2, 3, 4),
                                      to = c("Low", "Medium", "High", "Very High")),
         environmentsatisfaction = mapvalues(employee_data$EnvironmentSatisfaction,
                                             from = c(1, 2, 3, 4),
                                             to = c("Low", "Medium", "High", "Very High")),
         worklifebalance = mapvalues(employee_data$WorkLifeBalance,
                                             from = c(1, 2, 3, 4),
                                             to = c("Low", "Medium", "High", "Very High"))
         )

facet_data$jobsatisfaction_f <- factor(facet_data$jobsatisfaction, levels = c("Low", "Medium", "High", "Very High"))
facet_data$education_f <- factor(facet_data$education, levels = c("Below College", "College", "Bachelor", "Master", "Doctor"))
facet_data$environmentsatisfaction_f <- factor(facet_data$environmentsatisfaction, levels = c("Low", "Medium", "High", "Very High"))
facet_data$worklifebalance_f <- factor(facet_data$worklifebalance, levels = c("Low", "Medium", "High", "Very High"))

ggplot(facet_data, aes(x = YearsInCurrentRole, y = YearsWithCurrManager)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(Attrition ~ jobsatisfaction_f)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As seen in the above graph, years in current role and years with current manager has a positive linear relationship in relation to attrition and job satisfaction

Let’s look at each section separately by attrition:

Those without attrition: low job satisfaction shows a positive and then negative correlation (a peak) in years with current role and current manager.
- This could imply internal management change throughout their tenure with the employer. Keep this in mind, as we will see this occur in those who left their jobs.
Medium, High, and Very High job satisfaction shows a consistently positive relationship, with no noticeable decline/peak
Those with attrition: Low, medium, and high job satisfaction showed two declines in the relationship between years in current role and years with current manager.
- The dips occur at different lengths of time, people with low satisfaction tend to have a decline sooner than those with medium and high job satisfaction.
- As expected, those with very high job satisfaction but still left their job showed a positive correlation throughout. This could imply that individuals might have suddenly left for a better opportunity towards the end of their tenure which explains no dips throughout their time with the current manager/current role.

ggplot(facet_data, aes(x = TotalWorkingYears, y = YearsSinceLastPromotion)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(Attrition ~ education_f)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As seen in the above graph, total working years and years since last promotion, there are varying relationships in relation to attrition and education

Let’s look at each section separately by attrition:

Those without attrition: over the total years of work, there is an increase in years since last promotion for those with education levels of below college and a Bachelor’s.
We see that there is a dip in years since last promotion over total working years within education levels of college and master’s. This means that these individuals had a promotion within their tenure.
- It is surprising to see that college students will receive a promotion within their jobs at a sooner rate than those with Bachelor’s. However, this can be explained by the type of work.
- College students (less than a Bachelor’s degree) tend to have different types of part-time/full-time positions (food service, sales associate, etc) vs Bachelors (entry-level corporate roles, etc) which might lead to earlier promotions. Individuals with Bachelor Degrees that have entry-level positions normally require a certain amount of tenure/experience with the company prior to promotions.
Those with attrition: individuals with below college and college education (less than Bachelor’s) showed attrition after promotions (as noted by the dip). This can be explained by the type of jobs obtained with that particular education level.
- That regardless of a promotion, these positions are expected to have a higher turnover rate.
As expected, those with a Bachelor’s and Master’s Degree who were not promoted for a long time over the total tenure have left their jobs.
- Some possible explanations can include feeling under-appreciated for their tenure and effort, overlooking educational impact, etc.
Those with doctorates, showed attrition even after promotions, this could be due to suddenly finding a better opportunity with their tenure at the company.

Conclusion

In conclusion, the Pearson’s correlation heat map was very accurate in correspondence to the t.test results. Those that showed statistical significance showed either a positive or negative correlation with attrition and those that did not show statistical significance showed a neutral correlation with attrition. Although the accuracy was profound, we noticed varying results from our Linear Regression Model (which is a better representation for real-world application). The number of companies an individual worked at and years since their last promotion showed statistically significant positive relationships to attrition, although these two variables showed a neutral correlation in Pearson’s correlation heat map. Also, we found in our Linear Regression Model that the amount of years at a company is not statistically significant in comparison to the other variables used in this analysis.

There are many reasons why an individual chooses to leave their current role/company. However, throughout this analysis, we have generated possible focus points to improve retention (negative correlation section). Such as acquiring tenured professionals, both in total work experience and experience in their role, as well as maintaining a good employee to manager relationship. Lastly, we have also generated good predictors of attrition that, if focused on, can predict the likelihood of attrition within an individual. Such as, Overtime (more overtime workers have a higher likelihood of attrition), Environment satisfaction, and Job satisfaction (more satisfied the less likelihood of attrition).

Employee Attrition EDA

Kevin Tran

10/5/2020

Abstract

Proportion of Attrition Within Dataset

Pearson’s Correlation - Heat Map

Positive Correlation to Attrition

Neutral Correlation to Attrition

Negative Correlation to Attrition

Logistic Linear Regression Model

Categorical Variable Analysis

Predictors of Interest - Numeric vs Categorical

Conclusion