Employee attrition is a term to describe the situation when an employee leaves the company whether he or she is retire, resign to move to another company, layoff, and any other reasons. Employee Attrition does not always bad when it come to saves some spending, but become bad when the attrition rate is high and the company struggle to find the replacement. So it is important to analyze the case and the reasons, so then the company can prevents or be prepared.
In this article I would try to make simple analysis of some factors that probably leads to employee retention in companies. For this research I use IBM HR Employee Attrition dataset.
The following are a brief explanations of each variable in this dataset.
| Variable | Description |
|---|---|
Age |
Age of employee |
Attrition |
Attrition of employee (Yes, No) |
BusinessTravel |
Frequency of business travel (Non-Travel, Travel_Rarely, Travel_Frequently) |
DailyRate |
Amount of money a company has to pay employee to work for them for a day |
Department |
Work Department (Research and Development, Sales, Human Resources) |
DistanceFromHome |
Distance between company and home |
Education |
Level of education (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor) |
EducationField |
Field of Education (Life Sciences, Medical, Human Resources, Technical Degree, Marketing, Other) |
EmployeeCount |
Count of employee (always 1) |
EmployeeNumber |
ID Employee |
EnvironmentSatisfaction |
Satisfaction of environment score(1: Low, 2: Medium, 3: High, 4: Very High) |
HourlyRate |
Amount of money a company has to pay employee to work for them for an hour |
JobInvolvement |
Level of job involvement (1: Low, 2: Medium, 3: High, 4: Very High) |
JobLevel |
Level of job (1 - 5) |
JobRole |
Role of job (Sales Executive, Research Scientist, Laboratory Technician, Manager, Healthcare Representative, Sales Representative, Manufacturing Director, Human Resources, Manager) |
JobSatisfaction |
Satisfaction of job (1: Low, 2: Medium, 3: High, 4: Very High) |
MaritalStatus |
Marital Status (Married, Single, Divorced) |
MonthlyIncome |
Monthly Income |
MonthlyRate |
Percent of salary of hike |
NumCompaniesWorked |
Total number of companies have been worked with |
Over18 |
Employee age over 18 years old (Yes, No) |
OverTime |
Frequently spent overtime working (Yes, No) |
PercentSalaryHike |
Percent of salary of hike |
PerformanceRating |
Level of performance assessment (1: Low, 2: Good, 3: Excellent, 4: Outstanding) |
RelationshipSatisfaction |
Level of relationship satisfaction (1: Low, 2: Medium, 3: High, 4: Very High) |
StandardHours |
Standard work hours (always 80) |
StockOptionLevel |
Stock option level (0 - 3) |
TotalWorkingYears |
Years of total working |
TrainingTimesLastYear |
Training times of last year |
WorkLifeBalance |
Level of work life balance (1: Bad, 2: Good, 3: Better, 4: Best) |
YearsAtCompany |
Years at company |
YearsInCurrentRole |
Years in current role |
YearsSinceLastPromotion |
Years since last promotion |
YearsWithCurrManager |
Years with current manager |
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(GGally)
library(gtools)
library(caret)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(plotly)
library(ggthemes)
library(MLmetrics)
library(performance)
library(ggpubr)First Step: Load data
df_employeeAttrition <- read.csv("data_input/WA_Fn-UseC_-HR-Employee-Attrition.csv")Check data structure
glimpse(df_employeeAttrition)#> Rows: 1,470
#> Columns: 35
#> $ ï..Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
#> $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
#> $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
#> $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
#> $ Department <chr> "Sales", "Research & Development", "Resear...
#> $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
#> $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
#> $ EducationField <chr> "Life Sciences", "Life Sciences", "Other",...
#> $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#> $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
#> $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
#> $ Gender <chr> "Female", "Male", "Male", "Female", "Male"...
#> $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
#> $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
#> $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
#> $ JobRole <chr> "Sales Executive", "Research Scientist", "...
#> $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
#> $ MaritalStatus <chr> "Single", "Married", "Single", "Married", ...
#> $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
#> $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
#> $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
#> $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
#> $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
#> $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
#> $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
#> $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
#> $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
#> $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
#> $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
#> $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
#> $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
#> $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
#> $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
#> $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...
This dataset consist of 35 features (variables) and 1,470 observations (rows data). There are 9 categorical columns and 26 numerical columns.
I will modify the first column name (ï..Age) to simplify and easy for processing. I also change the categorical type to factor
df_employeeAttrition <- df_employeeAttrition %>% rename(Age = ï..Age) %>% mutate_if(is.character, as.factor)Here are the summary of the raw dataset:
summary(df_employeeAttrition %>% select_if(is.numeric))#> Age DailyRate DistanceFromHome Education
#> Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
#> 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
#> Median :36.00 Median : 802.0 Median : 7.000 Median :3.000
#> Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913
#> 3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000
#> Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
#> EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate
#> Min. :1 Min. : 1.0 Min. :1.000 Min. : 30.00
#> 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 1st Qu.: 48.00
#> Median :1 Median :1020.5 Median :3.000 Median : 66.00
#> Mean :1 Mean :1024.9 Mean :2.722 Mean : 65.89
#> 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.: 83.75
#> Max. :1 Max. :2068.0 Max. :4.000 Max. :100.00
#> JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate
#> Min. :1.00 Min. :1.000 Min. :1.000 Min. : 1009 Min. : 2094
#> 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 2911 1st Qu.: 8047
#> Median :3.00 Median :2.000 Median :3.000 Median : 4919 Median :14236
#> Mean :2.73 Mean :2.064 Mean :2.729 Mean : 6503 Mean :14313
#> 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
#> Max. :4.00 Max. :5.000 Max. :4.000 Max. :19999 Max. :26999
#> NumCompaniesWorked PercentSalaryHike PerformanceRating
#> Min. :0.000 Min. :11.00 Min. :3.000
#> 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:3.000
#> Median :2.000 Median :14.00 Median :3.000
#> Mean :2.693 Mean :15.21 Mean :3.154
#> 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000
#> Max. :9.000 Max. :25.00 Max. :4.000
#> RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#> Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
#> 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
#> Median :3.000 Median :80 Median :1.0000 Median :10.00
#> Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
#> 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
#> Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
#> TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
#> Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
#> Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
#> Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
#> Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
#> YearsSinceLastPromotion YearsWithCurrManager
#> Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 2.000
#> Median : 1.000 Median : 3.000
#> Mean : 2.188 Mean : 4.123
#> 3rd Qu.: 3.000 3rd Qu.: 7.000
#> Max. :15.000 Max. :17.000
and here is the categorical feature along with the number of unique values:
df_employeeAttrition %>% select_if(is.factor) %>%
summarise_all(~n_distinct(.)) %>%
pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values")There are some variables that can be removed as they do not give useful information nor relevant to the dependent variable
df_employeeAttrition <- df_employeeAttrition %>% select(-c("Over18", "EmployeeCount", "EmployeeNumber", "StandardHours", "HourlyRate", "MonthlyRate", "DailyRate"))
str(df_employeeAttrition)#> 'data.frame': 1470 obs. of 28 variables:
#> $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
#> $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
#> $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
#> $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
#> $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
#> $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
#> $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
#> $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
#> $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
#> $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
#> $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
#> $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
#> $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
#> $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
#> $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#> $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
#> $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
#> $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
#> $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
#> $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
#> $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
#> $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
#> $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
#> $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
#> $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
#> $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
#> $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
#> $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Now the number of columns reduced from 35 to 28.
Now let’s check if there is a missing value:
colSums(is.na(df_employeeAttrition))#> Age Attrition BusinessTravel
#> 0 0 0
#> Department DistanceFromHome Education
#> 0 0 0
#> EducationField EnvironmentSatisfaction Gender
#> 0 0 0
#> JobInvolvement JobLevel JobRole
#> 0 0 0
#> JobSatisfaction MaritalStatus MonthlyIncome
#> 0 0 0
#> NumCompaniesWorked OverTime PercentSalaryHike
#> 0 0 0
#> PerformanceRating RelationshipSatisfaction StockOptionLevel
#> 0 0 0
#> TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
#> 0 0 0
#> YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
#> 0 0 0
#> YearsWithCurrManager
#> 0
Great, no missing value.
Check if any duplicate data
df_employeeAttrition[duplicated(df_employeeAttrition),]Awesome, no duplicate data.
Here is the first 6 data.
head(df_employeeAttrition) EDA
3.1 Demography
3.1.1 Employee Demography
Gender, Age, Education Bg, Dept, Job Role, Job Level
3.1.2 Experience
Total Industry Experience
Years in Company
Total Company Worked
Training Completed
3.1.3 Company Survey [Scale (1-4)]
Employee Satisfaction
Job Satisfaction
Job Involvement
Work Life Balance
3.1.4 Performance
Job Involvement
Performance Rating
3.2 Employee Attrition against several factor
3.2.1 Per Department
3.2.2 Job Level
3.2.3 Salary
3.2.4 Gender
3.2.5 Other Factors
Home Distance
Overtime
Stock Options
As company owner or HR Department, it is important to know the employees demography to know their characteristics and become valuable to analyze further employees development. In this part i break the analysis into three segments: Overall Demography, Experiences, and Company Survey Result .
In this section i try to visualize the overall employees demography data like Gender,Education Background, Age Departments, Job Level, and Job Role. I also display the number of attrition rate per categories to find out whether
d1_plot <- df_employeeAttrition %>% select(Gender, Attrition) %>%
count(Gender, Attrition) %>%
ggplot(aes(x=Gender, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.2, reverse = F), size=4) +
labs(fill = "Attrition", y="Count") + theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
d2_plot <- df_employeeAttrition %>%
mutate(Education = ifelse(Education == 0, "Below College",
ifelse(Education == 1, "College",
ifelse(Education == 2, "Bachelor",
ifelse(Education == 3, "Master", "Doctor")
)
)
)
) %>%
select(Education, Attrition) %>% count(Education, Attrition) %>%
ggplot(aes(x=Education, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F), size=4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition", y = "")
d3_plot <- df_employeeAttrition %>%
select(Department, Attrition) %>% count(Department, Attrition) %>%
ggplot(aes(x=Department, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
scale_x_discrete(labels=c("HRD","R&D","Sales")) +
labs(fill = "Attrition", y = "Count")
d4_plot <- df_employeeAttrition %>%
select(JobRole, Attrition) %>% count(JobRole, Attrition) %>%
ggplot(aes(x=JobRole, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.3, reverse = F),size=3.5) +
coord_flip() +
theme_minimal() +
theme(#axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition",
x = "Job Role", y = "Count")
d5_plot <- df_employeeAttrition %>%
select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>%
ggplot(aes(x=JobLevel, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.8, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition",
x = "Job Level", y="")
d6_plot <- df_employeeAttrition %>%
mutate(Age = as.factor(
ifelse(Age < 20, "18-19",
ifelse((Age >= 20) & (Age <= 25), "20-25",
ifelse((Age >= 26) & (Age <= 30), "26-30",
ifelse((Age >= 31) & (Age <= 35), "31-35",
ifelse((Age >= 36) & (Age <= 40), "36-40",
ifelse((Age >= 41) & (Age <= 45), "41-45",
ifelse((Age >= 46) & (Age <= 50), "46-50",
ifelse((Age >= 51) & (Age <= 55), "51-55", ">55"
)
)
)
)
)
)
)
)
)
) %>%
group_by(Age, Attrition) %>% count(Age, Attrition) %>%
ggplot(aes(x=factor(Age, levels = c("18-19", "20-25", "26-30", "31-35", "36-40",
"41-45", "46-50", "51-55", ">55")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.3, reverse = F),size=3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition", x = "Age", y = "")
demography_plot <- ggarrange(d1_plot, d2_plot, d3_plot, d5_plot,
ncol = 2, nrow = 2,
common.legend = T,
legend = "bottom")
demography_plotdemography_plot2 <- ggarrange(d4_plot, d6_plot,
ncol = 2, nrow = 1,
common.legend = T,
legend = "bottom")
demography_plot2Insights :
There are no significant discrepancies between male and female employee composition. 61 % of the employees are male and the rest is female.
From three departments, the most employees are in R&D (65.37%) and Sales Department (30..34%) while Human Resource Department has the least employee.
Most of our employees were graduated from university.
Most of the employees are work as Sales Executive, Research Scientist, and also Lab Technician. However those job role has the most highest level of attrition compare to any Job Role
Most of the employees are in productive age. From the employee’s age distribution, 60% of employee age are between 20-40 years.
Most of the employees are on Job Level 1 and 2. When we see the attrition rate, the highest rate among all Job Levels are from Job Level 1 (26.3 %).
In this section i try to visualize several employees experience feature that are available in the dataset, such as total working experience, total company work, duration work in this company, and number of completed training.
# 1. TotalWorkingYears
plot_exp1 <- df_employeeAttrition %>% select(TotalWorkingYears, Attrition) %>%
mutate(binning_workingyears = as.factor(
ifelse(TotalWorkingYears <= 4, "0-4",
ifelse((TotalWorkingYears >= 5 & TotalWorkingYears <= 9), "5-9",
ifelse((TotalWorkingYears >= 10 & TotalWorkingYears <= 14), "10-14",
ifelse((TotalWorkingYears >= 15 & TotalWorkingYears <= 19), "15-19",
ifelse((TotalWorkingYears >= 20 & TotalWorkingYears <= 24), "20-24",
ifelse((TotalWorkingYears >= 25 & TotalWorkingYears <= 29), "25-29",
ifelse((TotalWorkingYears >= 26 & TotalWorkingYears <= 29), "26-29",
ifelse((TotalWorkingYears >= 30 & TotalWorkingYears <= 34), "30-34",
ifelse((TotalWorkingYears >= 35 & TotalWorkingYears <= 39), "35-39", ">=40"
)
)
)
)
)
)
)
)
)
)
) %>%
count(binning_workingyears, Attrition) %>%
ggplot(aes(x=factor(binning_workingyears, levels = c("0-4", "5-9", "10-14", "15-19", "20-24",
"25-29", "26-29", "30-34", "35-39", ">=40")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.3, reverse = F), size=3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Total Working Experience",
fill = "Attrition", x = "Total Working (Years)", y = ""
)
# 2.NumCompaniesWorked
plot_exp2 <- df_employeeAttrition %>% select(NumCompaniesWorked, Attrition) %>%
mutate(binning = as.factor(
ifelse(NumCompaniesWorked <= 1, "0-1",
ifelse((NumCompaniesWorked >= 2 & NumCompaniesWorked <= 3), "2-3",
ifelse((NumCompaniesWorked >= 4 & NumCompaniesWorked <= 5), "4-5",
ifelse((NumCompaniesWorked >= 6 & NumCompaniesWorked <= 7), "6-7", ">7"
)
)
)
)
)
) %>%
count(binning, Attrition) %>%
ggplot(aes(x=factor(binning, levels = c("0-1", "2-3", "4-5", "6-7",">7")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Total Company Worked",
fill = "Attrition", x = "Total Company Worked", y = ""
)
# 3.YearsAtCompany
plot_exp3 <- df_employeeAttrition %>% select(YearsAtCompany, Attrition) %>%
mutate(binning_years = as.factor(
ifelse(YearsAtCompany <= 4, "0-4",
ifelse((YearsAtCompany >= 5 & YearsAtCompany <= 9), "5-9",
ifelse((YearsAtCompany >= 10 & YearsAtCompany <= 14), "10-14",
ifelse((YearsAtCompany >= 15 & YearsAtCompany <= 19), "15-19",
ifelse((YearsAtCompany >= 20 & YearsAtCompany <= 24), "20-24",
ifelse((YearsAtCompany >= 25 & YearsAtCompany <= 29), "25-29",
ifelse((YearsAtCompany >= 26 & YearsAtCompany <= 29), "26-29",
ifelse((YearsAtCompany >= 30 & YearsAtCompany <= 34), "30-34",
ifelse((YearsAtCompany >= 35 & YearsAtCompany <= 39), "35-39", ">=40"
)
)
)
)
)
)
)
)
)
)
) %>%
count(binning_years, Attrition) %>%
ggplot(aes(x=factor(binning_years, levels = c("0-4", "5-9", "10-14", "15-19", "20-24",
"25-29", "26-29", "30-34", "35-39", ">=40")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.2, reverse = F), size=3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Years at Company",
fill = "Attrition", x = "Years at Company", y = ""
)
# 4.NumCompaniesWorked
plot_exp4 <- df_employeeAttrition %>% select(TrainingTimesLastYear, Attrition) %>%
mutate(binning = as.factor(
ifelse(TrainingTimesLastYear <= 1, "0-1",
ifelse((TrainingTimesLastYear >= 2 & TrainingTimesLastYear <= 3), "2-3",
ifelse((TrainingTimesLastYear >= 4 & TrainingTimesLastYear <= 5), "4-5", ">5"
)
)
)
)
) %>%
count(binning, Attrition) %>%
ggplot(aes(x=factor(binning, levels = c("0-1", "2-3", "4-5", ">5")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.8, reverse = F), size=4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Total Completed Training",
fill = "Attrition", x = "Total Training Completed", y = ""
)
experience_plot <- ggarrange(plot_exp1, plot_exp2, plot_exp3, plot_exp4,
ncol = 2, nrow = 2,
common.legend = T,
legend = "bottom")
experience_plotInsights :
The employees mostly have more than five years working experience and mostly have been work 0-9 years in this company.
Around 48 % of the employees probably fresh-graduate and start their career by working in this company, 44 % of them have been work for 2-5 different companies, and about 16 % have been work for more than 5 different companies
Around 70% (1038 persons) of the employees had completed 2-3 training. Interestingly, the highest attrition rate among the number of completed training groups are also from those group (around 59 % on group of employees that had completed 2-3 training).
It is also important to understand some feedback from the employees to analyze and become foundation for future improvements. In the datasets there are several company survey feature like Employee Satisfaction, Job Satisfaction, Job Involvement, and Work Life Balance. All of those survey are in range 1-4 score, the higher score means better result.
#1. EnvironmentSatisfaction
plot_sv1 <- df_employeeAttrition %>% select(EnvironmentSatisfaction, Attrition) %>%
count(EnvironmentSatisfaction, Attrition) %>%
ggplot(aes(x=factor(EnvironmentSatisfaction), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Employee Satisfaction",
fill = "Attrition", x = "", y = ""
)
#2. JobSatisfaction
plot_sv2 <- df_employeeAttrition %>% select(JobSatisfaction, Attrition) %>%
count(JobSatisfaction, Attrition) %>%
ggplot(aes(x=factor(JobSatisfaction), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Job Satisfaction",
fill = "Attrition", x = "", y = ""
)
#3. JobInvolvement
plot_sv3 <- df_employeeAttrition %>% select(JobInvolvement, Attrition) %>%
count(JobInvolvement, Attrition) %>%
ggplot(aes(x=factor(JobInvolvement), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Job Involvement",
fill = "Attrition", x = "", y = ""
)
#4. WorkLifeBalance
plot_sv4 <- df_employeeAttrition %>% select(WorkLifeBalance, Attrition) %>%
count(WorkLifeBalance, Attrition) %>%
ggplot(aes(x=factor(WorkLifeBalance), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Work Life Balance",
fill = "Attrition", x = "", y = ""
)
csurvey_plot <- ggarrange(plot_sv1, plot_sv2, plot_sv3, plot_sv4,
ncol = 2, nrow = 2,
common.legend = T,
legend = "bottom")
csurvey_plotInsights :
The proportion of each score for Employee Satisfaction and Job Satisfaction are quite similar. From Employee Satisfaction survey, 30.84 % were very satisfied (score = 4), 30.82 % were satisfied (score = 3), 19.52 % were partly-satisfied (score = 2), and 19.32 % are not satisfied (score = 1). While in in Job Satisfaction survey, 31.22 % were very satisfied (score = 4), 30.07 % were satisfied (score = 3), 19.05 % were partly-satisfied (score = 2), and 19.66 % are not satisfied (score = 1).
From the Job Involvement survey, around 59 % of them had high involvement in their job (score = 3). Among that group, the attrition rate is around 8 %.
Looks like the employees can manage their work-life balance. Based on Work-Life balance survey, around 60 % of them feels they had better (score = 3) work-life balance and only 5 % of them feels that they had bad (score = 1) work-life balance.
In this section, i would like to analyze several factor that probably leads to employee attrition such as income, home to workplace distance and overtime working, . I also breakdown the attrition rate per several variable like departments and company survey.
Before we deep dive into several factor, let’s see the overall attrition in this company
df_employeeAttrition %>% select(Attrition) %>% count(Attrition) %>%
mutate(percent=round((n/sum(n))*100,2),
lab_ypos = cumsum(percent) - 0.7*percent) %>%
ggplot(aes(x=2, y=percent, fill = factor(Attrition, levels = c("Yes", "No")))) +
geom_bar(stat="identity", start=0) +
coord_polar(theta = "y", start=0) +
geom_text(aes(y = lab_ypos,
label = paste0(percent,' ','%')), color = "white") +
theme_void() + theme(legend.position = "bottom") + xlim(0.5, 2.5) +
labs(title = "Attrition Rate ", fill = "Attrition") As shown in above diagram, the attrition rate is about 16 % . Several companies has different metrics or threshold to decide whether this rate is good, moderate, or bad.
temp_plot1 <- df_employeeAttrition %>%
select(Department, Attrition) %>% count(Department, Attrition) %>%
group_by(Department) %>%
mutate(percent = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=Department, y=n,
text=paste0('</br>Department: ', Department,
'</br>Attrition Status: ', Attrition,
'</br>Count: ', n,
'</br>Percentage: ', percent, ' ', '%'))) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(title = "Attrition Rate per Department", fill = "Attrition", y = "")
ggplotly(temp_plot1, tooltip="text")From the Departments, the highest employees (consecutively) in this follwing department: Research & Developments, Sales, and HRD. However the highest attrition rate is from Sales department (20.63 %), followed by Research & Developments department (13.84 %) and HRD department (19.05 %).
temp_plot2 <- df_employeeAttrition %>%
select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>%
group_by(JobLevel) %>%
mutate(percent = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=factor(JobLevel), y=n,
text=paste0('</br>Job Level: ', JobLevel,
'</br>Attrition Status: ', Attrition,
'</br>Count: ', n,
'</br>Percentage: ', percent, ' ', '%'))) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(title = "Attrition Rate per Job Level", fill = "Attrition", x = "Job Level", y = "")
ggplotly(temp_plot2, tooltip="text")Around 63 % of employees were in Job Level 1 and 2. The highest attrition rate is from Job Level I (26.34 %), followed by Job Level III (14.68 %), Job Level II (9.74 %), Job Level V (7.25 %), and Job Level IV (4.72 %).
avgincome_plot1 <- df_employeeAttrition %>%
select(JobRole, MonthlyIncome, Attrition) %>%
group_by(JobRole, Attrition) %>%
summarise(avg_monthly_income = round(mean(MonthlyIncome),2)) %>%
ggplot(aes(x=JobRole, y=avg_monthly_income,
fill=factor(Attrition, levels = c("Yes","No")))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(x=JobRole, y=1000,
label = paste0("$"," ", avg_monthly_income)),
fontface="bold", color="white", size = 2.5, nudge_y = 1500) +
facet_wrap(~Attrition) +
coord_flip() +
theme_minimal() +
labs(
x = "Job Role", y = "Average Income (dollar)",
fill = "Attrition",
title="Average Income\nby Job Role and Attrition Status"
) +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
avgincome_plot1median_incomevsjobsscore <- df_employeeAttrition %>% select(MonthlyIncome, JobSatisfaction, Attrition) %>%
group_by(JobSatisfaction, Attrition) %>%
summarise(median_income = median(MonthlyIncome)) %>%
ggplot(aes(x=median_income, y=JobSatisfaction,
color=factor(Attrition, levels = c("Yes","No")))) +
geom_point(size = 4) +
geom_segment(aes(x = 0, xend = median_income,
y = JobSatisfaction, yend = JobSatisfaction), size = 2) +
geom_text(aes(x=1000, y=JobSatisfaction,
label= paste0("$ ", " ", median_income)),
nudge_y = 0.2) +
facet_wrap(~Attrition) +
theme_minimal() +
theme(legend.position = "bottom",
panel.grid.major.x = element_line(linetype = "dashed", colour = "grey"),
panel.grid.minor.x = element_line(linetype = "dashed", colour = "grey"),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()) +
labs(
x = "Median Income (dollar)", y = "Job Satisfaction Score",
color = "Attrition",
title="Does Income affect on Job Satisfaction?\nby Attrition Status"
)
median_incomevsjobsscoreI try to analyze the income per each job satisfaction score. In the plot above shows us that for the same level of Job Satisfaction Score there is a big gap of (median) income between those who leave the company and those who stay.
It also seems like the gap of income between the attrition status group tend to increase when Job Score Satisfaction Score is lower. Probably the gap of income become the reason for employee to leave.
#summary(df_employeeAttrition$PercentSalaryHike)
plot_subsalary1 <- df_employeeAttrition %>% select(PercentSalaryHike, Attrition) %>%
count(PercentSalaryHike, Attrition) %>%
group_by(PercentSalaryHike) %>% mutate(percent = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=factor(PercentSalaryHike), y=n,
text=paste0('</br>Salary Hike Last Year (in percent): ', PercentSalaryHike,
'</br>Attrition Status: ', Attrition,
'</br>Count: ', n,
'</br>Percentage: ', percent, ' ', '%'))) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
angle=90, fontface="bold", color="white", size=2,
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(title = "Attrition Rate per Salary Hike Last Year\n in percent", fill = "Attrition", x = "Salary Hike (%)", y = "")
ggplotly(plot_subsalary1, tooltip="text") Insights :
Most of the employees get salary hike between 11 - 14 %.
Employees that have salary hike between 11 to 14 percent have higher chance to leave.
df_employeeAttrition %>% select(EnvironmentSatisfaction, JobRole, Attrition) %>%
group_by(JobRole, Attrition) %>%
summarize(avg_env_score = round(mean(EnvironmentSatisfaction),2)) %>%
ggplot(aes(x=JobRole,y=avg_env_score)) +
geom_line(aes(group=Attrition), linetype= "twodash", size=1) +
geom_point(aes(color=Attrition), size=3) +
theme_minimal() +
theme(legend.position = "top", axis.text.x = element_text(angle = 90),
axis.line = element_line(colour = "grey",
size = 0.7, linetype = "solid"),
panel.grid.major.x = element_line(size = 0.5, linetype = "dashed", colour = "lightgray"),
panel.grid.minor.x = element_line(size = 0.5, linetype = "dashed", colour = "lightgray"),
panel.grid.major.y = element_line(size = 0.5,linetype = "dashed", colour = "lightgray")) +
labs(
x = "", y = "Average Score",
fill = "Attrition",
title = "Average Environtment Satisfaction Score\nper Job Role"
)Insights:
As seen on the plot above, for almost every job role those who leave the company has lower Environmental Satisfaction Score compare to those who stay. There is an exception for Research Director, probably Environmental Satisfaction is not their main reason to leave the company
On average, Managers and Healthcare Representatives that leave has lowest score and big gap compare to those who stay. Probably the environment job become the main issue for those role.
On average, the gap of Environmental Satisfaction Score between who leave and those who stay as Research Scientist and Sales Representativeis small. Probably the Environment for those role is good and not the main reason to leave.
I was wondering whether specific marital status increase the chance to leave the company. I assume that the married woman or the employee that has divorce tend to leave the company.
After I agggregate the data between the attrition status and the marital status the assumption is not true.
temp_plot4 <- df_employeeAttrition %>%
select(Gender, MaritalStatus, Attrition) %>% count(Gender, MaritalStatus, Attrition) %>%
group_by(Gender, MaritalStatus) %>%
mutate(percent = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=factor(MaritalStatus), y=n,
text=paste0('</br>Gender: ', Gender,
'</br>Marital Status: ', MaritalStatus,
'</br>Attrition Status: ', Attrition,
'</br>Count: ', n,
'</br>Percentage: ', percent, ' ', '%'))) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
facet_wrap(~Gender) +
geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(title = "Attrition Rate per Job Level", fill = "Attrition", x = "Marital Status", y = "")
ggplotly(temp_plot4, tooltip="text")Insights:
As seen on the plot above, the attrition rate for divorced employee is still lower than those who married or single.
Also the married woman has lower attrition rate than those who married. If compared to married male, the attrition rate for woman is still lower than married man.
#1. DistancefromHome
plot_homedistt <- df_employeeAttrition %>% select(DistanceFromHome, Attrition) %>%
mutate(homedist_category = as.factor(
ifelse(DistanceFromHome > mean(df_employeeAttrition$DistanceFromHome), "Above Average", "Below Average"))
) %>%
count(homedist_category, Attrition) %>%
group_by(homedist_category) %>%
mutate(percentage=round((n/sum(n))*100,2)) %>%
ggplot(aes(x=Attrition, y=n)) +
geom_bar(stat="identity", position = "dodge",
aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')),
size=3, fill = "white", color = "black") +
facet_wrap(~homedist_category) +
theme_minimal() +
labs(fill = "Attrition", y="Count",
title="Proportion of Employee Attrition \nby Office-Home Distance") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_homedistt#2. Overtime
plot_overtime <- df_employeeAttrition %>% select(OverTime, Attrition) %>%
count(OverTime, Attrition) %>%
group_by(OverTime) %>%
mutate(OverTime = as.factor(ifelse(OverTime == "Yes", "Overtime", "Non-Overtime")),
percentage=round((n/sum(n))*100,2)) %>%
ggplot(aes(x=Attrition ,y=n)) +
geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
show.legend = F,
stat="identity", position="dodge") +
facet_wrap(~OverTime) +
geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')),
stat="identity", position = position_dodge(1),
size=3, fill = "white", color = "black") +
theme_minimal() +
labs(fill = "Attrition",
x = "Attrition", y="Count",
title="Proportion of Employee Attrition \nby Working Overtime") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_overtime#3. StockOption
plot_stockoption <- df_employeeAttrition %>% select(StockOptionLevel, Attrition) %>%
count(StockOptionLevel, Attrition) %>%
group_by(StockOptionLevel) %>%
mutate(percentage = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=factor(StockOptionLevel), y=n, group=factor(Attrition, levels = c("Yes", "No")))) +
geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
stat="identity", position="dodge", vjust=0) +
geom_label(aes(label = paste0(n,' ','\n(', percentage,'%',')')),
position = position_dodge(0.9), stat = "identity",
size=2.5, fill = "white", color = "black") +
theme_minimal() +
labs(fill = "Attrition",
x = "Stock Option Level", y="Count",
title="Proportion of Employee Attrition\nby Stock Option Level") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_stockoptionInsight :
The average distance between office to home is around 9 km. Around 63 % of employees has office-home distance below or equal to the average.
The distance between office to home is not significantly contribute to higher attrition rate. Group of employees has the home distance above average still has lower rate compare to group who has distance below average
Around 28.3 % of employees worked overtime. The employee attrition rate on that group is higher than those who worked non-overtime.
Employees that have stock option is less likely to leave compared to those who do not. Less level of Stock Option leads to the increase of the attrition rate.
From the EDA section above we know that the current attrition rate is 16 %. There are several factors that have a distinct pattern and probably leads to increasing of rate of employees attrition such as salary hike, working overtime, and Stock Option Levels. By learning the data and visualize it, we can gain some insight and become foundation to develop strategic planning.