1 Introduction

Employee attrition is a term to describe the situation when an employee leaves the company whether he or she is retire, resign to move to another company, layoff, and any other reasons. Employee Attrition does not always bad when it come to saves some spending, but become bad when the attrition rate is high and the company struggle to find the replacement. So it is important to analyze the case and the reasons, so then the company can prevents or be prepared.

In this article I would try to make simple analysis of some factors that probably leads to employee retention in companies. For this research I use IBM HR Employee Attrition dataset.

2 Data Preparation

2.1 About the Data

The following are a brief explanations of each variable in this dataset.

Variable Description
Age Age of employee
Attrition Attrition of employee (Yes, No)
BusinessTravel Frequency of business travel (Non-Travel, Travel_Rarely, Travel_Frequently)
DailyRate Amount of money a company has to pay employee to work for them for a day
Department Work Department (Research and Development, Sales, Human Resources)
DistanceFromHome Distance between company and home
Education Level of education (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor)
EducationField Field of Education (Life Sciences, Medical, Human Resources, Technical Degree, Marketing, Other)
EmployeeCount Count of employee (always 1)
EmployeeNumber ID Employee
EnvironmentSatisfaction Satisfaction of environment score(1: Low, 2: Medium, 3: High, 4: Very High)
HourlyRate Amount of money a company has to pay employee to work for them for an hour
JobInvolvement Level of job involvement (1: Low, 2: Medium, 3: High, 4: Very High)
JobLevel Level of job (1 - 5)
JobRole Role of job (Sales Executive, Research Scientist, Laboratory Technician, Manager, Healthcare Representative, Sales Representative, Manufacturing Director, Human Resources, Manager)
JobSatisfaction Satisfaction of job (1: Low, 2: Medium, 3: High, 4: Very High)
MaritalStatus Marital Status (Married, Single, Divorced)
MonthlyIncome Monthly Income
MonthlyRate Percent of salary of hike
NumCompaniesWorked Total number of companies have been worked with
Over18 Employee age over 18 years old (Yes, No)
OverTime Frequently spent overtime working (Yes, No)
PercentSalaryHike Percent of salary of hike
PerformanceRating Level of performance assessment (1: Low, 2: Good, 3: Excellent, 4: Outstanding)
RelationshipSatisfaction Level of relationship satisfaction (1: Low, 2: Medium, 3: High, 4: Very High)
StandardHours Standard work hours (always 80)
StockOptionLevel Stock option level (0 - 3)
TotalWorkingYears Years of total working
TrainingTimesLastYear Training times of last year
WorkLifeBalance Level of work life balance (1: Bad, 2: Good, 3: Better, 4: Best)
YearsAtCompany Years at company
YearsInCurrentRole Years in current role
YearsSinceLastPromotion Years since last promotion
YearsWithCurrManager Years with current manager

2.2 Import Library

library(dplyr)
library(tidyr)
library(readr)
library(stringr)

library(GGally)
library(gtools)
library(caret)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(plotly)
library(ggthemes)
library(MLmetrics)
library(performance)
library(ggpubr)

2.3 Pre-Processing Data

First Step: Load data

df_employeeAttrition <- read.csv("data_input/WA_Fn-UseC_-HR-Employee-Attrition.csv")

Check data structure

glimpse(df_employeeAttrition)
#> Rows: 1,470
#> Columns: 35
#> $ ï..Age                   <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
#> $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
#> $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
#> $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
#> $ Department               <chr> "Sales", "Research & Development", "Resear...
#> $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
#> $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
#> $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other",...
#> $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#> $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
#> $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
#> $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male"...
#> $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
#> $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
#> $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
#> $ JobRole                  <chr> "Sales Executive", "Research Scientist", "...
#> $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
#> $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", ...
#> $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
#> $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
#> $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
#> $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
#> $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
#> $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
#> $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
#> $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
#> $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
#> $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
#> $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
#> $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
#> $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
#> $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
#> $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
#> $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...

This dataset consist of 35 features (variables) and 1,470 observations (rows data). There are 9 categorical columns and 26 numerical columns.

I will modify the first column name (ï..Age) to simplify and easy for processing. I also change the categorical type to factor

df_employeeAttrition <- df_employeeAttrition %>% rename(Age = ï..Age) %>% mutate_if(is.character, as.factor)

Here are the summary of the raw dataset:

summary(df_employeeAttrition %>% select_if(is.numeric))
#>       Age          DailyRate      DistanceFromHome   Education    
#>  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
#>  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
#>  Median :36.00   Median : 802.0   Median : 7.000   Median :3.000  
#>  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   :2.913  
#>  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
#>  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
#>  EmployeeCount EmployeeNumber   EnvironmentSatisfaction   HourlyRate    
#>  Min.   :1     Min.   :   1.0   Min.   :1.000           Min.   : 30.00  
#>  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           1st Qu.: 48.00  
#>  Median :1     Median :1020.5   Median :3.000           Median : 66.00  
#>  Mean   :1     Mean   :1024.9   Mean   :2.722           Mean   : 65.89  
#>  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000           3rd Qu.: 83.75  
#>  Max.   :1     Max.   :2068.0   Max.   :4.000           Max.   :100.00  
#>  JobInvolvement    JobLevel     JobSatisfaction MonthlyIncome    MonthlyRate   
#>  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   : 1009   Min.   : 2094  
#>  1st Qu.:2.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.: 2911   1st Qu.: 8047  
#>  Median :3.00   Median :2.000   Median :3.000   Median : 4919   Median :14236  
#>  Mean   :2.73   Mean   :2.064   Mean   :2.729   Mean   : 6503   Mean   :14313  
#>  3rd Qu.:3.00   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.: 8379   3rd Qu.:20462  
#>  Max.   :4.00   Max.   :5.000   Max.   :4.000   Max.   :19999   Max.   :26999  
#>  NumCompaniesWorked PercentSalaryHike PerformanceRating
#>  Min.   :0.000      Min.   :11.00     Min.   :3.000    
#>  1st Qu.:1.000      1st Qu.:12.00     1st Qu.:3.000    
#>  Median :2.000      Median :14.00     Median :3.000    
#>  Mean   :2.693      Mean   :15.21     Mean   :3.154    
#>  3rd Qu.:4.000      3rd Qu.:18.00     3rd Qu.:3.000    
#>  Max.   :9.000      Max.   :25.00     Max.   :4.000    
#>  RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#>  Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00    
#>  1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00    
#>  Median :3.000            Median :80    Median :1.0000   Median :10.00    
#>  Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28    
#>  3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00    
#>  Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00    
#>  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
#>  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
#>  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
#>  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
#>  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
#>  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
#>  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
#>  YearsSinceLastPromotion YearsWithCurrManager
#>  Min.   : 0.000          Min.   : 0.000      
#>  1st Qu.: 0.000          1st Qu.: 2.000      
#>  Median : 1.000          Median : 3.000      
#>  Mean   : 2.188          Mean   : 4.123      
#>  3rd Qu.: 3.000          3rd Qu.: 7.000      
#>  Max.   :15.000          Max.   :17.000

and here is the categorical feature along with the number of unique values:

df_employeeAttrition %>% select_if(is.factor) %>%
  summarise_all(~n_distinct(.)) %>% 
  pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values")

There are some variables that can be removed as they do not give useful information nor relevant to the dependent variable

df_employeeAttrition <- df_employeeAttrition %>% select(-c("Over18", "EmployeeCount", "EmployeeNumber", "StandardHours", "HourlyRate", "MonthlyRate", "DailyRate"))

str(df_employeeAttrition)
#> 'data.frame':    1470 obs. of  28 variables:
#>  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
#>  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
#>  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
#>  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
#>  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
#>  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
#>  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
#>  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
#>  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
#>  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
#>  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
#>  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
#>  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
#>  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
#>  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#>  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
#>  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
#>  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
#>  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
#>  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
#>  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
#>  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
#>  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
#>  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
#>  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
#>  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
#>  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
#>  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Now the number of columns reduced from 35 to 28.

Now let’s check if there is a missing value:

colSums(is.na(df_employeeAttrition))
#>                      Age                Attrition           BusinessTravel 
#>                        0                        0                        0 
#>               Department         DistanceFromHome                Education 
#>                        0                        0                        0 
#>           EducationField  EnvironmentSatisfaction                   Gender 
#>                        0                        0                        0 
#>           JobInvolvement                 JobLevel                  JobRole 
#>                        0                        0                        0 
#>          JobSatisfaction            MaritalStatus            MonthlyIncome 
#>                        0                        0                        0 
#>       NumCompaniesWorked                 OverTime        PercentSalaryHike 
#>                        0                        0                        0 
#>        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
#>                        0                        0                        0 
#>        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
#>                        0                        0                        0 
#>           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
#>                        0                        0                        0 
#>     YearsWithCurrManager 
#>                        0

Great, no missing value.

Check if any duplicate data

df_employeeAttrition[duplicated(df_employeeAttrition),]

Awesome, no duplicate data.

Here is the first 6 data.

head(df_employeeAttrition)

3 EDA

  EDA
      3.1 Demography
          3.1.1 Employee Demography
              Gender, Age, Education Bg, Dept, Job Role, Job Level
    
          3.1.2 Experience
            Total Industry Experience
            Years in Company
            Total Company Worked
            Training Completed
    
          3.1.3 Company Survey [Scale (1-4)]
              Employee Satisfaction
              Job Satisfaction
              Job Involvement
              Work Life Balance
        
          3.1.4 Performance
              Job Involvement
              Performance Rating

      3.2 Employee Attrition against several factor
            3.2.1 Per Department
            3.2.2 Job Level
            3.2.3 Salary
            3.2.4 Gender
            3.2.5 Other Factors
                  Home Distance
                  Overtime
                  Stock Options

3.1 Employee Demographics data

As company owner or HR Department, it is important to know the employees demography to know their characteristics and become valuable to analyze further employees development. In this part i break the analysis into three segments: Overall Demography, Experiences, and Company Survey Result .

3.1.1 Overall Demography

In this section i try to visualize the overall employees demography data like Gender,Education Background, Age Departments, Job Level, and Job Role. I also display the number of attrition rate per categories to find out whether

d1_plot <- df_employeeAttrition %>% select(Gender, Attrition) %>% 
  count(Gender, Attrition) %>% 
  ggplot(aes(x=Gender, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.2, reverse = F), size=4) +
  labs(fill = "Attrition", y="Count") + theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())


d2_plot <- df_employeeAttrition %>%
  mutate(Education = ifelse(Education == 0, "Below College", 
                            ifelse(Education == 1, "College",
                                   ifelse(Education == 2, "Bachelor",
                                          ifelse(Education == 3, "Master", "Doctor")
                                          )
                                   )
                            )
         ) %>% 
  select(Education, Attrition) %>% count(Education, Attrition) %>% 
  ggplot(aes(x=Education, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F), size=4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition", y = "")


d3_plot <- df_employeeAttrition %>% 
  select(Department, Attrition) %>% count(Department, Attrition) %>% 
  ggplot(aes(x=Department, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  scale_x_discrete(labels=c("HRD","R&D","Sales")) +
  labs(fill = "Attrition", y = "Count")

d4_plot <- df_employeeAttrition %>% 
  select(JobRole, Attrition) %>% count(JobRole, Attrition) %>% 
  ggplot(aes(x=JobRole, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.3, reverse = F),size=3.5) +
  coord_flip() +
  theme_minimal() + 
  theme(#axis.text.x = element_text(angle = 90),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition",
       x = "Job Role", y = "Count")

d5_plot <- df_employeeAttrition %>%
  select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>% 
  ggplot(aes(x=JobLevel, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.8, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition",
       x = "Job Level", y="")

d6_plot <- df_employeeAttrition %>%
  mutate(Age = as.factor(
    ifelse(Age < 20, "18-19",
        ifelse((Age >= 20) & (Age <= 25), "20-25",
          ifelse((Age >= 26) & (Age <= 30), "26-30",
            ifelse((Age >= 31) & (Age <= 35), "31-35",
              ifelse((Age >= 36) & (Age <= 40), "36-40",
                ifelse((Age >= 41) & (Age <= 45), "41-45",
                  ifelse((Age >= 46) & (Age <= 50), "46-50",
                    ifelse((Age >= 51) & (Age <= 55), "51-55", ">55"
                    )
                  )
                )
              )
            )
          )
        )
      )
    ) 
  ) %>% 
  group_by(Age, Attrition) %>% count(Age, Attrition) %>% 
  ggplot(aes(x=factor(Age, levels = c("18-19", "20-25", "26-30", "31-35", "36-40",
                                      "41-45", "46-50", "51-55", ">55")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.3, reverse = F),size=3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
         panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition", x = "Age", y = "")

demography_plot <- ggarrange(d1_plot, d2_plot, d3_plot, d5_plot,
                      ncol = 2, nrow = 2, 
                      common.legend = T, 
                      legend = "bottom")
demography_plot

demography_plot2 <- ggarrange(d4_plot, d6_plot,
                      ncol = 2, nrow = 1, 
                      common.legend = T, 
                      legend = "bottom")
demography_plot2

Insights :

  • There are no significant discrepancies between male and female employee composition. 61 % of the employees are male and the rest is female.

  • From three departments, the most employees are in R&D (65.37%) and Sales Department (30..34%) while Human Resource Department has the least employee.

  • Most of our employees were graduated from university.

  • Most of the employees are work as Sales Executive, Research Scientist, and also Lab Technician. However those job role has the most highest level of attrition compare to any Job Role

  • Most of the employees are in productive age. From the employee’s age distribution, 60% of employee age are between 20-40 years.

  • Most of the employees are on Job Level 1 and 2. When we see the attrition rate, the highest rate among all Job Levels are from Job Level 1 (26.3 %).

3.1.2 Experiences

In this section i try to visualize several employees experience feature that are available in the dataset, such as total working experience, total company work, duration work in this company, and number of completed training.

# 1. TotalWorkingYears
plot_exp1 <- df_employeeAttrition %>% select(TotalWorkingYears, Attrition) %>% 
  mutate(binning_workingyears = as.factor(
                                ifelse(TotalWorkingYears <= 4, "0-4", 
                                  ifelse((TotalWorkingYears >= 5 & TotalWorkingYears <= 9), "5-9",
                                    ifelse((TotalWorkingYears >= 10 & TotalWorkingYears <= 14), "10-14",
                                      ifelse((TotalWorkingYears >= 15 & TotalWorkingYears <= 19), "15-19",
                                        ifelse((TotalWorkingYears >= 20 & TotalWorkingYears <= 24), "20-24",
                                          ifelse((TotalWorkingYears >= 25 & TotalWorkingYears <= 29), "25-29",
                                            ifelse((TotalWorkingYears >= 26 & TotalWorkingYears <= 29), "26-29",
                                              ifelse((TotalWorkingYears >= 30 & TotalWorkingYears <= 34), "30-34",
                                                ifelse((TotalWorkingYears >= 35 & TotalWorkingYears <= 39), "35-39", ">=40"
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                )
         ) %>% 
  count(binning_workingyears, Attrition) %>% 
  ggplot(aes(x=factor(binning_workingyears, levels = c("0-4", "5-9", "10-14", "15-19", "20-24",
                                                       "25-29", "26-29", "30-34", "35-39", ">=40")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.3, reverse = F), size=3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Total Working Experience",
    fill = "Attrition", x = "Total Working (Years)", y = ""
    )


# 2.NumCompaniesWorked
plot_exp2 <- df_employeeAttrition %>% select(NumCompaniesWorked, Attrition) %>% 
  mutate(binning = as.factor(
                                ifelse(NumCompaniesWorked <= 1, "0-1", 
                                  ifelse((NumCompaniesWorked >= 2 & NumCompaniesWorked <= 3), "2-3",
                                    ifelse((NumCompaniesWorked >= 4 & NumCompaniesWorked <= 5), "4-5",
                                      ifelse((NumCompaniesWorked >= 6 & NumCompaniesWorked <= 7), "6-7", ">7"
                                            )
                                          )
                                        )
                                      )
                                )
         ) %>% 
  count(binning, Attrition) %>% 
  ggplot(aes(x=factor(binning, levels = c("0-1", "2-3", "4-5", "6-7",">7")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Total Company Worked",
    fill = "Attrition", x = "Total Company Worked", y = ""
    )

# 3.YearsAtCompany
plot_exp3 <- df_employeeAttrition %>% select(YearsAtCompany, Attrition) %>% 
  mutate(binning_years = as.factor(
                                ifelse(YearsAtCompany <= 4, "0-4", 
                                  ifelse((YearsAtCompany >= 5 & YearsAtCompany <= 9), "5-9",
                                    ifelse((YearsAtCompany >= 10 & YearsAtCompany <= 14), "10-14",
                                      ifelse((YearsAtCompany >= 15 & YearsAtCompany <= 19), "15-19",
                                        ifelse((YearsAtCompany >= 20 & YearsAtCompany <= 24), "20-24",
                                          ifelse((YearsAtCompany >= 25 & YearsAtCompany <= 29), "25-29",
                                            ifelse((YearsAtCompany >= 26 & YearsAtCompany <= 29), "26-29",
                                              ifelse((YearsAtCompany >= 30 & YearsAtCompany <= 34), "30-34",
                                                ifelse((YearsAtCompany >= 35 & YearsAtCompany <= 39), "35-39", ">=40"
                                                      )
                                                    )
                                                  )
                                                )
                                              )
                                            )
                                          )
                                        )
                                      )
                                )
         ) %>% 
  count(binning_years, Attrition) %>% 
  ggplot(aes(x=factor(binning_years, levels = c("0-4", "5-9", "10-14", "15-19", "20-24",
                                                       "25-29", "26-29", "30-34", "35-39", ">=40")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.2, reverse = F), size=3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Years at Company",
    fill = "Attrition", x = "Years at Company", y = ""
    )

# 4.NumCompaniesWorked
plot_exp4 <- df_employeeAttrition %>% select(TrainingTimesLastYear, Attrition) %>% 
  mutate(binning = as.factor(
                                ifelse(TrainingTimesLastYear <= 1, "0-1", 
                                  ifelse((TrainingTimesLastYear >= 2 & TrainingTimesLastYear <= 3), "2-3",
                                    ifelse((TrainingTimesLastYear >= 4 & TrainingTimesLastYear <= 5), "4-5", ">5"
                                          )
                                        )
                                      )
                                )
         ) %>% 
  count(binning, Attrition) %>% 
  ggplot(aes(x=factor(binning, levels = c("0-1", "2-3", "4-5", ">5")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.8, reverse = F), size=4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Total Completed Training",
    fill = "Attrition", x = "Total Training Completed", y = ""
    )

experience_plot <- ggarrange(plot_exp1, plot_exp2, plot_exp3, plot_exp4,
                      ncol = 2, nrow = 2, 
                      common.legend = T, 
                      legend = "bottom")
experience_plot

Insights :

  • The employees mostly have more than five years working experience and mostly have been work 0-9 years in this company.

  • Around 48 % of the employees probably fresh-graduate and start their career by working in this company, 44 % of them have been work for 2-5 different companies, and about 16 % have been work for more than 5 different companies

  • Around 70% (1038 persons) of the employees had completed 2-3 training. Interestingly, the highest attrition rate among the number of completed training groups are also from those group (around 59 % on group of employees that had completed 2-3 training).

3.1.3 Company Survey

It is also important to understand some feedback from the employees to analyze and become foundation for future improvements. In the datasets there are several company survey feature like Employee Satisfaction, Job Satisfaction, Job Involvement, and Work Life Balance. All of those survey are in range 1-4 score, the higher score means better result.

#1. EnvironmentSatisfaction
plot_sv1 <- df_employeeAttrition %>% select(EnvironmentSatisfaction, Attrition) %>% 
  count(EnvironmentSatisfaction, Attrition) %>% 
  ggplot(aes(x=factor(EnvironmentSatisfaction), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Employee Satisfaction",
    fill = "Attrition", x = "", y = ""
    )

#2. JobSatisfaction
plot_sv2 <- df_employeeAttrition %>% select(JobSatisfaction, Attrition) %>% 
  count(JobSatisfaction, Attrition) %>% 
  ggplot(aes(x=factor(JobSatisfaction), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Job Satisfaction",
    fill = "Attrition", x = "", y = ""
    )

#3. JobInvolvement
plot_sv3 <- df_employeeAttrition %>% select(JobInvolvement, Attrition) %>% 
  count(JobInvolvement, Attrition) %>% 
  ggplot(aes(x=factor(JobInvolvement), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Job Involvement",
    fill = "Attrition", x = "", y = ""
    )

#4. WorkLifeBalance
plot_sv4 <- df_employeeAttrition %>% select(WorkLifeBalance, Attrition) %>% 
  count(WorkLifeBalance, Attrition) %>% 
  ggplot(aes(x=factor(WorkLifeBalance), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Work Life Balance",
    fill = "Attrition", x = "", y = ""
    )

csurvey_plot <- ggarrange(plot_sv1, plot_sv2, plot_sv3, plot_sv4,
                      ncol = 2, nrow = 2, 
                      common.legend = T, 
                      legend = "bottom")
csurvey_plot

Insights :

  • The proportion of each score for Employee Satisfaction and Job Satisfaction are quite similar. From Employee Satisfaction survey, 30.84 % were very satisfied (score = 4), 30.82 % were satisfied (score = 3), 19.52 % were partly-satisfied (score = 2), and 19.32 % are not satisfied (score = 1). While in in Job Satisfaction survey, 31.22 % were very satisfied (score = 4), 30.07 % were satisfied (score = 3), 19.05 % were partly-satisfied (score = 2), and 19.66 % are not satisfied (score = 1).

  • From the Job Involvement survey, around 59 % of them had high involvement in their job (score = 3). Among that group, the attrition rate is around 8 %.

  • Looks like the employees can manage their work-life balance. Based on Work-Life balance survey, around 60 % of them feels they had better (score = 3) work-life balance and only 5 % of them feels that they had bad (score = 1) work-life balance.


3.2 Employee Attrition against several factor

In this section, i would like to analyze several factor that probably leads to employee attrition such as income, home to workplace distance and overtime working, . I also breakdown the attrition rate per several variable like departments and company survey.

3.2.1 Overall Attrition Rate

Before we deep dive into several factor, let’s see the overall attrition in this company

df_employeeAttrition %>% select(Attrition) %>% count(Attrition) %>% 
  mutate(percent=round((n/sum(n))*100,2), 
         lab_ypos = cumsum(percent) - 0.7*percent) %>% 
  ggplot(aes(x=2, y=percent, fill = factor(Attrition, levels = c("Yes", "No")))) +
  geom_bar(stat="identity", start=0) +
  coord_polar(theta = "y", start=0) +
  geom_text(aes(y = lab_ypos, 
                label = paste0(percent,' ','%')), color = "white") +
  theme_void() + theme(legend.position = "bottom") + xlim(0.5, 2.5) +
  labs(title = "Attrition Rate ", fill = "Attrition")

As shown in above diagram, the attrition rate is about 16 % . Several companies has different metrics or threshold to decide whether this rate is good, moderate, or bad.

3.2.2 Department

temp_plot1 <- df_employeeAttrition %>% 
  select(Department, Attrition) %>% count(Department, Attrition) %>% 
  group_by(Department) %>% 
  mutate(percent = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=Department, y=n,
             text=paste0('</br>Department: ', Department,
                         '</br>Attrition Status: ', Attrition,
                         '</br>Count: ', n,
                         '</br>Percentage: ', percent, ' ', '%'))) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(title = "Attrition Rate per Department", fill = "Attrition", y = "")

ggplotly(temp_plot1, tooltip="text")

From the Departments, the highest employees (consecutively) in this follwing department: Research & Developments, Sales, and HRD. However the highest attrition rate is from Sales department (20.63 %), followed by Research & Developments department (13.84 %) and HRD department (19.05 %).

3.2.3 Job Level

temp_plot2 <- df_employeeAttrition %>% 
  select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>% 
  group_by(JobLevel) %>% 
  mutate(percent = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=factor(JobLevel), y=n,
             text=paste0('</br>Job Level: ', JobLevel,
                         '</br>Attrition Status: ', Attrition,
                         '</br>Count: ', n,
                         '</br>Percentage: ', percent, ' ', '%'))) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(title = "Attrition Rate per Job Level", fill = "Attrition", x = "Job Level", y = "")

ggplotly(temp_plot2, tooltip="text")

Around 63 % of employees were in Job Level 1 and 2. The highest attrition rate is from Job Level I (26.34 %), followed by Job Level III (14.68 %), Job Level II (9.74 %), Job Level V (7.25 %), and Job Level IV (4.72 %).

3.2.4 Salary & Satisfaction Score

avgincome_plot1 <-  df_employeeAttrition %>% 
  select(JobRole, MonthlyIncome, Attrition) %>% 
  group_by(JobRole, Attrition) %>% 
  summarise(avg_monthly_income = round(mean(MonthlyIncome),2)) %>% 
  ggplot(aes(x=JobRole, y=avg_monthly_income,
             fill=factor(Attrition, levels = c("Yes","No")))) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(x=JobRole, y=1000, 
                label = paste0("$"," ", avg_monthly_income)),
                fontface="bold", color="white", size = 2.5, nudge_y = 1500) +
  facet_wrap(~Attrition) +
  coord_flip() +
  theme_minimal() +
  labs(
    x = "Job Role", y = "Average Income (dollar)",
    fill = "Attrition",
    title="Average Income\nby Job Role and Attrition Status"
  ) +
  theme(axis.text.x = element_text(angle = 90),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())
avgincome_plot1

median_incomevsjobsscore <- df_employeeAttrition %>% select(MonthlyIncome, JobSatisfaction, Attrition) %>% 
  group_by(JobSatisfaction, Attrition) %>% 
  summarise(median_income = median(MonthlyIncome)) %>% 
  ggplot(aes(x=median_income, y=JobSatisfaction, 
             color=factor(Attrition, levels = c("Yes","No")))) +
  geom_point(size = 4) +
  geom_segment(aes(x = 0, xend = median_income, 
                   y = JobSatisfaction, yend = JobSatisfaction), size = 2) +
  geom_text(aes(x=1000, y=JobSatisfaction, 
                label= paste0("$ ", " ", median_income)),
            nudge_y = 0.2) + 
  facet_wrap(~Attrition) +
  theme_minimal() +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_line(linetype = "dashed", colour = "grey"),
        panel.grid.minor.x = element_line(linetype = "dashed", colour = "grey"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()) +
  labs(
    x = "Median Income (dollar)", y = "Job Satisfaction Score",
    color = "Attrition",
    title="Does Income affect on Job Satisfaction?\nby Attrition Status"
  )
median_incomevsjobsscore

I try to analyze the income per each job satisfaction score. In the plot above shows us that for the same level of Job Satisfaction Score there is a big gap of (median) income between those who leave the company and those who stay.

It also seems like the gap of income between the attrition status group tend to increase when Job Score Satisfaction Score is lower. Probably the gap of income become the reason for employee to leave.

#summary(df_employeeAttrition$PercentSalaryHike)

plot_subsalary1 <- df_employeeAttrition %>% select(PercentSalaryHike, Attrition) %>% 
  count(PercentSalaryHike, Attrition) %>% 
  group_by(PercentSalaryHike) %>% mutate(percent = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=factor(PercentSalaryHike), y=n,
             text=paste0('</br>Salary Hike Last Year (in percent): ', PercentSalaryHike,
                         '</br>Attrition Status: ', Attrition,
                         '</br>Count: ', n,
                         '</br>Percentage: ', percent, ' ', '%'))) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
            angle=90, fontface="bold", color="white", size=2,
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(title = "Attrition Rate per Salary Hike Last Year\n in percent", fill = "Attrition", x = "Salary Hike (%)", y = "")

ggplotly(plot_subsalary1, tooltip="text")  

Insights :

  • Most of the employees get salary hike between 11 - 14 %.

  • Employees that have salary hike between 11 to 14 percent have higher chance to leave.

df_employeeAttrition %>% select(EnvironmentSatisfaction, JobRole, Attrition) %>% 
  group_by(JobRole, Attrition) %>% 
  summarize(avg_env_score = round(mean(EnvironmentSatisfaction),2)) %>% 
  ggplot(aes(x=JobRole,y=avg_env_score)) +
  geom_line(aes(group=Attrition), linetype= "twodash", size=1) +
  geom_point(aes(color=Attrition), size=3) + 
  theme_minimal() +
  theme(legend.position = "top", axis.text.x = element_text(angle = 90),
        axis.line = element_line(colour = "grey", 
                      size = 0.7, linetype = "solid"),
        panel.grid.major.x = element_line(size = 0.5, linetype = "dashed", colour = "lightgray"),
        panel.grid.minor.x = element_line(size = 0.5, linetype = "dashed", colour = "lightgray"),
        panel.grid.major.y = element_line(size = 0.5,linetype = "dashed", colour = "lightgray")) +
  labs(
    x = "", y = "Average Score",
    fill = "Attrition",
    title = "Average Environtment Satisfaction Score\nper Job Role"
  )

Insights:

  • As seen on the plot above, for almost every job role those who leave the company has lower Environmental Satisfaction Score compare to those who stay. There is an exception for Research Director, probably Environmental Satisfaction is not their main reason to leave the company

  • On average, Managers and Healthcare Representatives that leave has lowest score and big gap compare to those who stay. Probably the environment job become the main issue for those role.

  • On average, the gap of Environmental Satisfaction Score between who leave and those who stay as Research Scientist and Sales Representativeis small. Probably the Environment for those role is good and not the main reason to leave.

3.2.5 Gender

I was wondering whether specific marital status increase the chance to leave the company. I assume that the married woman or the employee that has divorce tend to leave the company.

After I agggregate the data between the attrition status and the marital status the assumption is not true.

temp_plot4 <- df_employeeAttrition %>% 
  select(Gender, MaritalStatus, Attrition) %>% count(Gender, MaritalStatus, Attrition) %>% 
  group_by(Gender, MaritalStatus) %>% 
  mutate(percent = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=factor(MaritalStatus), y=n,
             text=paste0('</br>Gender: ', Gender,
                         '</br>Marital Status: ', MaritalStatus,
                         '</br>Attrition Status: ', Attrition,
                         '</br>Count: ', n,
                         '</br>Percentage: ', percent, ' ', '%'))) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  facet_wrap(~Gender) +
  geom_text(aes(label=paste0(percent, ' ' ,'%') , fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(title = "Attrition Rate per Job Level", fill = "Attrition", x = "Marital Status", y = "")

ggplotly(temp_plot4, tooltip="text")

Insights:

  • As seen on the plot above, the attrition rate for divorced employee is still lower than those who married or single.

  • Also the married woman has lower attrition rate than those who married. If compared to married male, the attrition rate for woman is still lower than married man.

3.2.6 Other Factors

#1. DistancefromHome
plot_homedistt <- df_employeeAttrition %>% select(DistanceFromHome, Attrition) %>% 
  mutate(homedist_category = as.factor(
    ifelse(DistanceFromHome > mean(df_employeeAttrition$DistanceFromHome), "Above Average", "Below Average"))
         ) %>% 
  count(homedist_category, Attrition) %>% 
  group_by(homedist_category) %>% 
  mutate(percentage=round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=Attrition, y=n)) +
  geom_bar(stat="identity", position = "dodge", 
           aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')), 
             size=3, fill = "white", color = "black") +
  facet_wrap(~homedist_category) +
  theme_minimal() +
  labs(fill = "Attrition", y="Count",
       title="Proportion of Employee Attrition \nby Office-Home Distance") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))
plot_homedistt

#2. Overtime
plot_overtime <- df_employeeAttrition %>% select(OverTime, Attrition) %>% 
  count(OverTime, Attrition) %>%
  group_by(OverTime) %>% 
  mutate(OverTime = as.factor(ifelse(OverTime == "Yes", "Overtime", "Non-Overtime")),
         percentage=round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=Attrition ,y=n)) +
  geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
           show.legend = F,
           stat="identity", position="dodge") +
  facet_wrap(~OverTime) +
  geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')), 
             stat="identity", position = position_dodge(1),
             size=3, fill = "white", color = "black") +
  theme_minimal() +
  labs(fill = "Attrition", 
       x = "Attrition", y="Count",
       title="Proportion of Employee Attrition \nby Working Overtime") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))
plot_overtime

#3. StockOption
plot_stockoption <- df_employeeAttrition %>% select(StockOptionLevel, Attrition) %>%
  count(StockOptionLevel, Attrition) %>% 
  group_by(StockOptionLevel) %>% 
  mutate(percentage = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=factor(StockOptionLevel), y=n, group=factor(Attrition, levels = c("Yes", "No")))) +
  geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
           stat="identity", position="dodge", vjust=0) +
  geom_label(aes(label = paste0(n,' ','\n(', percentage,'%',')')), 
             position = position_dodge(0.9), stat = "identity",
             size=2.5, fill = "white", color = "black") +
  theme_minimal() +
  labs(fill = "Attrition", 
       x = "Stock Option Level", y="Count",
       title="Proportion of Employee Attrition\nby Stock Option Level") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))

plot_stockoption

Insight :

  • The average distance between office to home is around 9 km. Around 63 % of employees has office-home distance below or equal to the average.

  • The distance between office to home is not significantly contribute to higher attrition rate. Group of employees has the home distance above average still has lower rate compare to group who has distance below average

  • Around 28.3 % of employees worked overtime. The employee attrition rate on that group is higher than those who worked non-overtime.

  • Employees that have stock option is less likely to leave compared to those who do not. Less level of Stock Option leads to the increase of the attrition rate.


4 Conclusion

From the EDA section above we know that the current attrition rate is 16 %. There are several factors that have a distinct pattern and probably leads to increasing of rate of employees attrition such as salary hike, working overtime, and Stock Option Levels. By learning the data and visualize it, we can gain some insight and become foundation to develop strategic planning.