1 Introduction

Employee attrition is a term to describe the situation when an employee leaves the company whether he or she is retire, resign to move to another company, layoff, and any other reasons. Employee Attrition does not always bad when it come to saves some spending, but become bad when the attrition rate is high and the company struggle to find the replacement. So it is important to analyze the case and the reasons, so then the company can prevents or be prepared.

In this article I would try develop classification model to predict the employee attrition status. In this article i use IBM HR Employee Attrition dataset. As part of Algortima DAta Science LBB (Learning By Building) Project for Classification-II, I use Naive Bayes, Decision Tree, and Random Forest to create model and compare their result in the end of chapter.

2 Data Preparation

2.1 About the Data

The following are a brief explanations of each variable in this dataset.

Variable Description
Age Age of employee
Attrition Attrition of employee (Yes, No)
BusinessTravel Frequency of business travel (Non-Travel, Travel_Rarely, Travel_Frequently)
DailyRate Amount of money a company has to pay employee to work for them for a day
Department Work Department (Research and Development, Sales, Human Resources)
DistanceFromHome Distance between company and home
Education Level of education (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor)
EducationField Field of Education (Life Sciences, Medical, Human Resources, Technical Degree, Marketing, Other)
EmployeeCount Count of employee (always 1)
EmployeeNumber ID Employee
EnvironmentSatisfaction Satisfaction of environment score(1: Low, 2: Medium, 3: High, 4: Very High)
HourlyRate Amount of money a company has to pay employee to work for them for an hour
JobInvolvement Level of job involvement (1: Low, 2: Medium, 3: High, 4: Very High)
JobLevel Level of job (1 - 5)
JobRole Role of job (Sales Executive, Research Scientist, Laboratory Technician, Manager, Healthcare Representative, Sales Representative, Manufacturing Director, Human Resources, Manager)
JobSatisfaction Satisfaction of job (1: Low, 2: Medium, 3: High, 4: Very High)
MaritalStatus Marital Status (Married, Single, Divorced)
MonthlyIncome Monthly Income
MonthlyRate Percent of salary of hike
NumCompaniesWorked Total number of companies have been worked with
Over18 Employee age over 18 years old (Yes, No)
OverTime Frequently spent overtime working (Yes, No)
PercentSalaryHike Percent of salary of hike
PerformanceRating Level of performance assessment (1: Low, 2: Good, 3: Excellent, 4: Outstanding)
RelationshipSatisfaction Level of relationship satisfaction (1: Low, 2: Medium, 3: High, 4: Very High)
StandardHours Standard work hours (always 80)
StockOptionLevel Stock option level (0 - 3)
TotalWorkingYears Years of total working
TrainingTimesLastYear Training times of last year
WorkLifeBalance Level of work life balance (1: Bad, 2: Good, 3: Better, 4: Best)
YearsAtCompany Years at company
YearsInCurrentRole Years in current role
YearsSinceLastPromotion Years since last promotion
YearsWithCurrManager Years with current manager

2.2 Import Library

library(dplyr) # for data wrangling
library(caret) # to pre-process data
library(e1071)
library(ggplot2) # to visualize datagraph
library(tidymodels) # to build tidy models
library(ROCR) # create ROC plot
#library(partykit) #Visualisasi Decision Tree:

2.3 Pre-Processing Data

First Step: Load data

employee <- read.csv("data_input/WA_Fn-UseC_-HR-Employee-Attrition.csv")

Check data structure

glimpse(employee)
#> Rows: 1,470
#> Columns: 35
#> $ ï..Age                   <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
#> $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
#> $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
#> $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
#> $ Department               <chr> "Sales", "Research & Development", "Resear...
#> $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
#> $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
#> $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other",...
#> $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#> $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
#> $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
#> $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male"...
#> $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
#> $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
#> $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
#> $ JobRole                  <chr> "Sales Executive", "Research Scientist", "...
#> $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
#> $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", ...
#> $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
#> $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
#> $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
#> $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
#> $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
#> $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
#> $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
#> $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
#> $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
#> $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
#> $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
#> $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
#> $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
#> $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
#> $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
#> $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...

This dataset consist of 35 features (variables) and 1,470 observations (rows data). There are 9 categorical columns and 26 numerical columns.

I will modify the first column name (ï..Age) to simplify and easy for processing. I also change the categorical type to factor

employee <- employee %>% rename(Age = ï..Age) %>% mutate_if(is.character, as.factor)

Here are the summary of the raw dataset:

summary(employee %>% select_if(is.numeric))
#>       Age          DailyRate      DistanceFromHome   Education    
#>  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
#>  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
#>  Median :36.00   Median : 802.0   Median : 7.000   Median :3.000  
#>  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   :2.913  
#>  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
#>  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
#>  EmployeeCount EmployeeNumber   EnvironmentSatisfaction   HourlyRate    
#>  Min.   :1     Min.   :   1.0   Min.   :1.000           Min.   : 30.00  
#>  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           1st Qu.: 48.00  
#>  Median :1     Median :1020.5   Median :3.000           Median : 66.00  
#>  Mean   :1     Mean   :1024.9   Mean   :2.722           Mean   : 65.89  
#>  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000           3rd Qu.: 83.75  
#>  Max.   :1     Max.   :2068.0   Max.   :4.000           Max.   :100.00  
#>  JobInvolvement    JobLevel     JobSatisfaction MonthlyIncome    MonthlyRate   
#>  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   : 1009   Min.   : 2094  
#>  1st Qu.:2.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.: 2911   1st Qu.: 8047  
#>  Median :3.00   Median :2.000   Median :3.000   Median : 4919   Median :14236  
#>  Mean   :2.73   Mean   :2.064   Mean   :2.729   Mean   : 6503   Mean   :14313  
#>  3rd Qu.:3.00   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.: 8379   3rd Qu.:20462  
#>  Max.   :4.00   Max.   :5.000   Max.   :4.000   Max.   :19999   Max.   :26999  
#>  NumCompaniesWorked PercentSalaryHike PerformanceRating
#>  Min.   :0.000      Min.   :11.00     Min.   :3.000    
#>  1st Qu.:1.000      1st Qu.:12.00     1st Qu.:3.000    
#>  Median :2.000      Median :14.00     Median :3.000    
#>  Mean   :2.693      Mean   :15.21     Mean   :3.154    
#>  3rd Qu.:4.000      3rd Qu.:18.00     3rd Qu.:3.000    
#>  Max.   :9.000      Max.   :25.00     Max.   :4.000    
#>  RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#>  Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00    
#>  1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00    
#>  Median :3.000            Median :80    Median :1.0000   Median :10.00    
#>  Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28    
#>  3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00    
#>  Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00    
#>  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
#>  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
#>  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
#>  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
#>  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
#>  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
#>  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
#>  YearsSinceLastPromotion YearsWithCurrManager
#>  Min.   : 0.000          Min.   : 0.000      
#>  1st Qu.: 0.000          1st Qu.: 2.000      
#>  Median : 1.000          Median : 3.000      
#>  Mean   : 2.188          Mean   : 4.123      
#>  3rd Qu.: 3.000          3rd Qu.: 7.000      
#>  Max.   :15.000          Max.   :17.000

and here is the categorical feature along with the number of unique values:

employee %>% select_if(is.factor) %>%
  summarise_all(~n_distinct(.)) %>% 
  pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values") %>%
  arrange(desc(count_unique_values))

There are some variables that can be removed as they do not give useful information nor relevant to the dependent variable

employee <- employee %>% select(-c("Over18", "EmployeeCount", "EmployeeNumber", "StandardHours", "HourlyRate", "MonthlyRate", "DailyRate"))

str(employee)
#> 'data.frame':    1470 obs. of  28 variables:
#>  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
#>  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
#>  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
#>  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
#>  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
#>  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
#>  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
#>  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
#>  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
#>  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
#>  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
#>  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
#>  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
#>  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
#>  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#>  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
#>  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
#>  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
#>  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
#>  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
#>  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
#>  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
#>  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
#>  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
#>  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
#>  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
#>  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
#>  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Now the number of columns reduced from 35 to 28.

Now let’s check if there is a missing value:

colSums(is.na(employee))
#>                      Age                Attrition           BusinessTravel 
#>                        0                        0                        0 
#>               Department         DistanceFromHome                Education 
#>                        0                        0                        0 
#>           EducationField  EnvironmentSatisfaction                   Gender 
#>                        0                        0                        0 
#>           JobInvolvement                 JobLevel                  JobRole 
#>                        0                        0                        0 
#>          JobSatisfaction            MaritalStatus            MonthlyIncome 
#>                        0                        0                        0 
#>       NumCompaniesWorked                 OverTime        PercentSalaryHike 
#>                        0                        0                        0 
#>        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
#>                        0                        0                        0 
#>        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
#>                        0                        0                        0 
#>           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
#>                        0                        0                        0 
#>     YearsWithCurrManager 
#>                        0

Great, no missing value.

Check if any duplicate data

employee[duplicated(employee),]

Awesome, no duplicate data.

Here is the first 6 data.

head(employee)

3 EDA

Several analysis about this data analysis can be seen in my previous post here.

3.1 Class Target Proportion

employee %>% select(Attrition) %>% group_by(Attrition) %>% count(.) %>% 
  ggplot(aes(x=factor(Attrition),y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No"))), show.legend = F) +
  labs(title = "Overall Attrition Rate", x = "Attrition Status", y = "Count")

3.2 Demography

d1_plot <- employee %>% select(Gender, Attrition) %>% 
  count(Gender, Attrition) %>% 
  ggplot(aes(x=Gender, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.2, reverse = F), size=4) +
  labs(fill = "Attrition", y="Count") + theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())


d2_plot <- employee %>%
  mutate(Education = ifelse(Education == 0, "Below College", 
                            ifelse(Education == 1, "College",
                                   ifelse(Education == 2, "Bachelor",
                                          ifelse(Education == 3, "Master", "Doctor")
                                          )
                                   )
                            )
         ) %>% 
  select(Education, Attrition) %>% count(Education, Attrition) %>% 
  ggplot(aes(x=Education, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F), size=4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition", y = "")


d3_plot <- employee %>% 
  select(Department, Attrition) %>% count(Department, Attrition) %>% 
  ggplot(aes(x=Department, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  scale_x_discrete(labels=c("HRD","R&D","Sales")) +
  labs(fill = "Attrition", y = "Count")

d4_plot <- employee %>% 
  select(JobRole, Attrition) %>% count(JobRole, Attrition) %>% 
  ggplot(aes(x=JobRole, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.3, reverse = F),size=3.5) +
  coord_flip() +
  theme_minimal() + 
  theme(#axis.text.x = element_text(angle = 90),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition",
       x = "Job Role", y = "Count")

d5_plot <- employee %>%
  select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>% 
  ggplot(aes(x=JobLevel, y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.8, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition",
       x = "Job Level", y="")

d6_plot <- employee %>%
  mutate(Age = as.factor(
    ifelse(Age < 20, "18-19",
        ifelse((Age >= 20) & (Age <= 25), "20-25",
          ifelse((Age >= 26) & (Age <= 30), "26-30",
            ifelse((Age >= 31) & (Age <= 35), "31-35",
              ifelse((Age >= 36) & (Age <= 40), "36-40",
                ifelse((Age >= 41) & (Age <= 45), "41-45",
                  ifelse((Age >= 46) & (Age <= 50), "46-50",
                    ifelse((Age >= 51) & (Age <= 55), "51-55", ">55"
                    )
                  )
                )
              )
            )
          )
        )
      )
    ) 
  ) %>% 
  group_by(Age, Attrition) %>% count(Age, Attrition) %>% 
  ggplot(aes(x=factor(Age, levels = c("18-19", "20-25", "26-30", "31-35", "36-40",
                                      "41-45", "46-50", "51-55", ">55")), 
             y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.3, reverse = F),size=3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90),
         panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(fill = "Attrition", x = "Age", y = "")


library(ggpubr)
demography_plot <- ggarrange(d1_plot, d2_plot, d3_plot, d5_plot,
                      ncol = 2, nrow = 2, 
                      common.legend = T, 
                      legend = "bottom")
demography_plot

  • Based on gender and education background, there is no big gap (of attrition rate) between it’s categories.
  • Among the Job Level, the top three Job Level that has highest Attrition Rate are Job level 1 (26.34%), Job level 3 (14.67%), and Job level 2 (9.37%)

3.3 Company Survey

#1. EnvironmentSatisfaction
plot_sv1 <- employee %>% select(EnvironmentSatisfaction, Attrition) %>% 
  count(EnvironmentSatisfaction, Attrition) %>% 
  ggplot(aes(x=factor(EnvironmentSatisfaction), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Employee Satisfaction",
    fill = "Attrition", x = "", y = ""
    )

#2. JobSatisfaction
plot_sv2 <- employee %>% select(JobSatisfaction, Attrition) %>% 
  count(JobSatisfaction, Attrition) %>% 
  ggplot(aes(x=factor(JobSatisfaction), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Job Satisfaction",
    fill = "Attrition", x = "", y = ""
    )

#3. JobInvolvement
plot_sv3 <- employee %>% select(JobInvolvement, Attrition) %>% 
  count(JobInvolvement, Attrition) %>% 
  ggplot(aes(x=factor(JobInvolvement), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Job Involvement",
    fill = "Attrition", x = "", y = ""
    )

#4. WorkLifeBalance
plot_sv4 <- employee %>% select(WorkLifeBalance, Attrition) %>% 
  count(WorkLifeBalance, Attrition) %>% 
  ggplot(aes(x=factor(WorkLifeBalance), y=n)) +
  geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
            position = position_stack(vjust = 0.5, reverse = F)) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  labs(
    title="Work Life Balance",
    fill = "Attrition", x = "", y = ""
    )

csurvey_plot <- ggarrange(plot_sv1, plot_sv2, plot_sv3, plot_sv4,
                      ncol = 2, nrow = 2, 
                      common.legend = T, 
                      legend = "bottom")
csurvey_plot

3.4 Others Factor

#1. DistancefromHome
plot_homedistt <- employee %>% select(DistanceFromHome, Attrition) %>% 
  mutate(homedist_category = as.factor(
    ifelse(DistanceFromHome > mean(employee$DistanceFromHome), "Above Average", "Below Average"))
         ) %>% 
  count(homedist_category, Attrition) %>% 
  group_by(homedist_category) %>% 
  mutate(percentage=round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=Attrition, y=n)) +
  geom_bar(stat="identity", position = "dodge", 
           aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
  geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')), 
             size=3, fill = "white", color = "black") +
  facet_wrap(~homedist_category) +
  theme_minimal() +
  labs(fill = "Attrition", y="Count",
       title="Proportion of Employee Attrition \nby Office-Home Distance") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))
plot_homedistt

#2. Overtime
plot_overtime <- employee %>% select(OverTime, Attrition) %>% 
  count(OverTime, Attrition) %>%
  group_by(OverTime) %>% 
  mutate(OverTime = as.factor(ifelse(OverTime == "Yes", "Overtime", "Non-Overtime")),
         percentage=round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=Attrition ,y=n)) +
  geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
           show.legend = F,
           stat="identity", position="dodge") +
  facet_wrap(~OverTime) +
  geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')), 
             stat="identity", position = position_dodge(1),
             size=3, fill = "white", color = "black") +
  theme_minimal() +
  labs(fill = "Attrition", 
       x = "Attrition", y="Count",
       title="Proportion of Employee Attrition \nby Working Overtime") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))
plot_overtime

#3. StockOption
plot_stockoption <- employee %>% select(StockOptionLevel, Attrition) %>%
  count(StockOptionLevel, Attrition) %>% 
  group_by(StockOptionLevel) %>% 
  mutate(percentage = round((n/sum(n))*100,2)) %>% 
  ggplot(aes(x=factor(StockOptionLevel), y=n, group=factor(Attrition, levels = c("Yes", "No")))) +
  geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
           stat="identity", position="dodge", vjust=0) +
  geom_label(aes(label = paste0(n,' ','\n(', percentage,'%',')')), 
             position = position_dodge(0.9), stat = "identity",
             size=2.5, fill = "white", color = "black") +
  theme_minimal() +
  labs(fill = "Attrition", 
       x = "Stock Option Level", y="Count",
       title="Proportion of Employee Attrition\nby Stock Option Level") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))
plot_stockoption

# otheroptions_plot <- ggarrange(plot_homedistt, plot_overtime, plot_stockoption,
#                       ncol = 2, nrow = 2, 
#                       common.legend = F, 
#                       legend = "bottom")
# otheroptions_plot
  • Employees that work overtime has higher attrition rate (30.53%) compare to those who worked non overtime (10.44%)

  • Employees that have stock option is less likely to leave compared to those who do not. Less level of Stock Option leads to the increase of the attrition rate.

3.5 Correlation between variables

library(GGally)
GGally::ggcorr(employee)  

  • MOst of numerical variables has low correlation. Several variable has zero correlation.

  • TotalWorkingYears, YearsWithCurrManager, YearsSinceLastPromotion, YearsInCurrentRole, and YearsAtCompanyhas strong correlation to each other.

4 Cross Validation

In this section i split 80 % of dataset as data train and the rest as data test.

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(employee)

# set random sampling index and only get 80%
index <- sample(row_data, row_data*0.8)

data_train <- employee[ index, ]
data_test <- employee[ -index, ] 

As spotted in EDA section, our target class is imbalance. In this case i use upscale function from package caret to oversampling data-train class.

5 Fitting Model

5.1 Naive Bayes

I use Naive Bayes because it’s fast and as benchmark for other models.

model_nb <- naiveBayes(x=data_train_up %>% select(-Attrition), 
                       y=data_train_up$Attrition,
                       laplace = 1)

Let’s see the performance against data-train

predict_modelNB_train <- predict(model_nb, 
                                 newdata = data_train_up)


cm_modelNB_train <- confusionMatrix(data = predict_modelNB_train, 
                reference = data_train_up$Attrition,
                positive = "Yes")
cm_modelNB_train
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  564 206
#>        Yes 424 782
#>                                                
#>                Accuracy : 0.6812               
#>                  95% CI : (0.6601, 0.7017)     
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.3623               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7915               
#>             Specificity : 0.5709               
#>          Pos Pred Value : 0.6484               
#>          Neg Pred Value : 0.7325               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.3957               
#>    Detection Prevalence : 0.6103               
#>       Balanced Accuracy : 0.6812               
#>                                                
#>        'Positive' Class : Yes                  
#> 
  • Accuracy = 68.12 %
  • Recall = 79.15 %
  • Specificity = 57.09 %
  • Precision = 64.84 %

5.2 Decision Tree

Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree, so the model is easily interpretable. Here is the model

library(rpart)
model_dct <- rpart(data = data_train_up,
                    formula = Attrition~.,
                    method = "class")
                    #cp = 0.001, minbucket = 20)

library(rpart.plot)
rpart.plot::rpart.plot(model_dct)

model_dct$call
#> rpart(formula = Attrition ~ ., data = data_train_up, method = "class")
model_dct$variable.importance
#>                 OverTime                      Age            MonthlyIncome 
#>              108.5476651               90.5149774               69.0624462 
#>                  JobRole        TotalWorkingYears           YearsAtCompany 
#>               68.7367832               66.4763514               43.9593326 
#>     YearsWithCurrManager                 JobLevel       YearsInCurrentRole 
#>               31.2980923               30.3953065               25.4923353 
#>           EducationField         DistanceFromHome RelationshipSatisfaction 
#>               20.0440177               17.7407261               15.1933443 
#>       NumCompaniesWorked           BusinessTravel  YearsSinceLastPromotion 
#>               13.6871631                2.9656863                2.9656863 
#>         StockOptionLevel        PercentSalaryHike                Education 
#>                2.2059030                1.5347356                1.2453561 
#>          WorkLifeBalance               Department 
#>                0.4981424                0.4619070

Let’s see the performance against data-train

predict_modelDCT_train <- predict(model_dct, 
                                  newdata = data_train_up, type="class")


# confusion matrix data test

cm_modelDCT_train <- confusionMatrix(data = predict_modelDCT_train, 
                reference = data_train_up$Attrition,
                positive = "Yes")
cm_modelDCT_train
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  760 198
#>        Yes 228 790
#>                                              
#>                Accuracy : 0.7844             
#>                  95% CI : (0.7656, 0.8024)   
#>     No Information Rate : 0.5                
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.5688             
#>                                              
#>  Mcnemar's Test P-Value : 0.16               
#>                                              
#>             Sensitivity : 0.7996             
#>             Specificity : 0.7692             
#>          Pos Pred Value : 0.7760             
#>          Neg Pred Value : 0.7933             
#>              Prevalence : 0.5000             
#>          Detection Rate : 0.3998             
#>    Detection Prevalence : 0.5152             
#>       Balanced Accuracy : 0.7844             
#>                                              
#>        'Positive' Class : Yes                
#> 
  • Accuracy = 78.44 %
  • Recall = 79.96 %
  • Specificity = 76.92 %
  • Precision = 77.6 %

Compare to model Naive Bayes in data train, Decision tree has better Recall and Precision.


5.3 Random Forest

Random Forest is an algorithm that using Ensemble-based method that was built based on decision tree. Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias, and improve predictions.

Random Forest makes use of the concept of Bagging (Bootstrap and Aggregation). In this step i use k-fold=5 and repeat 4 times.

set.seed(123)
ctrl <- trainControl(method = "repeatedcv",
                          number = 5, # k-fold
                          repeats = 4) # repetition 
     
model_rforest <- train(Attrition ~ .,
                        data = data_train_up,
                        method = "rf", # random forest
                        trControl = ctrl)

# to save model to save process-time
saveRDS(model_rforest, "LBB_randomforest.RDS")
model_rforest <- readRDS("LBB_randomforest.RDS")
model_rforest
#> Random Forest 
#> 
#> 1976 samples
#>   27 predictor
#>    2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 4 times) 
#> Summary of sample sizes: 1581, 1581, 1580, 1581, 1581, 1581, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.9664700  0.9329411
#>   21    0.9640653  0.9281301
#>   41    0.9588783  0.9177557
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

From the model summary, the final value chosen is mtry=21 with highest accuracy when tested bootstap sampling. We can also inspect the importance of each variable that was used in our random forest using varImp().

varImp(model_rforest)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 41)
#> 
#>                          Overall
#> MonthlyIncome             100.00
#> Age                        94.01
#> OverTimeYes                81.77
#> TotalWorkingYears          81.37
#> YearsAtCompany             69.02
#> DistanceFromHome           67.58
#> YearsWithCurrManager       60.50
#> YearsInCurrentRole         56.91
#> PercentSalaryHike          53.37
#> JobSatisfaction            49.42
#> StockOptionLevel           46.51
#> JobLevel                   45.16
#> NumCompaniesWorked         44.26
#> EnvironmentSatisfaction    42.78
#> TrainingTimesLastYear      41.62
#> YearsSinceLastPromotion    41.54
#> RelationshipSatisfaction   37.93
#> Education                  35.19
#> WorkLifeBalance            31.45
#> JobInvolvement             29.70
plot(varImp(model_rforest))

model_rforest %>% ggplot()

model_rforest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 2.33%
#> Confusion matrix:
#>      No Yes class.error
#> No  942  46   0.0465587
#> Yes   0 988   0.0000000

OOB Score = 2.73 % performance accuracy : 100 - OOB_Error = 100-2.83 = 97.27 %

In practice, it is not required to split the dataset into train-test because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen data. Although, it is possible to hold out a regular train-test cross-validation.

predict_modelRF_train <- predict(model_rforest, newdata = data_train_up, type = "raw")
cm_rf <- confusionMatrix(data = predict_modelRF_train,
                reference = data_train_up$Attrition,
                positive = "Yes")
cm_rf$table
#>           Reference
#> Prediction  No Yes
#>        No  985   0
#>        Yes   3 988
cm_rf$byClass
#>          Sensitivity          Specificity       Pos Pred Value 
#>            1.0000000            0.9969636            0.9969728 
#>       Neg Pred Value            Precision               Recall 
#>            1.0000000            0.9969728            1.0000000 
#>                   F1           Prevalence       Detection Rate 
#>            0.9984841            0.5000000            0.5000000 
#> Detection Prevalence    Balanced Accuracy 
#>            0.5015182            0.9984818

6 Evaluation

Time to evaluate the models. we use Confusion matrix to evaluate the result between prediction and actual on data test.

Generally there are four metrics to measure performance (for binary classification) based on Confusion Matrix:

  • Accuracy : Measure how many of our data is correctly predicted. \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

  • Sensitivity : measures out of all positive outcome, how many are correctly predicted. \[Sensitivity = \frac{TP}{TP + FN}\]

  • Specificty: measure how many negative outcome is correctly predicted. \[Specificty = \frac{TN}{TN + FP}\]

  • Precision: measures how many of our positive prediction is correct. \[Precision = \frac{TN}{TN + FP}\] ## Naive Bayes

predict_modelNB_test <- predict(model_nb, newdata = data_test)

# confusion matrix data test

cm_modelNB_test <- confusionMatrix(data = predict_modelNB_test, 
                reference = data_test$Attrition,
                positive = "Yes")
cm_modelNB_test
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  141  11
#>        Yes 104  38
#>                                              
#>                Accuracy : 0.6088             
#>                  95% CI : (0.5505, 0.665)    
#>     No Information Rate : 0.8333             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.1995             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7755             
#>             Specificity : 0.5755             
#>          Pos Pred Value : 0.2676             
#>          Neg Pred Value : 0.9276             
#>              Prevalence : 0.1667             
#>          Detection Rate : 0.1293             
#>    Detection Prevalence : 0.4830             
#>       Balanced Accuracy : 0.6755             
#>                                              
#>        'Positive' Class : Yes                
#> 
  • Accuracy = 60.88 %

  • Recall = 77.55 %

  • Specificity = 57.55 %

  • Precision = 26.76 %


6.1 Decision Tree

library(rpart) model_dct <- rpart(data = data_train_up, formula = Attrition~., method = “class”) #cp = 0.001, minbucket = 20)

library(rpart.plot) rpart.plot::rpart.plot(model_dct)

predict_modelDCT_test <- predict(model_dct, 
                                newdata = data_test, type="class")
#head(predict_modelDCT_test)

# confusion matrix data test

cm_modelDCT_test <- confusionMatrix(data = predict_modelDCT_test, 
                reference = data_test$Attrition,
                positive = "Yes")
cm_modelDCT_test
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  171  18
#>        Yes  74  31
#>                                           
#>                Accuracy : 0.6871          
#>                  95% CI : (0.6307, 0.7397)
#>     No Information Rate : 0.8333          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.2269          
#>                                           
#>  Mcnemar's Test P-Value : 0.0000000098    
#>                                           
#>             Sensitivity : 0.6327          
#>             Specificity : 0.6980          
#>          Pos Pred Value : 0.2952          
#>          Neg Pred Value : 0.9048          
#>              Prevalence : 0.1667          
#>          Detection Rate : 0.1054          
#>    Detection Prevalence : 0.3571          
#>       Balanced Accuracy : 0.6653          
#>                                           
#>        'Positive' Class : Yes             
#> 
  • Accuracy = 68.71 %

  • Recall = 63.27 %

  • Specificity = 69.8 %

  • Precision = 29.52 %

6.2 Random Forest

predict_modelRF_test <- predict(model_rforest, newdata = data_test, type = "raw")
#head(predict_modelRF_test)


cm_modelRF_test <- confusionMatrix(data = predict_modelRF_test,
                                   reference = data_test$Attrition,
                                   positive = "Yes")
cm_modelRF_test
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  233  32
#>        Yes  12  17
#>                                           
#>                Accuracy : 0.8503          
#>                  95% CI : (0.8043, 0.8891)
#>     No Information Rate : 0.8333          
#>     P-Value [Acc > NIR] : 0.243419        
#>                                           
#>                   Kappa : 0.3561          
#>                                           
#>  Mcnemar's Test P-Value : 0.004179        
#>                                           
#>             Sensitivity : 0.34694         
#>             Specificity : 0.95102         
#>          Pos Pred Value : 0.58621         
#>          Neg Pred Value : 0.87925         
#>              Prevalence : 0.16667         
#>          Detection Rate : 0.05782         
#>    Detection Prevalence : 0.09864         
#>       Balanced Accuracy : 0.64898         
#>                                           
#>        'Positive' Class : Yes             
#> 
  • Accuracy = 84.01 %

  • Recall = 28.57 %

  • Specificity = 95.1 %

  • Precision = 53.48 %

Suprisingly, the Accuracy score is the lowest against other model.

6.3 Evaluation Summary

library(tibble)

summary_evalTest_NB <- data_frame(Model = "Naive Bayes", 
                              Accuracy = round(cm_modelNB_test$overall[1]*100,2),
                              Recall = round(cm_modelNB_test$byClass[1]*100,2),
                              Specificity = round(cm_modelNB_test$byClass[2]*100,2),
                              Precision = round(cm_modelNB_test$byClass[3]*100,2))

summary_evalTest_DCT <- data_frame(Model = "Decision Tree", 
                               Accuracy = round(cm_modelDCT_test$overall[1]*100,2),
                              Recall = round(cm_modelDCT_test$byClass[1]*100,2),
                              Specificity = round(cm_modelDCT_test$byClass[2]*100,2),
                              Precision = round(cm_modelDCT_test$byClass[3]*100,2))

summary_evalTest_RF <- data_frame(Model = "Random Forest", 
                              Accuracy = round(cm_modelRF_test$overall[1]*100,2),
                              Recall = round(cm_modelRF_test$byClass[1]*100,2),
                              Specificity = round(cm_modelRF_test$byClass[2]*100,2),
                              Precision = round(cm_modelRF_test$byClass[3]*100,2))

summary_evalTest_model <-  rbind(summary_evalTest_NB, summary_evalTest_DCT, summary_evalTest_RF)
summary_evalTest_model
summary_evalTrain_NB <- data_frame(Model = "Naive Bayes", 
                              Accuracy = round(cm_modelNB_train$overall[1]*100,2),
                              Recall = round(cm_modelNB_train$byClass[1]*100,2),
                              Specificity = round(cm_modelNB_train$byClass[2]*100,2),
                              Precision = round(cm_modelNB_train$byClass[3]*100,2))

summary_evalTrain_DCT <- data_frame(Model = "Decision Tree", 
                               Accuracy = round(cm_modelDCT_train$overall[1]*100,2),
                              Recall = round(cm_modelDCT_train$byClass[1]*100,2),
                              Specificity = round(cm_modelDCT_train$byClass[2]*100,2),
                              Precision = round(cm_modelDCT_train$byClass[3]*100,2))

summary_evalTrain_RF <- data_frame(Model = "Random Forest", 
                              Accuracy = round(cm_rf$overall[1]*100,2),
                              Recall = round(cm_rf$byClass[1]*100,2),
                              Specificity = round(cm_rf$byClass[2]*100,2),
                              Precision = round(cm_rf$byClass[3]*100,2))

summary_evalTrain_model <-  rbind(summary_evalTrain_NB, summary_evalTrain_DCT, summary_evalTrain_RF)
summary_evalTrain_model

From the performance summary above we can see that the model tend to be overfit. Among the three models, model Random Forest has better Accuracy, Specificity, and Precision score rather than the other models, but still not good enough.

However from Recall score, Random Forest is the worst, while Naive Bayes is better Decision Tree.


7 Conclusion

As previously mention in evaluation section, Random Forest has better Accuracy, Specificity, and Precision score rather than the other models, but still not good enough. However the Recall score of Random Forest is bad. Naive Bayes has better score compare to Decision Tree and Random Forest

I guess it is happen because several predictors are low-correlated while Naive Bayes assumes that all features of the dataset are equally important and independent.

In this case, it is both True Positive and True Negative are important so all of the metric are good to watch, but since the problem is to predict the attrition rate the metrics that mainly focus on high True Positive such as Recall and Precision must be the main focus in order to evaluate the attrition model.