Employee attrition is a term to describe the situation when an employee leaves the company whether he or she is retire, resign to move to another company, layoff, and any other reasons. Employee Attrition does not always bad when it come to saves some spending, but become bad when the attrition rate is high and the company struggle to find the replacement. So it is important to analyze the case and the reasons, so then the company can prevents or be prepared.
In this article I would try develop classification model to predict the employee attrition status. In this article i use IBM HR Employee Attrition dataset. As part of Algortima DAta Science LBB (Learning By Building) Project for Classification-II, I use Naive Bayes, Decision Tree, and Random Forest to create model and compare their result in the end of chapter.
The following are a brief explanations of each variable in this dataset.
Variable | Description |
---|---|
Age |
Age of employee |
Attrition |
Attrition of employee (Yes, No) |
BusinessTravel |
Frequency of business travel (Non-Travel, Travel_Rarely, Travel_Frequently) |
DailyRate |
Amount of money a company has to pay employee to work for them for a day |
Department |
Work Department (Research and Development, Sales, Human Resources) |
DistanceFromHome |
Distance between company and home |
Education |
Level of education (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor) |
EducationField |
Field of Education (Life Sciences, Medical, Human Resources, Technical Degree, Marketing, Other) |
EmployeeCount |
Count of employee (always 1) |
EmployeeNumber |
ID Employee |
EnvironmentSatisfaction |
Satisfaction of environment score(1: Low, 2: Medium, 3: High, 4: Very High) |
HourlyRate |
Amount of money a company has to pay employee to work for them for an hour |
JobInvolvement |
Level of job involvement (1: Low, 2: Medium, 3: High, 4: Very High) |
JobLevel |
Level of job (1 - 5) |
JobRole |
Role of job (Sales Executive, Research Scientist, Laboratory Technician, Manager, Healthcare Representative, Sales Representative, Manufacturing Director, Human Resources, Manager) |
JobSatisfaction |
Satisfaction of job (1: Low, 2: Medium, 3: High, 4: Very High) |
MaritalStatus |
Marital Status (Married, Single, Divorced) |
MonthlyIncome |
Monthly Income |
MonthlyRate |
Percent of salary of hike |
NumCompaniesWorked |
Total number of companies have been worked with |
Over18 |
Employee age over 18 years old (Yes, No) |
OverTime |
Frequently spent overtime working (Yes, No) |
PercentSalaryHike |
Percent of salary of hike |
PerformanceRating |
Level of performance assessment (1: Low, 2: Good, 3: Excellent, 4: Outstanding) |
RelationshipSatisfaction |
Level of relationship satisfaction (1: Low, 2: Medium, 3: High, 4: Very High) |
StandardHours |
Standard work hours (always 80) |
StockOptionLevel |
Stock option level (0 - 3) |
TotalWorkingYears |
Years of total working |
TrainingTimesLastYear |
Training times of last year |
WorkLifeBalance |
Level of work life balance (1: Bad, 2: Good, 3: Better, 4: Best) |
YearsAtCompany |
Years at company |
YearsInCurrentRole |
Years in current role |
YearsSinceLastPromotion |
Years since last promotion |
YearsWithCurrManager |
Years with current manager |
library(dplyr) # for data wrangling
library(caret) # to pre-process data
library(e1071)
library(ggplot2) # to visualize datagraph
library(tidymodels) # to build tidy models
library(ROCR) # create ROC plot
#library(partykit) #Visualisasi Decision Tree:
First Step: Load data
read.csv("data_input/WA_Fn-UseC_-HR-Employee-Attrition.csv") employee <-
Check data structure
glimpse(employee)
#> Rows: 1,470
#> Columns: 35
#> $ ï..Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
#> $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No"...
#> $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Tra...
#> $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
#> $ Department <chr> "Sales", "Research & Development", "Resear...
#> $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
#> $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
#> $ EducationField <chr> "Life Sciences", "Life Sciences", "Other",...
#> $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
#> $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
#> $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
#> $ Gender <chr> "Female", "Male", "Male", "Female", "Male"...
#> $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84...
#> $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, ...
#> $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, ...
#> $ JobRole <chr> "Sales Executive", "Research Scientist", "...
#> $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, ...
#> $ MaritalStatus <chr> "Single", "Married", "Single", "Married", ...
#> $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, ...
#> $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9...
#> $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, ...
#> $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
#> $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Ye...
#> $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13...
#> $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, ...
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, ...
#> $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80...
#> $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, ...
#> $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5...
#> $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, ...
#> $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, ...
#> $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2,...
#> $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, ...
#> $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, ...
#> $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, ...
This dataset consist of 35 features (variables) and 1,470 observations (rows data). There are 9 categorical columns and 26 numerical columns.
I will modify the first column name (ï..Age
) to simplify and easy for processing. I also change the categorical type to factor
employee %>% rename(Age = ï..Age) %>% mutate_if(is.character, as.factor) employee <-
Here are the summary of the raw dataset:
summary(employee %>% select_if(is.numeric))
#> Age DailyRate DistanceFromHome Education
#> Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
#> 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
#> Median :36.00 Median : 802.0 Median : 7.000 Median :3.000
#> Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913
#> 3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000
#> Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
#> EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate
#> Min. :1 Min. : 1.0 Min. :1.000 Min. : 30.00
#> 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 1st Qu.: 48.00
#> Median :1 Median :1020.5 Median :3.000 Median : 66.00
#> Mean :1 Mean :1024.9 Mean :2.722 Mean : 65.89
#> 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.: 83.75
#> Max. :1 Max. :2068.0 Max. :4.000 Max. :100.00
#> JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate
#> Min. :1.00 Min. :1.000 Min. :1.000 Min. : 1009 Min. : 2094
#> 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 2911 1st Qu.: 8047
#> Median :3.00 Median :2.000 Median :3.000 Median : 4919 Median :14236
#> Mean :2.73 Mean :2.064 Mean :2.729 Mean : 6503 Mean :14313
#> 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
#> Max. :4.00 Max. :5.000 Max. :4.000 Max. :19999 Max. :26999
#> NumCompaniesWorked PercentSalaryHike PerformanceRating
#> Min. :0.000 Min. :11.00 Min. :3.000
#> 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:3.000
#> Median :2.000 Median :14.00 Median :3.000
#> Mean :2.693 Mean :15.21 Mean :3.154
#> 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000
#> Max. :9.000 Max. :25.00 Max. :4.000
#> RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#> Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
#> 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
#> Median :3.000 Median :80 Median :1.0000 Median :10.00
#> Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
#> 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
#> Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
#> TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
#> Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
#> Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
#> Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
#> 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
#> Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
#> YearsSinceLastPromotion YearsWithCurrManager
#> Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 2.000
#> Median : 1.000 Median : 3.000
#> Mean : 2.188 Mean : 4.123
#> 3rd Qu.: 3.000 3rd Qu.: 7.000
#> Max. :15.000 Max. :17.000
and here is the categorical feature along with the number of unique values:
%>% select_if(is.factor) %>%
employee summarise_all(~n_distinct(.)) %>%
pivot_longer(., everything(), names_to = "columns", values_to = "count_unique_values") %>%
arrange(desc(count_unique_values))
There are some variables that can be removed as they do not give useful information nor relevant to the dependent variable
employee %>% select(-c("Over18", "EmployeeCount", "EmployeeNumber", "StandardHours", "HourlyRate", "MonthlyRate", "DailyRate"))
employee <-
str(employee)
#> 'data.frame': 1470 obs. of 28 variables:
#> $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
#> $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
#> $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
#> $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
#> $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
#> $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
#> $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
#> $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
#> $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
#> $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
#> $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
#> $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
#> $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
#> $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
#> $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#> $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
#> $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
#> $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
#> $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
#> $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
#> $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
#> $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
#> $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
#> $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
#> $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
#> $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
#> $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
#> $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Now the number of columns reduced from 35 to 28.
Now let’s check if there is a missing value:
colSums(is.na(employee))
#> Age Attrition BusinessTravel
#> 0 0 0
#> Department DistanceFromHome Education
#> 0 0 0
#> EducationField EnvironmentSatisfaction Gender
#> 0 0 0
#> JobInvolvement JobLevel JobRole
#> 0 0 0
#> JobSatisfaction MaritalStatus MonthlyIncome
#> 0 0 0
#> NumCompaniesWorked OverTime PercentSalaryHike
#> 0 0 0
#> PerformanceRating RelationshipSatisfaction StockOptionLevel
#> 0 0 0
#> TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
#> 0 0 0
#> YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
#> 0 0 0
#> YearsWithCurrManager
#> 0
Great, no missing value.
Check if any duplicate data
duplicated(employee),] employee[
Awesome, no duplicate data.
Here is the first 6 data.
head(employee)
Several analysis about this data analysis can be seen in my previous post here.
%>% select(Attrition) %>% group_by(Attrition) %>% count(.) %>%
employee ggplot(aes(x=factor(Attrition),y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No"))), show.legend = F) +
labs(title = "Overall Attrition Rate", x = "Attrition Status", y = "Count")
employee %>% select(Gender, Attrition) %>%
d1_plot <- count(Gender, Attrition) %>%
ggplot(aes(x=Gender, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.2, reverse = F), size=4) +
labs(fill = "Attrition", y="Count") + theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
employee %>%
d2_plot <- mutate(Education = ifelse(Education == 0, "Below College",
ifelse(Education == 1, "College",
ifelse(Education == 2, "Bachelor",
ifelse(Education == 3, "Master", "Doctor")
)
)
)%>%
) select(Education, Attrition) %>% count(Education, Attrition) %>%
ggplot(aes(x=Education, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F), size=4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition", y = "")
employee %>%
d3_plot <- select(Department, Attrition) %>% count(Department, Attrition) %>%
ggplot(aes(x=Department, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
scale_x_discrete(labels=c("HRD","R&D","Sales")) +
labs(fill = "Attrition", y = "Count")
employee %>%
d4_plot <- select(JobRole, Attrition) %>% count(JobRole, Attrition) %>%
ggplot(aes(x=JobRole, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.3, reverse = F),size=3.5) +
coord_flip() +
theme_minimal() +
theme(#axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition",
x = "Job Role", y = "Count")
employee %>%
d5_plot <- select(JobLevel, Attrition) %>% count(JobLevel, Attrition) %>%
ggplot(aes(x=JobLevel, y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.8, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition",
x = "Job Level", y="")
employee %>%
d6_plot <- mutate(Age = as.factor(
ifelse(Age < 20, "18-19",
ifelse((Age >= 20) & (Age <= 25), "20-25",
ifelse((Age >= 26) & (Age <= 30), "26-30",
ifelse((Age >= 31) & (Age <= 35), "31-35",
ifelse((Age >= 36) & (Age <= 40), "36-40",
ifelse((Age >= 41) & (Age <= 45), "41-45",
ifelse((Age >= 46) & (Age <= 50), "46-50",
ifelse((Age >= 51) & (Age <= 55), "51-55", ">55"
)
)
)
)
)
)
)
)
) %>%
) group_by(Age, Attrition) %>% count(Age, Attrition) %>%
ggplot(aes(x=factor(Age, levels = c("18-19", "20-25", "26-30", "31-35", "36-40",
"41-45", "46-50", "51-55", ">55")),
y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.3, reverse = F),size=3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(fill = "Attrition", x = "Age", y = "")
library(ggpubr)
ggarrange(d1_plot, d2_plot, d3_plot, d5_plot,
demography_plot <-ncol = 2, nrow = 2,
common.legend = T,
legend = "bottom")
demography_plot
#1. EnvironmentSatisfaction
employee %>% select(EnvironmentSatisfaction, Attrition) %>%
plot_sv1 <- count(EnvironmentSatisfaction, Attrition) %>%
ggplot(aes(x=factor(EnvironmentSatisfaction), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Employee Satisfaction",
fill = "Attrition", x = "", y = ""
)
#2. JobSatisfaction
employee %>% select(JobSatisfaction, Attrition) %>%
plot_sv2 <- count(JobSatisfaction, Attrition) %>%
ggplot(aes(x=factor(JobSatisfaction), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Job Satisfaction",
fill = "Attrition", x = "", y = ""
)
#3. JobInvolvement
employee %>% select(JobInvolvement, Attrition) %>%
plot_sv3 <- count(JobInvolvement, Attrition) %>%
ggplot(aes(x=factor(JobInvolvement), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Job Involvement",
fill = "Attrition", x = "", y = ""
)
#4. WorkLifeBalance
employee %>% select(WorkLifeBalance, Attrition) %>%
plot_sv4 <- count(WorkLifeBalance, Attrition) %>%
ggplot(aes(x=factor(WorkLifeBalance), y=n)) +
geom_col(aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_text(aes(label=n, fill = factor(Attrition, levels = c("Yes", "No"))),
position = position_stack(vjust = 0.5, reverse = F)) +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
labs(
title="Work Life Balance",
fill = "Attrition", x = "", y = ""
)
ggarrange(plot_sv1, plot_sv2, plot_sv3, plot_sv4,
csurvey_plot <-ncol = 2, nrow = 2,
common.legend = T,
legend = "bottom")
csurvey_plot
#1. DistancefromHome
employee %>% select(DistanceFromHome, Attrition) %>%
plot_homedistt <- mutate(homedist_category = as.factor(
ifelse(DistanceFromHome > mean(employee$DistanceFromHome), "Above Average", "Below Average"))
%>%
) count(homedist_category, Attrition) %>%
group_by(homedist_category) %>%
mutate(percentage=round((n/sum(n))*100,2)) %>%
ggplot(aes(x=Attrition, y=n)) +
geom_bar(stat="identity", position = "dodge",
aes(fill=factor(Attrition, levels = c("Yes", "No")))) +
geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')),
size=3, fill = "white", color = "black") +
facet_wrap(~homedist_category) +
theme_minimal() +
labs(fill = "Attrition", y="Count",
title="Proportion of Employee Attrition \nby Office-Home Distance") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_homedistt
#2. Overtime
employee %>% select(OverTime, Attrition) %>%
plot_overtime <- count(OverTime, Attrition) %>%
group_by(OverTime) %>%
mutate(OverTime = as.factor(ifelse(OverTime == "Yes", "Overtime", "Non-Overtime")),
percentage=round((n/sum(n))*100,2)) %>%
ggplot(aes(x=Attrition ,y=n)) +
geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
show.legend = F,
stat="identity", position="dodge") +
facet_wrap(~OverTime) +
geom_label(aes(label = paste0(n,' ','(', percentage,'%',')')),
stat="identity", position = position_dodge(1),
size=3, fill = "white", color = "black") +
theme_minimal() +
labs(fill = "Attrition",
x = "Attrition", y="Count",
title="Proportion of Employee Attrition \nby Working Overtime") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_overtime
#3. StockOption
employee %>% select(StockOptionLevel, Attrition) %>%
plot_stockoption <- count(StockOptionLevel, Attrition) %>%
group_by(StockOptionLevel) %>%
mutate(percentage = round((n/sum(n))*100,2)) %>%
ggplot(aes(x=factor(StockOptionLevel), y=n, group=factor(Attrition, levels = c("Yes", "No")))) +
geom_bar(aes(fill = factor(Attrition, levels = c("Yes", "No"))),
stat="identity", position="dodge", vjust=0) +
geom_label(aes(label = paste0(n,' ','\n(', percentage,'%',')')),
position = position_dodge(0.9), stat = "identity",
size=2.5, fill = "white", color = "black") +
theme_minimal() +
labs(fill = "Attrition",
x = "Stock Option Level", y="Count",
title="Proportion of Employee Attrition\nby Stock Option Level") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
plot_stockoption
# otheroptions_plot <- ggarrange(plot_homedistt, plot_overtime, plot_stockoption,
# ncol = 2, nrow = 2,
# common.legend = F,
# legend = "bottom")
# otheroptions_plot
Employees that work overtime has higher attrition rate (30.53%) compare to those who worked non overtime (10.44%)
Employees that have stock option is less likely to leave compared to those who do not. Less level of Stock Option leads to the increase of the attrition rate.
library(GGally)
::ggcorr(employee) GGally
MOst of numerical variables has low correlation. Several variable has zero correlation.
TotalWorkingYears
, YearsWithCurrManager
, YearsSinceLastPromotion
, YearsInCurrentRole
, and YearsAtCompany
has strong correlation to each other.
In this section i split 80 % of dataset as data train and the rest as data test.
RNGkind(sample.kind = "Rounding")
set.seed(123)
nrow(employee)
row_data <-
# set random sampling index and only get 80%
sample(row_data, row_data*0.8)
index <-
employee[ index, ]
data_train <- employee[ -index, ] data_test <-
As spotted in EDA section, our target class is imbalance. In this case i use upscale
function from package caret to oversampling data-train class.
I use Naive Bayes because it’s fast and as benchmark for other models.
naiveBayes(x=data_train_up %>% select(-Attrition),
model_nb <-y=data_train_up$Attrition,
laplace = 1)
Let’s see the performance against data-train
predict(model_nb,
predict_modelNB_train <-newdata = data_train_up)
confusionMatrix(data = predict_modelNB_train,
cm_modelNB_train <-reference = data_train_up$Attrition,
positive = "Yes")
cm_modelNB_train
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 564 206
#> Yes 424 782
#>
#> Accuracy : 0.6812
#> 95% CI : (0.6601, 0.7017)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.3623
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7915
#> Specificity : 0.5709
#> Pos Pred Value : 0.6484
#> Neg Pred Value : 0.7325
#> Prevalence : 0.5000
#> Detection Rate : 0.3957
#> Detection Prevalence : 0.6103
#> Balanced Accuracy : 0.6812
#>
#> 'Positive' Class : Yes
#>
Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree, so the model is easily interpretable. Here is the model
library(rpart)
rpart(data = data_train_up,
model_dct <-formula = Attrition~.,
method = "class")
#cp = 0.001, minbucket = 20)
library(rpart.plot)
::rpart.plot(model_dct) rpart.plot
$call model_dct
#> rpart(formula = Attrition ~ ., data = data_train_up, method = "class")
$variable.importance model_dct
#> OverTime Age MonthlyIncome
#> 108.5476651 90.5149774 69.0624462
#> JobRole TotalWorkingYears YearsAtCompany
#> 68.7367832 66.4763514 43.9593326
#> YearsWithCurrManager JobLevel YearsInCurrentRole
#> 31.2980923 30.3953065 25.4923353
#> EducationField DistanceFromHome RelationshipSatisfaction
#> 20.0440177 17.7407261 15.1933443
#> NumCompaniesWorked BusinessTravel YearsSinceLastPromotion
#> 13.6871631 2.9656863 2.9656863
#> StockOptionLevel PercentSalaryHike Education
#> 2.2059030 1.5347356 1.2453561
#> WorkLifeBalance Department
#> 0.4981424 0.4619070
Let’s see the performance against data-train
predict(model_dct,
predict_modelDCT_train <-newdata = data_train_up, type="class")
# confusion matrix data test
confusionMatrix(data = predict_modelDCT_train,
cm_modelDCT_train <-reference = data_train_up$Attrition,
positive = "Yes")
cm_modelDCT_train
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 760 198
#> Yes 228 790
#>
#> Accuracy : 0.7844
#> 95% CI : (0.7656, 0.8024)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.5688
#>
#> Mcnemar's Test P-Value : 0.16
#>
#> Sensitivity : 0.7996
#> Specificity : 0.7692
#> Pos Pred Value : 0.7760
#> Neg Pred Value : 0.7933
#> Prevalence : 0.5000
#> Detection Rate : 0.3998
#> Detection Prevalence : 0.5152
#> Balanced Accuracy : 0.7844
#>
#> 'Positive' Class : Yes
#>
Compare to model Naive Bayes in data train, Decision tree has better Recall and Precision.
Random Forest is an algorithm that using Ensemble-based method that was built based on decision tree. Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias, and improve predictions.
Random Forest makes use of the concept of Bagging (Bootstrap and Aggregation). In this step i use k-fold=5 and repeat 4 times.
set.seed(123)
trainControl(method = "repeatedcv",
ctrl <-number = 5, # k-fold
repeats = 4) # repetition
train(Attrition ~ .,
model_rforest <-data = data_train_up,
method = "rf", # random forest
trControl = ctrl)
# to save model to save process-time
saveRDS(model_rforest, "LBB_randomforest.RDS")
readRDS("LBB_randomforest.RDS")
model_rforest <- model_rforest
#> Random Forest
#>
#> 1976 samples
#> 27 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 4 times)
#> Summary of sample sizes: 1581, 1581, 1580, 1581, 1581, 1581, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9664700 0.9329411
#> 21 0.9640653 0.9281301
#> 41 0.9588783 0.9177557
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
From the model summary, the final value chosen is mtry=21
with highest accuracy when tested bootstap sampling. We can also inspect the importance of each variable that was used in our random forest using varImp()
.
varImp(model_rforest)
#> rf variable importance
#>
#> only 20 most important variables shown (out of 41)
#>
#> Overall
#> MonthlyIncome 100.00
#> Age 94.01
#> OverTimeYes 81.77
#> TotalWorkingYears 81.37
#> YearsAtCompany 69.02
#> DistanceFromHome 67.58
#> YearsWithCurrManager 60.50
#> YearsInCurrentRole 56.91
#> PercentSalaryHike 53.37
#> JobSatisfaction 49.42
#> StockOptionLevel 46.51
#> JobLevel 45.16
#> NumCompaniesWorked 44.26
#> EnvironmentSatisfaction 42.78
#> TrainingTimesLastYear 41.62
#> YearsSinceLastPromotion 41.54
#> RelationshipSatisfaction 37.93
#> Education 35.19
#> WorkLifeBalance 31.45
#> JobInvolvement 29.70
plot(varImp(model_rforest))
%>% ggplot() model_rforest
$finalModel model_rforest
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 2.33%
#> Confusion matrix:
#> No Yes class.error
#> No 942 46 0.0465587
#> Yes 0 988 0.0000000
OOB Score = 2.73 % performance accuracy : 100 - OOB_Error = 100-2.83 = 97.27 %
In practice, it is not required to split the dataset into train-test because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen data. Although, it is possible to hold out a regular train-test cross-validation.
predict(model_rforest, newdata = data_train_up, type = "raw")
predict_modelRF_train <- confusionMatrix(data = predict_modelRF_train,
cm_rf <-reference = data_train_up$Attrition,
positive = "Yes")
$table cm_rf
#> Reference
#> Prediction No Yes
#> No 985 0
#> Yes 3 988
$byClass cm_rf
#> Sensitivity Specificity Pos Pred Value
#> 1.0000000 0.9969636 0.9969728
#> Neg Pred Value Precision Recall
#> 1.0000000 0.9969728 1.0000000
#> F1 Prevalence Detection Rate
#> 0.9984841 0.5000000 0.5000000
#> Detection Prevalence Balanced Accuracy
#> 0.5015182 0.9984818
Time to evaluate the models. we use Confusion matrix to evaluate the result between prediction and actual on data test.
Generally there are four metrics to measure performance (for binary classification) based on Confusion Matrix:
Accuracy : Measure how many of our data is correctly predicted. \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
Sensitivity : measures out of all positive outcome, how many are correctly predicted. \[Sensitivity = \frac{TP}{TP + FN}\]
Specificty: measure how many negative outcome is correctly predicted. \[Specificty = \frac{TN}{TN + FP}\]
Precision: measures how many of our positive prediction is correct. \[Precision = \frac{TN}{TN + FP}\] ## Naive Bayes
predict(model_nb, newdata = data_test)
predict_modelNB_test <-
# confusion matrix data test
confusionMatrix(data = predict_modelNB_test,
cm_modelNB_test <-reference = data_test$Attrition,
positive = "Yes")
cm_modelNB_test
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 141 11
#> Yes 104 38
#>
#> Accuracy : 0.6088
#> 95% CI : (0.5505, 0.665)
#> No Information Rate : 0.8333
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.1995
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7755
#> Specificity : 0.5755
#> Pos Pred Value : 0.2676
#> Neg Pred Value : 0.9276
#> Prevalence : 0.1667
#> Detection Rate : 0.1293
#> Detection Prevalence : 0.4830
#> Balanced Accuracy : 0.6755
#>
#> 'Positive' Class : Yes
#>
Accuracy = 60.88 %
Recall = 77.55 %
Specificity = 57.55 %
Precision = 26.76 %
library(rpart) model_dct <- rpart(data = data_train_up, formula = Attrition~., method = “class”) #cp = 0.001, minbucket = 20)
library(rpart.plot) rpart.plot::rpart.plot(model_dct)
predict(model_dct,
predict_modelDCT_test <-newdata = data_test, type="class")
#head(predict_modelDCT_test)
# confusion matrix data test
confusionMatrix(data = predict_modelDCT_test,
cm_modelDCT_test <-reference = data_test$Attrition,
positive = "Yes")
cm_modelDCT_test
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 171 18
#> Yes 74 31
#>
#> Accuracy : 0.6871
#> 95% CI : (0.6307, 0.7397)
#> No Information Rate : 0.8333
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2269
#>
#> Mcnemar's Test P-Value : 0.0000000098
#>
#> Sensitivity : 0.6327
#> Specificity : 0.6980
#> Pos Pred Value : 0.2952
#> Neg Pred Value : 0.9048
#> Prevalence : 0.1667
#> Detection Rate : 0.1054
#> Detection Prevalence : 0.3571
#> Balanced Accuracy : 0.6653
#>
#> 'Positive' Class : Yes
#>
Accuracy = 68.71 %
Recall = 63.27 %
Specificity = 69.8 %
Precision = 29.52 %
predict(model_rforest, newdata = data_test, type = "raw")
predict_modelRF_test <-#head(predict_modelRF_test)
confusionMatrix(data = predict_modelRF_test,
cm_modelRF_test <-reference = data_test$Attrition,
positive = "Yes")
cm_modelRF_test
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 233 32
#> Yes 12 17
#>
#> Accuracy : 0.8503
#> 95% CI : (0.8043, 0.8891)
#> No Information Rate : 0.8333
#> P-Value [Acc > NIR] : 0.243419
#>
#> Kappa : 0.3561
#>
#> Mcnemar's Test P-Value : 0.004179
#>
#> Sensitivity : 0.34694
#> Specificity : 0.95102
#> Pos Pred Value : 0.58621
#> Neg Pred Value : 0.87925
#> Prevalence : 0.16667
#> Detection Rate : 0.05782
#> Detection Prevalence : 0.09864
#> Balanced Accuracy : 0.64898
#>
#> 'Positive' Class : Yes
#>
Accuracy = 84.01 %
Recall = 28.57 %
Specificity = 95.1 %
Precision = 53.48 %
Suprisingly, the Accuracy score is the lowest against other model.
library(tibble)
data_frame(Model = "Naive Bayes",
summary_evalTest_NB <-Accuracy = round(cm_modelNB_test$overall[1]*100,2),
Recall = round(cm_modelNB_test$byClass[1]*100,2),
Specificity = round(cm_modelNB_test$byClass[2]*100,2),
Precision = round(cm_modelNB_test$byClass[3]*100,2))
data_frame(Model = "Decision Tree",
summary_evalTest_DCT <-Accuracy = round(cm_modelDCT_test$overall[1]*100,2),
Recall = round(cm_modelDCT_test$byClass[1]*100,2),
Specificity = round(cm_modelDCT_test$byClass[2]*100,2),
Precision = round(cm_modelDCT_test$byClass[3]*100,2))
data_frame(Model = "Random Forest",
summary_evalTest_RF <-Accuracy = round(cm_modelRF_test$overall[1]*100,2),
Recall = round(cm_modelRF_test$byClass[1]*100,2),
Specificity = round(cm_modelRF_test$byClass[2]*100,2),
Precision = round(cm_modelRF_test$byClass[3]*100,2))
rbind(summary_evalTest_NB, summary_evalTest_DCT, summary_evalTest_RF)
summary_evalTest_model <- summary_evalTest_model
data_frame(Model = "Naive Bayes",
summary_evalTrain_NB <-Accuracy = round(cm_modelNB_train$overall[1]*100,2),
Recall = round(cm_modelNB_train$byClass[1]*100,2),
Specificity = round(cm_modelNB_train$byClass[2]*100,2),
Precision = round(cm_modelNB_train$byClass[3]*100,2))
data_frame(Model = "Decision Tree",
summary_evalTrain_DCT <-Accuracy = round(cm_modelDCT_train$overall[1]*100,2),
Recall = round(cm_modelDCT_train$byClass[1]*100,2),
Specificity = round(cm_modelDCT_train$byClass[2]*100,2),
Precision = round(cm_modelDCT_train$byClass[3]*100,2))
data_frame(Model = "Random Forest",
summary_evalTrain_RF <-Accuracy = round(cm_rf$overall[1]*100,2),
Recall = round(cm_rf$byClass[1]*100,2),
Specificity = round(cm_rf$byClass[2]*100,2),
Precision = round(cm_rf$byClass[3]*100,2))
rbind(summary_evalTrain_NB, summary_evalTrain_DCT, summary_evalTrain_RF)
summary_evalTrain_model <- summary_evalTrain_model
From the performance summary above we can see that the model tend to be overfit. Among the three models, model Random Forest has better Accuracy, Specificity, and Precision score rather than the other models, but still not good enough.
However from Recall score, Random Forest is the worst, while Naive Bayes is better Decision Tree.
As previously mention in evaluation section, Random Forest has better Accuracy, Specificity, and Precision score rather than the other models, but still not good enough. However the Recall score of Random Forest is bad. Naive Bayes has better score compare to Decision Tree and Random Forest
I guess it is happen because several predictors are low-correlated while Naive Bayes assumes that all features of the dataset are equally important and independent.
In this case, it is both True Positive and True Negative are important so all of the metric are good to watch, but since the problem is to predict the attrition rate the metrics that mainly focus on high True Positive such as Recall and Precision must be the main focus in order to evaluate the attrition model.