The main objectives for this project are:
The dataset for this project is available on Kaggle: IBM HR Analytics Employee Attrition & Performance.
For this analysis we have used the following packages:
library(plyr)
library(dplyr)
library(ggplot2)
library(caret)
library(car)
library(cowplot)
library(gridExtra)
library(corrplot)
library(MASS)
library(ROCR)
library(ROSE)
library(rpart)
library(rpart.plot)# Reading files
general_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_general_data.csv', header = TRUE, sep = ',')
survey_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_employee_survey_data.csv',header = TRUE,sep = ',')
manager_survey_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_manager_survey_data.csv', header = TRUE, sep = ',')
in_time_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/in_time.csv', header = TRUE, sep = ',')
out_time_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/out_time.csv', header = TRUE, sep = ',')
# Exploring the dataframes
str(survey_data_raw)## 'data.frame': 4410 obs. of 4 variables:
## $ EmployeeID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ EnvironmentSatisfaction: int 3 3 2 4 4 3 1 1 2 2 ...
## $ JobSatisfaction : int 4 2 2 4 1 2 3 2 4 1 ...
## $ WorkLifeBalance : int 2 4 1 3 3 2 1 3 3 3 ...
str(manager_survey_data_raw)## 'data.frame': 4410 obs. of 3 variables:
## $ EmployeeID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ JobInvolvement : int 3 2 3 2 3 3 3 3 3 3 ...
## $ PerformanceRating: int 3 4 3 3 3 3 4 4 4 3 ...
str(general_data_raw)## 'data.frame': 4410 obs. of 24 variables:
## $ Age : int 51 31 32 38 32 46 28 29 31 25 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 2 1 3 3 3 3 3 1 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 6 10 17 2 10 8 11 18 1 7 ...
## $ Education : int 2 1 4 5 1 3 2 3 3 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 2 2 1 ...
## $ JobLevel : int 1 1 4 3 1 4 2 2 3 4 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 1 7 8 2 8 6 8 8 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 3 2 2 1 ...
## $ MonthlyIncome : int 131160 41890 193280 83210 23420 40710 58130 31430 20440 134640 ...
## $ NumCompaniesWorked : int 1 0 1 3 4 3 2 2 0 1 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ StandardHours : int 8 8 8 8 8 8 8 8 8 8 ...
## $ StockOptionLevel : int 0 1 3 3 2 0 1 3 0 1 ...
## $ TotalWorkingYears : int 1 6 5 13 9 28 5 10 10 6 ...
## $ TrainingTimesLastYear : int 6 3 2 5 2 5 2 2 2 2 ...
## $ YearsAtCompany : int 1 5 5 8 6 7 0 0 9 6 ...
## $ YearsSinceLastPromotion: int 0 1 0 7 0 7 0 0 7 1 ...
## $ YearsWithCurrManager : int 0 4 3 5 4 7 0 0 8 5 ...
#str(in_time_raw)
#str(out_time_raw)
# Checking for missing values
sapply(survey_data_raw, function(x) sum(is.na(x)))/nrow(survey_data_raw)*100## EmployeeID EnvironmentSatisfaction JobSatisfaction
## 0.0000000 0.5668934 0.4535147
## WorkLifeBalance
## 0.8616780
sapply(manager_survey_data_raw, function(x) sum(is.na(x)))/nrow(manager_survey_data_raw)*100## EmployeeID JobInvolvement PerformanceRating
## 0 0 0
sapply(general_data_raw, function(x) sum(is.na(x)))/nrow(general_data_raw)*100## Age Attrition BusinessTravel
## 0.0000000 0.0000000 0.0000000
## Department DistanceFromHome Education
## 0.0000000 0.0000000 0.0000000
## EducationField EmployeeCount EmployeeID
## 0.0000000 0.0000000 0.0000000
## Gender JobLevel JobRole
## 0.0000000 0.0000000 0.0000000
## MaritalStatus MonthlyIncome NumCompaniesWorked
## 0.0000000 0.0000000 0.4308390
## Over18 PercentSalaryHike StandardHours
## 0.0000000 0.0000000 0.0000000
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0.0000000 0.2040816 0.0000000
## YearsAtCompany YearsSinceLastPromotion YearsWithCurrManager
## 0.0000000 0.0000000 0.0000000
Since we have out_time and in_time we can compute the average working hours per employee. This might perhaps gives us some additional insight.
in_time <- sapply(in_time_raw, function(x) as.POSIXlt(x, origin = '1970-01-01', '%y-%m-%d %H:%M:%S'))
in_time <- as.data.frame(in_time)
out_time <- sapply(out_time_raw, function(x) as.POSIXlt(x, origin = '1970-01-01', '%y-%m-%d %H:%M:%S'))
out_time <- as.data.frame(out_time)
# Removing the first column since there has no value in the data
in_time <- in_time[,c(2:ncol(in_time))]
out_time <- out_time[,c(2:ncol(out_time))]
# Computing the difference between out_time and in_time
hours_worked_raw <- out_time - in_time
# We want to keep the records even if there was one employee that went to work
hours_worked_raw <- hours_worked_raw[,colSums(is.na(hours_worked_raw)) < nrow(hours_worked_raw)]
# Transforming the resuls to numeric so we can take the average
hours_worked_raw <- sapply(hours_worked_raw, function(x) as.numeric(x))
hours_worked_raw <- as.data.frame(hours_worked_raw)
# Taking the average of working hours per employee
AvrWorkHrs <- apply(hours_worked_raw, 1, mean, na.rm = TRUE) # 1 Is to indicate rows
# Creating ID column so it can be merged with the rest of the datasets
id_vector <- 1:nrow(hours_worked_raw)
AverageWorkedHours <- data.frame(EmployeeID = id_vector, AvrWorkHrs = AvrWorkHrs)
# Before merging we take a look if each observation is from a different employee
setdiff(survey_data_raw$EmployeeID, manager_survey_data_raw$EmployeeID)## integer(0)
setdiff(manager_survey_data_raw$EmployeeID, general_data_raw$EmployeeID)## integer(0)
setdiff(general_data_raw$EmployeeID, AverageWorkedHours$EmployeeID)## integer(0)
# Since all of them are complete we can merge them
general_data_merged <- inner_join(general_data_raw,manager_survey_data_raw, by = 'EmployeeID') %>%
inner_join(., survey_data_raw, by = 'EmployeeID') %>%
inner_join(., AverageWorkedHours, by = 'EmployeeID')
# Droping values that have the same value for all observations
same_values <- nearZeroVar(general_data_merged, names = TRUE)
general_data_merged <- general_data_merged %>%
dplyr::select(-c(c('EmployeeID', same_values)))# Now we check the structure of the new dataframe and the missing values. And evaluate if they are significant in terms of the total observations
str(general_data_merged)## 'data.frame': 4410 obs. of 26 variables:
## $ Age : int 51 31 32 38 32 46 28 29 31 25 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 2 1 3 3 3 3 3 1 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 6 10 17 2 10 8 11 18 1 7 ...
## $ Education : int 2 1 4 5 1 3 2 3 3 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 2 2 1 ...
## $ JobLevel : int 1 1 4 3 1 4 2 2 3 4 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 1 7 8 2 8 6 8 8 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 3 2 2 1 ...
## $ MonthlyIncome : int 131160 41890 193280 83210 23420 40710 58130 31430 20440 134640 ...
## $ NumCompaniesWorked : int 1 0 1 3 4 3 2 2 0 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ StockOptionLevel : int 0 1 3 3 2 0 1 3 0 1 ...
## $ TotalWorkingYears : int 1 6 5 13 9 28 5 10 10 6 ...
## $ TrainingTimesLastYear : int 6 3 2 5 2 5 2 2 2 2 ...
## $ YearsAtCompany : int 1 5 5 8 6 7 0 0 9 6 ...
## $ YearsSinceLastPromotion: int 0 1 0 7 0 7 0 0 7 1 ...
## $ YearsWithCurrManager : int 0 4 3 5 4 7 0 0 8 5 ...
## $ JobInvolvement : int 3 2 3 2 3 3 3 3 3 3 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ EnvironmentSatisfaction: int 3 3 2 4 4 3 1 1 2 2 ...
## $ JobSatisfaction : int 4 2 2 4 1 2 3 2 4 1 ...
## $ WorkLifeBalance : int 2 4 1 3 3 2 1 3 3 3 ...
## $ AvrWorkHrs : num 7.37 7.72 7.01 7.19 8.01 ...
sapply(general_data_merged, function(x) sum(is.na(x)))/nrow(general_data_merged)*100## Age Attrition BusinessTravel
## 0.0000000 0.0000000 0.0000000
## Department DistanceFromHome Education
## 0.0000000 0.0000000 0.0000000
## EducationField Gender JobLevel
## 0.0000000 0.0000000 0.0000000
## JobRole MaritalStatus MonthlyIncome
## 0.0000000 0.0000000 0.0000000
## NumCompaniesWorked PercentSalaryHike StockOptionLevel
## 0.4308390 0.0000000 0.0000000
## TotalWorkingYears TrainingTimesLastYear YearsAtCompany
## 0.2040816 0.0000000 0.0000000
## YearsSinceLastPromotion YearsWithCurrManager JobInvolvement
## 0.0000000 0.0000000 0.0000000
## PerformanceRating EnvironmentSatisfaction JobSatisfaction
## 0.0000000 0.5668934 0.4535147
## WorkLifeBalance AvrWorkHrs
## 0.8616780 0.0000000
# We have a very small proportion of missing values per feature. We translate those in terms of observations
flag_nulls <- ifelse(rowSums(is.na(general_data_merged)) == 0, 0, 1)
sum(flag_nulls)/nrow(general_data_merged)*100 ## [1] 2.494331
# Since missing values represent 2.5% of the observations we remove them (just for this case, one should be careful when dropping missing values).
general_data_merged <- general_data_merged[rowSums(is.na(general_data_merged)) == 0,]For this EDA we create two dataframes one with labels of the ordinal variables and the other just as it is.
# First we keep the numeric variables
nums <- unlist(lapply(general_data_merged, is.numeric))
general_data_merged_with_ordinal_values <- general_data_merged[, nums]
# With this data we create a correlation matrix in order to see the relantionships between the features. Since the data contains ordinal data (hierarchy or ranks) we use Spearman correlation instead of Pearson.
cor_matrix <- cor(general_data_merged_with_ordinal_values, method = 'spearman')
corrplot(corr = cor_matrix, method = 'color', addCoef.col = 'gray',tl.cex = 0.7, number.cex = 0.5)# We create the second dataframe with the labels of the ordinal features
general_data_merged_with_categories <- general_data_merged
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 1)] <- 'Below College'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 2)] <- 'College'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 3)] <- 'Bachelor'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 4)] <- 'Master'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 5)] <- 'Doctor'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 1)] <- 'Low'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 2)] <- 'Medium'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 3)] <- 'High'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 4)] <- 'Very High'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 1)] <- 'Low'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 2)] <- 'Medium'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 3)] <- 'High'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 4)] <- 'Very High'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 1)] <- 'Low'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 2)] <- 'Medium'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 3)] <- 'High'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 4)] <- 'Very High'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 1)] <- 'Low'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 2)] <- 'Good'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 3)] <- 'Excellent'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 4)] <- 'Outstanding'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 1)] <- 'Bad'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 2)] <- 'Good'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 3)] <- 'Better'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 4)] <- 'Best'general_data_merged_with_categories %>%
group_by(Gender, Education) %>%
summarize(Average_Income = mean(MonthlyIncome)) %>%
arrange(-Average_Income)general_data_merged_with_categories %>%
group_by(MaritalStatus, WorkLifeBalance) %>%
summarize(Average_Work = mean(AvrWorkHrs), Percent_employees = round(n()/nrow(general_data_merged_with_categories)*100,0)) %>%
arrange(-Percent_employees)Findings:
It seems that female workers have a high income in comparison with men, except for the women who have a masters degree since they earn less on average than the men.
More than a half of the employees say they have a ‘Better’ work-life balance.
# Creating some plots to see the rate of attrition by other variables such as Education, Environment Satisfaction, Job Involvement, Job satisfaction, Performance rating, work balance
g1 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Education)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~Education) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Attrition by Education', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g2 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = EnvironmentSatisfaction)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~EnvironmentSatisfaction) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Environment satisfaction vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g3 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = JobInvolvement)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~JobInvolvement) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Job involvement vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g4 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = JobSatisfaction)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~JobSatisfaction) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Job Satisfaction vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g5 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = PerformanceRating)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~PerformanceRating) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Performance Rating vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g6 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = WorkLifeBalance)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~WorkLifeBalance) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Work-life balance vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
plot_grid(g1,g2,g3,g4,g5,g6, nrow = 3)Findings:
We can see that attrition comes most from people with college education, perhaps they’re could be interns.
People that have a low environment satisfaction have a high rate of attrition.
People with low job involvement have a high rate of attrition.
People with low job satisfaction have a high rate of attrition.
People with low work-life balance have a high rate of attrition.
Performance rating does not have a significance difference in the level of attrition.
# Plots of Attrition vs Age, Average working hours, Monthly income, Gender, Marital status, and Department.
g7 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = Age, fill = Attrition)) +
geom_boxplot() +
xlab('')+
theme(legend.position = 'none')
g8 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = AvrWorkHrs, fill = Attrition)) +
geom_boxplot()+
xlab('')+
theme(legend.position = 'none')
g9 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = log(MonthlyIncome), fill = Attrition)) +
geom_boxplot()+
xlab('')+
ylab('Log of Monthly Income')+
theme(legend.position = 'none')
g10 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Gender)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~Gender) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Gender vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g11 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = MaritalStatus)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~MaritalStatus) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Marital Status vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
g12 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Department)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
facet_grid(~Department) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
labs(y = 'Percent', x = 'Attrition', title = 'Department vs Attrition', fill = 'Attrition')+
theme_light() +
theme(legend.position = 'none')
plot_grid(g7,g8,g9,g10,g11,g12, nrow = 3)Findings:
Younger people have higher level of attrition than older people.
People that work more hours on average have a high rate of attrition.
It seems that income and gender are not relevant in the level of attrition.
People that are single have a higher rate of attrition than the ones that are married or divorced.
Human Resources have a higher rate of attrition than Research & Development and Sales departments.
We want to understand the most important factors that lead to employee attrition. For this we use logistic regression to uncover which factors are the most relevant. For this moment we do not want to predict. In the following part Model building: Part 2 we split the data into training and test and predict the outcomes based on the best classification algorithm either logistic regression, decision trees or random forest.
# Based on the findings in EDA we compute the first model based relevant features.
mod_a <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = general_data_merged, family = 'binomial')
summary(mod_a)##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department +
## DistanceFromHome + Education + Gender + JobLevel + MaritalStatus +
## MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement +
## PerformanceRating + EnvironmentSatisfaction + JobSatisfaction +
## WorkLifeBalance + AvrWorkHrs, family = "binomial", data = general_data_merged)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8051 -0.5777 -0.3877 -0.2272 3.3738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.012e-01 6.653e-01 -1.355 0.17557
## Age -5.419e-02 6.159e-03 -8.798 < 2e-16 ***
## BusinessTravelTravel_Frequently 1.375e+00 2.078e-01 6.616 3.69e-11 ***
## BusinessTravelTravel_Rarely 6.252e-01 1.939e-01 3.224 0.00126 **
## DepartmentResearch & Development -9.342e-01 1.916e-01 -4.877 1.08e-06 ***
## DepartmentSales -9.463e-01 2.007e-01 -4.716 2.41e-06 ***
## DistanceFromHome -1.601e-04 5.785e-03 -0.028 0.97791
## Education -7.715e-02 4.483e-02 -1.721 0.08528 .
## GenderMale 9.117e-02 9.512e-02 0.958 0.33787
## JobLevel -6.167e-02 4.207e-02 -1.466 0.14269
## MaritalStatusMarried 3.024e-01 1.375e-01 2.199 0.02784 *
## MaritalStatusSingle 1.122e+00 1.362e-01 8.239 < 2e-16 ***
## MonthlyIncome -1.076e-06 1.011e-06 -1.064 0.28711
## NumCompaniesWorked 1.145e-01 1.929e-02 5.936 2.92e-09 ***
## YearsAtCompany -4.310e-02 1.018e-02 -4.233 2.30e-05 ***
## JobInvolvement -4.869e-02 6.326e-02 -0.770 0.44146
## PerformanceRating 7.858e-02 1.241e-01 0.633 0.52655
## EnvironmentSatisfaction -3.288e-01 4.149e-02 -7.924 2.30e-15 ***
## JobSatisfaction -3.110e-01 4.120e-02 -7.548 4.42e-14 ***
## WorkLifeBalance -2.872e-01 6.312e-02 -4.550 5.36e-06 ***
## AvrWorkHrs 4.296e-01 3.274e-02 13.124 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3804.3 on 4299 degrees of freedom
## Residual deviance: 3127.5 on 4279 degrees of freedom
## AIC: 3169.5
##
## Number of Fisher Scoring iterations: 5
vif(mod = mod_a)## GVIF Df GVIF^(1/(2*Df))
## Age 1.333480 1 1.154764
## BusinessTravel 1.052222 2 1.012807
## Department 1.057845 2 1.014158
## DistanceFromHome 1.017167 1 1.008547
## Education 1.013869 1 1.006911
## Gender 1.032351 1 1.016047
## JobLevel 1.021485 1 1.010686
## MaritalStatus 1.050845 2 1.012476
## MonthlyIncome 1.024963 1 1.012405
## NumCompaniesWorked 1.228627 1 1.108434
## YearsAtCompany 1.141766 1 1.068534
## JobInvolvement 1.011986 1 1.005975
## PerformanceRating 1.027053 1 1.013436
## EnvironmentSatisfaction 1.030300 1 1.015037
## JobSatisfaction 1.024502 1 1.012177
## WorkLifeBalance 1.019058 1 1.009484
## AvrWorkHrs 1.047025 1 1.023242
Interpretation:
Age(-): All things being equal, the older the people are, the less likely to leave the company.
BusinessTravel(+): All things being equal, people that travel frequently or rarely are more likely to leave the company than non-travel people.
Department(-): All things being equal, people that work in Research and Sales are less likely to leave the company.
MaritalStatus(+): All things being equal, people that are married or single are more likely to leave the company.
NumCompaniesWorked(+): All things being equal, people who have worked with more companies are more likely to leave the company.
YearsAtCompany(-): All things being equal, people with more years at the company are less likely to leave.
EnvironmentSatisfaction(-): All things being equal, people with more environment satisfaction are less likely to leave.
JobSatisfaction(-): All things being equal, people with more job satisfaction are less likely to leave.
WorkLifeBalance(-): All things being equal, people with more work-life balance are less likely to leave.
AvrWorkHrs(+): All things being equal, people with more hours worked are more likely to leave.
From these results, we can say that they’re are align with the findings from the EDA.
The Variance Inflation Factor(VIF) tells us that there are no problems with multicollinearity.
We could also improve the first model by droping some variables based on the Akaike Information Criteria (AIC) using the step wise algorithm.
mod_b <- stepAIC(mod_a, direction = 'both', trace = FALSE)
summary(mod_b)##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department +
## Education + JobLevel + MaritalStatus + NumCompaniesWorked +
## YearsAtCompany + EnvironmentSatisfaction + JobSatisfaction +
## WorkLifeBalance + AvrWorkHrs, family = "binomial", data = general_data_merged)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8264 -0.5769 -0.3926 -0.2289 3.3994
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.783706 0.500468 -1.566 0.11736
## Age -0.054558 0.006133 -8.896 < 2e-16 ***
## BusinessTravelTravel_Frequently 1.371882 0.207092 6.625 3.48e-11 ***
## BusinessTravelTravel_Rarely 0.621613 0.193498 3.213 0.00132 **
## DepartmentResearch & Development -0.942783 0.190989 -4.936 7.96e-07 ***
## DepartmentSales -0.946018 0.200501 -4.718 2.38e-06 ***
## Education -0.079047 0.044740 -1.767 0.07726 .
## JobLevel -0.063872 0.041955 -1.522 0.12791
## MaritalStatusMarried 0.305664 0.137073 2.230 0.02575 *
## MaritalStatusSingle 1.129215 0.135828 8.314 < 2e-16 ***
## NumCompaniesWorked 0.114091 0.019217 5.937 2.90e-09 ***
## YearsAtCompany -0.043178 0.010177 -4.242 2.21e-05 ***
## EnvironmentSatisfaction -0.330640 0.041464 -7.974 1.53e-15 ***
## JobSatisfaction -0.309223 0.041121 -7.520 5.48e-14 ***
## WorkLifeBalance -0.288729 0.063040 -4.580 4.65e-06 ***
## AvrWorkHrs 0.431877 0.032630 13.236 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3804.3 on 4299 degrees of freedom
## Residual deviance: 3130.5 on 4284 degrees of freedom
## AIC: 3162.5
##
## Number of Fisher Scoring iterations: 5
vif(mod = mod_b)## GVIF Df GVIF^(1/(2*Df))
## Age 1.320872 1 1.149292
## BusinessTravel 1.042031 2 1.010346
## Department 1.042443 2 1.010446
## Education 1.011867 1 1.005916
## JobLevel 1.017562 1 1.008743
## MaritalStatus 1.041215 2 1.010148
## NumCompaniesWorked 1.220371 1 1.104704
## YearsAtCompany 1.140375 1 1.067884
## EnvironmentSatisfaction 1.028594 1 1.014196
## JobSatisfaction 1.021555 1 1.010720
## WorkLifeBalance 1.017083 1 1.008505
## AvrWorkHrs 1.043959 1 1.021743
Interpretation:
In this case our model has reduced its features from 20 to 15.
The effects are the same as the prior model but now the new model has a lower AIC so in those terms this is a simpler and better model than the previous one.
There is also no problem with multicollinearity.
Model evaluation:
To evaluate our model we create a graph comparing the predicted probabilities against actual attrition.
predicted_data <- data.frame(prob_of_attrition = mod_b$fitted.values, attrition = general_data_merged$Attrition)
predicted_data <- predicted_data %>%
arrange(prob_of_attrition) %>%
mutate(rank = 1:nrow(predicted_data))
ggplot(data = predicted_data, aes(x = rank, y = prob_of_attrition))+
geom_point(aes(color = attrition), alpha = 1, shape = 4, stroke = 2)+
labs(y = 'Predicted probability of attrition', x = 'Index', title = 'Probability of attrition vs Actual attrition')Graph Interpretation:
As we see our model is making a good job identifying attrition. The salmon color represents the people with low probability of attrition and did not leave the company while the turquoise represents the people that had a high probability of leaving and actually left.
However this model can’t be used to predict anything since we are using all the data. We need to see how our model works with unseen data.
Now instead of just uncover the factors of attrition we want to make predictions on which employee will leave based on several characteristcs gather from the company. The steps we take are the following:
We split the data into training and test datasets.
We build three models: Logistic regression, decision trees and random forest and compare their results.
Select the best model
# Setting seed for reproducibilty
set.seed(123)
# Data split
training.sample <- general_data_merged$Attrition %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- general_data_merged[training.sample,]
test.data <- general_data_merged[-training.sample,]
mod_c <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+log(MonthlyIncome)+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = train.data, family = 'binomial')
mod_c_StepWise <- stepAIC(mod_c, direction = 'both', trace = FALSE)
summary(mod_c_StepWise)##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department +
## Education + MaritalStatus + NumCompaniesWorked + YearsAtCompany +
## JobInvolvement + EnvironmentSatisfaction + JobSatisfaction +
## WorkLifeBalance + AvrWorkHrs, family = "binomial", data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7597 -0.5811 -0.3855 -0.2290 3.3936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.693881 0.592687 -1.171 0.241704
## Age -0.052146 0.006853 -7.610 2.75e-14 ***
## BusinessTravelTravel_Frequently 1.475801 0.239529 6.161 7.22e-10 ***
## BusinessTravelTravel_Rarely 0.663749 0.225277 2.946 0.003215 **
## DepartmentResearch & Development -1.002461 0.212039 -4.728 2.27e-06 ***
## DepartmentSales -1.128969 0.224367 -5.032 4.86e-07 ***
## Education -0.106843 0.050013 -2.136 0.032655 *
## MaritalStatusMarried 0.342032 0.154962 2.207 0.027301 *
## MaritalStatusSingle 1.192341 0.153175 7.784 7.02e-15 ***
## NumCompaniesWorked 0.106587 0.021571 4.941 7.76e-07 ***
## YearsAtCompany -0.043955 0.011333 -3.879 0.000105 ***
## JobInvolvement -0.144800 0.070039 -2.067 0.038694 *
## EnvironmentSatisfaction -0.314071 0.046318 -6.781 1.20e-11 ***
## JobSatisfaction -0.293706 0.046437 -6.325 2.54e-10 ***
## WorkLifeBalance -0.273230 0.070445 -3.879 0.000105 ***
## AvrWorkHrs 0.438169 0.036596 11.973 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3043.4 on 3439 degrees of freedom
## Residual deviance: 2497.2 on 3424 degrees of freedom
## AIC: 2529.2
##
## Number of Fisher Scoring iterations: 5
p.training <- predict(mod_c_StepWise, train.data, type = 'response') # vector of probabilites from the training dataset
p.training.attrition <- as.factor(ifelse(p.training > 0.5, 'Yes', 'No')) # transforming those probabilites into factors (Yes or No)
# Obtaining the confusion matrix for this model in our training data
confusionMatrix(p.training.attrition, train.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2825 441
## Yes 59 115
##
## Accuracy : 0.8547
## 95% CI : (0.8424, 0.8663)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.004642
##
## Kappa : 0.2579
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9795
## Specificity : 0.2068
## Pos Pred Value : 0.8650
## Neg Pred Value : 0.6609
## Prevalence : 0.8384
## Detection Rate : 0.8212
## Detection Prevalence : 0.9494
## Balanced Accuracy : 0.5932
##
## 'Positive' Class : No
##
# Testing our results in the test dataset with threshold = 0.5
p.test <- predict(mod_c_StepWise, test.data, type = 'response')
p.test.attrition <- as.factor(ifelse(p.test > 0.5, 'Yes','No'))
confusionMatrix(p.test.attrition, test.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 703 114
## Yes 18 25
##
## Accuracy : 0.8465
## 95% CI : (0.8207, 0.87)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.2758
##
## Kappa : 0.2148
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9750
## Specificity : 0.1799
## Pos Pred Value : 0.8605
## Neg Pred Value : 0.5814
## Prevalence : 0.8384
## Detection Rate : 0.8174
## Detection Prevalence : 0.9500
## Balanced Accuracy : 0.5774
##
## 'Positive' Class : No
##
Interpretation
As we see, the results in our training dataset have an accuracy of 85% while in the test dataset gives us 84%. However if we take a look at the No Information Rate indicator this shows that based only in the distribution of our features we can predict correctly 83% of the time. This is, if we take a new data point and assign it to the most common set we will be correct 83% of the time.
Futhermore if we take a look at Sensitivity and Specificity we have values of 97% and 18% respectively. So if we want to predict people who might leave (Prediction = Yes and Reference = Yes) this is not a good model, since will only detect 18% of them (25/114+25).
Possible solutions
Solution 1: Selecting a different threshold
# Selecting a different treshold based on ROC curve
ROCR_prediction <- prediction(p.training, train.data$Attrition)
ROCR_performance <- performance(ROCR_prediction, 'tpr', 'fpr')
plot(ROCR_performance, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1))# Based on this we select the new threshold to be 0.2 and test our model again.# Evaluating in training dataset withe new threshold
p.training <- predict(mod_c_StepWise, train.data, type = 'response')
p.training.attrition <- as.factor(ifelse(p.training > 0.2, 'Yes', 'No'))
confusionMatrix(p.training.attrition, train.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2271 181
## Yes 613 375
##
## Accuracy : 0.7692
## 95% CI : (0.7547, 0.7832)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3516
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7874
## Specificity : 0.6745
## Pos Pred Value : 0.9262
## Neg Pred Value : 0.3796
## Prevalence : 0.8384
## Detection Rate : 0.6602
## Detection Prevalence : 0.7128
## Balanced Accuracy : 0.7310
##
## 'Positive' Class : No
##
# Evaluating in test dataset
p.test <- predict(mod_c_StepWise, test.data, type = 'response')
p.test.attrition <- as.factor(ifelse(p.test> 0.2, 'Yes', 'No'))
confusionMatrix(p.test.attrition, test.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 590 50
## Yes 131 89
##
## Accuracy : 0.7895
## 95% CI : (0.7607, 0.8163)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.3713
##
## Mcnemar's Test P-Value : 2.742e-09
##
## Sensitivity : 0.8183
## Specificity : 0.6403
## Pos Pred Value : 0.9219
## Neg Pred Value : 0.4045
## Prevalence : 0.8384
## Detection Rate : 0.6860
## Detection Prevalence : 0.7442
## Balanced Accuracy : 0.7293
##
## 'Positive' Class : No
##
Results:
Solution 2: Oversampling
Since the data is unbalanced and we do not want to loose any information of the mayority class we oversample the minority class and try again with logistic regression.
We also use the stepwise algorithm to decrease the number of features in the model.
Training dataset
# Using oversampling in the training dataset
over <- ovun.sample(formula = Attrition ~., data = train.data, method = 'over')$data
table(over$Attrition) # now the data is balanced between No and Yes##
## No Yes
## 2884 2846
# Building the model
balanced_model <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+log(MonthlyIncome)+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = over, family = 'binomial')
# Model output
summary(balanced_model)##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department +
## DistanceFromHome + Education + Gender + JobLevel + MaritalStatus +
## log(MonthlyIncome) + NumCompaniesWorked + YearsAtCompany +
## JobInvolvement + PerformanceRating + EnvironmentSatisfaction +
## JobSatisfaction + WorkLifeBalance + AvrWorkHrs, family = "binomial",
## data = over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5145 -0.9066 -0.2923 0.9129 2.6345
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.418185 0.688563 2.060 0.039434 *
## Age -0.047525 0.003914 -12.142 < 2e-16 ***
## BusinessTravelTravel_Frequently 1.475060 0.133064 11.085 < 2e-16 ***
## BusinessTravelTravel_Rarely 0.716333 0.121629 5.890 3.87e-09 ***
## DepartmentResearch & Development -1.338819 0.143217 -9.348 < 2e-16 ***
## DepartmentSales -1.338110 0.148093 -9.036 < 2e-16 ***
## DistanceFromHome 0.001657 0.003903 0.425 0.671130
## Education -0.099516 0.030506 -3.262 0.001106 **
## GenderMale 0.054859 0.063564 0.863 0.388112
## JobLevel -0.113311 0.029187 -3.882 0.000103 ***
## MaritalStatusMarried 0.395811 0.087825 4.507 6.58e-06 ***
## MaritalStatusSingle 1.178699 0.089146 13.222 < 2e-16 ***
## log(MonthlyIncome) -0.157270 0.047457 -3.314 0.000920 ***
## NumCompaniesWorked 0.117968 0.013161 8.963 < 2e-16 ***
## YearsAtCompany -0.030521 0.006126 -4.983 6.28e-07 ***
## JobInvolvement -0.060252 0.041674 -1.446 0.148236
## PerformanceRating 0.339953 0.083488 4.072 4.66e-05 ***
## EnvironmentSatisfaction -0.289308 0.027411 -10.554 < 2e-16 ***
## JobSatisfaction -0.309423 0.028510 -10.853 < 2e-16 ***
## WorkLifeBalance -0.275408 0.041185 -6.687 2.28e-11 ***
## AvrWorkHrs 0.435745 0.022971 18.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7943.2 on 5729 degrees of freedom
## Residual deviance: 6329.9 on 5709 degrees of freedom
## AIC: 6371.9
##
## Number of Fisher Scoring iterations: 4
# Step wise
balanced_model_step_wise <- stepAIC(balanced_model, direction = 'both', trace = FALSE)
# Step wise model output
summary(balanced_model_step_wise)##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department +
## Education + JobLevel + MaritalStatus + log(MonthlyIncome) +
## NumCompaniesWorked + YearsAtCompany + JobInvolvement + PerformanceRating +
## EnvironmentSatisfaction + JobSatisfaction + WorkLifeBalance +
## AvrWorkHrs, family = "binomial", data = over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5261 -0.9013 -0.2901 0.9158 2.6209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.465011 0.685749 2.136 0.032650 *
## Age -0.047468 0.003913 -12.131 < 2e-16 ***
## BusinessTravelTravel_Frequently 1.467963 0.132684 11.064 < 2e-16 ***
## BusinessTravelTravel_Rarely 0.710717 0.121304 5.859 4.66e-09 ***
## DepartmentResearch & Development -1.339945 0.143265 -9.353 < 2e-16 ***
## DepartmentSales -1.337897 0.148122 -9.032 < 2e-16 ***
## Education -0.098041 0.030459 -3.219 0.001287 **
## JobLevel -0.114806 0.029099 -3.945 7.97e-05 ***
## MaritalStatusMarried 0.392157 0.087733 4.470 7.83e-06 ***
## MaritalStatusSingle 1.174886 0.089040 13.195 < 2e-16 ***
## log(MonthlyIncome) -0.157297 0.047408 -3.318 0.000907 ***
## NumCompaniesWorked 0.116839 0.013099 8.920 < 2e-16 ***
## YearsAtCompany -0.030522 0.006113 -4.993 5.93e-07 ***
## JobInvolvement -0.060258 0.041671 -1.446 0.148165
## PerformanceRating 0.342283 0.083315 4.108 3.99e-05 ***
## EnvironmentSatisfaction -0.289354 0.027403 -10.559 < 2e-16 ***
## JobSatisfaction -0.310816 0.028469 -10.918 < 2e-16 ***
## WorkLifeBalance -0.275165 0.041176 -6.683 2.35e-11 ***
## AvrWorkHrs 0.436605 0.022942 19.031 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7943.2 on 5729 degrees of freedom
## Residual deviance: 6330.8 on 5711 degrees of freedom
## AIC: 6368.8
##
## Number of Fisher Scoring iterations: 4
p.training <- predict(balanced_model_step_wise, train.data, type = 'response')
p.training.attrition <- as.factor(ifelse(p.training > 0.5, 'Yes', 'No'))
confusionMatrix(p.training.attrition, train.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2110 153
## Yes 774 403
##
## Accuracy : 0.7305
## 95% CI : (0.7154, 0.7453)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3146
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7316
## Specificity : 0.7248
## Pos Pred Value : 0.9324
## Neg Pred Value : 0.3424
## Prevalence : 0.8384
## Detection Rate : 0.6134
## Detection Prevalence : 0.6578
## Balanced Accuracy : 0.7282
##
## 'Positive' Class : No
##
Result: As we can see our model has improved by doing oversampling. Also setting the threshold to 0.5 does a good job identifying which employee might leave the company in our training dataset. Now we test the result in our testing dataset
Testing dataset
p.test <- predict(balanced_model_step_wise, test.data, type = 'response')
p.test.attrition <- as.factor(ifelse(p.test > 0.5, 'Yes', 'No'))
confusionMatrix(p.test.attrition, test.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 545 46
## Yes 176 93
##
## Accuracy : 0.7419
## 95% CI : (0.7112, 0.7708)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3085
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7559
## Specificity : 0.6691
## Pos Pred Value : 0.9222
## Neg Pred Value : 0.3457
## Prevalence : 0.8384
## Detection Rate : 0.6337
## Detection Prevalence : 0.6872
## Balanced Accuracy : 0.7125
##
## 'Positive' Class : No
##
Result: Now the speceificity has increased to 66% in our testing set, this is result is almost the same to the result we obtained previously when we set our threshold to be 02.
For the second model we create a decision tree to see if our results can improve.
# Setting seed for reproducibilty
set.seed(123)
# Data split
training.sample <- general_data_merged$Attrition %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- general_data_merged[training.sample,]
test.data <- general_data_merged[-training.sample,]
# Decision tree model
model.tree <- rpart::rpart(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = train.data, method = 'class')
# Plotting decision tree
rpart.plot::prp(model.tree)predict_tree <- predict(model.tree, test.data, type = 'class')
confusionMatrix(predict_tree, test.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 708 96
## Yes 13 43
##
## Accuracy : 0.8733
## 95% CI : (0.8492, 0.8948)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.002491
##
## Kappa : 0.3838
##
## Mcnemar's Test P-Value : 4.024e-15
##
## Sensitivity : 0.9820
## Specificity : 0.3094
## Pos Pred Value : 0.8806
## Neg Pred Value : 0.7679
## Prevalence : 0.8384
## Detection Rate : 0.8233
## Detection Prevalence : 0.9349
## Balanced Accuracy : 0.6457
##
## 'Positive' Class : No
##
Findings
Just comparing the result of the first logistic regression(before doing oversampling) we can see that in terms of specificity the decision tree does a better job identifying attrition (30% vs 27%).
However when the data is balanced by oversampling we see that logistic regression is a better model to predict attrition (66% vs 30%).
As we did in logistic regression we balanced the data by doing oversampling and test the decision tree again
Decision tree with oversampling
over <- ovun.sample(formula = Attrition ~., data = train.data, method = 'over')$data
model.tree <- rpart::rpart(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = over, method = 'class')
# Plotting decision tree
rpart.plot::prp(model.tree)predict_tree <- predict(model.tree, test.data, type = 'class')
confusionMatrix(predict_tree, test.data$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 584 51
## Yes 137 88
##
## Accuracy : 0.7814
## 95% CI : (0.7523, 0.8086)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3545
##
## Mcnemar's Test P-Value : 5.673e-10
##
## Sensitivity : 0.8100
## Specificity : 0.6331
## Pos Pred Value : 0.9197
## Neg Pred Value : 0.3911
## Prevalence : 0.8384
## Detection Rate : 0.6791
## Detection Prevalence : 0.7384
## Balanced Accuracy : 0.7215
##
## 'Positive' Class : No
##
Findings
A priori we know that Random Forest works well with inbalaced data in comparison with logistic and decision trees. We take the same parameters or features as before and feed them into the Random Forest model and take a look at the results.
set.seed(123)
partition <- createDataPartition(general_data_merged$Attrition, p = 0.8, list = FALSE)
training <- general_data_merged[partition,]
test <- general_data_merged[-partition,]
# Random forest with cross-validation
model_rf_1 <- randomForest::randomForest(Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = training, trControl = trainControl(method = 'cv', 10))
# Printing the model
model_rf_1##
## Call:
## randomForest(formula = Attrition ~ Age + BusinessTravel + Department + DistanceFromHome + Education + Gender + JobLevel + MaritalStatus + MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement + PerformanceRating + EnvironmentSatisfaction + JobSatisfaction + WorkLifeBalance + AvrWorkHrs, data = training, trControl = trainControl(method = "cv", 10))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.81%
## Confusion matrix:
## No Yes class.error
## No 2880 4 0.001386963
## Yes 24 532 0.043165468
# Evaluation of the model with useen data
p <- predict(model_rf_1, test, type = 'response')
confusionMatrix(p, test$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 721 6
## Yes 0 133
##
## Accuracy : 0.993
## 95% CI : (0.9849, 0.9974)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9738
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9568
## Pos Pred Value : 0.9917
## Neg Pred Value : 1.0000
## Prevalence : 0.8384
## Detection Rate : 0.8384
## Detection Prevalence : 0.8453
## Balanced Accuracy : 0.9784
##
## 'Positive' Class : No
##
Findings
Random Forest is doing a pretty good job predicting attrition since it only misclasfied 6 out of 139 observations reaching a level of specificty of 95%.
The number of trees for this model is set by default to be 500.
We might want to decrease the number of trees based on the error rate. If we decrease the number of trees the time in computing the model should decrease as well.
# Plotting error rate
plot(model_rf_1, main = 'Error rate vs number of trees')Based on this 100 trees should be fine.
We can also defined the mtry(No. of variables tried at each split) to improve our model.
# Plotting mtry
t <- randomForest::tuneRF(training[,-2], training[,2], stepFactor = 0.5, plot = TRUE, ntreeTry = 100, trace = TRUE, improve = 0.05)## mtry = 5 OOB error = 0.84%
## Searching left ...
## mtry = 10 OOB error = 0.87%
## -0.03448276 0.05
## Searching right ...
## mtry = 2 OOB error = 0.78%
## 0.06896552 0.05
## mtry = 1 OOB error = 8.31%
## -9.592593 0.05
With our default model we had an OOB of 0.81% and based on the graph we can leave it there because we do not think there is going to be a significance improvement with the data we have.
Since we found the 100 trees should be fine instead of 500, we build our model with that amount of trees.
model_rf_2 <- randomForest::randomForest(Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
AvrWorkHrs, data = training, ntree = 100, trControl = trainControl(method = 'cv', 10))
model_rf_2##
## Call:
## randomForest(formula = Attrition ~ Age + BusinessTravel + Department + DistanceFromHome + Education + Gender + JobLevel + MaritalStatus + MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement + PerformanceRating + EnvironmentSatisfaction + JobSatisfaction + WorkLifeBalance + AvrWorkHrs, data = training, ntree = 100, trControl = trainControl(method = "cv", 10))
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.84%
## Confusion matrix:
## No Yes class.error
## No 2878 6 0.002080444
## Yes 23 533 0.041366906
p <- predict(model_rf_2, test, type = 'response')
confusionMatrix(p, test$Attrition)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 721 6
## Yes 0 133
##
## Accuracy : 0.993
## 95% CI : (0.9849, 0.9974)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9738
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9568
## Pos Pred Value : 0.9917
## Neg Pred Value : 1.0000
## Prevalence : 0.8384
## Detection Rate : 0.8384
## Detection Prevalence : 0.8453
## Balanced Accuracy : 0.9784
##
## 'Positive' Class : No
##
hist(randomForest::treesize(model_rf_2), main = 'No. of nodes', xlab = '', col = 'forestgreen')The range of nodes goes from 250 to 300 for our 100 trees.
Also we can identify which set of features are the most important.
# Plotting features based on importance
randomForest::varImpPlot(model_rf_2, main = 'Feature Importance')Top 5 features
Partial dependency plot
randomForest::partialPlot(model_rf_2, training, AvrWorkHrs, 'Yes')# When the Average working hours is more than 7 it tends to predict Attrition ('Yes')
randomForest::partialPlot(model_rf_2, training, Age, 'Yes')# In this case the model is confused. When is more than 40 it tends to predict 'Yes' but also when the range is less than 40. This could be because of the outliers.First we construct a logistic regression to identify the most relevant features for attrition. We found that Age(-), BusinessTravel(+), Department(-), MaritalStatus(+), NumCompaniesWorked(+), YearsAtCompany(-), EnvironmentSatisfaction(-), JobSatisfaction(-), WorkLifeBalance(-) and AvrWorkHrs(+) are statistically significant. This model was using all the data since at first we were not interested in prediction.
When we changed the problem to predict employee attrition we used 3 different models: Logistic regression, decision trees and random forest to test which of those model performed better with the data we had.
First we build those model with the data as it is and changing the threshold level to increase specificity. After that we did some manipulation to balanced the data by oversampling the minority class. With this data manipulation the results were better for both the logistic regression and decision trees in terms of specificity (i.e attrition = ‘Yes’).
For the random forest model we did not use oversampling since that model usually works better with unbalanced data. The results that we got outperformed the previous two models, reaching a level of specificity of 95% in the test data set.
We also modify our random forest model by reducing the number of trees needed to construct the forest from 500 (default) to 100 based on the error rate for each tree without compromising our results. We decided not change the number of variables for each split since it did not add any value.
Based on the results of the random forest analysis we can conclude that the top 5 features are: Average working hours, Age, Years at the company, Monthly Income and Distance from home. Those features could be important for the manager in order to reduce the level of attrition among employees.
For this particular dataset, we conclude that the best model to use in order to predict attrition is random forest.