HR Analytics: Classification

Project Overview

The main objectives for this project are:

Uncover the factors that lead to employee attrition.
Build a model that can predict attrition based on certain features of the employee.

The dataset for this project is available on Kaggle: IBM HR Analytics Employee Attrition & Performance.

For this analysis we have used the following packages:

library(plyr)
library(dplyr)
library(ggplot2)
library(caret)
library(car)
library(cowplot)
library(gridExtra)
library(corrplot)
library(MASS)
library(ROCR)
library(ROSE)
library(rpart)
library(rpart.plot)

Reading data and basic exploration

# Reading files
general_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_general_data.csv', header = TRUE, sep = ',')

survey_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_employee_survey_data.csv',header = TRUE,sep = ',')

manager_survey_data_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/datasets_42363_72602_manager_survey_data.csv', header = TRUE, sep = ',')

in_time_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/in_time.csv', header = TRUE, sep = ',')

out_time_raw <- read.csv('D:/DS_PROJECTS/HR Analytics Case Study/out_time.csv', header = TRUE, sep = ',')

# Exploring the dataframes

str(survey_data_raw)

## 'data.frame':    4410 obs. of  4 variables:
##  $ EmployeeID             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ EnvironmentSatisfaction: int  3 3 2 4 4 3 1 1 2 2 ...
##  $ JobSatisfaction        : int  4 2 2 4 1 2 3 2 4 1 ...
##  $ WorkLifeBalance        : int  2 4 1 3 3 2 1 3 3 3 ...

str(manager_survey_data_raw)

## 'data.frame':    4410 obs. of  3 variables:
##  $ EmployeeID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ JobInvolvement   : int  3 2 3 2 3 3 3 3 3 3 ...
##  $ PerformanceRating: int  3 4 3 3 3 3 4 4 4 3 ...

str(general_data_raw)

## 'data.frame':    4410 obs. of  24 variables:
##  $ Age                    : int  51 31 32 38 32 46 28 29 31 25 ...
##  $ Attrition              : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
##  $ BusinessTravel         : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 2 1 3 3 3 3 3 1 ...
##  $ Department             : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome       : int  6 10 17 2 10 8 11 18 1 7 ...
##  $ Education              : int  2 1 4 5 1 3 2 3 3 4 ...
##  $ EducationField         : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeID             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 2 2 1 ...
##  $ JobLevel               : int  1 1 4 3 1 4 2 2 3 4 ...
##  $ JobRole                : Factor w/ 9 levels "Healthcare Representative",..: 1 7 8 2 8 6 8 8 3 3 ...
##  $ MaritalStatus          : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 3 2 2 1 ...
##  $ MonthlyIncome          : int  131160 41890 193280 83210 23420 40710 58130 31430 20440 134640 ...
##  $ NumCompaniesWorked     : int  1 0 1 3 4 3 2 2 0 1 ...
##  $ Over18                 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PercentSalaryHike      : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ StandardHours          : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ StockOptionLevel       : int  0 1 3 3 2 0 1 3 0 1 ...
##  $ TotalWorkingYears      : int  1 6 5 13 9 28 5 10 10 6 ...
##  $ TrainingTimesLastYear  : int  6 3 2 5 2 5 2 2 2 2 ...
##  $ YearsAtCompany         : int  1 5 5 8 6 7 0 0 9 6 ...
##  $ YearsSinceLastPromotion: int  0 1 0 7 0 7 0 0 7 1 ...
##  $ YearsWithCurrManager   : int  0 4 3 5 4 7 0 0 8 5 ...

#str(in_time_raw)

#str(out_time_raw)

# Checking for missing values

sapply(survey_data_raw, function(x) sum(is.na(x)))/nrow(survey_data_raw)*100

##              EmployeeID EnvironmentSatisfaction         JobSatisfaction 
##               0.0000000               0.5668934               0.4535147 
##         WorkLifeBalance 
##               0.8616780

sapply(manager_survey_data_raw, function(x) sum(is.na(x)))/nrow(manager_survey_data_raw)*100

##        EmployeeID    JobInvolvement PerformanceRating 
##                 0                 0                 0

sapply(general_data_raw, function(x) sum(is.na(x)))/nrow(general_data_raw)*100

##                     Age               Attrition          BusinessTravel 
##               0.0000000               0.0000000               0.0000000 
##              Department        DistanceFromHome               Education 
##               0.0000000               0.0000000               0.0000000 
##          EducationField           EmployeeCount              EmployeeID 
##               0.0000000               0.0000000               0.0000000 
##                  Gender                JobLevel                 JobRole 
##               0.0000000               0.0000000               0.0000000 
##           MaritalStatus           MonthlyIncome      NumCompaniesWorked 
##               0.0000000               0.0000000               0.4308390 
##                  Over18       PercentSalaryHike           StandardHours 
##               0.0000000               0.0000000               0.0000000 
##        StockOptionLevel       TotalWorkingYears   TrainingTimesLastYear 
##               0.0000000               0.2040816               0.0000000 
##          YearsAtCompany YearsSinceLastPromotion    YearsWithCurrManager 
##               0.0000000               0.0000000               0.0000000

Data treatment

Since we have out_time and in_time we can compute the average working hours per employee. This might perhaps gives us some additional insight.

in_time <- sapply(in_time_raw, function(x) as.POSIXlt(x, origin = '1970-01-01', '%y-%m-%d %H:%M:%S'))
in_time <- as.data.frame(in_time)

out_time <- sapply(out_time_raw, function(x) as.POSIXlt(x, origin = '1970-01-01', '%y-%m-%d %H:%M:%S'))
out_time <- as.data.frame(out_time)

# Removing the first column since there has no value in the data

in_time <- in_time[,c(2:ncol(in_time))]

out_time <- out_time[,c(2:ncol(out_time))]

# Computing the difference between out_time and in_time

hours_worked_raw <- out_time - in_time

# We want to keep the records even if there was one employee that went to work

hours_worked_raw <- hours_worked_raw[,colSums(is.na(hours_worked_raw)) < nrow(hours_worked_raw)]

# Transforming the resuls to numeric so we can take the average

hours_worked_raw <- sapply(hours_worked_raw, function(x) as.numeric(x))

hours_worked_raw <- as.data.frame(hours_worked_raw)

# Taking the average of working hours per employee
AvrWorkHrs <- apply(hours_worked_raw, 1, mean, na.rm = TRUE) # 1 Is to indicate rows

# Creating ID column so it can be merged with the rest of the datasets

id_vector <- 1:nrow(hours_worked_raw)

AverageWorkedHours <- data.frame(EmployeeID = id_vector, AvrWorkHrs = AvrWorkHrs)

# Before merging we take a look if each observation is from a different employee

setdiff(survey_data_raw$EmployeeID, manager_survey_data_raw$EmployeeID)

## integer(0)

setdiff(manager_survey_data_raw$EmployeeID, general_data_raw$EmployeeID)

## integer(0)

setdiff(general_data_raw$EmployeeID, AverageWorkedHours$EmployeeID)

## integer(0)

# Since all of them are complete we can merge them

general_data_merged <- inner_join(general_data_raw,manager_survey_data_raw, by = 'EmployeeID') %>%
  inner_join(., survey_data_raw, by = 'EmployeeID') %>%
  inner_join(., AverageWorkedHours, by = 'EmployeeID')

# Droping values that have the same value for all observations

same_values <- nearZeroVar(general_data_merged, names = TRUE)
general_data_merged <- general_data_merged %>%
  dplyr::select(-c(c('EmployeeID', same_values)))

# Now we check the structure of the new dataframe and the missing values. And evaluate if they are significant in terms of the total observations

str(general_data_merged)

## 'data.frame':    4410 obs. of  26 variables:
##  $ Age                    : int  51 31 32 38 32 46 28 29 31 25 ...
##  $ Attrition              : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 1 1 ...
##  $ BusinessTravel         : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 2 1 3 3 3 3 3 1 ...
##  $ Department             : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome       : int  6 10 17 2 10 8 11 18 1 7 ...
##  $ Education              : int  2 1 4 5 1 3 2 3 3 4 ...
##  $ EducationField         : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 2 2 1 ...
##  $ JobLevel               : int  1 1 4 3 1 4 2 2 3 4 ...
##  $ JobRole                : Factor w/ 9 levels "Healthcare Representative",..: 1 7 8 2 8 6 8 8 3 3 ...
##  $ MaritalStatus          : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 3 2 2 1 ...
##  $ MonthlyIncome          : int  131160 41890 193280 83210 23420 40710 58130 31430 20440 134640 ...
##  $ NumCompaniesWorked     : int  1 0 1 3 4 3 2 2 0 1 ...
##  $ PercentSalaryHike      : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ StockOptionLevel       : int  0 1 3 3 2 0 1 3 0 1 ...
##  $ TotalWorkingYears      : int  1 6 5 13 9 28 5 10 10 6 ...
##  $ TrainingTimesLastYear  : int  6 3 2 5 2 5 2 2 2 2 ...
##  $ YearsAtCompany         : int  1 5 5 8 6 7 0 0 9 6 ...
##  $ YearsSinceLastPromotion: int  0 1 0 7 0 7 0 0 7 1 ...
##  $ YearsWithCurrManager   : int  0 4 3 5 4 7 0 0 8 5 ...
##  $ JobInvolvement         : int  3 2 3 2 3 3 3 3 3 3 ...
##  $ PerformanceRating      : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ EnvironmentSatisfaction: int  3 3 2 4 4 3 1 1 2 2 ...
##  $ JobSatisfaction        : int  4 2 2 4 1 2 3 2 4 1 ...
##  $ WorkLifeBalance        : int  2 4 1 3 3 2 1 3 3 3 ...
##  $ AvrWorkHrs             : num  7.37 7.72 7.01 7.19 8.01 ...

sapply(general_data_merged, function(x) sum(is.na(x)))/nrow(general_data_merged)*100

##                     Age               Attrition          BusinessTravel 
##               0.0000000               0.0000000               0.0000000 
##              Department        DistanceFromHome               Education 
##               0.0000000               0.0000000               0.0000000 
##          EducationField                  Gender                JobLevel 
##               0.0000000               0.0000000               0.0000000 
##                 JobRole           MaritalStatus           MonthlyIncome 
##               0.0000000               0.0000000               0.0000000 
##      NumCompaniesWorked       PercentSalaryHike        StockOptionLevel 
##               0.4308390               0.0000000               0.0000000 
##       TotalWorkingYears   TrainingTimesLastYear          YearsAtCompany 
##               0.2040816               0.0000000               0.0000000 
## YearsSinceLastPromotion    YearsWithCurrManager          JobInvolvement 
##               0.0000000               0.0000000               0.0000000 
##       PerformanceRating EnvironmentSatisfaction         JobSatisfaction 
##               0.0000000               0.5668934               0.4535147 
##         WorkLifeBalance              AvrWorkHrs 
##               0.8616780               0.0000000

# We have a very small proportion of missing values per feature. We translate those in terms of observations

flag_nulls <- ifelse(rowSums(is.na(general_data_merged)) == 0, 0, 1)
sum(flag_nulls)/nrow(general_data_merged)*100

## [1] 2.494331

# Since missing values represent 2.5% of the observations we remove them (just for this case, one should be careful when dropping missing values).

general_data_merged <- general_data_merged[rowSums(is.na(general_data_merged)) == 0,]

Exploratory Data Analysis (EDA)

For this EDA we create two dataframes one with labels of the ordinal variables and the other just as it is.

Correlation plot

# First we keep the numeric variables

nums <- unlist(lapply(general_data_merged, is.numeric))
general_data_merged_with_ordinal_values <- general_data_merged[, nums]

# With this data we create a correlation matrix in order to see the relantionships between the features. Since the data contains ordinal data (hierarchy or ranks) we use Spearman correlation instead of Pearson.

cor_matrix <- cor(general_data_merged_with_ordinal_values, method = 'spearman')
corrplot(corr = cor_matrix, method = 'color', addCoef.col = 'gray',tl.cex = 0.7, number.cex = 0.5)

# We create the second dataframe with the labels of the ordinal features

general_data_merged_with_categories <- general_data_merged

general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 1)] <- 'Below College'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 2)] <- 'College'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 3)] <- 'Bachelor'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 4)] <- 'Master'
general_data_merged_with_categories$Education[which(general_data_merged_with_categories$Education == 5)] <- 'Doctor'

general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 1)] <- 'Low'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 2)] <- 'Medium'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 3)] <- 'High'
general_data_merged_with_categories$EnvironmentSatisfaction[which(general_data_merged_with_categories$EnvironmentSatisfaction == 4)] <- 'Very High'

general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 1)] <- 'Low'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 2)] <- 'Medium'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 3)] <- 'High'
general_data_merged_with_categories$JobInvolvement[which(general_data_merged_with_categories$JobInvolvement == 4)] <- 'Very High'

general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 1)] <- 'Low'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 2)] <- 'Medium'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 3)] <- 'High'
general_data_merged_with_categories$JobSatisfaction[which(general_data_merged_with_categories$JobSatisfaction == 4)] <- 'Very High'

general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 1)] <- 'Low'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 2)] <- 'Good'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 3)] <- 'Excellent'
general_data_merged_with_categories$PerformanceRating[which(general_data_merged_with_categories$PerformanceRating == 4)] <- 'Outstanding'

general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 1)] <- 'Bad'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 2)] <- 'Good'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 3)] <- 'Better'
general_data_merged_with_categories$WorkLifeBalance[which(general_data_merged_with_categories$WorkLifeBalance == 4)] <- 'Best'

Pivot tables

general_data_merged_with_categories %>%
  group_by(Gender, Education) %>%
  summarize(Average_Income = mean(MonthlyIncome)) %>%
  arrange(-Average_Income)

general_data_merged_with_categories %>%
  group_by(MaritalStatus, WorkLifeBalance) %>%
  summarize(Average_Work = mean(AvrWorkHrs), Percent_employees = round(n()/nrow(general_data_merged_with_categories)*100,0)) %>%
  arrange(-Percent_employees)

Findings:

It seems that female workers have a high income in comparison with men, except for the women who have a masters degree since they earn less on average than the men.
More than a half of the employees say they have a ‘Better’ work-life balance.

Bar and box plots

First part

# Creating some plots to see the rate of attrition by other variables such as Education, Environment Satisfaction, Job Involvement, Job satisfaction, Performance rating, work balance

g1 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Education)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~Education) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Attrition by Education', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g2 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = EnvironmentSatisfaction)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~EnvironmentSatisfaction) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Environment satisfaction vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g3 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = JobInvolvement)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~JobInvolvement) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Job involvement vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g4 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = JobSatisfaction)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~JobSatisfaction) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Job Satisfaction vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g5 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = PerformanceRating)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~PerformanceRating) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Performance Rating vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g6 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = WorkLifeBalance)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~WorkLifeBalance) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Work-life balance vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

plot_grid(g1,g2,g3,g4,g5,g6, nrow = 3)

Findings:

We can see that attrition comes most from people with college education, perhaps they’re could be interns.
People that have a low environment satisfaction have a high rate of attrition.
People with low job involvement have a high rate of attrition.
People with low job satisfaction have a high rate of attrition.
People with low work-life balance have a high rate of attrition.
Performance rating does not have a significance difference in the level of attrition.

Second part

# Plots of Attrition vs Age, Average working hours, Monthly income, Gender, Marital status, and Department.

g7 <-  ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = Age, fill = Attrition)) +
  geom_boxplot() +
  xlab('')+
  theme(legend.position = 'none')

g8 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = AvrWorkHrs, fill = Attrition)) +
  geom_boxplot()+
  xlab('')+
  theme(legend.position = 'none')

g9 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), y = log(MonthlyIncome), fill = Attrition)) +
  geom_boxplot()+
  xlab('')+
  ylab('Log of Monthly Income')+
  theme(legend.position = 'none')

g10 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Gender)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~Gender) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Gender vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g11 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = MaritalStatus)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~MaritalStatus) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Marital Status vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

g12 <- ggplot(data = general_data_merged_with_categories, aes(x = factor(Attrition), group = Department)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat = 'count', position = 'stack') +
  facet_grid(~Department) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = scales::percent(round((..prop..),2)), y = ..prop..), stat = 'count', vjust = -.2, size = 3) +
  labs(y = 'Percent', x = 'Attrition', title = 'Department vs Attrition', fill = 'Attrition')+
  theme_light() +
  theme(legend.position = 'none')

plot_grid(g7,g8,g9,g10,g11,g12, nrow = 3)

Findings:

Younger people have higher level of attrition than older people.
People that work more hours on average have a high rate of attrition.
It seems that income and gender are not relevant in the level of attrition.
People that are single have a higher rate of attrition than the ones that are married or divorced.
Human Resources have a higher rate of attrition than Research & Development and Sales departments.

Model building: Part 1

We want to understand the most important factors that lead to employee attrition. For this we use logistic regression to uncover which factors are the most relevant. For this moment we do not want to predict. In the following part Model building: Part 2 we split the data into training and test and predict the outcomes based on the best classification algorithm either logistic regression, decision trees or random forest.

# Based on the findings in EDA we compute the first model based relevant features.

mod_a <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
               Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
               JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
               AvrWorkHrs, data = general_data_merged, family = 'binomial')

summary(mod_a)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department + 
##     DistanceFromHome + Education + Gender + JobLevel + MaritalStatus + 
##     MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement + 
##     PerformanceRating + EnvironmentSatisfaction + JobSatisfaction + 
##     WorkLifeBalance + AvrWorkHrs, family = "binomial", data = general_data_merged)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8051  -0.5777  -0.3877  -0.2272   3.3738  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -9.012e-01  6.653e-01  -1.355  0.17557    
## Age                              -5.419e-02  6.159e-03  -8.798  < 2e-16 ***
## BusinessTravelTravel_Frequently   1.375e+00  2.078e-01   6.616 3.69e-11 ***
## BusinessTravelTravel_Rarely       6.252e-01  1.939e-01   3.224  0.00126 ** 
## DepartmentResearch & Development -9.342e-01  1.916e-01  -4.877 1.08e-06 ***
## DepartmentSales                  -9.463e-01  2.007e-01  -4.716 2.41e-06 ***
## DistanceFromHome                 -1.601e-04  5.785e-03  -0.028  0.97791    
## Education                        -7.715e-02  4.483e-02  -1.721  0.08528 .  
## GenderMale                        9.117e-02  9.512e-02   0.958  0.33787    
## JobLevel                         -6.167e-02  4.207e-02  -1.466  0.14269    
## MaritalStatusMarried              3.024e-01  1.375e-01   2.199  0.02784 *  
## MaritalStatusSingle               1.122e+00  1.362e-01   8.239  < 2e-16 ***
## MonthlyIncome                    -1.076e-06  1.011e-06  -1.064  0.28711    
## NumCompaniesWorked                1.145e-01  1.929e-02   5.936 2.92e-09 ***
## YearsAtCompany                   -4.310e-02  1.018e-02  -4.233 2.30e-05 ***
## JobInvolvement                   -4.869e-02  6.326e-02  -0.770  0.44146    
## PerformanceRating                 7.858e-02  1.241e-01   0.633  0.52655    
## EnvironmentSatisfaction          -3.288e-01  4.149e-02  -7.924 2.30e-15 ***
## JobSatisfaction                  -3.110e-01  4.120e-02  -7.548 4.42e-14 ***
## WorkLifeBalance                  -2.872e-01  6.312e-02  -4.550 5.36e-06 ***
## AvrWorkHrs                        4.296e-01  3.274e-02  13.124  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3804.3  on 4299  degrees of freedom
## Residual deviance: 3127.5  on 4279  degrees of freedom
## AIC: 3169.5
## 
## Number of Fisher Scoring iterations: 5

vif(mod = mod_a)

##                             GVIF Df GVIF^(1/(2*Df))
## Age                     1.333480  1        1.154764
## BusinessTravel          1.052222  2        1.012807
## Department              1.057845  2        1.014158
## DistanceFromHome        1.017167  1        1.008547
## Education               1.013869  1        1.006911
## Gender                  1.032351  1        1.016047
## JobLevel                1.021485  1        1.010686
## MaritalStatus           1.050845  2        1.012476
## MonthlyIncome           1.024963  1        1.012405
## NumCompaniesWorked      1.228627  1        1.108434
## YearsAtCompany          1.141766  1        1.068534
## JobInvolvement          1.011986  1        1.005975
## PerformanceRating       1.027053  1        1.013436
## EnvironmentSatisfaction 1.030300  1        1.015037
## JobSatisfaction         1.024502  1        1.012177
## WorkLifeBalance         1.019058  1        1.009484
## AvrWorkHrs              1.047025  1        1.023242

Interpretation:

Coefficients: The coefficients evaluate the effect against ‘Yes’ i.e an employee leaving the company. Based on this we can say the following about the statistical significant features:

Age(-): All things being equal, the older the people are, the less likely to leave the company.
BusinessTravel(+): All things being equal, people that travel frequently or rarely are more likely to leave the company than non-travel people.
Department(-): All things being equal, people that work in Research and Sales are less likely to leave the company.
MaritalStatus(+): All things being equal, people that are married or single are more likely to leave the company.
NumCompaniesWorked(+): All things being equal, people who have worked with more companies are more likely to leave the company.
YearsAtCompany(-): All things being equal, people with more years at the company are less likely to leave.
EnvironmentSatisfaction(-): All things being equal, people with more environment satisfaction are less likely to leave.
JobSatisfaction(-): All things being equal, people with more job satisfaction are less likely to leave.
WorkLifeBalance(-): All things being equal, people with more work-life balance are less likely to leave.
AvrWorkHrs(+): All things being equal, people with more hours worked are more likely to leave.

Overall

From these results, we can say that they’re are align with the findings from the EDA.
The Variance Inflation Factor(VIF) tells us that there are no problems with multicollinearity.
We could also improve the first model by droping some variables based on the Akaike Information Criteria (AIC) using the step wise algorithm.

mod_b <- stepAIC(mod_a, direction = 'both', trace = FALSE)

summary(mod_b)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department + 
##     Education + JobLevel + MaritalStatus + NumCompaniesWorked + 
##     YearsAtCompany + EnvironmentSatisfaction + JobSatisfaction + 
##     WorkLifeBalance + AvrWorkHrs, family = "binomial", data = general_data_merged)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8264  -0.5769  -0.3926  -0.2289   3.3994  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.783706   0.500468  -1.566  0.11736    
## Age                              -0.054558   0.006133  -8.896  < 2e-16 ***
## BusinessTravelTravel_Frequently   1.371882   0.207092   6.625 3.48e-11 ***
## BusinessTravelTravel_Rarely       0.621613   0.193498   3.213  0.00132 ** 
## DepartmentResearch & Development -0.942783   0.190989  -4.936 7.96e-07 ***
## DepartmentSales                  -0.946018   0.200501  -4.718 2.38e-06 ***
## Education                        -0.079047   0.044740  -1.767  0.07726 .  
## JobLevel                         -0.063872   0.041955  -1.522  0.12791    
## MaritalStatusMarried              0.305664   0.137073   2.230  0.02575 *  
## MaritalStatusSingle               1.129215   0.135828   8.314  < 2e-16 ***
## NumCompaniesWorked                0.114091   0.019217   5.937 2.90e-09 ***
## YearsAtCompany                   -0.043178   0.010177  -4.242 2.21e-05 ***
## EnvironmentSatisfaction          -0.330640   0.041464  -7.974 1.53e-15 ***
## JobSatisfaction                  -0.309223   0.041121  -7.520 5.48e-14 ***
## WorkLifeBalance                  -0.288729   0.063040  -4.580 4.65e-06 ***
## AvrWorkHrs                        0.431877   0.032630  13.236  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3804.3  on 4299  degrees of freedom
## Residual deviance: 3130.5  on 4284  degrees of freedom
## AIC: 3162.5
## 
## Number of Fisher Scoring iterations: 5

vif(mod = mod_b)

##                             GVIF Df GVIF^(1/(2*Df))
## Age                     1.320872  1        1.149292
## BusinessTravel          1.042031  2        1.010346
## Department              1.042443  2        1.010446
## Education               1.011867  1        1.005916
## JobLevel                1.017562  1        1.008743
## MaritalStatus           1.041215  2        1.010148
## NumCompaniesWorked      1.220371  1        1.104704
## YearsAtCompany          1.140375  1        1.067884
## EnvironmentSatisfaction 1.028594  1        1.014196
## JobSatisfaction         1.021555  1        1.010720
## WorkLifeBalance         1.017083  1        1.008505
## AvrWorkHrs              1.043959  1        1.021743

Interpretation:

In this case our model has reduced its features from 20 to 15.
The effects are the same as the prior model but now the new model has a lower AIC so in those terms this is a simpler and better model than the previous one.
There is also no problem with multicollinearity.

Model evaluation:

To evaluate our model we create a graph comparing the predicted probabilities against actual attrition.

predicted_data <- data.frame(prob_of_attrition = mod_b$fitted.values, attrition = general_data_merged$Attrition)

predicted_data <- predicted_data %>%
  arrange(prob_of_attrition) %>%
  mutate(rank = 1:nrow(predicted_data))

ggplot(data = predicted_data, aes(x = rank, y = prob_of_attrition))+
  geom_point(aes(color = attrition), alpha = 1, shape = 4, stroke = 2)+
  labs(y = 'Predicted probability of attrition', x = 'Index', title = 'Probability of attrition vs Actual attrition')

Graph Interpretation:

As we see our model is making a good job identifying attrition. The salmon color represents the people with low probability of attrition and did not leave the company while the turquoise represents the people that had a high probability of leaving and actually left.
However this model can’t be used to predict anything since we are using all the data. We need to see how our model works with unseen data.

Model building: Part 2

Now instead of just uncover the factors of attrition we want to make predictions on which employee will leave based on several characteristcs gather from the company. The steps we take are the following:

We split the data into training and test datasets.
We build three models: Logistic regression, decision trees and random forest and compare their results.
Select the best model

Model 1: Logistic Regression

# Setting seed for reproducibilty

set.seed(123)

# Data split

training.sample <- general_data_merged$Attrition %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data <- general_data_merged[training.sample,]
test.data <- general_data_merged[-training.sample,]

mod_c <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
               Gender+JobLevel+MaritalStatus+log(MonthlyIncome)+NumCompaniesWorked+YearsAtCompany+
               JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
               AvrWorkHrs, data = train.data, family = 'binomial')

mod_c_StepWise <- stepAIC(mod_c, direction = 'both', trace = FALSE)

summary(mod_c_StepWise)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department + 
##     Education + MaritalStatus + NumCompaniesWorked + YearsAtCompany + 
##     JobInvolvement + EnvironmentSatisfaction + JobSatisfaction + 
##     WorkLifeBalance + AvrWorkHrs, family = "binomial", data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7597  -0.5811  -0.3855  -0.2290   3.3936  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.693881   0.592687  -1.171 0.241704    
## Age                              -0.052146   0.006853  -7.610 2.75e-14 ***
## BusinessTravelTravel_Frequently   1.475801   0.239529   6.161 7.22e-10 ***
## BusinessTravelTravel_Rarely       0.663749   0.225277   2.946 0.003215 ** 
## DepartmentResearch & Development -1.002461   0.212039  -4.728 2.27e-06 ***
## DepartmentSales                  -1.128969   0.224367  -5.032 4.86e-07 ***
## Education                        -0.106843   0.050013  -2.136 0.032655 *  
## MaritalStatusMarried              0.342032   0.154962   2.207 0.027301 *  
## MaritalStatusSingle               1.192341   0.153175   7.784 7.02e-15 ***
## NumCompaniesWorked                0.106587   0.021571   4.941 7.76e-07 ***
## YearsAtCompany                   -0.043955   0.011333  -3.879 0.000105 ***
## JobInvolvement                   -0.144800   0.070039  -2.067 0.038694 *  
## EnvironmentSatisfaction          -0.314071   0.046318  -6.781 1.20e-11 ***
## JobSatisfaction                  -0.293706   0.046437  -6.325 2.54e-10 ***
## WorkLifeBalance                  -0.273230   0.070445  -3.879 0.000105 ***
## AvrWorkHrs                        0.438169   0.036596  11.973  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3043.4  on 3439  degrees of freedom
## Residual deviance: 2497.2  on 3424  degrees of freedom
## AIC: 2529.2
## 
## Number of Fisher Scoring iterations: 5

p.training <- predict(mod_c_StepWise, train.data, type = 'response') # vector of probabilites from the training dataset

p.training.attrition <- as.factor(ifelse(p.training > 0.5, 'Yes', 'No')) # transforming those probabilites into factors (Yes or No)

# Obtaining the confusion matrix for this model in our training data

confusionMatrix(p.training.attrition, train.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2825  441
##        Yes   59  115
##                                           
##                Accuracy : 0.8547          
##                  95% CI : (0.8424, 0.8663)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 0.004642        
##                                           
##                   Kappa : 0.2579          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9795          
##             Specificity : 0.2068          
##          Pos Pred Value : 0.8650          
##          Neg Pred Value : 0.6609          
##              Prevalence : 0.8384          
##          Detection Rate : 0.8212          
##    Detection Prevalence : 0.9494          
##       Balanced Accuracy : 0.5932          
##                                           
##        'Positive' Class : No              
##

# Testing our results in the test dataset with threshold = 0.5

p.test <- predict(mod_c_StepWise, test.data, type = 'response')
p.test.attrition <- as.factor(ifelse(p.test > 0.5, 'Yes','No'))
confusionMatrix(p.test.attrition, test.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  703 114
##        Yes  18  25
##                                         
##                Accuracy : 0.8465        
##                  95% CI : (0.8207, 0.87)
##     No Information Rate : 0.8384        
##     P-Value [Acc > NIR] : 0.2758        
##                                         
##                   Kappa : 0.2148        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.9750        
##             Specificity : 0.1799        
##          Pos Pred Value : 0.8605        
##          Neg Pred Value : 0.5814        
##              Prevalence : 0.8384        
##          Detection Rate : 0.8174        
##    Detection Prevalence : 0.9500        
##       Balanced Accuracy : 0.5774        
##                                         
##        'Positive' Class : No            
##

Interpretation

As we see, the results in our training dataset have an accuracy of 85% while in the test dataset gives us 84%. However if we take a look at the No Information Rate indicator this shows that based only in the distribution of our features we can predict correctly 83% of the time. This is, if we take a new data point and assign it to the most common set we will be correct 83% of the time.
Futhermore if we take a look at Sensitivity and Specificity we have values of 97% and 18% respectively. So if we want to predict people who might leave (Prediction = Yes and Reference = Yes) this is not a good model, since will only detect 18% of them (25/114+25).

Possible solutions

Selecting a different threshold: Right now our threshold is 0.5, we could reduce it to increase the specificity value.

For finding this new value we can plot a ROC curve to tell us which is the most appropiate value to identify the most true positives. By selecting a new treshold to improve specificity we are going to decrease our level of sensitivity because of the trade off between them.

Since the data is clearly unbalanced towards the people who stayed we can balanced it by using oversampling.

Solution 1: Selecting a different threshold

# Selecting a different treshold based on ROC curve

ROCR_prediction <- prediction(p.training, train.data$Attrition)
ROCR_performance <- performance(ROCR_prediction, 'tpr', 'fpr')

plot(ROCR_performance, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1))

# Based on this we select the new threshold to be 0.2 and test our model again.

# Evaluating in training dataset withe new threshold

p.training <- predict(mod_c_StepWise, train.data, type = 'response')
p.training.attrition <- as.factor(ifelse(p.training > 0.2, 'Yes', 'No'))
confusionMatrix(p.training.attrition, train.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2271  181
##        Yes  613  375
##                                           
##                Accuracy : 0.7692          
##                  95% CI : (0.7547, 0.7832)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3516          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7874          
##             Specificity : 0.6745          
##          Pos Pred Value : 0.9262          
##          Neg Pred Value : 0.3796          
##              Prevalence : 0.8384          
##          Detection Rate : 0.6602          
##    Detection Prevalence : 0.7128          
##       Balanced Accuracy : 0.7310          
##                                           
##        'Positive' Class : No              
##

# Evaluating in test dataset

p.test <- predict(mod_c_StepWise, test.data, type = 'response')
p.test.attrition <-  as.factor(ifelse(p.test> 0.2, 'Yes', 'No'))

confusionMatrix(p.test.attrition, test.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  590  50
##        Yes 131  89
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7607, 0.8163)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.3713          
##                                           
##  Mcnemar's Test P-Value : 2.742e-09       
##                                           
##             Sensitivity : 0.8183          
##             Specificity : 0.6403          
##          Pos Pred Value : 0.9219          
##          Neg Pred Value : 0.4045          
##              Prevalence : 0.8384          
##          Detection Rate : 0.6860          
##    Detection Prevalence : 0.7442          
##       Balanced Accuracy : 0.7293          
##                                           
##        'Positive' Class : No              
##

Results:

With this new threshold our level of specificity increased from 20% to 67% in the training dataset and from 18% to 64% in the test dataset. So now our model is predicting better the people who might leave the company.

Solution 2: Oversampling

Since the data is unbalanced and we do not want to loose any information of the mayority class we oversample the minority class and try again with logistic regression.
We also use the stepwise algorithm to decrease the number of features in the model.

Training dataset

# Using oversampling in the training dataset
over <- ovun.sample(formula = Attrition ~., data = train.data, method = 'over')$data

table(over$Attrition) # now the data is balanced between No and Yes

## 
##   No  Yes 
## 2884 2846

# Building the model

balanced_model <- glm(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
               Gender+JobLevel+MaritalStatus+log(MonthlyIncome)+NumCompaniesWorked+YearsAtCompany+
               JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
               AvrWorkHrs, data = over, family = 'binomial')

# Model output

summary(balanced_model)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department + 
##     DistanceFromHome + Education + Gender + JobLevel + MaritalStatus + 
##     log(MonthlyIncome) + NumCompaniesWorked + YearsAtCompany + 
##     JobInvolvement + PerformanceRating + EnvironmentSatisfaction + 
##     JobSatisfaction + WorkLifeBalance + AvrWorkHrs, family = "binomial", 
##     data = over)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5145  -0.9066  -0.2923   0.9129   2.6345  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.418185   0.688563   2.060 0.039434 *  
## Age                              -0.047525   0.003914 -12.142  < 2e-16 ***
## BusinessTravelTravel_Frequently   1.475060   0.133064  11.085  < 2e-16 ***
## BusinessTravelTravel_Rarely       0.716333   0.121629   5.890 3.87e-09 ***
## DepartmentResearch & Development -1.338819   0.143217  -9.348  < 2e-16 ***
## DepartmentSales                  -1.338110   0.148093  -9.036  < 2e-16 ***
## DistanceFromHome                  0.001657   0.003903   0.425 0.671130    
## Education                        -0.099516   0.030506  -3.262 0.001106 ** 
## GenderMale                        0.054859   0.063564   0.863 0.388112    
## JobLevel                         -0.113311   0.029187  -3.882 0.000103 ***
## MaritalStatusMarried              0.395811   0.087825   4.507 6.58e-06 ***
## MaritalStatusSingle               1.178699   0.089146  13.222  < 2e-16 ***
## log(MonthlyIncome)               -0.157270   0.047457  -3.314 0.000920 ***
## NumCompaniesWorked                0.117968   0.013161   8.963  < 2e-16 ***
## YearsAtCompany                   -0.030521   0.006126  -4.983 6.28e-07 ***
## JobInvolvement                   -0.060252   0.041674  -1.446 0.148236    
## PerformanceRating                 0.339953   0.083488   4.072 4.66e-05 ***
## EnvironmentSatisfaction          -0.289308   0.027411 -10.554  < 2e-16 ***
## JobSatisfaction                  -0.309423   0.028510 -10.853  < 2e-16 ***
## WorkLifeBalance                  -0.275408   0.041185  -6.687 2.28e-11 ***
## AvrWorkHrs                        0.435745   0.022971  18.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7943.2  on 5729  degrees of freedom
## Residual deviance: 6329.9  on 5709  degrees of freedom
## AIC: 6371.9
## 
## Number of Fisher Scoring iterations: 4

# Step wise

balanced_model_step_wise <- stepAIC(balanced_model, direction = 'both', trace = FALSE)

# Step wise model output

summary(balanced_model_step_wise)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + Department + 
##     Education + JobLevel + MaritalStatus + log(MonthlyIncome) + 
##     NumCompaniesWorked + YearsAtCompany + JobInvolvement + PerformanceRating + 
##     EnvironmentSatisfaction + JobSatisfaction + WorkLifeBalance + 
##     AvrWorkHrs, family = "binomial", data = over)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5261  -0.9013  -0.2901   0.9158   2.6209  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.465011   0.685749   2.136 0.032650 *  
## Age                              -0.047468   0.003913 -12.131  < 2e-16 ***
## BusinessTravelTravel_Frequently   1.467963   0.132684  11.064  < 2e-16 ***
## BusinessTravelTravel_Rarely       0.710717   0.121304   5.859 4.66e-09 ***
## DepartmentResearch & Development -1.339945   0.143265  -9.353  < 2e-16 ***
## DepartmentSales                  -1.337897   0.148122  -9.032  < 2e-16 ***
## Education                        -0.098041   0.030459  -3.219 0.001287 ** 
## JobLevel                         -0.114806   0.029099  -3.945 7.97e-05 ***
## MaritalStatusMarried              0.392157   0.087733   4.470 7.83e-06 ***
## MaritalStatusSingle               1.174886   0.089040  13.195  < 2e-16 ***
## log(MonthlyIncome)               -0.157297   0.047408  -3.318 0.000907 ***
## NumCompaniesWorked                0.116839   0.013099   8.920  < 2e-16 ***
## YearsAtCompany                   -0.030522   0.006113  -4.993 5.93e-07 ***
## JobInvolvement                   -0.060258   0.041671  -1.446 0.148165    
## PerformanceRating                 0.342283   0.083315   4.108 3.99e-05 ***
## EnvironmentSatisfaction          -0.289354   0.027403 -10.559  < 2e-16 ***
## JobSatisfaction                  -0.310816   0.028469 -10.918  < 2e-16 ***
## WorkLifeBalance                  -0.275165   0.041176  -6.683 2.35e-11 ***
## AvrWorkHrs                        0.436605   0.022942  19.031  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7943.2  on 5729  degrees of freedom
## Residual deviance: 6330.8  on 5711  degrees of freedom
## AIC: 6368.8
## 
## Number of Fisher Scoring iterations: 4

p.training <- predict(balanced_model_step_wise, train.data, type = 'response')

p.training.attrition <- as.factor(ifelse(p.training > 0.5, 'Yes', 'No'))

confusionMatrix(p.training.attrition, train.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2110  153
##        Yes  774  403
##                                           
##                Accuracy : 0.7305          
##                  95% CI : (0.7154, 0.7453)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3146          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7316          
##             Specificity : 0.7248          
##          Pos Pred Value : 0.9324          
##          Neg Pred Value : 0.3424          
##              Prevalence : 0.8384          
##          Detection Rate : 0.6134          
##    Detection Prevalence : 0.6578          
##       Balanced Accuracy : 0.7282          
##                                           
##        'Positive' Class : No              
##

Result: As we can see our model has improved by doing oversampling. Also setting the threshold to 0.5 does a good job identifying which employee might leave the company in our training dataset. Now we test the result in our testing dataset

Testing dataset

p.test <- predict(balanced_model_step_wise, test.data, type = 'response')

p.test.attrition <- as.factor(ifelse(p.test > 0.5, 'Yes', 'No'))

confusionMatrix(p.test.attrition, test.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  545  46
##        Yes 176  93
##                                           
##                Accuracy : 0.7419          
##                  95% CI : (0.7112, 0.7708)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3085          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7559          
##             Specificity : 0.6691          
##          Pos Pred Value : 0.9222          
##          Neg Pred Value : 0.3457          
##              Prevalence : 0.8384          
##          Detection Rate : 0.6337          
##    Detection Prevalence : 0.6872          
##       Balanced Accuracy : 0.7125          
##                                           
##        'Positive' Class : No              
##

Result: Now the speceificity has increased to 66% in our testing set, this is result is almost the same to the result we obtained previously when we set our threshold to be 02.

Model 2: Decision Tree

For the second model we create a decision tree to see if our results can improve.

# Setting seed for reproducibilty

set.seed(123)

# Data split

training.sample <- general_data_merged$Attrition %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data <- general_data_merged[training.sample,]
test.data <- general_data_merged[-training.sample,]

# Decision tree model

model.tree <- rpart::rpart(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
                             Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
                             JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
                             AvrWorkHrs, data = train.data, method = 'class')

# Plotting decision tree
rpart.plot::prp(model.tree)

predict_tree <- predict(model.tree, test.data, type = 'class')

confusionMatrix(predict_tree, test.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  708  96
##        Yes  13  43
##                                           
##                Accuracy : 0.8733          
##                  95% CI : (0.8492, 0.8948)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 0.002491        
##                                           
##                   Kappa : 0.3838          
##                                           
##  Mcnemar's Test P-Value : 4.024e-15       
##                                           
##             Sensitivity : 0.9820          
##             Specificity : 0.3094          
##          Pos Pred Value : 0.8806          
##          Neg Pred Value : 0.7679          
##              Prevalence : 0.8384          
##          Detection Rate : 0.8233          
##    Detection Prevalence : 0.9349          
##       Balanced Accuracy : 0.6457          
##                                           
##        'Positive' Class : No              
##

Findings

Just comparing the result of the first logistic regression(before doing oversampling) we can see that in terms of specificity the decision tree does a better job identifying attrition (30% vs 27%).
However when the data is balanced by oversampling we see that logistic regression is a better model to predict attrition (66% vs 30%).

As we did in logistic regression we balanced the data by doing oversampling and test the decision tree again

Decision tree with oversampling

over <- ovun.sample(formula = Attrition ~., data = train.data, method = 'over')$data

model.tree <- rpart::rpart(formula = Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
                             Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
                             JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
                             AvrWorkHrs, data = over, method = 'class')

# Plotting decision tree
rpart.plot::prp(model.tree)

predict_tree <- predict(model.tree, test.data, type = 'class')

confusionMatrix(predict_tree, test.data$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  584  51
##        Yes 137  88
##                                           
##                Accuracy : 0.7814          
##                  95% CI : (0.7523, 0.8086)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3545          
##                                           
##  Mcnemar's Test P-Value : 5.673e-10       
##                                           
##             Sensitivity : 0.8100          
##             Specificity : 0.6331          
##          Pos Pred Value : 0.9197          
##          Neg Pred Value : 0.3911          
##              Prevalence : 0.8384          
##          Detection Rate : 0.6791          
##    Detection Prevalence : 0.7384          
##       Balanced Accuracy : 0.7215          
##                                           
##        'Positive' Class : No              
##

Findings

With balanced data the model has improved from 30% to 63% in our test dataset.
Overall oversampling the data has led to better results in logistic and decision tree models.

Model 3: Random Forest

A priori we know that Random Forest works well with inbalaced data in comparison with logistic and decision trees. We take the same parameters or features as before and feed them into the Random Forest model and take a look at the results.

set.seed(123)

partition <- createDataPartition(general_data_merged$Attrition, p = 0.8, list = FALSE)

training <- general_data_merged[partition,]
test <- general_data_merged[-partition,]

# Random forest with cross-validation
model_rf_1 <- randomForest::randomForest(Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
                             Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
                             JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
                             AvrWorkHrs, data = training, trControl = trainControl(method = 'cv', 10))

# Printing the model
model_rf_1

## 
## Call:
##  randomForest(formula = Attrition ~ Age + BusinessTravel + Department +      DistanceFromHome + Education + Gender + JobLevel + MaritalStatus +      MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement +      PerformanceRating + EnvironmentSatisfaction + JobSatisfaction +      WorkLifeBalance + AvrWorkHrs, data = training, trControl = trainControl(method = "cv",      10)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.81%
## Confusion matrix:
##       No Yes class.error
## No  2880   4 0.001386963
## Yes   24 532 0.043165468

# Evaluation of the model with useen data
p <- predict(model_rf_1, test, type = 'response')

confusionMatrix(p, test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  721   6
##        Yes   0 133
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9849, 0.9974)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9738          
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9568          
##          Pos Pred Value : 0.9917          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8384          
##          Detection Rate : 0.8384          
##    Detection Prevalence : 0.8453          
##       Balanced Accuracy : 0.9784          
##                                           
##        'Positive' Class : No              
##

Findings

Random Forest is doing a pretty good job predicting attrition since it only misclasfied 6 out of 139 observations reaching a level of specificty of 95%.
The number of trees for this model is set by default to be 500.
We might want to decrease the number of trees based on the error rate. If we decrease the number of trees the time in computing the model should decrease as well.

# Plotting error rate
plot(model_rf_1, main = 'Error rate vs number of trees')

Based on this 100 trees should be fine.
We can also defined the mtry(No. of variables tried at each split) to improve our model.

# Plotting mtry
t <- randomForest::tuneRF(training[,-2], training[,2], stepFactor = 0.5, plot = TRUE, ntreeTry = 100, trace = TRUE, improve = 0.05)

## mtry = 5  OOB error = 0.84% 
## Searching left ...
## mtry = 10    OOB error = 0.87% 
## -0.03448276 0.05 
## Searching right ...
## mtry = 2     OOB error = 0.78% 
## 0.06896552 0.05 
## mtry = 1     OOB error = 8.31% 
## -9.592593 0.05

With our default model we had an OOB of 0.81% and based on the graph we can leave it there because we do not think there is going to be a significance improvement with the data we have.
Since we found the 100 trees should be fine instead of 500, we build our model with that amount of trees.

model_rf_2 <- randomForest::randomForest(Attrition ~ Age+BusinessTravel+Department+DistanceFromHome+Education+
                                           Gender+JobLevel+MaritalStatus+MonthlyIncome+NumCompaniesWorked+YearsAtCompany+
                                           JobInvolvement+PerformanceRating+EnvironmentSatisfaction+JobSatisfaction+WorkLifeBalance+
                                           AvrWorkHrs, data = training, ntree = 100, trControl = trainControl(method = 'cv', 10))

model_rf_2

## 
## Call:
##  randomForest(formula = Attrition ~ Age + BusinessTravel + Department +      DistanceFromHome + Education + Gender + JobLevel + MaritalStatus +      MonthlyIncome + NumCompaniesWorked + YearsAtCompany + JobInvolvement +      PerformanceRating + EnvironmentSatisfaction + JobSatisfaction +      WorkLifeBalance + AvrWorkHrs, data = training, ntree = 100,      trControl = trainControl(method = "cv", 10)) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.84%
## Confusion matrix:
##       No Yes class.error
## No  2878   6 0.002080444
## Yes   23 533 0.041366906

p <- predict(model_rf_2, test, type = 'response')

confusionMatrix(p, test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  721   6
##        Yes   0 133
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9849, 0.9974)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9738          
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9568          
##          Pos Pred Value : 0.9917          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8384          
##          Detection Rate : 0.8384          
##    Detection Prevalence : 0.8453          
##       Balanced Accuracy : 0.9784          
##                                           
##        'Positive' Class : No              
##

We also can take a look at the number of nodes by each tree

hist(randomForest::treesize(model_rf_2), main = 'No. of nodes', xlab = '', col = 'forestgreen')

The range of nodes goes from 250 to 300 for our 100 trees.
Also we can identify which set of features are the most important.

# Plotting features based on importance
randomForest::varImpPlot(model_rf_2, main = 'Feature Importance')

Top 5 features

Average working hours
Age
Years at the company
Monthly Income
Distance from home

Partial dependency plot

randomForest::partialPlot(model_rf_2, training, AvrWorkHrs, 'Yes')

# When the Average working hours is more than 7 it tends to predict Attrition ('Yes')

randomForest::partialPlot(model_rf_2, training, Age, 'Yes')

# In this case the model is confused. When is more than 40 it tends to predict 'Yes' but also when the range is less than 40. This could be because of the outliers.

Conclusions

First we construct a logistic regression to identify the most relevant features for attrition. We found that Age(-), BusinessTravel(+), Department(-), MaritalStatus(+), NumCompaniesWorked(+), YearsAtCompany(-), EnvironmentSatisfaction(-), JobSatisfaction(-), WorkLifeBalance(-) and AvrWorkHrs(+) are statistically significant. This model was using all the data since at first we were not interested in prediction.
When we changed the problem to predict employee attrition we used 3 different models: Logistic regression, decision trees and random forest to test which of those model performed better with the data we had.
First we build those model with the data as it is and changing the threshold level to increase specificity. After that we did some manipulation to balanced the data by oversampling the minority class. With this data manipulation the results were better for both the logistic regression and decision trees in terms of specificity (i.e attrition = ‘Yes’).
For the random forest model we did not use oversampling since that model usually works better with unbalanced data. The results that we got outperformed the previous two models, reaching a level of specificity of 95% in the test data set.
We also modify our random forest model by reducing the number of trees needed to construct the forest from 500 (default) to 100 based on the error rate for each tree without compromising our results. We decided not change the number of variables for each split since it did not add any value.
Based on the results of the random forest analysis we can conclude that the top 5 features are: Average working hours, Age, Years at the company, Monthly Income and Distance from home. Those features could be important for the manager in order to reduce the level of attrition among employees.
For this particular dataset, we conclude that the best model to use in order to predict attrition is random forest.

References

Kaggle Notebook
Statquest
Dealing with unbalanced data
Lectures from professor Bharatendra Rai in Handling Class Imbalance Problem, ROC Curve and AUC and Random Forest

HR Analytics: Classification

Omar Juarez

Project Overview

Reading data and basic exploration

Data treatment

Exploratory Data Analysis (EDA)

Correlation plot

Pivot tables

Bar and box plots

First part

Second part

Model building: Part 1

Model building: Part 2

Model 1: Logistic Regression

Model 2: Decision Tree

Model 3: Random Forest

Conclusions

References