This project aims to predict attrition of employee based on data sourced from HR department. This data includes columns for Age, Attrition, Department, Gender, Salary, Education Level, Role, Hours Worked among others. The target variable here is the Attrition column with values Yes or No. The other 34 variables are used in predicting the target using a classification machine learning model. Performance of the model will also be evaluated using the McFadden R2 and confusion matrix.
Lastly, factor importance will also be determined to ascertain the most important features driving staff to leave or stay with the company. This will enable HR can formulate appropriate staff retention strategies.
Libraries used for this project includes caret for Machine Learning classification models Tidyverse for manipulating dataframes Corrplot for viewing correlation among the numerical features pscl is used for calculating McFadden’s R2 to evaluate model’s fit Random Forest for determining feature importance
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
library (corrplot)
## corrplot 0.92 loaded
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
#loading dataset
HR_data <- read.csv("Employee Attrition.csv", header = TRUE, stringsAsFactors = FALSE)
head(HR_data)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research & Development
## 3 37 Yes Travel_Rarely 1373 Research & Development
## 4 33 No Travel_Frequently 1392 Research & Development
## 5 27 No Travel_Rarely 591 Research & Development
## 6 32 No Travel_Frequently 1005 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 1 2 Life Sciences 1 1
## 2 8 1 Life Sciences 1 2
## 3 2 2 Other 1 4
## 4 3 4 Life Sciences 1 5
## 5 2 1 Medical 1 7
## 6 2 2 Life Sciences 1 8
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Female 94 3 2
## 2 3 Male 61 2 2
## 3 4 Male 92 2 1
## 4 4 Female 56 3 1
## 5 1 Male 40 3 1
## 6 4 Male 79 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Sales Executive 4 Single 5993 19479
## 2 Research Scientist 2 Married 5130 24907
## 3 Laboratory Technician 3 Single 2090 2396
## 4 Research Scientist 3 Married 2909 23159
## 5 Laboratory Technician 2 Married 3468 16632
## 6 Laboratory Technician 4 Single 3068 11864
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 8 Y Yes 11 3
## 2 1 Y No 23 4
## 3 6 Y Yes 15 3
## 4 1 Y Yes 11 3
## 5 9 Y No 12 3
## 6 0 Y No 13 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 1 80 0 8
## 2 4 80 1 10
## 3 2 80 0 7
## 4 3 80 0 8
## 5 4 80 1 6
## 6 3 80 0 8
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 0 1 6 4
## 2 3 3 10 7
## 3 3 3 0 0
## 4 3 3 8 7
## 5 3 3 2 2
## 6 2 2 7 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 0 5
## 2 1 7
## 3 0 0
## 4 3 0
## 5 2 2
## 6 3 6
str(HR_data)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
#Attrition rate at the company
p1 <- HR_data %>% dplyr::group_by(Attrition) %>% dplyr::summarise(cnt = n()) %>%
dplyr::mutate(freq = (cnt / sum(cnt))*100) %>%
ggplot(aes(x = Attrition, y = freq, fill = Attrition)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(freq,0), "%")), position = position_stack(vjust = 0.5), size = 2.5) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +labs(title = "Attrition", x = "Attrition", y ="Percentage")
p2 <- HR_data %>% dplyr::group_by(Department, Attrition) %>% dplyr::summarise(cnt = n()) %>%
dplyr::mutate(freq = (cnt / sum(cnt))*100) %>% ggplot(aes(x = Department, y = freq, fill = Attrition)) +
geom_bar(position = position_stack(), stat = "identity", width = .6) +
geom_text(aes(label = paste0(round(freq,0), "%")), position = position_stack(vjust = 0.5), size = 2.5) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Attrition by Department", x = "Department", y = "Percentage")
## `summarise()` has grouped output by 'Department'. You can override using the
## `.groups` argument.
grid.arrange(p1, p2, nrow = 2, ncol = 1)
Comments - General attrition rate across the company is
16%. Sales (21%) and HR(19%) have a higher attrition than the company
average while R&D department is slightly lower at 14%.
#Plotting data to view distr of Job Roles
HR_data %>%
dplyr::group_by(JobRole, Attrition) %>%
dplyr::summarise(cnt = n()) %>%
dplyr::mutate(freq = (cnt / sum(cnt))*100) %>%
ggplot(aes(x = JobRole, y = freq, fill = Attrition)) +
geom_bar(position = position_stack(), stat = "identity", width = .7) +
geom_text(aes(label = paste0(round(freq,0), "%")),
position = position_stack(vjust = 0.5), size = 3) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Job Role and Attrition", x = "Job Role", y = "Percentage") +
theme(axis.text.x = element_text(angle = 20, hjust = 0.5))
## `summarise()` has grouped output by 'JobRole'. You can override using the
## `.groups` argument.
Comments - A further dive into the Attrition numbers
shows by job roles shows HR Representatives (23%), Laboratory
Technicians (24%) and Sales Representatives (40%) are the key drivers of
attrition at the company.
p1 <- ggplot(HR_data) + geom_histogram(aes(Age), binwidth = 5, fill = "purple",col = "black")
p2 <- ggplot(HR_data) + geom_histogram(aes(DistanceFromHome), binwidth = 5, fill = "purple",col = "black")
p3 <- ggplot(HR_data) + geom_histogram(aes(NumCompaniesWorked), binwidth = 2, fill = "purple",col = "black")
p4 <- ggplot(HR_data) + geom_histogram(aes(YearsAtCompany ), binwidth = 3, fill = "purple",col = "black")
p5 <- ggplot(HR_data) + geom_histogram(aes(MonthlyIncome), binwidth = 1000, fill = "purple",col = "black")
p6 <- ggplot(HR_data) + geom_histogram (aes(PercentSalaryHike), binwidth = 1, fill = "purple",col = "black")
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Comments - The Age feature is close to being normally
distrusted with most employees within the ages of 30 and 40. The other 5
numerical features are skewed to the right. Features will be normalized
in subsequent sections
p1 <- HR_data %>%
ggplot(aes(x = Age, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Age") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p2 <- HR_data %>%
ggplot(aes(x = DistanceFromHome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Distance From Home") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p3 <- HR_data %>%
ggplot(aes(x = NumCompaniesWorked, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Number of Companies Worked") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p4 <- HR_data %>%
ggplot(aes(x = YearsAtCompany, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years at Company") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p5 <- HR_data %>%
ggplot(aes(x = MonthlyIncome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Monthly Income") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p6 <- HR_data %>%
ggplot(aes(x = PercentSalaryHike, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Percent Salary Hike") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p1, p2, p3, p4, p5, p6 , nrow = 3, ncol = 2)
Comments - The Bivariate analysis applies EDA using 2
variables. In this case 6 of the numerical variables used earlier and
the target variable (Attrition).
From the plots above, it can be seen that attritition is highest between the ages of 20-30 and also among staff that leave more than 10kms from work. In terms of salaries, staff that earn less than 5000 per month while staff that have worked for 5 or move companies have an higher attrition rate.
p1<- HR_data %>%
group_by(Gender) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(Gender), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3") + ggtitle("Gender") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p2<- HR_data %>%
group_by(Education) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(Education), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3") + ggtitle("Education") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))
p3 <- HR_data %>%
group_by(EducationField) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(EducationField), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3") + ggtitle("Education Field") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))
p4 <- HR_data %>%
group_by(MaritalStatus) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(MaritalStatus), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3")+ ggtitle("Marital Status") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 750))
p5 <- HR_data %>%
group_by(RelationshipSatisfaction) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(RelationshipSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3") + ggtitle("Relationship Satisfaction") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())+ scale_y_continuous(limits = c(0, 500))
p6 <- HR_data %>%
group_by(WorkLifeBalance) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(WorkLifeBalance), y = counts)) + geom_bar(stat = 'identity', fill = "darkolivegreen3")+ ggtitle("Work Life Balance") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 950))
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Comments - The company employs more men (60%) than
women (40%). Also, most employees (73%) are either from the life
sciences and medical field. In terms of education qualifications, 50% of
staff have at least a college education.
On a personal level, 46% of staff are married with 60% having either high or very high relationship satisfaction. Lastly, the data shows a high level of work life balance with 95% of staff choosing good to best option on the survey
This stage involves making the data suitable for a machine-learning model. Operations performed include;
#checking for null values
sapply(HR_data, function(x) sum(is.na(x)))
## Age Attrition BusinessTravel
## 0 0 0
## DailyRate Department DistanceFromHome
## 0 0 0
## Education EducationField EmployeeCount
## 0 0 0
## EmployeeNumber EnvironmentSatisfaction Gender
## 0 0 0
## HourlyRate JobInvolvement JobLevel
## 0 0 0
## JobRole JobSatisfaction MaritalStatus
## 0 0 0
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 0 0 0
## Over18 OverTime PercentSalaryHike
## 0 0 0
## PerformanceRating RelationshipSatisfaction StandardHours
## 0 0 0
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0 0 0
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
# Removing Zero and Near Zero-Variance Predictors - feature with very few unique values
nzv <- nearZeroVar(HR_data)
nzcol <- colnames(HR_data)[nzv]
nzcol
## [1] "EmployeeCount" "Over18" "StandardHours"
#new df with redundant columns
HR_data1<- HR_data[, -nzv]
dim(HR_data1)
## [1] 1470 32
#Dropping other columns with little bearing with attrition or are better represented by other features
drop <- c("DailyRate", "EmployeeNumber","HourlyRate", "MonthlyRate" )
HR_data2 = HR_data1[,!(names(HR_data1) %in% drop)]
dim(HR_data2)
## [1] 1470 28
# numeric columns
num_cols <- unlist(lapply(HR_data2, is.numeric))
num_cols
## Age Attrition BusinessTravel
## TRUE FALSE FALSE
## Department DistanceFromHome Education
## FALSE TRUE TRUE
## EducationField EnvironmentSatisfaction Gender
## FALSE TRUE FALSE
## JobInvolvement JobLevel JobRole
## TRUE TRUE FALSE
## JobSatisfaction MaritalStatus MonthlyIncome
## TRUE FALSE TRUE
## NumCompaniesWorked OverTime PercentSalaryHike
## TRUE FALSE TRUE
## PerformanceRating RelationshipSatisfaction StockOptionLevel
## TRUE TRUE TRUE
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## TRUE TRUE TRUE
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## TRUE TRUE TRUE
## YearsWithCurrManager
## TRUE
HR_data_num <- HR_data2[ , num_cols]
dim(HR_data_num)
## [1] 1470 20
# non-numeric columns of data
char_cols <- unlist(lapply(HR_data2, is.character))
char_cols
## Age Attrition BusinessTravel
## FALSE TRUE TRUE
## Department DistanceFromHome Education
## TRUE FALSE FALSE
## EducationField EnvironmentSatisfaction Gender
## TRUE FALSE TRUE
## JobInvolvement JobLevel JobRole
## FALSE FALSE TRUE
## JobSatisfaction MaritalStatus MonthlyIncome
## FALSE TRUE FALSE
## NumCompaniesWorked OverTime PercentSalaryHike
## FALSE TRUE FALSE
## PerformanceRating RelationshipSatisfaction StockOptionLevel
## FALSE FALSE FALSE
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## FALSE FALSE FALSE
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## FALSE FALSE FALSE
## YearsWithCurrManager
## FALSE
# non-numeric columns of data
char_cols <- unlist(lapply(HR_data2, is.character))
char_cols
## Age Attrition BusinessTravel
## FALSE TRUE TRUE
## Department DistanceFromHome Education
## TRUE FALSE FALSE
## EducationField EnvironmentSatisfaction Gender
## TRUE FALSE TRUE
## JobInvolvement JobLevel JobRole
## FALSE FALSE TRUE
## JobSatisfaction MaritalStatus MonthlyIncome
## FALSE TRUE FALSE
## NumCompaniesWorked OverTime PercentSalaryHike
## FALSE TRUE FALSE
## PerformanceRating RelationshipSatisfaction StockOptionLevel
## FALSE FALSE FALSE
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## FALSE FALSE FALSE
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## FALSE FALSE FALSE
## YearsWithCurrManager
## FALSE
HR_data_char <- HR_data2[ , char_cols]
dim(HR_data_char)
## [1] 1470 8
corrplot(Cor, type="lower",method ="color", title = "Correlation Plot",
mar=c(0,1,1,1), tl.cex= 0.7, outline= T, tl.col= rgb(0, 0, 0))
#Setting correlation cutoff
highlyCorrelated <- findCorrelation(Cor, cutoff = 0.6)
highlyCorCol <- colnames(HR_data_num)[highlyCorrelated]
highlyCorCol
## [1] "TotalWorkingYears" "YearsAtCompany" "JobLevel"
## [4] "YearsInCurrentRole" "PercentSalaryHike"
HR_data_num1 <- HR_data_num[, -which(colnames(HR_data_num) %in% highlyCorCol)]
dim(HR_data_num1)
## [1] 1470 15
Outliers should be handled before building a statistical model as they reduce the fit and stability of the model. In order to avoid this, features are scaled using a technique called Standardization, which is a process of rescaling data so that the data have a mean of ‘0’ and standard deviation of ‘1’.
Viewing distribution of numerical features using a boxplot
boxplot(HR_data_num1)
Numerical features are standardized using the scale function
HR_data_num2 <- HR_data_num1 %>% mutate_all(~(scale(.) %>% as.vector))
head(HR_data_num2)
## Age DistanceFromHome Education EnvironmentSatisfaction
## 1 0.44619856 -1.0105654 -0.8913849 -0.6603060
## 2 1.32191535 -0.1470997 -1.8677901 0.2545383
## 3 0.00834016 -0.8872132 -0.8913849 1.1693826
## 4 -0.42951824 -0.7638609 1.0614255 1.1693826
## 5 -1.08630583 -0.8872132 -1.8677901 -1.5751502
## 6 -0.53898284 -0.8872132 -0.8913849 1.1693826
## JobInvolvement JobSatisfaction MonthlyIncome NumCompaniesWorked
## 1 0.379543 1.1528613 -0.1083127 2.1244130
## 2 -1.025818 -0.6606284 -0.2916193 -0.6778187
## 3 -1.025818 0.2461164 -0.9373347 1.3237753
## 4 0.379543 0.2461164 -0.7633739 -0.6778187
## 5 0.379543 -0.6606284 -0.6446387 2.5247318
## 6 0.379543 1.1528613 -0.7296013 -1.0781375
## PerformanceRating RelationshipSatisfaction StockOptionLevel
## 1 -0.426085 -1.5836393 -0.9316973
## 2 2.345353 1.1910327 0.2419060
## 3 -0.426085 -0.6587487 -0.9316973
## 4 -0.426085 0.2661420 -0.9316973
## 5 -0.426085 1.1910327 0.2419060
## 6 -0.426085 0.2661420 -0.9316973
## TrainingTimesLastYear WorkLifeBalance YearsSinceLastPromotion
## 1 -2.1712429 -2.4929720 -0.67891464
## 2 0.1556541 0.3379811 -0.36858985
## 3 0.1556541 0.3379811 -0.67891464
## 4 0.1556541 0.3379811 0.25205973
## 5 0.1556541 0.3379811 -0.05826506
## 6 -0.6199782 -1.0774954 0.25205973
## YearsWithCurrManager
## 1 0.2457504
## 2 0.8062671
## 3 -1.1555415
## 4 -1.1555415
## 5 -0.5950247
## 6 0.5260087
The chi-square test is a statistical test of independence to determine the dependency of two variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.
The chi-square statistics is calculated between every feature variable and the target variable. The null hypothesis for this test is the two variables are independent, and the alternative hypothesis is the variables are not independent. In order to reject the null hypothesis and keep variables in the model, the p-value of this test must have a p-value below .05
glimpse(HR_data_char)
## Rows: 1,470
## Columns: 8
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No",…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_Rarely", …
## $ Department <chr> "Sales", "Research & Development", "Research & Developm…
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "Life Scienc…
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "Male", "Fe…
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Laboratory Te…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Married", "S…
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No…
chisq.test(HR_data_char$BusinessTravel, HR_data_char$Attrition)
##
## Pearson's Chi-squared test
##
## data: HR_data_char$BusinessTravel and HR_data_char$Attrition
## X-squared = 24.182, df = 2, p-value = 5.609e-06
chisq.test(HR_data_char$Department, HR_data_char$Attrition)
##
## Pearson's Chi-squared test
##
## data: HR_data_char$Department and HR_data_char$Attrition
## X-squared = 10.796, df = 2, p-value = 0.004526
chisq.test(HR_data_char$EducationField, HR_data_char$Attrition)
## Warning in chisq.test(HR_data_char$EducationField, HR_data_char$Attrition): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: HR_data_char$EducationField and HR_data_char$Attrition
## X-squared = 16.025, df = 5, p-value = 0.006774
chisq.test(HR_data_char$Gender, HR_data_char$Attrition)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: HR_data_char$Gender and HR_data_char$Attrition
## X-squared = 1.117, df = 1, p-value = 0.2906
chisq.test(HR_data_char$JobRole, HR_data_char$Attrition)
##
## Pearson's Chi-squared test
##
## data: HR_data_char$JobRole and HR_data_char$Attrition
## X-squared = 86.19, df = 8, p-value = 2.752e-15
chisq.test(HR_data_char$MaritalStatus, HR_data_char$Attrition)
##
## Pearson's Chi-squared test
##
## data: HR_data_char$MaritalStatus and HR_data_char$Attrition
## X-squared = 46.164, df = 2, p-value = 9.456e-11
chisq.test(HR_data_char$OverTime, HR_data_char$Attrition)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: HR_data_char$OverTime and HR_data_char$Attrition
## X-squared = 87.564, df = 1, p-value < 2.2e-16
From chi-square tests carried out, the gender features with be dropped because it has a p-value of > 0.05 i.e. 0.2906
#dropping gender column
HR_data_char = subset(HR_data_char, select = -c(Gender) )
head(HR_data_char)
## Attrition BusinessTravel Department EducationField
## 1 Yes Travel_Rarely Sales Life Sciences
## 2 No Travel_Frequently Research & Development Life Sciences
## 3 Yes Travel_Rarely Research & Development Other
## 4 No Travel_Frequently Research & Development Life Sciences
## 5 No Travel_Rarely Research & Development Medical
## 6 No Travel_Frequently Research & Development Life Sciences
## JobRole MaritalStatus OverTime
## 1 Sales Executive Single Yes
## 2 Research Scientist Married No
## 3 Laboratory Technician Single Yes
## 4 Research Scientist Married Yes
## 5 Laboratory Technician Married No
## 6 Laboratory Technician Single No
# Label encoding columns with 2 unique values
HR_data_char$Attrition[HR_data_char$Attrition == 'Yes'] <- 1
HR_data_char$Attrition[HR_data_char$Attrition == 'No'] <- 0
HR_data_char$OverTime[HR_data_char$OverTime == 'Yes'] <- 1
HR_data_char$OverTime[HR_data_char$OverTime == 'No'] <- 0
#converting columns to numeric
HR_data_char$Attrition <- as.numeric(HR_data_char$Attrition)
HR_data_char$OverTime <- as.numeric(HR_data_char$OverTime)
str(HR_data_char)
## 'data.frame': 1470 obs. of 7 variables:
## $ Attrition : num 1 0 1 0 0 0 0 0 0 0 ...
## $ BusinessTravel: chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ EducationField: chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ OverTime : num 1 0 1 1 0 0 1 0 0 0 ...
#one-hot encoding columns with more than 2 unique values
dummy <- dummyVars(" ~ .", data = HR_data_char)
HR_data_char1 <- data.frame(predict(dummy, newdata = HR_data_char))
str(HR_data_char1)
## 'data.frame': 1470 obs. of 26 variables:
## $ Attrition : num 1 0 1 0 0 0 0 0 0 0 ...
## $ BusinessTravelNon.Travel : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BusinessTravelTravel_Frequently : num 0 1 0 1 0 1 0 0 1 0 ...
## $ BusinessTravelTravel_Rarely : num 1 0 1 0 1 0 1 1 0 1 ...
## $ DepartmentHuman.Resources : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DepartmentResearch...Development: num 0 1 1 1 1 1 1 1 1 1 ...
## $ DepartmentSales : num 1 0 0 0 0 0 0 0 0 0 ...
## $ EducationFieldHuman.Resources : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EducationFieldLife.Sciences : num 1 1 0 1 0 1 0 1 1 0 ...
## $ EducationFieldMarketing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EducationFieldMedical : num 0 0 0 0 1 0 1 0 0 1 ...
## $ EducationFieldOther : num 0 0 1 0 0 0 0 0 0 0 ...
## $ EducationFieldTechnical.Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleHealthcare.Representative: num 0 0 0 0 0 0 0 0 0 1 ...
## $ JobRoleHuman.Resources : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleLaboratory.Technician : num 0 0 1 0 1 1 1 1 0 0 ...
## $ JobRoleManager : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleManufacturing.Director : num 0 0 0 0 0 0 0 0 1 0 ...
## $ JobRoleResearch.Director : num 0 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleResearch.Scientist : num 0 1 0 1 0 0 0 0 0 0 ...
## $ JobRoleSales.Executive : num 1 0 0 0 0 0 0 0 0 0 ...
## $ JobRoleSales.Representative : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MaritalStatusDivorced : num 0 0 0 0 0 0 0 1 0 0 ...
## $ MaritalStatusMarried : num 0 1 0 1 1 0 1 0 0 1 ...
## $ MaritalStatusSingle : num 1 0 1 0 0 1 0 0 1 0 ...
## $ OverTime : num 1 0 1 1 0 0 1 0 0 0 ...
#Binding categorical and numerical dfs to form new (complete) df
HR_data2 <- cbind(HR_data_char1, HR_data_num2)
glimpse(HR_data2)
## Rows: 1,470
## Columns: 41
## $ Attrition <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BusinessTravelNon.Travel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BusinessTravelTravel_Frequently <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ BusinessTravelTravel_Rarely <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1…
## $ DepartmentHuman.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DepartmentResearch...Development <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DepartmentSales <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EducationFieldHuman.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EducationFieldLife.Sciences <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1…
## $ EducationFieldMarketing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EducationFieldMedical <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0…
## $ EducationFieldOther <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EducationFieldTechnical.Degree <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JobRoleHealthcare.Representative <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ JobRoleHuman.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JobRoleLaboratory.Technician <dbl> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0…
## $ JobRoleManager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JobRoleManufacturing.Director <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ JobRoleResearch.Director <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JobRoleResearch.Scientist <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ JobRoleSales.Executive <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JobRoleSales.Representative <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ MaritalStatusDivorced <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ MaritalStatusMarried <dbl> 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0…
## $ MaritalStatusSingle <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0…
## $ OverTime <dbl> 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ Age <dbl> 0.44619856, 1.32191535, 0.00834016, -…
## $ DistanceFromHome <dbl> -1.01056544, -0.14709966, -0.88721318…
## $ Education <dbl> -0.89138490, -1.86779013, -0.89138490…
## $ EnvironmentSatisfaction <dbl> -0.6603060, 0.2545383, 1.1693826, 1.1…
## $ JobInvolvement <dbl> 0.379543, -1.025818, -1.025818, 0.379…
## $ JobSatisfaction <dbl> 1.1528613, -0.6606284, 0.2461164, 0.2…
## $ MonthlyIncome <dbl> -0.1083127, -0.2916193, -0.9373347, -…
## $ NumCompaniesWorked <dbl> 2.1244130, -0.6778187, 1.3237753, -0.…
## $ PerformanceRating <dbl> -0.426085, 2.345353, -0.426085, -0.42…
## $ RelationshipSatisfaction <dbl> -1.5836393, 1.1910327, -0.6587487, 0.…
## $ StockOptionLevel <dbl> -0.9316973, 0.2419060, -0.9316973, -0…
## $ TrainingTimesLastYear <dbl> -2.1712429, 0.1556541, 0.1556541, 0.1…
## $ WorkLifeBalance <dbl> -2.4929720, 0.3379811, 0.3379811, 0.3…
## $ YearsSinceLastPromotion <dbl> -0.67891464, -0.36858985, -0.67891464…
## $ YearsWithCurrManager <dbl> 0.2457504, 0.8062671, -1.1555415, -1.…
# To achieve reproducible model; set the random seed number
set.seed(100)
# Data is split into training and test set in a 80:20 ratio
TrainingIndex <- createDataPartition(HR_data2$Attrition, p=0.8, list = FALSE)
TrainingSet <- HR_data2[TrainingIndex,] # Training Set
TestingSet <- HR_data2[-TrainingIndex,] # Test Set
#Model fitting
model <- glm(Attrition ~.,family=binomial(link='logit'),data = TrainingSet )
summary(model)
##
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"),
## data = TrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0356 -0.4613 -0.2312 -0.0766 3.5945
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.66241 0.51878 -1.277 0.20165
## BusinessTravelNon.Travel -0.86982 0.41824 -2.080 0.03755 *
## BusinessTravelTravel_Frequently 0.80515 0.24604 3.272 0.00107 **
## BusinessTravelTravel_Rarely NA NA NA NA
## DepartmentHuman.Resources -13.33593 682.09278 -0.020 0.98440
## DepartmentResearch...Development 1.72803 1.27577 1.355 0.17558
## DepartmentSales NA NA NA NA
## EducationFieldHuman.Resources 1.18525 1.04279 1.137 0.25570
## EducationFieldLife.Sciences -0.81368 0.34865 -2.334 0.01961 *
## EducationFieldMarketing -0.70680 0.46096 -1.533 0.12520
## EducationFieldMedical -1.02297 0.35848 -2.854 0.00432 **
## EducationFieldOther -1.29757 0.58295 -2.226 0.02602 *
## EducationFieldTechnical.Degree NA NA NA NA
## JobRoleHealthcare.Representative -4.40468 1.44292 -3.053 0.00227 **
## JobRoleHuman.Resources 12.25560 682.09251 0.018 0.98566
## JobRoleLaboratory.Technician -2.51441 1.32971 -1.891 0.05863 .
## JobRoleManager -2.40495 1.39399 -1.725 0.08449 .
## JobRoleManufacturing.Director -3.55433 1.42154 -2.500 0.01241 *
## JobRoleResearch.Director -5.36094 1.88490 -2.844 0.00445 **
## JobRoleResearch.Scientist -3.40236 1.33761 -2.544 0.01097 *
## JobRoleSales.Executive -1.07118 0.45032 -2.379 0.01737 *
## JobRoleSales.Representative NA NA NA NA
## MaritalStatusDivorced -1.15375 0.40058 -2.880 0.00397 **
## MaritalStatusMarried -0.85460 0.29564 -2.891 0.00384 **
## MaritalStatusSingle NA NA NA NA
## OverTime 2.24965 0.22788 9.872 < 2e-16 ***
## Age -0.45087 0.12806 -3.521 0.00043 ***
## DistanceFromHome 0.32186 0.10022 3.212 0.00132 **
## Education 0.19894 0.10580 1.880 0.06005 .
## EnvironmentSatisfaction -0.55965 0.10675 -5.243 1.58e-07 ***
## JobInvolvement -0.39147 0.10002 -3.914 9.08e-05 ***
## JobSatisfaction -0.48350 0.10325 -4.683 2.83e-06 ***
## MonthlyIncome -0.36458 0.29363 -1.242 0.21437
## NumCompaniesWorked 0.43758 0.10914 4.009 6.09e-05 ***
## PerformanceRating -0.04862 0.10846 -0.448 0.65396
## RelationshipSatisfaction -0.26494 0.10430 -2.540 0.01108 *
## StockOptionLevel -0.23216 0.16239 -1.430 0.15281
## TrainingTimesLastYear -0.20770 0.10854 -1.914 0.05568 .
## WorkLifeBalance -0.24826 0.10158 -2.444 0.01453 *
## YearsSinceLastPromotion 0.73246 0.14898 4.916 8.81e-07 ***
## YearsWithCurrManager -0.71666 0.16302 -4.396 1.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1006.61 on 1175 degrees of freedom
## Residual deviance: 642.35 on 1140 degrees of freedom
## AIC: 714.35
##
## Number of Fisher Scoring iterations: 15
The coefficients indicates the average change in log odds of attrition. For instance, every unit increase in OverTime is associated with an average increase of 2.2840 in the log odds of Attrition. The p-values in the output also give us an idea of how effective each predictor variable is at predicting the probability of Attrition:
While linear models performance is measured by R2, that of logistic regression is measured by a metric called McFadden’s R2. The value ranges from 0 to 1, in practice values over 0.40 indicates a good model fit.
We can compute McFadden’s R2 for our model using the pR2 function from the pscl package. A rule of thumb that is quite helpful is that a McFadden’s pseudo R2 ranging from 0.2 to 0.4 indicates very good model fit
pscl::pR2(model)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.3618759
A value of 0.3647089 indicates the model fits the data quite well and has high predictive power.
The confusion matrix table in R helps in matching the predictions against actual values. It includes two dimensions, among them one will indicate the predicted values and another one will represent the actual values.
# Prediction on TestingSet
prediction <- predict(model, TestingSet, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
head(prediction)
## 5 7 9 14 22 25
## 0.23414477 0.15677665 0.02209677 0.02449423 0.62525367 0.08203140
#Assigning probabilities - If prediction exceeds threshold of 0.5, 1 else 0
prediction <- ifelse(prediction >0.5, 1, 0)
head(prediction)
## 5 7 9 14 22 25
## 0 0 0 0 1 0
#Computing confusion matrix values
confusionMatrix(factor(TestingSet$Attrition),factor(prediction), mode = 'everything', positive = "0")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 225 12
## 1 34 23
##
## Accuracy : 0.8435
## 95% CI : (0.7969, 0.8831)
## No Information Rate : 0.881
## P-Value [Acc > NIR] : 0.97759
##
## Kappa : 0.4135
##
## Mcnemar's Test P-Value : 0.00196
##
## Sensitivity : 0.8687
## Specificity : 0.6571
## Pos Pred Value : 0.9494
## Neg Pred Value : 0.4035
## Precision : 0.9494
## Recall : 0.8687
## F1 : 0.9073
## Prevalence : 0.8810
## Detection Rate : 0.7653
## Detection Prevalence : 0.8061
## Balanced Accuracy : 0.7629
##
## 'Positive' Class : 0
##
Interpreting the measures in the confusion matrix:
Accuracy 84.35% - The success rate or accuracy of the model is calculated by dividing total no. of correction predictions by total predictions (TP + TN/TP+TN+FP+FN)
Sensitivity 86.87% - Also known as recall or True Positive Rate (TPR), sensitivity measures how often the model is correct when it predicts employee attrition TPR = (TP/TP+FN)
Specificity 65.71% - This is the opposite of sensitivity and it measures how often the model is correct when predicts employees retention. The closer this number is to 0, the better. TNR = (TN/TN+FP)
Precision 94.94% - Precision measures how well the model correctly predicts attrition. Precision = TP/TP+FP
This section of focuses on ranking all the features in order of importance using the random forest algorithm in caret. A higher score means that the specific feature has a larger effect on the model in predicting the target label Attrition.
Feature importance exercise is critical because it makes it easier to identify variables to be dropped in order to reduce complexity of the model. Also, it is a straightforward way of communicating your model performance to other stakeholders.
set.seed(355)
rf <- train(Attrition ~., data = TrainingSet, method = "rf")
rf
## Random Forest
##
## 1176 samples
## 40 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1176, 1176, 1176, 1176, 1176, 1176, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.3249222 0.2472995 0.2285590
## 21 0.3154840 0.2377607 0.2070822
## 40 0.3210502 0.2121344 0.2062005
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 21.
varimp_RF <- varImp(rf)
plot(varimp_RF, main = "Employee Attrition (Random Forest)")
To conclude, we have seen the entire process where we started with importing the dataset, getting to know the dataset at a high level by carrying out EDA (univariate & multivariate) and then moving on to data pre processing before finally building models to predict employee Attrition.
While 16% of employees left the company, it was shown that the rate of attrition was higher with employees with the following characteristics; Sales, Traveled frequently, low job satisfaction, low environment satisfaction, bad work life balance
Chi-square results revealed gender, education, and performance rating did not have a significant role in employee attrition.
According to the variable importance plot, monthly income and over time are critical in attrition. Other important variables are related to work history, and distance from the office.
To reduce attrition, the company could consider raising wages, fostering a company culture that promotes work life balance, and allow remote work so employees don’t have long commutes. Remote work will also permit flexible schedules that will aid in work life balance issues.