Objective: The objective of this project is to identify factors that are related to employee attrition and building a model that could be used to predict whether an employee would leave the company or not.
Abstract:
Employee attrition can be detrimental to a company’s performance in the long term. I have personally observed a negative impact on one of my past employer’s performance because of employee attrition. The objective of this project is to explore the factors that are related to employee attrition through data wrangling and building a model that could be used to predict whether an employee would leave the company or not.
I have used different statistical techniques to predict employee attrition and compared the performance for those models. Further, I have explored different sampling techniques such as Over Sampling, Under Sampling, SMOTE, etc. as an attempt to manage the imbalance in the data set. There are only 16% positive response values. Finally, I have compared the predicting performance of the model built through the artificial sampling with the original model with random sampling.
I have used a fictional data set created by IBM data scientists that is openly available on Kaggle. The data set contains 1450 observations and 35 variables. There is a categorical response variable - Attrition with a ‘Yes’ or a ‘No’ response. The predictor variables include a mix of categorical and continuous variables such as Age, Income, Business Travel, Department, JobLevel, etc.
Loading and Tidying Data
setwd("C:/Users/aabha/Desktop/Capstone/Data/ibm-hr-analytics-employee-attrition-performance")
data <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
# renaming AGE
data <- data %>% rename(Age = ï..Age)
data <- data %>% mutate(Attrition=ifelse(Attrition=="Yes",1,0))
# correcting data types
data[,c("Attrition", "Education","EnvironmentSatisfaction","JobInvolvement","JobLevel", "JobSatisfaction","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear","WorkLifeBalance")] <-
data[,c("Attrition", "Education","EnvironmentSatisfaction","JobInvolvement","JobLevel",
"JobSatisfaction","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear", "WorkLifeBalance")] %>%
mutate_all(as.factor)
# removing variables of no importance
## Over18 and EmployeeCount
data <- data %>% dplyr::select(-c("Over18", "EmployeeCount"))
Response Variable - Attrition
data %>%
group_by(Attrition) %>%
summarise(Count=n()) %>%
ggplot(aes(x=Attrition, y=Count)) + geom_bar(stat="identity", fill="#FF6666", color="grey40") + theme_bw() +
geom_text(aes(x=Attrition, y=0.03, label= Count),
hjust=-0.5, vjust=-1, size=4,
colour="black", fontface="bold",
angle=360) + labs(title="Employee Attrition (Number)", x="Employee Attrition",y="Amount") +
theme(plot.title=element_text(hjust=0.5))
Attrition - by Age
data %>%
filter(!is.na(Age)) %>% ggplot(aes(x=Age)) + geom_histogram(fill="#FF6666",alpha=0.8, show.legend=FALSE) +
theme_minimal() +
labs(title="Age Distribution") +
theme(plot.title=element_text(hjust=0.5)) +
scale_fill_manual(values=c( "#819FF7")) +
facet_grid(.~Attrition)
Attrition - by Age and Gender
ggplot(data,
aes(x = Age, fill = Attrition)) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = c("#386cb0","#fdb462")) +
labs(title="Attrion Density by Age") + facet_wrap(~Gender) +
labs(title="Age Distribution") +
theme(plot.title=element_text(hjust=0.5))
Attrition - by Job Satisfactionr
ggplot(data, aes(as.factor(Attrition), fill=as.factor(JobSatisfaction))) +
geom_bar(position = "fill") +
labs(title = "Job Satisfaction", x = "Attrition",
y = "Proportion", fill="Satisfaction Level") +
theme(plot.title=element_text(hjust=0.5))
Attrition - by Income
ggplot(data,
aes(x = MonthlyIncome, fill = Attrition)) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = c("#386cb0","#fdb462")) +
labs(title = "Income and Attrition", x = "Monthly Income",
y = "Count", fill="Attrition") +
theme(plot.title=element_text(hjust=0.5))
Attrition - by Overtime
ggplot(data,
aes(x = OverTime, group = Attrition)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count",
alpha = 0.7) +
geom_text(aes(label = scales::percent(..prop..), y = ..prop.. ),
stat= "count",
vjust = -.5) +
labs(y = "Percentage", fill= "OverTime") +
facet_grid(~Attrition) +
scale_fill_manual(values = c("#386cb0","#fdb462")) +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
ggtitle("Attrition")
Attrition - by Business Travel
# business travel
ggplot(data,
aes(x= BusinessTravel, group=Attrition)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count",
alpha = 0.7) +
geom_text(aes(label = scales::percent(..prop..), y = ..prop.. ),
stat= "count",
vjust = -.5) +
labs(y = "Percentage", fill="Business Travel") +
facet_grid(~Attrition) +
scale_y_continuous(labels=percent) +
scale_fill_manual(values = c("#386cb0","#ef3b2c", "#fdb462")) +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
ggtitle("Attrition")
set.seed(1234)
index <- sample(nrow(data),nrow(data)*0.75)
data.train <- data[index,]
data.test <- data[-index,]
# Variable Selection with Stepwise Backward Selecton Approach
data.lr0<- glm(Attrition~., family=binomial, data=data.train)
data.lr.back <- step(data.lr0, direction = "backward", trace=0)
# summary(data.lr.back)
# model
data.lr.train <- glm(formula = Attrition ~ Age + BusinessTravel + DistanceFromHome +
EducationField + EnvironmentSatisfaction + JobInvolvement +
JobLevel + JobRole + JobSatisfaction + MonthlyRate + NumCompaniesWorked +
OverTime + RelationshipSatisfaction + StockOptionLevel +
TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
YearsWithCurrManager, family = binomial, data = data.train)
I have taken a weightage of 5:1 to penalize False Negatives more than False Positives.
It is more important to correctly identify positive cases in order to check if changes can be made to prevent the employee from leaving the company.
Using these weights, I have taken cut off value of probability as 1/6.
# Asymmetric Misclassification Rate, using 5:1 asymmetric cost
cost <- function(r, pi){
weight1 = 5
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
# pcutoff value
pcut <- 1/6
Confusion Matrix:
# getting predictions for in sample
pred.lr.train<- predict(data.lr.train, type="response")
# In sample prediction after cost function
pred_cutoff_train <-as.numeric(pred.lr.train >pcut)
# confusion matrix - LOR - train data
table(data.train$Attrition, pred_cutoff_train,dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 756 167
## 1 34 145
# MR - LOR - train data
trainmrlor <- round(mean(data.train$Attrition!=pred_cutoff_train),2)
# AMR - LOR - train data
trainamrlor <- round(cost(r = data.train$Attrition,pred_cutoff_train),2)
The Misclassification Rate is 0.18
The Asymmetrical Misclassification Rate is 0.31
Confusion Matrix:
# out of sample prediction
pred.lr.test <- predict(data.lr.train, newdata = data.test, type = "response")
pred_cutoff_test <- as.numeric(pred.lr.test >pcut)
# confusion matrix - LOR - train data
table(data.test$Attrition, pred_cutoff_test,dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 255 55
## 1 19 39
# MR - LOR - test data
testmrlor <- round(mean(data.test$Attrition!=as.numeric(pred_cutoff_test)),2)
# AMR - LOR - test data
testamrlor <- round(cost(r = data.test$Attrition,pred_cutoff_test),2)
The Misclassification Rate is 0.2
The Asymmetrical Misclassification Rate is 0.41
# roc
pred.lor <- prediction(pred.lr.test, data.test$Attrition)
perf.lor <- performance(pred.lor, "tpr", "fpr")
plot(perf.lor, colorize=TRUE)
auclor <- round(unlist(slot(performance(pred.lor, "auc"), "y.values")),2)
# precision recall curve
score1.test.lor= pred.lr.test[data.test$Attrition==1]
score0.test.lor= pred.lr.test[data.test$Attrition==0]
pr.test= pr.curve(score1.test.lor, score0.test.lor, curve = T)
plot(pr.test)
Building a Big Tree:
set.seed(91011)
data.rpart <- rpart(formula = Attrition ~ . , data = data.train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),cp = 0.001)
prp(data.rpart, extra = 1)
Pruning Tree:
plotcp(data.rpart)
printcp(data.rpart)
##
## Classification tree:
## rpart(formula = Attrition ~ ., data = data.train, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
##
## Variables actually used in tree construction:
## [1] Age DailyRate
## [3] DistanceFromHome Education
## [5] EducationField EmployeeNumber
## [7] EnvironmentSatisfaction HourlyRate
## [9] JobInvolvement JobLevel
## [11] JobRole JobSatisfaction
## [13] MonthlyIncome NumCompaniesWorked
## [15] OverTime PercentSalaryHike
## [17] StockOptionLevel TotalWorkingYears
## [19] TrainingTimesLastYear WorkLifeBalance
##
## Root node error: 895/1102 = 0.81216
##
## n= 1102
##
## CP nsplit rel error xerror xstd
## 1 0.2983240 0 1.00000 1.8156 0.080173
## 2 0.0770950 1 0.70168 1.5676 0.078187
## 3 0.0374302 2 0.62458 1.4056 0.075300
## 4 0.0296089 4 0.54972 1.2536 0.071714
## 5 0.0223464 6 0.49050 1.2648 0.072144
## 6 0.0167598 7 0.46816 1.2313 0.071369
## 7 0.0100559 8 0.45140 1.1464 0.069335
## 8 0.0092179 11 0.42123 1.1397 0.069201
## 9 0.0089385 18 0.34860 1.1821 0.070331
## 10 0.0067039 20 0.33073 1.2089 0.070839
## 11 0.0061453 21 0.32402 1.2883 0.072811
## 12 0.0055866 26 0.28939 1.3173 0.073247
## 13 0.0044693 27 0.28380 1.3218 0.073379
## 14 0.0026071 30 0.27039 1.4022 0.075000
## 15 0.0022346 34 0.25922 1.3855 0.074658
## 16 0.0016760 35 0.25698 1.3855 0.074658
## 17 0.0011173 37 0.25363 1.3855 0.074658
## 18 0.0010000 38 0.25251 1.3687 0.074142
# pruning the tree
data.rpart.prunedtree <- rpart(Attrition~., data = data.train, method = "class",
parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)),cp=0.013)
prp(data.rpart.prunedtree, extra = 1, nn.font=500,box.palette = "green")
Confusion Matrix:
#training stats
pred.tree.train <- predict(data.rpart.prunedtree, type = "prob")
data.train.pred.rpart = as.numeric(pred.tree.train[,2] > pcut)
# confusion matrix - Class Tree - train data
table(data.train$Attrition, data.train.pred.rpart, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 764 159
## 1 49 130
# MR - Class Tree - train data
trainmrct <- round(mean(data.train$Attrition!=data.train.pred.rpart),2)
# AMR - Class Tree - train data
testmrct <- cost(r = data.train$Attrition,data.train.pred.rpart)
The Misclassification Rate is 0.19
The Asymmetrical Misclassification Rate is 0.3666062
Confusion Matrix:
# test data set
pred.tree.test <- predict(data.rpart.prunedtree, newdata=data.test, type = "prob")
data.test.pred.rpart = as.numeric(pred.tree.test[,2] > pcut)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, data.test.pred.rpart, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 241 69
## 1 26 32
# MR - Class Tree - test data
testmrct <- round(mean(data.test$Attrition!=as.numeric(data.test.pred.rpart)),2)
# AMR - Class Tree - test data
testamrct <- round(cost(r = data.test$Attrition,data.test.pred.rpart),2)
The Misclassification Rate is 0.26
The Asymmetrical Misclassification Rate is 0.54
# roc
pred.ct <- prediction(pred.tree.test[,2], data.test$Attrition)
perf.ct <- performance(pred.ct, "tpr", "fpr")
plot(perf.ct, colorize=TRUE)
aucct <- round(unlist(slot(performance(pred.ct, "auc"), "y.values")),2)
# precision recall curve
score1.test.lor= pred.lr.test[data.test$Attrition==1]
score0.test.lor= pred.lr.test[data.test$Attrition==0]
pr.test= pr.curve(score1.test.lor, score0.test.lor, curve = T)
plot(pr.test)
Confusion Matrix:
set.seed(112233)
data.train.rf <- randomForest(Attrition~., data = data.train)
#training stats
pred.rf.train <- predict(data.train.rf, type = "prob")
data.train.pred.rf = as.numeric(pred.rf.train[,2] > pcut)
# confusion matrix - Class Tree - train data
table(data.train$Attrition, data.train.pred.rf, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 639 284
## 1 36 143
# MR - Class Tree - train data
trainmrrf <- round(mean(data.train$Attrition!=data.train.pred.rf),2)
# AMR - Class Tree - train data
trainamrf <- round(cost(r = data.train$Attrition,data.train.pred.rf),2)
The Misclassification Rate is 0.29
The Asymmetrical Misclassification Rate is 0.42
Confusion Matrix:
# test data set
pred.rf.test <- predict(data.train.rf, newdata=data.test, type = "prob")
data.test.pred.rf = as.numeric(pred.rf.test[,2] > pcut)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, data.test.pred.rf, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 211 99
## 1 12 46
# MR - Class Tree - test data
testmrrf <- round(mean(data.test$Attrition!=as.numeric(data.test.pred.rf)),2)
# AMR - Class Tree - test data
testamrrf <- round(cost(r = data.test$Attrition,data.test.pred.rf),2)
The Misclassification Rate is 0.3
The Asymmetrical Misclassification Rate is 0.43
# roc
pred.rf <- prediction(pred.rf.test[,2], data.test$Attrition)
perf.rf <- performance(pred.rf, "tpr", "fpr")
plot(perf.rf, colorize=TRUE)
aucrf <- round(unlist(slot(performance(pred.rf, "auc"), "y.values")),2)
# precion recall curve
score1.test.rf <- pred.rf.test[,2][data.test$Attrition==1]
score0.test.rf <- pred.rf.test[,2][data.test$Attrition==0]
pr.test.ct <- pr.curve(score1.test.rf, score0.test.rf, curve = T)
plot(pr.test.ct)
table(data.train$Attrition)
##
## 0 1
## 923 179
set.seed(1234)
# oversampling
data_train_over <- ovun.sample( Attrition~ ., data = data.train, method = "over",N = 1846)$data
# checking frequency of response variale
table(data_train_over$Attrition)
##
## 0 1
## 923 923
set.seed(1234)
data_train_under <- ovun.sample( Attrition~ ., data = data.train, method = "under",N = 358)$data
# checking frequency of response variale
table(data_train_under$Attrition)
##
## 0 1
## 179 179
# both
set.seed(1234)
data_train_both <- ovun.sample( Attrition~ ., data = data.train, method = "both",p=0.5)$data
# checking frequency of response variale
table(data_train_both$Attrition)
##
## 0 1
## 539 563
# ROSE
set.seed(1234)
data_train_rose <- ROSE(Attrition~ ., data = data.train, seed = 1234)$data
table(data_train_both$Attrition)
##
## 0 1
## 539 563
# SMOTE
set.seed(1234)
data_train_smote <- SMOTE(Attrition~ ., data = data.train, perc.over = 100, perc.under = 200)
table(data_train_smote$Attrition)
##
## 0 1
## 358 358
set.seed(112233)
data.train.rf.over <- randomForest(Attrition~., data = data_train_over)
data.train.rf.under <- randomForest(Attrition~., data = data_train_under)
data.train.rf.both <- randomForest(Attrition~., data = data_train_both)
data.train.rf.rose <- randomForest(Attrition~., data = data_train_rose)
data.train.rf.smote <- randomForest(Attrition~., data = data_train_smote)
# making predictions
pred.rf.over <- predict(data.train.rf.over, newdata = data.test,type = "prob")
pred.rf.under <- predict(data.train.rf.under, newdata = data.test,type = "prob")
pred.rf.both <- predict(data.train.rf.both, newdata = data.test,type = "prob")
pred.rf.rose <- predict(data.train.rf.rose, newdata = data.test,type = "prob")
pred.rf.smote <- predict(data.train.rf.smote, newdata = data.test,type = "prob")
# ROC and AUC
a <- roc.curve(data.test$Attrition, pred.rf.over[,2],col="red")
par(new=TRUE)
b<- roc.curve(data.test$Attrition, pred.rf.under[,2],col="violet")
par(new=TRUE)
c<- roc.curve(data.test$Attrition, pred.rf.both[,2],col="brown")
par(new=TRUE)
d<- roc.curve(data.test$Attrition, pred.rf.rose[,2],col="black")
par(new=TRUE)
e<-roc.curve(data.test$Attrition, pred.rf.smote[,2],col="blue")
#
# roc.over <- roc.curve(data.test$Attrition, pred.rf.over[,2],add.roc=TRUE,col="red")
# roc.under <- roc.curve(data.test$Attrition, pred.rf.under[,2],add.roc=TRUE,col="violet")
# roc.both <- roc.curve(data.test$Attrition, pred.rf.both[,2],add.roc=TRUE,col="brown")
# roc.rose <- roc.curve(data.test$Attrition, pred.rf.rose[,2],add.roc=TRUE,col="black")
# roc.smote <- roc.curve(data.test$Attrition, pred.rf.smote[,2],add.roc=TRUE, col="blue")
Confusion Matrix:
pred.rf.rose.test <- predict(data.train.rf.rose, newdata = data.test)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, pred.rf.rose.test, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 268 42
## 1 24 34
# MR - Class Tree - test data
rose.test.mr <- round(mean(data.test$Attrition!=pred.rf.rose.test),2)
# AMR - Class Tree - test data
rose.test.amr <- round(cost(r = data.test$Attrition,pred.rf.rose.test),2)
The Misclassification Rate is 0.18
The Asymmetrical Misclassification Rate is 0.44
ROC AUC:
LR:0.82
CT: 0.64
RF: 0.8
RF (ROSE): .81
Precision Recall Curve AUC:
LR: 0.55
CT: 0.3
RF: 0.52
RF (ROSE): 0.44
Precision and Recall Values:
LR - P:0.41 R:0.67
CT - P:0.31 R:0.55
RF - P:0.31 R:0.79
RF (ROSE) - P:0.46 R:0.62
Asymmetrical Misclassification Rate:
LR - 0.41
CT - 0.54
RF - 0.43
RF (ROSE) - 0.41
Analysis:
Logistic Regression has maximum AUC; however, Random Forest has AUC very close to Logistic Regression. But, Recall for Random Forest is higher compared with all other models.
High Recall will result in correctly identifying employees who may leave the company. A lower Recall would mean that there are higher cases where employees who are actually not going to leave are wrongly predicted in the attrition category. Hence, a higher Recall is preferable for a comparatively smaller decrease in Precision. Thus, Random Forest should be preferred.
Random Over Sampling Examples (ROSE) for Random Forest resulted in a slightly higher AUC for the ROC curve but a lower AUC for Precision Recall Curve compared with the AUC’s for Random Forest without ROSE. This technique managed to increase Precision in Random Forest but at the cost of decreasing Recall by a significant margin. Since there was a fall in AUC for precision recall curve for Random Forest with ROSE, it can be said that artificial sampling failed to provide us with a better model in this case.
Random Over Sampling Examples (ROSE) gave us the best ROC curve among all the sampling techniques