Predicting Employee Attrition

Abstract

Objective: The objective of this project is to identify factors that are related to employee attrition and building a model that could be used to predict whether an employee would leave the company or not.



Abstract:

  • Employee attrition can be detrimental to a company’s performance in the long term. I have personally observed a negative impact on one of my past employer’s performance because of employee attrition. The objective of this project is to explore the factors that are related to employee attrition through data wrangling and building a model that could be used to predict whether an employee would leave the company or not.

  • I have used different statistical techniques to predict employee attrition and compared the performance for those models. Further, I have explored different sampling techniques such as Over Sampling, Under Sampling, SMOTE, etc. as an attempt to manage the imbalance in the data set. There are only 16% positive response values. Finally, I have compared the predicting performance of the model built through the artificial sampling with the original model with random sampling.

Data

Link to Data Set on Kaggle

I have used a fictional data set created by IBM data scientists that is openly available on Kaggle. The data set contains 1450 observations and 35 variables. There is a categorical response variable - Attrition with a ‘Yes’ or a ‘No’ response. The predictor variables include a mix of categorical and continuous variables such as Age, Income, Business Travel, Department, JobLevel, etc.

Exploratory Data Analysis

Loading and Tidying Data

setwd("C:/Users/aabha/Desktop/Capstone/Data/ibm-hr-analytics-employee-attrition-performance")
data <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

# renaming AGE
data <- data %>% rename(Age = ï..Age)
data <- data %>% mutate(Attrition=ifelse(Attrition=="Yes",1,0))


# correcting data types
data[,c("Attrition", "Education","EnvironmentSatisfaction","JobInvolvement","JobLevel", "JobSatisfaction","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear","WorkLifeBalance")] <- 
data[,c("Attrition", "Education","EnvironmentSatisfaction","JobInvolvement","JobLevel",
"JobSatisfaction","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear", "WorkLifeBalance")] %>%
  mutate_all(as.factor) 

# removing variables of no importance
##  Over18 and EmployeeCount
data <- data %>% dplyr::select(-c("Over18", "EmployeeCount"))


Response Variable - Attrition

data %>% 
  group_by(Attrition) %>% 
  summarise(Count=n()) %>%
  ggplot(aes(x=Attrition, y=Count)) + geom_bar(stat="identity", fill="#FF6666", color="grey40") +         theme_bw() + 
  geom_text(aes(x=Attrition, y=0.03, label= Count),
  hjust=-0.5, vjust=-1, size=4, 
  colour="black", fontface="bold",
  angle=360) + labs(title="Employee Attrition (Number)", x="Employee Attrition",y="Amount") + 
  theme(plot.title=element_text(hjust=0.5)) 

  • 16% of the values were yes.


Attrition - by Age

data %>% 
filter(!is.na(Age)) %>% ggplot(aes(x=Age)) + geom_histogram(fill="#FF6666",alpha=0.8,           show.legend=FALSE) + 
theme_minimal() + 
labs(title="Age Distribution") + 
theme(plot.title=element_text(hjust=0.5)) + 
scale_fill_manual(values=c( "#819FF7"))  +
facet_grid(.~Attrition)

Attrition - by Age and Gender

ggplot(data, 
       aes(x = Age, fill = Attrition)) + 
  geom_density(alpha = 0.7) + 
  scale_fill_manual(values = c("#386cb0","#fdb462")) +
  labs(title="Attrion Density by Age") + facet_wrap(~Gender) +
 labs(title="Age Distribution") + 
  theme(plot.title=element_text(hjust=0.5)) 

  • Maximum attrition density is seen for people around the age of 30 years.
    • Females displayed a higher attrition density around the age of 30 than their male counterparts



Attrition - by Job Satisfactionr

ggplot(data, aes(as.factor(Attrition), fill=as.factor(JobSatisfaction))) +
  geom_bar(position = "fill") +
  labs(title = "Job Satisfaction", x = "Attrition", 
       y = "Proportion", fill="Satisfaction Level") +
  theme(plot.title=element_text(hjust=0.5))

  • A higher proportion of people who left the company gave low satisfaction level compared with the people who did not leave the company



Attrition - by Income

ggplot(data, 
       aes(x = MonthlyIncome, fill = Attrition)) + 
  geom_density(alpha = 0.7) + 
  scale_fill_manual(values = c("#386cb0","#fdb462")) +
  labs(title = "Income and Attrition", x = "Monthly Income",
  y = "Count", fill="Attrition") +
  theme(plot.title=element_text(hjust=0.5))

  • Clearly, a higher proportion of people with low incomes left



Attrition - by Overtime

ggplot(data, 
  aes(x = OverTime, group = Attrition)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), 
  stat="count", 
  alpha = 0.7) +
  geom_text(aes(label = scales::percent(..prop..), y = ..prop.. ), 
  stat= "count", 
  vjust = -.5) +
  labs(y = "Percentage", fill= "OverTime") +
  facet_grid(~Attrition) +
  scale_fill_manual(values = c("#386cb0","#fdb462")) + 
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Attrition")

  • Among people who left, a higher percentage had done overtime compared with the people who had not left



Attrition - by Business Travel

# business travel
ggplot(data, 
  aes(x= BusinessTravel,  group=Attrition)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), 
  stat="count", 
  alpha = 0.7) +
  geom_text(aes(label = scales::percent(..prop..), y = ..prop.. ), 
  stat= "count", 
  vjust = -.5) +
  labs(y = "Percentage", fill="Business Travel") +
  facet_grid(~Attrition) +
  scale_y_continuous(labels=percent) + 
  scale_fill_manual(values = c("#386cb0","#ef3b2c", "#fdb462")) + 
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Attrition")

  • Among people who left, a higher percentage traveled frequently compared with the people who had not left

Random Sampling

set.seed(1234)
index <- sample(nrow(data),nrow(data)*0.75)
data.train <- data[index,]
data.test <- data[-index,]
  • The data has been randomly sampled into training and test data sets in the ratio 75:25

Models

Logistic Regression

Variable Selection
  • I used stepwise backward selection for selecting variables for logistic regression
#  Variable Selection with Stepwise Backward Selecton Approach
data.lr0<- glm(Attrition~., family=binomial, data=data.train)
data.lr.back <- step(data.lr0, direction = "backward", trace=0) 
# summary(data.lr.back)

# model
data.lr.train <- glm(formula = Attrition ~ Age + BusinessTravel + DistanceFromHome + 
      EducationField + EnvironmentSatisfaction + JobInvolvement + 
      JobLevel + JobRole + JobSatisfaction + MonthlyRate + NumCompaniesWorked + 
      OverTime + RelationshipSatisfaction + StockOptionLevel + 
      TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
      YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
      YearsWithCurrManager, family = binomial, data = data.train)



Defining Cost Function
  • I have taken a weightage of 5:1 to penalize False Negatives more than False Positives.

  • It is more important to correctly identify positive cases in order to check if changes can be made to prevent the employee from leaving the company.

  • Using these weights, I have taken cut off value of probability as 1/6.

# Asymmetric Misclassification Rate, using  5:1 asymmetric cost
cost <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

# pcutoff value
pcut <- 1/6



In Sample Prediction

Confusion Matrix:

# getting predictions for in sample
pred.lr.train<- predict(data.lr.train, type="response")

# In sample prediction after cost function
pred_cutoff_train <-as.numeric(pred.lr.train >pcut) 
# confusion matrix - LOR - train data 
table(data.train$Attrition, pred_cutoff_train,dnn = c("Actual", "Predicted"))
##       Predicted
## Actual   0   1
##      0 756 167
##      1  34 145
# MR - LOR - train data
trainmrlor <- round(mean(data.train$Attrition!=pred_cutoff_train),2)
# AMR - LOR - train data
trainamrlor <- round(cost(r = data.train$Attrition,pred_cutoff_train),2)
  • The Misclassification Rate is 0.18

  • The Asymmetrical Misclassification Rate is 0.31



Out of Sample Prediction

Confusion Matrix:

# out of sample prediction
pred.lr.test <- predict(data.lr.train, newdata = data.test, type = "response")
pred_cutoff_test <- as.numeric(pred.lr.test >pcut)
# confusion matrix - LOR - train data
table(data.test$Attrition, pred_cutoff_test,dnn = c("Actual", "Predicted"))
##       Predicted
## Actual   0   1
##      0 255  55
##      1  19  39
# MR - LOR - test data
testmrlor <- round(mean(data.test$Attrition!=as.numeric(pred_cutoff_test)),2)
# AMR - LOR - test data
testamrlor <- round(cost(r = data.test$Attrition,pred_cutoff_test),2)
  • The Misclassification Rate is 0.2

  • The Asymmetrical Misclassification Rate is 0.41



ROC Curve - Out of Sample
# roc
pred.lor <- prediction(pred.lr.test, data.test$Attrition)
perf.lor <- performance(pred.lor, "tpr", "fpr")
plot(perf.lor, colorize=TRUE)

auclor <- round(unlist(slot(performance(pred.lor, "auc"), "y.values")),2)
  • The AUC is 0.82



Precision Recall - Out of Sample
# precision recall curve
score1.test.lor= pred.lr.test[data.test$Attrition==1]
score0.test.lor= pred.lr.test[data.test$Attrition==0]
pr.test= pr.curve(score1.test.lor, score0.test.lor, curve = T)
plot(pr.test)

Classification Tree

In Sample Prediction

Building a Big Tree:

set.seed(91011)
data.rpart <- rpart(formula = Attrition ~ . , data = data.train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),cp = 0.001)
prp(data.rpart, extra = 1)

Pruning Tree:

plotcp(data.rpart)

printcp(data.rpart)
## 
## Classification tree:
## rpart(formula = Attrition ~ ., data = data.train, method = "class", 
##     parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] Age                     DailyRate              
##  [3] DistanceFromHome        Education              
##  [5] EducationField          EmployeeNumber         
##  [7] EnvironmentSatisfaction HourlyRate             
##  [9] JobInvolvement          JobLevel               
## [11] JobRole                 JobSatisfaction        
## [13] MonthlyIncome           NumCompaniesWorked     
## [15] OverTime                PercentSalaryHike      
## [17] StockOptionLevel        TotalWorkingYears      
## [19] TrainingTimesLastYear   WorkLifeBalance        
## 
## Root node error: 895/1102 = 0.81216
## 
## n= 1102 
## 
##           CP nsplit rel error xerror     xstd
## 1  0.2983240      0   1.00000 1.8156 0.080173
## 2  0.0770950      1   0.70168 1.5676 0.078187
## 3  0.0374302      2   0.62458 1.4056 0.075300
## 4  0.0296089      4   0.54972 1.2536 0.071714
## 5  0.0223464      6   0.49050 1.2648 0.072144
## 6  0.0167598      7   0.46816 1.2313 0.071369
## 7  0.0100559      8   0.45140 1.1464 0.069335
## 8  0.0092179     11   0.42123 1.1397 0.069201
## 9  0.0089385     18   0.34860 1.1821 0.070331
## 10 0.0067039     20   0.33073 1.2089 0.070839
## 11 0.0061453     21   0.32402 1.2883 0.072811
## 12 0.0055866     26   0.28939 1.3173 0.073247
## 13 0.0044693     27   0.28380 1.3218 0.073379
## 14 0.0026071     30   0.27039 1.4022 0.075000
## 15 0.0022346     34   0.25922 1.3855 0.074658
## 16 0.0016760     35   0.25698 1.3855 0.074658
## 17 0.0011173     37   0.25363 1.3855 0.074658
## 18 0.0010000     38   0.25251 1.3687 0.074142
# pruning the tree
data.rpart.prunedtree <- rpart(Attrition~., data = data.train, method = "class",
                           parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)),cp=0.013)
prp(data.rpart.prunedtree, extra = 1, nn.font=500,box.palette = "green")

Confusion Matrix:

#training stats
pred.tree.train <- predict(data.rpart.prunedtree, type = "prob")
data.train.pred.rpart = as.numeric(pred.tree.train[,2] > pcut)
# confusion matrix - Class Tree - train data 
table(data.train$Attrition, data.train.pred.rpart, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 764 159
##     1  49 130
# MR - Class Tree - train data
trainmrct <-  round(mean(data.train$Attrition!=data.train.pred.rpart),2)
# AMR - Class Tree - train data
testmrct <- cost(r = data.train$Attrition,data.train.pred.rpart)
  • The Misclassification Rate is 0.19

  • The Asymmetrical Misclassification Rate is 0.3666062



Out of Sample Prediction

Confusion Matrix:

# test data set 
pred.tree.test <- predict(data.rpart.prunedtree, newdata=data.test, type = "prob")
data.test.pred.rpart = as.numeric(pred.tree.test[,2] > pcut)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, data.test.pred.rpart, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 241  69
##     1  26  32
# MR - Class Tree - test data
testmrct <- round(mean(data.test$Attrition!=as.numeric(data.test.pred.rpart)),2)
# AMR - Class Tree - test data
testamrct <- round(cost(r = data.test$Attrition,data.test.pred.rpart),2)
  • The Misclassification Rate is 0.26

  • The Asymmetrical Misclassification Rate is 0.54



ROC Curve - Out of Sample
# roc
pred.ct <- prediction(pred.tree.test[,2], data.test$Attrition)
perf.ct <- performance(pred.ct, "tpr", "fpr")
plot(perf.ct, colorize=TRUE)

aucct <- round(unlist(slot(performance(pred.ct, "auc"), "y.values")),2)
  • The AUC is 0.64



Precision Recall - Out of Sample
# precision recall curve
score1.test.lor= pred.lr.test[data.test$Attrition==1]
score0.test.lor= pred.lr.test[data.test$Attrition==0]
pr.test= pr.curve(score1.test.lor, score0.test.lor, curve = T)
plot(pr.test)

Random Forest

In Sample Prediction

Confusion Matrix:

set.seed(112233)
data.train.rf <- randomForest(Attrition~., data = data.train)

#training stats
pred.rf.train <- predict(data.train.rf, type = "prob")
data.train.pred.rf = as.numeric(pred.rf.train[,2] > pcut)
# confusion matrix - Class Tree - train data 
table(data.train$Attrition, data.train.pred.rf, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 639 284
##     1  36 143
# MR - Class Tree - train data
trainmrrf <- round(mean(data.train$Attrition!=data.train.pred.rf),2)
# AMR - Class Tree - train data
trainamrf <- round(cost(r = data.train$Attrition,data.train.pred.rf),2)
  • The Misclassification Rate is 0.29

  • The Asymmetrical Misclassification Rate is 0.42



Out of Sample Prediction

Confusion Matrix:

# test data set
pred.rf.test <- predict(data.train.rf, newdata=data.test, type = "prob")
data.test.pred.rf = as.numeric(pred.rf.test[,2] > pcut)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, data.test.pred.rf, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 211  99
##     1  12  46
# MR - Class Tree - test data
testmrrf <- round(mean(data.test$Attrition!=as.numeric(data.test.pred.rf)),2)
# AMR - Class Tree - test data
testamrrf <- round(cost(r = data.test$Attrition,data.test.pred.rf),2)
  • The Misclassification Rate is 0.3

  • The Asymmetrical Misclassification Rate is 0.43



ROC Curve - Out of Sample
# roc
pred.rf <- prediction(pred.rf.test[,2], data.test$Attrition)
perf.rf <- performance(pred.rf, "tpr", "fpr")
plot(perf.rf, colorize=TRUE)

aucrf <- round(unlist(slot(performance(pred.rf, "auc"), "y.values")),2)
  • The AUC is 0.8
Precision Recall - Out of Sample
# precion recall curve
score1.test.rf <- pred.rf.test[,2][data.test$Attrition==1]
score0.test.rf <- pred.rf.test[,2][data.test$Attrition==0]
pr.test.ct <- pr.curve(score1.test.rf, score0.test.rf, curve = T)
plot(pr.test.ct)

Other Sampling Techniques

Original Data:

table(data.train$Attrition)
## 
##   0   1 
## 923 179



Over Sampling

  • The minority positives are increased to match the number of negatives
set.seed(1234)
# oversampling
data_train_over <- ovun.sample( Attrition~ ., data = data.train, method = "over",N = 1846)$data
# checking frequency of response variale
table(data_train_over$Attrition)
## 
##   0   1 
## 923 923



Under Sampling

  • The majority negatives are reduced to match the minority positives. Caution: Information is lost in this case
set.seed(1234)
data_train_under <- ovun.sample( Attrition~ ., data = data.train, method = "under",N = 358)$data
# checking frequency of response variale
table(data_train_under$Attrition)
## 
##   0   1 
## 179 179



Mixed Sampling

  • We can use a mix of over and under sampling. We have to specify the percentage of positive class in newly generated sample. The function will try to match the specified percentage. For the following table, I gave the percentage as 50%
# both
set.seed(1234)
data_train_both <- ovun.sample( Attrition~ ., data = data.train, method = "both",p=0.5)$data
# checking frequency of response variale
table(data_train_both$Attrition)
## 
##   0   1 
## 539 563



Random Over Sampling Examples(ROSE)

  • Oversampling and undersampling may lead to inaccuracies in the resulting performance as information is manipulated. ROSE (Random Over Sampling Examples) helps us to generate data synthetically to get a better estimate of original data
# ROSE
set.seed(1234)
data_train_rose <- ROSE(Attrition~ ., data = data.train, seed = 1234)$data
table(data_train_both$Attrition)
## 
##   0   1 
## 539 563



Synthetic Minority Oversampling Technique (SMOTE)

  • Synthetic Minority Oversampling Technique generates a random set of minority class observations to shift the classifier learning bias towards minority class. To generate artificial data, it uses bootstrapping and k-nearest neighbors
# SMOTE
set.seed(1234)
data_train_smote <- SMOTE(Attrition~ ., data = data.train, perc.over = 100,  perc.under = 200)
table(data_train_smote$Attrition)
## 
##   0   1 
## 358 358



Random Forest using the Each of the Sampled Data

  • Built a Random Forest Model using data obtained from each sampling technique
set.seed(112233)
data.train.rf.over <- randomForest(Attrition~., data = data_train_over)
data.train.rf.under <- randomForest(Attrition~., data = data_train_under)
data.train.rf.both <- randomForest(Attrition~., data = data_train_both)
data.train.rf.rose <- randomForest(Attrition~., data = data_train_rose)
data.train.rf.smote <- randomForest(Attrition~., data = data_train_smote)

# making predictions
pred.rf.over <- predict(data.train.rf.over, newdata = data.test,type = "prob")
pred.rf.under <- predict(data.train.rf.under, newdata = data.test,type = "prob")
pred.rf.both <- predict(data.train.rf.both, newdata = data.test,type = "prob")
pred.rf.rose <- predict(data.train.rf.rose, newdata = data.test,type = "prob")
pred.rf.smote <- predict(data.train.rf.smote, newdata = data.test,type = "prob")



ROC and AUC

# ROC and AUC

a <- roc.curve(data.test$Attrition, pred.rf.over[,2],col="red")
par(new=TRUE)
b<- roc.curve(data.test$Attrition, pred.rf.under[,2],col="violet")
par(new=TRUE)
c<- roc.curve(data.test$Attrition, pred.rf.both[,2],col="brown")
par(new=TRUE)
d<- roc.curve(data.test$Attrition, pred.rf.rose[,2],col="black")
par(new=TRUE)
e<-roc.curve(data.test$Attrition, pred.rf.smote[,2],col="blue")

# 
# roc.over <- roc.curve(data.test$Attrition, pred.rf.over[,2],add.roc=TRUE,col="red")
# roc.under <- roc.curve(data.test$Attrition, pred.rf.under[,2],add.roc=TRUE,col="violet")
# roc.both <- roc.curve(data.test$Attrition, pred.rf.both[,2],add.roc=TRUE,col="brown")
# roc.rose <- roc.curve(data.test$Attrition, pred.rf.rose[,2],add.roc=TRUE,col="black")
# roc.smote <- roc.curve(data.test$Attrition, pred.rf.smote[,2],add.roc=TRUE, col="blue")
  • AUC:
    • Over Sampling: 0.79
    • Under Sampling: 0.79
    • Mixed Sampling: 0.8
    • ROSE: 0.81
    • SMOTE:0.77
  • All Sampling techniques gave very similar AUC for this data set.



Checking Random Forest Model Performance for ROSE on Test data

Confusion Matrix:

pred.rf.rose.test <- predict(data.train.rf.rose, newdata = data.test)
# test stats
# confusion matrix - Class Tree - train data
table(data.test$Attrition, pred.rf.rose.test, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 268  42
##     1  24  34
# MR - Class Tree - test data
rose.test.mr <- round(mean(data.test$Attrition!=pred.rf.rose.test),2)
# AMR - Class Tree - test data
rose.test.amr <- round(cost(r = data.test$Attrition,pred.rf.rose.test),2)
  • The Misclassification Rate is 0.18

  • The Asymmetrical Misclassification Rate is 0.44

Analysis

ROC AUC:

  • LR:0.82

  • CT: 0.64

  • RF: 0.8

  • RF (ROSE): .81



Precision Recall Curve AUC:

  • LR: 0.55

  • CT: 0.3

  • RF: 0.52

  • RF (ROSE): 0.44



Precision and Recall Values:

  • LR - P:0.41 R:0.67

  • CT - P:0.31 R:0.55

  • RF - P:0.31 R:0.79

  • RF (ROSE) - P:0.46 R:0.62



Asymmetrical Misclassification Rate:

  • LR - 0.41

  • CT - 0.54

  • RF - 0.43

  • RF (ROSE) - 0.41



Analysis:

  • Logistic Regression has maximum AUC; however, Random Forest has AUC very close to Logistic Regression. But, Recall for Random Forest is higher compared with all other models.

  • High Recall will result in correctly identifying employees who may leave the company. A lower Recall would mean that there are higher cases where employees who are actually not going to leave are wrongly predicted in the attrition category. Hence, a higher Recall is preferable for a comparatively smaller decrease in Precision. Thus, Random Forest should be preferred.

  • Random Over Sampling Examples (ROSE) for Random Forest resulted in a slightly higher AUC for the ROC curve but a lower AUC for Precision Recall Curve compared with the AUC’s for Random Forest without ROSE. This technique managed to increase Precision in Random Forest but at the cost of decreasing Recall by a significant margin. Since there was a fall in AUC for precision recall curve for Random Forest with ROSE, it can be said that artificial sampling failed to provide us with a better model in this case.

  • Random Over Sampling Examples (ROSE) gave us the best ROC curve among all the sampling techniques