HarvardX Data Science Capstone 2 – Predicting Employee Attrition

1. Introduction and Executive Summary

HR Analytics is defined as the “collection and application of talent data to improve critical talent and business outcomes”. Examples include identifying models that predict amount of sick leaves taken, ascertaining links between employee engagement and business performance, and quantifying ROI on employee training. (Source). These practices allow for data-driven decision making in the HR function, a valuable capability that many companies are investing in now.

My second capstone project will be based in the HR Analytics domain. Specifically, I aim to: (1) Identify factors that correlate with employee attrition (i.e., voluntary departure of employees from an organization), and (2) Build and test a series of models that predict attrition. There is a case for studying this phenomenon: Gallup estimates that the cost of replacing an employee ranges from 1.5-2 times an employee’s annual salary. It is clearly useful to identify and intervene on high-risk employees before actual attrition takes place.

This project uses a fictitious HR dataset provided by IBM. This dataset contains information on 1,470 ‘employees’ such as education level, tenure in company, amount of overtime worked, and marital status. Subsequent chapters are as follows: Section 2 covers data processing and analysis, which will let me identify useful predictors of employee attrition (Research Task 1). In Section 3, I build a series of predictive models and identify which one best serves my context (Research Task 2). I close with Section 4: Conclusions, in which I discuss limitations and possible lines of future work.

I start by downloading the dataset.

#Download IBM's HR Analytics Dataset#
#Source: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
temp <- tempfile(fileext = ".csv")
download.file("https://drive.google.com/u/0/uc?id=1MYrMS546iSBcps-1lkO0aHV-nVJn8htD&export=download",temp)
readfile <- read.csv(temp,quote="")

2. Methods and Analysis

The methods section is split into 4 parts: (1) Data observation, (2) Data Cleaning, (3) Pre-Processing, and (4) Exploration. Observation helps me make preliminary sense of the data, Cleaning removes ‘corrupt’ data that interferes with analysis, Pre-processing parses the data into a usable format, and Exploration addresses Research Task 1.

2.1 Data Observation

I first present the dataset in a tabular form for a quick visualization.

#make sense of the overall data
as_tibble(readfile)

## # A tibble: 1,470 × 35
##      Age Attrition BusinessTravel    DailyRate Department       DistanceFromHome
##    <int> <chr>     <chr>                 <int> <chr>                       <int>
##  1    41 Yes       Travel_Rarely          1102 Sales                           1
##  2    49 No        Travel_Frequently       279 Research & Deve…                8
##  3    37 Yes       Travel_Rarely          1373 Research & Deve…                2
##  4    33 No        Travel_Frequently      1392 Research & Deve…                3
##  5    27 No        Travel_Rarely           591 Research & Deve…                2
##  6    32 No        Travel_Frequently      1005 Research & Deve…                2
##  7    59 No        Travel_Rarely          1324 Research & Deve…                3
##  8    30 No        Travel_Rarely          1358 Research & Deve…               24
##  9    38 No        Travel_Frequently       216 Research & Deve…               23
## 10    36 No        Travel_Rarely          1299 Research & Deve…               27
## # … with 1,460 more rows, and 29 more variables: Education <int>,
## #   EducationField <chr>, EmployeeCount <int>, EmployeeNumber <int>,
## #   EnvironmentSatisfaction <int>, Gender <chr>, HourlyRate <int>,
## #   JobInvolvement <int>, JobLevel <int>, JobRole <chr>, JobSatisfaction <int>,
## #   MaritalStatus <chr>, MonthlyIncome <int>, MonthlyRate <int>,
## #   NumCompaniesWorked <int>, Over18 <chr>, OverTime <chr>,
## #   PercentSalaryHike <int>, PerformanceRating <int>, …

We see a table of 1,470 rows of employee data, with 35 columns. Let us dive further into each of these columns, understanding what properties each of them have.

#delve into individual columns
str(readfile)

## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

The ‘str’ command lets us see the data contained in each of the 35 columns. We see a wide variety of employee information, ranging from age, to income, to Job Role, presented in both text and numeric forms.

2.2 Data Cleaning

In Data Cleaning, I check for N/A values (i.e., missing/blank information in the data set) and possible duplicated rows, to ensure that I do not get any errors or misleading insights as I perform my analysis and modelling. I will first check for N/A values through the (is.na) function, which will report out on any N/A values observed.

which(is.na(readfile))

## integer(0)

The output is integer(0), i.e., no blank/NA values. Let us spot for duplicate rows.

sum(duplicated(readfile))

## [1] 0

There are no duplicate rows spotted; the data does not require further cleaning at this point in time.

2.3 Pre-Processing

2.3.1 Creating Dummy Variables

To work on the data, I have to modify certain columns which are clearly not suited for mathematical processing. For instance, the Gender Column comprises of “Male” and “Female”, character strings which R won’t be able to interpret. As such, I will apply the following methods:

Splitting multi-categorical data into dummy variables. We have columns such as Job Role and Department, each of which possesses multiple values (e.g., Sales Rep, Human Resources, Director, etc.). To force them into a mathematically-processable format, I will create dummy variables for each of these individual values. For example, in the case of Job Role, I will create 1 new column called Job Role_Sales Rep, another called Job Role_Human Resource, and so on until all roles are covered. An employee who is a Sales Rep will have a value of ‘1’ inside the Sales Rep column, and a value of ‘0’ for all other Job Role columns. The converse holds true for a Human Resources employee.
Converting ordinal variables into numerical equivalents. Strictly applied only to the “Business Travel” column as there is an order of intensity for business travel (Non Travel, Rarely, Frequently). Specifically, I will re-level them as ‘0’, ‘1’ and ‘2’ respectively.
Converting binary variables into numerical equivalents. There are columns that can only hold 2 values such as Gender (M/F), Attrition (Y/N) and Overtime (Y/N). I will recode them as 0/1 (1 = Male, or Yes).

This image sums up the 3 steps I am trying to achieve. The code will not be shown here (it is quite dense!).
Nutshell image

2.3.2 Removing (near) zero variance predictors

By splitting data, I have just created many new columns, making analysis more cumbersome. To simplify, I will remove near zero variance predictors i.e., certain characteristics/predictors that are so uncommon that it is not meaningful to analyze the column itself. Removing such predictors can be done easily using the nearZeroVar function.

#we have many variables, identify the near zero variance predictors using caret package's nearzerovar function
#<https://riptutorial.com/r/example/24920/removing-features-with-zero-or-near-zero-variance>
nearZeroVar(readfiletest1)

## [1]  7 18 23 33 44 47

The nearZerovar function identifies the column indices (i.e. numbers) to remove, and the subset function following it strips off these columns from our dataset.

#remove those columns from the dataset
readfiletest2 <- subset(readfiletest1,select=-c(nearZeroVar(readfiletest1)))

We now have a cleaner data set. I will split this dataset into training and test, at a 80:20 ratio. This is a rule of thumb espoused by Roshan Joseph (2022).

# split data into training and test sets. I use a 80:20 split.
set.seed(1, sample.kind = "Rounding") # if using R 3.5 or earlier, remove the sample.kind argument
test_index <- createDataPartition(as.factor(readfiletest2$Attrition), times = 1, p = 0.2, list = FALSE)
test_set <- readfiletest2[test_index, ]
train_set <- readfiletest2[-test_index, ]

2.3 Exploration

Putting aside the test set which is meant for its stated purpose, I will evaluate the training set to identify predictors that are (statistically-significantly) correlated with Attrition. This is done through a correlation plot - that basically evaluates pairwise correlations between every single column i.e., variable.

Note that the concept of statistical significance will not be discussed here for parsimony.

First, I need to remove highly-correlated variables, which do not add predictive power (and add modelling overheads).

##### 2.3.1. Remove highly correlated variables
cor_mat<-cor(readfiletest2)
findCorrelation(cor_mat, .99)

## integer(0)

#Find linear combinations of variables
findLinearCombos(readfiletest2)

## $linearCombos
## list()
## 
## $remove
## NULL

There are no highly correlated variables, or combinations thereof. Let us continue with the correlation plot.

##### 2.3.1. Remove non-significantly correlated variables (between attrition + predictor) #####

#First, generate correlation table with p-values below a certain value
testRes <- cor.mtest(train_set, conf.level = 0.95)

#Then, generate correlation plot
corrplot(cor(train_set), p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
         addCoef.col ='black', number.cex = 0.2, order = 'AOE', diag=FALSE, tl.cex=0.3, tl.srt=45)

Click on this link to view the Correlation Plot at full size. You may wish to right click and open in a new tab.

Zooming into the correlation plot above, we see that several columns a.k.a variables do not have a statistically-significant correlation with attrition. I will remove those variables prior to training our machine learning models, and re-run the correlation plot.

#Identify column indices that have no relationship with attrition
#i.e. in the correlation table containing p-values, identify the 2nd row, which is attrition
#then, identify which column indices exceed the p-value of 0.05
remove_index <- which(((as.data.frame(testRes$p))[2,]) > 0.05)

#manually remove those columns that have no (statistically-significant) correlation with attrition
train_set_cleaned <- subset(train_set,select=-remove_index)

#generate correlation table once again with p-values below a certain value
testRes_cleaned <- cor.mtest(train_set_cleaned, conf.level = 0.95)

#generate correlation plot again, we should see a much cleaner table
corrplot(cor(train_set_cleaned), p.mat = testRes_cleaned$p, method = 'circle', type = 'lower', insig='blank',
         addCoef.col ='black', number.cex = 0.25, order = 'AOE', diag=FALSE, tl.cex=0.3, tl.srt=45)

Click on this link to view the Correlation Plot at full size. You may wish to right click and open in a new tab. (Notes: a higher value means a higher magnitude of correlation. a negative sign means a negative relationship i.e. higher the value of the variable, lower the attrition risk.)

Notice that the correlation plot is simpler (i.e, fewer rows and columns) thanks to the removal of several variables that are not correlated with Attrition. Furthermore, the row with Attrition is now filled exclusively with significant correlations. This allows us to identify a few insights, thereby answering Research Task 1.

What drives attrition? Working overtime, being single, being a Sales Rep, being in the Sales department, and travelling a lot on business are factors associated with a higher rate of attrition.
What else? To a lesser extent, being situated far from home and being educated in marketing are also associated with a higher rate of attrition.
On the flip side, high job level, higher monthly income, being older, being married, and being in the role for a long time is associated with a lower rate of attrition.

Some of these points intuitively make sense, but it is important to acknowledge that these are but correlations, and not causal relationships. Working overtime does not necessarily cause attrition; there could be a background factor (e.g., an incompatible supervisor) that causes both overtime and attrition. Proving causal relationships requires actual experimentation.

2.4. Modelling Approach and Performance Metrics

I aim to build a predictive model based on our training data set, and apply it to a test data set, for which the Attrition outcomes are already known. A good model is one that predicts correctly which employees will leave/stay in the test set.

Several such classification models exist. My modelling approach is to experiment with several of them and identify those with the highest accuracy and recall. Accuracy is a general reporting metric. On the other hand, recall is the more ‘important’ metric to prioritize, as it indicates the false negative rate. Minimizing the false negatives (i.e., employees predicted to stay but actually leave) is important to ensure that key talent stays within the organization.

The equations that reflect accuracy and recall are as follows:

\[Accuracy = \displaystyle \frac{TP + TN}{TP + TN + FP + FN}\]

\[Recall = \displaystyle \frac{TP}{TP + FN}\] Where:
TP = True Positives (i.e. those correctly classified as Attrition)
TN = True Negatives (i.e. those correctly classified as non-Attrition)
FP = False Positive (i.e. those incorrectly classified as Attrition)
FN = False Negative (i.e. those incorrectly classified as non-Attrition)

3. Building and Testing Predictive Models

I will experiment with 3 different types of models: Logistic, K-Nearest Neighbors(KNN), and classification tree model. The mathematical reasoning/logic behind each of them is complex and thus out of scope of discussion.

3.1. Logistic Model

A logistic regression is a subset of a generalized linear model, usually used for binary (two-outcome) classification tasks such as in our case. The following code ‘interprets’ the training set and generates a set of data, which is used to predict Attrition outcomes in our test set.

###start with logistic regression##

#Fit data
lm_fit <- lm(Attrition~.,train_set_cleaned)

#Generate a list of outcomes
p_hat <- predict(lm_fit,test_set)

#Set outcomes to be 1 (i.e., attrition) if probability exceeds 0.5 
y_hat <- ifelse(p_hat > 0.5, "1", "0") |> factor()
test_set$Attrition<-as.factor(test_set$Attrition)

#Run a confusion matrix to evaluate the performance.
cm_1<-confusionMatrix(y_hat, test_set$Attrition)

#calculate accuracy and recall
Accuracy_1<-cm_1$overall[["Accuracy"]]
Recall_1<-cm_1$byClass[["Recall"]]

Accuracy_1
Recall_1

The Accuracy and Metric scores are 0.871 and 0.984 respectively. Note that perfect scores are 1.0, so the Recall score in particular is good. I will record this result for comparison purposes.

#Add a results table to comparatively show how Accuracy and Recall Fare. 
Results_table <- tibble(Method = "Logistic", Accuracy = Accuracy_1, Recall=Recall_1)
Results_table

## # A tibble: 1 × 3
##   Method   Accuracy Recall
##   <chr>       <dbl>  <dbl>
## 1 Logistic    0.875  0.996

3.2. K-Nearest Neighbors

Let us move on to K-Nearest Neighbors. KNN is an approach that basically looks at a given data point, as well as the data points of its ‘K’ (to be defined) nearest neighbors by ‘distance’(formally Euclidean Distance), and makes a judgment call as to which category this point should belong to.

Normally, k=1 means that there is a high risk of overfitting, so I’ll be using a rule of thumb around using k = sqrt(number of rows in training set) -Source. There is another option to experiment with different k values until the best accuracy is found, but I refrain from doing so as it breaks the ‘golden rule’ of machine learning - not to train the model on one’s testing data!

#Fit data
knn_fit <- knn3(Attrition~.,data=train_set_cleaned, k=sqrt(nrow(train_set_cleaned)))

#Set outcomes to be 1 (i.e., attrition) if probability exceeds 0.5 
y_hat2raw <- as.data.frame(predict(knn_fit, test_set)) %>% mutate('class'=names(.)[apply(., 1, which.max)]) %>% select(class)
y_hat2 <- as.factor(y_hat2raw$class)

#Run a confusion matrix to evaluate the performance.
cm_2<-confusionMatrix(y_hat2, test_set$Attrition)

#calculate performance metrics
Accuracy_2<-cm_2$overall[["Accuracy"]]
Recall_2<-cm_2$byClass[["Recall"]]

Accuracy_2
Recall_2

We see a slight drop in Accuracy of 2%, but perfect Recall. Let’s tabulate them.

#Add a results table to comparatively show how Accuracy and Recall Fare. 
Results_table <- Results_table %>% add_row(Method="KNN",Accuracy=Accuracy_2,Recall=Recall_2)
as.data.frame(Results_table)

##     Method  Accuracy    Recall
## 1 Logistic 0.8745763 0.9959514
## 2      KNN 0.8372881 0.9959514

3.3. Classification Tree

The classification tree is another approach that is widely used and applied in other more sophisticated machine learning approaches (e.g., bagging).

###CLASSIFICATION TREE###
set.seed(1, sample.kind = "Rounding") # if using R 3.5 or earlier, remove the sample.kind argument
train_rpart<- train(Attrition~., method="rpart", data=train_set_cleaned, tuneGrid=data.frame(cp=seq(0,0.05,0.002)))
train_rpart$bestTune
#Accuracy based on best value of k
y_hat6raw <- predict(train_rpart,test_set)
y_hat6 <- ifelse(y_hat6raw>0.5,"1","0") %>% factor(levels=c("0","1"))
cm_6 <- confusionMatrix(y_hat6, test_set$Attrition)

#calculate performance metrics
Accuracy_6<-cm_6$overall[["Accuracy"]]
Recall_6<-cm_6$byClass[["Recall"]]

Notice that its performance unfortunately falls short relative to that of Logistic Regression or KNN.

#Add a results table to comparatively show how Accuracy and Recall Fare. 
Results_table <- Results_table %>% add_row(Method="Classification Tree",Accuracy=Accuracy_6,Recall=Recall_6)
as.data.frame(Results_table)

##                Method  Accuracy    Recall
## 1            Logistic 0.8745763 0.9959514
## 2                 KNN 0.8372881 0.9959514
## 3 Classification Tree 0.8508475 0.9797571

4. Conclusions

The logistic model is the best variant for me- with the highest accuracy and near-perfect recall (but it will be KNN if my leaders are serious about retaining the star performers!). That said, I have observed one limitation: I have received a warning of a ‘rank-deficient fit’ as I have run the logistic regression. This means that there are variables that are perfectly correlated (not possible since I have done the findLinearCombos function earlier), or that there are more model parameters than observations: Source. While there are indeed more observations than parameters (>1,000 observations vs 35 parameters) - it is likely that not all possible combinations of parameters are present, which can lead to misleading estimates. Future areas of work include: (1) Using academic literature to identify possible correlations and reducing the number of parameters to be modeled, instead of a using a ‘spray-and-pray’ approach in this case where every single combination is considered (this is where domain knowledge is important!), or (2) Simply getting a larger dataset, though I imagine that I will need to access datasets from very large companies. It is also for this reason that I eliminated Cross-Validation approaches (which involve re-sampling the same data multiple times) - predictions may become less accurate.

In any case, this exercise has helped me to better understand the use cases for HR Analytics, as well as its limitations. Predicting attrition would be difficult without a substantial pool of data, and it may be better to apply machine learning concepts towards other sub-realms where data is plentiful (e.g., HR chatbot interactions, caseloads, or employee sentiments).