1 Introduction

Decision tree algorithms are a class of machine learning algorithms used for both classification and regression tasks. They work by recursively splitting the dataset into subsets based on the values of input features, ultimately creating a tree-like structure of decisions that leads to a predicted output.

In the context of our employee Attrition dataset, decision tree algorithms can be appropriate for the following reasons:

Interpretability: Decision trees provide transparent and easy-to-understand rules that map input features to predicted outcomes. This can be valuable in understanding the factors that contribute to employee attrition.

Feature Importance: Decision trees can rank features based on their importance in the decision-making process. This helps identify which factors have the most significant impact on employee attrition.

Non-linearity: Decision trees can capture complex relationships and interactions between features, which could be present in employee attrition scenarios.

Handling Missing Values: Decision trees can handle missing values in a robust manner by considering available data for each split.

No Assumption about Data Distribution: Decision trees do not assume any specific data distribution, making them suitable for various types of data.

The Decision Tree (DT) algorithm is based on conditional probabilities. Unlike the other classification algorithms, decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records. It is easy to interpret and implement in real-world applications. Among several basic tree-based algorithms, Classification and Regression Tree (CART) is most frequently used in practice.

For our analysis, we can follow these steps:

Data Preparation: Clean and preprocess the employee attrition dataset, handling missing values and encoding categorical variables.

Split Data: Divide the dataset into training and testing subsets.

Model Selection: Choose a decision tree algorithm, such as ID3, C4.5, CART, or Random Forest, based on your requirements and characteristics of the dataset.

Model Training: Train the selected decision tree algorithm on the training data.

Model Evaluation: Evaluate the model’s performance using appropriate metrics such as accuracy, precision, recall, F1-score, and AUC. Utilize techniques like cross-validation to assess stability.

Visualize the Tree: If the decision tree is not too complex, visualize it to understand the hierarchy of decisions and feature importance.

Hyperparameter Tuning: Optimize hyperparameters, like tree depth or minimum samples per leaf, to avoid overfitting or underfitting.

Interpretation: Analyze the tree’s rules and feature importance to gain insights into the factors contributing to employee attrition.

Prediction and Deployment: Use the trained decision tree model to make predictions on new, unseen data, and potentially deploy it in your HR or business process.

2 Case Study - Predicting Attrition

This is a new model that is different from logistic and neural network models. We load the analytic data set.

attrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
# We use a random split approach
n = dim(attrition)[1]  # sample size
# caution: using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)  
train = attrition[train.id, ]    # training data
test = attrition[-train.id, ]    # testing data


rpart() has a lot of flexibility to construct decision trees as it has user controls. It is particularly useful in applications where the costs of false positive and false negative are different.

Next, we write a wrapper so we can build different decision trees conveniently.

# arguments to pass into rpart():
# 1. data set (training /testing); 
# 2. Penalty coefficients
# 3. Impurity measure
## 
tree.builder = function(attrition, fp, fn, purity){
   tree = rpart(Attrition ~ .,                # including all features
                data = attrition, 
                na.action  = na.rpart,       # By default, deleted if the outcome is missing, 
                                             # kept if predictors are missing
                method = "class",            # Classification form factor
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list( # loss matrix. Penalize false positives or negatives more heavily
                         loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             ## rpart algorithm options (These are defaults)
             control = rpart.control(
                        minsplit = 10,  # minimum number of observations required before split
                        minbucket= 10,  # minimum number of observations in any terminal node, default = minsplit/3
                        cp  = 0.01,  # complexity parameter for stopping rule, 0.02 -> small tree 
                       xval = 10     # number of cross-validation )
                        )
             )
  }

Using the above function, we define six different decision tree models in the following.

  • Model 1: gini.tree.11 is based on the Gini index without penalizing false positives and false negatives.

  • Model 2: info.tree.11 is based on entropy without penalizing false positives and false negatives.

  • Model 3: gini.tree.110 is based on the Gini index: cost of false negatives is 10 times the positives.

  • Model 4: info.tree.110 is based on entropy: cost of false negatives is 10 times the positives.

  • Model 5: gini.tree.101 is based on the Gini index: cost of false positive is 10 times the negatives.

  • Model 6: info.tree.101 is based on entropy: cost of false positive is 10 times the negatives.

The tree diagram of the above two regular decision models is given below.

## Call the tree model wrapper.
gini.tree.1.1 = tree.builder(attrition = train, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(attrition = train, fp = 1, fn = 1, purity = "information")
gini.tree.1.10 = tree.builder(attrition = train, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(attrition = train, fp = 1, fn = 10, purity = "information")
## tree plots
par(mfrow=c(1,2))
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization")
rpart.plot(info.tree.1.1, main = "Tree with entropy: non-penalization")
Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization")
Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

2.1 ROC for Model Selection

We built 4 different decision tree models above. Next, we use ROC analysis to select the best among the four candidate models.

# function returning a sensitivity and specificity matrix
SenSpe = function(attrition, fp, fn, purity){
  cutoff = seq(0,1, length = 20)   # 20 cut-offs including 0 and 1. 
  model = tree.builder(attrition, fp, fn, purity) 
  ## Caution: decision tree returns both "success" and "failure" probabilities.
  ## We need only "success" probability to define sensitivity and specificity!!! 
  pred = predict(model, newdata = attrition, type = "prob") # two-column matrix.
  senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
  for (i in 1:length(cutoff)){
  # CAUTION: "pos" and "neg" are values of the label in this data set!
  # The following line uses only "pos" probability: pred[, "pos"] !!!!
  pred.out =  ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")  
  TP = sum(pred.out =="Yes" & attrition$Attrition == "Yes")
  TN = sum(pred.out =="No" & attrition$Attrition == "No")
  FP = sum(pred.out =="Yes" & attrition$Attrition == "No")
  FN = sum(pred.out =="No" & attrition$Attrition == "Yes")
  senspe.mtx[1,i] = TP/(TP + FN)
  senspe.mtx[2,i] = TN/(TN + FP)
  }
  ## A better approx of ROC, need library {pROC}
  prediction = pred[, "Yes"]
  category = attrition$Attrition == "Yes"
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)
  ##
  list(senspe.mtx= senspe.mtx, AUC = round(AUC,3))
 }

The above function has three arguments for users to choose different types of decision trees including the 4 trees discussed in the previous subsection. Next we use this function to build 6 different trees and plot their corresponding ROC curves so we can see the global performance of these tree algorithms.

giniROC11 = SenSpe(attrition = train, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(attrition = train, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(attrition = train, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(attrition = train, fp=1, fn=10, purity="information")
giniROC101 = SenSpe(attrition = train, fp=10, fn=1, purity="gini")
infoROC101 = SenSpe(attrition = train, fp=10, fn=1, purity="information")

Next, we plot the ROC curves and calculate the areas under the ROC curves for Individual decision tree models.

par(pty="s")      # set up square plot through graphic parameter
plot(1-giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity: FPR", ylab="Sensitivity: TPR", col = "blue", lwd = 2,
     main="ROC Curves of Decision Trees", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)
lines(1-infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,], col = "firebrick2", lwd = 2, lty=2)
lines(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], col = "olivedrab", lwd = 2)
lines(1-infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,], col = "skyblue", lwd = 2)
lines(1-giniROC101$senspe.mtx[2,], giniROC101$senspe.mtx[1,], col = "red", lwd = 2, lty = 4)
lines(1-infoROC101$senspe.mtx[2,], infoROC101$senspe.mtx[1,], col = "sienna3", lwd = 2)
legend("bottomright", c(paste("gini.1.1,  AUC =", giniROC11$AUC), 
                        paste("info.1.1,   AUC =",infoROC11$AUC), 
                        paste("gini.1.10, AUC =",giniROC110$AUC), 
                        paste("info.1.10, AUC =",infoROC110$AUC),
                        paste("gini.10.1, AUC =",giniROC101$AUC), 
                        paste("info.10.1, AUC =",infoROC101$AUC)),
                        col=c("blue","firebrick2","olivedrab","skyblue","red","sienna3"), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Figure 16. Comparison of ROC curves

Figure 16. Comparison of ROC curves

The above ROC curves represent various decision trees and their corresponding AUC. The model with the largest AUC is considered the best decision tree among the existing ones.

An AUC (Area Under the Curve) value of 0.875 (largest out of 6) for a decision tree model indicates that the model has relatively good performance in distinguishing between the positive and negative classes in your dataset. AUC measures the ability of the model to correctly rank instances of different classes based on their predicted probabilities. An AUC value closer to 1.0 indicates better discrimination, while a value closer to 0.5 suggests random performance. In your case, an AUC of 0.875 indicates that the decision tree model is making accurate predictions and effectively separating the two classes.

2.2 Optimal Cut-off Score Determination

As usual, once the final model is determined, we need to find the optimal cut-off score for reporting the predictive performance of the final model with the test data. Please keep in mind the optimal cut-off determination through cross-validation must be based on the training data set.

In practical applications, one may end up with two or more final models with similar AUCs. In this case, we need to report the performance of all final models based on the test data and let clients choose one to deploy (and possibly leave the rest as challengers). For this reason, we write a function to determine the optimal cut-off for a given decision tree (based on this project) since different decision trees have their own optimal cut-off.

Optm.cutoff = function(attrition, fp, fn, purity){
  n0 = dim(attrition)[1]/5
  cutoff = seq(0,1, length = 20)               # candidate cut off prob
  ## accuracy for each candidate cut-off
  accuracy.mtx = matrix(0, ncol=20, nrow=5)    # 20 candidate cutoffs and gini.11
  ##
  for (k in 1:5){
     valid.id = ((k-1)*n0 + 1):(k*n0)
     valid.dat = attrition[valid.id,]
     train.dat = attrition[-valid.id,] 
     ## tree model
     tree.model = tree.builder(attrition, fp, fn, purity)
     ## prediction 
     pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
     ## for-loop
     for (i in 1:20){
        ## predicted probabilities
        pc.1 = ifelse(pred > cutoff[i], "Yes", "No")
        ## accuracy
        a1 = mean(pc.1 == valid.dat$Attrition)
        accuracy.mtx[k,i] = a1
       }
      }
   avg.acc = apply(accuracy.mtx, 2, mean)
   ## plots
   n = length(avg.acc)
   idx = which(avg.acc == max(avg.acc))
   tick.label = as.character(round(cutoff,2))
   ##
   plot(1:n, avg.acc, xlab="cut-off score", ylab="average accuracy", 
        ylim=c(min(avg.acc), 1), 
        axes = FALSE,
        main=paste("5-fold CV optimal cut-off \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
        cex.main = 0.9,
        col.main = "navy")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, avg.acc[idx], pch=19, col = "red")
        segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "red")
       text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "red", cex = 0.8) 
   }

For demonstration, we use the above function to calculate the optimal cut-off of 6 decision trees constructed earlier in the following.

par(mfrow=c(3,2))
Optm.cutoff(attrition = train, fp=1, fn=1, purity="gini")
Optm.cutoff(attrition = train, fp=1, fn=1, purity="information")
Optm.cutoff(attrition = train, fp=1, fn=10, purity="gini")
Optm.cutoff(attrition = train, fp=1, fn=10, purity="information")
Optm.cutoff(attrition = train, fp=10, fn=1, purity="gini")
Optm.cutoff(attrition = train, fp=10, fn=1, purity="information")
Figure 17: Plot of optimal cut-off determination

Figure 17: Plot of optimal cut-off determination

As anticipated, different trees have their own optimal cut-off. Please keep in mind that the cut-off is random (based on the randomly split training data), there may have different cut-offs in different runs. It is dependent on the tree size, sometimes, we may end up with multiple optimal cut-offs. Technically speaking, we choose any one of them for implementation. A better recommendation is to choose the average of these multiple cut-offs and the final cut-off to be used on the testing data set.

3 Concluding Remarks

Based on the information provided, it seems that our Decision Tree model has achieved an Area Under the Curve (AUC) of 0.875. A higher AUC indicates better discriminatory power of our model in distinguishing between positive and negative cases. In the context of employee attrition analysis, this suggests that our Decision Tree model has a good ability to rank and classify employees with respect to attrition risk.

However, the choice of a cutoff score is important for practical deployment and decision-making. The cutoff score determines how the model’s predictions are converted into binary classifications (positive or negative). The selection of a cutoff involves a trade-off between sensitivity (true positive rate) and specificity (true negative rate).

Model Performance: An AUC of 0.875 is generally considered good, indicating that your Decision Tree model is performing well in differentiating between employees who will attrite and those who won’t.

Cutoff Selection: The optimal cutoff score depends on specific business context and the relative importance of false positives (predicting attrition when it won’t occur) versus false negatives (failing to predict attrition when it will occur). we should choose a cutoff that aligns with your organization’s goals and risk tolerance.

Trade-off Consideration: we need to balance sensitivity and specificity based on our business needs. A higher sensitivity might be preferred if missing actual attrition cases is more costly, while a higher specificity might be chosen if avoiding false alarms is a higher priority.

Further Analysis: To make an informed decision about the cutoff, we might want to analyze precision, recall, F1-score, and other relevant metrics. We could also perform sensitivity analysis by evaluating the model’s performance across a range of cutoff values.

Deployment and Monitoring: Keep in mind that model performance can vary in real-world scenarios. It’s important to monitor the model’s performance after deployment and refine the cutoff if necessary.

In summary, our Decision Tree model is demonstrating strong predictive performance, but the choice of a cutoff score should be guided by the specific needs of the organization and the implications of false positives and false negatives in the context of employee attrition.