Decision trees are models used to make predictions in statistical problems where the output variable is categorical. In this project, we will perform an exercise using a simulated database to estimate whether customers of a financial institution have any potential for default or not.

This type of analysis is extremely useful in the real world as it assists financial institutions in decision-making, such as granting credit, while minimizing the risks of default. Based on customer characteristics, it is possible to identify those with a higher probability of meeting their financial obligations, thus ensuring better risk management.

Package installation

# List of packages to be installed or loaded
packages <- c('tidyverse', 'rpart', 'rpart.plot', 'gtools', 'Rmisc', 'gridExtra', 'scales', 'caret', 'plotROC')

# Check if the packages are already installed; if not, install them
if (sum(!as.numeric(!packages %in% installed.packages())) != 0) {
  installer <- packages[!packages %in% installed.packages()]
  for (i in 1:length(installer)) {
    install.packages(installer, dependencies = TRUE)
    break()
  }
  sapply(packages, require, character = TRUE) 
} else {
  sapply(packages, require, character = TRUE) 
}

Creating the database to be used in the exercise

Let’s create a fictitious database that simulates a customer portfolio of a financial institution. Each observation will represent a customer, the independent variables will represent characteristics of that customer, and our output variable will be binary categorical, indicating whether the customer has any history of default or not.

set.seed(42)  # Setting a random seed for result reproducibility

# Generating simulated variables with correlation
age <- sample(18:70, 10000, replace = TRUE)

# Generating correlated variables using the rmvnorm() function from the mvtnorm package
library(mvtnorm)

mean_values <- c(5000, 2000, 0.5, 5)  # Means of the variables
correlation_matrix <- matrix(c(1, 0.3, 0.2, -0.1, 
                                0.3, 1, -0.1, 0.2, 
                                0.2, -0.1, 1, 0.4, 
                                -0.1, 0.2, 0.4, 1), nrow = 4)  # Correlation matrix

simulated_data <- rmvnorm(10000, mean = mean_values, sigma = correlation_matrix)

income <- simulated_data[, 1]
debt <- simulated_data[, 2]
credit_utilization <- pmin(pmax(simulated_data[, 3], 0), 1)  # Limits credit utilization between 0 and 1
recent_inquiries <- pmax(round(simulated_data[, 4]), 0)  # Ensures that the number of recent inquiries is non-negative and non-decimal

# Generating a linear predictor from the explanatory variables
linear_predictor <- -7 - 0.01*age - 0.0002*income + 0.003*debt - 3*credit_utilization + 0.5*recent_inquiries

# Calculating the probability of default (PD) using the logit link function
prob_default <- plogis(linear_predictor)
# Note: plogis(x) is equivalent to 1/(1+exp(-x))

# Generating default as a Bernoulli variable based on the probability of default
default <- rbinom(10000, size = 1, prob = prob_default)

# Creating a dataframe
data <- data.frame(age, income, debt, credit_utilization, recent_inquiries, default)
data$default <- ifelse(data$default == 0, "N", "Y")
data$default <- factor(data$default)

head(data)
##   age   income     debt credit_utilization recent_inquiries default
## 1  66 5000.309 1999.040          0.6570972                6       N
## 2  54 5001.162 2001.150          0.0000000                5       N
## 3  18 4999.755 1999.397          1.0000000                6       N
## 4  42 4999.277 1999.902          0.0000000                5       N
## 5  27 4999.854 1999.030          0.1964189                4       N
## 6  53 5000.227 2000.533          0.9166606                5       N

Splitting the dataset

## Creating a vector to split the dataset into a 75/25 proportion
bool_train <- stats::runif(dim(data)[1]) > 0.25

## Using the vector to separate the dataset into "train" and "test" sets
train <- data[bool_train,]
test  <- data[!bool_train,]

data %>% str()
## 'data.frame':    10000 obs. of  6 variables:
##  $ age               : int  66 54 18 42 27 53 35 66 64 41 ...
##  $ income            : num  5000 5001 5000 4999 5000 ...
##  $ debt              : num  1999 2001 1999 2000 1999 ...
##  $ credit_utilization: num  0.657 0 1 0 0.196 ...
##  $ recent_inquiries  : num  6 5 6 5 4 5 8 4 4 4 ...
##  $ default           : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

Creating the first tree

# Creating a decision tree without constraints (maxdepth=30) using the "train" dataset
tree <- rpart::rpart(default ~ age + income + debt + credit_utilization + recent_inquiries,
                     data = train,
                     method = 'class',
                     xval = 5,
                     control = rpart.control(cp = 0,
                                             minsplit = 1,
                                             maxdepth = 30))

Evaluating the tree on the training and test datasets

Using predict to store the predictions of the tree using the training and test datasets in the “p_train” object. Afterward, we transform the predicted values into Y or N with a cutoff probability of 0.5 and save them in the “c_train” object.

# Generating predictions on the training data using the original tree
p_train <- stats::predict(tree, train)
c_train <- base::factor(ifelse(p_train[, 2] > 0.5, "Y", "N"))

# Generating predictions on the test data using the original tree
p_test <- stats::predict(tree, test)
c_test <- base::factor(ifelse(p_test[, 2] > 0.5, "Y", "N"))

## Calculating accuracy of the tree on the "train" dataset
tab <- table(c_train, train$default)
accuracy_train <- (tab[1, 1] + tab[2, 2]) / nrow(train)
sprintf('Accuracy on the training dataset: %s ', percent(accuracy_train))
## [1] "Accuracy on the training dataset: 100% "
## Calculating accuracy of the tree on the "test" dataset
tab2 <- table(c_test, test$default)
accuracy_test <- (tab2[1, 1] + tab2[2, 2]) / nrow(test)
sprintf('Accuracy on the test dataset: %s ', percent(accuracy_test))
## [1] "Accuracy on the test dataset: 69% "

ROC curve for training

Let’s calculate the area under the ROC curve using the twoClassSummary function in Caret. This function expects input in the form of a dataframe with the following layout:

obs: a column containing a factor with the observed classes. pred: a factor with the predicted classes. Y: contains the probability of default. N: contains the probability of non-default.

evaluation_train <- data.frame(obs = train$default,
                               pred = c_train,
                               Y = p_train[, 2],
                               N = 1 - p_train[, 2]
)

# Summarizing two-class classification on the training data
caret::twoClassSummary(evaluation_train, lev = levels(evaluation_train$obs))
##       ROC      Sens      Spec 
## 0.9999651 0.9982310 0.9945711
# Generating an ROC curve for the training dataset
ROC_curve <- ggplot2::ggplot(evaluation_train, aes(d = obs, m = Y, colour = '1')) + 
  plotROC::geom_roc(n.cuts = 0) +
  scale_color_viridis_d(direction = -1, begin = 0, end = .25) +
  theme(legend.position = "none") +
  ggtitle("ROC Curve - Training Data")

ROC_curve

ROC curve for the test dataset

# Creating a data frame for test evaluation
evaluation_test <- data.frame(obs = test$default, 
                              pred = c_test, 
                              Y = p_test[, 2], 
                              N = 1 - p_test[, 2]
)

# Summarizing two-class classification on the test data
twoClassSummary(evaluation_test, lev = levels(evaluation_test$obs))
##       ROC      Sens      Spec 
## 0.5952859 0.7845333 0.4063492
# Generating an ROC curve for the test dataset
ROC_curve_2 <- ggplot(evaluation_test, aes(d = obs, m = Y, colour = 'a')) +
  plotROC::geom_roc(n.cuts = 0) +
  scale_color_viridis_d(direction = -1, begin = 0, end = .25) +
  theme(legend.position = "none") +
  ggtitle("ROC Curve - Test Data")

ROC_curve_2

The ROC curve we constructed using the test data does not have a very good accuracy. This result likely occurs due to overfitting of our tree to the training data. To resolve this issue, we can prune our tree based on the point on the complexity path (cp) that has the lowest relative error (xerror). This way, we can search for a tree that better generalizes the data and improves performance on previously unseen data.

Pruned/Tuned tree

Let’s define and save the cp point in an object called “cp_min” and create a new tree using this object as a parameter

# Displaying the complexity parameter table for the original tree
tab_cp <- rpart::printcp(tree)
## 
## Classification tree:
## rpart::rpart(formula = default ~ age + income + debt + credit_utilization + 
##     recent_inquiries, data = train, method = "class", control = rpart.control(cp = 0, 
##     minsplit = 1, maxdepth = 30), xval = 5)
## 
## Variables actually used in tree construction:
## [1] age                credit_utilization debt               income            
## [5] recent_inquiries  
## 
## Root node error: 1842/7495 = 0.24576
## 
## n= 7495 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.01176258      0  1.000000 1.00000 0.020235
## 2  0.01031488      4  0.946254 0.99349 0.020191
## 3  0.00579081      6  0.925624 0.95657 0.019931
## 4  0.00434311     10  0.901737 0.95440 0.019915
## 5  0.00271444     12  0.893051 0.95494 0.019919
## 6  0.00230727     13  0.890337 0.95494 0.019919
## 7  0.00217155     17  0.881107 0.95548 0.019923
## 8  0.00190011     19  0.876764 0.95657 0.019931
## 9  0.00162866     23  0.869164 0.96634 0.020001
## 10 0.00135722     40  0.830619 0.97720 0.020077
## 11 0.00126674     42  0.827904 0.99294 0.020187
## 12 0.00108578     55  0.808360 0.99294 0.020187
## 13 0.00096513     80  0.780673 1.00109 0.020243
## 14 0.00095005     94  0.761129 1.00814 0.020291
## 15 0.00090481    104  0.748643 1.00869 0.020294
## 16 0.00084449    142  0.694354 1.02823 0.020424
## 17 0.00081433    153  0.684039 1.03094 0.020442
## 18 0.00072385    298  0.529316 1.03909 0.020495
## 19 0.00069800    320  0.513029 1.04940 0.020562
## 20 0.00067861    327  0.508143 1.05103 0.020572
## 21 0.00063337    331  0.505429 1.05592 0.020603
## 22 0.00054289    347  0.492942 1.11781 0.020979
## 23 0.00049353    676  0.307818 1.12866 0.021042
## 24 0.00048257    694  0.298046 1.13355 0.021070
## 25 0.00046533    706  0.292074 1.13572 0.021083
## 26 0.00043431    716  0.286645 1.14169 0.021116
## 27 0.00040717    723  0.283388 1.14549 0.021138
## 28 0.00038778    757  0.268730 1.17372 0.021293
## 29 0.00036193    780  0.258415 1.17372 0.021293
## 30 0.00032573    885  0.218241 1.18350 0.021345
## 31 0.00027144    938  0.198154 1.25841 0.021723
## 32 0.00021716   1513  0.032030 1.26276 0.021744
## 33 0.00018096   1541  0.025516 1.27850 0.021817
## 34 0.00013572   1603  0.011944 1.27850 0.021817
## 35 0.00000000   1611  0.010858 1.27850 0.021817
# Plotting the complexity parameter plot
plotcp(tree)

# Finding the row with the minimum cross-validated error
min_cp_row <- tab_cp[which.min(tab_cp[, 'xerror']), ]

# Extracting the minimum complexity parameter (CP)
cp_min <- min_cp_row['CP']

## Creating a new pruned tree using the previously defined CP
set.seed(1)
pruned_tree <- rpart::rpart(default ~ age + income + debt + credit_utilization + recent_inquiries,
                            data = train,
                            method = 'class',
                            xval = 0,
                            control = rpart.control(cp = cp_min,
                                                    minsplit = 1,
                                                    maxdepth = 30))

Evaluating the pruned tree on the training and test datasets

# Generating predictions on the training data using the pruned tree
p_train_poda <- stats::predict(pruned_tree, train)
c_train_poda <- base::factor(ifelse(p_train_poda[, 2] > 0.5, "Y", "N"))

# Generating predictions on the test data using the pruned tree
p_test_poda <- stats::predict(pruned_tree, test)
c_test_poda <- base::factor(ifelse(p_test_poda[, 2] > 0.5, "Y", "N"))

Training dataset

Initial tree vs. Pruned tree on the training dataset

# Creating a data frame for training evaluation
evaluation_train_poda <- data.frame(obs = train$default,
                                    pred = c_train_poda,
                                    Y = p_train_poda[, 2],
                                    N = 1 - p_train_poda[, 2]
)

# Summary for two-class classification on the training data
caret::twoClassSummary(evaluation_train, lev = levels(evaluation_train$obs))
##       ROC      Sens      Spec 
## 0.9999651 0.9982310 0.9945711
caret::twoClassSummary(evaluation_train_poda, lev = levels(evaluation_train_poda$obs))
##       ROC      Sens      Spec 
## 0.7576569 0.9303025 0.3121607
# ROC curve for the "pruned" tree on the training dataset
ROC_curve_3 <- ggplot2::ggplot(evaluation_train_poda, aes(d = obs, m = Y, colour = '1')) + 
  plotROC::geom_roc(n.cuts = 0) +
  scale_color_viridis_d(direction = -1, begin = 0, end = .25) +
  theme(legend.position = "none") +
  ggtitle("ROC Curve (Pruned) - Training Data")

# Arranging the plots side by side
grid.arrange(ROC_curve, ROC_curve_3, ncol = 2)

Test dataset

Initial tree vs. Pruned tree on the test dataset

# Creating a data frame for test evaluation
evaluation_test_poda <- data.frame(obs = test$default,
                                    pred = c_test_poda,
                                    Y = p_test_poda[, 2],
                                    N = 1 - p_test_poda[, 2]
)

# Summary for two-class classification on the test data
twoClassSummary(evaluation_test, lev = levels(evaluation_train$obs))
##       ROC      Sens      Spec 
## 0.5952859 0.7845333 0.4063492
twoClassSummary(evaluation_test_poda, lev = levels(evaluation_test_poda$obs))
##       ROC      Sens      Spec 
## 0.7626662 0.9269333 0.3603175
# ROC curve for the "pruned" tree on the test dataset
ROC_curve_4 <- ggplot2::ggplot(evaluation_train_poda, aes(d = obs, m = Y, colour = '1')) + 
  plotROC::geom_roc(n.cuts = 0) +
  scale_color_viridis_d(direction = -1, begin = 0, end = .25) +
  theme(legend.position = "none") +
  ggtitle("ROC Curve (Pruned) - Test Data")

# Arranging the plots side by side
grid.arrange(ROC_curve_2, ROC_curve_4, ncol = 2)

We can see that overfitting has been reduced. This is evidenced by the decrease in the area under the ROC curve for the initial tree compared to the area under the ROC curve for the pruned tree on the training dataset. When analyzing the ROC curves on the test dataset, we observe an increase in the area under the ROC curve for the pruned tree compared to the initial tree.

These results indicate that pruning the tree has helped improve the model’s generalization ability and reduce overfitting. By selecting an optimal point on the complexity path, we limit the tree’s growth and prevent it from fitting too closely to the training data.

This approach demonstrates how it’s possible to control overfitting and improve the accuracy of a decision tree. By finding the appropriate point on the complexity path to limit tree growth, we ensure a more robust model with better generalization ability for previously unseen data.

Visualizing the pruned tree

# Defining a color palette
palette = scales::viridis_pal(begin=.75, end=1)(20)

# Plotting the tree
rpart.plot::rpart.plot(pruned_tree,
                       box.palette = palette)

Visualizing an alternative plot

prp(pruned_tree, box.palette = palette)