1 Introduction

1.1 Problem Background

One key activity of banks is the issuance of credit cards, a process that involves assessing the creditworthiness of potential cardholders. Banks typically evaluate an individual’s financial history, income, existing debts, and other economic factors to determine their eligibility for a credit card. This process is critical for managing risk and ensuring that customers are capable of repaying their debts.

Once a credit card is issued, banks continually monitor card usage to manage credit risk. Understanding how and why card balances fluctuate is crucial in this regard. Predicting credit card balances becomes a vital part of managing credit portfolios and preventing financial losses.

In this analysis project, we aim to develop a machine learning model to predict credit card balances. The outcomes of this analysis help identify debtors with a high risk of credit payment default or to understand debtor behavior. Additionally, combining credit balance data with information like credit limits can aid in calculating credit card utilization, a factor that significantly impacts cardholder’s credit rating.

1.2 Dataset

In this project, we will use credit card balance dataset. This dataset consists of 400 observations and 12 columns.

  • ID: a unique identification number for each individual.
  • Income: the individual’s income, scaled in units of $10,000.
  • Limit: the maximum amount of credit available to the individual.
  • Rating: a score representing the individual’s creditworthiness.
  • Cards: the number of card ownership.
  • Age: the individual’s age, measured in years.
  • Education: the number of years the individual has spent in education.
  • Gender: specifies the gender of the individual, either Male or Female.
  • Student: indicates whether the individual is a student, with possible values being Yes or No.
  • Married: shows if the individual is married or not, options being Yes or No.
  • Ethnicity: the individual’s ethnic background, which can be African American, Asian, or Caucasian.
  • Balance: the average balance maintained on the individual’s credit card, expressed in dollars.

1.3 Problem Statement, Predictor and Target Variable Identification

Given the background information and the details of the dataset, the objective is clearly defined as follows:

The primary goal is to develop regression models for predicting the average credit card balance of individuals.

  • Target variable: Balance.
  • Predictor variables: these include various aspects of a lender’s personal and financial details, which are expected to influence their credit card balance.

2 Load Packages

Below are the packages that we will use during the analysis.

# for data loading - data preprocessing
library(readxl)
library(dplyr)
library(tidyr)

# for exploratory data analysis
library(psych)
library(ggplot2)
library(GGally)
library(gridExtra)

# for modeling - evaluation
library(caret)
library(partykit)
library(randomForest)
library(xgboost)
library(tidymodels)
library(MLmetrics)

# for model interpretation
library(lime)

# tidy visualization
library(DT)

# for reproducible analysis
set.seed(123)

3 Load Dataset

data <- read.csv("data_input/credit.csv")

Below are the first six rows of our observations.

head(data)

4 Data Preparation

Before we dive into our analysis, the first thing we’re going to do is clean up our data. This means we’ll make sure everything in our dataset is accurate and ready for analysis. Cleaning the data involves fixing any mistakes, sorting out any mix-ups in how the data is formatted, getting rid of any repeated information, and filling in any gaps where information is missing. To start off right, we need to get a good grasp of what our data looks like.

glimpse(data)
#> Rows: 400
#> Columns: 12
#> $ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
#> $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
#> $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
#> $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
#> $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
#> $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
#> $ Gender    <chr> " Male", "Female", " Male", "Female", " Male", " Male", "Fem…
#> $ Student   <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
#> $ Married   <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
#> $ Ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian", "Caucas…
#> $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Our dataset consists of 400 rows and 12 columns. Each row corresponds to a specific lender in the institution. Out of the 12 columns, 8 are numerical and 4 are categorical. The numerical columns are formatted correctly. However, the categorical columns are currently formatted as character types. We will convert these to the appropriate categorical type.

data_clean <- data %>% mutate_if(is.character, as.factor)

Recheck the data types.

glimpse(data_clean)
#> Rows: 400
#> Columns: 12
#> $ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
#> $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
#> $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
#> $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
#> $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
#> $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
#> $ Gender    <fct>  Male, Female,  Male, Female,  Male,  Male, Female,  Male, F…
#> $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
#> $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
#> $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Africa…
#> $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Our data is formatted correctly.

Let’s check if our data is complete by looking for any missing or blank spots, called null values. These missing parts can cause problems in our analysis. It’s important to find out if there are any and then figure out what to do with them to keep our data good and useful.

colSums(is.na(data_clean))
#>         X    Income     Limit    Rating     Cards       Age Education    Gender 
#>         0         0         0         0         0         0         0         0 
#>   Student   Married Ethnicity   Balance 
#>         0         0         0         0

After checking, we found no missing parts in our data.

To clean our data, it’s essential to remove duplicate entries. Duplicates can skew the modeling process by giving undue weight to repeated information. We’ll identify and discard any such repeats for a more accurate analysis.

data_clean[duplicated(data_clean),]

There is no duplication in our data.

For our analysis, we’re going to remove columns that aren’t needed and won’t affect our results, like the ID column.

data_clean <- data_clean  %>% select(-X)
head(data_clean)

Since our project focuses on monitoring the active usage of credit cards, we will ignore data in which the credit card has not been used in the last three months. We will also filter out data where the balance is zero.

data_clean <- data_clean %>% filter(Balance != 0)
dim(data_clean)
#> [1] 310  11

Our data now is clean and ready for analysis.

5 Exploratory Data Analysis

5.1 Outliers

Spotting outliers is super important in data analysis. These are the unusual numbers that stand out from the rest and can throw off our whole analysis. We need to identify these odd ones out so we can figure out if we should keep them, tweak them, or remove them, ensuring our analysis is accurate. This step isn’t just about fixing errors; it’s about really understanding our data, catching strange patterns, and making sure the rest of our analysis is reliable. So, keeping an eye out for outliers is key to a strong and effective data analysis process.

To spot these outliers, we utilize base function boxplot().

boxplot(data_clean$Income, horizontal = T)

boxplot(data_clean$Limit, horizontal = T)

boxplot(data_clean$Rating, horizontal = T)

boxplot(data_clean$Cards, horizontal = T)

boxplot(data_clean$Age, horizontal = T)

boxplot(data_clean$Education, horizontal = T)

From these boxplots, we can see outliers in these columns: Income, Limit, Rating, and Cards.

  • For Income, since the data is positively skewed, this means we have a bunch of cases where individuals earn less than 37.14 (in terms of $10,000 units), but there are also a few instances where earnings go above 124.62 (again, in $10,000 units).
  • For Limit, as depicted in the boxplot, the data distribution suggests that the majority of credit limits are clustered around the median, which is 5147. We can see that credit limits stretch out mostly between 1160 and up to 13913, with a few cases extending well beyond this range. These cases, shown as individual points, indicate that a small number of individuals have credit limits that are exceptionally high compared to the bulk of the data. These outliers hint at a group of customers with notably higher credit limits, suggesting they may have higher incomes or better credit standings than the average customer.
  • Looking at the boxplot for Rating, it shows us that the credit ratings for most individuals hover around the median, which is 380. However, we also notice a small number of ratings that are exceptionally high, going beyond 716. These points are our outliers. They’re significant because they indicate that there are some individuals with credit ratings that are much higher than average, suggesting they may have an outstanding financial track record or creditworthiness.
  • In the data for Cards, we find that it’s pretty normal for people to have a couple of credit cards. But what’s really interesting is that some folks have way more than that, even over 7 cards. This could mean they’re into collecting rewards or maybe they like to spread out their spending. It also might show that these people have a really good credit score that lets them get their hands on more cards.

We have a couple of options for handling these unusual data points: we could remove them or we can keep them in. We’ve chosen to keep these outliers in our dataset. The reason is that these unusual points can sometimes reveal valuable insights about what we’re analyzing. By including them, we ensure our analysis is comprehensive and accounts for all aspects of the data, even those that are less common.

5.2 Correlation

In our next step, we’re going to explore the way different pieces of information in our dataset relate to each other by conducting a correlation analysis. This will help us see how certain variables move in relation to one another. For instance, if two variables tend to increase or decrease in tandem, they’re said to have a positive correlation. On the flip side, if one variable tends to go up when the other goes down, this is known as a negative correlation.

Grasping these patterns is crucial for a clear and accurate interpretation of our data. To carry out this analysis, we’ll employ the ggcorr() function from the GGally package, which is a great tool for such statistical explorations.

ggcorr(data_clean, label = T, hjust = 0.7)

From the image, it’s clear that Income, Limit, and Rating are closely related. The correlation value between Rating and Limit is perfect, at 1. This means they move together in the same way. The strong connection between these three things makes a lot of sense. Usually, people with higher incomes get higher credit limits, and both of these can lead to better credit ratings. And it’s no surprise that these factors are really important for figuring out credit card balances. The higher someone’s income and credit limit, the more likely they are to have a higher balance on their credit card.

These strong relationship further can we investigate from the following scatterplots.

# scatterplot for Rating and Balance

ggplot(data_clean, aes(x = Rating, y = Balance)) +
  geom_point() +
  labs(title = "Rating vs Balance", 
       x = "Rating", 
       y = "Balance")

# scatterplot for Limit and Balance

ggplot(data_clean, aes(x = Limit, y = Balance)) +
  geom_point() +
  labs(title = "Limit vs Balance", 
       x = "Limit", 
       y = "Balance")

# scatterplot for Income and Balance

ggplot(data_clean, aes(x = Income, y = Balance)) +
  geom_point() +
  labs(title = "Income vs Balance", 
       x = "Income", 
       y = "Balance")

On the other hand, Cards, Age, and Education don’t seem to have much impact on predicting credit card balances. This is shown by their correlation values, which are close to 0. This means they don’t really move in sync with the credit card balance like Income, Limit, and Rating do. So, things like how many cards someone has, their age, or how much education they’ve received, don’t significantly change the prediction of their credit card balance.

These weak relationship further can we investigate from the following scatterplots.

# scatterplot for Cards and Balance

ggplot(data_clean, aes(x = Cards, y = Balance)) +
  geom_point() +
  labs(title = "Cards vs Balance", 
       x = "Cards", 
       y = "Balance")

# scatterplot for Age and Balance

ggplot(data_clean, aes(x = Cards, y = Balance)) +
  geom_point() +
  labs(title = "Age vs Balance", 
       x = "Age", 
       y = "Balance")

# scatterplot for Education and Balance

ggplot(data_clean, aes(x = Education, y = Balance)) +
  geom_point() +
  labs(title = "Education vs Balance", 
       x = "Education", 
       y = "Balance")

6 Modeling

Having looked over our data and understanding the main points, we’re now ready to move on to building our model. The goal is to create a regression model that can predict credit card balances. We’re going to try three different kinds of regression models: a decision tree regressor, a random forest regressor, and an extra gradient boosting (XGBoost) regressor.

6.1 Train-Test Splitting

For the model, we’re going to divide the data into two parts: one for training and one for testing. We’ll use 80% of the data to train the model – this is where it’ll learn how to predict. The remaining 20% will be used for testing – this is where we check how good the model is at predicting. By doing this, we make sure the model learns properly and also gets a good test to see how well it’s doing.

This splitting utilizes sample() function. This function will return a list of data indexes for training data.

# for reproducibility
RNGkind(sample.kind = "Rounding")
set.seed(100)

# define train and test proportion
index <- sample(nrow(data_clean), nrow(data_clean) * 0.8)

# subset
train_data <- data_clean[index,]
test_data <- data_clean[-index,]
# check the result

dim(train_data)
#> [1] 248  11
dim(test_data)
#> [1] 62 11

Now, we will train the data with 248 observations and test the result with 62 observations.

6.2 Model Building

For our project, we’re mostly going to focus on using tree-like models that are really good at making predictions. These include the decision tree model, the random forest model, and the XGBoost (extra gradient boosting) model.

6.2.1 Decision Tree Regressor

6.2.1.1 How Decision Tree Regressor Works

A decision tree regressor works by sorting a bunch of data into different groups based on certain features. Let’s say we have data about people, including their income, credit limit, credit rating, age, education, gender, student status, marital status, ethnicity, and their average credit card balance. Each of these bits of info tells us something different about the person, like how much they earn, their credit history, or their lifestyle.

We start by looking at all this data. The decision tree begins by figuring out the best way to group people based on one of these features. For example, it might sort them by income, putting people with similar earnings together. The aim is to make groups where the credit card balances are as similar as possible.

At each step, we check out each feature to see the best way to further sort these groups. We use a math formula to see how good our sorting is. We keep on splitting the data, making smaller, more specific groups, until we hit a point where the groups are similar enough in their credit card balances, or we reach a limit like the maximum depth of the tree.

The final model is like a big tree. Each branch end, or leaf, represents a group of people with similar predicted credit card balances, often the average balance of the group in that leaf. So, when we have a new person’s details, the tree helps us figure out where they fit based on their features. The predicted credit card balance for them is then based on the average balance of that group in the leaf. This way, we can predict someone’s credit card balance based on their personal and financial details.

6.2.1.2 Implementation

Now that we’ve learned how to create a decision tree regressor with our data, the next step is to actually build it using R. We’ll be using the ctree() function from the partykit package in R. This function is super useful because it can handle both classification and prediction tasks. To get it going, we need to focus on two main things: the formula and the data. In the formula, we specify our target variable (which, for us, is Balance) and the predictors we’re using (things like Income, Limit, Rating, and so on). Then, in the data part, we input the dataset we’re training it with. By setting these parameters, we can build a decision tree tegressor in R that fits our specific needs perfectly.

# predict Balance with all predictors
dtr_base <- ctree(formula = Balance ~ .,
                 data = train_data)

To see what our decision tree model looks like, we can use the plot() function in R. This will show us a picture of the tree, letting us see how the data is divided at each step and giving us a peek into how the model makes its decisions.

# visualize the tree
plot(dtr_base, type="simple")

dtr_base
#> 
#> Model formula:
#> Balance ~ Income + Limit + Rating + Cards + Age + Education + 
#>     Gender + Student + Married + Ethnicity
#> 
#> Fitted party:
#> [1] root
#> |   [2] Rating <= 352
#> |   |   [3] Rating <= 279: 181.436 (n = 39, err = 422075.6)
#> |   |   [4] Rating > 279
#> |   |   |   [5] Income <= 44.978: 475.708 (n = 48, err = 1265325.9)
#> |   |   |   [6] Income > 44.978: 126.091 (n = 11, err = 25906.9)
#> |   [7] Rating > 352
#> |   |   [8] Rating <= 682
#> |   |   |   [9] Student in No
#> |   |   |   |   [10] Rating <= 394
#> |   |   |   |   |   [11] Income <= 51.872
#> |   |   |   |   |   |   [12] Income <= 17.765: 829.571 (n = 7, err = 33507.7)
#> |   |   |   |   |   |   [13] Income > 17.765
#> |   |   |   |   |   |   |   [14] Income <= 39.422: 651.933 (n = 15, err = 70272.9)
#> |   |   |   |   |   |   |   [15] Income > 39.422: 538.000 (n = 9, err = 14498.0)
#> |   |   |   |   |   [16] Income > 51.872: 324.778 (n = 9, err = 205983.6)
#> |   |   |   |   [17] Rating > 394
#> |   |   |   |   |   [18] Income <= 83.851
#> |   |   |   |   |   |   [19] Rating <= 536: 880.122 (n = 49, err = 1102143.3)
#> |   |   |   |   |   |   [20] Rating > 536: 1194.333 (n = 9, err = 127314.0)
#> |   |   |   |   |   [21] Income > 83.851
#> |   |   |   |   |   |   [22] Limit <= 7140: 505.286 (n = 7, err = 228189.4)
#> |   |   |   |   |   |   [23] Limit > 7140: 790.933 (n = 15, err = 715742.9)
#> |   |   |   [24] Student in Yes: 1217.412 (n = 17, err = 700078.1)
#> |   |   [25] Rating > 682: 1488.615 (n = 13, err = 830473.1)
#> 
#> Number of inner nodes:    12
#> Number of terminal nodes: 13

In the tree diagram we created, Rating sits right at the top. This tells us that Rating is the first thing the tree considers when predicting the credit card balance. Essentially, Rating is a big factor in determining the balance. So, whenever we use this tree to predict a balance, it first looks at the Rating, and then bases its further decisions on that. This points out that the credit rating of a person is really important in figuring out their credit card balance.

6.2.1.3 Model Performance

Next up, we’re going to test how well our dtr_base model works on both the training data and the test data. We’ll focus on two main measures: adjusted r-squared and MAE (Mean Absolute Error). It’s crucial to evaluate our model on both sets of data. This helps us figure out how well it can use what it learned on brand new data it hasn’t seen before. By doing this, we can be confident that our model isn’t just repeating what it saw in the training data but is really learning patterns that hold true more broadly and can be applied in various situations.

# custom function for calculating adj. r-squared

adj_r2 <- function(preds, actual, n, p){
  
  rss <- sum((preds - actual) ^ 2) 
  tss <- sum((actual - mean(actual)) ^ 2)  
  rsq <- 1 - rss/tss

  return  (1 - (((1-rsq)*(n-1))/(n-p-1)))
}
# prediction on training data and testing data

y_train_pred <- predict(object = dtr_base, newdata = train_data)
y_test_pred <- predict(object = dtr_base, newdata = test_data)
dtr_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
  MAE_train = MAE(y_train_pred, train_data$Balance),
  adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
  MAE_test = MAE(y_test_pred, test_data$Balance)
)

dtr_results

💡 Insight:

  • The adjusted r-squared indicates that our model accounts for approximately 85.11% of the variation in Balance within the training data, and 74.75% within the testing data.
  • Using the MAE, we determined that our predictions could be off by about 115.39 units (either too high or too low) for the training data, and approximately 157.30 units for the testing data.
range(train_data$Balance)
#> [1]    8 1999

When we look at our model’s errors, which are about 115.39 for the training data and 157.30 for the testing data, and compare them with the range of credit card balances from 8 to 1999, the model seems to be doing a pretty good job. These errors are just a small part of the maximum balance amount – less than 8%. This means that for really high balances, the mistakes our model makes aren’t too big of a deal. But, for smaller balances, these errors might be more noticeable. Overall, the model is doing a decent job, especially for a first try. It works well for getting a general idea or for an initial check on how risky a credit card balance might be. But if we need super accurate predictions, like for really important financial decisions, we might want to fine-tune the model a bit more.

6.2.2 Random Forest Regressor

6.2.2.1 How Random Forest Regressor Works

A random forest regressor is an enhanced version of a decision tree model, utilizing multiple trees to refine our predictive accuracy. It constructs these trees from varied random segments of our data, employing a method known as ‘bootstrapping’. Additionally, each tree analyzes a distinct subset of features, ensuring their individuality and aiding the comprehensive model in mitigating errors common to single-tree models.

When it’s time to make predictions, the random forest employs all its trees. Each tree contributes its prediction, and the final output is derived from the average of these contributions. This collective approach not only heightens the model’s accuracy but also bolsters its performance on new, unfamiliar data by reducing the likelihood of overfitting.

Random forests are highly adaptable, suitable for both regression and classification tasks. They adeptly handle diverse data types, positioning them as a favored choice for a broad range of predictive modeling scenarios.

6.2.2.2 Implementation

Previously, we utilized the ctree() function from the partykit package, specifically designed for decision trees. However, for our next step in model building, we’ll use the train() function from the caret package, a high-level abstraction function. This function allows us to experiment with different models simply by modifying the method parameter. In this case, as we aim to build a random forest regressor, we will set method = "rf".

rfr_base <- train(Balance ~ .,
                  data = train_data,
                  method = "rf")

Since a random forest has many trees and it’s hard to show them all at once, we use something called ‘feature importance’ to understand the model better. Feature importance tells us how important each part of our data is for making good predictions. We can find this out by using the varImp() function.

vi <- varImp(rfr_base$finalModel)
vi %>% arrange(desc(Overall))

From our analysis, it’s clear that Limit stands out as the most important factor for predicting credit card balances. Rating and Income are also vital. This shows that the model gives a lot of weight to Limit, highlighting that a person’s credit rating is a key indicator of their balance. The credit rating (Rating) and the individual’s income (Income) are also significant, underlining their importance in accurately forecasting credit card balances.

6.2.2.3 Model Performance

Similarly to the decision tree regressor, we will evaluate the model’s performance on both the training and testing data.

# prediction on training data and testing data

y_train_pred <- predict(object = rfr_base, newdata = train_data)
y_test_pred <- predict(object = rfr_base, newdata = test_data)
rfr_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
  MAE_train = MAE(y_train_pred, train_data$Balance),
  adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
  MAE_test = MAE(y_test_pred, test_data$Balance)
)

rfr_results

💡 Insight:

  • The adjusted r-squared indicates that our model accounts for approximately 98.30% of the variation in Balance within the training data, and 90.25% within the testing data.
  • Using the MAE, we determined that our predictions could be off by about 34.80 units (either too high or too low) for the training data, and approximately 96.75 units for the testing data.

When we compare the two models, the random forest regressor does a much better job than the decision tree regressor when it comes to predicting credit card balances. The decision tree regressor has errors of around 115.39 for the data used for training, and 157.30 for the data used for testing. These errors are less than 8% of the maximum balance range (which goes from 8 to 1999). While this is okay for a general analysis, it’s more noticeable when dealing with lower balances.

On the other hand, the random forest regressor significantly reduces these errors to 34.80 for training and 96.75 for testing. These errors are only about 1.74% and 4.83% of the maximum range, respectively. This big improvement in accuracy, especially during testing, shows that the random forest model is better. So, if we need precision in situations like detailed financial decision-making, the random forest model is the more reliable choice.

6.2.3 XGBoost Regressor

6.2.3.1 How XGBoost Regressor Works

XGBoost takes the idea of decision trees and kicks it up a notch. Here’s how it works: It starts by building a basic decision tree, but it doesn’t stop there. Instead, XGBoost keeps creating more trees, and each new tree is like a smart teammate that learns from the mistakes of the previous ones. Think of it as a team effort where everyone is getting better over time. This technique is called ‘boosting,’ and it’s all about focusing on the tricky parts that earlier trees struggled with, kind of like solving a puzzle one challenging piece at a time.

What makes XGBoost stand out is its efficiency and accuracy. It doesn’t just throw in trees randomly; it does it in a clever and calculated way. The algorithm also has built-in safeguards to avoid getting too fixated on the specific details of the training data, which makes it more dependable when predicting new, unseen data. In a nutshell, XGBoost creates a well-tuned and collaborative group of trees that team up to make more precise predictions, especially when dealing with complex datasets.

6.2.3.2 Implementation

In implementing the XGBoost regressor, we will use the same approach aswith the random forest regressor, with the only change being setting method = xgbTree.

xgbr_base <- train(Balance ~ .,
                  data = train_data,
                  method = "xgbTree")

Alternatively, we can construct an XGBoost regressor using the boost_tree() function from the tidy_models package.

xgbr_base_tm <- boost_tree() %>% 
                set_mode("regression") %>% 
                set_engine("xgboost") %>% 
                fit(Balance ~ ., data = train_data)

6.2.3.3 Model Performance

# prediction on training data and testing data with model from tidymodels

y_train_pred <- predict(object = xgbr_base_tm, new_data = train_data)
y_test_pred <- predict(object = xgbr_base_tm, new_data = test_data)
xgbr_tm_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
  MAE_train = MAE(y_train_pred$.pred, train_data$Balance),
  adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
  MAE_test = MAE(y_test_pred$.pred, test_data$Balance)
)

xgbr_tm_results

💡 Insight:

  • The adjusted r-squared indicates that our model accounts for approximately 99.69% of the variation in Balance within the training data, and 93.29% within the testing data.
  • Using the MAE, we determined that our predictions could be off by about 15.21 units (either too high or too low) for the training data, and approximately 74.88 units for the testing data.

From what we can see in the results, the XGBoost regressor does better when it comes to training metrics compared to the random forest regressor. Additionally, this model also outperforms random forest regressor in testing metrics.

6.3 Model Improvement: Fine Tuning

In this section, we will fine-tune the XGBoost regressor using the tidymodels package. Among the adjustable parameters for the XGBoost regressor, we will focus on trees, which controls the number of trees used in the model.

Initially, we define the model and identify the specific argument to be tuned. For tuning solely the trees parameter, we assign the tune() function to it.

# 1. Mark trees as parameter to be tuned
xgbr_tune <- boost_tree(trees = tune()) %>% 
                set_mode("regression") %>% 
                set_engine("xgboost") 

Next, we define the recipe, which outlines the steps for preparing our data for modeling. This process can involve various preprocessing tasks. In the the following code, we specify what to predict, the predictors to use, and the dataset.

# 2. Define recipe
rcp <- recipe(Balance ~ ., data = train_data) %>% 
        step_dummy(Gender, Student, Married, Ethnicity)

Subsequently, we integrate the model and recipe using the workflow() function. The following code demonstrates creating a tuning workflow that combines the XGBoost regressor with our dataset.

# 3. Make workflow 
xgb_wflow <- 
  workflow() %>% 
  add_model(xgbr_tune) %>% 
  add_recipe(rcp)
xgb_wflow %>% extract_parameter_dials("trees")
#> # Trees (quantitative)
#> Range: [1, 2000]

Executing the above code shows that the default tuning range spans from 1 to 2000. However, we want to specifically tune the trees value between 15 and 30. To do this, we can manually create a tibble with our desired range.

# 4. Create a tibble with the specific range for trees
trees_values <- tibble(trees = 15:30)

Next, we will define the benchmark for determining the best trees value, using MAE as the standard of measurement.

# 5. Define specific metrics
mae_res <- metric_set(mae)

Finally, we are ready to tune our model using the tune_grid() function.

# 6. Tune the model using the specified range for trees
tuning_results <- tune_grid(
  xgb_wflow,
  grid = trees_values,
  resamples = vfold_cv(train_data, v = 5),
  metrics = mae_res
)

Let’s inspect the best parameter for our XGBoost regressor.

show_best(tuning_results, n = 1)

Based on the results, trees = 30 yields the lowest MAE in the 5-fold cross-validation. We will remodel our data using this outcome.

xgbr_tune <- boost_tree(trees = 30) %>% 
                set_mode("regression") %>% 
                set_engine("xgboost") %>% 
                fit(Balance ~ ., data = train_data)
# prediction on training data and testing data with tuned model from tidymodels

y_train_pred <- predict(object = xgbr_tune, new_data = train_data)
y_test_pred <- predict(object = xgbr_tune, new_data = test_data)
xgbr_tm_tuned_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
  MAE_train = MAE(y_train_pred$.pred, train_data$Balance),
  adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
  MAE_test = MAE(y_test_pred$.pred, test_data$Balance)
)

xgbr_tm_tuned_results
xgbr_tm_results

Looking at these results, our post-tuned regressor is better than pre-tuned regressor. However, there’s a big difference between the errors we see when we train the model and when we test it. We observe this phenomenon occurring in the majority of the models that we have test. The more complex the model is, the bigger this error gap seems to be. Based on these findings, we can say that the more complex models tend to remember unnecessary things in the data, like noise or unimportant details. But in the next part, we’ll use xgbr_tune to help us understand the model better.

6.4 Model Interpretation

Our top-performing model is xgbr_tune. It gives us an adjusted r-squared of 94.15% and a MAE of 70.36 on the test data. This XGBoost model is great at finding complex patterns, but it’s not very easy to figure out how it comes up with its predictions.

Luckily, R has a tool to help with this. It’s called the lime package, short for Local Interpretable Model-Agnostic Explanations. This tool can explain any kind of predictive model, even complex ones like ours.

Now, let’s use the lime function to better understand our model. Here’s the code we’ll use.

# create the explainer
explainer <- lime(x = train_data %>% select(-Balance), 
                  model = xgbr_tune)
# select only the first 4 observations of test_data
selected_data <- test_data %>% 
  select(-Balance) %>% 
  dplyr::slice(1:4)
selected_data
explanation <- explain(x = selected_data, 
                       explainer = explainer, 
                       feature_select = "auto", # method of feature selection for lime
                       n_features = 10 # number of features to explain the model
                       )

Now, we will visualize the explanation using the plot_features() function.

plot_features(explanation)

As we can see from the figure, our plot contains 4 observations and the details for each.

1. Case

Case indicates the index of our observation.

  • Case = 1 means the first row of our test data.
  • Case = 2 means the second row of our test data.
  • and so on.

2. Explanation Fit

Explanation Fit reveals how good lime is in interpreting our model in terms of r-squared values of linear regression. Given the figure, we know lime is able to:

  • Describe 38% variation in case 1.
  • Describe 67% variation in case 2.
  • Describe 12% variation in case 3.
  • Describe 67% variation in case 4.

3. Prediction

Prediction refers to the predicted credit card balance Balance given the predictors. Now, we will compare with the predicted values with our actual values.

test_data$Balance[1:4]
#> [1]  333  903  331 1350
# case 1
319.82-333
#> [1] -13.18
# case 2
920.67-903
#> [1] 17.67
# case 3
417.77-331
#> [1] 86.77
# case 4
1378.32-1350
#> [1] 28.32

If we pay close attention, we can see that case 3 has the largest error. This is connected to the explanation provided in the previous section. Because case 3 has the lowest explanation fit, it suggests that the model doesn’t have a good idea when making predictions. This results in a much higher error rate compared to the other cases.

4. Feature Barplot

For each observation, lime generates bar plots in red and blue. Red reveals a negative contribution to our target and blue otherwise. Each observation shows a different pattern regarding which variables strengthen the credit card balance. From 10 predictors, in most cases, there 4 predictors that are dominantly effect the prediction of credit card balances: Rating, Income, Limit, and Student.

  • For case 1, the only positive contribution in determining the credit balance is Income variable. In this case, the lender’s income is below 22.8. However, by the bar length, we can see that this variable influence is insignificant. The lower limit amount, the No status of student, and the lower credit rating contribute more to a low credit balance.
  • In case 2, the income amount is relatively larger compared to case 1. It is interesting to note that despite the huge income, this factor contributes in negative direction towards the credit balance. However, the borrower in this scenario has a higher credit limit, is a student (with a “yes” status), and has a higher credit rating. These factors have a positive impact on the credit balance and significantly increase it.
  • In case 3, where the explanation fit is the lowest, we cannot accurately determine the factors that affect the credit balance. However, based on the barplot, most of the variables for this lender have a negative impact.
  • In case 4, we observe the same pattern as in case 2. A higher credit limit, student status, and credit rating significantly increase the credit balance in a positive direction.
plot_features(explanation)

7 Conclusion

dtr_results$model <- "Base Decision Tree Regressor"
rfr_results$model <- "Base Random Forest Regressor"
xgbr_tm_results$model <- "Base XGBoost Regressor"
xgbr_tm_tuned_results$model <- "Fine-Tuned XGBoost Regressor"

summary <- rbind(dtr_results, 
      rfr_results, 
      xgbr_tm_results, 
      xgbr_tm_tuned_results)

summary[, c(ncol(summary), 1:(ncol(summary) - 1))]
  • In our effort to create a model that predicts credit card balances, we tried out three different types: a decision tree model, a random forest model, and an XGBoost model. From the results, it’s clear that the XGBoost model is doing a better job than the other two when it comes to forecasting credit card balances.
  • Our top model is xgbr_tune, which we set up to use 30 trees. It’s really doing well with an adjusted r-squared of 99.95% on the training data and 94.15% on the testing data. When it comes to errors, it’s got a 5.69 MAE for the training data and 70.36 for the testing data. But these results suggest that the model might be overfitting, which means it’s great at handling the data it trained on but not as good when it comes to new, unseen data.
  • To enhance our models and tackle overfitting:
    • Expand Parameter Tuning: We’ll explore more parameter tuning options for the XGBoost model to achieve a balanced performance.
    • Adjust Parameters: We plan to adjust existing model parameters, like learning rate and tree depth, for better generalization.
    • Data Preprocessing: Advanced data preprocessing steps, including feature selection and transformation, will be implemented for enhanced accuracy.
    • Fine Tuning Random Forest: As shown in the table, the random forest regressor performs slightly better than the XGBoost regressor in terms of MAE in testing set. In the future, we could also consider fine-tuning this model.