One key activity of banks is the issuance of credit cards, a process that involves assessing the creditworthiness of potential cardholders. Banks typically evaluate an individual’s financial history, income, existing debts, and other economic factors to determine their eligibility for a credit card. This process is critical for managing risk and ensuring that customers are capable of repaying their debts.
Once a credit card is issued, banks continually monitor card usage to manage credit risk. Understanding how and why card balances fluctuate is crucial in this regard. Predicting credit card balances becomes a vital part of managing credit portfolios and preventing financial losses.
In this analysis project, we aim to develop a machine learning model to predict credit card balances. The outcomes of this analysis help identify debtors with a high risk of credit payment default or to understand debtor behavior. Additionally, combining credit balance data with information like credit limits can aid in calculating credit card utilization, a factor that significantly impacts cardholder’s credit rating.
In this project, we will use credit card balance dataset. This dataset consists of 400 observations and 12 columns.
ID: a unique identification number for each
individual.Income: the individual’s income, scaled in units of
$10,000.Limit: the maximum amount of credit available to the
individual.Rating: a score representing the individual’s
creditworthiness.Cards: the number of card ownership.Age: the individual’s age, measured in years.Education: the number of years the individual has spent
in education.Gender: specifies the gender of the individual, either
Male or Female.Student: indicates whether the individual is a student,
with possible values being Yes or No.Married: shows if the individual is married or not,
options being Yes or No.Ethnicity: the individual’s ethnic background, which
can be African American, Asian, or Caucasian.Balance: the average balance maintained on the
individual’s credit card, expressed in dollars.Given the background information and the details of the dataset, the objective is clearly defined as follows:
The primary goal is to develop regression models for predicting the average credit card balance of individuals.
Balance.Below are the packages that we will use during the analysis.
# for data loading - data preprocessing
library(readxl)
library(dplyr)
library(tidyr)
# for exploratory data analysis
library(psych)
library(ggplot2)
library(GGally)
library(gridExtra)
# for modeling - evaluation
library(caret)
library(partykit)
library(randomForest)
library(xgboost)
library(tidymodels)
library(MLmetrics)
# for model interpretation
library(lime)
# tidy visualization
library(DT)
# for reproducible analysis
set.seed(123)Below are the first six rows of our observations.
Before we dive into our analysis, the first thing we’re going to do is clean up our data. This means we’ll make sure everything in our dataset is accurate and ready for analysis. Cleaning the data involves fixing any mistakes, sorting out any mix-ups in how the data is formatted, getting rid of any repeated information, and filling in any gaps where information is missing. To start off right, we need to get a good grasp of what our data looks like.
#> Rows: 400
#> Columns: 12
#> $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
#> $ Limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
#> $ Rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
#> $ Cards <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
#> $ Age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
#> $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
#> $ Gender <chr> " Male", "Female", " Male", "Female", " Male", " Male", "Fem…
#> $ Student <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
#> $ Married <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
#> $ Ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian", "Caucas…
#> $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
Our dataset consists of 400 rows and 12 columns. Each row corresponds to a specific lender in the institution. Out of the 12 columns, 8 are numerical and 4 are categorical. The numerical columns are formatted correctly. However, the categorical columns are currently formatted as character types. We will convert these to the appropriate categorical type.
Recheck the data types.
#> Rows: 400
#> Columns: 12
#> $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
#> $ Limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
#> $ Rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
#> $ Cards <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
#> $ Age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
#> $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
#> $ Gender <fct> Male, Female, Male, Female, Male, Male, Female, Male, F…
#> $ Student <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
#> $ Married <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
#> $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Africa…
#> $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
Our data is formatted correctly.
Let’s check if our data is complete by looking for any missing or blank spots, called null values. These missing parts can cause problems in our analysis. It’s important to find out if there are any and then figure out what to do with them to keep our data good and useful.
#> X Income Limit Rating Cards Age Education Gender
#> 0 0 0 0 0 0 0 0
#> Student Married Ethnicity Balance
#> 0 0 0 0
After checking, we found no missing parts in our data.
To clean our data, it’s essential to remove duplicate entries. Duplicates can skew the modeling process by giving undue weight to repeated information. We’ll identify and discard any such repeats for a more accurate analysis.
There is no duplication in our data.
For our analysis, we’re going to remove columns that aren’t needed and won’t affect our results, like the ID column.
Since our project focuses on monitoring the active usage of credit cards, we will ignore data in which the credit card has not been used in the last three months. We will also filter out data where the balance is zero.
#> [1] 310 11
Our data now is clean and ready for analysis.
Spotting outliers is super important in data analysis. These are the unusual numbers that stand out from the rest and can throw off our whole analysis. We need to identify these odd ones out so we can figure out if we should keep them, tweak them, or remove them, ensuring our analysis is accurate. This step isn’t just about fixing errors; it’s about really understanding our data, catching strange patterns, and making sure the rest of our analysis is reliable. So, keeping an eye out for outliers is key to a strong and effective data analysis process.
To spot these outliers, we utilize base function
boxplot().
From these boxplots, we can see outliers in these columns:
Income, Limit, Rating, and
Cards.
Income, since the data is positively skewed, this
means we have a bunch of cases where individuals earn less than 37.14
(in terms of $10,000 units), but there are also a few instances where
earnings go above 124.62 (again, in $10,000 units).Limit, as depicted in the boxplot, the data
distribution suggests that the majority of credit limits are clustered
around the median, which is 5147. We can see that credit limits stretch
out mostly between 1160 and up to 13913, with a few cases extending well
beyond this range. These cases, shown as individual points, indicate
that a small number of individuals have credit limits that are
exceptionally high compared to the bulk of the data. These outliers hint
at a group of customers with notably higher credit limits, suggesting
they may have higher incomes or better credit standings than the average
customer.Rating, it shows us that the
credit ratings for most individuals hover around the median, which is
380. However, we also notice a small number of ratings that are
exceptionally high, going beyond 716. These points are our outliers.
They’re significant because they indicate that there are some
individuals with credit ratings that are much higher than average,
suggesting they may have an outstanding financial track record or
creditworthiness.Cards, we find that it’s pretty normal
for people to have a couple of credit cards. But what’s really
interesting is that some folks have way more than that, even over 7
cards. This could mean they’re into collecting rewards or maybe they
like to spread out their spending. It also might show that these people
have a really good credit score that lets them get their hands on more
cards.We have a couple of options for handling these unusual data points: we could remove them or we can keep them in. We’ve chosen to keep these outliers in our dataset. The reason is that these unusual points can sometimes reveal valuable insights about what we’re analyzing. By including them, we ensure our analysis is comprehensive and accounts for all aspects of the data, even those that are less common.
In our next step, we’re going to explore the way different pieces of information in our dataset relate to each other by conducting a correlation analysis. This will help us see how certain variables move in relation to one another. For instance, if two variables tend to increase or decrease in tandem, they’re said to have a positive correlation. On the flip side, if one variable tends to go up when the other goes down, this is known as a negative correlation.
Grasping these patterns is crucial for a clear and accurate
interpretation of our data. To carry out this analysis, we’ll employ the
ggcorr() function from the GGally package,
which is a great tool for such statistical explorations.
From the image, it’s clear that Income,
Limit, and Rating are closely related. The
correlation value between Rating and Limit is
perfect, at 1. This means they move together in the same way. The strong
connection between these three things makes a lot of sense. Usually,
people with higher incomes get higher credit limits, and both of these
can lead to better credit ratings. And it’s no surprise that these
factors are really important for figuring out credit card balances. The
higher someone’s income and credit limit, the more likely they are to
have a higher balance on their credit card.
These strong relationship further can we investigate from the following scatterplots.
# scatterplot for Rating and Balance
ggplot(data_clean, aes(x = Rating, y = Balance)) +
geom_point() +
labs(title = "Rating vs Balance",
x = "Rating",
y = "Balance")# scatterplot for Limit and Balance
ggplot(data_clean, aes(x = Limit, y = Balance)) +
geom_point() +
labs(title = "Limit vs Balance",
x = "Limit",
y = "Balance")# scatterplot for Income and Balance
ggplot(data_clean, aes(x = Income, y = Balance)) +
geom_point() +
labs(title = "Income vs Balance",
x = "Income",
y = "Balance")On the other hand, Cards, Age, and
Education don’t seem to have much impact on predicting
credit card balances. This is shown by their correlation values, which
are close to 0. This means they don’t really move in sync with the
credit card balance like Income, Limit, and
Rating do. So, things like how many cards someone has,
their age, or how much education they’ve received, don’t significantly
change the prediction of their credit card balance.
These weak relationship further can we investigate from the following scatterplots.
# scatterplot for Cards and Balance
ggplot(data_clean, aes(x = Cards, y = Balance)) +
geom_point() +
labs(title = "Cards vs Balance",
x = "Cards",
y = "Balance")# scatterplot for Age and Balance
ggplot(data_clean, aes(x = Cards, y = Balance)) +
geom_point() +
labs(title = "Age vs Balance",
x = "Age",
y = "Balance")# scatterplot for Education and Balance
ggplot(data_clean, aes(x = Education, y = Balance)) +
geom_point() +
labs(title = "Education vs Balance",
x = "Education",
y = "Balance")Having looked over our data and understanding the main points, we’re now ready to move on to building our model. The goal is to create a regression model that can predict credit card balances. We’re going to try three different kinds of regression models: a decision tree regressor, a random forest regressor, and an extra gradient boosting (XGBoost) regressor.
For the model, we’re going to divide the data into two parts: one for training and one for testing. We’ll use 80% of the data to train the model – this is where it’ll learn how to predict. The remaining 20% will be used for testing – this is where we check how good the model is at predicting. By doing this, we make sure the model learns properly and also gets a good test to see how well it’s doing.
This splitting utilizes sample() function. This function
will return a list of data indexes for training data.
# for reproducibility
RNGkind(sample.kind = "Rounding")
set.seed(100)
# define train and test proportion
index <- sample(nrow(data_clean), nrow(data_clean) * 0.8)
# subset
train_data <- data_clean[index,]
test_data <- data_clean[-index,]#> [1] 248 11
#> [1] 62 11
Now, we will train the data with 248 observations and test the result with 62 observations.
For our project, we’re mostly going to focus on using tree-like models that are really good at making predictions. These include the decision tree model, the random forest model, and the XGBoost (extra gradient boosting) model.
A decision tree regressor works by sorting a bunch of data into different groups based on certain features. Let’s say we have data about people, including their income, credit limit, credit rating, age, education, gender, student status, marital status, ethnicity, and their average credit card balance. Each of these bits of info tells us something different about the person, like how much they earn, their credit history, or their lifestyle.
We start by looking at all this data. The decision tree begins by figuring out the best way to group people based on one of these features. For example, it might sort them by income, putting people with similar earnings together. The aim is to make groups where the credit card balances are as similar as possible.
At each step, we check out each feature to see the best way to further sort these groups. We use a math formula to see how good our sorting is. We keep on splitting the data, making smaller, more specific groups, until we hit a point where the groups are similar enough in their credit card balances, or we reach a limit like the maximum depth of the tree.
The final model is like a big tree. Each branch end, or leaf, represents a group of people with similar predicted credit card balances, often the average balance of the group in that leaf. So, when we have a new person’s details, the tree helps us figure out where they fit based on their features. The predicted credit card balance for them is then based on the average balance of that group in the leaf. This way, we can predict someone’s credit card balance based on their personal and financial details.
Now that we’ve learned how to create a decision tree regressor with
our data, the next step is to actually build it using R. We’ll be using
the ctree() function from the partykit package
in R. This function is super useful because it can handle both
classification and prediction tasks. To get it going, we need to focus
on two main things: the formula and the data.
In the formula, we specify our target variable (which, for
us, is Balance) and the predictors we’re using (things like
Income, Limit, Rating, and so
on). Then, in the data part, we input the dataset we’re
training it with. By setting these parameters, we can build a decision
tree tegressor in R that fits our specific needs perfectly.
To see what our decision tree model looks like, we can use the
plot() function in R. This will show us a picture of the
tree, letting us see how the data is divided at each step and giving us
a peek into how the model makes its decisions.
#>
#> Model formula:
#> Balance ~ Income + Limit + Rating + Cards + Age + Education +
#> Gender + Student + Married + Ethnicity
#>
#> Fitted party:
#> [1] root
#> | [2] Rating <= 352
#> | | [3] Rating <= 279: 181.436 (n = 39, err = 422075.6)
#> | | [4] Rating > 279
#> | | | [5] Income <= 44.978: 475.708 (n = 48, err = 1265325.9)
#> | | | [6] Income > 44.978: 126.091 (n = 11, err = 25906.9)
#> | [7] Rating > 352
#> | | [8] Rating <= 682
#> | | | [9] Student in No
#> | | | | [10] Rating <= 394
#> | | | | | [11] Income <= 51.872
#> | | | | | | [12] Income <= 17.765: 829.571 (n = 7, err = 33507.7)
#> | | | | | | [13] Income > 17.765
#> | | | | | | | [14] Income <= 39.422: 651.933 (n = 15, err = 70272.9)
#> | | | | | | | [15] Income > 39.422: 538.000 (n = 9, err = 14498.0)
#> | | | | | [16] Income > 51.872: 324.778 (n = 9, err = 205983.6)
#> | | | | [17] Rating > 394
#> | | | | | [18] Income <= 83.851
#> | | | | | | [19] Rating <= 536: 880.122 (n = 49, err = 1102143.3)
#> | | | | | | [20] Rating > 536: 1194.333 (n = 9, err = 127314.0)
#> | | | | | [21] Income > 83.851
#> | | | | | | [22] Limit <= 7140: 505.286 (n = 7, err = 228189.4)
#> | | | | | | [23] Limit > 7140: 790.933 (n = 15, err = 715742.9)
#> | | | [24] Student in Yes: 1217.412 (n = 17, err = 700078.1)
#> | | [25] Rating > 682: 1488.615 (n = 13, err = 830473.1)
#>
#> Number of inner nodes: 12
#> Number of terminal nodes: 13
In the tree diagram we created, Rating sits right at the
top. This tells us that Rating is the first thing the tree
considers when predicting the credit card balance. Essentially,
Rating is a big factor in determining the balance. So,
whenever we use this tree to predict a balance, it first looks at the
Rating, and then bases its further decisions on that. This
points out that the credit rating of a person is really important in
figuring out their credit card balance.
Next up, we’re going to test how well our dtr_base model
works on both the training data and the test data. We’ll focus on two
main measures: adjusted r-squared and MAE (Mean Absolute Error). It’s
crucial to evaluate our model on both sets of data. This helps us figure
out how well it can use what it learned on brand new data it hasn’t seen
before. By doing this, we can be confident that our model isn’t just
repeating what it saw in the training data but is really learning
patterns that hold true more broadly and can be applied in various
situations.
# custom function for calculating adj. r-squared
adj_r2 <- function(preds, actual, n, p){
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
return (1 - (((1-rsq)*(n-1))/(n-p-1)))
}# prediction on training data and testing data
y_train_pred <- predict(object = dtr_base, newdata = train_data)
y_test_pred <- predict(object = dtr_base, newdata = test_data)dtr_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
MAE_train = MAE(y_train_pred, train_data$Balance),
adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
MAE_test = MAE(y_test_pred, test_data$Balance)
)
dtr_results💡 Insight:
Balance within the
training data, and 74.75% within the testing data.#> [1] 8 1999
When we look at our model’s errors, which are about 115.39 for the training data and 157.30 for the testing data, and compare them with the range of credit card balances from 8 to 1999, the model seems to be doing a pretty good job. These errors are just a small part of the maximum balance amount – less than 8%. This means that for really high balances, the mistakes our model makes aren’t too big of a deal. But, for smaller balances, these errors might be more noticeable. Overall, the model is doing a decent job, especially for a first try. It works well for getting a general idea or for an initial check on how risky a credit card balance might be. But if we need super accurate predictions, like for really important financial decisions, we might want to fine-tune the model a bit more.
A random forest regressor is an enhanced version of a decision tree model, utilizing multiple trees to refine our predictive accuracy. It constructs these trees from varied random segments of our data, employing a method known as ‘bootstrapping’. Additionally, each tree analyzes a distinct subset of features, ensuring their individuality and aiding the comprehensive model in mitigating errors common to single-tree models.
When it’s time to make predictions, the random forest employs all its trees. Each tree contributes its prediction, and the final output is derived from the average of these contributions. This collective approach not only heightens the model’s accuracy but also bolsters its performance on new, unfamiliar data by reducing the likelihood of overfitting.
Random forests are highly adaptable, suitable for both regression and classification tasks. They adeptly handle diverse data types, positioning them as a favored choice for a broad range of predictive modeling scenarios.
Previously, we utilized the ctree() function from the
partykit package, specifically designed for decision trees.
However, for our next step in model building, we’ll use the
train() function from the caret package, a
high-level abstraction function. This function allows us to experiment
with different models simply by modifying the method parameter. In this
case, as we aim to build a random forest regressor, we will set
method = "rf".
Since a random forest has many trees and it’s hard to show them all
at once, we use something called ‘feature importance’ to understand the
model better. Feature importance tells us how important each part of our
data is for making good predictions. We can find this out by using the
varImp() function.
From our analysis, it’s clear that Limit stands out as
the most important factor for predicting credit card balances.
Rating and Income are also vital. This shows
that the model gives a lot of weight to Limit, highlighting
that a person’s credit rating is a key indicator of their balance. The
credit rating (Rating) and the individual’s income
(Income) are also significant, underlining their importance
in accurately forecasting credit card balances.
Similarly to the decision tree regressor, we will evaluate the model’s performance on both the training and testing data.
# prediction on training data and testing data
y_train_pred <- predict(object = rfr_base, newdata = train_data)
y_test_pred <- predict(object = rfr_base, newdata = test_data)rfr_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
MAE_train = MAE(y_train_pred, train_data$Balance),
adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
MAE_test = MAE(y_test_pred, test_data$Balance)
)
rfr_results💡 Insight:
Balance within the
training data, and 90.25% within the testing data.When we compare the two models, the random forest regressor does a much better job than the decision tree regressor when it comes to predicting credit card balances. The decision tree regressor has errors of around 115.39 for the data used for training, and 157.30 for the data used for testing. These errors are less than 8% of the maximum balance range (which goes from 8 to 1999). While this is okay for a general analysis, it’s more noticeable when dealing with lower balances.
On the other hand, the random forest regressor significantly reduces these errors to 34.80 for training and 96.75 for testing. These errors are only about 1.74% and 4.83% of the maximum range, respectively. This big improvement in accuracy, especially during testing, shows that the random forest model is better. So, if we need precision in situations like detailed financial decision-making, the random forest model is the more reliable choice.
XGBoost takes the idea of decision trees and kicks it up a notch. Here’s how it works: It starts by building a basic decision tree, but it doesn’t stop there. Instead, XGBoost keeps creating more trees, and each new tree is like a smart teammate that learns from the mistakes of the previous ones. Think of it as a team effort where everyone is getting better over time. This technique is called ‘boosting,’ and it’s all about focusing on the tricky parts that earlier trees struggled with, kind of like solving a puzzle one challenging piece at a time.
What makes XGBoost stand out is its efficiency and accuracy. It doesn’t just throw in trees randomly; it does it in a clever and calculated way. The algorithm also has built-in safeguards to avoid getting too fixated on the specific details of the training data, which makes it more dependable when predicting new, unseen data. In a nutshell, XGBoost creates a well-tuned and collaborative group of trees that team up to make more precise predictions, especially when dealing with complex datasets.
In implementing the XGBoost regressor, we will use the same approach
aswith the random forest regressor, with the only change being setting
method = xgbTree.
Alternatively, we can construct an XGBoost regressor using the
boost_tree() function from the tidy_models
package.
# prediction on training data and testing data with model from tidymodels
y_train_pred <- predict(object = xgbr_base_tm, new_data = train_data)
y_test_pred <- predict(object = xgbr_base_tm, new_data = test_data)xgbr_tm_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
MAE_train = MAE(y_train_pred$.pred, train_data$Balance),
adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
MAE_test = MAE(y_test_pred$.pred, test_data$Balance)
)
xgbr_tm_results💡 Insight:
Balance within the
training data, and 93.29% within the testing data.From what we can see in the results, the XGBoost regressor does better when it comes to training metrics compared to the random forest regressor. Additionally, this model also outperforms random forest regressor in testing metrics.
In this section, we will fine-tune the XGBoost regressor using the
tidymodels package. Among the adjustable parameters for the
XGBoost regressor, we will focus on trees, which controls
the number of trees used in the model.
Initially, we define the model and identify the specific argument to
be tuned. For tuning solely the trees parameter, we assign
the tune() function to it.
# 1. Mark trees as parameter to be tuned
xgbr_tune <- boost_tree(trees = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost") Next, we define the recipe, which outlines the steps for preparing our data for modeling. This process can involve various preprocessing tasks. In the the following code, we specify what to predict, the predictors to use, and the dataset.
# 2. Define recipe
rcp <- recipe(Balance ~ ., data = train_data) %>%
step_dummy(Gender, Student, Married, Ethnicity)Subsequently, we integrate the model and recipe using the
workflow() function. The following code demonstrates
creating a tuning workflow that combines the XGBoost regressor with our
dataset.
#> # Trees (quantitative)
#> Range: [1, 2000]
Executing the above code shows that the default tuning range spans
from 1 to 2000. However, we want to specifically tune the
trees value between 15 and 30. To do this, we can manually
create a tibble with our desired range.
Next, we will define the benchmark for determining the best trees value, using MAE as the standard of measurement.
Finally, we are ready to tune our model using the
tune_grid() function.
# 6. Tune the model using the specified range for trees
tuning_results <- tune_grid(
xgb_wflow,
grid = trees_values,
resamples = vfold_cv(train_data, v = 5),
metrics = mae_res
)Let’s inspect the best parameter for our XGBoost regressor.
Based on the results, trees = 30 yields the lowest MAE in the 5-fold cross-validation. We will remodel our data using this outcome.
xgbr_tune <- boost_tree(trees = 30) %>%
set_mode("regression") %>%
set_engine("xgboost") %>%
fit(Balance ~ ., data = train_data)# prediction on training data and testing data with tuned model from tidymodels
y_train_pred <- predict(object = xgbr_tune, new_data = train_data)
y_test_pred <- predict(object = xgbr_tune, new_data = test_data)xgbr_tm_tuned_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$Balance, nrow(train_data), ncol(train_data)),
MAE_train = MAE(y_train_pred$.pred, train_data$Balance),
adj_r2_test = adj_r2(y_test_pred, test_data$Balance, nrow(test_data), ncol(test_data)),
MAE_test = MAE(y_test_pred$.pred, test_data$Balance)
)
xgbr_tm_tuned_resultsLooking at these results, our post-tuned regressor is better than
pre-tuned regressor. However, there’s a big difference between the
errors we see when we train the model and when we test it. We observe
this phenomenon occurring in the majority of the models that we have
test. The more complex the model is, the bigger this error gap seems to
be. Based on these findings, we can say that the more complex models
tend to remember unnecessary things in the data, like noise or
unimportant details. But in the next part, we’ll use
xgbr_tune to help us understand the model better.
Our top-performing model is xgbr_tune. It gives us an
adjusted r-squared of 94.15% and a MAE of 70.36 on the test data. This
XGBoost model is great at finding complex patterns, but it’s not very
easy to figure out how it comes up with its predictions.
Luckily, R has a tool to help with this. It’s called the
lime package, short for Local Interpretable Model-Agnostic
Explanations. This tool can explain any kind of predictive model, even
complex ones like ours.
Now, let’s use the lime function to better understand
our model. Here’s the code we’ll use.
# select only the first 4 observations of test_data
selected_data <- test_data %>%
select(-Balance) %>%
dplyr::slice(1:4)explanation <- explain(x = selected_data,
explainer = explainer,
feature_select = "auto", # method of feature selection for lime
n_features = 10 # number of features to explain the model
)Now, we will visualize the explanation using the
plot_features() function.
As we can see from the figure, our plot contains 4 observations and the details for each.
1. Case
Case indicates the index of our observation.
Case = 1 means the first row of our test data.Case = 2 means the second row of our test data.2. Explanation Fit
Explanation Fit reveals how good lime is in
interpreting our model in terms of r-squared values of linear
regression. Given the figure, we know lime is able to:
3. Prediction
Prediction refers to the predicted credit card balance
Balance given the predictors. Now, we will compare with the
predicted values with our actual values.
#> [1] 333 903 331 1350
#> [1] -13.18
#> [1] 17.67
#> [1] 86.77
#> [1] 28.32
If we pay close attention, we can see that case 3 has the largest error. This is connected to the explanation provided in the previous section. Because case 3 has the lowest explanation fit, it suggests that the model doesn’t have a good idea when making predictions. This results in a much higher error rate compared to the other cases.
4. Feature Barplot
For each observation, lime generates bar plots in red
and blue. Red reveals a negative contribution to our target and blue
otherwise. Each observation shows a different pattern regarding which
variables strengthen the credit card balance. From 10 predictors, in
most cases, there 4 predictors that are dominantly effect the prediction
of credit card balances: Rating, Income,
Limit, and Student.
Income variable. In this case, the lender’s
income is below 22.8. However, by the bar length, we can see that this
variable influence is insignificant. The lower limit amount, the
No status of student, and the lower credit rating
contribute more to a low credit balance.dtr_results$model <- "Base Decision Tree Regressor"
rfr_results$model <- "Base Random Forest Regressor"
xgbr_tm_results$model <- "Base XGBoost Regressor"
xgbr_tm_tuned_results$model <- "Fine-Tuned XGBoost Regressor"
summary <- rbind(dtr_results,
rfr_results,
xgbr_tm_results,
xgbr_tm_tuned_results)
summary[, c(ncol(summary), 1:(ncol(summary) - 1))]xgbr_tune, which we set up to use 30
trees. It’s really doing well with an adjusted r-squared of 99.95% on
the training data and 94.15% on the testing data. When it comes to
errors, it’s got a 5.69 MAE for the training data and 70.36 for the
testing data. But these results suggest that the model might be
overfitting, which means it’s great at handling the data it trained on
but not as good when it comes to new, unseen data.