This analysis aims to quantify the inferential relationship between historical credit card delinquency and future default. Specifically, we ask: How do a customer’s historical payment records (from the most recent month to six months prior) affect their probability of defaulting on their next payment, and which specific historical period carries the most predictive weight?
This report seeks to provide parsimony and actionable insights for risk management.
This dataset comes from the UCI Machine Learning Repository, Default of Credit Card Clients
sample size: 30,000 objects, 23 features (columns)
Although the dataset used in this report is highly structure and organization, a simple data cleaning is needed for better personalized and simpfy analysis.
taiwan_finance <- read_excel('/Users/yuhe/Downloads/default of credit card clients.xls', skip = 1)
glimpse(taiwan_finance)
## Rows: 30,000
## Columns: 25
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000, 50000…
## $ SEX <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1,…
## $ EDUCATION <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2,…
## $ MARRIAGE <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2,…
## $ AGE <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, 3…
## $ PAY_0 <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, -…
## $ PAY_2 <dbl> 2, 2, 0, 0, 0, 0, 0, -1, 0, -2, 0, -1, 0,…
## $ PAY_3 <dbl> -1, 0, 0, 0, -1, 0, 0, -1, 2, -2, 2, -1, …
## $ PAY_4 <dbl> -1, 0, 0, 0, 0, 0, 0, 0, 0, -2, 0, -1, -1…
## $ PAY_5 <dbl> -2, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, -1…
## $ PAY_6 <dbl> -2, 2, 0, 0, 0, 0, 0, -1, 0, -1, -1, 2, -…
## $ BILL_AMT1 <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 36…
## $ BILL_AMT2 <dbl> 3102, 1725, 14027, 48233, 5670, 57069, 41…
## $ BILL_AMT3 <dbl> 689, 2682, 13559, 49291, 35835, 57608, 44…
## $ BILL_AMT4 <dbl> 0, 3272, 14331, 28314, 20940, 19394, 5426…
## $ BILL_AMT5 <dbl> 0, 3455, 14948, 28959, 19146, 19619, 4830…
## $ BILL_AMT6 <dbl> 0, 3261, 15549, 29547, 19131, 20024, 4739…
## $ PAY_AMT1 <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380,…
## $ PAY_AMT2 <dbl> 689, 1000, 1500, 2019, 36681, 1815, 40000…
## $ PAY_AMT3 <dbl> 0, 1000, 1000, 1200, 10000, 657, 38000, 0…
## $ PAY_AMT4 <dbl> 0, 1000, 1000, 1100, 9000, 1000, 20239, 5…
## $ PAY_AMT5 <dbl> 0, 0, 1000, 1069, 689, 1000, 13750, 1687,…
## $ PAY_AMT6 <dbl> 0, 2000, 5000, 1000, 679, 800, 13770, 154…
## $ `default payment next month` <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
ID - ID
LIMIT_BAL - Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX - Gender (1 = male; 2 = female)
EDUCATION - Education level (1= graduate school; 2 = university; 3 = high school; 4 = others)
MARRIAGE - Marital status (1 = married; 2 = single; 3 = others)
AGE - Age
PAY0 to PAY6 - History of past payment. PAY0 = the repayment status in September, 2005; PAY2 = the repayment status in August, 2005; . . .;PAY6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -2 = Dormant; -1 = pay in full; 0 = partial paid; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
BILL_AMT1 to BILL_AMT6 - Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6 = amount of bill statement in April, 2005.
PAY_AMT1 to PAY_AMT6 - Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY_AMT6 = amount paid in April, 2005.
default payment next month - whether the customer not pay the minimal payment for the next month’s bill (Yes = 1; No = 0)
In this report, we will only use PAY_0 to PAY_6 and
default payment next month columns
In the credit industry, the difference between inactive accounts
(-2), fully paid balances (-1), and minimum
payments (0) is functionally irrelevant regarding default
risk—all represent accounts in good standing. Leaving these as distinct
numerical categories introduces unnecessary noise and violates the
principle of parsimony.
Therefore, we grouping these safe behaviors into a single categorical
factor labeled "Current", which represent safe in the
credit industry.
clean_data <- taiwan_finance %>%
rename(default_next_month = `default payment next month`) %>%
mutate(
default_next_month = factor(ifelse(default_next_month == 1, "Yes", "No"),
levels = c("Yes", "No")),
across(starts_with("PAY_"), ~ case_when(
. <= 0 ~ "Current",
TRUE ~ as.character(.))
),
across(starts_with("PAY_"), as.factor)
)
To prevent data leakage and guarantee a fair evaluation, the dataset
is split into an 70% training set and a 30% testing set. A stratified
split is utilized via createDataPartition to mathematically
guarantee that the ratio of defaulters to non-defaulters remains
perfectly balanced across both subsets.
set.seed(123)
data_split <- createDataPartition(clean_data$default_next_month, p = 0.70, list = FALSE)
# Force standard data frames to optimize memory management during training (Ps: My Mac is too weak for the random forest traning...)
train_data <- as.data.frame(clean_data[data_split, ])
test_data <- as.data.frame(clean_data[-data_split, ])
To answer the research question, two distinct classification models are applied and compared: 1. Logistic Regression (Baseline): Selected because its sigmoid function properly bounds predictions as probabilities between 0 and 1, making it mathematically appropriate for binary classification. It offers maximum interpretability. 2. Random Forest (Advanced): Selected to test if non-linear interactions between historical months (e.g., compounding consecutive late payments) yield higher predictive accuracy than a linear model.
To prevent overfitting, 5-fold cross-validation is applied exclusively within the training data. The models are optimized for the ROC metric rather than raw accuracy, which is highly misleading in imbalanced financial datasets.
# Clean up memory
gc()
# Define cross-validation rules
cv_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
)
# 1. Fit Logistic Regression
log_model <- train(
default_next_month ~ PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6,
data = train_data,
method = "glm",
family = "binomial",
trControl = cv_control,
metric = "ROC"
)
# 2. Fit Random Forest
rf_model <- train(
default_next_month ~ PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6,
data = train_data,
method = "ranger",
importance = "impurity",
trControl = cv_control,
metric = "ROC"
)
Evaluate using Receiver Operating Characteristic (ROC) curves and confusion matrix.
log_preds <- predict(log_model, newdata = test_data, type = "prob")
rf_preds <- predict(rf_model, newdata = test_data, type = "prob")
# Construct ROC objects
log_roc <- roc(test_data$default_next_month, log_preds$Yes, levels = c("No", "Yes"))
rf_roc <- roc(test_data$default_next_month, rf_preds$Yes, levels = c("No", "Yes"))
# Visualize comparative performance
ggroc(list("Logistic Regression" = log_roc, "Random Forest" = rf_roc), size = 1) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "gray") +
theme_minimal() +
labs(
title = "Model Validation: Historical Payment Impact",
subtitle = paste("AUC - Logistic:", round(auc(log_roc), 3),
"| AUC - Random Forest:", round(auc(rf_roc), 3)),
x = "Specificity (True Negative Rate)",
y = "Sensitivity (True Positive Rate)",
color = "Algorithm Type"
)
log_class_preds <- predict(log_model, newdata = test_data)
rf_class_preds <- predict(rf_model, newdata = test_data)
confusionMatrix(
data = log_class_preds,
reference = test_data$default_next_month,
positive = "Yes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 694 363
## No 1296 6646
##
## Accuracy : 0.8156
## 95% CI : (0.8075, 0.8236)
## No Information Rate : 0.7789
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3569
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.34874
## Specificity : 0.94821
## Pos Pred Value : 0.65658
## Neg Pred Value : 0.83682
## Prevalence : 0.22114
## Detection Rate : 0.07712
## Detection Prevalence : 0.11746
## Balanced Accuracy : 0.64848
##
## 'Positive' Class : Yes
##
confusionMatrix(
data = rf_class_preds,
reference = test_data$default_next_month,
positive = "Yes"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 709 380
## No 1281 6629
##
## Accuracy : 0.8154
## 95% CI : (0.8073, 0.8234)
## No Information Rate : 0.7789
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3605
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.35628
## Specificity : 0.94578
## Pos Pred Value : 0.65106
## Neg Pred Value : 0.83805
## Prevalence : 0.22114
## Detection Rate : 0.07879
## Detection Prevalence : 0.12101
## Balanced Accuracy : 0.65103
##
## 'Positive' Class : Yes
##
Visual analysis of the ROC curve and the Confusion Matrix table reveals that the baseline Logistic Regression model and the advanced Random Forest model achieved nearly identical predictive performance. Because the feature set was restricted to categorical payment histories, the non-linear interaction capabilities of the Random Forest provided no additional predictive advantage. Following the principle of parsimony, Logistic Regression is the superior model for this specific dataset; it achieves equivalent accuracy with vastly superior computational efficiency and regulatory transparency.
Having validated the models, we extract the Variable Importance metrics from the Random Forest to directly answer the inference question with a comparison from the logistic regression model.
rf_importance <- varImp(rf_model)
plot(rf_importance, top = 6,
main = "Drivers of Default: Most Critical Historical Months")
log_importance <- varImp(log_model)
plot(log_importance, top = 6,
main = "Logistic Regression: Variable Importance")
### Insight
While the variable importance plots appear slightly different, they reveal a unified conclusion: behavior in the most recent month (PAY_0) dictates risk. The Random Forest highlights PAY_0Current as the most critical factor for isolating non-defaulters (impurity reduction), while the Logistic Regression highlights PAY_02 as the most severe statistical risk factor for predicting defaults.
The variable importance analysis yields a clear inferential
conclusion: PAY_0 (the most recent month’s payment
status) is the overwhelmingly dominant predictor of default.
The predictive weight of historical delinquency decays rapidly over
time. A customer’s behavior six months ago (PAY_6) provides
insignificant statistical value compared to their immediate, current
financial state. For risk mitigation, financial institutions should
index heavily on the most recent 30-to-60 days of behavior rather than
longer-term historical averages.
Although these models achieved significant predictive discrimination, computational resource constraints necessitated the exclusion of continuous financial variables from the current analysis. Consequently, the model is capable of predicting only the behavioral probability of default, but cannot estimate the magnitude of the resulting financial losses. In future iterations of this research, distributed computing technologies should be leveraged to reincorporate variables such as BILL_AMT (bill amount) and demographic data, thereby enabling deeper inferences regarding the socioeconomic drivers underlying default behavior.