Predicting Credit Default: The Impact of Historical Payment Behavior

1. Introduction & Research Question

This analysis aims to quantify the inferential relationship between historical credit card delinquency and future default. Specifically, we ask: How do a customer’s historical payment records (from the most recent month to six months prior) affect their probability of defaulting on their next payment, and which specific historical period carries the most predictive weight?

This report seeks to provide parsimony and actionable insights for risk management.

1.1 Dataset Introducing

This dataset comes from the UCI Machine Learning Repository, Default of Credit Card Clients

sample size: 30,000 objects, 23 features (columns)

2. Data Preprocessing

Although the dataset used in this report is highly structure and organization, a simple data cleaning is needed for better personalized and simpfy analysis.

taiwan_finance <- read_excel('/Users/yuhe/Downloads/default of credit card clients.xls', skip = 1)

glimpse(taiwan_finance)

## Rows: 30,000
## Columns: 25
## $ ID                           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ LIMIT_BAL                    <dbl> 20000, 120000, 90000, 50000, 50000, 50000…
## $ SEX                          <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1,…
## $ EDUCATION                    <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2,…
## $ MARRIAGE                     <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2,…
## $ AGE                          <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, 3…
## $ PAY_0                        <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, -…
## $ PAY_2                        <dbl> 2, 2, 0, 0, 0, 0, 0, -1, 0, -2, 0, -1, 0,…
## $ PAY_3                        <dbl> -1, 0, 0, 0, -1, 0, 0, -1, 2, -2, 2, -1, …
## $ PAY_4                        <dbl> -1, 0, 0, 0, 0, 0, 0, 0, 0, -2, 0, -1, -1…
## $ PAY_5                        <dbl> -2, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, -1…
## $ PAY_6                        <dbl> -2, 2, 0, 0, 0, 0, 0, -1, 0, -1, -1, 2, -…
## $ BILL_AMT1                    <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 36…
## $ BILL_AMT2                    <dbl> 3102, 1725, 14027, 48233, 5670, 57069, 41…
## $ BILL_AMT3                    <dbl> 689, 2682, 13559, 49291, 35835, 57608, 44…
## $ BILL_AMT4                    <dbl> 0, 3272, 14331, 28314, 20940, 19394, 5426…
## $ BILL_AMT5                    <dbl> 0, 3455, 14948, 28959, 19146, 19619, 4830…
## $ BILL_AMT6                    <dbl> 0, 3261, 15549, 29547, 19131, 20024, 4739…
## $ PAY_AMT1                     <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380,…
## $ PAY_AMT2                     <dbl> 689, 1000, 1500, 2019, 36681, 1815, 40000…
## $ PAY_AMT3                     <dbl> 0, 1000, 1000, 1200, 10000, 657, 38000, 0…
## $ PAY_AMT4                     <dbl> 0, 1000, 1000, 1100, 9000, 1000, 20239, 5…
## $ PAY_AMT5                     <dbl> 0, 0, 1000, 1069, 689, 1000, 13750, 1687,…
## $ PAY_AMT6                     <dbl> 0, 2000, 5000, 1000, 679, 800, 13770, 154…
## $ `default payment next month` <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…

Variable explanation

ID - ID

LIMIT_BAL - Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

SEX - Gender (1 = male; 2 = female)

EDUCATION - Education level (1= graduate school; 2 = university; 3 = high school; 4 = others)

MARRIAGE - Marital status (1 = married; 2 = single; 3 = others)

AGE - Age

PAY0 to PAY6 - History of past payment. PAY0 = the repayment status in September, 2005; PAY2 = the repayment status in August, 2005; . . .;PAY6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -2 = Dormant; -1 = pay in full; 0 = partial paid; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

BILL_AMT1 to BILL_AMT6 - Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6 = amount of bill statement in April, 2005.

PAY_AMT1 to PAY_AMT6 - Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY_AMT6 = amount paid in April, 2005.

default payment next month - whether the customer not pay the minimal payment for the next month’s bill (Yes = 1; No = 0)

In this report, we will only use PAY_0 to PAY_6 and default payment next month columns

In the credit industry, the difference between inactive accounts (-2), fully paid balances (-1), and minimum payments (0) is functionally irrelevant regarding default risk—all represent accounts in good standing. Leaving these as distinct numerical categories introduces unnecessary noise and violates the principle of parsimony.

Therefore, we grouping these safe behaviors into a single categorical factor labeled "Current", which represent safe in the credit industry.

clean_data <- taiwan_finance %>%
  rename(default_next_month = `default payment next month`) %>%
  mutate(
    default_next_month = factor(ifelse(default_next_month == 1, "Yes", "No"), 
                                levels = c("Yes", "No")),
    across(starts_with("PAY_"), ~ case_when(
      . <= 0 ~ "Current",
      TRUE ~ as.character(.))
    ),
    across(starts_with("PAY_"), as.factor)
  )

2.1 Stratified Data Splitting

To prevent data leakage and guarantee a fair evaluation, the dataset is split into an 70% training set and a 30% testing set. A stratified split is utilized via createDataPartition to mathematically guarantee that the ratio of defaulters to non-defaulters remains perfectly balanced across both subsets.

set.seed(123)

data_split <- createDataPartition(clean_data$default_next_month, p = 0.70, list = FALSE)

# Force standard data frames to optimize memory management during training (Ps: My Mac is too weak for the random forest traning...)
train_data <- as.data.frame(clean_data[data_split, ])
test_data  <- as.data.frame(clean_data[-data_split, ])

3. Methodology & Model Fitting

To answer the research question, two distinct classification models are applied and compared: 1. Logistic Regression (Baseline): Selected because its sigmoid function properly bounds predictions as probabilities between 0 and 1, making it mathematically appropriate for binary classification. It offers maximum interpretability. 2. Random Forest (Advanced): Selected to test if non-linear interactions between historical months (e.g., compounding consecutive late payments) yield higher predictive accuracy than a linear model.

3.1 Cross-Validation Strategy

To prevent overfitting, 5-fold cross-validation is applied exclusively within the training data. The models are optimized for the ROC metric rather than raw accuracy, which is highly misleading in imbalanced financial datasets.

Modeling

# Clean up memory
gc()

# Define cross-validation rules
cv_control <- trainControl(
  method = "cv", 
  number = 5, 
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
)

# 1. Fit Logistic Regression
log_model <- train(
  default_next_month ~ PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6,
  data = train_data,
  method = "glm",
  family = "binomial",
  trControl = cv_control,
  metric = "ROC"
)

# 2. Fit Random Forest
rf_model <- train(
  default_next_month ~ PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6,
  data = train_data,
  method = "ranger", 
  importance = "impurity", 
  trControl = cv_control,
  metric = "ROC"
)

4. Model Diagnostics & Validation

Evaluate using Receiver Operating Characteristic (ROC) curves and confusion matrix.

4.1 Generate Prediction result

log_preds <- predict(log_model, newdata = test_data, type = "prob")
rf_preds  <- predict(rf_model, newdata = test_data, type = "prob")

4.2 ROC curve visualization

# Construct ROC objects
log_roc <- roc(test_data$default_next_month, log_preds$Yes, levels = c("No", "Yes"))
rf_roc  <- roc(test_data$default_next_month, rf_preds$Yes, levels = c("No", "Yes"))

# Visualize comparative performance
ggroc(list("Logistic Regression" = log_roc, "Random Forest" = rf_roc), size = 1) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "gray") +
  theme_minimal() +
  labs(
    title = "Model Validation: Historical Payment Impact",
    subtitle = paste("AUC - Logistic:", round(auc(log_roc), 3), 
                     "| AUC - Random Forest:", round(auc(rf_roc), 3)),
    x = "Specificity (True Negative Rate)",
    y = "Sensitivity (True Positive Rate)",
    color = "Algorithm Type"
  )

4.3 Table

log_class_preds <- predict(log_model, newdata = test_data)
rf_class_preds  <- predict(rf_model, newdata = test_data)

confusionMatrix(
  data = log_class_preds, 
  reference = test_data$default_next_month, 
  positive = "Yes"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  694  363
##        No  1296 6646
##                                           
##                Accuracy : 0.8156          
##                  95% CI : (0.8075, 0.8236)
##     No Information Rate : 0.7789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3569          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.34874         
##             Specificity : 0.94821         
##          Pos Pred Value : 0.65658         
##          Neg Pred Value : 0.83682         
##              Prevalence : 0.22114         
##          Detection Rate : 0.07712         
##    Detection Prevalence : 0.11746         
##       Balanced Accuracy : 0.64848         
##                                           
##        'Positive' Class : Yes             
##

confusionMatrix(
  data = rf_class_preds, 
  reference = test_data$default_next_month, 
  positive = "Yes"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  709  380
##        No  1281 6629
##                                           
##                Accuracy : 0.8154          
##                  95% CI : (0.8073, 0.8234)
##     No Information Rate : 0.7789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3605          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.35628         
##             Specificity : 0.94578         
##          Pos Pred Value : 0.65106         
##          Neg Pred Value : 0.83805         
##              Prevalence : 0.22114         
##          Detection Rate : 0.07879         
##    Detection Prevalence : 0.12101         
##       Balanced Accuracy : 0.65103         
##                                           
##        'Positive' Class : Yes             
##

4.4 Interpretation

Visual analysis of the ROC curve and the Confusion Matrix table reveals that the baseline Logistic Regression model and the advanced Random Forest model achieved nearly identical predictive performance. Because the feature set was restricted to categorical payment histories, the non-linear interaction capabilities of the Random Forest provided no additional predictive advantage. Following the principle of parsimony, Logistic Regression is the superior model for this specific dataset; it achieves equivalent accuracy with vastly superior computational efficiency and regulatory transparency.

5. Results & Interpretation

Having validated the models, we extract the Variable Importance metrics from the Random Forest to directly answer the inference question with a comparison from the logistic regression model.

rf_importance <- varImp(rf_model)

plot(rf_importance, top = 6, 
     main = "Drivers of Default: Most Critical Historical Months")

log_importance <- varImp(log_model)

plot(log_importance, top = 6, 
     main = "Logistic Regression: Variable Importance")

### Insight

While the variable importance plots appear slightly different, they reveal a unified conclusion: behavior in the most recent month (PAY_0) dictates risk. The Random Forest highlights PAY_0Current as the most critical factor for isolating non-defaulters (impurity reduction), while the Logistic Regression highlights PAY_02 as the most severe statistical risk factor for predicting defaults.

5.1 Conclusion

The variable importance analysis yields a clear inferential conclusion: PAY_0 (the most recent month’s payment status) is the overwhelmingly dominant predictor of default. The predictive weight of historical delinquency decays rapidly over time. A customer’s behavior six months ago (PAY_6) provides insignificant statistical value compared to their immediate, current financial state. For risk mitigation, financial institutions should index heavily on the most recent 30-to-60 days of behavior rather than longer-term historical averages.

5.2 Limitation

Although these models achieved significant predictive discrimination, computational resource constraints necessitated the exclusion of continuous financial variables from the current analysis. Consequently, the model is capable of predicting only the behavioral probability of default, but cannot estimate the magnitude of the resulting financial losses. In future iterations of this research, distributed computing technologies should be leveraged to reincorporate variables such as BILL_AMT (bill amount) and demographic data, thereby enabling deeper inferences regarding the socioeconomic drivers underlying default behavior.