Data Split Thesis

Overview

This section constructs a stratified 60/20/20 train–validation–test split of the Freddie Mac panel dataset. The splitting procedure is implemented at the loan level to prevent information leakage across observations. Stratification is applied to preserve the distribution of default events across all subsets, ensuring comparability in model training and evaluation.

Setup

library(data.table)

Data Loading

The panel dataset generated during preprocessing is loaded from the specified directory.

OUTPUT_DIR <- "/Users/amalianimeskern/Library/CloudStorage/OneDrive-ErasmusUniversityRotterdam/Freddie Mac Data"

panel <- readRDS(file.path(OUTPUT_DIR, "freddie_mac_panel.rds"))

Construction of Loan-Level Default Indicator

A binary indicator is constructed at the loan level to identify whether a loan experiences a default event within the observation window.

loan_ids <- panel[, .(
  ever_default = max(default_next_12m)
), by = loan_sequence_number]

Stratified Sampling Procedure

To preserve class balance, loans are partitioned separately for defaulting and non-defaulting groups. Each group is randomly divided into training (60%), validation (20%), and test (20%) subsets.

set.seed(123)

defaults     <- loan_ids[ever_default == 1]
non_defaults <- loan_ids[ever_default == 0]

split_loans <- function(dt) {
  n <- nrow(dt)
  idx <- sample(n)
  
  n_train <- round(0.6 * n)
  n_valid <- round(0.2 * n)
  
  dt[idx[1:n_train], split := "train"]
  dt[idx[(n_train + 1):(n_train + n_valid)], split := "valid"]
  dt[idx[(n_train + n_valid + 1):n], split := "test"]
  
  dt
}

defaults     <- split_loans(defaults)
non_defaults <- split_loans(non_defaults)

loan_splits <- rbindlist(list(defaults, non_defaults))

Assignment of Split Labels to Panel Data

The loan-level split assignments are merged back into the full panel dataset.

panel <- merge(panel,
               loan_splits[, .(loan_sequence_number, split)],
               by = "loan_sequence_number")

Verification of Stratification

Loan-Level Distribution

The distribution of default events across splits is evaluated at the loan level.

loan_splits[, .(
  n_loans      = .N,
  n_default    = sum(ever_default),
  default_rate = round(100 * mean(ever_default), 2)
), by = split]

##     split n_loans n_default default_rate
##    <char>   <int>     <int>        <num>
## 1:  train  144339     10661         7.39
## 2:  valid   48113      3554         7.39
## 3:   test   48113      3553         7.38

Observation-Level Distribution

The distribution is also verified at the observation level to ensure consistency.

panel[, .(
  n_obs        = .N,
  n_default    = sum(default_next_12m),
  default_rate = round(100 * mean(default_next_12m), 4)
), by = split]

##     split   n_obs n_default default_rate
##    <char>   <int>     <int>        <num>
## 1:   test 1461257      3553       0.2431
## 2:  valid 1465449      3554       0.2425
## 3:  train 4386259     10661       0.2431

Export of Final Datasets

The panel dataset is partitioned into training, validation, and test subsets and saved for subsequent modeling.

train <- panel[split == "train"]
valid <- panel[split == "valid"]
test  <- panel[split == "test"]

train[, split := NULL]
valid[, split := NULL]
test[, split := NULL]

saveRDS(train, file.path(OUTPUT_DIR, "train.rds"))
saveRDS(valid, file.path(OUTPUT_DIR, "valid.rds"))
saveRDS(test,  file.path(OUTPUT_DIR, "test.rds"))

Summary

A stratified sampling procedure is applied at the loan level
Default rates are preserved across all subsets
The resulting datasets are suitable for model training, hyperparameter tuning, and out-of-sample evaluation