Credit Risk

FZ2024 Financial Modeling and Programming

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

November 7, 2025

0.1 Before you begin: important instructions for all Workshops

Welcome to our workshop series! Please read these instructions carefully before starting any activity. Following these guidelines will make your work smoother and ensure that your submissions are graded without issues.

0.1.1 Working environment

We will use Google Colab for all workshops. Colab runs Python in the cloud — you don’t need to install anything locally.

  • Access Colab at: https://colab.research.google.com/
  • Sign in with your institutional Google account for access to all features.
  • Always save a copy of the notebook to your Google Drive:
    • Go to File → Save a copy in Drive.

0.1.2 Loading data

You may work with datasets provided by the instructor or public datasets online. You will receive instructions each time to load the data with Python code. However, it is a good idea to store files, like data or your own notes, in a dedicated Google Drive folder:

  1. Create a folder in your Google Drive named fz2024_workshops (or similar).
  2. Upload your datasets there.

0.1.3 Output and submission format

  • After completing the workshop, export your notebook as PDF:
  • In Colab: File → Print → Save as PDF.
  • Submit the PDF file through Canvas, as well as the.ipynb.
  • Include all outputs, tables, and graphs in your PDF — make sure you run all cells before exporting.
  • Name your PDF file using the following format: Lastname_Firstname_WorkshopX.pdf

0.1.4 Deadlines

All assignments must be uploaded to Canvas before the stated deadline. Late submissions are not accepted. Once you have read and understood these instructions, you are ready to begin the workshop!

1 Overview

In this workshop you will build credit risk models to estimate the Probability of Default (PD) using Logistic Regression, Linear Discriminant Analysis (LDA), Decision Tree Classifier, and Ridge Classifier. We work with a LendingClub sample and follow a simple Internal Ratings-Based (IRB)-style flow:

  1. Define the dependent variable
  2. Prepare features
  3. Split data
  4. Train models
  5. Evaluate with Accuracy and Recall, and
  6. Extract PD from the best model.

Learning goals: understand credit risk and PD, translate Basel IRB intuition into a modeling workflow, train/compare 4 classifiers, evaluate with Accuracy & Recall, and produce PD estimates and a new-applicant prediction.

2 Credit Risk & Basel in a nutshell

Credit risk: the risk a borrower will not meet obligations. This definition can be expanded to the probability that the counterpart in any agreement will not fulfill their commitment. Basel expected loss (EL) uses three pillars:

\text{EL} = \text{PD} \times \text{LGD} \times \text{EAD}.

  • Probability of Default (PD): likelihood of default within a horizon (often 12 months).
  • Loss Given Default (LGD): fraction of exposure lost when default occurs (after recoveries).
  • Exposure at Default (EAD): amount ($) outstanding at the moment of default.

Takeaway: PD is the probability lever; LGD is the severity lever; EAD is the exposure lever. Changing any lever changes EL — and capital, pricing, and strategy follow.

2.1 Basel Accords: why PD modeling matters

Banks don’t model default just for sport — they do it because capital, provisioning, pricing, and strategy depend on it. Basel’s Expected Loss (EL) identity, is the “bridge” between risk measurement and business decisions.

Basel I → II → III (and now IV in some jurisdictions):

  • Basel I (1988): first global minimum capital rules; simple risk weights, mostly about credit risk.
  • Basel II (2004): three pillars (Pillar 1 capital, Pillar 2 supervisory review, Pillar 3 market discipline), plus Operational Risk; crucially, allows IRB so banks can model PD, LGD, EAD.
  • Basel III (2010+): post-GFC tightening — higher quality capital, leverage ratio, liquidity, and buffers (capital conservation, countercyclical).

How we study credit risk in practice for retail/SME lending:

  • Standardized Approach (SA) uses regulator-set risk weights — easy, conservative, less sensitive to borrower risk.
  • IRB Approach estimates PD/LGD/EAD internally — more risk-sensitive but requires model governance: data lineage, validation, backtesting, stability, and conservatism.

We focus on PD — mapping borrower features X to \Pr(\text{Default}=1 \mid X).

Even if you’re not building regulatory capital models, a good PD model informs loan approvals, risk-based pricing, limits, and collections strategy. Later, PDs roll up into portfolio EL, stress testing, and capital planning.

A PD model is a probability map from borrower features X to the chance of default. With logistic regression:

\Pr(\text{Default}=1\mid X)=\frac{1}{1+e^{-(\beta_0+X\beta)}}.

3 Setup

We’ll use libraries available in Google Colab and standard Python environments. If you’re in Colab, no special installs should be required.

3.1 Core Python libraries for Financial Modeling and Machine Learning

In this workshop, we use a small ecosystem of Python libraries that together form the foundation of most data science and machine learning projects.
Each one has a specific role in the workflow:

Library Import Name Main Purpose
NumPy import numpy as np Provides fast mathematical operations, array manipulation, and linear algebra tools. It is the numerical backbone of all modern Python data libraries.
pandas import pandas as pd Used for data loading, cleaning, and manipulation. It introduces the DataFrame, a spreadsheet-like structure ideal for tabular data.
scikit-learn from sklearn import ... The main machine learning library in Python. It includes algorithms for classification, regression, and clustering, as well as tools for preprocessing, model validation, and performance metrics.
matplotlib import matplotlib.pyplot as plt The standard library for data visualization. We use it to create simple charts, histograms, and diagnostic plots.
statsmodels import statsmodels.api as sm Focused on statistical modeling. It complements scikit-learn by providing detailed regression summaries, significance tests, and econometric-style analysis.

Together, these libraries let us:

  1. Load and explore data (pandas, NumPy),
  2. Prepare and transform variables (scikit-learn preprocessors),
  3. Build and validate models (scikit-learn, statsmodels),
  4. Visualize results (matplotlib).

Think of them as a pipeline: NumPy handles raw numbers → pandas organizes data → scikit-learn learns patterns → statsmodels interprets results → matplotlib visualizes findings.

# Core stack
import numpy as np
import pandas as pd

# Modeling
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Metrics & diagnostics
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import statsmodels.api as sm

3.2 Data Workflow Summary

Before any modeling, it’s essential to understand the full data workflow — from loading the dataset to connecting preprocessing with the model. Each step below ensures that the data entering the model is structured, scaled, and encoded correctly. This guarantees consistency, reproducibility, and fair comparison among different algorithms.

Step Tool / Function Purpose
Load data pd.read_csv(url_or_path) Imports the dataset from a file or URL into a pandas DataFrame.
Inspect data .head(), .info(), .describe() Quickly review the first rows, data types, and summary statistics to detect anomalies or missing values.
Split dataset train_test_split(X, y, test_size=..., random_state=#) Divides data into training and testing sets, ensuring the same class distribution and reproducible results.

Together, these steps form the foundation of any supervised learning project. They standardize how data is handled before reaching the model, making experiments replicable and minimizing manual mistakes.

Quick reminders

  • Load data: pd.read_csv(url_or_path); inspect with .head(), .info(), .describe().
  • Split: train_test_split(X, y, test_size=..., random_state=#).

4 Data: LendingClub Sample

We use a cleaned sample hosted online. Below, you will find the variables descriptions. The original data set has at least 2 million observations and 150 variables. Inside the file “credit.xlsx,” you will find only 873 observations (rows) and 70 columns. Each row represents a Lendingclub client. We previously made the data cleaning (missing values, correlated variables, Zero- and Near Zero-Variance Predictors).

The goal is to predict whether the loan will default (Charged Off) or be fully repaid (Fully Paid).

Variable Type Meaning / Description Interpretation in Credit Risk
Default Categorical (target) Loan outcome: "Fully Paid" (0) or "Charged Off" (1). Dependent variable — whether the client defaulted.
term Numeric Length of the loan in years (1 = short-term, 2 = long-term). Longer terms increase uncertainty and usually risk.
installment Numeric Monthly payment amount due on the loan. Higher installments may strain borrower capacity.
grade Categorical (A–G mapped to numeric) Internal credit grade assigned by LendingClub. Proxy for borrower credit quality (lower = riskier).
emp_length Numeric Years of employment at current job. Longer employment often signals stability and lower risk.
home_ownership Categorical Borrower’s housing status (e.g., rent, mortgage, own). Homeowners may be more stable; renters slightly riskier.
annual_inc Numeric Annual income of the borrower in U.S. dollars. Higher income improves repayment capacity.
verification_status Categorical Indicates whether the borrower’s income was verified. Verified income reduces uncertainty about true earnings.
purpose Categorical Purpose of the loan (e.g., debt consolidation, car, credit card). Some purposes (like debt consolidation) historically riskier.
num_il_tl Numeric Number of installment accounts the borrower has. Many existing loans may signal leverage.
num_rev_accts Numeric Number of revolving (credit card) accounts. Too many revolving accounts can raise risk.
percent_bc_gt_75 Numeric (percentage) % of bankcard accounts where balance > 75% of limit. High utilization indicates potential over-indebtedness.
pub_rec_bankruptcies Numeric (count) Number of public bankruptcy records. Any record increases default risk significantly.
total_bc_limit Numeric Total bankcard credit limit available. Higher limits may show creditworthiness or exposure risk.

These features capture capacity, character, and credit behavior — the core “3 Cs” of credit analysis. Models use them jointly to estimate the Probability of Default (PD) for new applicants.

url = "https://raw.githubusercontent.com/abernal30/ml_book/main/credit.csv"
credit = pd.read_csv(url)
credit.info()
credit.head()
credit.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 71 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Default                873 non-null    object 
 1   term                   873 non-null    int64  
 2   installment            873 non-null    float64
 3   grade                  873 non-null    int64  
 4   emp_title              873 non-null    float64
 5   emp_length             873 non-null    int64  
 6   home_ownership         873 non-null    int64  
 7   annual_inc             873 non-null    float64
 8   verification_status    873 non-null    int64  
 9   purpose                873 non-null    int64  
 10  title                  873 non-null    int64  
 11  zip_code               873 non-null    int64  
 12  addr_state             873 non-null    int64  
 13  dti                    873 non-null    float64
 14  delinq_2yrs            873 non-null    int64  
 15  earliest_cr_line       873 non-null    int64  
 16  fico_range_high        873 non-null    int64  
 17  inq_last_6mths         873 non-null    int64  
 18  pub_rec                873 non-null    int64  
 19  revol_bal              873 non-null    int64  
 20  revol_util             873 non-null    float64
 21  total_acc              873 non-null    int64  
 22  total_rec_int          873 non-null    float64
 23  recoveries             873 non-null    float64
 24  last_pymnt_d           873 non-null    int64  
 25  last_pymnt_amnt        873 non-null    float64
 26  last_credit_pull_d     873 non-null    int64  
 27  last_fico_range_high   873 non-null    int64  
 28  last_fico_range_low    873 non-null    int64  
 29  tot_coll_amt           873 non-null    int64  
 30  tot_cur_bal            873 non-null    int64  
 31  open_acc_6m            873 non-null    int64  
 32  open_act_il            873 non-null    int64  
 33  open_il_12m            873 non-null    int64  
 34  open_il_24m            873 non-null    int64  
 35  mths_since_rcnt_il     873 non-null    int64  
 36  total_bal_il           873 non-null    int64  
 37  il_util                873 non-null    int64  
 38  open_rv_12m            873 non-null    int64  
 39  open_rv_24m            873 non-null    int64  
 40  max_bal_bc             873 non-null    int64  
 41  all_util               873 non-null    int64  
 42  total_rev_hi_lim       873 non-null    int64  
 43  inq_fi                 873 non-null    int64  
 44  total_cu_tl            873 non-null    int64  
 45  inq_last_12m           873 non-null    int64  
 46  acc_open_past_24mths   873 non-null    int64  
 47  avg_cur_bal            873 non-null    int64  
 48  bc_open_to_buy         873 non-null    int64  
 49  bc_util                873 non-null    float64
 50  mo_sin_old_il_acct     873 non-null    float64
 51  mo_sin_old_rev_tl_op   873 non-null    int64  
 52  mo_sin_rcnt_rev_tl_op  873 non-null    int64  
 53  mo_sin_rcnt_tl         873 non-null    int64  
 54  mort_acc               873 non-null    int64  
 55  mths_since_recent_bc   873 non-null    int64  
 56  mths_since_recent_inq  873 non-null    int64  
 57  num_accts_ever_120_pd  873 non-null    int64  
 58  num_actv_bc_tl         873 non-null    int64  
 59  num_bc_sats            873 non-null    int64  
 60  num_bc_tl              873 non-null    int64  
 61  num_il_tl              873 non-null    int64  
 62  num_op_rev_tl          873 non-null    int64  
 63  num_rev_accts          873 non-null    int64  
 64  num_rev_tl_bal_gt_0    873 non-null    int64  
 65  num_sats               873 non-null    int64  
 66  num_tl_op_past_12m     873 non-null    int64  
 67  pct_tl_nvr_dlq         873 non-null    float64
 68  percent_bc_gt_75       873 non-null    float64
 69  pub_rec_bankruptcies   873 non-null    int64  
 70  total_bc_limit         873 non-null    int64  
dtypes: float64(12), int64(58), object(1)
memory usage: 484.4+ KB
term installment grade emp_title emp_length home_ownership annual_inc verification_status purpose title ... num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies total_bc_limit
count 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 ... 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000 873.000000
mean 1.219931 447.911707 2.538373 334.707331 4.298969 1.792669 79421.277010 1.768614 3.306987 4.223368 ... 8.798396 8.749141 15.305842 5.840779 12.406644 2.463918 94.175716 39.036197 0.135166 24827.071019
std 0.414437 258.247786 1.271135 181.498130 2.513489 0.915885 42055.600733 0.779070 1.814736 1.667774 ... 7.385868 4.635333 8.239209 3.217439 5.543059 1.944503 9.026038 35.194738 0.355253 24568.268359
min 1.000000 32.970000 1.000000 1.000000 1.000000 1.000000 13000.000000 1.000000 1.000000 1.000000 ... 0.000000 1.000000 2.000000 0.000000 2.000000 0.000000 39.000000 0.000000 0.000000 0.000000
25% 1.000000 256.050000 2.000000 191.000000 3.000000 1.000000 50000.000000 1.000000 2.000000 3.000000 ... 4.000000 5.000000 10.000000 4.000000 9.000000 1.000000 91.700000 0.000000 0.000000 8800.000000
50% 1.000000 391.620000 2.000000 335.500000 3.000000 1.000000 70000.000000 2.000000 3.000000 4.000000 ... 7.000000 8.000000 14.000000 5.000000 11.000000 2.000000 97.900000 33.300000 0.000000 17500.000000
75% 1.000000 612.890000 3.000000 487.000000 6.000000 3.000000 100000.000000 2.000000 3.000000 4.000000 ... 11.000000 11.000000 19.000000 7.000000 15.000000 3.000000 100.000000 66.700000 0.000000 33200.000000
max 2.000000 1252.560000 7.000000 648.000000 11.000000 3.000000 450000.000000 3.000000 11.000000 11.000000 ... 59.000000 30.000000 80.000000 28.000000 46.000000 12.000000 100.000000 100.000000 2.000000 281300.000000

8 rows × 70 columns

The target Default has two labels: Charged Off (default) and Fully Paid (no default).

credit['Default'].value_counts(dropna=False)
Default
Fully Paid     728
Charged Off    145
Name: count, dtype: int64

Define dependent and independent variables

y=credit["Default"] # select the Defalut variable
X=credit.drop(columns=["Default"]) # we drop the dependent variable, Default

5 Train / Test Split

Use 80/20 split, stratified by the target to preserve the default rate, and the assignment seed.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=43
)

6 Estimate models

import warnings 
warnings.filterwarnings('ignore')

model = LogisticRegression()

model.fit(X_train,y_train) 
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

For now, we will not focus on causality analysis, meaning that we are not asking what is the effect of the independent variables on the dependent variable (that will come later.) For now, we will focus on the predictability power of the models. So we need to estimate those predictions.

predict_logit=model.predict(X_test)
predict_logit[:5]
array(['Fully Paid', 'Fully Paid', 'Charged Off', 'Fully Paid',
       'Charged Off'], dtype=object)

Now, to measure the performance of our prediction we will use the accuracy measure (see more below).

accuracy_score(y_test, predict_logit) 
0.9314285714285714

This means that 93.14% of the times, the algorithm predicted the same output as the observed data. So, the bigger this number, the better the model.

6.1 LinearDiscriminantAnalysis

We will now estimate the predictions with a different model. For explanations on this and the other models involved in the exercise, please read the next section.

# Define the model

model_LDA=LDA()

# Follow the same steps as before
model_LDA.fit(X_train,y_train)
predict_LDA=model_LDA.predict(X_test)
accuracy_score(y_test, predict_LDA)
0.9428571428571428

This model resulted with a higher accuracy (94.29%) than the Logit model.

6.2 Probability of Default

Now, to estimate the probability of default for each observation with the test data base, data that was not used to train the model, we run the following code.

pd.DataFrame(model_LDA.predict_proba(X_test)).head()
0 1
0 0.000036 0.999964
1 0.999554 0.000446
2 0.489148 0.510852
3 0.000002 0.999998
4 0.999953 0.000047

7 Model Families

7.1 Logistic Regression (Logit)

What it is:
A classic statistical model that predicts the probability that a borrower will default.
It models the log-odds of default as a linear combination of borrower characteristics:

\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k

How it works:
It finds the line (or surface) that best separates “good” and “bad” borrowers, then uses the logistic (sigmoid) function to turn that into probabilities between 0 and 1.

Why it’s good:
- Produces probabilities, not just yes/no labels.
- Coefficients can be interpreted as odds ratios, great for explainability.
- Transparent, simple, and widely accepted under Basel frameworks.

Why it’s popular:
- The standard for credit scoring.
- Easy to implement and explain.
- Performs well even with limited data.

7.2 Linear Discriminant Analysis (LDA)

What it is:
A method that finds the line (or plane) that best separates two groups (defaulters vs non-defaulters).

How it works:
It assumes each group follows a normal distribution and that both groups share the same covariance matrix.
Then it computes a linear boundary that maximizes the separation between the two groups.

Why it’s good:
- Fast to compute.
- Works well if features are roughly continuous and normally distributed.
- A useful benchmark model.

Why it’s less used now:
- Assumes data are Gaussian and continuous.
- Doesn’t handle categorical variables or non-linear effects easily.

7.3 Decision Tree Classifier

What it is:
A model that makes predictions by asking a sequence of if/else questions, such as:
“Is income < 30,000?” → “Does the borrower own a house?” → “Is the loan > 10,000?”

How it works:
It recursively splits the data into smaller and smaller groups to make each leaf node as “pure” as possible — meaning mostly all good or all bad borrowers.

Why it’s good:
- Handles both numeric and categorical data.
- Captures non-linear relationships naturally.
- Easy to visualize and explain to managers and regulators.

Why it’s tricky:
- Can overfit if not pruned (too deep or too many leaves).
- Needs regularization to generalize well.

Why it’s popular:
- Intuitive and visual.
- Foundation for advanced ensemble models (Random Forest, XGBoost).

7.4 Ridge Classifier

What it is:
A linear model similar to Logistic Regression but includes a penalty that discourages large coefficients — this is called L2 regularization.

How it works:
It still tries to draw a linear boundary but shrinks less important coefficients toward zero to reduce overfitting and noise sensitivity.

Why it’s good:
- Handles many correlated features well (useful after one-hot encoding).
- More stable than plain logistic regression.

Why it’s limited:
- Doesn’t directly output probabilities (needs calibration).
- Slightly harder to interpret.

Why it’s popular:
- Fast, robust, and good when you have many features or little data.
- Often used as a strong baseline in machine learning competitions.

7.5 Why these models matter for PD estimation

In credit risk, we don’t just care about overall accuracy — we care about the type of error:

Error Type Meaning Cost
False Negative Predict “good” but borrower defaults 💸 Very costly
False Positive Predict “bad” but borrower pays ❌ Lost opportunity

Therefore, banks often prioritize Recall for class 1 (defaults)
it’s better to be conservative and reject a few safe borrowers than to approve one who defaults.

7.6 Model Evaluation Metrics: Accuracy, Recall & Precision

Once a model predicts who will default and who will not, we must evaluate how good those predictions are. In credit risk, we usually care about both overall performance and the type of mistakes the model makes.

7.6.1 The Confusion Matrix

A confusion matrix summarizes model predictions versus actual outcomes:

Predicted: No Default (0) Predicted: Default (1)
Actual: No Default (0) True Negative (TN) – borrower pays False Positive (FP) – model wrongly predicts default
Actual: Default (1) False Negative (FN) – borrower defaults but model missed it True Positive (TP) – borrower defaults and model predicted it

We can think of it as a simple 2×2 grid of model outcomes.

7.6.2 Accuracy

Accuracy measures the percentage of total predictions that the model got right.

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

It answers: “Out of all cases, how many did we classify correctly?”

Good for: general model performance
⚠️ Limitation: misleading with imbalanced data (e.g., if only 5% of loans default, a model that always predicts “no default” gets 95% accuracy but is useless).

7.6.3 Recall (Sensitivity or True Positive Rate)

Recall focuses on the default (1) class and measures how many actual defaulters we correctly identified.

\text{Recall} = \frac{TP}{TP + FN}

It answers: “Of all borrowers who actually defaulted, how many did we catch?”

Good for: credit risk and fraud detection, where missing a defaulter is costly.
⚠️ Limitation: increasing Recall can reduce Precision — catching more defaults may mean more false alarms.

7.6.4 Precision

Precision tells us how many of the borrowers predicted as defaulters actually were defaulters.

\text{Precision} = \frac{TP}{TP + FP}

It answers: “When we predict someone will default, how often are we correct?”

Good for: when false positives (rejecting good clients) are costly.
⚠️ Trade-off: a model can have high Recall but low Precision, or vice versa.

7.6.5 Putting it Together

Metric Focus Measures Ideal Use Case
Accuracy Overall performance Correct predictions over total Balanced datasets
Recall (Sensitivity) Defaults caught TP / (TP + FN) Credit risk, fraud, medical tests
Precision Correct alarms TP / (TP + FP) Lending profitability, marketing
F1-Score Balance of Precision & Recall ( 2 ) When both types of error matter

Why this matters for credit risk

  • A bank that misses many defaulters (low Recall) loses money.
  • A bank that flags too many safe borrowers (low Precision) loses business.
  • The right balance depends on strategy:
    • Retail lending: prioritize Recall (avoid defaults).
    • Marketing or growth: prioritize Precision (avoid false alarms).

In practice, Recall for class 1 (Default) is usually emphasized in PD models under Basel and regulatory frameworks.

Summary:
- Use Accuracy for a quick overview.
- Use Recall to measure how many defaulters the model finds.
- Use Precision to measure how often predicted defaulters really default.
- Combine both in F1-Score when you need balance.