Credit Risk - Assignment solution

FZ2024 Financial Modeling and Programming

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

November 11, 2025

1 Libraries

First we load the libraries, adding the additional libraries needed for the Ridge Classifier and Decision Tree Classifier models, as well as for the recall metric.

# Core stack
import numpy as np
import pandas as pd

# Modeling
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.tree import DecisionTreeClassifier

# Metrics & diagnostics
from sklearn.metrics import accuracy_score, recall_score
import matplotlib.pyplot as plt
import statsmodels.api as sm

2 Load data

url = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/Q1.csv"
credit = pd.read_csv(url)
credit.info()
credit.head()
credit.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473 entries, 0 to 472
Data columns (total 39 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             473 non-null    int64  
 1   payment_status         473 non-null    object 
 2   installment            473 non-null    float64
 3   grade                  473 non-null    int64  
 4   emp_title              473 non-null    float64
 5   emp_length             473 non-null    int64  
 6   home_ownership         473 non-null    int64  
 7   annual_inc             473 non-null    float64
 8   verification_status    473 non-null    int64  
 9   title                  473 non-null    int64  
 10  addr_state             473 non-null    int64  
 11  delinq_2yrs            473 non-null    int64  
 12  earliest_cr_line       473 non-null    int64  
 13  fico_range_high        473 non-null    int64  
 14  last_pymnt_amnt        473 non-null    float64
 15  last_credit_pull_d     473 non-null    int64  
 16  last_fico_range_high   473 non-null    int64  
 17  last_fico_range_low    473 non-null    int64  
 18  tot_coll_amt           473 non-null    int64  
 19  tot_cur_bal            473 non-null    int64  
 20  open_acc_6m            473 non-null    int64  
 21  open_act_il            473 non-null    int64  
 22  open_il_24m            473 non-null    int64  
 23  open_rv_12m            473 non-null    int64  
 24  open_rv_24m            473 non-null    int64  
 25  max_bal_bc             473 non-null    int64  
 26  all_util               473 non-null    int64  
 27  total_cu_tl            473 non-null    int64  
 28  inq_last_12m           473 non-null    int64  
 29  bc_open_to_buy         473 non-null    int64  
 30  mo_sin_rcnt_rev_tl_op  473 non-null    int64  
 31  mo_sin_rcnt_tl         473 non-null    int64  
 32  mort_acc               473 non-null    int64  
 33  mths_since_recent_inq  473 non-null    int64  
 34  num_bc_sats            473 non-null    int64  
 35  num_bc_tl              473 non-null    int64  
 36  num_rev_accts          473 non-null    int64  
 37  num_sats               473 non-null    int64  
 38  num_tl_op_past_12m     473 non-null    int64  
dtypes: float64(4), int64(34), object(1)
memory usage: 144.2+ KB
Unnamed: 0 installment grade emp_title emp_length home_ownership annual_inc verification_status title addr_state ... bc_open_to_buy mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_inq num_bc_sats num_bc_tl num_rev_accts num_sats num_tl_op_past_12m
count 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 ... 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000 473.000000
mean 433.040169 454.273658 2.556025 335.885835 4.340381 1.805497 78498.062114 1.813953 4.209302 21.133192 ... 12267.871036 11.668076 6.718816 1.775899 6.325581 4.900634 8.243129 15.344609 12.230444 2.494715
std 249.373649 252.293289 1.284495 177.513355 2.544602 0.920696 38225.792322 0.780629 1.659768 12.821675 ... 20064.077177 16.353930 8.008682 1.901296 5.620381 2.959590 4.514346 7.955494 5.331367 1.943174
min 5.000000 33.210000 1.000000 2.000000 1.000000 1.000000 16968.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 3.000000 0.000000
25% 224.000000 268.960000 2.000000 197.000000 3.000000 1.000000 50400.000000 1.000000 3.000000 9.000000 ... 2076.000000 3.000000 2.000000 0.000000 2.000000 3.000000 5.000000 10.000000 9.000000 1.000000
50% 423.000000 403.640000 2.000000 335.500000 3.000000 1.000000 70000.000000 2.000000 4.000000 20.000000 ... 6092.000000 6.000000 4.000000 1.000000 5.000000 4.000000 7.000000 14.000000 11.000000 2.000000
75% 660.000000 620.060000 3.000000 484.000000 6.000000 3.000000 100000.000000 2.000000 4.000000 31.000000 ... 14788.000000 15.000000 9.000000 3.000000 9.000000 6.000000 10.000000 19.000000 15.000000 4.000000
max 871.000000 1240.720000 7.000000 648.000000 11.000000 3.000000 300000.000000 3.000000 11.000000 46.000000 ... 263953.000000 139.000000 82.000000 9.000000 23.000000 21.000000 27.000000 57.000000 34.000000 12.000000

8 rows × 38 columns

The target payment_status has two labels: No paga (default) and Paga (no default).

credit['payment_status'].value_counts(dropna=False)
payment_status
Paga       395
No paga     78
Name: count, dtype: int64

Define dependent and independent variables

y=credit["payment_status"] # select the Defalut variable
X=credit.drop(columns=["payment_status"]) # we drop the dependent variable, payment_status

3 Train / Test Split

Use 80/20 split, stratified by the target to preserve the default rate, and the assignment seed.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=20
)

4 Models

We fit four models using the same preprocessing. Logistic and LDA provide natural class probabilities; Decision Trees do as well; RidgeClassifier does not output probabilities directly, so we will later calibrate it to obtain PDs when needed.

# Comparing competing models

#1) Define the models

# Logistic Regression
logit_clf = LogisticRegression(max_iter=2000)
# Linear Discriminant Analysis
lda_clf = LDA()
# Ridge Classifier
ridge_clf = RidgeClassifier()
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state=20)

# 2) Estimate the models

logit_clf.fit(X_train, y_train)
lda_clf.fit(X_train, y_train)
ridge_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)

# 3) Predict models

y_logit = logit_clf.predict(X_test)
y_lda = lda_clf.predict(X_test)
y_ridge = ridge_clf.predict(X_test)
y_dt = dt_clf.predict(X_test)

# 4) Metrics (Recall focused on defaulters = "No paga")
acc_logit = accuracy_score(y_test, y_logit)
acc_lda   = accuracy_score(y_test, y_lda)
acc_ridge = accuracy_score(y_test, y_ridge)
acc_dt    = accuracy_score(y_test, y_dt)

rec_logit = recall_score(y_test, y_logit, pos_label="No paga")
rec_lda   = recall_score(y_test, y_lda,   pos_label="No paga")
rec_ridge = recall_score(y_test, y_ridge, pos_label="No paga")
rec_dt    = recall_score(y_test, y_dt,    pos_label="No paga")

# Summary table
results_df = pd.DataFrame({
    "Model":   ["Logit", "LDA", "Ridge", "DT"],
    "Accuracy": [acc_logit, acc_lda, acc_ridge, acc_dt],
    "Recall":   [rec_logit, rec_lda, rec_ridge, rec_dt]
}).sort_values(by=["Recall", "Accuracy"], ascending=False).reset_index(drop=True)

results_df
C:\Users\L03544739\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py:473: ConvergenceWarning: lbfgs failed to converge after 2000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Model Accuracy Recall
0 LDA 0.936842 0.785714
1 Ridge 0.936842 0.714286
2 Logit 0.905263 0.642857
3 DT 0.915789 0.571429

According to the table, the best model was LDA, since it has the highest recall, which tells us the ability of the model to correctly identify defaulters, as well as the highest accuracy (with Ridge).

5 Predicted probability with new dataset

# Load the new applicant data
url_new = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/prueba.csv"
new_data = pd.read_csv(url_new)

pd.DataFrame(lda_clf.predict_proba(new_data)).head()
0 1
0 0.000367 0.999633