Credit Risk - Assignment solution

FZ2024 Financial Modeling and Programming

Author

Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

November 11, 2025

1 Libraries

First we load the libraries, adding the additional libraries needed for the Ridge Classifier and Decision Tree Classifier models, as well as for the recall metric.

# Core stack
import numpy as np
import pandas as pd

# Modeling
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.tree import DecisionTreeClassifier

# Metrics & diagnostics
from sklearn.metrics import accuracy_score, recall_score
import matplotlib.pyplot as plt
import statsmodels.api as sm

2 Load data

url = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/Q1.csv"
credit = pd.read_csv(url)
credit.info()
credit.head()
credit.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473 entries, 0 to 472
Data columns (total 39 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             473 non-null    int64  
 1   payment_status         473 non-null    object 
 2   installment            473 non-null    float64
 3   grade                  473 non-null    int64  
 4   emp_title              473 non-null    float64
 5   emp_length             473 non-null    int64  
 6   home_ownership         473 non-null    int64  
 7   annual_inc             473 non-null    float64
 8   verification_status    473 non-null    int64  
 9   title                  473 non-null    int64  
 10  addr_state             473 non-null    int64  
 11  delinq_2yrs            473 non-null    int64  
 12  earliest_cr_line       473 non-null    int64  
 13  fico_range_high        473 non-null    int64  
 14  last_pymnt_amnt        473 non-null    float64
 15  last_credit_pull_d     473 non-null    int64  
 16  last_fico_range_high   473 non-null    int64  
 17  last_fico_range_low    473 non-null    int64  
 18  tot_coll_amt           473 non-null    int64  
 19  tot_cur_bal            473 non-null    int64  
 20  open_acc_6m            473 non-null    int64  
 21  open_act_il            473 non-null    int64  
 22  open_il_24m            473 non-null    int64  
 23  open_rv_12m            473 non-null    int64  
 24  open_rv_24m            473 non-null    int64  
 25  max_bal_bc             473 non-null    int64  
 26  all_util               473 non-null    int64  
 27  total_cu_tl            473 non-null    int64  
 28  inq_last_12m           473 non-null    int64  
 29  bc_open_to_buy         473 non-null    int64  
 30  mo_sin_rcnt_rev_tl_op  473 non-null    int64  
 31  mo_sin_rcnt_tl         473 non-null    int64  
 32  mort_acc               473 non-null    int64  
 33  mths_since_recent_inq  473 non-null    int64  
 34  num_bc_sats            473 non-null    int64  
 35  num_bc_tl              473 non-null    int64  
 36  num_rev_accts          473 non-null    int64  
 37  num_sats               473 non-null    int64  
 38  num_tl_op_past_12m     473 non-null    int64  
dtypes: float64(4), int64(34), object(1)
memory usage: 144.2+ KB

	Unnamed: 0	installment	grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	title	addr_state	...	bc_open_to_buy	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_tl	mort_acc	mths_since_recent_inq	num_bc_sats	num_bc_tl	num_rev_accts	num_sats	num_tl_op_past_12m
count	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	...	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000	473.000000
mean	433.040169	454.273658	2.556025	335.885835	4.340381	1.805497	78498.062114	1.813953	4.209302	21.133192	...	12267.871036	11.668076	6.718816	1.775899	6.325581	4.900634	8.243129	15.344609	12.230444	2.494715
std	249.373649	252.293289	1.284495	177.513355	2.544602	0.920696	38225.792322	0.780629	1.659768	12.821675	...	20064.077177	16.353930	8.008682	1.901296	5.620381	2.959590	4.514346	7.955494	5.331367	1.943174
min	5.000000	33.210000	1.000000	2.000000	1.000000	1.000000	16968.000000	1.000000	1.000000	1.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000	3.000000	0.000000
25%	224.000000	268.960000	2.000000	197.000000	3.000000	1.000000	50400.000000	1.000000	3.000000	9.000000	...	2076.000000	3.000000	2.000000	0.000000	2.000000	3.000000	5.000000	10.000000	9.000000	1.000000
50%	423.000000	403.640000	2.000000	335.500000	3.000000	1.000000	70000.000000	2.000000	4.000000	20.000000	...	6092.000000	6.000000	4.000000	1.000000	5.000000	4.000000	7.000000	14.000000	11.000000	2.000000
75%	660.000000	620.060000	3.000000	484.000000	6.000000	3.000000	100000.000000	2.000000	4.000000	31.000000	...	14788.000000	15.000000	9.000000	3.000000	9.000000	6.000000	10.000000	19.000000	15.000000	4.000000
max	871.000000	1240.720000	7.000000	648.000000	11.000000	3.000000	300000.000000	3.000000	11.000000	46.000000	...	263953.000000	139.000000	82.000000	9.000000	23.000000	21.000000	27.000000	57.000000	34.000000	12.000000

8 rows × 38 columns

The target payment_status has two labels: No paga (default) and Paga (no default).

credit['payment_status'].value_counts(dropna=False)

payment_status
Paga       395
No paga     78
Name: count, dtype: int64

Define dependent and independent variables

y=credit["payment_status"] # select the Defalut variable
X=credit.drop(columns=["payment_status"]) # we drop the dependent variable, payment_status

3 Train / Test Split

Use 80/20 split, stratified by the target to preserve the default rate, and the assignment seed.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=20
)

4 Models

We fit four models using the same preprocessing. Logistic and LDA provide natural class probabilities; Decision Trees do as well; RidgeClassifier does not output probabilities directly, so we will later calibrate it to obtain PDs when needed.

# Comparing competing models

#1) Define the models

# Logistic Regression
logit_clf = LogisticRegression(max_iter=2000)
# Linear Discriminant Analysis
lda_clf = LDA()
# Ridge Classifier
ridge_clf = RidgeClassifier()
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state=20)

# 2) Estimate the models

logit_clf.fit(X_train, y_train)
lda_clf.fit(X_train, y_train)
ridge_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)

# 3) Predict models

y_logit = logit_clf.predict(X_test)
y_lda = lda_clf.predict(X_test)
y_ridge = ridge_clf.predict(X_test)
y_dt = dt_clf.predict(X_test)

# 4) Metrics (Recall focused on defaulters = "No paga")
acc_logit = accuracy_score(y_test, y_logit)
acc_lda   = accuracy_score(y_test, y_lda)
acc_ridge = accuracy_score(y_test, y_ridge)
acc_dt    = accuracy_score(y_test, y_dt)

rec_logit = recall_score(y_test, y_logit, pos_label="No paga")
rec_lda   = recall_score(y_test, y_lda,   pos_label="No paga")
rec_ridge = recall_score(y_test, y_ridge, pos_label="No paga")
rec_dt    = recall_score(y_test, y_dt,    pos_label="No paga")

# Summary table
results_df = pd.DataFrame({
    "Model":   ["Logit", "LDA", "Ridge", "DT"],
    "Accuracy": [acc_logit, acc_lda, acc_ridge, acc_dt],
    "Recall":   [rec_logit, rec_lda, rec_ridge, rec_dt]
}).sort_values(by=["Recall", "Accuracy"], ascending=False).reset_index(drop=True)

results_df

C:\Users\L03544739\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py:473: ConvergenceWarning: lbfgs failed to converge after 2000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

	Model	Accuracy	Recall
0	LDA	0.936842	0.785714
1	Ridge	0.936842	0.714286
2	Logit	0.905263	0.642857
3	DT	0.915789	0.571429

According to the table, the best model was LDA, since it has the highest recall, which tells us the ability of the model to correctly identify defaulters, as well as the highest accuracy (with Ridge).

5 Predicted probability with new dataset

# Load the new applicant data
url_new = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/prueba.csv"
new_data = pd.read_csv(url_new)

pd.DataFrame(lda_clf.predict_proba(new_data)).head()

	0	1
0	0.000367	0.999633