# Core stack
import numpy as np
import pandas as pd
# Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.tree import DecisionTreeClassifier
# Metrics & diagnostics
from sklearn.metrics import accuracy_score, recall_score
import matplotlib.pyplot as plt
import statsmodels.api as smCredit Risk - Assignment solution
FZ2024 Financial Modeling and Programming
1 Libraries
First we load the libraries, adding the additional libraries needed for the Ridge Classifier and Decision Tree Classifier models, as well as for the recall metric.
2 Load data
url = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/Q1.csv"
credit = pd.read_csv(url)
credit.info()
credit.head()
credit.describe()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473 entries, 0 to 472
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 473 non-null int64
1 payment_status 473 non-null object
2 installment 473 non-null float64
3 grade 473 non-null int64
4 emp_title 473 non-null float64
5 emp_length 473 non-null int64
6 home_ownership 473 non-null int64
7 annual_inc 473 non-null float64
8 verification_status 473 non-null int64
9 title 473 non-null int64
10 addr_state 473 non-null int64
11 delinq_2yrs 473 non-null int64
12 earliest_cr_line 473 non-null int64
13 fico_range_high 473 non-null int64
14 last_pymnt_amnt 473 non-null float64
15 last_credit_pull_d 473 non-null int64
16 last_fico_range_high 473 non-null int64
17 last_fico_range_low 473 non-null int64
18 tot_coll_amt 473 non-null int64
19 tot_cur_bal 473 non-null int64
20 open_acc_6m 473 non-null int64
21 open_act_il 473 non-null int64
22 open_il_24m 473 non-null int64
23 open_rv_12m 473 non-null int64
24 open_rv_24m 473 non-null int64
25 max_bal_bc 473 non-null int64
26 all_util 473 non-null int64
27 total_cu_tl 473 non-null int64
28 inq_last_12m 473 non-null int64
29 bc_open_to_buy 473 non-null int64
30 mo_sin_rcnt_rev_tl_op 473 non-null int64
31 mo_sin_rcnt_tl 473 non-null int64
32 mort_acc 473 non-null int64
33 mths_since_recent_inq 473 non-null int64
34 num_bc_sats 473 non-null int64
35 num_bc_tl 473 non-null int64
36 num_rev_accts 473 non-null int64
37 num_sats 473 non-null int64
38 num_tl_op_past_12m 473 non-null int64
dtypes: float64(4), int64(34), object(1)
memory usage: 144.2+ KB
| Unnamed: 0 | installment | grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | title | addr_state | ... | bc_open_to_buy | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_inq | num_bc_sats | num_bc_tl | num_rev_accts | num_sats | num_tl_op_past_12m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | ... | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 | 473.000000 |
| mean | 433.040169 | 454.273658 | 2.556025 | 335.885835 | 4.340381 | 1.805497 | 78498.062114 | 1.813953 | 4.209302 | 21.133192 | ... | 12267.871036 | 11.668076 | 6.718816 | 1.775899 | 6.325581 | 4.900634 | 8.243129 | 15.344609 | 12.230444 | 2.494715 |
| std | 249.373649 | 252.293289 | 1.284495 | 177.513355 | 2.544602 | 0.920696 | 38225.792322 | 0.780629 | 1.659768 | 12.821675 | ... | 20064.077177 | 16.353930 | 8.008682 | 1.901296 | 5.620381 | 2.959590 | 4.514346 | 7.955494 | 5.331367 | 1.943174 |
| min | 5.000000 | 33.210000 | 1.000000 | 2.000000 | 1.000000 | 1.000000 | 16968.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 3.000000 | 0.000000 |
| 25% | 224.000000 | 268.960000 | 2.000000 | 197.000000 | 3.000000 | 1.000000 | 50400.000000 | 1.000000 | 3.000000 | 9.000000 | ... | 2076.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 3.000000 | 5.000000 | 10.000000 | 9.000000 | 1.000000 |
| 50% | 423.000000 | 403.640000 | 2.000000 | 335.500000 | 3.000000 | 1.000000 | 70000.000000 | 2.000000 | 4.000000 | 20.000000 | ... | 6092.000000 | 6.000000 | 4.000000 | 1.000000 | 5.000000 | 4.000000 | 7.000000 | 14.000000 | 11.000000 | 2.000000 |
| 75% | 660.000000 | 620.060000 | 3.000000 | 484.000000 | 6.000000 | 3.000000 | 100000.000000 | 2.000000 | 4.000000 | 31.000000 | ... | 14788.000000 | 15.000000 | 9.000000 | 3.000000 | 9.000000 | 6.000000 | 10.000000 | 19.000000 | 15.000000 | 4.000000 |
| max | 871.000000 | 1240.720000 | 7.000000 | 648.000000 | 11.000000 | 3.000000 | 300000.000000 | 3.000000 | 11.000000 | 46.000000 | ... | 263953.000000 | 139.000000 | 82.000000 | 9.000000 | 23.000000 | 21.000000 | 27.000000 | 57.000000 | 34.000000 | 12.000000 |
8 rows × 38 columns
The target payment_status has two labels: No paga (default) and Paga (no default).
credit['payment_status'].value_counts(dropna=False)payment_status
Paga 395
No paga 78
Name: count, dtype: int64
Define dependent and independent variables
y=credit["payment_status"] # select the Defalut variable
X=credit.drop(columns=["payment_status"]) # we drop the dependent variable, payment_status3 Train / Test Split
Use 80/20 split, stratified by the target to preserve the default rate, and the assignment seed.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=20
)4 Models
We fit four models using the same preprocessing. Logistic and LDA provide natural class probabilities; Decision Trees do as well; RidgeClassifier does not output probabilities directly, so we will later calibrate it to obtain PDs when needed.
# Comparing competing models
#1) Define the models
# Logistic Regression
logit_clf = LogisticRegression(max_iter=2000)
# Linear Discriminant Analysis
lda_clf = LDA()
# Ridge Classifier
ridge_clf = RidgeClassifier()
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state=20)
# 2) Estimate the models
logit_clf.fit(X_train, y_train)
lda_clf.fit(X_train, y_train)
ridge_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)
# 3) Predict models
y_logit = logit_clf.predict(X_test)
y_lda = lda_clf.predict(X_test)
y_ridge = ridge_clf.predict(X_test)
y_dt = dt_clf.predict(X_test)
# 4) Metrics (Recall focused on defaulters = "No paga")
acc_logit = accuracy_score(y_test, y_logit)
acc_lda = accuracy_score(y_test, y_lda)
acc_ridge = accuracy_score(y_test, y_ridge)
acc_dt = accuracy_score(y_test, y_dt)
rec_logit = recall_score(y_test, y_logit, pos_label="No paga")
rec_lda = recall_score(y_test, y_lda, pos_label="No paga")
rec_ridge = recall_score(y_test, y_ridge, pos_label="No paga")
rec_dt = recall_score(y_test, y_dt, pos_label="No paga")
# Summary table
results_df = pd.DataFrame({
"Model": ["Logit", "LDA", "Ridge", "DT"],
"Accuracy": [acc_logit, acc_lda, acc_ridge, acc_dt],
"Recall": [rec_logit, rec_lda, rec_ridge, rec_dt]
}).sort_values(by=["Recall", "Accuracy"], ascending=False).reset_index(drop=True)
results_dfC:\Users\L03544739\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py:473: ConvergenceWarning: lbfgs failed to converge after 2000 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT
Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
| Model | Accuracy | Recall | |
|---|---|---|---|
| 0 | LDA | 0.936842 | 0.785714 |
| 1 | Ridge | 0.936842 | 0.714286 |
| 2 | Logit | 0.905263 | 0.642857 |
| 3 | DT | 0.915789 | 0.571429 |
According to the table, the best model was LDA, since it has the highest recall, which tells us the ability of the model to correctly identify defaulters, as well as the highest accuracy (with Ridge).
5 Predicted probability with new dataset
# Load the new applicant data
url_new = "https://github.com/chechurris/FZ2024/raw/refs/heads/main/prueba.csv"
new_data = pd.read_csv(url_new)
pd.DataFrame(lda_clf.predict_proba(new_data)).head()| 0 | 1 | |
|---|---|---|
| 0 | 0.000367 | 0.999633 |