Modeling Credit Default in Python

Isai Guizar

1 Intro

Credit is a transaction between two parties in which one (the creditor or lender) supplies money, goods, services or securities in return for a promise of future payment by the other (the debtor or borrower). Such transactions normally include the payment of interest to the lender (Joseph 2013).

The potential for loss due to failure of a borrower to meet its contractual obligation to repay a debt in accordance with the agreed terms is called Credit Risk. Three elements characterize credit risk:

Probability of Default (PD): Likelihood that the counter-party would fail to make full and timely repayment of its financial obligations
Exposure at Default (EAD): The expected value of the loan at the time of default, i.e., the amount the customer owes the lender in case of default.
Loss Given Default (LGD): The amount of loss in case of a default. Expressed as a percentage of the EAD.

The product of these three is called Expected Loss. It is a measure of the anticipated financial loss from a loan or portfolio of loans over a certain period. Creditors invest significant efforts in creating algorithms to predict the likelihood of a customer defaulting on a loan (PD). Defaults can result in substantial financial losses, impacting both the profitability and stability of financial institutions. To mitigate these risk, it is essential to develop robust predictive models that help to identify potential defaulters before credit is granted. Our aim here is to develop such models.

We will employ a dataset from kaggle (Lichman 2013) that contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan. Click on the link to identify the variables and additional features of the data.

Disclaimer:

This document is intended for educational purposes only. It does not constitute financial advice. It is part of my financial modeling course at Tec de Monterrey, created to demonstrate the application of machine learning algorithms in predicting credit defaults.

2 Data exploration

Import libraries

import pandas  as pd
import numpy   as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics         import confusion_matrix, classification_report, roc_curve, roc_auc_score

Load data

data = pd.read_csv("/mydirectory/UCI_Credit_Card.csv")

Data preparation

We begin by exploring general features of the dataset as the number of columns, types of variables, and if there are missing values:

data.info()

To facilitate the analysis we will rename “PAY_0” (“PAY_1”) and “LIMIT_BAL” (“AMOUNT”). Also, shorten the name of “default.payment.next.month”, call it “default”.

data.rename(columns={
    'LIMIT_BAL' : 'AMOUNT',
    'PAY_0': 'PAY_1',
    'default.payment.next.month': 'default'
}, inplace=True)

Some features of the data are either confusing or counter-intuitive, which adds an easily avoidable degree of complexity. Let’s fix it here:

SEX will be 0 for female (instead of 2) and 1 for male.

data.SEX.value_counts()
data['SEX'] = data['SEX'].apply(lambda x: 0 if x == 2 else 1)
data.SEX.value_counts()

The scale of EDUCATION (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Categories 4,5,6 and 0 can Group into class 4

data.EDUCATION.value_counts()
data['EDUCATION'] = data['EDUCATION'].replace([0, 4, 5, 6], 4)
data.EDUCATION.value_counts()

MARRIAGE will be regrouped. The marital status (0 = ?, 1=married, 2=single, 3=others) We’ll group categories 1 for single and zero otherwise.

data.MARRIAGE.value_counts()

** code **

** code **

The payment history variables (PAY_1 to PAY_6) indicate the repayment status in previous months. Negative values represent payments made on time, while positive values indicate delays. A more intuitive scale should reverse the signs:

data.PAY_1.value_counts()

for i in range(1,7):
    data[f'PAY_{i}'] = data[f'PAY_{i}'].apply(lambda x: -x)

data.PAY_1.value_counts()

Next, we will proceed to to generate descriptive statistics, which will help us to identify general characteristics of the sample.

3 Descriptive analysis

dStats  = data.describe().T.round(1) 
dStats

	count	mean	std	min	25%	50%	75%	max
ID	30000.0	15000.5	8660.4	1.0	7500.8	15000.5	22500.2	30000.0
AMOUNT	30000.0	167484.3	129747.7	10000.0	50000.0	140000.0	240000.0	1000000.0
SEX	30000.0	0.4	0.5	0.0	0.0	0.0	1.0	1.0
EDUCATION	30000.0	1.8	0.7	1.0	1.0	2.0	2.0	4.0
MARRIAGE	30000.0	0.5	0.5	0.0	0.0	1.0	1.0	1.0
AGE	30000.0	35.5	9.2	21.0	28.0	34.0	41.0	79.0
PAY_1	30000.0	0.0	1.1	-8.0	0.0	0.0	1.0	2.0
PAY_2	30000.0	0.1	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_3	30000.0	0.2	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_4	30000.0	0.2	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_5	30000.0	0.3	1.1	-8.0	0.0	0.0	1.0	2.0
PAY_6	30000.0	0.3	1.1	-8.0	0.0	0.0	1.0	2.0
BILL_AMT1	30000.0	51223.3	73635.9	-165580.0	3558.8	22381.5	67091.0	964511.0
BILL_AMT2	30000.0	49179.1	71173.8	-69777.0	2984.8	21200.0	64006.2	983931.0
BILL_AMT3	30000.0	47013.2	69349.4	-157264.0	2666.2	20088.5	60164.8	1664089.0
BILL_AMT4	30000.0	43262.9	64332.9	-170000.0	2326.8	19052.0	54506.0	891586.0
BILL_AMT5	30000.0	40311.4	60797.2	-81334.0	1763.0	18104.5	50190.5	927171.0
BILL_AMT6	30000.0	38871.8	59554.1	-339603.0	1256.0	17071.0	49198.2	961664.0
PAY_AMT1	30000.0	5663.6	16563.3	0.0	1000.0	2100.0	5006.0	873552.0
PAY_AMT2	30000.0	5921.2	23040.9	0.0	833.0	2009.0	5000.0	1684259.0
PAY_AMT3	30000.0	5225.7	17607.0	0.0	390.0	1800.0	4505.0	896040.0
PAY_AMT4	30000.0	4826.1	15666.2	0.0	296.0	1500.0	4013.2	621000.0
PAY_AMT5	30000.0	4799.4	15278.3	0.0	252.5	1500.0	4031.5	426529.0
PAY_AMT6	30000.0	5215.5	17777.5	0.0	117.8	1500.0	4000.0	528666.0
default	30000.0	0.2	0.4	0.0	0.0	0.0	0.0	1.0

Define the categorical variables:

cats = ['default','SEX', 'EDUCATION', 'MARRIAGE',
    'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

# Converting columns to 'category' data type
data[cats] = data[cats].astype('category')

3.1 Default status

Figure 1 shows how the variable of interest is distributed.

plt.figure(figsize=(7, 5))
ax = sns.countplot(x=data['default'], hue=data['default'], palette=['#3182bd', '#DBB40C'], legend=False)

# Adding percentage 
total_count = len(data['default'])
for container in ax.containers:
    ax.bar_label(container, labels=[f'{(v/total_count)*100:.1f}%' for v in container.datavalues])

# Adjust for binary variable
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])
plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Distribution of Default', fontsize=10)

plt.tight_layout()
plt.show()

3.2 Loan amount

plt.figure(figsize=(7, 5))
sns.histplot(data['AMOUNT'], kde=True, color='#04d8b2', edgecolor='black')

plt.xlabel('Credit Amount', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# plt.title('Distribution of Credit Amount', fontsize=12)

plt.tight_layout()
plt.show()

plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'], hue=data['default'], y=data['AMOUNT'], palette=['#04d8b2', '#DBB40C'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])

plt.xlabel('Status', fontsize=10)
plt.ylabel('Credit Amount', fontsize=10)

# Displaying the plot
plt.tight_layout()
plt.show()

Boxplot of amount of credit by default status

An empirical Cumulative Distribution Function (eCDF) as in Figure 2 can also help derive lessons. It displays the data points of loan amount from lowest to highest against their percentiles.

plt.figure(figsize=(7, 5))
sns.ecdfplot(data=data, x='AMOUNT', hue='default', palette=['#6baed6', '#fd8d3c'])

plt.xlabel('Credit Amount', fontsize=10)
plt.ylabel('eCDF', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")

plt.tight_layout()
plt.show()

3.3 Age

plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'],hue = data['default'], y=data['AGE'], palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])

plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Age', fontsize=10)

plt.tight_layout()
plt.show()

3.4 Sex

plt.figure(figsize=(7, 5))
sns.countplot(x='SEX', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Female", "Male"])

plt.title('Distribution of Default by Sex', fontsize=10)
plt.xlabel('Sex', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")

plt.tight_layout()
plt.show()

3.5 Education

plt.figure(figsize=(7, 5))
sns.countplot(x='EDUCATION', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1, 2, 3], labels=["Unknown/Others", "Graduate", "University", "High School"])

plt.xlabel('Education Level', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")


plt.tight_layout()
plt.show()

Distribution of Default Status by Education

# Marriage

#| output: false
#| include: false
#| echo: false

plt.figure(figsize=(7, 5))
ax = sns.countplot(x='MARRIAGE', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])

# Adjusting x-axis labels for marriage categories
ax.set_xticks([0, 1])
ax.set_xticklabels(['Others', 'Single'])

plt.title('Distribution of Default by Marriage Status', fontsize=10)
plt.xlabel('Marriage Status', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")

plt.tight_layout()
plt.show()

4 Modeling

Begin by defining independent (X) and target (y) features, then split the data into training (80%) and testing (20%) subsets.

data.columns

X = data.drop(['default', 'ID'], axis=1)
y = data['default']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The function train_test_split(): randomly splits the dataset into two parts: one for training the model(X_train, y_train) and the other for testing (X_test, y_test) the model’s performance. The 0.2 means 20% of the dataset will be used for testing, and 80% will be used for training. The parameter random_state=42: sets the random seed ensures that the data is split in the same way every time the code is run, making the results consistent and facilitating the comparison from different models.

Proceed to scale all numerical variables:

# Scaling only the numerical features
cat_features = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

num_features = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

scaler = StandardScaler()

X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])

Why do we scale? Some models (e.g., Logistic Regression) are sensitive to the scale of the features. If one feature (e.g., AMOUNT, which ranges from 10,000 to 1,000,000) has much larger values than another (e.g., AGE, which ranges from 21 to 79), the model might give more weight to the larger-scale feature just because of its magnitude. Scaling puts all numerical variables on a similar scale, with a mean of 0 and a standard deviation of 1 (Standardization: StandardScaler), making the model treat them equally.

4.1 Logistic Regression

4.1.1 Estimation

# Initialize the Logistic Regression model
log_reg = LogisticRegression(random_state=42)

# Fit the model on the training data
log_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = log_reg.predict(X_test)

4.1.2 Evaluation

Confusion Matrix:

# Confusion matrix
cm    = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, 
                     index=['Actual No Default', 'Actual Default'], 
                     columns=['Predicted No Default', 'Predicted Default'])

cm_df

	Predicted No Default	Predicted Default
Actual No Default	4553	134
Actual Default	1001	312

Accuracy:

Measures the proportion of correct predictions out of all predictions.

\[ Accuracy = \frac{True \ Positives + True\ Negatives}{Total \ Predictions}\] High accuracy means the model correctly predicted most of the cases. However, it can be misleading if the dataset is imbalanced (e.g., if the majority class dominates).

accuracy  = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.81

Precision:

Measures the proportion of correctly predicted positive observations out of all predicted positives.

\[ Precision = \frac{True \ positives}{True \ positives + False \ postives}\] Indicates how well the model avoids false positives. It’s useful when the cost of a false positive is high (e.g., wrongly predicting someone will default).

precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

Precision: 0.70

Recall:

Measures the proportion of correctly predicted positive observations out of all actual positives.

\[ Recall = \frac{True \ positives}{True \ positives + False \ negatives}\] indicates how well the model identifies true positives. It’s useful when the cost of missing a positive case is high (e.g., failing to identify a potential defaulter).

recall    = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

Recall: 0.24

4.2 ML GradientBoosting

Gradient Boosting (in Scikit-learn) is a method proposed by Friedman (2001) that builds a series of simple decision trees sequentially, optimizing errors at each step to create a stronger, more accurate model.

4.2.1 Estimation

from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting Classifier
gbc_model = GradientBoostingClassifier(random_state=42)

# Fit the model on the training data
gbc_model.fit(X_train, y_train)

# Predictions on the test data
y_pred_gbc = gbc_model.predict(X_test)

4.2.2 Evaluation

# Confusion matrix
cm_gb     = confusion_matrix(y_test, y_pred_gbc)
cm_df_gbc = pd.DataFrame(cm_gb, 
                     index=['Actual No Default', 'Actual Default'], 
                     columns=['Predicted No Default', 'Predicted Default'])

print(cm_df_gbc)

                   Predicted No Default  Predicted Default
Actual No Default                  4451                236
Actual Default                      843                470

accuracy  = accuracy_score(y_test, y_pred_gbc)
precision = precision_score(y_test, y_pred_gbc)
recall    = recall_score(y_test, y_pred_gbc)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Accuracy: 0.82
Precision: 0.67
Recall: 0.36

4.3 Random Forest

Activity 1

(A). Estimate a Random Forest Model.

Use the function “RandomForestClassifier(n_estimators=100, random_state=42)” to initialize the classifier,
“rf_model.fit” to fit the model, and
“rf_model.predict” to predict

(B). Evaluate the random forest classifier:

Show the confusion matrix
Obtain the measures of “accuracy”, “precision” , and “recall”

Here is some code to get you started:

from sklearn.ensemble import RandomForestClassifier

# (A). Estimation of the Random Forest Model

  # 1. Initialize the Random Forest Classifier here
rf_model =

  # 2. Fit the model on the training data here

  # 3. Make predictions on the test data here
y_pred_rf = 

# (B). Evaluation of the model's peformance:

Solution

from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Predictions on the test data
y_pred_rf = rf_model.predict(X_test)

# Confusion matrix
cm_rf    = confusion_matrix(y_test, y_pred_rf)
cm_df_rf = pd.DataFrame(cm_rf, 
                        index=['Actual No Default', 'Actual Default'], 
                        columns=['Predicted No Default', 'Predicted Default'])
print(cm_df_rf)

# Random Forest model's performance
accuracy  = accuracy_score(y_test, y_pred_rf)
precision = precision_score(y_test, y_pred_rf)
recall    = recall_score(y_test, y_pred_rf)


print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

                   Predicted No Default  Predicted Default
Actual No Default                  4413                274
Actual Default                      827                486
Accuracy: 0.82
Precision: 0.64
Recall: 0.37

5 Models Comparison

# Creating a dictionary to store the evaluation metrics for each model
model_comparison = {
    'Model': ['Logistic Regression', 'GBoost', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred),    # Logistic Regression
        accuracy_score(y_test, y_pred_gbc),  
        accuracy_score(y_test, y_pred_rf)  
    ],
    'Precision': [
        precision_score(y_test, y_pred),
        precision_score(y_test, y_pred_gbc),
        precision_score(y_test, y_pred_rf)
    ],
    'Recall': [
        recall_score(y_test, y_pred),
        recall_score(y_test, y_pred_gbc),
        recall_score(y_test, y_pred_rf)
    ]
}

comparison_df = pd.DataFrame(model_comparison)
print(comparison_df)

                 Model  Accuracy  Precision    Recall
0  Logistic Regression  0.810833   0.699552  0.237624
1               GBoost  0.820167   0.665722  0.357959
2        Random Forest  0.816500   0.639474  0.370145

6 Conclusion

Note

The three models offer relatively high rates of accuracy (>81%). The logistic model offers a better precision (\(\approx\) 70%), however, the analyst must be aware that in the context of credit default, recall is particularly crucial, as it measures the model’s ability to correctly identify actual defaulters. Missing a default can lead to significant financial losses. Therefore, the GBoost and random forest models would be the better choice (as they offer higher recall than the logistic model).

7 Other comparison measures

Analysts often include the Receiver Operating Cost (ROC) curve to evaluate the performance of a binary diagnostic classification method. The ROC curve is a plot of True Positive Rate \(\Big(\frac{TP}{TP + FN}\Big)\) against the False Positive Rate \(\Big(\frac{FP}{FP + TN}\Big)\) at various threshold settings. See the ROC curve for each of the models developed here in Figure 3.

# ROC and AUC for Logistic Regression
fpr_log, tpr_log, _ = roc_curve(y_test, log_reg.predict_proba(X_test)[:, 1])
auc_log = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])

# ROC and AUC for GBoost
fpr_gbc, tpr_gbc, _ = roc_curve(y_test, gbc_model.predict_proba(X_test)[:, 1])
auc_gbc = roc_auc_score(y_test, gbc_model.predict_proba(X_test)[:, 1])

# ROC and AUC for Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:, 1])
auc_rf = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])

# Plotting the ROC curves
plt.figure(figsize=(7, 5))
plt.plot(fpr_log, tpr_log, label=f'Logistic Regression (AUC = {auc_log:.2f})', color='blue')
plt.plot(fpr_gbc, tpr_gbc, label=f'GBoost (AUC = {auc_gbc:.2f})', color='orange')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})', color='green')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')


plt.xlabel('False Positive Rate', fontsize=10)
plt.ylabel('True Positive Rate', fontsize=10)
# plt.title('ROC Curves', fontsize=12)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

The AUC represents the area under the ROC curve. This single value helps to summarize the overall model’s discriminative power in distinguishing between two classes.

AUC=1: perfect separation between classes
AUC=0.5: random model, no discrimination between classes.
AUC<0.5: worse than random, the model misclassifies more than it correctly classifies.

The accuracy of a model’s predictions can also be measured with the Root Mean Square Error (RMSE). It quantifies the difference between the actual (observed) values and the predicted values produced by the model. That is:

\[ RMSE = \Big[\frac{1}{n} \sum_{i=1}^n (y_i-\hat{y})^2\Big]^{1/2} \]

Where \(n\) is the sample size, \(y_i\) and \(\hat{y}_i\) are the actual and predicted values, respectively.

Note however that the RMSE will generally not be useful when the \(y\) variable is not continuous, such as is the case here.

8 Prediction

Suppose a new borrower has the following characteristics:

newCred = {
    'AMOUNT': [167484],  # Loan amount at the mean
    'SEX': [1],          # Male
    'EDUCATION': [2],    # University
    'MARRIAGE': [1],     # Single
    'AGE': [30],         # Age
    'PAY_1': [0],        # On-time payment
    'PAY_2': [0],  
    'PAY_3': [0], 
    'PAY_4': [0],  
    'PAY_5': [0], 
    'PAY_6': [0],        
    'BILL_AMT1': [50000],  # Bill statement
    'BILL_AMT2': [45000], 
    'BILL_AMT3': [47000],
    'BILL_AMT4': [35000], 
    'BILL_AMT5': [30000], 
    'BILL_AMT6': [25000],
    'PAY_AMT1': [0],       # Previous payments
    'PAY_AMT2': [0], 
    'PAY_AMT3': [0],
    'PAY_AMT4': [0],  
    'PAY_AMT5': [0], 
    'PAY_AMT6': [0]       
}

newCred_df = pd.DataFrame(newCred)

Prepare the new data:

cats = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

nums = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']


newCred_df[cats] = newCred_df[cats].astype('category')
newCred_df[nums] = scaler.transform(newCred_df[nums])

Predict the default probability using the model of your choice:

probDefault = log_reg.predict_proba(newCred_df)[:, 1][0]

print(f"Probability of Default for the new customer is: {probDefault:.2f}")

Probability of Default for the new customer is: 0.21

References

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5). https://doi.org/10.1214/aos/1013203451.

Joseph, Ciby. 2013. “Advanced Credit Risk Analysis and Management,” May. https://doi.org/10.1002/9781118604878.

Lichman, M. 2013. “UCI Machine Learning Repository [Http://Archive.ics.uci.edu/Ml].” Irvine, CA: University of California, School of Information and Computer Science.