import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_scoreModeling Credit Default in Python
1 Intro
Credit is a transaction between two parties in which one (the creditor or lender) supplies money, goods, services or securities in return for a promise of future payment by the other (the debtor or borrower). Such transactions normally include the payment of interest to the lender (Joseph 2013).
The potential for loss due to failure of a borrower to meet its contractual obligation to repay a debt in accordance with the agreed terms is called Credit Risk. Three elements characterize credit risk:
- Probability of Default (PD): Likelihood that the counter-party would fail to make full and timely repayment of its financial obligations
- Exposure at Default (EAD): The expected value of the loan at the time of default, i.e., the amount the customer owes the lender in case of default.
- Loss Given Default (LGD): The amount of loss in case of a default. Expressed as a percentage of the EAD.
The product of these three is called Expected Loss. It is a measure of the anticipated financial loss from a loan or portfolio of loans over a certain period. Creditors invest significant efforts in creating algorithms to predict the likelihood of a customer defaulting on a loan (PD). Defaults can result in substantial financial losses, impacting both the profitability and stability of financial institutions. To mitigate these risk, it is essential to develop robust predictive models that help to identify potential defaulters before credit is granted. Our aim here is to develop such models.
We will employ a dataset from kaggle (Lichman 2013) that contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan. Click on the link to identify the variables and additional features of the data.
Disclaimer:
This document is intended for educational purposes only. It does not constitute financial advice. It is part of my financial modeling course at Tec de Monterrey, created to demonstrate the application of machine learning algorithms in predicting credit defaults.
2 Data exploration
Import libraries
Load data
data = pd.read_csv("/mydirectory/UCI_Credit_Card.csv")Data preparation
We begin by exploring general features of the dataset as the number of columns, types of variables, and if there are missing values:
data.info()To facilitate the analysis we will rename “PAY_0” (“PAY_1”) and “LIMIT_BAL” (“AMOUNT”). Also, shorten the name of “default.payment.next.month”, call it “default”.
data.rename(columns={
'LIMIT_BAL' : 'AMOUNT',
'PAY_0': 'PAY_1',
'default.payment.next.month': 'default'
}, inplace=True)Some features of the data are either confusing or counter-intuitive, which adds an easily avoidable degree of complexity. Let’s fix it here:
- SEX will be 0 for female (instead of 2) and 1 for male.
data.SEX.value_counts()
data['SEX'] = data['SEX'].apply(lambda x: 0 if x == 2 else 1)
data.SEX.value_counts()- The scale of EDUCATION (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Categories 4,5,6 and 0 can Group into class 4
data.EDUCATION.value_counts()
data['EDUCATION'] = data['EDUCATION'].replace([0, 4, 5, 6], 4)
data.EDUCATION.value_counts()- MARRIAGE will be regrouped. The marital status (0 = ?, 1=married, 2=single, 3=others) We’ll group categories 1 for single and zero otherwise.
data.MARRIAGE.value_counts()
** code **
** code **- The payment history variables (PAY_1 to PAY_6) indicate the repayment status in previous months. Negative values represent payments made on time, while positive values indicate delays. A more intuitive scale should reverse the signs:
data.PAY_1.value_counts()
for i in range(1,7):
data[f'PAY_{i}'] = data[f'PAY_{i}'].apply(lambda x: -x)
data.PAY_1.value_counts()Next, we will proceed to to generate descriptive statistics, which will help us to identify general characteristics of the sample.
3 Descriptive analysis
dStats = data.describe().T.round(1)
dStats| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 30000.0 | 15000.5 | 8660.4 | 1.0 | 7500.8 | 15000.5 | 22500.2 | 30000.0 |
| AMOUNT | 30000.0 | 167484.3 | 129747.7 | 10000.0 | 50000.0 | 140000.0 | 240000.0 | 1000000.0 |
| SEX | 30000.0 | 0.4 | 0.5 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| EDUCATION | 30000.0 | 1.8 | 0.7 | 1.0 | 1.0 | 2.0 | 2.0 | 4.0 |
| MARRIAGE | 30000.0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| AGE | 30000.0 | 35.5 | 9.2 | 21.0 | 28.0 | 34.0 | 41.0 | 79.0 |
| PAY_1 | 30000.0 | 0.0 | 1.1 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| PAY_2 | 30000.0 | 0.1 | 1.2 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| PAY_3 | 30000.0 | 0.2 | 1.2 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| PAY_4 | 30000.0 | 0.2 | 1.2 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| PAY_5 | 30000.0 | 0.3 | 1.1 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| PAY_6 | 30000.0 | 0.3 | 1.1 | -8.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| BILL_AMT1 | 30000.0 | 51223.3 | 73635.9 | -165580.0 | 3558.8 | 22381.5 | 67091.0 | 964511.0 |
| BILL_AMT2 | 30000.0 | 49179.1 | 71173.8 | -69777.0 | 2984.8 | 21200.0 | 64006.2 | 983931.0 |
| BILL_AMT3 | 30000.0 | 47013.2 | 69349.4 | -157264.0 | 2666.2 | 20088.5 | 60164.8 | 1664089.0 |
| BILL_AMT4 | 30000.0 | 43262.9 | 64332.9 | -170000.0 | 2326.8 | 19052.0 | 54506.0 | 891586.0 |
| BILL_AMT5 | 30000.0 | 40311.4 | 60797.2 | -81334.0 | 1763.0 | 18104.5 | 50190.5 | 927171.0 |
| BILL_AMT6 | 30000.0 | 38871.8 | 59554.1 | -339603.0 | 1256.0 | 17071.0 | 49198.2 | 961664.0 |
| PAY_AMT1 | 30000.0 | 5663.6 | 16563.3 | 0.0 | 1000.0 | 2100.0 | 5006.0 | 873552.0 |
| PAY_AMT2 | 30000.0 | 5921.2 | 23040.9 | 0.0 | 833.0 | 2009.0 | 5000.0 | 1684259.0 |
| PAY_AMT3 | 30000.0 | 5225.7 | 17607.0 | 0.0 | 390.0 | 1800.0 | 4505.0 | 896040.0 |
| PAY_AMT4 | 30000.0 | 4826.1 | 15666.2 | 0.0 | 296.0 | 1500.0 | 4013.2 | 621000.0 |
| PAY_AMT5 | 30000.0 | 4799.4 | 15278.3 | 0.0 | 252.5 | 1500.0 | 4031.5 | 426529.0 |
| PAY_AMT6 | 30000.0 | 5215.5 | 17777.5 | 0.0 | 117.8 | 1500.0 | 4000.0 | 528666.0 |
| default | 30000.0 | 0.2 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Define the categorical variables:
cats = ['default','SEX', 'EDUCATION', 'MARRIAGE',
'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
# Converting columns to 'category' data type
data[cats] = data[cats].astype('category')3.1 Default status
Figure 1 shows how the variable of interest is distributed.
plt.figure(figsize=(7, 5))
ax = sns.countplot(x=data['default'], hue=data['default'], palette=['#3182bd', '#DBB40C'], legend=False)
# Adding percentage
total_count = len(data['default'])
for container in ax.containers:
ax.bar_label(container, labels=[f'{(v/total_count)*100:.1f}%' for v in container.datavalues])
# Adjust for binary variable
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])
plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Distribution of Default', fontsize=10)
plt.tight_layout()
plt.show()3.2 Loan amount
plt.figure(figsize=(7, 5))
sns.histplot(data['AMOUNT'], kde=True, color='#04d8b2', edgecolor='black')
plt.xlabel('Credit Amount', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# plt.title('Distribution of Credit Amount', fontsize=12)
plt.tight_layout()
plt.show()plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'], hue=data['default'], y=data['AMOUNT'], palette=['#04d8b2', '#DBB40C'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])
plt.xlabel('Status', fontsize=10)
plt.ylabel('Credit Amount', fontsize=10)
# Displaying the plot
plt.tight_layout()
plt.show()An empirical Cumulative Distribution Function (eCDF) as in Figure 2 can also help derive lessons. It displays the data points of loan amount from lowest to highest against their percentiles.
plt.figure(figsize=(7, 5))
sns.ecdfplot(data=data, x='AMOUNT', hue='default', palette=['#6baed6', '#fd8d3c'])
plt.xlabel('Credit Amount', fontsize=10)
plt.ylabel('eCDF', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")
plt.tight_layout()
plt.show()3.3 Age
plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'],hue = data['default'], y=data['AGE'], palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])
plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Age', fontsize=10)
plt.tight_layout()
plt.show()3.4 Sex
plt.figure(figsize=(7, 5))
sns.countplot(x='SEX', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Female", "Male"])
plt.title('Distribution of Default by Sex', fontsize=10)
plt.xlabel('Sex', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")
plt.tight_layout()
plt.show()3.5 Education
plt.figure(figsize=(7, 5))
sns.countplot(x='EDUCATION', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1, 2, 3], labels=["Unknown/Others", "Graduate", "University", "High School"])
plt.xlabel('Education Level', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")
plt.tight_layout()
plt.show()# Marriage
#| output: false
#| include: false
#| echo: false
plt.figure(figsize=(7, 5))
ax = sns.countplot(x='MARRIAGE', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
# Adjusting x-axis labels for marriage categories
ax.set_xticks([0, 1])
ax.set_xticklabels(['Others', 'Single'])
plt.title('Distribution of Default by Marriage Status', fontsize=10)
plt.xlabel('Marriage Status', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"], title="Default Status")
plt.tight_layout()
plt.show()4 Modeling
Begin by defining independent (X) and target (y) features, then split the data into training (80%) and testing (20%) subsets.
data.columns
X = data.drop(['default', 'ID'], axis=1)
y = data['default']
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)The function train_test_split(): randomly splits the dataset into two parts: one for training the model(X_train, y_train) and the other for testing (X_test, y_test) the model’s performance. The 0.2 means 20% of the dataset will be used for testing, and 80% will be used for training. The parameter random_state=42: sets the random seed ensures that the data is split in the same way every time the code is run, making the results consistent and facilitating the comparison from different models.
Proceed to scale all numerical variables:
# Scaling only the numerical features
cat_features = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
num_features = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])Why do we scale? Some models (e.g., Logistic Regression) are sensitive to the scale of the features. If one feature (e.g., AMOUNT, which ranges from 10,000 to 1,000,000) has much larger values than another (e.g., AGE, which ranges from 21 to 79), the model might give more weight to the larger-scale feature just because of its magnitude. Scaling puts all numerical variables on a similar scale, with a mean of 0 and a standard deviation of 1 (Standardization: StandardScaler), making the model treat them equally.
4.1 Logistic Regression
4.1.1 Estimation
# Initialize the Logistic Regression model
log_reg = LogisticRegression(random_state=42)
# Fit the model on the training data
log_reg.fit(X_train, y_train)
# Make predictions on the test data
y_pred = log_reg.predict(X_test)4.1.2 Evaluation
Confusion Matrix:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
index=['Actual No Default', 'Actual Default'],
columns=['Predicted No Default', 'Predicted Default'])
cm_df| Predicted No Default | Predicted Default | |
|---|---|---|
| Actual No Default | 4553 | 134 |
| Actual Default | 1001 | 312 |
Accuracy:
Measures the proportion of correct predictions out of all predictions.
\[ Accuracy = \frac{True \ Positives + True\ Negatives}{Total \ Predictions}\] High accuracy means the model correctly predicted most of the cases. However, it can be misleading if the dataset is imbalanced (e.g., if the majority class dominates).
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")Accuracy: 0.81
Precision:
Measures the proportion of correctly predicted positive observations out of all predicted positives.
\[ Precision = \frac{True \ positives}{True \ positives + False \ postives}\] Indicates how well the model avoids false positives. It’s useful when the cost of a false positive is high (e.g., wrongly predicting someone will default).
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")Precision: 0.70
Recall:
Measures the proportion of correctly predicted positive observations out of all actual positives.
\[ Recall = \frac{True \ positives}{True \ positives + False \ negatives}\] indicates how well the model identifies true positives. It’s useful when the cost of missing a positive case is high (e.g., failing to identify a potential defaulter).
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")Recall: 0.24
4.2 ML GradientBoosting
Gradient Boosting (in Scikit-learn) is a method proposed by Friedman (2001) that builds a series of simple decision trees sequentially, optimizing errors at each step to create a stronger, more accurate model.
4.2.1 Estimation
from sklearn.ensemble import GradientBoostingClassifier
# Initialize the Gradient Boosting Classifier
gbc_model = GradientBoostingClassifier(random_state=42)
# Fit the model on the training data
gbc_model.fit(X_train, y_train)
# Predictions on the test data
y_pred_gbc = gbc_model.predict(X_test)4.2.2 Evaluation
# Confusion matrix
cm_gb = confusion_matrix(y_test, y_pred_gbc)
cm_df_gbc = pd.DataFrame(cm_gb,
index=['Actual No Default', 'Actual Default'],
columns=['Predicted No Default', 'Predicted Default'])
print(cm_df_gbc) Predicted No Default Predicted Default
Actual No Default 4451 236
Actual Default 843 470
accuracy = accuracy_score(y_test, y_pred_gbc)
precision = precision_score(y_test, y_pred_gbc)
recall = recall_score(y_test, y_pred_gbc)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")Accuracy: 0.82
Precision: 0.67
Recall: 0.36
4.3 Random Forest
(A). Estimate a Random Forest Model.
- Use the function “RandomForestClassifier(n_estimators=100, random_state=42)” to initialize the classifier,
- “rf_model.fit” to fit the model, and
- “rf_model.predict” to predict
(B). Evaluate the random forest classifier:
- Show the confusion matrix
- Obtain the measures of “accuracy”, “precision” , and “recall”
Here is some code to get you started:
from sklearn.ensemble import RandomForestClassifier
# (A). Estimation of the Random Forest Model
# 1. Initialize the Random Forest Classifier here
rf_model =
# 2. Fit the model on the training data here
# 3. Make predictions on the test data here
y_pred_rf =
# (B). Evaluation of the model's peformance:from sklearn.ensemble import RandomForestClassifier
# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit the model on the training data
rf_model.fit(X_train, y_train)
# Predictions on the test data
y_pred_rf = rf_model.predict(X_test)
# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_df_rf = pd.DataFrame(cm_rf,
index=['Actual No Default', 'Actual Default'],
columns=['Predicted No Default', 'Predicted Default'])
print(cm_df_rf)
# Random Forest model's performance
accuracy = accuracy_score(y_test, y_pred_rf)
precision = precision_score(y_test, y_pred_rf)
recall = recall_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}") Predicted No Default Predicted Default
Actual No Default 4413 274
Actual Default 827 486
Accuracy: 0.82
Precision: 0.64
Recall: 0.37
5 Models Comparison
# Creating a dictionary to store the evaluation metrics for each model
model_comparison = {
'Model': ['Logistic Regression', 'GBoost', 'Random Forest'],
'Accuracy': [
accuracy_score(y_test, y_pred), # Logistic Regression
accuracy_score(y_test, y_pred_gbc),
accuracy_score(y_test, y_pred_rf)
],
'Precision': [
precision_score(y_test, y_pred),
precision_score(y_test, y_pred_gbc),
precision_score(y_test, y_pred_rf)
],
'Recall': [
recall_score(y_test, y_pred),
recall_score(y_test, y_pred_gbc),
recall_score(y_test, y_pred_rf)
]
}
comparison_df = pd.DataFrame(model_comparison)
print(comparison_df) Model Accuracy Precision Recall
0 Logistic Regression 0.810833 0.699552 0.237624
1 GBoost 0.820167 0.665722 0.357959
2 Random Forest 0.816500 0.639474 0.370145
6 Conclusion
The three models offer relatively high rates of accuracy (>81%). The logistic model offers a better precision (\(\approx\) 70%), however, the analyst must be aware that in the context of credit default, recall is particularly crucial, as it measures the model’s ability to correctly identify actual defaulters. Missing a default can lead to significant financial losses. Therefore, the GBoost and random forest models would be the better choice (as they offer higher recall than the logistic model).
7 Other comparison measures
Analysts often include the Receiver Operating Cost (ROC) curve to evaluate the performance of a binary diagnostic classification method. The ROC curve is a plot of True Positive Rate \(\Big(\frac{TP}{TP + FN}\Big)\) against the False Positive Rate \(\Big(\frac{FP}{FP + TN}\Big)\) at various threshold settings. See the ROC curve for each of the models developed here in Figure 3.
# ROC and AUC for Logistic Regression
fpr_log, tpr_log, _ = roc_curve(y_test, log_reg.predict_proba(X_test)[:, 1])
auc_log = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
# ROC and AUC for GBoost
fpr_gbc, tpr_gbc, _ = roc_curve(y_test, gbc_model.predict_proba(X_test)[:, 1])
auc_gbc = roc_auc_score(y_test, gbc_model.predict_proba(X_test)[:, 1])
# ROC and AUC for Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:, 1])
auc_rf = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])
# Plotting the ROC curves
plt.figure(figsize=(7, 5))
plt.plot(fpr_log, tpr_log, label=f'Logistic Regression (AUC = {auc_log:.2f})', color='blue')
plt.plot(fpr_gbc, tpr_gbc, label=f'GBoost (AUC = {auc_gbc:.2f})', color='orange')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})', color='green')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate', fontsize=10)
plt.ylabel('True Positive Rate', fontsize=10)
# plt.title('ROC Curves', fontsize=12)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()The AUC represents the area under the ROC curve. This single value helps to summarize the overall model’s discriminative power in distinguishing between two classes.
- AUC=1: perfect separation between classes
- AUC=0.5: random model, no discrimination between classes.
- AUC<0.5: worse than random, the model misclassifies more than it correctly classifies.
The accuracy of a model’s predictions can also be measured with the Root Mean Square Error (RMSE). It quantifies the difference between the actual (observed) values and the predicted values produced by the model. That is:
\[ RMSE = \Big[\frac{1}{n} \sum_{i=1}^n (y_i-\hat{y})^2\Big]^{1/2} \]
Where \(n\) is the sample size, \(y_i\) and \(\hat{y}_i\) are the actual and predicted values, respectively.
Note however that the RMSE will generally not be useful when the \(y\) variable is not continuous, such as is the case here.
8 Prediction
Suppose a new borrower has the following characteristics:
newCred = {
'AMOUNT': [167484], # Loan amount at the mean
'SEX': [1], # Male
'EDUCATION': [2], # University
'MARRIAGE': [1], # Single
'AGE': [30], # Age
'PAY_1': [0], # On-time payment
'PAY_2': [0],
'PAY_3': [0],
'PAY_4': [0],
'PAY_5': [0],
'PAY_6': [0],
'BILL_AMT1': [50000], # Bill statement
'BILL_AMT2': [45000],
'BILL_AMT3': [47000],
'BILL_AMT4': [35000],
'BILL_AMT5': [30000],
'BILL_AMT6': [25000],
'PAY_AMT1': [0], # Previous payments
'PAY_AMT2': [0],
'PAY_AMT3': [0],
'PAY_AMT4': [0],
'PAY_AMT5': [0],
'PAY_AMT6': [0]
}
newCred_df = pd.DataFrame(newCred)Prepare the new data:
cats = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
nums = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
newCred_df[cats] = newCred_df[cats].astype('category')
newCred_df[nums] = scaler.transform(newCred_df[nums])Predict the default probability using the model of your choice:
probDefault = log_reg.predict_proba(newCred_df)[:, 1][0]
print(f"Probability of Default for the new customer is: {probDefault:.2f}")Probability of Default for the new customer is: 0.21