Module 1

Isai Guizar

Disclaimer:

This document is intended for educational purposes only. It does not constitute business advice.

1 Intro

1.1 What is Business Intelligence?

– Business Intelligence (BI) refers to the strategies, technologies, and tools that organizations use to collect, process, and analyze data in order to support decision-making.

– Business intelligence is the set of tools and systems used to collect and analyze data to make more informed business decisions.

BI is important because it allows organizations to:

Transform raw data into actionable insights.
Support evidence-based decision-making.
Improve operational efficiency.
Gain competitive advantages.

Examples:

Finance: credit risk analysis, fraud detection.
Economics: forecasting, policy intervention
Marketing: customer segmentation, sentiment analysis.
Healthcare: patient monitoring, resource allocation.
Operations: supply chain optimization.

1.2 Types of Data

When working in Business Intelligence, it is crucial to understand the different types of data we may encounter:

Organization:

Structured Data
Organized in tabular formats (rows and columns), easy to store in relational databases.
Examples: transaction records, customer information, financial statements.
Semi-Structured Data
Does not follow a strict tabular structure but still contains organizational tags or markers.
Examples: JSON, XML, log files, sensor data.
Unstructured Data
Lacks predefined structure; often qualitative and requires special techniques for processing.
Examples: text documents, social media posts, audio, video, images.

Analysis:

Cross-Sectional Data
Collected at a single point in time across multiple samples of individuals, households, firms, cities, states, countries, or other units of interest.
Examples: customer surveys, demographics of clients in one year.
Temporal / Time Series Data
Data indexed over time; central in forecasting and anomaly detection.
Examples: stock prices, sales per month, daily website traffic.
Longitudinal / Panel Data
Follows the same entities across time, combining cross-section and time dimensions.
Examples: repeated credit histories of the same clients, health data tracked over years.

Key takeaway: BI projects can involve multiple data types, and the choice of tools and methods depends on the structure and nature of the dataset. Use of inappropriate methods may lead to misleading results.

Video: Business Intelligence vs Data Analytics

2 Application

Creditors invest significant efforts in creating algorithms to predict the likelihood of a customer defaulting on a loan (PD). Defaults can result in substantial financial losses, impacting both the profitability and stability of financial institutions. To mitigate these risk, it is essential to develop robust predictive models that help to identify potential defaulters. Our goal in this module is to:

Explore and clean the structured dataset.
Conduct descriptive analysis for cross-sectional data.
Apply classification algorithms to predict default.

We will employ a dataset from kaggle (Lichman 2013) that contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan. Click on the link above to identify the variables and additional features of the data.

2.1 Data exploration

Import libraries

import pandas  as pd
import numpy   as np

Load data

data = pd.read_csv("/mydirectory/UCI_Credit_Card.csv")

Data preparation

We begin by exploring general features of the dataset as the number of columns, types of variables, and if there are missing values:

data.info()

As shown, variables can store different types of data. Some common types are int (whole numbers), float (numbers with decimal points or in exponential form), str (sequence of characters), and bool (logical values, true or false). This data set contain 13 columns of float variables and 12 of integer numbers. There are no missing values.

To facilitate the analysis we will rename “PAY_0” (“PAY_1”) and “LIMIT_BAL” (“AMOUNT”). Also, shorten the name of “default.payment.next.month”, call it “default”.

data.rename(columns={
    'LIMIT_BAL' : 'AMOUNT',
    'PAY_0': 'PAY_1',
    'default.payment.next.month': 'default'
}, inplace=True)

To further explore the variables, we can use the function value_counts(). For example, for the variable SEX:

data.SEX.value_counts()

It is convenient for binary values to take values of 0 and 1. SEX will be 0 for female (instead of 2) and 1 for male.

data['SEX'] = data['SEX'].apply(lambda x: 0 if x == 2 else 1)
data.SEX.value_counts()

Video: More on the Python lambda function

Some other features of the data remain either confusing or counter-intuitive, if ignored, an easily avoidable degree of complexity will be added to the anaylsis. We’ll fix in what follows:

The scale of EDUCATION (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Categories 4,5,6 and 0 can be grouped into only one class, say 4.

data.EDUCATION.value_counts()

data['EDUCATION'] = data['EDUCATION'].replace([0, 4, 5, 6], 4)

data.EDUCATION.value_counts()

MARRIAGE will be regrouped. The marital status (0 = ?, 1=married, 2=single, 3=others). We’ll group categories 1 for single and zero otherwise.

data.MARRIAGE.value_counts()

** code **

** code **

The payment history variables (PAY_1 to PAY_6) indicate the repayment status in previous months. Negative values represent payments made on time, while positive values indicate delays. A more intuitive scale should reverse the signs:

data.PAY_1.value_counts()

for i in range(1,7):
    data[f'PAY_{i}'] = data[f'PAY_{i}'].apply(lambda x: -x)

data.PAY_1.value_counts()

Next, we will proceed to to generate descriptive statistics, which will help us to identify general characteristics of the sample.

2.2 Descriptive statistics

dStats  = data.describe().T.round(1) 
dStats

	count	mean	std	min	25%	50%	75%	max
ID	30000.0	15000.5	8660.4	1.0	7500.8	15000.5	22500.2	30000.0
AMOUNT	30000.0	167484.3	129747.7	10000.0	50000.0	140000.0	240000.0	1000000.0
SEX	30000.0	0.4	0.5	0.0	0.0	0.0	1.0	1.0
EDUCATION	30000.0	1.8	0.7	1.0	1.0	2.0	2.0	4.0
MARRIAGE	30000.0	0.5	0.5	0.0	0.0	1.0	1.0	1.0
AGE	30000.0	35.5	9.2	21.0	28.0	34.0	41.0	79.0
PAY_1	30000.0	0.0	1.1	-8.0	0.0	0.0	1.0	2.0
PAY_2	30000.0	0.1	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_3	30000.0	0.2	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_4	30000.0	0.2	1.2	-8.0	0.0	0.0	1.0	2.0
PAY_5	30000.0	0.3	1.1	-8.0	0.0	0.0	1.0	2.0
PAY_6	30000.0	0.3	1.1	-8.0	0.0	0.0	1.0	2.0
BILL_AMT1	30000.0	51223.3	73635.9	-165580.0	3558.8	22381.5	67091.0	964511.0
BILL_AMT2	30000.0	49179.1	71173.8	-69777.0	2984.8	21200.0	64006.2	983931.0
BILL_AMT3	30000.0	47013.2	69349.4	-157264.0	2666.2	20088.5	60164.8	1664089.0
BILL_AMT4	30000.0	43262.9	64332.9	-170000.0	2326.8	19052.0	54506.0	891586.0
BILL_AMT5	30000.0	40311.4	60797.2	-81334.0	1763.0	18104.5	50190.5	927171.0
BILL_AMT6	30000.0	38871.8	59554.1	-339603.0	1256.0	17071.0	49198.2	961664.0
PAY_AMT1	30000.0	5663.6	16563.3	0.0	1000.0	2100.0	5006.0	873552.0
PAY_AMT2	30000.0	5921.2	23040.9	0.0	833.0	2009.0	5000.0	1684259.0
PAY_AMT3	30000.0	5225.7	17607.0	0.0	390.0	1800.0	4505.0	896040.0
PAY_AMT4	30000.0	4826.1	15666.2	0.0	296.0	1500.0	4013.2	621000.0
PAY_AMT5	30000.0	4799.4	15278.3	0.0	252.5	1500.0	4031.5	426529.0
PAY_AMT6	30000.0	5215.5	17777.5	0.0	117.8	1500.0	4000.0	528666.0
default	30000.0	0.2	0.4	0.0	0.0	0.0	0.0	1.0

Note that the dataset contain variables defined as “int” when factually no ranking should be implied, such as in the variable SEX. A categorical variable is one that takes on a limited, fixed number of possible values (categories), instead of being continuous numeric data.

Define the categorical variables:

cats = ['default','SEX', 'EDUCATION', 'MARRIAGE']

# Converting columns to 'category' data type
data[cats] = data[cats].astype('category')

data.info()

2.3 Graphical output

Python offers a wide variety of plotting options, with Matplotlib and Seaborn being the two main libraries for this purpose. You can explore their galleries here: Matplotlib Gallery | Seaborn Gallery. Although there are many, only a few of them are used in this document.

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

2.3.1 Default status

Figure 1 shows how the variable of interest is distributed.

plt.figure(figsize=(7, 5))
ax = sns.countplot(x=data['default'], hue=data['default'], palette=['#3182bd', '#DBB40C'], legend=False)

# Adding percentage labels
total_count = len(data['default'])
for container in ax.containers:
    ax.bar_label(container, labels=[f'{(v/total_count)*100:.1f}%' for v in container.datavalues])

# Adjust for binary variable
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])
plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Distribution of Default', fontsize=10)

plt.tight_layout()
plt.show()

Here is a scatter plot of the amounts of credit vs default status:

plt.figure(figsize=(7, 5))

plt.scatter(data['AMOUNT'], data['default'], alpha=0.6)
plt.xlabel('Amount')
plt.ylabel('Default Status')
plt.title('Scatter plot of credit amount vs default status')

plt.tight_layout()
plt.show()

2.3.2 Loan amount

plt.figure(figsize=(7, 5))
sns.histplot(data['AMOUNT'], kde=True, color='#04d8b2', edgecolor='black')

plt.xlabel('Credit Amount', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# plt.title('Distribution of Credit Amount', fontsize=12)

plt.tight_layout()
plt.show()

plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'], hue=data['default'], y=data['AMOUNT']/1000, palette=['#04d8b2', '#DBB40C'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])

plt.xlabel('Status', fontsize=10)
plt.ylabel('Credit Amount ($1000)', fontsize=10)

# Displaying the plot
plt.tight_layout()
plt.show()

Boxplot of amount of credit by default status

fig = px.histogram(
    data, 
    x="AMOUNT", 
    nbins=50,              # adjust bin size
    color_discrete_sequence=["#04d8b2"], 
    opacity=0.5,
    marginal="box"         # optional: adds a boxplot on top
)

fig.update_layout(
    xaxis_title="Credit Amount",
    yaxis_title="Frequency",
    bargap=0.05
)

fig.show()

2.3.3 Age

plt.figure(figsize=(7, 5))
sns.boxplot(x=data['default'],hue = data['default'], y=data['AGE'], palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Not Defaulted", "Defaulted"])

plt.xlabel('Default Status', fontsize=10)
plt.ylabel('Age', fontsize=10)

plt.tight_layout()
plt.show()

2.3.4 Sex

plt.figure(figsize=(7, 5))
sns.countplot(x='SEX', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1], labels=["Female", "Male"])

plt.title('Distribution of Default by Sex', fontsize=10)
plt.xlabel('Sex',   fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"])

plt.tight_layout()
plt.show()

# Calculate proportions
prop_df = (
    data.groupby(['SEX', 'default'])
        .size()
        .reset_index(name='count')
)

# Convert counts to percentages within each SEX group
prop_df['percentage'] = (
    prop_df.groupby('SEX')['count']
           .transform(lambda x: 100 * x / x.sum())
)

# Plot percentages
plt.figure(figsize=(7, 5))
sns.barplot(
    x='SEX', y='percentage', hue='default',
    data=prop_df, palette=['#6baed6', '#fd8d3c']
)

plt.xticks([0, 1], labels=["Female", "Male"])
plt.title('Distribution of Default by Sex', fontsize=10)
plt.xlabel('Sex', fontsize=10)
plt.ylabel('Percentage (%)', fontsize=10)

plt.tight_layout()
plt.show()

2.3.5 Education

plt.figure(figsize=(7, 5))
sns.countplot(x='EDUCATION', hue='default', data=data, palette=['#6baed6', '#fd8d3c'])
plt.xticks([0, 1, 2, 3], labels=["Unknown/Others", "Graduate", "University", "High School"])

plt.xlabel('Education Level', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.legend(labels=["Not Defaulted", "Defaulted"])


plt.tight_layout()
plt.show()

Distribution of Default Status by Education

2.4 Classification

Classification methods are a family of machine learning techniques used to predict a categorical outcome based on a set of input variables. They employ historical data where the outcome of interest is known, so the algorithm can learn the relationships between predictors and outcomes. Once trained, the model can be used to assign new observations to one of the predefined categories. Classification algorithms include logistic regression (often used as a benchmark), decision trees, random forests, and support vector machines. More advanced methods such as gradient boosting and neural networks can capture more complex, nonlinear relationships.

Scikit-learn is a widely used Python library for predictive analysis. It is an open-source and provides a wide range of algorithms and tools for a number of classification methods, explore further here:
👉 Supervised Learning

Video: Classification methods

We focus here on three classification methods: Logistic Regression, Random Forest, and XGBoost. Logistic Regression is prehaps the simplest, yet powerful, statistical model that estimates the probability of an outcome, easy to interpret and useful as a baseline. Random Forest extends the idea of decision trees by building many trees and averaging their predictions, which improves accuracy and reduces overfitting. Finally, XGBoost represents a more advanced machine learning technique that builds trees sequentially.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import StandardScaler
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics         import confusion_matrix, classification_report, roc_curve, roc_auc_score

Begin by defining independent (X) and target (y) features, then split the data into training (80%) and testing (20%) subsets.

data.columns

X = data.drop(['default', 'ID'], axis=1)
y = data['default']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The function train_test_split(): randomly splits the dataset into two parts: one for training the model(X_train, y_train) and the other for testing (X_test, y_test) the model’s performance. The 0.2 means 20% of the dataset will be used for testing, and 80% will be used for training. The parameter random_state=42: sets the random seed ensures that the data is split in the same way every time the code is run, making the results consistent and facilitating the comparison from different models.

Proceed to scale all numerical variables:

# Scaling only the numerical features
cat_features = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1']

num_features = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

scaler = StandardScaler()

X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features]  = scaler.transform(X_test[num_features])

Why do we scale? Some models (e.g., Logistic Regression) are sensitive to the scale of the features. If one feature (e.g., AMOUNT, which ranges from 10,000 to 1,000,000) has much larger values than another (e.g., AGE, which ranges from 21 to 79), the model might give more weight to the larger-scale feature just because of its magnitude. Scaling puts all numerical variables on a similar scale, with a mean of 0 and a standard deviation of 1 (Standardization: StandardScaler), making the model treat them equally.

2.4.1 Logistic Regression

2.4.1.1 Estimation

# Initialize the Logistic Regression model
log_reg = LogisticRegression(random_state=42)

# Fit the model on the training data
log_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = log_reg.predict(X_test)

2.4.1.2 Evaluation

Confusion Matrix:

# Confusion matrix
cm    = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, 
                     index=['Actual No Default', 'Actual Default'], 
                     columns=['Predicted No Default', 'Predicted Default'])

cm_df

	Predicted No Default	Predicted Default
Actual No Default	4553	134
Actual Default	1001	312

Accuracy:

Measures the proportion of correct predictions out of all predictions.

\[ Accuracy = \frac{True \ Positives + True\ Negatives}{Total \ Predictions}\] High accuracy means the model correctly predicted most of the cases. However, it can be misleading if the dataset is imbalanced (e.g., if the majority class dominates).

accuracy  = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.81

Precision:

Measures the proportion of correctly predicted positive observations out of all predicted positives.

\[ Precision = \frac{True \ positives}{True \ positives + False \ postives}\] Indicates how well the model avoids false positives. It’s useful when the cost of a false positive is high (e.g., wrongly predicting someone will default).

precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

Precision: 0.70

Recall:

Measures the proportion of correctly predicted positive observations out of all actual positives.

\[ Recall = \frac{True \ positives}{True \ positives + False \ negatives}\] indicates how well the model identifies true positives. It’s useful when the cost of missing a positive case is high (e.g., failing to identify a potential defaulter).

recall    = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

Recall: 0.24

2.4.2 ML GradientBoosting

Gradient Boosting (in Scikit-learn) is a method proposed by Friedman (2001) that builds a series of simple decision trees sequentially, optimizing errors at each step to create a stronger, more accurate model.

2.4.2.1 Estimation

from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting Classifier
gbc_model = GradientBoostingClassifier(random_state=42)

# Fit the model on the training data
gbc_model.fit(X_train, y_train)

# Predictions on the test data
y_pred_gbc = gbc_model.predict(X_test)

2.4.2.2 Evaluation

# Confusion matrix
cm_gb     = confusion_matrix(y_test, y_pred_gbc)
cm_df_gbc = pd.DataFrame(cm_gb, 
                     index=['Actual No Default', 'Actual Default'], 
                     columns=['Predicted No Default', 'Predicted Default'])

print(cm_df_gbc)

                   Predicted No Default  Predicted Default
Actual No Default                  4451                236
Actual Default                      843                470

accuracy  = accuracy_score(y_test, y_pred_gbc)
precision = precision_score(y_test, y_pred_gbc)
recall    = recall_score(y_test, y_pred_gbc)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Accuracy: 0.82
Precision: 0.67
Recall: 0.36

2.4.3 Random Forest

Activity

(A). Estimate a Random Forest Model.

Use the function “RandomForestClassifier(n_estimators=100, random_state=42)” to initialize the classifier,
“rf_model.fit” to fit the model, and
“rf_model.predict” to predict

(B). Evaluate the random forest classifier:

Show the confusion matrix
Obtain the measures of “accuracy”, “precision” , and “recall”

Here is some code to get you started:

from sklearn.ensemble import RandomForestClassifier

# (A). Estimation of the Random Forest Model

  # 1. Initialize the Random Forest Classifier here
rf_model =

  # 2. Fit the model on the training data here

  # 3. Make predictions on the test data here
y_pred_rf = 

# (B). Evaluation of the model's peformance:

Solution

2.4.4 Models Comparison

# Creating a dictionary to store the evaluation metrics for each model
model_comparison = {
    'Model': ['Logistic Regression', 'GBoost', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred),    # Logistic Regression
        accuracy_score(y_test, y_pred_gbc),  
        accuracy_score(y_test, y_pred_rf)  
    ],
    'Precision': [
        precision_score(y_test, y_pred),
        precision_score(y_test, y_pred_gbc),
        precision_score(y_test, y_pred_rf)
    ],
    'Recall': [
        recall_score(y_test, y_pred),
        recall_score(y_test, y_pred_gbc),
        recall_score(y_test, y_pred_rf)
    ]
}

comparison_df = pd.DataFrame(model_comparison)
print(comparison_df)

                 Model  Accuracy  Precision    Recall
0  Logistic Regression  0.810833   0.699552  0.237624
1               GBoost  0.820167   0.665722  0.357959
2        Random Forest  0.816500   0.639474  0.370145

2.5 Conclusion

Note

The three models offer relatively high rates of accuracy (>81%). The logistic model offers a better precision (\(\approx\) 70%), however, the analyst must be aware that in the context of credit default, recall is particularly crucial, as it measures the model’s ability to correctly identify actual defaulters. Missing a default can lead to significant financial losses. Therefore, the GBoost and random forest models would be the better choice (as they offer higher recall than the logistic model).

2.6 Other comparison measures

Analysts often include the Receiver Operating Cost (ROC) curve to evaluate the performance of a binary diagnostic classification method. The ROC curve is a plot of True Positive Rate \(\Big(\frac{TP}{TP + FN}\Big)\) against the False Positive Rate \(\Big(\frac{FP}{FP + TN}\Big)\) at various threshold settings. See the ROC curve for each of the models developed here in Figure 2.

# ROC and AUC for Logistic Regression
fpr_log, tpr_log, _ = roc_curve(y_test, log_reg.predict_proba(X_test)[:, 1])
auc_log = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])

# ROC and AUC for GBoost
fpr_gbc, tpr_gbc, _ = roc_curve(y_test, gbc_model.predict_proba(X_test)[:, 1])
auc_gbc = roc_auc_score(y_test, gbc_model.predict_proba(X_test)[:, 1])

# ROC and AUC for Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:, 1])
auc_rf = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])

# Plotting the ROC curves
plt.figure(figsize=(7, 5))
plt.plot(fpr_log, tpr_log, label=f'Logistic Regression (AUC = {auc_log:.2f})', color='blue')
plt.plot(fpr_gbc, tpr_gbc, label=f'GBoost (AUC = {auc_gbc:.2f})', color='orange')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})', color='green')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')


plt.xlabel('False Positive Rate', fontsize=10)
plt.ylabel('True Positive Rate', fontsize=10)
# plt.title('ROC Curves', fontsize=12)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

The AUC represents the area under the ROC curve. This single value helps to summarize the overall model’s discriminative power in distinguishing between two classes.

AUC=1: perfect separation between classes
AUC=0.5: random model, no discrimination between classes.
AUC<0.5: worse than random, the model misclassifies more than it correctly classifies.

The accuracy of a model’s predictions can also be measured with the Root Mean Square Error (RMSE). It quantifies the difference between the actual (observed) values and the predicted values produced by the model. That is:

\[ RMSE = \Big[\frac{1}{n} \sum_{i=1}^n (y_i-\hat{y})^2\Big]^{1/2} \]

Where \(n\) is the sample size, \(y_i\) and \(\hat{y}_i\) are the actual and predicted values, respectively.

Note however that the RMSE will generally not be useful when the \(y\) variable is not continuous, such as is the case here.

2.7 Prediction

Suppose a new borrower has the following characteristics:

newCred = {
    'AMOUNT': [167484],  # Loan amount at the mean
    'SEX': [1],          # Male
    'EDUCATION': [2],    # University
    'MARRIAGE': [1],     # Single
    'AGE': [30],         # Age
    'PAY_1': [0],        # On-time payment
    'PAY_2': [0],  
    'PAY_3': [0], 
    'PAY_4': [0],  
    'PAY_5': [0], 
    'PAY_6': [0],        
    'BILL_AMT1': [50000],  # Bill statement
    'BILL_AMT2': [45000], 
    'BILL_AMT3': [47000],
    'BILL_AMT4': [35000], 
    'BILL_AMT5': [30000], 
    'BILL_AMT6': [25000],
    'PAY_AMT1': [0],       # Previous payments
    'PAY_AMT2': [0], 
    'PAY_AMT3': [0],
    'PAY_AMT4': [0],  
    'PAY_AMT5': [0], 
    'PAY_AMT6': [0]       
}

newCred_df = pd.DataFrame(newCred)

Prepare the new data:

cats = ['SEX', 'EDUCATION', 'MARRIAGE']

nums = ['AMOUNT', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']


newCred_df[cats] = newCred_df[cats].astype('category')
newCred_df[nums] = scaler.transform(newCred_df[nums])

Predict the default probability using the model of your choice:

probDefault = log_reg.predict_proba(newCred_df)[:, 1][0]

print(f"Probability of Default for the new customer is: {probDefault:.2f}")

Probability of Default for the new customer is: 0.21

References

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5). https://doi.org/10.1214/aos/1013203451.

Lichman, M. 2013. “UCI Machine Learning Repository [Http://Archive.ics.uci.edu/Ml].” Irvine, CA: University of California, School of Information and Computer Science.