Credit Risk: Analyzing Financial and Behavioral Factors

Author

Bhavani Priya

Predicting Credit Risk: Analyzing Financial and Behavioral Factors

1) Introduction and Data

The South German Credit dataset originates from a study conducted by Prof. Hofmann and offers detailed insights into credit applicants’ profiles and associated risks from 1973 to 1975 in Germany. The dataset includes demographic and financial information on 1,000 individuals who took credit loans, detailing factors such as credit history, loan purpose, account status, and personal attributes like age and employment duration. Financial variables include credit amount and installment rate, while categorical attributes outline account status, housing type, and other key indicators of financial responsibility.

The dataset is imbalanced, containing 700 “good” and 300 “bad” credit cases, making it particularly useful for examining credit risk prediction in an imbalanced setting. This data supports predictive modeling tasks aimed at assessing individual credit risk based on socio-economic and financial attributes, with a focus on differentiating between good and bad credit profiles. The South German Credit dataset is a refined and corrected alternative to the widely referenced Statlog German credit dataset, providing essential background and data integrity that enhances its reliability for credit risk analysis.

Topic Description

This project aims to develop a predictive model to assess an individual’s credit risk—specifically, whether they are likely to be classified as a “good” or “bad” credit risk. By analyzing a range of demographic, financial, and account-related attributes, such as credit history, loan purpose, checking account status, and employment duration, the project seeks to identify the key factors that most strongly correlate with credit risk.

The target variable, credit_risk, categorizes each individual as a good or bad risk based on these features. The ultimate goal is to create a model capable of accurately predicting an individual’s credit risk.

Data Description

Individual Demographics: Age, sex and marital status (e.g., married, single, widowed) may not be strong predictors of credit risk but could provide additional insights and are worth exploring.
Employment (Job Type) & Employment Duration: Job quality and length of employment could play a significant role in credit risk, as individuals with stable, long-term jobs may have more consistent income and lower default risk.
Housing and Property Ownership: Whether individuals own property, such as a home, might indicate financial stability and reduce perceived credit risk. Property ownership can provide collateral, which might lower the risk for lenders.
Duration at Present Residence: A longer stay at the current residence could suggest stability in the applicant’s lifestyle, potentially lowering credit risk.
Foreign Worker Status: Foreign worker status might influence risk assessments due to potential differences in income stability or residency duration, which could be relevant in evaluating credit risk.
Status of Checking and Savings Accounts: I expect that the status of the checking and savings accounts, along with the balance, could impact credit risk. Individuals who maintain higher balances may be less likely to default.
Number of Credits & Credit History: The number of loans taken by an individual, combined with their past credit history, would likely provide strong insights into credit risk. These factors could reveal patterns in borrowing behavior and repayment reliability.
Purpose of Loan: The reason for taking out the loan, whether for repairs, business, or a vacation, may reflect an individual’s priorities and potential repayment behavior, possibly impacting their credit risk.
Installment Rate & Other Installment Plans: The rate of installments as a portion of income, along with other installment plans the individual may have, could affect their ability to manage the loan. Higher installment rates might increase credit risk due to financial strain.
Other Debtors and Other Installment plans: These features can indicate the financial commitments of an individual. If someone has other debtors or installment plans, it suggests a higher financial burden and may imply a higher credit risk. Lenders will be cautious if an individual has multiple debt obligations.
People Dependent(liable):Refers to the number of people who depend on the individual for financial support. A higher number of dependents can suggest a greater financial responsibility, which might increase the likelihood of credit default due to a potential strain on financial resources.
Telephone: Whether a person has a telephone can be an indicator of stability and reliability, which might suggest a lower credit risk. However, the goal is to evaluate model accuracy rather than making real-world credit decisions solely based on this feature.

Project Expectations and Motivation

The main motivation of this project is to develop a predictive model using the South German Credit dataset to accurately assess an individual’s credit risk. The dataset provides detailed demographic, financial, and account-related attributes that are useful in distinguishing between “good” and “bad” credit profiles. The goal is to identify which model—Logistic Regression, K-Nearest Neighbors (KNN) Classifier, Random Forest, Gradient Boosting, and Support Vector Classifier—can predict the credit_risk outcome most accurately. This project aims to evaluate predictive accuracy, not to implement real-world credit decisions.

Import necessary libraries

2) Exploratory Data Analysis

2.1 Data Cleaning

Renamed column names from German to English for better readability.
Defined levels for all categorical columns to ensure consistency and clarity in analysis.

odict_keys(['credit_data'])

       checking_status  credit_duration  ... foreign_worker credit_risk
0  no checking account               18  ...             no        good
1  no checking account                9  ...             no        good
2           ... < 0 DM               12  ...             no        good
3  no checking account               12  ...            yes        good
4  no checking account               12  ...            yes        good

[5 rows x 21 columns]

Transformation of `personal_status_sex` to Create `sex` and `personal_status` Columns

The transformation process involved the following steps:

1. Understanding the personal_status_sex Variable

The personal_status_sex column contains four distinct categories: - Female: non-single or male: single - Male: married/widowed - Female: single - Male: divorced/separated

2. Creating Separate Rows for Dual Categories

For entries with both “female” and “male” values (e.g., Female: non-single or male: single), new rows were created to preserve the information for both sexes without making assumptions.
This step ensures the integrity of the data by retaining both sex and personal_status in separate rows.

3. Splitting into sex and personal_status Columns

The personal_status_sex column was split into two new columns:
- sex: Indicates the gender (female or male).
- personal_status: Indicates the marital or relationship status (non-single, single, married/widowed, divorced/separated).

4. Impact on Data Size

After the transformation, the dataset grew from 1,000 rows to 1,310 rows, reflecting the addition of new rows for dual-category entries.

5. Distribution in the New Columns

sex column:
- Male: 908 entries
- Female: 402 entries
personal_status column:
- married/widowed: 548 entries
- single: 402 entries
- non-single: 310 entries
- divorced/separated: 50 entries

Dropped personal_status_sex variable

Personal_status_sex: The categories of this variables are not clear distinguished, using this variable in the model might not be useful and it’s ambiguous. I’ve created a clearer sex and personal_status variables using this for modeling.

#check categories
#credit_data.personal_status_sex.value_counts()

#convert credit data into pandas dataframe
credit_data = pd.DataFrame(credit_data)
      
# Step 1: Split rows based on "or"
credit_data = credit_data.assign(
    personal_status_sex=credit_data['personal_status_sex'].str.split(' or ')
).explode('personal_status_sex')

# Step 2: Split each row into 'sex' and 'status' based on ":"
credit_data[['sex', 'personal_status']] = credit_data['personal_status_sex'].str.split(' : ', expand=True)

# Factorize the 'sex' column
credit_data['sex'] = pd.Categorical(
    credit_data['sex'], 
    categories=['female', 'male'],  # Adjust based on actual values in your data
    ordered=False
)

# Factorize the 'personal_status' column
credit_data['personal_status'] = pd.Categorical(
    credit_data['personal_status'], 
    categories=['non-single', 'single', 'married/widowed', 'divorced/separated'],  # Adjust based on actual values
    ordered=False
)

# Drop the original column
credit_data = credit_data.drop(columns=['personal_status_sex'])

#check new column sex and personal status value counts
#print(credit_data.sex.value_counts())
#print(credit_data.personal_status.value_counts())

Create Test column

Set a random seed to ensure reproducibility of the train-test split.
Randomly assigned 80% of the data to the training set and 20% to the test set by creating a new column, test, with values 0 (training) and 1 (test).

test
0    1040
1     270
Name: count, dtype: int64

Splitting Train and test sets

Extracted training and test datasets from credit_data by filtering based on the test column and dropping the test column afterward.
Separated features (X_train, X_test) and target variable (y_train, y_test) for both training and test datasets.

Shapes of the datasets:

Training set: 1,040 rows, 22 columns (features + target).
Test set: 270 rows, 22 columns (features + target).

Shape of Train set:  (1040, 22)

Shape of Test set:  (270, 22)

Null Values: Since the dataset has no missing values, imputation is not required.

checking_status            0
credit_duration            0
credit_history             0
purpose                    0
credit_amount              0
savings_status             0
employment_duration        0
installment_rate           0
other_debtors              0
present_residence          0
property                   0
age                        0
other_installment_plans    0
housing                    0
number_credits             0
job                        0
people_liable              0
telephone                  0
foreign_worker             0
credit_risk                0
sex                        0
personal_status            0
test                       0
dtype: int64

2.2 Numerical and Visual Summary

Outcome Variable: Credit Risk Statistics

count     1040
unique       2
top       good
freq       717
Name: credit_risk, dtype: object

Count: There are 1040 observations in the credit_risk variable.
Unique Values: There are 2 unique values in the variable: ‘good’ and ‘bad’.
Frequency: ‘good’ occurs 717 times, while ‘bad’ occurs 323 times.

The credit_risk variable is imbalanced, with 717 “good” and 323 “bad” observations. This imbalance can bias models toward predicting the majority class (“good”), reducing accuracy for the minority class (“bad”). Addressing this requires techniques like class weighting, resampling, or using algorithms suited for imbalanced data.

#summary of train data
print(train_data.describe())

       credit_duration  credit_amount          age
count      1040.000000    1040.000000  1040.000000
mean         20.508654    3155.041346    34.694231
std          11.787904    2768.686533    11.568383
min           4.000000     250.000000    19.000000
25%          12.000000    1346.750000    26.000000
50%          18.000000    2191.500000    31.500000
75%          24.000000    3845.500000    41.000000
max          72.000000   18424.000000    75.000000

Descriptive Statistics of Numerical variables

Credit Duration: The average credit duration is 20.5 months, ranging from a minimum of 4 months to a maximum of 72 months. The standard deviation of 11.78 months indicates how much individual credit durations vary around the mean. A higher standard deviation suggests greater variability in credit duration among individuals.
Credit Amount: The average credit amount is $3,155, with a minimum of $250 and a maximum of approximately $18,424. The standard deviation of $2,768.68 reflects the degree of spread around the average credit amount. A higher standard deviation indicates more variability in credit amounts taken by individuals.
Age: The minimum age is 19 years, and the maximum age is 75 years. The average age is 34 years, with a standard deviation of 11 years. The standard deviation provides insight into how much individual ages deviate from the mean age. A higher standard deviation indicates greater age variability among individuals in the dataset.

Pair Plot of Numerical Variables

Age Distribution: The distribution of age shows that younger individuals are more likely to have a “bad” credit risk, while older individuals tend to have a “good” credit risk.
Credit Amount vs. Credit Risk: Higher credit amounts are associated with both good and bad credit risks, but there is a noticeable concentration of bad credit risks at higher amounts.
Credit Duration vs. Credit Risk: Longer credit durations are slightly more associated with bad credit risks, indicating potential challenges in managing long-term financial commitments.

Positive Skew in Distribution plots

The distribution plots for Age, Credit amount, and Duration have a positive skew

Age Distribution: The age distribution shows a positive skew, with more individuals in the younger age group. Younger individuals are more associated with “bad” credit risk.
Credit Amount Distribution: There is a positive skew in credit amounts, with most values concentrated at lower amounts. Higher credit amounts are linked to both “good” and “bad” credit risks.
Credit Duration Distribution: The duration distribution also exhibits positive skewness, with shorter durations being more common. Longer durations tend to be associated with “bad” credit risk.
Positive Skewness Impact: The positive skewness in age, credit amount, and credit duration indicates that most data points are concentrated at lower values, which might lead models to underpredict for higher values. This skewness can affect model accuracy, especially if the model does not handle non-linear relationships well.

Credit Amount and Risk by Housing category

The violin plot displays the distribution of credit amounts across different housing categories (“for free,” “own,” and “rent”) and separates them by credit risk (“bad” in blue and “good” in orange).
The plot shows that for all housing categories, “good” credit risk generally corresponds to a lower median credit amount than “bad” credit risk, indicating that those with higher credit risk tend to have larger credit amounts.

Credit Risk Distrbution across Gender by Credit Amount

Gender Count Plot: This plot shows the distribution of credit risk (bad and good) by gender. Males have a higher count in both “good” and “bad” credit risk categories compared to females.
Credit Amount by Gender Plot: This box plot displays the distribution of credit amounts by gender, separated by credit risk. Outliers are present, especially in “bad” credit risks for both genders, indicating some individuals have significantly higher credit amounts.

Credit Risk Distribution Across Job by Credit Amount

Distribution by Job Category: This bar chart shows the distribution of credit risk (“bad” in blue and “good” in orange) across different job categories. “Skilled employee/official” has the highest count of both “good” and “bad” credit risks, while “unemployed/unskilled - non-resident” is the least represented.
Distribution by Job by Credit Amount: The box plot illustrates credit amounts across these job categories, separated by credit risk. There are several outliers, particularly in the “skilled employee/official” and “manager/self-empl./highly qualif. employee” groups, indicating individuals with unusually high credit amounts.

Credit Risk Distribution Across Savings Account by Credit Amount and Age

Savings Accounts Count by Credit Risk: This bar plot shows the count of individuals categorized by their savings account status (<100 DM, >=1000 DM, etc.) and their associated credit risk (good or bad). Most people with good credit risk do not have a savings account (“No account”) or have very low savings (<100 DM). People with bad credit risk are more evenly distributed across savings categories, with a notable presence in the <100 DM category. Savings of >=1000 DM are associated with a lower proportion of bad credit risks.
Credit Amount by Savings Account: This box plot represents the distribution of credit amounts for different savings account categories, split by credit risk (good or bad). For both good and bad credit risks, individuals with higher savings (e.g., >=1000 DM) tend to request or possess higher credit amounts. “No account” and lower savings categories (<100 DM) are generally associated with lower credit amounts. There are more outliers (higher requested credit amounts) among people with good credit risk compared to bad credit risk.
Age by Savings Account: This box plot shows the distribution of ages for each savings account status category, split by credit risk. People with higher savings (>=1000 DM) tend to be older on average compared to those with lower savings (<100 DM) or no account. For both good and bad credit risks, individuals without a savings account (“No account”) tend to have a broader age distribution. Younger individuals are more likely to fall into the lower savings categories (<100 DM and No account).

Credit Risk Distribution by Purpose and Credit amount

Purposes Count: This bar chart shows the count of credit applications based on their purpose.Each bar represents a purpose such as “business,” “car (new),” “car (used),” etc. “Furniture/equipment” has the highest number of applications, with more good credit risks than bad. “Car (used)” and “others” also show a significant number of applications.
Credit Amount Distribution by Purposes: This box plot displays credit amount distributions for various purposes. The x-axis lists the different purposes, while the y-axis denotes the credit amount.“Business” and “car (new)” have higher credit amounts compared to other purposes. The boxes represent the interquartile range (IQR), and the lines extend to show the overall distribution.There are also outliers visible as individual points beyond the whiskers.

Credit Risk by Foreign Worker Status

Most individuals in the dataset are not foreign workers, with both “good” and “bad” credit risks predominantly found in this group
Among foreign workers, the majority are classified as “good” credit risks, with very few classified as “bad.” This suggests that foreign workers in the dataset are more likely to have a lower credit risk.
Non-foreign workers also show a strong skew toward “good” credit risk, though they have a larger number of “bad” credit risks compared to foreign workers.

Credit Risk and Telephone

The bar chart illustrates the distribution of credit risk (either “good” or “bad”) categorized by the telephone ownership status: “no” (no telephone) and “yes (under customer name).”
In both telephone ownership groups, individuals with good credit risk significantly outnumber those with bad credit risk. However, the count of good credit risk is noticeably higher for those without a telephone.
The number of individuals with bad credit risk is relatively low across both groups, with slightly more cases among those without a telephone compared to those with one listed under their name.

3) Evaluation Metric

For this project, I have chosen the F1 score as the evaluation metric because:

Binary Classification Model: The project involves a classification model with a binary outcome variable (good and bad credit risk). Hence, using a classification metric is necessary.
Imbalance Between Classes: There is imbalance in the distribution of good and bad credit risks, relying on accuracy alone could be misleading. The F1 score balances precision (how many predicted positives are correct) and recall (how many actual positives are captured), making it a reliable metric in such cases.
Performance Focus: The F1 score provides a single-number summary of the model’s ability to minimize false positives and false negatives, which is critical in credit risk classification to assess both good and bad risks effectively.

4) Fitting Different Models

Preprocessing

Preprocessing Setup:
- The code sets up a column transformer for preprocessing the data.
- It uses different techniques to handle categorical and numerical features.
Handling Ordinal Features:
- OrdinalEncoder is used to encode ordinal features (checking_status, savings_status, etc.) with a specified value for unknown categories.
- Unknown categories are assigned a value of -1.
Handling Nominal Features:
- OneHotEncoder is used to perform one-hot encoding on nominal features (credit_history, purpose, etc.).
- The drop='first' parameter prevents the creation of one redundant column to avoid multicollinearity issues.
Scaling Numerical Features:
- StandardScaler is applied to scale numerical features (age, credit_duration, credit_amount).
- This standardization helps models to converge faster and perform better.
Remainder Option:
- The remainder='passthrough' option keeps any remaining columns unchanged, which include those not explicitly specified for encoding or scaling.

nominal_features = ["credit_history","purpose",
                    "other_debtors", "property","other_installment_plans",
                    "housing","job","personal_status",
                    "telephone","foreign_worker","sex"]
                    
ordinal_features = ["checking_status","savings_status","people_liable",
                    "employment_duration","installment_rate",
                    "present_residence","number_credits"]

numeric_features = ["age","credit_duration","credit_amount"]

# one hot encoding
preprocessor = make_column_transformer(
  (OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1),ordinal_features),
  (OneHotEncoder(drop = 'first',handle_unknown='ignore'), nominal_features),
  (StandardScaler(), numeric_features),
  remainder = 'passthrough',
  verbose_feature_names_out = False
)

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the target variable
y_train_encoded = label_encoder.fit_transform(y_train)

#transform y_test 
y_test_encoded = label_encoder.transform(y_test)

Cross-Validation Strategy: - Defines a StratifiedKFold cross-validation strategy with 10 splits which will be used for every model in this project to ensure that the distribution of the credit_risk classes is maintained in each fold. This helps in robust model evaluation.

Logistic Regression Model with ElasticNet penalty coefficients

Logistic regression with ElasticNet penalty to capture both lasso (L1) and ridge (L2) regularization effects, balancing model interpretability and predictive performance on an imbalanced dataset.

Parameter Grid Setup:
- Sets up a parameter grid for LogisticRegression with values for regularization (C), elastic net mixing parameter (l1_ratio), and maximum iterations (max_iter). This grid is used to tune the logistic regression model.
Pipeline Creation:
- Creates a preprocessing pipeline that includes data preprocessing steps and a logistic regression model. The logistic regression is configured with an elastic net penalty, saga solver, and balanced class weights to handle the imbalanced nature of the credit_risk target.
Grid Search with Cross-Validation:
- Performs a GridSearchCV on the pipeline to find the best logistic regression model based on the parameter grid using f1_score as the scoring metric.
Feature Importance:
- Extracts and visualizes feature names and coefficients from the best model to understand which features are most influential in determining credit_risk.
Prediction and Evaluation:
- Uses the best model to predict credit_risk on the test set and calculates the f1_score to evaluate its performance.

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Define the parameter grid with correct parameter names
param_grid = {
    'log_reg__C': np.logspace(-4, 4, 10),
    'log_reg__l1_ratio': np.linspace(0.1, 1.0, 10),
    'log_reg__max_iter': [5000, 10000, 15000]
}

# Create a pipeline with preprocessing and logistic regression
pipeline_log = Pipeline([
    ('preprocessor', preprocessor),  # Ensure preprocessor is defined
    ('log_reg', LogisticRegression(penalty='elasticnet', solver='saga',
                class_weight = "balanced"))
])

# Set up GridSearchCV
grid_search_log = (GridSearchCV(
  pipeline_log, 
  param_grid, 
  cv=cv, 
  scoring=make_scorer(f1_score,pos_label="good"),
  error_score="raise").fit(X_train, y_train)
  )

#save grid_search_log as pickle
with open("grid_search_log.pkl", "wb") as f:
    dump(grid_search_log, f, protocol=5)

Best Parameters for Logistic Regression Model: {'log_reg__C': 0.000774263682681127, 'log_reg__l1_ratio': 0.2, 'log_reg__max_iter': 10000}

Predicting using Best Logistic Regression Model

#predict 
y_pred_log = best_log_model.predict(X_test)

#f1_score
f1_score_log = f1_score(y_test,y_pred_log, pos_label="good")
print("F1 score for Logistic Regression:",f1_score_log)

F1 score for Logistic Regression: 0.8105726872246696

K-Nearest Neighbors Classification

K-Nearest Neighbors was chosen for its simplicity and flexibility in handling imbalanced data, which allows it to be robust in predicting credit risk based on its similarity to other instances.

Label Encoding:
- Initialized a LabelEncoder to encode the target variable y_train and y_test since the K-Neighbors Classifier requires numeric values for prediction.
- Encodes the target labels into numeric values.
Parameter Grid:
- Defined a parameter grid for the K-Neighbors Classifier (knn) with options for n_neighbors, weights, and p (manhattan or euclidean distance metric).
- This grid helps identify the best configuration for the K-Neighbors Classifier.
Pipeline Creation:
- Created a pipeline combining preprocessing steps and the K-Neighbors Classifier.
- The preprocessor defined earlier is included to preprocess the data before fitting the model.
Grid Search:
- Set up a GridSearchCV to find the best parameters using the defined parameter grid and cross-validation strategy.
- It scores models based on the F1-score for the positive class (1), using the best parameters to fit the training data.
Prediction:
- Retrieves the best model from the grid search.
- Makes predictions on the test set.
- Calculates the F1-score of the best K-Neighbors Classifier on the test data to evaluate its performance.

# Define the parameter grid with correct parameter names
param_grid_knn = {
    'knn__n_neighbors': np.arange(1,51),
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]
}

# Create a pipeline with preprocessing and KNeighborsClassifier
pipeline_knn = Pipeline([
    ('preprocessor', preprocessor),  # Ensure preprocessor is defined
    ('knn', KNeighborsClassifier())
])

# Create a GridSearchCV object
grid_search_knn = (GridSearchCV(
  pipeline_knn,
  param_grid_knn,
  cv=cv,
  scoring=make_scorer(f1_score,pos_label=1),
  error_score='raise').fit(X_train,y_train_encoded)
  )

#save grid_search_log as pickle
with open("grid_search_knn.pkl", "wb") as f:
    dump(grid_search_knn, f, protocol=5)

Predicting using Best K-Neighbors Classification Model

with open("grid_search_knn.pkl", "rb") as f:
    grid_search_knn = load(f)

# Retrieve the best parameters from GridSearchCV
best_knn_params = grid_search_knn.best_params_
print("Best Parameters:", best_knn_params)

Best Parameters: {'knn__n_neighbors': 3, 'knn__p': 1, 'knn__weights': 'distance'}

# Create a new model with the best parameters
best_knn_model = grid_search_knn.best_estimator_

#predict 
y_pred_knn = best_knn_model.predict(X_test)

#f1_score
f1_score_knn = f1_score(y_test_encoded,y_pred_knn, pos_label=1)
print("F1 score for KNN Classifier:",f1_score_knn)

F1 score for KNN Classifier: 0.8527918781725888

Random Forest Classifier

Random Forest due to its ability to handle high-dimensional data and its robustness against overfitting, which is beneficial when dealing with complex relationships in credit risk prediction.

Parameter Grid Definition:
- Defines a parameter grid for the RandomForestClassifier (rf) including options for n_estimators (number of trees), max_features (number of features to consider when splitting a node), max_depth (maximum depth of the tree), min_samples_split (minimum number of samples required to split an internal node), min_samples_leaf (minimum number of samples required to be at a leaf node), and bootstrap (whether to use bootstrap sampling or not).
- This grid ensures that the RandomForest model can explore a variety of configurations to optimize performance.
Pipeline Creation:
- Combines the preprocessing steps and the RandomForestClassifier into a single pipeline.
- The preprocessor from the previous steps is included to preprocess the data before fitting the model.
- The RandomForestClassifier is configured with balanced subsample class weighting to address the class imbalance in the data.
RandomizedSearchCV:
- Sets up a RandomizedSearchCV to find the best parameters using the defined parameter grid and cross-validation strategy.
- RandomizedSearchCV explores a random subset of the parameter space, which is computationally cheaper than a grid search, especially with a large parameter space.
- It scores models based on the F1-score for the ‘good’ class and uses 50 iterations for exploration.
Prediction:
- Retrieves the best model from the random search.
- Evaluates the performance of the best RandomForest model on the test data using the F1-score metric. This helps in understanding how well the model generalizes to unseen data.

# Define the parameter grid with correct parameter names
param_grid_rf = {
    'rf__n_estimators': [100, 300, 500, 1000, 1500],
    'rf__max_features': np.arange(1, X_train.shape[1] + 1),
    'rf__max_depth': [None] + list(np.arange(5, 30, 5)),
    'rf__min_samples_split': np.arange(2, 10, 2),
    'rf__min_samples_leaf': np.arange(1, 5),
    'rf__bootstrap': [True, False]
}


# Create a pipeline with preprocessing and RandomForestClassifier
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),  # Ensure preprocessor is defined
    ('rf', RandomForestClassifier(
      random_state = 42, 
      class_weight = 'balanced_subsample'))
])

# Create a RandomizedSearchCV object
random_search = (RandomizedSearchCV(
    estimator=pipeline_rf,
    param_distributions=param_grid_rf,
    n_iter=50,
    scoring=make_scorer(f1_score,pos_label='good'),
    cv=cv,
    verbose=1,
    random_state=42
).fit(X_train,y_train))

#save grid_search_log as pickle
with open("random_search.pkl", "wb") as f:
    dump(random_search, f, protocol=5)

Predicting using Best Random Forest Classifier Model

with open("random_search.pkl", "rb") as f:
    random_search = load(f)

best_rf_params = random_search.best_params_
print("Best Parameters:", best_rf_params)

Best Parameters: {'rf__n_estimators': 500, 'rf__min_samples_split': 2, 'rf__min_samples_leaf': 1, 'rf__max_features': 21, 'rf__max_depth': 20, 'rf__bootstrap': True}

# Evaluate the best model on the test set
best_model_rf = random_search.best_estimator_  # Best model
y_pred_rf = best_model_rf.predict(X_test)  # Predictions on test data
f1_random_forest = f1_score(y_test, y_pred_rf, pos_label="good")  # Calculate F1 score
print("F1 Score on Test Data for Random Forest model:", f1_random_forest)

F1 Score on Test Data for Random Forest model: 0.8956743002544529

Gradient Boosting Classifier

Gradient Boosting was chosen for its superior performance in boosting the predictive accuracy and handling complex, non-linear relationships inherent in credit risk prediction tasks.
No encoding is required for this model as this can process categorical features well.

Pipeline Definition (pipe_gbm):
- Defined a pipeline specifically for hyperparameter tuning of the HistGradientBoostingClassifier. This model is chosen for its ability to handle categorical data directly, which is important given the dataset’s features.
- The HistGradientBoostingClassifier is configured with parameters like random_state and class_weight to manage class imbalance in the dataset. The categorical_features parameter specifies which features are categorical.
Grid Configuration (param_grid_gbm_hlv):
- Set up a hyperparameter grid for tuning the HistGradientBoostingClassifier. The grid includes parameters such as max_iter, learning_rate, and max_depth with multiple values to explore during the Halving Grid Search. These parameters control the number of boosting iterations, the learning rate which affects the step size, and the tree depth which impacts model complexity.
Halving Grid Search (grid_gbm_hlv):
- Used a HalvingGridSearchCV to efficiently search through the hyperparameter space. This method is well-suited for high-dimensional data and allows us to progressively focus on the most promising hyperparameter combinations.
- The search uses cross-validation (cv) and an F1 score (make_scorer(f1_score, pos_label="good")) to evaluate model performance on the training data.
Evaluation:
- The best parameters found by the grid search are printed.
- The best model (gbm_best_hlv) is evaluated on the test data using the F1 score to measure its performance in credit risk prediction.

pipe_gbm = Pipeline(
  [
    ('gbm', 
    HistGradientBoostingClassifier(
      random_state = 42,
      class_weight = 'balanced',  # Handle class imbalance
      categorical_features = ["credit_history","purpose",
                    "other_debtors", "property","other_installment_plans",
                    "housing","job","telephone","foreign_worker",
                    "sex","personal_status","checking_status","savings_status",
                    "people_liable","employment_duration","installment_rate",
                    "present_residence","number_credits"])
      )
  ]
)

# Combine the hyperparameters into a grid
param_grid_gbm_hlv = dict(
  gbm__max_iter = [1000, 1500, 2000, 2500, 3000],  # Map boosting iterations to the grid
  gbm__learning_rate = [0.001, 0.0015, 0.01, 0.015, 0.1],  # Map learning rate to the grid
  gbm__max_depth = [1, 2, 3, 4, 5]  # Map tree depth to the grid
)

# Perform a Randomized Search CV for hyperparameter tuning
grid_gbm_hlv = (RandomizedSearchCV(
    estimator=pipe_gbm,  # Pipeline or model
    param_distributions=param_grid_gbm_hlv,  # Hyperparameter grid (change param_grid to param_distributions)
    n_iter=50,  # Number of random samples to evaluate
    cv=cv,  # Cross-validation strategy
    scoring=make_scorer(f1_score, pos_label="good"),  # F1 score as the evaluation metric
    verbose=1,  # Optional: Print progress during search
    random_state=42  # For reproducibility
).fit(X_train,y_train))

with open("grid_gbm_hlv.pkl", "wb") as f:
    dump(grid_gbm_hlv, f, protocol=5)

with open("grid_gbm_hlv.pkl", "rb") as f:
    grid_gbm_hlv = load(f)

# Print the best parameters and cross-validation score
print("Best Parameters for Gradient Boosting:", grid_gbm_hlv.best_params_)

Best Parameters for Gradient Boosting: {'gbm__max_iter': 3000, 'gbm__max_depth': 4, 'gbm__learning_rate': 0.015}

# Evaluate the best model on the test set
gbm_best_hlv = grid_gbm_hlv.best_estimator_

# Predictions on test data
y_pred_gbm = gbm_best_hlv.predict(X_test)

#calculate f1 score
f1_gbm = f1_score(y_test, y_pred_gbm, pos_label = "good")
print("F1 Score on Test Data for Hist Gradient Boosting model:", f1_gbm)

F1 Score on Test Data for Hist Gradient Boosting model: 0.8746666666666667

Support Vector Classification

Support Vector Classification was chosen for its strength in capturing the decision boundary between different classes, making it effective for classifying credit risk when dealing with high-dimensional data

Pipeline Definition (pipeline_svm):
- Created a pipeline to preprocess data and train a Support Vector Classifier (SVC). The preprocessor ensures that data is cleaned and standardized before feeding it to the model.
- The SVC is configured with a balanced class weight to handle class imbalance and a random state for reproducibility.
Parameter Grid (param_distributions_svm):
- Defined a parameter grid for tuning the SVC. This grid includes parameters such as C (regularization parameter), gamma (kernel coefficient), kernel type (linear, RBF, polynomial), and degree (only relevant for polynomial kernels). These settings allow us to explore different trade-offs between model complexity and accuracy.
RandomizedSearchCV (random_search_svm):
- Used a RandomizedSearchCV for hyperparameter tuning of the SVC. This method efficiently explores the parameter space by sampling random combinations, allowing us to find the best settings quickly.
- The search uses cross-validation (cv) to evaluate model performance, and an F1 score (make_scorer(f1_score, pos_label="good")) is used as the scoring metric to focus on predictive performance for the ‘good’ credit risk class.
Evaluation:
- The best model (best_model_svm) is evaluated on the test set using the F1 score to measure its performance in credit risk prediction.

# Create a pipeline with preprocessing and SV classifier
pipeline_svm = Pipeline([
    ('preprocessor', preprocessor),
    ('svm', SVC(class_weight='balanced', random_state=42))
])

# Define the parameter grid for RandomizedSearchCV
param_distributions_svm = {
    'svm__C': np.logspace(-3, 3, 10),
    'svm__gamma': ['scale', 'auto'] + list(np.logspace(-3, 2, 10)),
    'svm__kernel': ['linear', 'rbf', 'poly'],
    'svm__degree': [2, 3, 4]  # Only relevant for 'poly' kernel
}

# Create a RandomizedSearchCV object
random_search_svm = (RandomizedSearchCV(
    estimator=pipeline_svm,
    param_distributions=param_distributions_svm,
    n_iter=50,
    scoring=make_scorer(f1_score,pos_label="good"),
    cv=cv,
    verbose=1,
    random_state=42
).fit(X_train, y_train))

#save the pickle file
with open("random_search_svm.pkl", "wb") as f:
    dump(random_search_svm, f, protocol=5)

Predicting using Best Support Vector Classification Model

Best Parameters for SVM:  {'svm__kernel': 'rbf', 'svm__gamma': 0.1668100537200059, 'svm__degree': 2, 'svm__C': 1000.0}

F1 Score on Test Data for Support Vector Classification model: 0.8702290076335878

5) Comparision and Evaluation of Models:

                    Model  F1 Score
0     Logistic Regression  0.810573
1     K-Nearest Neighbors  0.852792
2           Random Forest  0.895674
3       Gradient Boosting  0.874667
4  Support Vector Machine  0.870229

Evaluation of Models

Overfitting vs. Underfitting

Logistic Regression:
- Simpler model; less prone to overfitting.
- May underfit complex relationships due to its linear nature.
- F1 Score: 0.7126, indicating decent performance but limited flexibility.
KNN Classifier:
- Moderate risk of overfitting, especially with lower neighbor counts and weighted distances.
- F1 Score: 0.8527, showing good performance but potentially sensitive to noise.
Random Forest:
- Strong resistance to overfitting due to ensemble averaging.
- F1 Score: 0.8956, the highest among all models, indicating excellent performance on test data.
Gradient Boosting:
- Can overfit if not tuned carefully, but controlled here with a low learning rate and shallow trees.
- F1 Score: 0.8746, showing good generalization but slightly behind Random Forest and SVM.
SVM:
- Risk of overfitting with a high C value (1000.0) and the RBF kernel, but performs well due to optimized parameters.
- F1 Score: 0.8702, indicating strong performance without significant overfitting.

Bias vs. Variance Tradeoff

Logistic Regression:
- High bias due to its linear assumptions; struggles with non-linear relationships in the data.
- Low variance, making it stable across different datasets.
KNN Classifier:
- Low bias, as KNN is a flexible algorithm that can capture complex patterns.
- Higher variance due to sensitivity to local data points and noise.
Random Forest:
- Balanced bias-variance tradeoff due to ensemble averaging.
- Performs well on both training and test data, indicating low variance and moderate bias.
Gradient Boosting:
- Can have higher variance compared to Random Forest if not tuned well, but here it is mitigated by the low learning rate and shallow trees.
- Moderate bias ensures good generalization.
SVM:
- Low bias due to the RBF kernel capturing complex relationships.
- Moderate variance due to high C value, which prioritizes fitting the training data.

Flexibility vs. Interpretability

Logistic Regression:
- Highly interpretable (coefficients directly explain feature importance).
- Limited flexibility; struggles with non-linear patterns in the data.
KNN Classifier:
- Flexible and easy to understand conceptually, but lacks interpretability regarding feature importance.
Random Forest:
- Highly flexible and robust for various data types.
- Interpretability is moderate; feature importance can be extracted but lacks direct interpretability like logistic regression.
Gradient Boosting:
- Flexible and powerful for complex datasets but less interpretable than Random Forest due to sequential boosting steps.
SVM:
- Very flexible with RBF kernels for non-linear decision boundaries but lacks interpretability (e.g., feature contributions are not directly available).

Main Takeaways

Best Model for Performance:
- Random Forest achieved the highest F1 score (0.8956) on test data, making it the best-performing model in terms of predictive power.
Tradeoffs Between Models:
- Logistic Regression is simple and interpretable but underperforms on complex datasets due to its linear nature.
- KNN provides good performance (0.8527) but is more prone to overfitting and sensitive to noise in imbalanced datasets.
- Gradient Boosting offers a balance between flexibility and generalization, though slightly behind Random Forest in performance (0.8746).
- SVM performs well (0.8702) with optimized hyperparameters but lacks interpretability.
Model Selection Depends on Goals:
- For interpretability: Logistic Regression or Random Forest (with feature importance).
- For best predictive performance: Random Forest or SVM (if interpretability is less critical).
- For imbalanced datasets: Gradient Boosting or Random Forest with class weights or balanced sampling.

6) Ethical Implications

Given the nature of predicting credit risk using demographic, financial, and behavioral factors, there are several ethical concerns that could arise if this system were deployed. Firstly, potential for harm and injustice is a key issue. The imbalanced nature of the credit_risk target variable, with 700 “good” and 300 “bad” credit cases, might lead to biased decision-making by favoring the majority class (“good”) and potentially discriminating against minority cases (“bad”). This imbalance could result in a system that unfairly denies credit to individuals based on less favorable past credit histories, which might include valid but unfortunate financial circumstances not directly indicative of future risk.

Moreover, replication of human biases and inequities is a significant risk. Factors such as job stability, length of residence, and foreign worker status can reflect underlying social biases and inequalities, which could be unintentionally perpetuated by the model. For instance, people who are foreign workers or those from socioeconomically disadvantaged backgrounds might face higher scrutiny and be unfairly penalized, which could exacerbate existing inequalities in credit access.

To address these issues, it is essential to adopt fairness-aware machine learning techniques. This includes strategies like re-balancing datasets to address class imbalances, implementing bias mitigation techniques such as fairness constraints during model training, and regularly monitoring the model’s performance to detect and correct for any unfair outcomes. Additionally, transparent and interpretable models should be used, so that users can understand how decisions are made and identify any patterns of discrimination. Ensuring that such systems do not merely replicate historical biases requires continuous evaluation and adjustments to promote fairness and equity in credit risk assessment.