Introduction

In the complex and ever-evolving world of finance, the ability to accurately assess and predict the likelihood of bankruptcy is a critical task for banks and financial institutions. The stakes are high, as the consequences of missing the signs of financial distress can be dire. From safeguarding investments and managing risks to ensuring regulatory compliance, the importance of leveraging statistics to determine the likelihood of bankruptcy cannot be overstated. This proactive approach empowers banks to make informed decisions, allocate resources efficiently, and maintain the stability and integrity of the financial system. In this context, statistical analysis becomes an indispensable tool for mitigating potential financial crises, protecting stakeholders’ interests, and fostering a healthier and more resilient banking industry. Using Random Forest and XGBoost, this article explores the vital role that statistical methods play in enabling banks to make data-driven decisions and navigate the complex landscape of bankruptcy prediction.

Scope and Purpose of the Research

Understanding what are the underlying indicators of bankruptcy is extremely important to the world of finance. Not only are banks able to track and notify clients/businesses of potential incoming bankruptcies, but also could get ahead of it before it gets out of hand. The use of the modeling types such as the Random Forest and the XGBoost, while both are decision tree-based algorithms, provide insight into what exactly these lead indicators of bankruptcy are.

Below are the following Python packages used in this analysis.

import pandas as pd
from scipy.io import arff
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer

Methodology of Data Collection and Analysis

To begin understanding the data we have encountered, the first step is to look for any null values in the data. Since we had found quite a few nulls, they were weeded out and removed from the data set. Additionally, we replaced the two classes in the bankruptcy column to 0 for no and 1 for yes. Understandably, there were only a few number of bankruptcy entries and thus creating an imbalance in the data. From this point we used the Random Forest Classifier to begin understanding how the Random Forest Model would predict bankruptcy, but also what variables the algorithm deems to be important. Afterwards, we use the XGBoost Model to compare and contrast a more in depth machine learning model and its results in determining what variables are most important in determining bankruptcies.
# Function: load and convert .arff file to a pandas DataFrame
def load_arff_to_dataframe(path):
    data, meta = arff.loadarff(path)
    df = pd.DataFrame(data)
    df['class'] = df['class'].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x).astype(int)  # Ensure 'class' is integer
    return df

# Paths to case study files
file_paths = [
    r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 4\1year.arff',
    r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 4\2year.arff',
    r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 4\3year.arff',
    r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 4\4year.arff',
    r'C:\Users\fidel\Desktop\School\Quantifying the World\Case Study 4\5year.arff'
]

# Load and concatenate the datasets
datasets = [load_arff_to_dataframe(path) for path in file_paths]
df = pd.concat(datasets, ignore_index=True)

Exporatory Data Analysis

Here we look at the histograms for each feature, class distribution of successful and bankrupt businesses, missing values for each attribute, basic statistics for the dataset and a correlation matrix. We see the two significant issues to address: the the dataset has a significant amount of missing values, large variances in each feature and the dataset is very imbalanced in terms of the prediction variable (class). We will use mean imputation for the missing values and will need to implement SMOTE. Synthetic Minority Over-sampling Technique, is a statistical technique designed to address the problem of class imbalance in machine learning datasets.

# Histograms for numerical features
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns

num_features = len(numerical_features)
num_rows = -(-num_features // 5)  # Ceiling division to ensure we have enough rows

# Plot histograms for all numerical features, 5 per row
fig, axes = plt.subplots(num_rows, 5, figsize=(20, num_rows * 3))  # Adjust the figure size as needed
axes = axes.flatten()  # Flatten the axes array for easy iteration

for i, col in enumerate(numerical_features):
    sns.histplot(df[col], bins=15, ax=axes[i], kde=False)
    axes[i].set_title(col)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

# Hide any unused subplots if the number of features is not a multiple of 5
for j in range(i + 1, num_rows * 5):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

Continuing EDA to class distribution and data quality

Imbalance in the ‘class’ feature

The datasset contains 41,314 instances of class 0 vs. 2,091 instances of class 1. This imbalance can bias classifiers towards the majority class. We will implement SMOTE to address the issue of oversampling and understampling each class.

# Class distribution, very imbalanced, implement SMOTE
plt.figure(figsize=(6, 4))
sns.countplot(x='class', data=df)
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

print(df['class'].value_counts())
## class
## 0    41314
## 1     2091
## Name: count, dtype: int64

Missing Values:

We have several features that have more 1000 missing values and a few other with less than 1000. To ensure random forests can use the data we will implement mean imputation address missing values. Random forests will not be able to run using NaN values in the data set. ‘Attr60’ and ‘Attr64’ have thousands of missing values and will need to be resolved to prevent biases.

# Count and plot missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

plt.figure(figsize=(10, 5))
missing_values.sort_values().plot(kind='bar')
plt.title('Missing Values Count by Feature')
plt.xlabel('Features')
plt.ylabel('Count of missing values')
plt.show()

# Missing values counts
print("Missing values by feature:\n", missing_values)
## Missing values by feature:
##  Attr1        8
## Attr2        8
## Attr3        8
## Attr4      134
## Attr5       89
##           ... 
## Attr60    2152
## Attr61     102
## Attr62     127
## Attr63     134
## Attr64     812
## Length: 64, dtype: int64

Basic dataset statistics and correlation matrix:

# Basic statistics
basic_stats = df.describe()
print("Basic Statistics for Numerical Features:\n", basic_stats)
## Basic Statistics for Numerical Features:
##                Attr1         Attr2  ...         Attr64         class
## count  43397.000000  43397.000000  ...   42593.000000  43405.000000
## mean       0.035160      0.590212  ...      72.788592      0.048174
## std        2.994109      5.842748  ...    2369.339482      0.214137
## min     -463.890000   -430.870000  ...  -10677.000000      0.000000
## 25%        0.003429      0.268980  ...       2.176800      0.000000
## 50%        0.049660      0.471900  ...       4.282500      0.000000
## 75%        0.129580      0.688320  ...       9.776200      0.000000
## max       94.280000    480.960000  ...  294770.000000      1.000000
## 
## [8 rows x 65 columns]
# Correlation matrix without numbers since there are a lot of features
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), cmap='coolwarm', center=0)
plt.title('Correlation Matrix - Color Coded')
plt.show()

Preprocessing data to handle missing values and implement SMOTE

# Preprocess data: Convert features to numeric and handle missing values
X = df.drop('class', axis=1).apply(pd.to_numeric, errors='coerce')
y = df['class']

Mean Imputation:

This will average the values for the records that are there and use that to fill the data. Random forest does not work well with missing values so these will need to be resolved before we fit the model.

# Impute missing values in features, using the mean for each feature
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

Feature Scaling:

Here we will apply the StandardScaler form SKearn.Preprocessing to help deal with the large variation within each feature.

## Apply feature scaling
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_imputed), columns=X_imputed.columns)

Implementing SMOTE

SMOTE works by synthetically generating new instances of the minority class, helping to balance the dataset without losing valuable information. The steps involved in SMOTE are as follows:

Choosing a Minority Class Instance: For each instance in the minority class, SMOTE selects a certain number of its nearest neighbors from the minority class. Synthesizing New Instances: For each selected neighbor, SMOTE generates synthetic instances. This is done by interpolating between the original instance and its neighbors. Specifically, a new instance is created by choosing a point along the line segment that connects the original instance with its selected neighbor(s) in the feature space. Repeating the Process: This process is repeated until the class distribution is balanced or until the desired number of synthetic instances has been generated.

# Apply SMOTE to address class imbalance
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_scaled, y)

Creating the training and test sets

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42)

Random Forests

The Random Forest model ran used 100 estimators. 
# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rf_predictions = rf_clf.predict(X_test)
rf_proba = rf_clf.predict_proba(X_test)[:, 1]

Random Forest Evaluation metrics

We achieved a .99 accuracy overall. From the output report for random forests model we can see there is a .99 precision for detecting those who are not going bankrupt, but only a .98 for those who are. Looking at the variables from the Random Forest Model, we see that Attribute 26, 20, 23, and 43 carry the largest weights in predicting if a company is on its way to bankrupcy. The model appears to be very well tuned and accurate for classifying companies that may go bankrupt. We also ran a confusion matrix to help validate the initiate metrics we observed. Our confusion matrix show 8,191 True Positives (TP), 8,089 True Negatives (TN), 164 False Positives (FP) and 82 False Negatives (FN). This supports the models accuracy and precisions in classifying companies in delinquency.
# Evaluation metrics
print("Random Forest Classifier Metrics:")
## Random Forest Classifier Metrics:
print(classification_report(y_test, rf_predictions))
##               precision    recall  f1-score   support
## 
##            0       0.99      0.98      0.99      8253
##            1       0.98      0.99      0.99      8273
## 
##     accuracy                           0.99     16526
##    macro avg       0.99      0.99      0.99     16526
## weighted avg       0.99      0.99      0.99     16526
# Confusion Matrix
print("Confusion Matrix:")
## Confusion Matrix:
print(confusion_matrix(y_test, rf_predictions))
## [[8089  164]
##  [  82 8191]]
# ROC-AUC Score
roc_auc = roc_auc_score(y_test, rf_proba)
print("ROC-AUC Score:", roc_auc)
## ROC-AUC Score: 0.9982747282253724
# Precision-Recall Curve and AUC
precision, recall, _ = precision_recall_curve(y_test, rf_proba)
pr_auc = auc(recall, precision)
print("Precision-Recall AUC:", pr_auc)
## Precision-Recall AUC: 0.9980413788155538
# Get feature importances
importances_rf = rf_clf.feature_importances_

# Create a DataFrame for easier handling and visualization
feature_names = X.columns
importances_rf_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances_rf}).sort_values(by='Importance', ascending=False)

# Display the top 10 most important features
print("Top 10 Important Features in Random Forest:")
## Top 10 Important Features in Random Forest:
print(importances_rf_df.head(10))
##    Feature  Importance
## 26  Attr27    0.146958
## 20  Attr21    0.052544
## 23  Attr24    0.050143
## 45  Attr46    0.037691
## 33  Attr34    0.037293
## 5    Attr6    0.035149
## 38  Attr39    0.023633
## 15  Attr16    0.023209
## 25  Attr26    0.022353
## 34  Attr35    0.021759

XGBoost

XGBoost performed even better than the random forests model. It was able to obtain a 99% overall accuracy. It matched the random forest model in all metrics and exceeded the random forest model in accuracy for companies who did go bankrupt; it performed at 99% vs the RT 98%. XGBoost report a different set of important features but 2 that were on both lists were ‘Attr26’ and ‘Attr 27’. These 2 specific attributes should be targeted for further investigation in how they may be important indicators of a companies financial health.

# XGBoost Classifier
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train, y_train)
## XGBClassifier(base_score=None, booster=None, callbacks=None,
##               colsample_bylevel=None, colsample_bynode=None,
##               colsample_bytree=None, device=None, early_stopping_rounds=None,
##               enable_categorical=False, eval_metric='logloss',
##               feature_types=None, gamma=None, grow_policy=None,
##               importance_type=None, interaction_constraints=None,
##               learning_rate=None, max_bin=None, max_cat_threshold=None,
##               max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
##               max_leaves=None, min_child_weight=None, missing=nan,
##               monotone_constraints=None, multi_strategy=None, n_estimators=None,
##               n_jobs=None, num_parallel_tree=None, random_state=None, ...)
xgb_predictions = xgb_clf.predict(X_test)
print("XGBoost Classifier Metrics:")
## XGBoost Classifier Metrics:
print(classification_report(y_test, xgb_predictions))
##               precision    recall  f1-score   support
## 
##            0       0.99      0.99      0.99      8253
##            1       0.99      0.99      0.99      8273
## 
##     accuracy                           0.99     16526
##    macro avg       0.99      0.99      0.99     16526
## weighted avg       0.99      0.99      0.99     16526
# Get feature importance
importances_xgb = xgb_clf.feature_importances_

# Create a DataFrame for easier handling and visualization
importances_xgb_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances_xgb}).sort_values(by='Importance', ascending=False)

# Display the top 10 most important features
print("Top 10 Important Features in XGBoost:")
## Top 10 Important Features in XGBoost:
print(importances_xgb_df.head(10))
##    Feature  Importance
## 25  Attr26    0.105672
## 26  Attr27    0.095705
## 33  Attr34    0.075430
## 20  Attr21    0.071993
## 38  Attr39    0.037719
## 36  Attr37    0.033919
## 5    Attr6    0.026472
## 24  Attr25    0.023384
## 55  Attr56    0.022518
## 19  Attr20    0.019384

Conclusion

The data provided for the analysis was sufficient to create 2 high accurate models using random forests and XGBoost. Both models acheived 99% overall accuracy and both identified the top 10 features that are important indicators of a companies financial health. Two specific features appeared on both lists ‘Attr26’ and ‘Attr27’. These two features warrant additional investigation on how they may be indicators for a companies financial health. Random Forests and XGBoost have proven to efficiently manage a large dataset of numeric data to make accurate predictions with simple tidying of the data provided. We feel comfident these models will better equip out company to better determin an investment comapnies financial health.