Executive Summary

his analysis investigates the potential of unsupervised learning—specifically k-means clustering—to predict corporate bankruptcy using financial statement data. Leveraging a dataset of 78,682 firm-year observations containing 18 numeric financial indicators, the study follows a structured data science workflow involving exploratory data analysis (EDA), data cleaning, feature engineering, clustering, and model evaluation.

Preprocessing steps included reclassification of variable types, verification of structural integrity, and outlier analysis. Financial indicators exhibited significant skewness and variance, typical of real-world economic data, and were normalized using z-score standardization. Two derived features, Return on Assets (ROA) and Debt Ratio, were engineered to incorporate domain knowledge into the model.

A k-means clustering model (k = 2) was built on the standardized training set to identify latent financial groupings that might map onto bankruptcy outcomes. Principal component analysis (PCA) visualization revealed two distinct but overlapping clusters. Mapping cluster assignments to actual bankruptcy labels using majority voting enabled evaluation on the test set.

The model achieved high performance, with precision = 1.000, recall = 0.9337, and an F1-score = 0.9657. These results indicate that the clustering algorithm, though unsupervised, effectively aligned with true firm status based on well-prepared input features. However, the approach remains limited by k-means’ geometric assumptions and inability to capture probabilistic risk.

Future enhancements could include alternative clustering methods (e.g., DBSCAN, GMM), dimensionality reduction (e.g., PCA), and integration of macroeconomic or temporal features. Ultimately, transitioning to supervised models may allow for optimized classification and cost-sensitive decision-making in real-world financial risk assessment contexts.

Introduction

Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. Financial ratios have historically demonstrated predictive utility in bankruptcy modeling (Altman, 1968). Contemporary research continues to explore advanced methods for selecting and transforming financial indicators for business crisis detection (Lin, Liang, & Chen, 2011). This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. Cluster analysis, particularly k-means, has been a foundational approach in unsupervised learning (Jain, 2010; Xu & Wunsch, 2005). All analysis is performed in R and adheres to APA-style reporting conventions (Saunders, Lewis, & Thornhill, 2019).

Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. All analysis is performed in R and adheres to APA-style reporting conventions.

Step 1: Data Preprocessing

Data preprocessing ensures the dataset is clean, structured, and appropriate for analysis. This step involves identifying variable types, assessing data structure, evaluating missingness and anomalies, and performing recoding as necessary. Preprocessing establishes the foundational data integrity required for robust exploratory and predictive analytics.

1.1 Load Data and Inspect Structure

data <- read.csv("C:/Users/Kat/Desktop/Bankruptcy Data/american_bankruptcy.csv")
str(data)
## 'data.frame':    78682 obs. of  21 variables:
##  $ company_name: chr  "C_1" "C_1" "C_1" "C_1" ...
##  $ status_label: chr  "alive" "alive" "alive" "alive" ...
##  $ year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ X1          : num  511 486 437 396 432 ...
##  $ X2          : num  833 714 526 497 523 ...
##  $ X3          : num  18.4 18.6 22.5 27.2 26.7 ...
##  $ X4          : num  89 64.4 27.2 30.7 47.5 ...
##  $ X5          : num  336 321 287 260 247 ...
##  $ X6          : num  35.2 18.5 -58.9 -12.4 3.5 ...
##  $ X7          : num  128.3 115.2 77.5 66.3 104.7 ...
##  $ X8          : num  373 377 365 143 309 ...
##  $ X9          : num  1024 874 639 606 652 ...
##  $ X10         : num  741 702 710 687 709 ...
##  $ X11         : num  180 180 218 165 249 ...
##  $ X12         : num  70.66 45.79 4.71 3.57 20.81 ...
##  $ X13         : num  191 160 112 110 129 ...
##  $ X14         : num  164 125 150 204 131 ...
##  $ X15         : num  201 204 140 124 132 ...
##  $ X16         : num  1024 874 639 606 652 ...
##  $ X17         : num  401 362 400 392 408 ...
##  $ X18         : num  935 810 612 576 604 ...
kable(head(data))
company_name status_label year X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18
C_1 alive 1999 511.267 833.107 18.373 89.031 336.018 35.163 128.348 372.7519 1024.333 740.998 180.447 70.658 191.226 163.816 201.026 1024.333 401.483 935.302
C_1 alive 2000 485.856 713.811 18.577 64.367 320.590 18.531 115.187 377.1180 874.255 701.854 179.987 45.790 160.444 125.392 204.065 874.255 361.642 809.888
C_1 alive 2001 436.656 526.477 22.496 27.207 286.588 -58.939 77.528 364.5928 638.721 710.199 217.699 4.711 112.244 150.464 139.603 638.721 399.964 611.514
C_1 alive 2002 396.412 496.747 27.172 30.745 259.954 -12.410 66.322 143.3295 606.337 686.621 164.658 3.573 109.590 203.575 124.106 606.337 391.633 575.592
C_1 alive 2003 432.204 523.302 26.680 47.491 247.245 3.504 104.661 308.9071 651.958 709.292 248.666 20.811 128.656 131.261 131.884 651.958 407.608 604.467
C_1 alive 2004 474.542 598.172 27.950 61.774 255.477 15.453 127.121 522.6794 747.848 732.230 227.159 33.824 149.676 160.025 142.450 747.848 417.486 686.074

1.2 Identify Variable Types and Recategorize

This step ensures all variables are correctly classified as numeric, categorical, or identifiers. Accurate data typing is critical for valid statistical inference and appropriate modeling choices.

data$status_label <- as.factor(data$status_label)
data$company_name <- as.character(data$company_name)
data$year <- as.integer(data$year)
summary(data$status_label)
##  alive failed 
##  73462   5220

1.3 Evaluate Missing Data and Outliers

Missing data can bias results and outliers can distort distributions and influence models. This step identifies and assesses anomalies to determine if imputation or filtering is required.

kable(sapply(data, function(x) sum(is.na(x))))
x
company_name 0
status_label 0
year 0
X1 0
X2 0
X3 0
X4 0
X5 0
X6 0
X7 0
X8 0
X9 0
X10 0
X11 0
X12 0
X13 0
X14 0
X15 0
X16 0
X17 0
X18 0
data_numeric <- data %>% select(where(is.numeric))
melted <- melt(data_numeric)
ggplot(melted, aes(x = variable, y = value)) +
  geom_boxplot(outlier.color = "red") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Boxplots of Numeric Variables")

Step 2: Exploratory Data Analysis

EDA provides insight into data distributions and interrelationships through descriptive statistics and visualizations. This helps detect patterns, guide feature engineering, and inform model selection.

2.1 Univariate Analysis

Univariate analysis evaluates the distribution of individual variables (Saunders et al., 2019). This helps identify patterns such as skewness, modality, and variability, which inform appropriate transformation techniques or highlight potential data quality issues (Lin et al., 2011). It is critical to identify skewness, modality, and variability.

describe(data_numeric)
##      vars     n    mean       sd  median trimmed    mad        min     max
## year    1 78682 2007.51     5.74 2007.00 2007.33   7.41    1999.00    2018
## X1      2 78682  880.36  3928.56  100.45  245.50 141.88      -7.76  169662
## X2      3 78682 1594.53  8930.48  103.66  356.63 149.69    -366.64  374623
## X3      4 78682  121.23   652.38    7.93   26.83  11.46       0.00   28430
## X4      5 78682  376.76  2012.02   15.03   76.53  38.18  -21913.00   81730
## X5      6 78682  201.61  1060.77    7.02   42.86  10.41       0.00   62567
## X6      7 78682  129.38  1265.53    1.62   19.34  26.45  -98696.00  104821
## X7      8 78682  286.83  1335.98   22.82   72.21  33.34      -0.01   65812
## X8      9 78682 3414.35 18414.10  227.51  705.05 325.43       0.00 1073391
## X9     10 78682 2364.02 11950.07  186.60  582.85 271.46   -1965.00  511729
## X10    11 78682 2867.11 12917.94  213.20  668.95 303.69       0.00  531864
## X11    12 78682  722.48  3242.17    7.59  140.23  11.26      -0.02  166250
## X12    13 78682  255.53  1494.64    6.52   47.30  30.37  -25913.00   71230
## X13    14 78682  769.49  3774.70   63.58  192.17  94.06  -21536.00  137106
## X14    15 78682  610.07  2938.39   43.33  131.31  60.35       0.00  116866
## X15    16 78682  532.47  6369.16   -1.13   56.89 139.16 -102362.00  402089
## X16    17 78682 2364.02 11950.07  186.60  582.85 271.46   -1965.00  511729
## X17    18 78682 1773.56  8053.68   81.99  370.59 117.39       0.00  337980
## X18    19 78682 1987.26 10419.63  168.91  492.77 237.46    -317.20  481580
##           range  skew kurtosis    se
## year      19.00  0.19    -1.17  0.02
## X1    169669.76 14.84   337.46 14.01
## X2    374989.64 20.05   577.29 31.84
## X3     28430.00 17.86   445.40  2.33
## X4    103643.00 16.40   399.81  7.17
## X5     62567.00 22.57   814.44  3.78
## X6    203517.00 11.87  1496.19  4.51
## X7     65812.01 15.84   400.32  4.76
## X8   1073390.54 18.19   540.96 65.65
## X9    513694.00 18.99   546.01 42.60
## X10   531864.00 13.56   272.64 46.05
## X11   166250.02 14.85   387.42 11.56
## X12    97143.00 17.97   528.46  5.33
## X13   158642.00 15.30   335.11 13.46
## X14   116866.00 14.22   293.58 10.48
## X15   504451.00 29.61  1544.01 22.71
## X16   513694.00 18.99   546.01 42.60
## X17   337980.00 13.77   296.42 28.71
## X18   481897.20 20.38   631.40 37.15
histograms <- map(names(data_numeric), function(var) {
  ggplot(data, aes_string(x = var)) + geom_histogram(bins = 30, fill = "blue", color = "white") + ggtitle(paste("Histogram of", var))
})

grid.arrange(grobs = histograms[1:6], ncol = 3)

2.2 Multivariate Analysis

Multivariate analysis assesses relationships between variables (Jain, 2010; Xu & Wunsch, 2005). These relationships are useful for detecting redundancy, multicollinearity, or natural groupings that may inform modeling strategies or feature selection (Saunders et al., 2019)., aiding in detecting collinearity, clusters, and feature relevance.

cor_mat <- round(cor(data_numeric), 2)
heatmap(cor_mat, symm = TRUE, col = topo.colors(10), margins = c(6,6))

Step 3: Feature Engineering

Feature engineering transforms raw inputs into meaningful predictors. Constructed features such as financial ratios are supported in the literature as predictive indicators of financial distress (Lin et al., 2011; Altman, 1968). Feature engineering enhances model interpretability and can improve clustering and classification performance by embedding domain-specific logic into the analysis (Saunders et al., 2019). New features may capture domain-specific ratios or interactions that improve clustering separation.

data <- data %>% mutate(debt_ratio = X5 / (X4 + 1e-5),
                        roa = X3 / (X4 + 1e-5))

Step 4: Data Splitting and Scaling

Splitting into training and test sets ensures model performance is evaluated on unseen data, thereby reducing overfitting and preserving generalizability (Saunders et al., 2019). Standardizing variables is essential for clustering algorithms that rely on distance metrics, as it prevents dominant influence from variables with large scales (Jain, 2010). is evaluated on unseen data. Scaling standardizes variable ranges, which is essential for distance-based models like k-means.

set.seed(42)
index <- createDataPartition(data$status_label, p = 0.7, list = FALSE)
train <- data[index, ]
test <- data[-index, ]

scale_cols <- names(train)[4:21]
train_scaled <- as.data.frame(scale(train[, scale_cols]))
test_scaled <- as.data.frame(scale(test[, scale_cols]))

Step 5: Clustering Analysis

K-means clustering (k=2) segments observations into groups by minimizing within-cluster variance. The goal is to determine if unsupervised grouping aligns with bankruptcy status.

k2 <- kmeans(train_scaled, centers = 2, nstart = 25)
fviz_cluster(k2, data = train_scaled) + ggtitle("2-Means Clustering")

train$cluster <- as.factor(k2$cluster)

Step 6: Cluster Evaluation on Test Set

Cluster validity is assessed using supervised labels in the test set. While clustering is an unsupervised method, post-hoc evaluation with known classes provides a quantitative estimate of alignment with actual outcomes (Xu & Wunsch, 2005). Evaluation metrics such as precision, recall, accuracy, and F1-score are standard tools in model assessment (Saunders et al., 2019). using supervised labels in the test set. Metrics include precision, recall, accuracy, and F1-score to evaluate prediction effectiveness.

mapping <- table(train$cluster, train$status_label)
cluster_to_label <- apply(mapping, 1, function(row) names(which.max(row)))

test_pred <- kmeans(test_scaled, centers = k2$centers, nstart = 1)
test$cluster <- as.factor(test_pred$cluster)
test$predicted_label <- as.factor(cluster_to_label[as.numeric(test$cluster)])

conf <- confusionMatrix(test$predicted_label, test$status_label)
conf_matrix <- as.data.frame(conf$table)

conf_plot <- ggplot(conf_matrix, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  geom_text(aes(label = Freq), size = 6, color = "black") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(
    title = "Confusion Matrix",
    x = "Actual Label",
    y = "Predicted Label"
  ) +
  theme_minimal(base_size = 14)

print(conf_plot)

precision <- Precision(as.character(test$predicted_label), as.character(test$status_label))
recall <- Recall(as.character(test$predicted_label), as.character(test$status_label))
accuracy <- Accuracy(as.character(test$predicted_label), as.character(test$status_label))
f1 <- F1_Score(as.character(test$predicted_label), as.character(test$status_label))


kable(data.frame(Precision = precision, Recall = recall, Accuracy = accuracy, F1_Score = f1))
Precision Recall Accuracy F1_Score
1 0.9336553 0.9336553 0.9656895

Step 7: Limitations and Future Work

The cluster-based approach does not incorporate domain knowledge or optimized decision boundaries, which limits prediction performance. Additionally, k-means clustering assumes spherical clusters and equal variance, which may not hold in financial data. Future work should investigate alternative clustering techniques (e.g., DBSCAN or hierarchical methods) and supervised classification models with engineered features. Dimensionality reduction via PCA could enhance visualization and interpretability. Moreover, incorporation of macroeconomic indicators could improve predictive utility.

References