Executive Summary
Introduction
Step 1: Data Preprocessing
Step 2: Exploratory Data Analysis
- 2.1 Univariate Analysis
- 2.2 Multivariate Analysis
Step 3: Feature Engineering
Step 4: Data Splitting and Scaling
Step 5: Clustering Analysis
Step 6: Cluster Evaluation on Test Set
Step 7: Limitations and Future Work
References

Executive Summary

his analysis investigates the potential of unsupervised learning—specifically k-means clustering—to predict corporate bankruptcy using financial statement data. Leveraging a dataset of 78,682 firm-year observations containing 18 numeric financial indicators, the study follows a structured data science workflow involving exploratory data analysis (EDA), data cleaning, feature engineering, clustering, and model evaluation.

Preprocessing steps included reclassification of variable types, verification of structural integrity, and outlier analysis. Financial indicators exhibited significant skewness and variance, typical of real-world economic data, and were normalized using z-score standardization. Two derived features, Return on Assets (ROA) and Debt Ratio, were engineered to incorporate domain knowledge into the model.

A k-means clustering model (k = 2) was built on the standardized training set to identify latent financial groupings that might map onto bankruptcy outcomes. Principal component analysis (PCA) visualization revealed two distinct but overlapping clusters. Mapping cluster assignments to actual bankruptcy labels using majority voting enabled evaluation on the test set.

The model achieved high performance, with precision = 1.000, recall = 0.9337, and an F1-score = 0.9657. These results indicate that the clustering algorithm, though unsupervised, effectively aligned with true firm status based on well-prepared input features. However, the approach remains limited by k-means’ geometric assumptions and inability to capture probabilistic risk.

Future enhancements could include alternative clustering methods (e.g., DBSCAN, GMM), dimensionality reduction (e.g., PCA), and integration of macroeconomic or temporal features. Ultimately, transitioning to supervised models may allow for optimized classification and cost-sensitive decision-making in real-world financial risk assessment contexts.

Introduction

Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. Financial ratios have historically demonstrated predictive utility in bankruptcy modeling (Altman, 1968). Contemporary research continues to explore advanced methods for selecting and transforming financial indicators for business crisis detection (Lin, Liang, & Chen, 2011). This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. Cluster analysis, particularly k-means, has been a foundational approach in unsupervised learning (Jain, 2010; Xu & Wunsch, 2005). All analysis is performed in R and adheres to APA-style reporting conventions (Saunders, Lewis, & Thornhill, 2019).

Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. All analysis is performed in R and adheres to APA-style reporting conventions.

Step 1: Data Preprocessing

Data preprocessing ensures the dataset is clean, structured, and appropriate for analysis. This step involves identifying variable types, assessing data structure, evaluating missingness and anomalies, and performing recoding as necessary. Preprocessing establishes the foundational data integrity required for robust exploratory and predictive analytics.

1.1 Load Data and Inspect Structure

data <- read.csv("C:/Users/Kat/Desktop/Bankruptcy Data/american_bankruptcy.csv")
str(data)

## 'data.frame':    78682 obs. of  21 variables:
##  $ company_name: chr  "C_1" "C_1" "C_1" "C_1" ...
##  $ status_label: chr  "alive" "alive" "alive" "alive" ...
##  $ year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ X1          : num  511 486 437 396 432 ...
##  $ X2          : num  833 714 526 497 523 ...
##  $ X3          : num  18.4 18.6 22.5 27.2 26.7 ...
##  $ X4          : num  89 64.4 27.2 30.7 47.5 ...
##  $ X5          : num  336 321 287 260 247 ...
##  $ X6          : num  35.2 18.5 -58.9 -12.4 3.5 ...
##  $ X7          : num  128.3 115.2 77.5 66.3 104.7 ...
##  $ X8          : num  373 377 365 143 309 ...
##  $ X9          : num  1024 874 639 606 652 ...
##  $ X10         : num  741 702 710 687 709 ...
##  $ X11         : num  180 180 218 165 249 ...
##  $ X12         : num  70.66 45.79 4.71 3.57 20.81 ...
##  $ X13         : num  191 160 112 110 129 ...
##  $ X14         : num  164 125 150 204 131 ...
##  $ X15         : num  201 204 140 124 132 ...
##  $ X16         : num  1024 874 639 606 652 ...
##  $ X17         : num  401 362 400 392 408 ...
##  $ X18         : num  935 810 612 576 604 ...

kable(head(data))

company_name	status_label	year	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15	X16	X17	X18
C_1	alive	1999	511.267	833.107	18.373	89.031	336.018	35.163	128.348	372.7519	1024.333	740.998	180.447	70.658	191.226	163.816	201.026	1024.333	401.483	935.302
C_1	alive	2000	485.856	713.811	18.577	64.367	320.590	18.531	115.187	377.1180	874.255	701.854	179.987	45.790	160.444	125.392	204.065	874.255	361.642	809.888
C_1	alive	2001	436.656	526.477	22.496	27.207	286.588	-58.939	77.528	364.5928	638.721	710.199	217.699	4.711	112.244	150.464	139.603	638.721	399.964	611.514
C_1	alive	2002	396.412	496.747	27.172	30.745	259.954	-12.410	66.322	143.3295	606.337	686.621	164.658	3.573	109.590	203.575	124.106	606.337	391.633	575.592
C_1	alive	2003	432.204	523.302	26.680	47.491	247.245	3.504	104.661	308.9071	651.958	709.292	248.666	20.811	128.656	131.261	131.884	651.958	407.608	604.467
C_1	alive	2004	474.542	598.172	27.950	61.774	255.477	15.453	127.121	522.6794	747.848	732.230	227.159	33.824	149.676	160.025	142.450	747.848	417.486	686.074

1.2 Identify Variable Types and Recategorize

This step ensures all variables are correctly classified as numeric, categorical, or identifiers. Accurate data typing is critical for valid statistical inference and appropriate modeling choices.

data$status_label <- as.factor(data$status_label)
data$company_name <- as.character(data$company_name)
data$year <- as.integer(data$year)
summary(data$status_label)

##  alive failed 
##  73462   5220

1.3 Evaluate Missing Data and Outliers

Missing data can bias results and outliers can distort distributions and influence models. This step identifies and assesses anomalies to determine if imputation or filtering is required.

kable(sapply(data, function(x) sum(is.na(x))))

	x
company_name	0
status_label	0
year	0
X1	0
X2	0
X3	0
X4	0
X5	0
X6	0
X7	0
X8	0
X9	0
X10	0
X11	0
X12	0
X13	0
X14	0
X15	0
X16	0
X17	0
X18	0

data_numeric <- data %>% select(where(is.numeric))
melted <- melt(data_numeric)
ggplot(melted, aes(x = variable, y = value)) +
  geom_boxplot(outlier.color = "red") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Boxplots of Numeric Variables")

Step 2: Exploratory Data Analysis

EDA provides insight into data distributions and interrelationships through descriptive statistics and visualizations. This helps detect patterns, guide feature engineering, and inform model selection.

2.1 Univariate Analysis

Univariate analysis evaluates the distribution of individual variables (Saunders et al., 2019). This helps identify patterns such as skewness, modality, and variability, which inform appropriate transformation techniques or highlight potential data quality issues (Lin et al., 2011). It is critical to identify skewness, modality, and variability.

describe(data_numeric)

##      vars     n    mean       sd  median trimmed    mad        min     max
## year    1 78682 2007.51     5.74 2007.00 2007.33   7.41    1999.00    2018
## X1      2 78682  880.36  3928.56  100.45  245.50 141.88      -7.76  169662
## X2      3 78682 1594.53  8930.48  103.66  356.63 149.69    -366.64  374623
## X3      4 78682  121.23   652.38    7.93   26.83  11.46       0.00   28430
## X4      5 78682  376.76  2012.02   15.03   76.53  38.18  -21913.00   81730
## X5      6 78682  201.61  1060.77    7.02   42.86  10.41       0.00   62567
## X6      7 78682  129.38  1265.53    1.62   19.34  26.45  -98696.00  104821
## X7      8 78682  286.83  1335.98   22.82   72.21  33.34      -0.01   65812
## X8      9 78682 3414.35 18414.10  227.51  705.05 325.43       0.00 1073391
## X9     10 78682 2364.02 11950.07  186.60  582.85 271.46   -1965.00  511729
## X10    11 78682 2867.11 12917.94  213.20  668.95 303.69       0.00  531864
## X11    12 78682  722.48  3242.17    7.59  140.23  11.26      -0.02  166250
## X12    13 78682  255.53  1494.64    6.52   47.30  30.37  -25913.00   71230
## X13    14 78682  769.49  3774.70   63.58  192.17  94.06  -21536.00  137106
## X14    15 78682  610.07  2938.39   43.33  131.31  60.35       0.00  116866
## X15    16 78682  532.47  6369.16   -1.13   56.89 139.16 -102362.00  402089
## X16    17 78682 2364.02 11950.07  186.60  582.85 271.46   -1965.00  511729
## X17    18 78682 1773.56  8053.68   81.99  370.59 117.39       0.00  337980
## X18    19 78682 1987.26 10419.63  168.91  492.77 237.46    -317.20  481580
##           range  skew kurtosis    se
## year      19.00  0.19    -1.17  0.02
## X1    169669.76 14.84   337.46 14.01
## X2    374989.64 20.05   577.29 31.84
## X3     28430.00 17.86   445.40  2.33
## X4    103643.00 16.40   399.81  7.17
## X5     62567.00 22.57   814.44  3.78
## X6    203517.00 11.87  1496.19  4.51
## X7     65812.01 15.84   400.32  4.76
## X8   1073390.54 18.19   540.96 65.65
## X9    513694.00 18.99   546.01 42.60
## X10   531864.00 13.56   272.64 46.05
## X11   166250.02 14.85   387.42 11.56
## X12    97143.00 17.97   528.46  5.33
## X13   158642.00 15.30   335.11 13.46
## X14   116866.00 14.22   293.58 10.48
## X15   504451.00 29.61  1544.01 22.71
## X16   513694.00 18.99   546.01 42.60
## X17   337980.00 13.77   296.42 28.71
## X18   481897.20 20.38   631.40 37.15

histograms <- map(names(data_numeric), function(var) {
  ggplot(data, aes_string(x = var)) + geom_histogram(bins = 30, fill = "blue", color = "white") + ggtitle(paste("Histogram of", var))
})

grid.arrange(grobs = histograms[1:6], ncol = 3)

2.2 Multivariate Analysis

Multivariate analysis assesses relationships between variables (Jain, 2010; Xu & Wunsch, 2005). These relationships are useful for detecting redundancy, multicollinearity, or natural groupings that may inform modeling strategies or feature selection (Saunders et al., 2019)., aiding in detecting collinearity, clusters, and feature relevance.

cor_mat <- round(cor(data_numeric), 2)
heatmap(cor_mat, symm = TRUE, col = topo.colors(10), margins = c(6,6))

Step 3: Feature Engineering

Feature engineering transforms raw inputs into meaningful predictors. Constructed features such as financial ratios are supported in the literature as predictive indicators of financial distress (Lin et al., 2011; Altman, 1968). Feature engineering enhances model interpretability and can improve clustering and classification performance by embedding domain-specific logic into the analysis (Saunders et al., 2019). New features may capture domain-specific ratios or interactions that improve clustering separation.

data <- data %>% mutate(debt_ratio = X5 / (X4 + 1e-5),
                        roa = X3 / (X4 + 1e-5))

Step 4: Data Splitting and Scaling

Splitting into training and test sets ensures model performance is evaluated on unseen data, thereby reducing overfitting and preserving generalizability (Saunders et al., 2019). Standardizing variables is essential for clustering algorithms that rely on distance metrics, as it prevents dominant influence from variables with large scales (Jain, 2010). is evaluated on unseen data. Scaling standardizes variable ranges, which is essential for distance-based models like k-means.

set.seed(42)
index <- createDataPartition(data$status_label, p = 0.7, list = FALSE)
train <- data[index, ]
test <- data[-index, ]

scale_cols <- names(train)[4:21]
train_scaled <- as.data.frame(scale(train[, scale_cols]))
test_scaled <- as.data.frame(scale(test[, scale_cols]))

Step 5: Clustering Analysis

K-means clustering (k=2) segments observations into groups by minimizing within-cluster variance. The goal is to determine if unsupervised grouping aligns with bankruptcy status.

k2 <- kmeans(train_scaled, centers = 2, nstart = 25)
fviz_cluster(k2, data = train_scaled) + ggtitle("2-Means Clustering")

train$cluster <- as.factor(k2$cluster)

Step 6: Cluster Evaluation on Test Set

Cluster validity is assessed using supervised labels in the test set. While clustering is an unsupervised method, post-hoc evaluation with known classes provides a quantitative estimate of alignment with actual outcomes (Xu & Wunsch, 2005). Evaluation metrics such as precision, recall, accuracy, and F1-score are standard tools in model assessment (Saunders et al., 2019). using supervised labels in the test set. Metrics include precision, recall, accuracy, and F1-score to evaluate prediction effectiveness.

mapping <- table(train$cluster, train$status_label)
cluster_to_label <- apply(mapping, 1, function(row) names(which.max(row)))

test_pred <- kmeans(test_scaled, centers = k2$centers, nstart = 1)
test$cluster <- as.factor(test_pred$cluster)
test$predicted_label <- as.factor(cluster_to_label[as.numeric(test$cluster)])

conf <- confusionMatrix(test$predicted_label, test$status_label)
conf_matrix <- as.data.frame(conf$table)

conf_plot <- ggplot(conf_matrix, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  geom_text(aes(label = Freq), size = 6, color = "black") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(
    title = "Confusion Matrix",
    x = "Actual Label",
    y = "Predicted Label"
  ) +
  theme_minimal(base_size = 14)

print(conf_plot)

precision <- Precision(as.character(test$predicted_label), as.character(test$status_label))
recall <- Recall(as.character(test$predicted_label), as.character(test$status_label))
accuracy <- Accuracy(as.character(test$predicted_label), as.character(test$status_label))
f1 <- F1_Score(as.character(test$predicted_label), as.character(test$status_label))


kable(data.frame(Precision = precision, Recall = recall, Accuracy = accuracy, F1_Score = f1))

Precision	Recall	Accuracy	F1_Score
1	0.9336553	0.9336553	0.9656895

Step 7: Limitations and Future Work

The cluster-based approach does not incorporate domain knowledge or optimized decision boundaries, which limits prediction performance. Additionally, k-means clustering assumes spherical clusters and equal variance, which may not hold in financial data. Future work should investigate alternative clustering techniques (e.g., DBSCAN or hierarchical methods) and supervised classification models with engineered features. Dimensionality reduction via PCA could enhance visualization and interpretability. Moreover, incorporation of macroeconomic indicators could improve predictive utility.

References

Altman, E. I. (1968). Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy. Journal of Finance, 23(4), 589–609.
Lin, F., Liang, D., & Chen, E. (2011). Financial ratio selection for business crisis prediction. Expert Systems with Applications, 38(12), 15094–15102.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651–666.
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Saunders, M., Lewis, P., & Thornhill, A. (2019). Research Methods for Business Students (8th ed.). Pearson Education.

KCampise-8Knit

Kat Campise

June 2025