his analysis investigates the potential of unsupervised learning—specifically k-means clustering—to predict corporate bankruptcy using financial statement data. Leveraging a dataset of 78,682 firm-year observations containing 18 numeric financial indicators, the study follows a structured data science workflow involving exploratory data analysis (EDA), data cleaning, feature engineering, clustering, and model evaluation.
Preprocessing steps included reclassification of variable types, verification of structural integrity, and outlier analysis. Financial indicators exhibited significant skewness and variance, typical of real-world economic data, and were normalized using z-score standardization. Two derived features, Return on Assets (ROA) and Debt Ratio, were engineered to incorporate domain knowledge into the model.
A k-means clustering model (k = 2) was built on the standardized training set to identify latent financial groupings that might map onto bankruptcy outcomes. Principal component analysis (PCA) visualization revealed two distinct but overlapping clusters. Mapping cluster assignments to actual bankruptcy labels using majority voting enabled evaluation on the test set.
The model achieved high performance, with precision = 1.000, recall = 0.9337, and an F1-score = 0.9657. These results indicate that the clustering algorithm, though unsupervised, effectively aligned with true firm status based on well-prepared input features. However, the approach remains limited by k-means’ geometric assumptions and inability to capture probabilistic risk.
Future enhancements could include alternative clustering methods (e.g., DBSCAN, GMM), dimensionality reduction (e.g., PCA), and integration of macroeconomic or temporal features. Ultimately, transitioning to supervised models may allow for optimized classification and cost-sensitive decision-making in real-world financial risk assessment contexts.
Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. Financial ratios have historically demonstrated predictive utility in bankruptcy modeling (Altman, 1968). Contemporary research continues to explore advanced methods for selecting and transforming financial indicators for business crisis detection (Lin, Liang, & Chen, 2011). This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. Cluster analysis, particularly k-means, has been a foundational approach in unsupervised learning (Jain, 2010; Xu & Wunsch, 2005). All analysis is performed in R and adheres to APA-style reporting conventions (Saunders, Lewis, & Thornhill, 2019).
Bankruptcy prediction remains a critical concern in corporate finance, providing stakeholders with foresight into financial instability. This study conducts a comprehensive exploratory data analysis (EDA) and clustering analysis of U.S. corporate bankruptcy data to identify patterns distinguishing active versus bankrupt companies. The methodology adheres to standard EDA principles, including data cleaning, variable reclassification, graphical and statistical analysis, and unsupervised clustering. The objective is to uncover latent groupings among firms and evaluate the potential of cluster-based labels to predict firm status. All analysis is performed in R and adheres to APA-style reporting conventions.
Data preprocessing ensures the dataset is clean, structured, and appropriate for analysis. This step involves identifying variable types, assessing data structure, evaluating missingness and anomalies, and performing recoding as necessary. Preprocessing establishes the foundational data integrity required for robust exploratory and predictive analytics.
data <- read.csv("C:/Users/Kat/Desktop/Bankruptcy Data/american_bankruptcy.csv")
str(data)
## 'data.frame': 78682 obs. of 21 variables:
## $ company_name: chr "C_1" "C_1" "C_1" "C_1" ...
## $ status_label: chr "alive" "alive" "alive" "alive" ...
## $ year : int 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
## $ X1 : num 511 486 437 396 432 ...
## $ X2 : num 833 714 526 497 523 ...
## $ X3 : num 18.4 18.6 22.5 27.2 26.7 ...
## $ X4 : num 89 64.4 27.2 30.7 47.5 ...
## $ X5 : num 336 321 287 260 247 ...
## $ X6 : num 35.2 18.5 -58.9 -12.4 3.5 ...
## $ X7 : num 128.3 115.2 77.5 66.3 104.7 ...
## $ X8 : num 373 377 365 143 309 ...
## $ X9 : num 1024 874 639 606 652 ...
## $ X10 : num 741 702 710 687 709 ...
## $ X11 : num 180 180 218 165 249 ...
## $ X12 : num 70.66 45.79 4.71 3.57 20.81 ...
## $ X13 : num 191 160 112 110 129 ...
## $ X14 : num 164 125 150 204 131 ...
## $ X15 : num 201 204 140 124 132 ...
## $ X16 : num 1024 874 639 606 652 ...
## $ X17 : num 401 362 400 392 408 ...
## $ X18 : num 935 810 612 576 604 ...
kable(head(data))
| company_name | status_label | year | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C_1 | alive | 1999 | 511.267 | 833.107 | 18.373 | 89.031 | 336.018 | 35.163 | 128.348 | 372.7519 | 1024.333 | 740.998 | 180.447 | 70.658 | 191.226 | 163.816 | 201.026 | 1024.333 | 401.483 | 935.302 |
| C_1 | alive | 2000 | 485.856 | 713.811 | 18.577 | 64.367 | 320.590 | 18.531 | 115.187 | 377.1180 | 874.255 | 701.854 | 179.987 | 45.790 | 160.444 | 125.392 | 204.065 | 874.255 | 361.642 | 809.888 |
| C_1 | alive | 2001 | 436.656 | 526.477 | 22.496 | 27.207 | 286.588 | -58.939 | 77.528 | 364.5928 | 638.721 | 710.199 | 217.699 | 4.711 | 112.244 | 150.464 | 139.603 | 638.721 | 399.964 | 611.514 |
| C_1 | alive | 2002 | 396.412 | 496.747 | 27.172 | 30.745 | 259.954 | -12.410 | 66.322 | 143.3295 | 606.337 | 686.621 | 164.658 | 3.573 | 109.590 | 203.575 | 124.106 | 606.337 | 391.633 | 575.592 |
| C_1 | alive | 2003 | 432.204 | 523.302 | 26.680 | 47.491 | 247.245 | 3.504 | 104.661 | 308.9071 | 651.958 | 709.292 | 248.666 | 20.811 | 128.656 | 131.261 | 131.884 | 651.958 | 407.608 | 604.467 |
| C_1 | alive | 2004 | 474.542 | 598.172 | 27.950 | 61.774 | 255.477 | 15.453 | 127.121 | 522.6794 | 747.848 | 732.230 | 227.159 | 33.824 | 149.676 | 160.025 | 142.450 | 747.848 | 417.486 | 686.074 |
This step ensures all variables are correctly classified as numeric, categorical, or identifiers. Accurate data typing is critical for valid statistical inference and appropriate modeling choices.
data$status_label <- as.factor(data$status_label)
data$company_name <- as.character(data$company_name)
data$year <- as.integer(data$year)
summary(data$status_label)
## alive failed
## 73462 5220
Missing data can bias results and outliers can distort distributions and influence models. This step identifies and assesses anomalies to determine if imputation or filtering is required.
kable(sapply(data, function(x) sum(is.na(x))))
| x | |
|---|---|
| company_name | 0 |
| status_label | 0 |
| year | 0 |
| X1 | 0 |
| X2 | 0 |
| X3 | 0 |
| X4 | 0 |
| X5 | 0 |
| X6 | 0 |
| X7 | 0 |
| X8 | 0 |
| X9 | 0 |
| X10 | 0 |
| X11 | 0 |
| X12 | 0 |
| X13 | 0 |
| X14 | 0 |
| X15 | 0 |
| X16 | 0 |
| X17 | 0 |
| X18 | 0 |
data_numeric <- data %>% select(where(is.numeric))
melted <- melt(data_numeric)
ggplot(melted, aes(x = variable, y = value)) +
geom_boxplot(outlier.color = "red") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Boxplots of Numeric Variables")
EDA provides insight into data distributions and interrelationships through descriptive statistics and visualizations. This helps detect patterns, guide feature engineering, and inform model selection.
Univariate analysis evaluates the distribution of individual variables (Saunders et al., 2019). This helps identify patterns such as skewness, modality, and variability, which inform appropriate transformation techniques or highlight potential data quality issues (Lin et al., 2011). It is critical to identify skewness, modality, and variability.
describe(data_numeric)
## vars n mean sd median trimmed mad min max
## year 1 78682 2007.51 5.74 2007.00 2007.33 7.41 1999.00 2018
## X1 2 78682 880.36 3928.56 100.45 245.50 141.88 -7.76 169662
## X2 3 78682 1594.53 8930.48 103.66 356.63 149.69 -366.64 374623
## X3 4 78682 121.23 652.38 7.93 26.83 11.46 0.00 28430
## X4 5 78682 376.76 2012.02 15.03 76.53 38.18 -21913.00 81730
## X5 6 78682 201.61 1060.77 7.02 42.86 10.41 0.00 62567
## X6 7 78682 129.38 1265.53 1.62 19.34 26.45 -98696.00 104821
## X7 8 78682 286.83 1335.98 22.82 72.21 33.34 -0.01 65812
## X8 9 78682 3414.35 18414.10 227.51 705.05 325.43 0.00 1073391
## X9 10 78682 2364.02 11950.07 186.60 582.85 271.46 -1965.00 511729
## X10 11 78682 2867.11 12917.94 213.20 668.95 303.69 0.00 531864
## X11 12 78682 722.48 3242.17 7.59 140.23 11.26 -0.02 166250
## X12 13 78682 255.53 1494.64 6.52 47.30 30.37 -25913.00 71230
## X13 14 78682 769.49 3774.70 63.58 192.17 94.06 -21536.00 137106
## X14 15 78682 610.07 2938.39 43.33 131.31 60.35 0.00 116866
## X15 16 78682 532.47 6369.16 -1.13 56.89 139.16 -102362.00 402089
## X16 17 78682 2364.02 11950.07 186.60 582.85 271.46 -1965.00 511729
## X17 18 78682 1773.56 8053.68 81.99 370.59 117.39 0.00 337980
## X18 19 78682 1987.26 10419.63 168.91 492.77 237.46 -317.20 481580
## range skew kurtosis se
## year 19.00 0.19 -1.17 0.02
## X1 169669.76 14.84 337.46 14.01
## X2 374989.64 20.05 577.29 31.84
## X3 28430.00 17.86 445.40 2.33
## X4 103643.00 16.40 399.81 7.17
## X5 62567.00 22.57 814.44 3.78
## X6 203517.00 11.87 1496.19 4.51
## X7 65812.01 15.84 400.32 4.76
## X8 1073390.54 18.19 540.96 65.65
## X9 513694.00 18.99 546.01 42.60
## X10 531864.00 13.56 272.64 46.05
## X11 166250.02 14.85 387.42 11.56
## X12 97143.00 17.97 528.46 5.33
## X13 158642.00 15.30 335.11 13.46
## X14 116866.00 14.22 293.58 10.48
## X15 504451.00 29.61 1544.01 22.71
## X16 513694.00 18.99 546.01 42.60
## X17 337980.00 13.77 296.42 28.71
## X18 481897.20 20.38 631.40 37.15
histograms <- map(names(data_numeric), function(var) {
ggplot(data, aes_string(x = var)) + geom_histogram(bins = 30, fill = "blue", color = "white") + ggtitle(paste("Histogram of", var))
})
grid.arrange(grobs = histograms[1:6], ncol = 3)
Multivariate analysis assesses relationships between variables (Jain, 2010; Xu & Wunsch, 2005). These relationships are useful for detecting redundancy, multicollinearity, or natural groupings that may inform modeling strategies or feature selection (Saunders et al., 2019)., aiding in detecting collinearity, clusters, and feature relevance.
cor_mat <- round(cor(data_numeric), 2)
heatmap(cor_mat, symm = TRUE, col = topo.colors(10), margins = c(6,6))
Feature engineering transforms raw inputs into meaningful predictors. Constructed features such as financial ratios are supported in the literature as predictive indicators of financial distress (Lin et al., 2011; Altman, 1968). Feature engineering enhances model interpretability and can improve clustering and classification performance by embedding domain-specific logic into the analysis (Saunders et al., 2019). New features may capture domain-specific ratios or interactions that improve clustering separation.
data <- data %>% mutate(debt_ratio = X5 / (X4 + 1e-5),
roa = X3 / (X4 + 1e-5))
Splitting into training and test sets ensures model performance is evaluated on unseen data, thereby reducing overfitting and preserving generalizability (Saunders et al., 2019). Standardizing variables is essential for clustering algorithms that rely on distance metrics, as it prevents dominant influence from variables with large scales (Jain, 2010). is evaluated on unseen data. Scaling standardizes variable ranges, which is essential for distance-based models like k-means.
set.seed(42)
index <- createDataPartition(data$status_label, p = 0.7, list = FALSE)
train <- data[index, ]
test <- data[-index, ]
scale_cols <- names(train)[4:21]
train_scaled <- as.data.frame(scale(train[, scale_cols]))
test_scaled <- as.data.frame(scale(test[, scale_cols]))
K-means clustering (k=2) segments observations into groups by minimizing within-cluster variance. The goal is to determine if unsupervised grouping aligns with bankruptcy status.
k2 <- kmeans(train_scaled, centers = 2, nstart = 25)
fviz_cluster(k2, data = train_scaled) + ggtitle("2-Means Clustering")
train$cluster <- as.factor(k2$cluster)
Cluster validity is assessed using supervised labels in the test set. While clustering is an unsupervised method, post-hoc evaluation with known classes provides a quantitative estimate of alignment with actual outcomes (Xu & Wunsch, 2005). Evaluation metrics such as precision, recall, accuracy, and F1-score are standard tools in model assessment (Saunders et al., 2019). using supervised labels in the test set. Metrics include precision, recall, accuracy, and F1-score to evaluate prediction effectiveness.
mapping <- table(train$cluster, train$status_label)
cluster_to_label <- apply(mapping, 1, function(row) names(which.max(row)))
test_pred <- kmeans(test_scaled, centers = k2$centers, nstart = 1)
test$cluster <- as.factor(test_pred$cluster)
test$predicted_label <- as.factor(cluster_to_label[as.numeric(test$cluster)])
conf <- confusionMatrix(test$predicted_label, test$status_label)
conf_matrix <- as.data.frame(conf$table)
conf_plot <- ggplot(conf_matrix, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), color = "white") +
geom_text(aes(label = Freq), size = 6, color = "black") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(
title = "Confusion Matrix",
x = "Actual Label",
y = "Predicted Label"
) +
theme_minimal(base_size = 14)
print(conf_plot)
precision <- Precision(as.character(test$predicted_label), as.character(test$status_label))
recall <- Recall(as.character(test$predicted_label), as.character(test$status_label))
accuracy <- Accuracy(as.character(test$predicted_label), as.character(test$status_label))
f1 <- F1_Score(as.character(test$predicted_label), as.character(test$status_label))
kable(data.frame(Precision = precision, Recall = recall, Accuracy = accuracy, F1_Score = f1))
| Precision | Recall | Accuracy | F1_Score |
|---|---|---|---|
| 1 | 0.9336553 | 0.9336553 | 0.9656895 |
The cluster-based approach does not incorporate domain knowledge or optimized decision boundaries, which limits prediction performance. Additionally, k-means clustering assumes spherical clusters and equal variance, which may not hold in financial data. Future work should investigate alternative clustering techniques (e.g., DBSCAN or hierarchical methods) and supervised classification models with engineered features. Dimensionality reduction via PCA could enhance visualization and interpretability. Moreover, incorporation of macroeconomic indicators could improve predictive utility.