# Define the list of required packages
packages <- c("tidyverse", "skimr", "visdat", "naniar",
"GGally", "gridExtra", "scales", "corrplot", "dplyr", "tidyr", "rsample", "kableExtra")
# Function to install missing packages
install_if_missing <- function(pkg) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
}
}
# Run the function across the list
invisible(lapply(packages, install_if_missing))
# Verify loading
print("All packages are installed and loaded.")
## [1] "All packages are installed and loaded."
This analysis uses the American Companies Bankruptcy Prediction dataset to build a two-cluster k-means model and evaluate its ability to separate surviving firms from failed firms. The workflow follows the standard exploratory data analysis pipeline: data ingestion, type verification, missing value assessment, outlier handling, transformation, splitting, feature engineering, modeling, and evaluation.
The dataset contains 18 financial variables along with a company identifier, year, and survival status label. The table below documents the original column codes, the renamed variables used throughout this analysis, and a short description of each measure.
| Variable | Renamed Variable | Description |
|---|---|---|
| X1 | CurrAssets | All the assets of a company that are expected to be sold or used as a result of standard business operations over the next year. |
| X2 | COGS | The total amount a company paid as a cost directly related to the sale of products. |
| X3 | DepAmort | Depreciation refers to the loss of value of a tangible fixed asset over time (such as property, machinery, buildings, and plant). Amortization refers to the loss of value of intangible assets over time. |
| X4 | EBITDA | Earnings before interest, taxes, depreciation, and amortization. A measure of a company’s overall financial performance, serving as an alternative to net income. |
| X5 | Inventory | The accounting of items and raw materials that a company either uses in production or sells. |
| X6 | NetIncome | The overall profitability of a company after all expenses and costs have been deducted from total revenue. |
| X7 | TotRecv | The balance of money due to a firm for goods or services delivered or used but not yet paid for by customers. |
| X8 | MktValue | The price of an asset in a marketplace. In this dataset, it refers to market capitalization since companies are publicly traded in the stock market. |
| X9 | NetSales | The sum of a company’s gross sales minus its returns, allowances, and discounts. |
| X10 | TotAssets | All the assets, or items of value, a business owns. |
| X11 | LTDebt | A company’s loans and other liabilities that will not become due within one year of the balance sheet date. |
| X12 | EBIT | Earnings before interest and taxes. |
| X13 | GrossProfit | The profit a business makes after subtracting all the costs that are related to manufacturing and selling its products or services. |
| X14 | TotCurrLiab | The sum of accounts payable, accrued liabilities, and taxes such as bonds payable at the end of the year, salaries, and commissions remaining. |
| X15 | RetEarn | The amount of profit a company has left over after paying all its direct costs, indirect costs, income taxes, and dividends to shareholders. |
| X16 | TotRevenue | The amount of income that a business has made from all sales before subtracting expenses. It may include interest and dividends from investments. |
| X17 | TotLiab | The combined debts and obligations that the company owes to outside parties. |
| X18 | TotOpEx | The expenses a business incurs through its normal business operations. |
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## Response [https://storage.googleapis.com:443/kaggle-data-sets/3326190/5791434/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20260427%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260427T000519Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=6f57d5eee3785d87922bb706568cc5b8ee7a62d36d8f7e7722f6024b2a723fcbc32c660a2b712e0f774ecdb32278e950adb48380f4e0e66d6e9f20b0d735dc7c3e6495409513e064d00ddc55d5d32afb0bea06d6dd014e3581cdc8b09227e7f969ca1a92a3c351e2fcd55203159c20784b88f2ed00f52239e7da44b166519453911741c1bf2407492ad43ccb6bb3f9f5ee246167d10a60765f0ad839c4e749edafe58766cd78036b9a97b82552e12b94b3bc95e7e476a3d683e1b7ddd9d531b48359a111f2cc638fd3c94efa5f878da96f184bc96d050fc315380787169ca208e5dd59ac4d94d9ee8944b2d1416c2d159499cf644cae2e2c55d7768fdcbc3839]
## Date: 2026-04-27 00:05
## Status: 200
## Content-Type: application/zip
## Size: 4.68 MB
## <ON DISK> C:\Users\abdou\OneDrive - Prod Student NU\Documents\DDS-8501 Exploratory Data Analysis\DDS8501_EDA\bankruptcy.zip
## Rows: 78682 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): company_name, status_label
## dbl (19): year, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 21
## company_name status_label year X1 X2 X3 X4 X5 X6 X7
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 C_1 alive 1999 511. 833. 18.4 89.0 336. 35.2 128.
## 2 C_1 alive 2000 486. 714. 18.6 64.4 321. 18.5 115.
## 3 C_1 alive 2001 437. 526. 22.5 27.2 287. -58.9 77.5
## 4 C_1 alive 2002 396. 497. 27.2 30.7 260. -12.4 66.3
## 5 C_1 alive 2003 432. 523. 26.7 47.5 247. 3.50 105.
## 6 C_1 alive 2004 475. 598. 28.0 61.8 255. 15.5 127.
## # ℹ 11 more variables: X8 <dbl>, X9 <dbl>, X10 <dbl>, X11 <dbl>, X12 <dbl>,
## # X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>
library(dplyr)
df <- df %>%
dplyr::rename(
CurrAssets = X1,
COGS = X2,
DepAmort = X3,
EBITDA = X4,
Inventory = X5,
NetIncome = X6,
TotRecv = X7,
MktValue = X8,
NetSales = X9,
TotAssets = X10,
LTDebt = X11,
EBIT = X12,
GrossProfit = X13,
TotCurrLiab = X14,
RetEarn = X15,
TotRevenue = X16,
TotLiab = X17,
TotOpEx = X18
)
## spc_tbl_ [78,682 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ company_name: chr [1:78682] "C_1" "C_1" "C_1" "C_1" ...
## $ status_label: chr [1:78682] "alive" "alive" "alive" "alive" ...
## $ year : num [1:78682] 1999 2000 2001 2002 2003 ...
## $ CurrAssets : num [1:78682] 511 486 437 396 432 ...
## $ COGS : num [1:78682] 833 714 526 497 523 ...
## $ DepAmort : num [1:78682] 18.4 18.6 22.5 27.2 26.7 ...
## $ EBITDA : num [1:78682] 89 64.4 27.2 30.7 47.5 ...
## $ Inventory : num [1:78682] 336 321 287 260 247 ...
## $ NetIncome : num [1:78682] 35.2 18.5 -58.9 -12.4 3.5 ...
## $ TotRecv : num [1:78682] 128.3 115.2 77.5 66.3 104.7 ...
## $ MktValue : num [1:78682] 373 377 365 143 309 ...
## $ NetSales : num [1:78682] 1024 874 639 606 652 ...
## $ TotAssets : num [1:78682] 741 702 710 687 709 ...
## $ LTDebt : num [1:78682] 180 180 218 165 249 ...
## $ EBIT : num [1:78682] 70.66 45.79 4.71 3.57 20.81 ...
## $ GrossProfit : num [1:78682] 191 160 112 110 129 ...
## $ TotCurrLiab : num [1:78682] 164 125 150 204 131 ...
## $ RetEarn : num [1:78682] 201 204 140 124 132 ...
## $ TotRevenue : num [1:78682] 1024 874 639 606 652 ...
## $ TotLiab : num [1:78682] 401 362 400 392 408 ...
## $ TotOpEx : num [1:78682] 935 810 612 576 604 ...
## - attr(*, "spec")=
## .. cols(
## .. company_name = col_character(),
## .. status_label = col_character(),
## .. year = col_double(),
## .. X1 = col_double(),
## .. X2 = col_double(),
## .. X3 = col_double(),
## .. X4 = col_double(),
## .. X5 = col_double(),
## .. X6 = col_double(),
## .. X7 = col_double(),
## .. X8 = col_double(),
## .. X9 = col_double(),
## .. X10 = col_double(),
## .. X11 = col_double(),
## .. X12 = col_double(),
## .. X13 = col_double(),
## .. X14 = col_double(),
## .. X15 = col_double(),
## .. X16 = col_double(),
## .. X17 = col_double(),
## .. X18 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## company_name status_label year CurrAssets
## Length:78682 Length:78682 Min. :1999 Min. : -7.76
## Class :character Class :character 1st Qu.:2002 1st Qu.: 18.92
## Mode :character Mode :character Median :2007 Median : 100.45
## Mean :2008 Mean : 880.36
## 3rd Qu.:2012 3rd Qu.: 431.53
## Max. :2018 Max. :169662.00
## COGS DepAmort EBITDA
## Min. : -366.64 Min. : 0.000 Min. :-21913.000
## 1st Qu.: 17.04 1st Qu.: 1.192 1st Qu.: -0.811
## Median : 103.66 Median : 7.929 Median : 15.034
## Mean : 1594.53 Mean : 121.234 Mean : 376.759
## 3rd Qu.: 634.55 3rd Qu.: 47.972 3rd Qu.: 139.655
## Max. :374623.00 Max. :28430.000 Max. : 81730.000
## Inventory NetIncome TotRecv
## Min. : 0.000 Min. :-98696.000 Min. : -0.006
## 1st Qu.: 0.000 1st Qu.: -7.416 1st Qu.: 3.281
## Median : 7.023 Median : 1.616 Median : 22.820
## Mean : 201.606 Mean : 129.382 Mean : 286.833
## 3rd Qu.: 74.747 3rd Qu.: 40.144 3rd Qu.: 131.580
## Max. :62567.000 Max. :104821.000 Max. :65812.000
## MktValue NetSales TotAssets
## Min. :0.000e+00 Min. : -1965.00 Min. : 0.00
## 1st Qu.:3.498e+01 1st Qu.: 27.55 1st Qu.: 37.36
## Median :2.275e+02 Median : 186.60 Median : 213.20
## Mean :3.414e+03 Mean : 2364.02 Mean : 2867.11
## 3rd Qu.:1.245e+03 3rd Qu.: 1046.40 3rd Qu.: 1171.36
## Max. :1.073e+06 Max. :511729.00 Max. :531864.00
## LTDebt EBIT GrossProfit
## Min. : -0.023 Min. :-25913.000 Min. :-21536.000
## 1st Qu.: 0.000 1st Qu.: -2.787 1st Qu.: 8.521
## Median : 7.593 Median : 6.518 Median : 63.581
## Mean : 722.484 Mean : 255.525 Mean : 769.491
## 3rd Qu.: 248.761 3rd Qu.: 87.599 3rd Qu.: 344.074
## Max. :166250.000 Max. : 71230.000 Max. :137106.000
## TotCurrLiab RetEarn TotRevenue
## Min. :1.000e-03 Min. :-102362.00 Min. : -1965.00
## 1st Qu.:8.889e+00 1st Qu.: -68.28 1st Qu.: 27.55
## Median :4.333e+01 Median : -1.13 Median : 186.60
## Mean :6.101e+02 Mean : 532.47 Mean : 2364.02
## 3rd Qu.:2.228e+02 3rd Qu.: 146.07 3rd Qu.: 1046.40
## Max. :1.169e+05 Max. : 402089.00 Max. :511729.00
## TotLiab TotOpEx
## Min. : 0.00 Min. : -317.20
## 1st Qu.: 13.49 1st Qu.: 32.87
## Median : 81.99 Median : 168.91
## Mean : 1773.56 Mean : 1987.26
## 3rd Qu.: 629.98 3rd Qu.: 875.52
## Max. :337980.00 Max. :481580.00
The dataset contains three identifier or label fields (company_name,
status_label, year) and 18 quantitative financial measures. The target
variable status_label should be a factor, and
year should be treated as an ordered grouping variable
rather than a numeric measure.
# Reclassify the important variables
df$status_label <- as.factor(df$status_label)
df$year <- as.integer(df$year)
str(df)
| Type | Variables |
|---|---|
| Identifier (qualitative) | company_name |
| Categorical (qualitative, target) | status_label |
| Temporal (discrete) | year |
| Quantitative (continuous) | CurrAssets, COGS, DepAmort, EBITDA, Inventory, NetIncome, TotRecv, MktValue, NetSales, TotAssets, LTDebt, EBIT, GrossProfit, TotCurrLiab, RetEarn, TotRevenue, TotLiab, TotOpEx |
# Overall dimensions of the dataset.
dim(df)
## [1] 78682 21
# Inspect the resulting structure.
glimpse(df)
## Rows: 78,682
## Columns: 21
## $ company_name <chr> "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "…
## $ status_label <chr> "alive", "alive", "alive", "alive", "alive", "alive", "al…
## $ year <dbl> 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 200…
## $ CurrAssets <dbl> 511.267, 485.856, 436.656, 396.412, 432.204, 474.542, 624…
## $ COGS <dbl> 833.107, 713.811, 526.477, 496.747, 523.302, 598.172, 704…
## $ DepAmort <dbl> 18.373, 18.577, 22.496, 27.172, 26.680, 27.950, 29.222, 3…
## $ EBITDA <dbl> 89.031, 64.367, 27.207, 30.745, 47.491, 61.774, 91.877, 1…
## $ Inventory <dbl> 336.018, 320.590, 286.588, 259.954, 247.245, 255.477, 323…
## $ NetIncome <dbl> 35.163, 18.531, -58.939, -12.410, 3.504, 15.453, 35.163, …
## $ TotRecv <dbl> 128.348, 115.187, 77.528, 66.322, 104.661, 127.121, 136.2…
## $ MktValue <dbl> 372.7519, 377.1180, 364.5928, 143.3295, 308.9071, 522.679…
## $ NetSales <dbl> 1024.333, 874.255, 638.721, 606.337, 651.958, 747.848, 89…
## $ TotAssets <dbl> 740.998, 701.854, 710.199, 686.621, 709.292, 732.230, 978…
## $ LTDebt <dbl> 180.447, 179.987, 217.699, 164.658, 248.666, 227.159, 318…
## $ EBIT <dbl> 70.658, 45.790, 4.711, 3.573, 20.811, 33.824, 62.655, 86.…
## $ GrossProfit <dbl> 191.226, 160.444, 112.244, 109.590, 128.656, 149.676, 193…
## $ TotCurrLiab <dbl> 163.816, 125.392, 150.464, 203.575, 131.261, 160.025, 187…
## $ RetEarn <dbl> 201.026, 204.065, 139.603, 124.106, 131.884, 142.450, 183…
## $ TotRevenue <dbl> 1024.333, 874.255, 638.721, 606.337, 651.958, 747.848, 89…
## $ TotLiab <dbl> 401.483, 361.642, 399.964, 391.633, 407.608, 417.486, 556…
## $ TotOpEx <dbl> 935.302, 809.888, 611.514, 575.592, 604.467, 686.074, 805…
# Compact summary of all columns including missingness, distribution shape,
# and quartiles.
skim(df)
| Name | df |
| Number of rows | 78682 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| company_name | 0 | 1 | 3 | 6 | 0 | 8971 | 0 |
| status_label | 0 | 1 | 5 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2007.51 | 5.74 | 1999.00 | 2002.00 | 2007.00 | 2012.00 | 2018 | ▇▇▆▆▅ |
| CurrAssets | 0 | 1 | 880.36 | 3928.56 | -7.76 | 18.92 | 100.45 | 431.53 | 169662 | ▇▁▁▁▁ |
| COGS | 0 | 1 | 1594.53 | 8930.48 | -366.64 | 17.04 | 103.66 | 634.55 | 374623 | ▇▁▁▁▁ |
| DepAmort | 0 | 1 | 121.23 | 652.38 | 0.00 | 1.19 | 7.93 | 47.97 | 28430 | ▇▁▁▁▁ |
| EBITDA | 0 | 1 | 376.76 | 2012.02 | -21913.00 | -0.81 | 15.03 | 139.66 | 81730 | ▁▇▁▁▁ |
| Inventory | 0 | 1 | 201.61 | 1060.77 | 0.00 | 0.00 | 7.02 | 74.75 | 62567 | ▇▁▁▁▁ |
| NetIncome | 0 | 1 | 129.38 | 1265.53 | -98696.00 | -7.42 | 1.62 | 40.14 | 104821 | ▁▁▇▁▁ |
| TotRecv | 0 | 1 | 286.83 | 1335.98 | -0.01 | 3.28 | 22.82 | 131.58 | 65812 | ▇▁▁▁▁ |
| MktValue | 0 | 1 | 3414.35 | 18414.10 | 0.00 | 34.98 | 227.51 | 1244.89 | 1073391 | ▇▁▁▁▁ |
| NetSales | 0 | 1 | 2364.02 | 11950.07 | -1965.00 | 27.55 | 186.60 | 1046.40 | 511729 | ▇▁▁▁▁ |
| TotAssets | 0 | 1 | 2867.11 | 12917.94 | 0.00 | 37.36 | 213.20 | 1171.36 | 531864 | ▇▁▁▁▁ |
| LTDebt | 0 | 1 | 722.48 | 3242.17 | -0.02 | 0.00 | 7.59 | 248.76 | 166250 | ▇▁▁▁▁ |
| EBIT | 0 | 1 | 255.53 | 1494.64 | -25913.00 | -2.79 | 6.52 | 87.60 | 71230 | ▁▇▁▁▁ |
| GrossProfit | 0 | 1 | 769.49 | 3774.70 | -21536.00 | 8.52 | 63.58 | 344.07 | 137106 | ▇▁▁▁▁ |
| TotCurrLiab | 0 | 1 | 610.07 | 2938.39 | 0.00 | 8.89 | 43.33 | 222.82 | 116866 | ▇▁▁▁▁ |
| RetEarn | 0 | 1 | 532.47 | 6369.16 | -102362.00 | -68.28 | -1.13 | 146.07 | 402089 | ▁▇▁▁▁ |
| TotRevenue | 0 | 1 | 2364.02 | 11950.07 | -1965.00 | 27.55 | 186.60 | 1046.40 | 511729 | ▇▁▁▁▁ |
| TotLiab | 0 | 1 | 1773.56 | 8053.68 | 0.00 | 13.49 | 81.99 | 629.98 | 337980 | ▇▁▁▁▁ |
| TotOpEx | 0 | 1 | 1987.26 | 10419.63 | -317.20 | 32.87 | 168.91 | 875.52 | 481580 | ▇▁▁▁▁ |
Report the row count, column count, observed year range, and the
distribution of status_label (count of alive versus
failed) directly from the skim() output. The class
imbalance between failed and surviving firms is a documented
characteristic of this dataset and should be noted before any
modeling.
The dataset show complete records across the 18 financial variables.
miss_var_summary returns zero missingness, which means the
imputation procedure is not needed.
# Numeric summary
miss_var_summary(df)
## # A tibble: 21 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 company_name 0 0
## 2 status_label 0 0
## 3 year 0 0
## 4 CurrAssets 0 0
## 5 COGS 0 0
## 6 DepAmort 0 0
## 7 EBITDA 0 0
## 8 Inventory 0 0
## 9 NetIncome 0 0
## 10 TotRecv 0 0
## # ℹ 11 more rows
# Visual inspection
vis_miss(df, warn_large_data = FALSE)
gg_miss_var(df)
Financial data is heavily right-skewed and contains legitimate extreme values (large firms vs. small firms). Standard deviation-based rules are inappropriate. Use the IQR rule for flagging and, where transformation is warranted, apply a signed log transform that handles negatives (since variables like NetIncome, EBIT, and RetEarn can be negative).
# Helper function: signed log transform.
# log1p(x) = log(1 + x), which handles values near zero stably.
# Multiplying by sign(x) preserves negative values after transformation.
# Select numeric variables
numeric_vars <- df %>%
select(CurrAssets:TotOpEx)
# IQR-based outlier counts per variable
outlier_counts <- numeric_vars %>%
summarise(across(everything(), ~ {
q <- quantile(.x, probs = c(0.25, 0.75), na.rm = TRUE)
iqr <- IQR(.x, na.rm = TRUE)
sum(.x < (q[1] - 1.5 * iqr) | .x > (q[2] + 1.5 * iqr), na.rm = TRUE)
})) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_outliers") %>%
arrange(desc(n_outliers))
outlier_counts
## # A tibble: 18 × 2
## variable n_outliers
## <chr> <int>
## 1 RetEarn 17867
## 2 NetIncome 17004
## 3 EBIT 13251
## 4 LTDebt 12641
## 5 Inventory 12435
## 6 TotLiab 12366
## 7 EBITDA 12218
## 8 TotCurrLiab 11781
## 9 COGS 11594
## 10 TotAssets 11521
## 11 DepAmort 11475
## 12 MktValue 11409
## 13 GrossProfit 11306
## 14 TotRecv 11237
## 15 NetSales 11225
## 16 TotRevenue 11225
## 17 TotOpEx 11214
## 18 CurrAssets 10975
# Signed log transform: preserves the sign of negative values
signed_log <- function(x) {
sign(x) * log1p(abs(x))
}
df_log <- df %>%
mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))
Outliers are flagged but retained. Extreme values in financial data often correspond to the firms most relevant to bankruptcy prediction. The signed log transform compresses scale while preserving the sign of negative earnings.
# Histograms of all financial variables on the signed log scale.
# Faceting by variable allows visual comparison of distribution shapes.
df_log %>%
select(starts_with("log_")) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
facet_wrap(~ variable, scales = "free", ncol = 4) +
labs(title = "Distribution of financial variables (signed log scale)",
x = NULL, y = "Count") +
theme_minimal()
df %>%
ggplot(aes(x = status_label, fill = status_label)) +
geom_bar() +
labs(title = "Class distribution: alive vs. failed",
x = NULL, y = "Count") +
theme_minimal() +
theme(legend.position = "none")
# Compute mean, median, sd, range, and skewness for each financial variable.
# Sorting by skew highlights the most heavily tailed distributions.
univariate_stats <- df %>%
select(CurrAssets:TotOpEx) %>%
summarise(across(everything(),
list(mean = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE),
skew = ~ moments::skewness(.x, na.rm = TRUE),
kurtosis = ~ moments::kurtosis(.x, na.rm = TRUE)))) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_sep = "_(?=[^_]+$)") %>%
arrange(desc(skew))
univariate_stats
## # A tibble: 18 × 8
## variable mean median sd min max skew kurtosis
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RetEarn 532. -1.13 6369. -102362 402089 29.6 1547.
## 2 Inventory 202. 7.02 1061. 0 62567 22.6 817.
## 3 TotOpEx 1987. 169. 10420. -317. 481580 20.4 634.
## 4 COGS 1595. 104. 8930. -367. 374623 20.1 580.
## 5 NetSales 2364. 187. 11950. -1965. 511729 19.0 549.
## 6 TotRevenue 2364. 187. 11950. -1965. 511729 19.0 549.
## 7 MktValue 3414. 228. 18414. 0.0001 1073391. 18.2 544.
## 8 EBIT 256. 6.52 1495. -25913 71230 18.0 531.
## 9 DepAmort 121. 7.93 652. 0 28430 17.9 448.
## 10 EBITDA 377. 15.0 2012. -21913 81730 16.4 403.
## 11 TotRecv 287. 22.8 1336. -0.006 65812 15.8 403.
## 12 GrossProfit 769. 63.6 3775. -21536 137106 15.3 338.
## 13 LTDebt 722. 7.59 3242. -0.023 166250 14.8 390.
## 14 CurrAssets 880. 100. 3929. -7.76 169662 14.8 340.
## 15 TotCurrLiab 610. 43.3 2938. 0.001 116866 14.2 297.
## 16 TotLiab 1774. 82.0 8054. 0.001 337980 13.8 299.
## 17 TotAssets 2867. 213. 12918. 0.001 531864 13.6 276.
## 18 NetIncome 129. 1.62 1266. -98696 104821 11.9 1499.
# Pearson correlation matrix on the log-transformed variables.
cor_matrix <- df_log %>%
select(starts_with("log_")) %>%
cor(use = "complete.obs")
# Visualize the correlation matrix as an upper-triangular colored grid.
cor_matrix %>%
corrplot(method = "color",
type = "upper",
tl.cex = 0.7,
tl.col = "black",
addCoef.col = "black",
number.cex = 0.55)
# Boxplots of key financial indicators split by survival status.
# These plots reveal whether log-scale distributions differ between groups.
df_log %>%
select(status_label,
log_TotAssets, log_NetIncome, log_RetEarn,
log_TotLiab, log_EBITDA, log_MktValue) %>%
pivot_longer(-status_label, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = status_label, y = value, fill = status_label)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ variable, scales = "free_y") +
labs(title = "Financial indicators by firm status",
x = NULL, y = "Signed log value") +
theme_minimal() +
theme(legend.position = "none")
# Median values of each variable split by survival status.
# The ratio column shows how much smaller failed firms are at the median.
group_medians <- df %>%
group_by(status_label) %>%
summarise(across(CurrAssets:TotOpEx, ~ median(.x, na.rm = TRUE))) %>%
pivot_longer(-status_label, names_to = "variable", values_to = "median") %>%
pivot_wider(names_from = status_label, values_from = median) %>%
mutate(ratio_failed_to_alive = failed / alive) %>%
arrange(ratio_failed_to_alive)
group_medians
## # A tibble: 18 × 4
## variable alive failed ratio_failed_to_alive
## <chr> <dbl> <dbl> <dbl>
## 1 NetIncome 2.07 -3.33 -1.61
## 2 EBIT 7.18 0.487 0.0678
## 3 MktValue 241. 118. 0.489
## 4 EBITDA 15.6 7.79 0.499
## 5 TotRecv 23.5 15.2 0.649
## 6 GrossProfit 64.8 47.5 0.733
## 7 CurrAssets 103. 75.9 0.737
## 8 Inventory 7.12 6.17 0.867
## 9 TotAssets 215. 195. 0.908
## 10 TotOpEx 170. 156. 0.914
## 11 NetSales 187. 181. 0.967
## 12 TotRevenue 187. 181. 0.967
## 13 DepAmort 7.94 7.85 0.989
## 14 TotCurrLiab 43.4 42.9 0.990
## 15 COGS 104. 107. 1.03
## 16 TotLiab 80.7 97.0 1.20
## 17 LTDebt 7.09 18.4 2.60
## 18 RetEarn -0.172 -25.5 148.
# Top pairwise correlations on the log scale.
# distinct() removes duplicate (var1, var2) and (var2, var1) pairs.
cor_long <- cor_matrix %>%
as.data.frame() %>%
rownames_to_column("var1") %>%
pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
filter(var1 != var2) %>%
mutate(abs_r = abs(r)) %>%
arrange(desc(abs_r)) %>%
distinct(abs_r, .keep_all = TRUE)
cor_long %>% head(15)
## # A tibble: 15 × 4
## var1 var2 r abs_r
## <chr> <chr> <dbl> <dbl>
## 1 log_NetSales log_TotRevenue 1 1
## 2 log_COGS log_TotOpEx 0.977 0.977
## 3 log_NetSales log_TotOpEx 0.975 0.975
## 4 log_TotCurrLiab log_TotLiab 0.969 0.969
## 5 log_CurrAssets log_TotAssets 0.961 0.961
## 6 log_COGS log_NetSales 0.959 0.959
## 7 log_TotAssets log_TotLiab 0.955 0.955
## 8 log_TotCurrLiab log_TotOpEx 0.948 0.948
## 9 log_CurrAssets log_TotOpEx 0.942 0.942
## 10 log_TotAssets log_TotCurrLiab 0.942 0.942
## 11 log_TotAssets log_TotOpEx 0.938 0.938
## 12 log_DepAmort log_TotLiab 0.936 0.936
## 13 log_EBITDA log_EBIT 0.934 0.934
## 14 log_DepAmort log_TotAssets 0.933 0.933
## 15 log_NetSales log_TotCurrLiab 0.932 0.932
The panel structure of this dataset (multiple year-rows per company) creates a real leakage risk. If observations from the same firm appear in both training and test sets, the model effectively sees its target firm during training. The correct unit for splitting is the company, not the row.
set.seed(8501)
# Build a company-level frame with a single survival label per firm.
# last() selects the most recent status, which is the final outcome.
company_labels <- df %>%
group_by(company_name) %>%
summarise(status_label = last(status_label), .groups = "drop")
# Stratified 80/20 split at the company level to preserve class balance.
company_split <- company_labels %>%
initial_split(prop = 0.8, strata = status_label)
train_companies <- company_split %>% training() %>% pull(company_name)
test_companies <- company_split %>% testing() %>% pull(company_name)
# Apply the split back to the full firm-year dataset
train_set <- df %>% filter(company_name %in% train_companies)
test_set <- df %>% filter(company_name %in% test_companies)
# Verify the split
list(train = train_set, test = test_set) %>%
map_dfr(~ .x %>%
summarise(n_rows = n(),
n_companies = n_distinct(company_name),
pct_failed = round(100 * mean(status_label == "failed"), 2)),
.id = "set")
## # A tibble: 2 × 4
## set n_rows n_companies pct_failed
## <chr> <int> <int> <dbl>
## 1 train 63245 7176 6.68
## 2 test 15437 1795 6.44
The verification table confirms that no company appears in both sets
and that the failure rate is preserved across them. Stratification by
status_label maintains roughly equal class balance between
train and test, which matters given the imbalance between surviving and
failed firms.
Two transformations apply here. The signed log handles skew and negative values in the raw financial measures. Standardization (z-score) puts all variables on a comparable scale for modeling. Both must be fit on the training set only and then applied to the test set using the training parameters.
signed_log <- function(x) sign(x) * log1p(abs(x))
# Step 1: signed log transform on both sets
train_log <- train_set %>%
mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))
test_log <- test_set %>%
mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))
# Step 2: compute scaling parameters (mean and sd) from the training set.
# These parameters will be reused on the test set without recomputation.
scale_params <- train_log %>%
select(starts_with("log_")) %>%
summarise(across(everything(),
list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE)))) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_sep = "_(?=[^_]+$)")
scale_params
## # A tibble: 18 × 3
## variable mean sd
## <chr> <dbl> <dbl>
## 1 log_CurrAssets 4.54 2.21
## 2 log_COGS 4.67 2.45
## 3 log_DepAmort 2.50 1.98
## 4 log_EBITDA 2.29 3.40
## 5 log_Inventory 2.50 2.36
## 6 log_NetIncome 0.827 3.56
## 7 log_TotRecv 3.26 2.20
## 8 log_MktValue 5.40 2.43
## 9 log_NetSales 5.11 2.55
## 10 log_TotAssets 5.35 2.43
## 11 log_LTDebt 2.94 2.93
## 12 log_EBIT 1.67 3.49
## 13 log_GrossProfit 3.85 2.87
## 14 log_TotCurrLiab 3.94 2.16
## 15 log_RetEarn 0.309 5.03
## 16 log_TotRevenue 5.11 2.55
## 17 log_TotLiab 4.62 2.48
## 18 log_TotOpEx 5.16 2.28
# Step 3: helper function that applies training-derived scaling to a dataset.
# cur_column() inside across() lets us look up the right mean/sd per variable.
apply_scaling <- function(df, params) {
df %>%
mutate(across(starts_with("log_"),
~ {
p <- params %>% filter(variable == cur_column())
(.x - p$mean) / p$sd
},
.names = "z_{.col}"))
}
train_scaled <- train_log %>% apply_scaling(scale_params)
test_scaled <- test_log %>% apply_scaling(scale_params)
# Verification: training z-scores should have mean ~0 and sd ~1.
# Test z-scores will not, which is the correct behavior.
train_scaled %>%
select(starts_with("z_")) %>%
summarise(across(everything(),
list(mean = ~ round(mean(.x, na.rm = TRUE), 3),
sd = ~ round(sd(.x, na.rm = TRUE), 3)))) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_sep = "_(?=[^_]+$)") %>%
head(10)
## # A tibble: 10 × 3
## variable mean sd
## <chr> <dbl> <dbl>
## 1 z_log_CurrAssets 0 1
## 2 z_log_COGS 0 1
## 3 z_log_DepAmort 0 1
## 4 z_log_EBITDA 0 1
## 5 z_log_Inventory 0 1
## 6 z_log_NetIncome 0 1
## 7 z_log_TotRecv 0 1
## 8 z_log_MktValue 0 1
## 9 z_log_NetSales 0 1
## 10 z_log_TotAssets 0 1
The mean of each z_log_ column in the training set should be approximately zero and the standard deviation approximately one. Test set z-scores will not have exactly these properties, which is the point. The transformation reflects training-set parameters rather than re-fitting on test data.
Bankruptcy prediction benefits from financial ratios more than from raw dollar amounts. Ratios normalize across firm size and capture relationships between balance sheet components that the original variables describe only indirectly. The features below are grounded in the Altman Z-score tradition, which remains a standard reference in corporate distress prediction (Altman, 1968; Altman, 2000).
add_engineered_features <- function(df) {
df %>%
mutate(
working_capital = CurrAssets - TotCurrLiab,
wc_to_assets = working_capital / TotAssets,
retained_to_assets = RetEarn / TotAssets,
ebit_to_assets = EBIT / TotAssets,
mkt_to_liab = MktValue / TotLiab,
sales_to_assets = NetSales / TotAssets,
debt_to_assets = TotLiab / TotAssets,
profit_margin = NetIncome / NetSales,
current_ratio = CurrAssets / TotCurrLiab,
across(working_capital:current_ratio,
~ if_else(is.finite(.x), .x, NA_real_))
)
}
train_features <- train_scaled %>% add_engineered_features()
test_features <- test_scaled %>% add_engineered_features()
# Inspect the new features
train_features %>%
select(working_capital:current_ratio) %>%
summarise(across(everything(),
list(median = ~ median(.x, na.rm = TRUE),
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE),
n_na = ~ sum(is.na(.x))))) %>%
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_sep = "_(?=[^_]+$)")
## # A tibble: 18 × 5
## variable median mean sd na
## <chr> <dbl> <dbl> <dbl> <int>
## 1 working_capital 34.5 275. 1918. NA
## 2 working_capital_n NA NA NA 0
## 3 wc_to_assets 0.211 -0.998 36.5 NA
## 4 wc_to_assets_n NA NA NA 0
## 5 retained_to_assets -0.0179 -11.2 240. NA
## 6 retained_to_assets_n NA NA NA 0
## 7 ebit_to_assets 0.0474 -0.466 10.2 NA
## 8 ebit_to_assets_n NA NA NA 0
## 9 mkt_to_liab 2.29 15.5 836. NA
## 10 mkt_to_liab_n NA NA NA 0
## 11 sales_to_assets 0.892 1.19 3.55 NA
## 12 sales_to_assets_n NA NA NA 0
## 13 debt_to_assets 0.506 1.95 43.2 NA
## 14 debt_to_assets_n NA NA NA 0
## 15 profit_margin 0.0151 -11.4 277. NA
## 16 profit_margin_n NA NA NA 0
## 17 current_ratio 1.90 3.60 98.7 NA
## 18 current_ratio_n NA NA NA 0
| Feature | Construction | Interpretation |
|---|---|---|
| working_capital | CurrAssets − TotCurrLiab | Short-term liquidity buffer |
| wc_to_assets | working_capital / TotAssets | Liquidity scaled by firm size (Altman X1) |
| retained_to_assets | RetEarn / TotAssets | Cumulative profitability over firm history (Altman X2) |
| ebit_to_assets | EBIT / TotAssets | Operating profitability (Altman X3) |
| mkt_to_liab | MktValue / TotLiab | Market-based solvency cushion (Altman X4) |
| sales_to_assets | NetSales / TotAssets | Asset turnover efficiency (Altman X5) |
| debt_to_assets | TotLiab / TotAssets | Leverage |
| profit_margin | NetIncome / NetSales | Earnings efficiency on sales |
| current_ratio | CurrAssets / TotCurrLiab | Standard short-term solvency measure |
The if_else(is.finite(.x), .x, NA_real_) step converts
infinite values (from divide-by-zero cases) to NA, which is necessary
because some firms report zero current liabilities or zero sales in
certain years.
# Compare engineered features between alive and failed firms
train_features %>%
group_by(status_label) %>%
summarise(across(working_capital:current_ratio,
~ median(.x, na.rm = TRUE))) %>%
pivot_longer(-status_label, names_to = "feature", values_to = "median") %>%
pivot_wider(names_from = status_label, values_from = median) %>%
mutate(diff = alive - failed) %>%
arrange(desc(abs(diff)))
## # A tibble: 9 × 4
## feature alive failed diff
## <chr> <dbl> <dbl> <dbl>
## 1 working_capital 35.8 17.0 18.8
## 2 mkt_to_liab 2.38 1.05 1.33
## 3 current_ratio 1.93 1.53 0.392
## 4 retained_to_assets -0.00464 -0.228 0.224
## 5 debt_to_assets 0.498 0.632 -0.134
## 6 wc_to_assets 0.218 0.123 0.0946
## 7 profit_margin 0.0177 -0.0391 0.0568
## 8 sales_to_assets 0.893 0.843 0.0505
## 9 ebit_to_assets 0.0499 0.00755 0.0424
This table shows which engineered features most sharply separate surviving from failed firms at the median. Features with large positive differences (alive minus failed) are those in which surviving firms outperform, providing an early signal of feature importance before any modeling.
K-means is unsupervised, so the cluster labels are not directly tied
to status_label. The model partitions firms into two groups
based on financial structure, and we then map cluster identity to the
survival outcome by majority class within each cluster on the training
set.
set.seed(8501)
# Use the standardized log-scale variables as clustering input
cluster_vars <- train_features %>%
select(starts_with("z_log_")) %>%
drop_na()
# Keep the row alignment with status_label for later evaluation
train_clean <- train_features %>%
drop_na(starts_with("z_log_"))
# Fit k-means with k = 2
kmeans_fit <- cluster_vars %>%
kmeans(centers = 2, nstart = 25, iter.max = 100)
# Attach cluster assignments to the training data
train_clustered <- train_clean %>%
mutate(cluster = factor(kmeans_fit$cluster))
# Map clusters to survival outcome by majority class
cluster_to_label <- train_clustered %>%
count(cluster, status_label) %>%
group_by(cluster) %>%
slice_max(n, n = 1) %>%
ungroup() %>%
select(cluster, predicted_label = status_label)
cluster_to_label
## # A tibble: 2 × 2
## cluster predicted_label
## <fct> <chr>
## 1 1 alive
## 2 2 alive
The cluster_to_label table assigns each cluster to its dominant class. If both clusters map to “alive,” that is itself an important finding about class imbalance in the feature space, and it should be reported rather than corrected by forcing a different mapping.
library(knitr)
library(kableExtra)
# Display the cluster-to-label mapping table
cluster_to_label %>%
kable(caption = "Final Two-Cluster k-means Partition", align = 'c') %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3")
| cluster | predicted_label |
|---|---|
| 1 | alive |
| 2 | alive |
# Project the standardized features onto the first two principal components
# for visualization of the cluster partition.
pca_fit <- train_clustered %>%
select(starts_with("z_log_")) %>%
prcomp(center = FALSE, scale. = FALSE)
# Build a tidy data frame with the first two PCs, cluster, and survival label
plot_data <- train_clustered %>%
select(cluster, status_label) %>%
bind_cols(as_tibble(pca_fit$x[, 1:2]))
# Project the cluster centroids into the same PC space
centroids_pc <- as_tibble(kmeans_fit$centers %*% pca_fit$rotation[, 1:2]) %>%
mutate(cluster = factor(1:n()))
# Final cluster diagram
plot_data %>%
ggplot(aes(x = PC1, y = PC2, color = cluster, shape = status_label)) +
geom_point(alpha = 0.4, size = 2) +
geom_point(data = centroids_pc,
aes(x = PC1, y = PC2),
color = "black", fill = "white",
shape = 23, size = 5, stroke = 1.5,
inherit.aes = FALSE) +
scale_color_manual(values = c("1" = "#1E2761", "2" = "#F9A826")) +
scale_shape_manual(values = c("alive" = 16, "failed" = 17)) +
labs(title = "Figure 3. Final Cluster Partition (PC1 vs PC2)",
subtitle = "Diamonds mark cluster centroids",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Cluster",
shape = "Status") +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "gray40"))
The most useful interpretation of a k-means model is the centroid profile: which features distinguish one cluster from the other.
# Cluster sizes
train_clustered %>%
count(cluster) %>%
mutate(percent = round(100 * n / sum(n), 2))
## # A tibble: 2 × 3
## cluster n percent
## <fct> <int> <dbl>
## 1 1 35056 55.4
## 2 2 28189 44.6
# Centroid profile in standardized space
# The separation column shows which features drive the partition;
# variables with large absolute separation define the cluster identities.
kmeans_fit$centers %>%
as.data.frame() %>%
rownames_to_column("cluster") %>%
pivot_longer(-cluster, names_to = "variable", values_to = "z_mean") %>%
pivot_wider(names_from = cluster, values_from = z_mean, names_prefix = "cluster_") %>%
mutate(separation = cluster_1 - cluster_2) %>%
arrange(desc(abs(separation)))
## # A tibble: 18 × 4
## variable cluster_1 cluster_2 separation
## <chr> <dbl> <dbl> <dbl>
## 1 z_log_NetSales -0.716 0.891 -1.61
## 2 z_log_TotRevenue -0.716 0.891 -1.61
## 3 z_log_TotLiab -0.711 0.885 -1.60
## 4 z_log_TotOpEx -0.711 0.884 -1.59
## 5 z_log_COGS -0.705 0.877 -1.58
## 6 z_log_TotCurrLiab -0.704 0.875 -1.58
## 7 z_log_TotAssets -0.703 0.874 -1.58
## 8 z_log_DepAmort -0.699 0.870 -1.57
## 9 z_log_TotRecv -0.697 0.867 -1.56
## 10 z_log_EBITDA -0.687 0.855 -1.54
## 11 z_log_CurrAssets -0.683 0.850 -1.53
## 12 z_log_GrossProfit -0.661 0.823 -1.48
## 13 z_log_MktValue -0.643 0.799 -1.44
## 14 z_log_EBIT -0.634 0.789 -1.42
## 15 z_log_LTDebt -0.615 0.765 -1.38
## 16 z_log_Inventory -0.579 0.720 -1.30
## 17 z_log_NetIncome -0.475 0.590 -1.06
## 18 z_log_RetEarn -0.468 0.582 -1.05
# Cluster composition by survival status.
# pct_within_cluster shows how dominant each class is in each cluster.
train_clustered %>%
count(cluster, status_label) %>%
group_by(cluster) %>%
mutate(pct_within_cluster = round(100 * n / sum(n), 2)) %>%
ungroup()
## # A tibble: 4 × 4
## cluster status_label n pct_within_cluster
## <fct> <chr> <int> <dbl>
## 1 1 alive 32629 93.1
## 2 1 failed 2427 6.92
## 3 2 alive 26390 93.6
## 4 2 failed 1799 6.38
The separation column indicates which financial variables drive the partition. Variables with the largest absolute separation define the cluster identities. In bankruptcy data, the typical pattern is that one cluster centers on firms with stronger asset bases, profitability, and revenue, while the other centers on firms with weaker positions across most measures. Report the actual top variables from your output rather than predicting the pattern in advance.
To evaluate a clustering model as a classifier, assign each test
observation to its nearest centroid using the same scaling parameters as
the training set, then compare the mapped cluster label to the true
status_label.
# Test-set clustering input
test_clean <- test_features %>%
drop_na(starts_with("z_log_"))
test_input <- test_clean %>%
select(starts_with("z_log_"))
# Helper: assign each row to the nearest centroid
assign_cluster <- function(df, centers) {
centers_mat <- as.matrix(centers)
df %>%
as.matrix() %>%
apply(1, function(row) {
distances <- rowSums((centers_mat - matrix(row,
nrow = nrow(centers_mat),
ncol = length(row),
byrow = TRUE))^2)
which.min(distances)
})
}
test_predictions <- test_clean %>%
mutate(cluster = factor(assign_cluster(test_input, kmeans_fit$centers))) %>%
left_join(cluster_to_label, by = "cluster") %>%
mutate(predicted_label = factor(predicted_label,
levels = c("alive", "failed")),
status_label = factor(status_label,
levels = c("alive", "failed")))
For a binary classification framing, treat “failed” as the positive class. This convention matches the practical question: how well does the model identify firms heading toward bankruptcy?
# Confusion matrix: rows are true labels, columns are predicted labels.
conf_matrix <- test_predictions %>%
count(status_label, predicted_label) %>%
pivot_wider(names_from = predicted_label, values_from = n, values_fill = 0)
conf_matrix
## # A tibble: 2 × 2
## status_label alive
## <fct> <int>
## 1 alive 14443
## 2 failed 994
# Compute classification metrics in a single pipeline.
# TP = true positives (correctly predicted failed firms)
# FP = false positives (alive firms predicted as failed)
# FN = false negatives (failed firms predicted as alive)
# TN = true negatives (correctly predicted alive firms)
metrics <- test_predictions %>%
summarise(
TP = sum(status_label == "failed" & predicted_label == "failed"),
FP = sum(status_label == "alive" & predicted_label == "failed"),
FN = sum(status_label == "failed" & predicted_label == "alive"),
TN = sum(status_label == "alive" & predicted_label == "alive")
) %>%
mutate(
accuracy = (TP + TN) / (TP + TN + FP + FN),
precision = TP / (TP + FP),
recall = TP / (TP + FN),
f1_score = 2 * (precision * recall) / (precision + recall)
)
metrics %>%
select(accuracy, precision, recall, f1_score) %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
mutate(value = round(value, 4))
## # A tibble: 4 × 2
## metric value
## <chr> <dbl>
## 1 accuracy 0.936
## 2 precision NaN
## 3 recall 0
## 4 f1_score NaN
Report each metric using the exact value produced by the chunk above.
Clustering performance. A k-means model fit on financial measures partitions firms into two structurally distinct groups. The first question is whether those groups align with the survival outcome at all. If one cluster is dominated by failed firms while the other is dominated by surviving firms, the partition has captured a real economic signal. If both clusters are dominated by surviving firms, the partition reflects firm size or sector rather than distress, and the model is not informative for the prediction task. The cluster composition table above answers this directly.
Precision (positive predictive value). The proportion of firms predicted as “failed” that actually failed. High precision indicates that when the model flags a firm, the flag is reliable. Low precision means the model produces many false alarms, and an analyst acting on those flags would waste resources investigating healthy firms.
Recall (sensitivity). The proportion of failed firms the model correctly identified. High recall is critical in bankruptcy prediction because the cost of missing a failing firm (lost investment, unrecovered loans) typically exceeds the cost of investigating a healthy one. Low recall means the model fails its primary purpose: catching firms in distress before they collapse.
Accuracy. The proportion of all firms classified correctly. Accuracy is the least informative metric for this dataset because of class imbalance. A trivial classifier that predicts every firm as “alive” can score high accuracy while providing no value. Report it for completeness but emphasize precision, recall, and F1.
F1 score. The harmonic mean of precision and recall. A balanced summary that penalizes models that achieve high performance on only one of the two. Useful here because both errors carry real cost: false positives consume analyst time, false negatives miss the firms that matter most.
The analysis has several limitations that should be documented.
K-means assumes spherical, equally sized clusters in Euclidean space. Financial data, even after signed log transformation and standardization, does not satisfy this assumption strictly. Firm-size effects produce elongated clusters, and the failure outcome is far rarer than survival, so the clusters are unlikely to be balanced.
The dataset has panel structure (firm-year observations), and the survival label is defined at the firm level. Clustering on firm-year rows means the same company contributes multiple observations, which violates the independence assumption underlying most distance-based methods. The firm-level split mitigates leakage but does not change the within-set non-independence.
K-means is unsupervised. It cannot be tuned to favor recall over precision the way a supervised classifier can. The cluster-to-label mapping is post hoc, and if the dominant class in both clusters happens to be the same, the model produces no useful predictions on the failed class regardless of cluster quality.
Class imbalance is severe in this dataset. Failed firms are a small fraction of total observations. K-means weights all points equally, so the centroids are pulled toward the majority class. Resampling, weighted clustering, or supervised methods would address this more directly.
Financial ratios and accounting variables capture firm condition at a single point in time. Bankruptcy is a process that unfolds over multiple years (Beaver, 1966; Ohlson, 1980), and a static cross-sectional model cannot capture the trajectory leading to failure.
Several directions follow naturally from these limitations.
A supervised baseline using logistic regression or gradient-boosted trees would establish whether the features carry predictive signal beyond what k-means recovers. Comparing supervised performance against the clustering baseline quantifies the cost of the unsupervised approach.
Time-series feature engineering would address the static-snapshot limitation. Year-over-year changes in liquidity, leverage, and profitability are well-documented predictors of distress (Beaver, 1966) and are not used in the current feature set.
Alternative clustering algorithms (Gaussian mixture models, DBSCAN, hierarchical methods) relax the spherical-cluster assumption and may produce partitions more aligned with the actual structure of financial distress. Comparing these against k-means on the same features would clarify whether the limitation lies in the algorithm or in the features themselves.
Class-imbalance treatments such as SMOTE or cost-sensitive learning would address the imbalance directly and are standard in the bankruptcy prediction literature (Zhou, 2013).
Cluster validation indices (silhouette width, Calinski-Harabasz, gap statistic) would provide formal evidence on whether k = 2 is appropriate. The choice of k = 2 here is dictated by the binary outcome rather than by data structure, and the two questions need not produce the same answer.
On bankruptcy and corporate distress prediction:
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609.
Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71–111.
Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), 109–131.
On cluster analysis methodology:
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.
Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.
Theoretical perspective:
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(2), 411–423.
Additional reference cited in future work:
Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16–25.