# Define the list of required packages
packages <- c("tidyverse", "skimr", "visdat", "naniar", 
              "GGally", "gridExtra", "scales", "corrplot", "dplyr", "tidyr", "rsample", "kableExtra")

# Function to install missing packages
install_if_missing <- function(pkg) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
    library(pkg, character.only = TRUE)
  }
}

# Run the function across the list
invisible(lapply(packages, install_if_missing))

# Verify loading
print("All packages are installed and loaded.")

## [1] "All packages are installed and loaded."

About the Dataset:

This analysis uses the American Companies Bankruptcy Prediction dataset to build a two-cluster k-means model and evaluate its ability to separate surviving firms from failed firms. The workflow follows the standard exploratory data analysis pipeline: data ingestion, type verification, missing value assessment, outlier handling, transformation, splitting, feature engineering, modeling, and evaluation.

The dataset contains 18 financial variables along with a company identifier, year, and survival status label. The table below documents the original column codes, the renamed variables used throughout this analysis, and a short description of each measure.

Variable	Renamed Variable	Description
X1	CurrAssets	All the assets of a company that are expected to be sold or used as a result of standard business operations over the next year.
X2	COGS	The total amount a company paid as a cost directly related to the sale of products.
X3	DepAmort	Depreciation refers to the loss of value of a tangible fixed asset over time (such as property, machinery, buildings, and plant). Amortization refers to the loss of value of intangible assets over time.
X4	EBITDA	Earnings before interest, taxes, depreciation, and amortization. A measure of a company’s overall financial performance, serving as an alternative to net income.
X5	Inventory	The accounting of items and raw materials that a company either uses in production or sells.
X6	NetIncome	The overall profitability of a company after all expenses and costs have been deducted from total revenue.
X7	TotRecv	The balance of money due to a firm for goods or services delivered or used but not yet paid for by customers.
X8	MktValue	The price of an asset in a marketplace. In this dataset, it refers to market capitalization since companies are publicly traded in the stock market.
X9	NetSales	The sum of a company’s gross sales minus its returns, allowances, and discounts.
X10	TotAssets	All the assets, or items of value, a business owns.
X11	LTDebt	A company’s loans and other liabilities that will not become due within one year of the balance sheet date.
X12	EBIT	Earnings before interest and taxes.
X13	GrossProfit	The profit a business makes after subtracting all the costs that are related to manufacturing and selling its products or services.
X14	TotCurrLiab	The sum of accounts payable, accrued liabilities, and taxes such as bonds payable at the end of the year, salaries, and commissions remaining.
X15	RetEarn	The amount of profit a company has left over after paying all its direct costs, indirect costs, income taxes, and dividends to shareholders.
X16	TotRevenue	The amount of income that a business has made from all sales before subtracting expenses. It may include interest and dividends from investments.
X17	TotLiab	The combined debts and obligations that the company owes to outside parties.
X18	TotOpEx	The expenses a business incurs through its normal business operations.

Data Preprocessing:

Identify variable types and redefine/recategorize as necessary.
Analyze and report the structure of the data.
Evaluate for missing data, outliers, and anomalies and handle them.
Conduct univariate/multivariate graphical analysis.
Evaluate and report on univariate/multivariate statistics.

Setting Up Directory

Data Acquisition

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

## Response [https://storage.googleapis.com:443/kaggle-data-sets/3326190/5791434/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20260427%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260427T000519Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=6f57d5eee3785d87922bb706568cc5b8ee7a62d36d8f7e7722f6024b2a723fcbc32c660a2b712e0f774ecdb32278e950adb48380f4e0e66d6e9f20b0d735dc7c3e6495409513e064d00ddc55d5d32afb0bea06d6dd014e3581cdc8b09227e7f969ca1a92a3c351e2fcd55203159c20784b88f2ed00f52239e7da44b166519453911741c1bf2407492ad43ccb6bb3f9f5ee246167d10a60765f0ad839c4e749edafe58766cd78036b9a97b82552e12b94b3bc95e7e476a3d683e1b7ddd9d531b48359a111f2cc638fd3c94efa5f878da96f184bc96d050fc315380787169ca208e5dd59ac4d94d9ee8944b2d1416c2d159499cf644cae2e2c55d7768fdcbc3839]
##   Date: 2026-04-27 00:05
##   Status: 200
##   Content-Type: application/zip
##   Size: 4.68 MB
## <ON DISK>  C:\Users\abdou\OneDrive - Prod Student NU\Documents\DDS-8501 Exploratory Data Analysis\DDS8501_EDA\bankruptcy.zip

## Rows: 78682 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): company_name, status_label
## dbl (19): year, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 21
##   company_name status_label  year    X1    X2    X3    X4    X5     X6    X7
##   <chr>        <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 C_1          alive         1999  511.  833.  18.4  89.0  336.  35.2  128. 
## 2 C_1          alive         2000  486.  714.  18.6  64.4  321.  18.5  115. 
## 3 C_1          alive         2001  437.  526.  22.5  27.2  287. -58.9   77.5
## 4 C_1          alive         2002  396.  497.  27.2  30.7  260. -12.4   66.3
## 5 C_1          alive         2003  432.  523.  26.7  47.5  247.   3.50 105. 
## 6 C_1          alive         2004  475.  598.  28.0  61.8  255.  15.5  127. 
## # ℹ 11 more variables: X8 <dbl>, X9 <dbl>, X10 <dbl>, X11 <dbl>, X12 <dbl>,
## #   X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>

Rename Variables

library(dplyr)

df <- df %>%
  dplyr::rename(
    CurrAssets  = X1,
    COGS        = X2,
    DepAmort    = X3,
    EBITDA      = X4,
    Inventory   = X5,
    NetIncome   = X6,
    TotRecv     = X7,
    MktValue    = X8,
    NetSales    = X9,
    TotAssets   = X10,
    LTDebt      = X11,
    EBIT        = X12,
    GrossProfit = X13,
    TotCurrLiab = X14,
    RetEarn     = X15,
    TotRevenue  = X16,
    TotLiab     = X17,
    TotOpEx     = X18
  )

## spc_tbl_ [78,682 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ company_name: chr [1:78682] "C_1" "C_1" "C_1" "C_1" ...
##  $ status_label: chr [1:78682] "alive" "alive" "alive" "alive" ...
##  $ year        : num [1:78682] 1999 2000 2001 2002 2003 ...
##  $ CurrAssets  : num [1:78682] 511 486 437 396 432 ...
##  $ COGS        : num [1:78682] 833 714 526 497 523 ...
##  $ DepAmort    : num [1:78682] 18.4 18.6 22.5 27.2 26.7 ...
##  $ EBITDA      : num [1:78682] 89 64.4 27.2 30.7 47.5 ...
##  $ Inventory   : num [1:78682] 336 321 287 260 247 ...
##  $ NetIncome   : num [1:78682] 35.2 18.5 -58.9 -12.4 3.5 ...
##  $ TotRecv     : num [1:78682] 128.3 115.2 77.5 66.3 104.7 ...
##  $ MktValue    : num [1:78682] 373 377 365 143 309 ...
##  $ NetSales    : num [1:78682] 1024 874 639 606 652 ...
##  $ TotAssets   : num [1:78682] 741 702 710 687 709 ...
##  $ LTDebt      : num [1:78682] 180 180 218 165 249 ...
##  $ EBIT        : num [1:78682] 70.66 45.79 4.71 3.57 20.81 ...
##  $ GrossProfit : num [1:78682] 191 160 112 110 129 ...
##  $ TotCurrLiab : num [1:78682] 164 125 150 204 131 ...
##  $ RetEarn     : num [1:78682] 201 204 140 124 132 ...
##  $ TotRevenue  : num [1:78682] 1024 874 639 606 652 ...
##  $ TotLiab     : num [1:78682] 401 362 400 392 408 ...
##  $ TotOpEx     : num [1:78682] 935 810 612 576 604 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   company_name = col_character(),
##   ..   status_label = col_character(),
##   ..   year = col_double(),
##   ..   X1 = col_double(),
##   ..   X2 = col_double(),
##   ..   X3 = col_double(),
##   ..   X4 = col_double(),
##   ..   X5 = col_double(),
##   ..   X6 = col_double(),
##   ..   X7 = col_double(),
##   ..   X8 = col_double(),
##   ..   X9 = col_double(),
##   ..   X10 = col_double(),
##   ..   X11 = col_double(),
##   ..   X12 = col_double(),
##   ..   X13 = col_double(),
##   ..   X14 = col_double(),
##   ..   X15 = col_double(),
##   ..   X16 = col_double(),
##   ..   X17 = col_double(),
##   ..   X18 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

##  company_name       status_label            year        CurrAssets       
##  Length:78682       Length:78682       Min.   :1999   Min.   :    -7.76  
##  Class :character   Class :character   1st Qu.:2002   1st Qu.:    18.92  
##  Mode  :character   Mode  :character   Median :2007   Median :   100.45  
##                                        Mean   :2008   Mean   :   880.36  
##                                        3rd Qu.:2012   3rd Qu.:   431.53  
##                                        Max.   :2018   Max.   :169662.00  
##       COGS              DepAmort             EBITDA          
##  Min.   :  -366.64   Min.   :    0.000   Min.   :-21913.000  
##  1st Qu.:    17.04   1st Qu.:    1.192   1st Qu.:    -0.811  
##  Median :   103.66   Median :    7.929   Median :    15.034  
##  Mean   :  1594.53   Mean   :  121.234   Mean   :   376.759  
##  3rd Qu.:   634.55   3rd Qu.:   47.972   3rd Qu.:   139.655  
##  Max.   :374623.00   Max.   :28430.000   Max.   : 81730.000  
##    Inventory           NetIncome             TotRecv         
##  Min.   :    0.000   Min.   :-98696.000   Min.   :   -0.006  
##  1st Qu.:    0.000   1st Qu.:    -7.416   1st Qu.:    3.281  
##  Median :    7.023   Median :     1.616   Median :   22.820  
##  Mean   :  201.606   Mean   :   129.382   Mean   :  286.833  
##  3rd Qu.:   74.747   3rd Qu.:    40.144   3rd Qu.:  131.580  
##  Max.   :62567.000   Max.   :104821.000   Max.   :65812.000  
##     MktValue            NetSales           TotAssets        
##  Min.   :0.000e+00   Min.   : -1965.00   Min.   :     0.00  
##  1st Qu.:3.498e+01   1st Qu.:    27.55   1st Qu.:    37.36  
##  Median :2.275e+02   Median :   186.60   Median :   213.20  
##  Mean   :3.414e+03   Mean   :  2364.02   Mean   :  2867.11  
##  3rd Qu.:1.245e+03   3rd Qu.:  1046.40   3rd Qu.:  1171.36  
##  Max.   :1.073e+06   Max.   :511729.00   Max.   :531864.00  
##      LTDebt                EBIT             GrossProfit        
##  Min.   :    -0.023   Min.   :-25913.000   Min.   :-21536.000  
##  1st Qu.:     0.000   1st Qu.:    -2.787   1st Qu.:     8.521  
##  Median :     7.593   Median :     6.518   Median :    63.581  
##  Mean   :   722.484   Mean   :   255.525   Mean   :   769.491  
##  3rd Qu.:   248.761   3rd Qu.:    87.599   3rd Qu.:   344.074  
##  Max.   :166250.000   Max.   : 71230.000   Max.   :137106.000  
##   TotCurrLiab           RetEarn             TotRevenue       
##  Min.   :1.000e-03   Min.   :-102362.00   Min.   : -1965.00  
##  1st Qu.:8.889e+00   1st Qu.:    -68.28   1st Qu.:    27.55  
##  Median :4.333e+01   Median :     -1.13   Median :   186.60  
##  Mean   :6.101e+02   Mean   :    532.47   Mean   :  2364.02  
##  3rd Qu.:2.228e+02   3rd Qu.:    146.07   3rd Qu.:  1046.40  
##  Max.   :1.169e+05   Max.   : 402089.00   Max.   :511729.00  
##     TotLiab             TotOpEx         
##  Min.   :     0.00   Min.   :  -317.20  
##  1st Qu.:    13.49   1st Qu.:    32.87  
##  Median :    81.99   Median :   168.91  
##  Mean   :  1773.56   Mean   :  1987.26  
##  3rd Qu.:   629.98   3rd Qu.:   875.52  
##  Max.   :337980.00   Max.   :481580.00

The dataset contains three identifier or label fields (company_name, status_label, year) and 18 quantitative financial measures. The target variable status_label should be a factor, and year should be treated as an ordered grouping variable rather than a numeric measure.

# Reclassify the important variables

df$status_label   <- as.factor(df$status_label)
df$year <- as.integer(df$year)
str(df)

Type	Variables
Identifier (qualitative)	company_name
Categorical (qualitative, target)	status_label
Temporal (discrete)	year
Quantitative (continuous)	CurrAssets, COGS, DepAmort, EBITDA, Inventory, NetIncome, TotRecv, MktValue, NetSales, TotAssets, LTDebt, EBIT, GrossProfit, TotCurrLiab, RetEarn, TotRevenue, TotLiab, TotOpEx

# Overall dimensions of the dataset.
dim(df)

## [1] 78682    21

# Inspect the resulting structure.
glimpse(df)

## Rows: 78,682
## Columns: 21
## $ company_name <chr> "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "C_1", "…
## $ status_label <chr> "alive", "alive", "alive", "alive", "alive", "alive", "al…
## $ year         <dbl> 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 200…
## $ CurrAssets   <dbl> 511.267, 485.856, 436.656, 396.412, 432.204, 474.542, 624…
## $ COGS         <dbl> 833.107, 713.811, 526.477, 496.747, 523.302, 598.172, 704…
## $ DepAmort     <dbl> 18.373, 18.577, 22.496, 27.172, 26.680, 27.950, 29.222, 3…
## $ EBITDA       <dbl> 89.031, 64.367, 27.207, 30.745, 47.491, 61.774, 91.877, 1…
## $ Inventory    <dbl> 336.018, 320.590, 286.588, 259.954, 247.245, 255.477, 323…
## $ NetIncome    <dbl> 35.163, 18.531, -58.939, -12.410, 3.504, 15.453, 35.163, …
## $ TotRecv      <dbl> 128.348, 115.187, 77.528, 66.322, 104.661, 127.121, 136.2…
## $ MktValue     <dbl> 372.7519, 377.1180, 364.5928, 143.3295, 308.9071, 522.679…
## $ NetSales     <dbl> 1024.333, 874.255, 638.721, 606.337, 651.958, 747.848, 89…
## $ TotAssets    <dbl> 740.998, 701.854, 710.199, 686.621, 709.292, 732.230, 978…
## $ LTDebt       <dbl> 180.447, 179.987, 217.699, 164.658, 248.666, 227.159, 318…
## $ EBIT         <dbl> 70.658, 45.790, 4.711, 3.573, 20.811, 33.824, 62.655, 86.…
## $ GrossProfit  <dbl> 191.226, 160.444, 112.244, 109.590, 128.656, 149.676, 193…
## $ TotCurrLiab  <dbl> 163.816, 125.392, 150.464, 203.575, 131.261, 160.025, 187…
## $ RetEarn      <dbl> 201.026, 204.065, 139.603, 124.106, 131.884, 142.450, 183…
## $ TotRevenue   <dbl> 1024.333, 874.255, 638.721, 606.337, 651.958, 747.848, 89…
## $ TotLiab      <dbl> 401.483, 361.642, 399.964, 391.633, 407.608, 417.486, 556…
## $ TotOpEx      <dbl> 935.302, 809.888, 611.514, 575.592, 604.467, 686.074, 805…

# Compact summary of all columns including missingness, distribution shape,
# and quartiles.
skim(df)

Data summary
Name	df
Number of rows	78682
Number of columns	21
_______________________
Column type frequency:
character	2
numeric	19
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
company_name	0	1	3	6	0	8971	0
status_label	0	1	5	6	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2007.51	5.74	1999.00	2002.00	2007.00	2012.00	2018	▇▇▆▆▅
CurrAssets	1	880.36	3928.56	-7.76	18.92	100.45	431.53	169662	▇▁▁▁▁
COGS	1	1594.53	8930.48	-366.64	17.04	103.66	634.55	374623	▇▁▁▁▁
DepAmort	1	121.23	652.38	0.00	1.19	7.93	47.97	28430	▇▁▁▁▁
EBITDA	1	376.76	2012.02	-21913.00	-0.81	15.03	139.66	81730	▁▇▁▁▁
Inventory	1	201.61	1060.77	0.00	0.00	7.02	74.75	62567	▇▁▁▁▁
NetIncome	1	129.38	1265.53	-98696.00	-7.42	1.62	40.14	104821	▁▁▇▁▁
TotRecv	1	286.83	1335.98	-0.01	3.28	22.82	131.58	65812	▇▁▁▁▁
MktValue	1	3414.35	18414.10	0.00	34.98	227.51	1244.89	1073391	▇▁▁▁▁
NetSales	1	2364.02	11950.07	-1965.00	27.55	186.60	1046.40	511729	▇▁▁▁▁
TotAssets	1	2867.11	12917.94	0.00	37.36	213.20	1171.36	531864	▇▁▁▁▁
LTDebt	1	722.48	3242.17	-0.02	0.00	7.59	248.76	166250	▇▁▁▁▁
EBIT	1	255.53	1494.64	-25913.00	-2.79	6.52	87.60	71230	▁▇▁▁▁
GrossProfit	1	769.49	3774.70	-21536.00	8.52	63.58	344.07	137106	▇▁▁▁▁
TotCurrLiab	1	610.07	2938.39	0.00	8.89	43.33	222.82	116866	▇▁▁▁▁
RetEarn	1	532.47	6369.16	-102362.00	-68.28	-1.13	146.07	402089	▁▇▁▁▁
TotRevenue	1	2364.02	11950.07	-1965.00	27.55	186.60	1046.40	511729	▇▁▁▁▁
TotLiab	1	1773.56	8053.68	0.00	13.49	81.99	629.98	337980	▇▁▁▁▁
TotOpEx	1	1987.26	10419.63	-317.20	32.87	168.91	875.52	481580	▇▁▁▁▁

Report the row count, column count, observed year range, and the distribution of status_label (count of alive versus failed) directly from the skim() output. The class imbalance between failed and surviving firms is a documented characteristic of this dataset and should be noted before any modeling.

Missing Data Evaluation

The dataset show complete records across the 18 financial variables. miss_var_summary returns zero missingness, which means the imputation procedure is not needed.

# Numeric summary
miss_var_summary(df)

## # A tibble: 21 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 company_name      0        0
##  2 status_label      0        0
##  3 year              0        0
##  4 CurrAssets        0        0
##  5 COGS              0        0
##  6 DepAmort          0        0
##  7 EBITDA            0        0
##  8 Inventory         0        0
##  9 NetIncome         0        0
## 10 TotRecv           0        0
## # ℹ 11 more rows

# Visual inspection
vis_miss(df, warn_large_data = FALSE)

gg_miss_var(df)

Outlier and Anomalies.

Financial data is heavily right-skewed and contains legitimate extreme values (large firms vs. small firms). Standard deviation-based rules are inappropriate. Use the IQR rule for flagging and, where transformation is warranted, apply a signed log transform that handles negatives (since variables like NetIncome, EBIT, and RetEarn can be negative).

# Helper function: signed log transform.
# log1p(x) = log(1 + x), which handles values near zero stably.
# Multiplying by sign(x) preserves negative values after transformation.

# Select numeric variables
numeric_vars <- df %>%
  select(CurrAssets:TotOpEx)

# IQR-based outlier counts per variable
outlier_counts <- numeric_vars %>%
  summarise(across(everything(), ~ {
    q <- quantile(.x, probs = c(0.25, 0.75), na.rm = TRUE)
    iqr <- IQR(.x, na.rm = TRUE)
    sum(.x < (q[1] - 1.5 * iqr) | .x > (q[2] + 1.5 * iqr), na.rm = TRUE)
  })) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_outliers") %>%
  arrange(desc(n_outliers))

outlier_counts

## # A tibble: 18 × 2
##    variable    n_outliers
##    <chr>            <int>
##  1 RetEarn          17867
##  2 NetIncome        17004
##  3 EBIT             13251
##  4 LTDebt           12641
##  5 Inventory        12435
##  6 TotLiab          12366
##  7 EBITDA           12218
##  8 TotCurrLiab      11781
##  9 COGS             11594
## 10 TotAssets        11521
## 11 DepAmort         11475
## 12 MktValue         11409
## 13 GrossProfit      11306
## 14 TotRecv          11237
## 15 NetSales         11225
## 16 TotRevenue       11225
## 17 TotOpEx          11214
## 18 CurrAssets       10975

# Signed log transform: preserves the sign of negative values
signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

df_log <- df %>%
  mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))

Outliers are flagged but retained. Extreme values in financial data often correspond to the firms most relevant to bankruptcy prediction. The signed log transform compresses scale while preserving the sign of negative earnings.

Univariate Graphical Analysis

# Histograms of all financial variables on the signed log scale.
# Faceting by variable allows visual comparison of distribution shapes.
df_log %>%
  select(starts_with("log_")) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  labs(title = "Distribution of financial variables (signed log scale)",
       x = NULL, y = "Count") +
  theme_minimal()

df %>%
  ggplot(aes(x = status_label, fill = status_label)) +
  geom_bar() +
  labs(title = "Class distribution: alive vs. failed",
       x = NULL, y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Univariate Statistics

# Compute mean, median, sd, range, and skewness for each financial variable.
# Sorting by skew highlights the most heavily tailed distributions.
univariate_stats <- df %>%
  select(CurrAssets:TotOpEx) %>%
  summarise(across(everything(),
                   list(mean   = ~ mean(.x, na.rm = TRUE),
                        median = ~ median(.x, na.rm = TRUE),
                        sd     = ~ sd(.x, na.rm = TRUE),
                        min    = ~ min(.x, na.rm = TRUE),
                        max    = ~ max(.x, na.rm = TRUE),
                        skew   = ~ moments::skewness(.x, na.rm = TRUE),
                        kurtosis = ~ moments::kurtosis(.x, na.rm = TRUE)))) %>%
  pivot_longer(everything(),
               names_to  = c("variable", ".value"),
               names_sep = "_(?=[^_]+$)") %>%
  arrange(desc(skew))

univariate_stats

## # A tibble: 18 × 8
##    variable     mean median     sd          min      max  skew kurtosis
##    <chr>       <dbl>  <dbl>  <dbl>        <dbl>    <dbl> <dbl>    <dbl>
##  1 RetEarn      532.  -1.13  6369. -102362       402089   29.6    1547.
##  2 Inventory    202.   7.02  1061.       0        62567   22.6     817.
##  3 TotOpEx     1987. 169.   10420.    -317.      481580   20.4     634.
##  4 COGS        1595. 104.    8930.    -367.      374623   20.1     580.
##  5 NetSales    2364. 187.   11950.   -1965.      511729   19.0     549.
##  6 TotRevenue  2364. 187.   11950.   -1965.      511729   19.0     549.
##  7 MktValue    3414. 228.   18414.       0.0001 1073391.  18.2     544.
##  8 EBIT         256.   6.52  1495.  -25913        71230   18.0     531.
##  9 DepAmort     121.   7.93   652.       0        28430   17.9     448.
## 10 EBITDA       377.  15.0   2012.  -21913        81730   16.4     403.
## 11 TotRecv      287.  22.8   1336.      -0.006    65812   15.8     403.
## 12 GrossProfit  769.  63.6   3775.  -21536       137106   15.3     338.
## 13 LTDebt       722.   7.59  3242.      -0.023   166250   14.8     390.
## 14 CurrAssets   880. 100.    3929.      -7.76    169662   14.8     340.
## 15 TotCurrLiab  610.  43.3   2938.       0.001   116866   14.2     297.
## 16 TotLiab     1774.  82.0   8054.       0.001   337980   13.8     299.
## 17 TotAssets   2867. 213.   12918.       0.001   531864   13.6     276.
## 18 NetIncome    129.   1.62  1266.  -98696       104821   11.9    1499.

Multivariate Graphic Analysis

# Pearson correlation matrix on the log-transformed variables.
cor_matrix <- df_log %>%
  select(starts_with("log_")) %>%
  cor(use = "complete.obs")

# Visualize the correlation matrix as an upper-triangular colored grid.
cor_matrix %>%
  corrplot(method = "color",
           type   = "upper",
           tl.cex = 0.7,
           tl.col = "black",
           addCoef.col = "black",
           number.cex  = 0.55)

# Boxplots of key financial indicators split by survival status.
# These plots reveal whether log-scale distributions differ between groups.
df_log %>%
  select(status_label,
         log_TotAssets, log_NetIncome, log_RetEarn,
         log_TotLiab,   log_EBITDA,    log_MktValue) %>%
  pivot_longer(-status_label, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = status_label, y = value, fill = status_label)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ variable, scales = "free_y") +
  labs(title = "Financial indicators by firm status",
       x = NULL, y = "Signed log value") +
  theme_minimal() +
  theme(legend.position = "none")

Multivariate Statistics

# Median values of each variable split by survival status.
# The ratio column shows how much smaller failed firms are at the median.
group_medians <- df %>%
  group_by(status_label) %>%
  summarise(across(CurrAssets:TotOpEx, ~ median(.x, na.rm = TRUE))) %>%
  pivot_longer(-status_label, names_to = "variable", values_to = "median") %>%
  pivot_wider(names_from = status_label, values_from = median) %>%
  mutate(ratio_failed_to_alive = failed / alive) %>%
  arrange(ratio_failed_to_alive)

group_medians

## # A tibble: 18 × 4
##    variable      alive  failed ratio_failed_to_alive
##    <chr>         <dbl>   <dbl>                 <dbl>
##  1 NetIncome     2.07   -3.33                -1.61  
##  2 EBIT          7.18    0.487                0.0678
##  3 MktValue    241.    118.                   0.489 
##  4 EBITDA       15.6     7.79                 0.499 
##  5 TotRecv      23.5    15.2                  0.649 
##  6 GrossProfit  64.8    47.5                  0.733 
##  7 CurrAssets  103.     75.9                  0.737 
##  8 Inventory     7.12    6.17                 0.867 
##  9 TotAssets   215.    195.                   0.908 
## 10 TotOpEx     170.    156.                   0.914 
## 11 NetSales    187.    181.                   0.967 
## 12 TotRevenue  187.    181.                   0.967 
## 13 DepAmort      7.94    7.85                 0.989 
## 14 TotCurrLiab  43.4    42.9                  0.990 
## 15 COGS        104.    107.                   1.03  
## 16 TotLiab      80.7    97.0                  1.20  
## 17 LTDebt        7.09   18.4                  2.60  
## 18 RetEarn      -0.172 -25.5                148.

# Top pairwise correlations on the log scale.
# distinct() removes duplicate (var1, var2) and (var2, var1) pairs.
cor_long <- cor_matrix %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  filter(var1 != var2) %>%
  mutate(abs_r = abs(r)) %>%
  arrange(desc(abs_r)) %>%
  distinct(abs_r, .keep_all = TRUE)

cor_long %>% head(15)

## # A tibble: 15 × 4
##    var1            var2                r abs_r
##    <chr>           <chr>           <dbl> <dbl>
##  1 log_NetSales    log_TotRevenue  1     1    
##  2 log_COGS        log_TotOpEx     0.977 0.977
##  3 log_NetSales    log_TotOpEx     0.975 0.975
##  4 log_TotCurrLiab log_TotLiab     0.969 0.969
##  5 log_CurrAssets  log_TotAssets   0.961 0.961
##  6 log_COGS        log_NetSales    0.959 0.959
##  7 log_TotAssets   log_TotLiab     0.955 0.955
##  8 log_TotCurrLiab log_TotOpEx     0.948 0.948
##  9 log_CurrAssets  log_TotOpEx     0.942 0.942
## 10 log_TotAssets   log_TotCurrLiab 0.942 0.942
## 11 log_TotAssets   log_TotOpEx     0.938 0.938
## 12 log_DepAmort    log_TotLiab     0.936 0.936
## 13 log_EBITDA      log_EBIT        0.934 0.934
## 14 log_DepAmort    log_TotAssets   0.933 0.933
## 15 log_NetSales    log_TotCurrLiab 0.932 0.932

Data Splitting and Transformation:

Split your data into training/test sets without leakage.
Conduct transformation/scaling.
Conduct feature engineering, producing at least one new variable for use in your analysis.

Data Splitting Without Leakage

The panel structure of this dataset (multiple year-rows per company) creates a real leakage risk. If observations from the same firm appear in both training and test sets, the model effectively sees its target firm during training. The correct unit for splitting is the company, not the row.

set.seed(8501)

# Build a company-level frame with a single survival label per firm.
# last() selects the most recent status, which is the final outcome.
company_labels <- df %>%
  group_by(company_name) %>%
  summarise(status_label = last(status_label), .groups = "drop")

# Stratified 80/20 split at the company level to preserve class balance.
company_split <- company_labels %>%
  initial_split(prop = 0.8, strata = status_label)

train_companies <- company_split %>% training() %>% pull(company_name)
test_companies  <- company_split %>% testing()  %>% pull(company_name)

# Apply the split back to the full firm-year dataset
train_set <- df %>% filter(company_name %in% train_companies)
test_set  <- df %>% filter(company_name %in% test_companies)

# Verify the split
list(train = train_set, test = test_set) %>%
  map_dfr(~ .x %>%
            summarise(n_rows      = n(),
                      n_companies = n_distinct(company_name),
                      pct_failed  = round(100 * mean(status_label == "failed"), 2)),
          .id = "set")

## # A tibble: 2 × 4
##   set   n_rows n_companies pct_failed
##   <chr>  <int>       <int>      <dbl>
## 1 train  63245        7176       6.68
## 2 test   15437        1795       6.44

The verification table confirms that no company appears in both sets and that the failure rate is preserved across them. Stratification by status_label maintains roughly equal class balance between train and test, which matters given the imbalance between surviving and failed firms.

Transformation and Scaling

Two transformations apply here. The signed log handles skew and negative values in the raw financial measures. Standardization (z-score) puts all variables on a comparable scale for modeling. Both must be fit on the training set only and then applied to the test set using the training parameters.

signed_log <- function(x) sign(x) * log1p(abs(x))

# Step 1: signed log transform on both sets
train_log <- train_set %>%
  mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))

test_log <- test_set %>%
  mutate(across(CurrAssets:TotOpEx, signed_log, .names = "log_{.col}"))

# Step 2: compute scaling parameters (mean and sd) from the training set.
# These parameters will be reused on the test set without recomputation.
scale_params <- train_log %>%
  select(starts_with("log_")) %>%
  summarise(across(everything(),
                   list(mean = ~ mean(.x, na.rm = TRUE),
                        sd   = ~ sd(.x,   na.rm = TRUE)))) %>%
  pivot_longer(everything(),
               names_to  = c("variable", ".value"),
               names_sep = "_(?=[^_]+$)")

scale_params

## # A tibble: 18 × 3
##    variable         mean    sd
##    <chr>           <dbl> <dbl>
##  1 log_CurrAssets  4.54   2.21
##  2 log_COGS        4.67   2.45
##  3 log_DepAmort    2.50   1.98
##  4 log_EBITDA      2.29   3.40
##  5 log_Inventory   2.50   2.36
##  6 log_NetIncome   0.827  3.56
##  7 log_TotRecv     3.26   2.20
##  8 log_MktValue    5.40   2.43
##  9 log_NetSales    5.11   2.55
## 10 log_TotAssets   5.35   2.43
## 11 log_LTDebt      2.94   2.93
## 12 log_EBIT        1.67   3.49
## 13 log_GrossProfit 3.85   2.87
## 14 log_TotCurrLiab 3.94   2.16
## 15 log_RetEarn     0.309  5.03
## 16 log_TotRevenue  5.11   2.55
## 17 log_TotLiab     4.62   2.48
## 18 log_TotOpEx     5.16   2.28

# Step 3: helper function that applies training-derived scaling to a dataset.
# cur_column() inside across() lets us look up the right mean/sd per variable.
apply_scaling <- function(df, params) {
  df %>%
    mutate(across(starts_with("log_"),
                  ~ {
                    p <- params %>% filter(variable == cur_column())
                    (.x - p$mean) / p$sd
                  },
                  .names = "z_{.col}"))
}

train_scaled <- train_log %>% apply_scaling(scale_params)
test_scaled  <- test_log  %>% apply_scaling(scale_params)

# Verification: training z-scores should have mean ~0 and sd ~1.
# Test z-scores will not, which is the correct behavior.
train_scaled %>%
  select(starts_with("z_")) %>%
  summarise(across(everything(),
                   list(mean = ~ round(mean(.x, na.rm = TRUE), 3),
                        sd   = ~ round(sd(.x,   na.rm = TRUE), 3)))) %>%
  pivot_longer(everything(),
               names_to  = c("variable", ".value"),
               names_sep = "_(?=[^_]+$)") %>%
  head(10)

## # A tibble: 10 × 3
##    variable          mean    sd
##    <chr>            <dbl> <dbl>
##  1 z_log_CurrAssets     0     1
##  2 z_log_COGS           0     1
##  3 z_log_DepAmort       0     1
##  4 z_log_EBITDA         0     1
##  5 z_log_Inventory      0     1
##  6 z_log_NetIncome      0     1
##  7 z_log_TotRecv        0     1
##  8 z_log_MktValue       0     1
##  9 z_log_NetSales       0     1
## 10 z_log_TotAssets      0     1

The mean of each z_log_ column in the training set should be approximately zero and the standard deviation approximately one. Test set z-scores will not have exactly these properties, which is the point. The transformation reflects training-set parameters rather than re-fitting on test data.

Feature Engineering

Bankruptcy prediction benefits from financial ratios more than from raw dollar amounts. Ratios normalize across firm size and capture relationships between balance sheet components that the original variables describe only indirectly. The features below are grounded in the Altman Z-score tradition, which remains a standard reference in corporate distress prediction (Altman, 1968; Altman, 2000).

add_engineered_features <- function(df) {
  df %>%
    mutate(
      working_capital     = CurrAssets - TotCurrLiab,
      wc_to_assets        = working_capital / TotAssets,
      retained_to_assets  = RetEarn        / TotAssets,
      ebit_to_assets      = EBIT           / TotAssets,
      mkt_to_liab         = MktValue       / TotLiab,
      sales_to_assets     = NetSales       / TotAssets,
      debt_to_assets      = TotLiab        / TotAssets,
      profit_margin       = NetIncome      / NetSales,
      current_ratio       = CurrAssets     / TotCurrLiab,
      across(working_capital:current_ratio,
             ~ if_else(is.finite(.x), .x, NA_real_))
    )
}

train_features <- train_scaled %>% add_engineered_features()
test_features  <- test_scaled  %>% add_engineered_features()

# Inspect the new features
train_features %>%
  select(working_capital:current_ratio) %>%
  summarise(across(everything(),
                   list(median = ~ median(.x, na.rm = TRUE),
                        mean   = ~ mean(.x,   na.rm = TRUE),
                        sd     = ~ sd(.x,     na.rm = TRUE),
                        n_na   = ~ sum(is.na(.x))))) %>%
  pivot_longer(everything(),
               names_to  = c("variable", ".value"),
               names_sep = "_(?=[^_]+$)")

## # A tibble: 18 × 5
##    variable              median    mean      sd    na
##    <chr>                  <dbl>   <dbl>   <dbl> <int>
##  1 working_capital      34.5    275.    1918.      NA
##  2 working_capital_n    NA       NA       NA        0
##  3 wc_to_assets          0.211   -0.998   36.5     NA
##  4 wc_to_assets_n       NA       NA       NA        0
##  5 retained_to_assets   -0.0179 -11.2    240.      NA
##  6 retained_to_assets_n NA       NA       NA        0
##  7 ebit_to_assets        0.0474  -0.466   10.2     NA
##  8 ebit_to_assets_n     NA       NA       NA        0
##  9 mkt_to_liab           2.29    15.5    836.      NA
## 10 mkt_to_liab_n        NA       NA       NA        0
## 11 sales_to_assets       0.892    1.19     3.55    NA
## 12 sales_to_assets_n    NA       NA       NA        0
## 13 debt_to_assets        0.506    1.95    43.2     NA
## 14 debt_to_assets_n     NA       NA       NA        0
## 15 profit_margin         0.0151 -11.4    277.      NA
## 16 profit_margin_n      NA       NA       NA        0
## 17 current_ratio         1.90     3.60    98.7     NA
## 18 current_ratio_n      NA       NA       NA        0

Feature	Construction	Interpretation
working_capital	CurrAssets − TotCurrLiab	Short-term liquidity buffer
wc_to_assets	working_capital / TotAssets	Liquidity scaled by firm size (Altman X1)
retained_to_assets	RetEarn / TotAssets	Cumulative profitability over firm history (Altman X2)
ebit_to_assets	EBIT / TotAssets	Operating profitability (Altman X3)
mkt_to_liab	MktValue / TotLiab	Market-based solvency cushion (Altman X4)
sales_to_assets	NetSales / TotAssets	Asset turnover efficiency (Altman X5)
debt_to_assets	TotLiab / TotAssets	Leverage
profit_margin	NetIncome / NetSales	Earnings efficiency on sales
current_ratio	CurrAssets / TotCurrLiab	Standard short-term solvency measure

The if_else(is.finite(.x), .x, NA_real_) step converts infinite values (from divide-by-zero cases) to NA, which is necessary because some firms report zero current liabilities or zero sales in certain years.

# Compare engineered features between alive and failed firms
train_features %>%
  group_by(status_label) %>%
  summarise(across(working_capital:current_ratio,
                   ~ median(.x, na.rm = TRUE))) %>%
  pivot_longer(-status_label, names_to = "feature", values_to = "median") %>%
  pivot_wider(names_from = status_label, values_from = median) %>%
  mutate(diff = alive - failed) %>%
  arrange(desc(abs(diff)))

## # A tibble: 9 × 4
##   feature               alive   failed    diff
##   <chr>                 <dbl>    <dbl>   <dbl>
## 1 working_capital    35.8     17.0     18.8   
## 2 mkt_to_liab         2.38     1.05     1.33  
## 3 current_ratio       1.93     1.53     0.392 
## 4 retained_to_assets -0.00464 -0.228    0.224 
## 5 debt_to_assets      0.498    0.632   -0.134 
## 6 wc_to_assets        0.218    0.123    0.0946
## 7 profit_margin       0.0177  -0.0391   0.0568
## 8 sales_to_assets     0.893    0.843    0.0505
## 9 ebit_to_assets      0.0499   0.00755  0.0424

This table shows which engineered features most sharply separate surviving from failed firms at the median. Features with large positive differences (alive minus failed) are those in which surviving firms outperform, providing an early signal of feature importance before any modeling.

Clustering and Model Building

Build the K-Means Model (k = 2)

K-means is unsupervised, so the cluster labels are not directly tied to status_label. The model partitions firms into two groups based on financial structure, and we then map cluster identity to the survival outcome by majority class within each cluster on the training set.

set.seed(8501)

# Use the standardized log-scale variables as clustering input
cluster_vars <- train_features %>%
  select(starts_with("z_log_")) %>%
  drop_na()

# Keep the row alignment with status_label for later evaluation
train_clean <- train_features %>%
  drop_na(starts_with("z_log_"))

# Fit k-means with k = 2
kmeans_fit <- cluster_vars %>%
  kmeans(centers = 2, nstart = 25, iter.max = 100)

# Attach cluster assignments to the training data
train_clustered <- train_clean %>%
  mutate(cluster = factor(kmeans_fit$cluster))

# Map clusters to survival outcome by majority class
cluster_to_label <- train_clustered %>%
  count(cluster, status_label) %>%
  group_by(cluster) %>%
  slice_max(n, n = 1) %>%
  ungroup() %>%
  select(cluster, predicted_label = status_label)
  

cluster_to_label

## # A tibble: 2 × 2
##   cluster predicted_label
##   <fct>   <chr>          
## 1 1       alive          
## 2 2       alive

The cluster_to_label table assigns each cluster to its dominant class. If both clusters map to “alive,” that is itself an important finding about class imbalance in the feature space, and it should be reported rather than corrected by forcing a different mapping.

library(knitr)
library(kableExtra)

# Display the cluster-to-label mapping table
cluster_to_label %>%
  kable(caption = "Final Two-Cluster k-means Partition", align = 'c') %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")

Final Two-Cluster k-means Partition
cluster	predicted_label
1	alive
2	alive

# Project the standardized features onto the first two principal components
# for visualization of the cluster partition.
pca_fit <- train_clustered %>%
  select(starts_with("z_log_")) %>%
  prcomp(center = FALSE, scale. = FALSE)

# Build a tidy data frame with the first two PCs, cluster, and survival label
plot_data <- train_clustered %>%
  select(cluster, status_label) %>%
  bind_cols(as_tibble(pca_fit$x[, 1:2]))

# Project the cluster centroids into the same PC space
centroids_pc <- as_tibble(kmeans_fit$centers %*% pca_fit$rotation[, 1:2]) %>%
  mutate(cluster = factor(1:n()))

# Final cluster diagram
plot_data %>%
  ggplot(aes(x = PC1, y = PC2, color = cluster, shape = status_label)) +
  geom_point(alpha = 0.4, size = 2) +
  geom_point(data = centroids_pc,
             aes(x = PC1, y = PC2),
             color = "black", fill = "white",
             shape = 23, size = 5, stroke = 1.5,
             inherit.aes = FALSE) +
  scale_color_manual(values = c("1" = "#1E2761", "2" = "#F9A826")) +
  scale_shape_manual(values = c("alive" = 16, "failed" = 17)) +
  labs(title    = "Figure 3. Final Cluster Partition (PC1 vs PC2)",
       subtitle = "Diamonds mark cluster centroids",
       x = "Principal Component 1",
       y = "Principal Component 2",
       color = "Cluster",
       shape = "Status") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray40"))

Interpret the Cluster Centers

The most useful interpretation of a k-means model is the centroid profile: which features distinguish one cluster from the other.

# Cluster sizes
train_clustered %>%
  count(cluster) %>%
  mutate(percent = round(100 * n / sum(n), 2))

## # A tibble: 2 × 3
##   cluster     n percent
##   <fct>   <int>   <dbl>
## 1 1       35056    55.4
## 2 2       28189    44.6

# Centroid profile in standardized space
# The separation column shows which features drive the partition;
# variables with large absolute separation define the cluster identities.
kmeans_fit$centers %>%
  as.data.frame() %>%
  rownames_to_column("cluster") %>%
  pivot_longer(-cluster, names_to = "variable", values_to = "z_mean") %>%
  pivot_wider(names_from = cluster, values_from = z_mean, names_prefix = "cluster_") %>%
  mutate(separation = cluster_1 - cluster_2) %>%
  arrange(desc(abs(separation)))

## # A tibble: 18 × 4
##    variable          cluster_1 cluster_2 separation
##    <chr>                 <dbl>     <dbl>      <dbl>
##  1 z_log_NetSales       -0.716     0.891      -1.61
##  2 z_log_TotRevenue     -0.716     0.891      -1.61
##  3 z_log_TotLiab        -0.711     0.885      -1.60
##  4 z_log_TotOpEx        -0.711     0.884      -1.59
##  5 z_log_COGS           -0.705     0.877      -1.58
##  6 z_log_TotCurrLiab    -0.704     0.875      -1.58
##  7 z_log_TotAssets      -0.703     0.874      -1.58
##  8 z_log_DepAmort       -0.699     0.870      -1.57
##  9 z_log_TotRecv        -0.697     0.867      -1.56
## 10 z_log_EBITDA         -0.687     0.855      -1.54
## 11 z_log_CurrAssets     -0.683     0.850      -1.53
## 12 z_log_GrossProfit    -0.661     0.823      -1.48
## 13 z_log_MktValue       -0.643     0.799      -1.44
## 14 z_log_EBIT           -0.634     0.789      -1.42
## 15 z_log_LTDebt         -0.615     0.765      -1.38
## 16 z_log_Inventory      -0.579     0.720      -1.30
## 17 z_log_NetIncome      -0.475     0.590      -1.06
## 18 z_log_RetEarn        -0.468     0.582      -1.05

# Cluster composition by survival status.
# pct_within_cluster shows how dominant each class is in each cluster.
train_clustered %>%
  count(cluster, status_label) %>%
  group_by(cluster) %>%
  mutate(pct_within_cluster = round(100 * n / sum(n), 2)) %>%
  ungroup()

## # A tibble: 4 × 4
##   cluster status_label     n pct_within_cluster
##   <fct>   <chr>        <int>              <dbl>
## 1 1       alive        32629              93.1 
## 2 1       failed        2427               6.92
## 3 2       alive        26390              93.6 
## 4 2       failed        1799               6.38

The separation column indicates which financial variables drive the partition. Variables with the largest absolute separation define the cluster identities. In bankruptcy data, the typical pattern is that one cluster centers on firms with stronger asset bases, profitability, and revenue, while the other centers on firms with weaker positions across most measures. Report the actual top variables from your output rather than predicting the pattern in advance.

Apply the Model to the Test Set

To evaluate a clustering model as a classifier, assign each test observation to its nearest centroid using the same scaling parameters as the training set, then compare the mapped cluster label to the true status_label.

# Test-set clustering input
test_clean <- test_features %>%
  drop_na(starts_with("z_log_"))

test_input <- test_clean %>%
  select(starts_with("z_log_"))

# Helper: assign each row to the nearest centroid
assign_cluster <- function(df, centers) {
  centers_mat <- as.matrix(centers)
  df %>%
    as.matrix() %>%
    apply(1, function(row) {
      distances <- rowSums((centers_mat - matrix(row,
                                                 nrow = nrow(centers_mat),
                                                 ncol = length(row),
                                                 byrow = TRUE))^2)
      which.min(distances)
    })
}

test_predictions <- test_clean %>%
  mutate(cluster = factor(assign_cluster(test_input, kmeans_fit$centers))) %>%
  left_join(cluster_to_label, by = "cluster") %>%
  mutate(predicted_label = factor(predicted_label,
                                  levels = c("alive", "failed")),
         status_label    = factor(status_label,
                                  levels = c("alive", "failed")))

Model Evaluation Metrics

For a binary classification framing, treat “failed” as the positive class. This convention matches the practical question: how well does the model identify firms heading toward bankruptcy?

# Confusion matrix: rows are true labels, columns are predicted labels.
conf_matrix <- test_predictions %>%
  count(status_label, predicted_label) %>%
  pivot_wider(names_from = predicted_label, values_from = n, values_fill = 0)

conf_matrix

## # A tibble: 2 × 2
##   status_label alive
##   <fct>        <int>
## 1 alive        14443
## 2 failed         994

# Compute classification metrics in a single pipeline.
# TP = true positives (correctly predicted failed firms)
# FP = false positives (alive firms predicted as failed)
# FN = false negatives (failed firms predicted as alive)
# TN = true negatives (correctly predicted alive firms)
metrics <- test_predictions %>%
  summarise(
    TP = sum(status_label == "failed" & predicted_label == "failed"),
    FP = sum(status_label == "alive"  & predicted_label == "failed"),
    FN = sum(status_label == "failed" & predicted_label == "alive"),
    TN = sum(status_label == "alive"  & predicted_label == "alive")
  ) %>%
  mutate(
    accuracy  = (TP + TN) / (TP + TN + FP + FN),
    precision = TP / (TP + FP),
    recall    = TP / (TP + FN),
    f1_score  = 2 * (precision * recall) / (precision + recall)
  )

metrics %>%
  select(accuracy, precision, recall, f1_score) %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  mutate(value = round(value, 4))

## # A tibble: 4 × 2
##   metric      value
##   <chr>       <dbl>
## 1 accuracy    0.936
## 2 precision NaN    
## 3 recall      0    
## 4 f1_score  NaN

Interpretation of Results

Report each metric using the exact value produced by the chunk above.

Clustering performance. A k-means model fit on financial measures partitions firms into two structurally distinct groups. The first question is whether those groups align with the survival outcome at all. If one cluster is dominated by failed firms while the other is dominated by surviving firms, the partition has captured a real economic signal. If both clusters are dominated by surviving firms, the partition reflects firm size or sector rather than distress, and the model is not informative for the prediction task. The cluster composition table above answers this directly.

Precision (positive predictive value). The proportion of firms predicted as “failed” that actually failed. High precision indicates that when the model flags a firm, the flag is reliable. Low precision means the model produces many false alarms, and an analyst acting on those flags would waste resources investigating healthy firms.

Recall (sensitivity). The proportion of failed firms the model correctly identified. High recall is critical in bankruptcy prediction because the cost of missing a failing firm (lost investment, unrecovered loans) typically exceeds the cost of investigating a healthy one. Low recall means the model fails its primary purpose: catching firms in distress before they collapse.

Accuracy. The proportion of all firms classified correctly. Accuracy is the least informative metric for this dataset because of class imbalance. A trivial classifier that predicts every firm as “alive” can score high accuracy while providing no value. Report it for completeness but emphasize precision, recall, and F1.

F1 score. The harmonic mean of precision and recall. A balanced summary that penalizes models that achieve high performance on only one of the two. Useful here because both errors carry real cost: false positives consume analyst time, false negatives miss the firms that matter most.

Limitations

The analysis has several limitations that should be documented.

K-means assumes spherical, equally sized clusters in Euclidean space. Financial data, even after signed log transformation and standardization, does not satisfy this assumption strictly. Firm-size effects produce elongated clusters, and the failure outcome is far rarer than survival, so the clusters are unlikely to be balanced.

The dataset has panel structure (firm-year observations), and the survival label is defined at the firm level. Clustering on firm-year rows means the same company contributes multiple observations, which violates the independence assumption underlying most distance-based methods. The firm-level split mitigates leakage but does not change the within-set non-independence.

K-means is unsupervised. It cannot be tuned to favor recall over precision the way a supervised classifier can. The cluster-to-label mapping is post hoc, and if the dominant class in both clusters happens to be the same, the model produces no useful predictions on the failed class regardless of cluster quality.

Class imbalance is severe in this dataset. Failed firms are a small fraction of total observations. K-means weights all points equally, so the centroids are pulled toward the majority class. Resampling, weighted clustering, or supervised methods would address this more directly.

Financial ratios and accounting variables capture firm condition at a single point in time. Bankruptcy is a process that unfolds over multiple years (Beaver, 1966; Ohlson, 1980), and a static cross-sectional model cannot capture the trajectory leading to failure.

Future Work

Several directions follow naturally from these limitations.

A supervised baseline using logistic regression or gradient-boosted trees would establish whether the features carry predictive signal beyond what k-means recovers. Comparing supervised performance against the clustering baseline quantifies the cost of the unsupervised approach.

Time-series feature engineering would address the static-snapshot limitation. Year-over-year changes in liquidity, leverage, and profitability are well-documented predictors of distress (Beaver, 1966) and are not used in the current feature set.

Alternative clustering algorithms (Gaussian mixture models, DBSCAN, hierarchical methods) relax the spherical-cluster assumption and may produce partitions more aligned with the actual structure of financial distress. Comparing these against k-means on the same features would clarify whether the limitation lies in the algorithm or in the features themselves.

Class-imbalance treatments such as SMOTE or cost-sensitive learning would address the imbalance directly and are standard in the bankruptcy prediction literature (Zhou, 2013).

Cluster validation indices (silhouette width, Calinski-Harabasz, gap statistic) would provide formal evidence on whether k = 2 is appropriate. The choice of k = 2 here is dictated by the binary outcome rather than by data structure, and the two questions need not produce the same answer.

References

On bankruptcy and corporate distress prediction:

Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609.

Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71–111.

Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18(1), 109–131.

On cluster analysis methodology:

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.

Theoretical perspective:

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(2), 411–423.

Additional reference cited in future work:

Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16–25.

US Company Bankruptcy Prediction - Implement EDA/Clustering Analysis and Interpret

A.Toure

2026-04-26