In this project, we aim to build a machine learning model to predict company bankruptcy based on financial ratios from Taiwanese companies between 1999 and 2009. Using classification techniques, the goal is to analyze which financial factors are most predictive of bankruptcy. The project involves thorough data exploration, preprocessing, and the comparison of three machine learning models: Desicion Trees, Random Forest and k-nearest neighbors. These models will be cross-validated to evaluate performance, and their results will be analyzed to determine which is most effective at predicting company insolvency.
This dataset contains financial ratios collected from the Taiwan Economic Journal between 1999 and 2009. The goal is to predict company bankruptcy based on 95 financial indicators, covering areas such as profitability, solvency, cash flow, and corporate governance. The target variable, “Bankrupt?”, is a binary classification where 0 indicates a company went bankrupt, and 1 indicates a company did not go bankrupt. This comprehensive set of financial features allows for an in-depth analysis of factors influencing corporate financial health and the prediction of bankruptcy risk.
How accurately can financial ratios predict the likelihood of bankruptcy.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(ggplot2)
library(e1071)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(rpart)
## Warning: package 'rpart' was built under R version 4.2.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.2.3
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
bankrupt <- read.csv("data.csv")
glimpse(bankrupt)
## Rows: 6,819
## Columns: 94
## $ Bankrupt <int> 1, 1, 1, 1, 1,…
## $ ROA.C..before.interest.and.depreciation.before.interest <dbl> 0.3705943, 0.4…
## $ ROA.A..before.interest.and...after.tax <dbl> 0.4243894, 0.5…
## $ ROA.B..before.interest.and.depreciation.after.tax <dbl> 0.4057498, 0.5…
## $ Operating.Gross.Margin <dbl> 0.6014572, 0.6…
## $ Realized.Sales.Gross.Margin <dbl> 0.6014572, 0.6…
## $ Operating.Profit.Rate <dbl> 0.9989692, 0.9…
## $ Pre.tax.net.Interest.Rate <dbl> 0.7968871, 0.7…
## $ After.tax.net.Interest.Rate <dbl> 0.8088094, 0.8…
## $ Non.industry.income.and.expenditure.revenue <dbl> 0.3026464, 0.3…
## $ Continuous.interest.rate..after.tax. <dbl> 0.7809848, 0.7…
## $ Operating.Expense.Rate <dbl> 1.25697e-04, 2…
## $ Research.and.development.expense.rate <dbl> 0.00e+00, 0.00…
## $ Cash.flow.rate <dbl> 0.4581431, 0.4…
## $ Interest.bearing.debt.interest.rate <dbl> 0.000725073, 0…
## $ Tax.rate..A. <dbl> 0.000000000, 0…
## $ Net.Value.Per.Share..B. <dbl> 0.1479499, 0.1…
## $ Net.Value.Per.Share..A. <dbl> 0.1479499, 0.1…
## $ Net.Value.Per.Share..C. <dbl> 0.1479499, 0.1…
## $ Persistent.EPS.in.the.Last.Four.Seasons <dbl> 0.1691406, 0.2…
## $ Cash.Flow.Per.Share <dbl> 0.3116644, 0.3…
## $ Operating.Profit.Per.Share..Yuan. <dbl> 0.09592053, 0.…
## $ Per.Share.Net.profit.before.tax..Yuan. <dbl> 0.1387362, 0.1…
## $ Realized.Sales.Gross.Profit.Growth.Rate <dbl> 0.02210228, 0.…
## $ Operating.Profit.Growth.Rate <dbl> 0.8481950, 0.8…
## $ After.tax.Net.Profit.Growth.Rate <dbl> 0.6889795, 0.6…
## $ Regular.Net.Profit.Growth.Rate <dbl> 0.6889795, 0.6…
## $ Continuous.Net.Profit.Growth.Rate <dbl> 0.2175354, 0.2…
## $ Total.Asset.Growth.Rate <dbl> 4.98e+09, 6.11…
## $ Net.Value.Growth.Rate <dbl> 0.000326977, 0…
## $ Total.Asset.Return.Growth.Rate.Ratio <dbl> 0.2631000, 0.2…
## $ Cash.Reinvestment.. <dbl> 0.3637253, 0.3…
## $ Current.Ratio <dbl> 0.002258963, 0…
## $ Quick.Ratio <dbl> 0.001207755, 0…
## $ Interest.Expense.Ratio <dbl> 0.6299513, 0.6…
## $ Total.debt.Total.net.worth <dbl> 0.021265924, 0…
## $ Debt.ratio.. <dbl> 0.20757626, 0.…
## $ Net.worth.Assets <dbl> 0.7924237, 0.8…
## $ Long.term.fund.suitability.ratio..A. <dbl> 0.005024455, 0…
## $ Borrowing.dependency <dbl> 0.3902844, 0.3…
## $ Contingent.liabilities.Net.worth <dbl> 0.006478502, 0…
## $ Operating.profit.Paid.in.capital <dbl> 0.09588483, 0.…
## $ Net.profit.before.tax.Paid.in.capital <dbl> 0.1377573, 0.1…
## $ Inventory.and.accounts.receivable.Net.value <dbl> 0.3980357, 0.3…
## $ Total.Asset.Turnover <dbl> 0.08695652, 0.…
## $ Accounts.Receivable.Turnover <dbl> 0.001813884, 0…
## $ Average.Collection.Days <dbl> 0.003487364, 0…
## $ Inventory.Turnover.Rate..times. <dbl> 1.82093e-04, 9…
## $ Fixed.Assets.Turnover.Frequency <dbl> 1.16501e-04, 7…
## $ Net.Worth.Turnover.Rate..times. <dbl> 0.03290323, 0.…
## $ Revenue.per.person <dbl> 0.034164182, 0…
## $ Operating.profit.per.person <dbl> 0.3929129, 0.3…
## $ Allocation.rate.per.person <dbl> 0.037135302, 0…
## $ Working.Capital.to.Total.Assets <dbl> 0.6727753, 0.7…
## $ Quick.Assets.Total.Assets <dbl> 0.16667296, 0.…
## $ Current.Assets.Total.Assets <dbl> 0.1906430, 0.1…
## $ Cash.Total.Assets <dbl> 0.004094406, 0…
## $ Quick.Assets.Current.Liability <dbl> 0.001996771, 0…
## $ Current.Liability.to.Assets <dbl> 0.14730845, 0.…
## $ Operating.Funds.to.Liability <dbl> 0.3340152, 0.3…
## $ Inventory.Working.Capital <dbl> 0.2769202, 0.2…
## $ Inventory.Current.Liability <dbl> 0.001035990, 0…
## $ Current.Liabilities.Liability <dbl> 0.6762692, 0.3…
## $ Working.Capital.Equity <dbl> 0.7212746, 0.7…
## $ Current.Liabilities.Equity <dbl> 0.3390770, 0.3…
## $ Long.term.Liability.to.Current.Assets <dbl> 0.025592368, 0…
## $ Retained.Earnings.to.Total.Assets <dbl> 0.9032248, 0.9…
## $ Total.income.Total.expense <dbl> 0.002021613, 0…
## $ Total.expense.Assets <dbl> 0.064855708, 0…
## $ Current.Asset.Turnover.Rate <dbl> 7.010000e+08, …
## $ Quick.Asset.Turnover.Rate <dbl> 6.550000e+09, …
## $ Working.capitcal.Turnover.Rate <dbl> 0.5938305, 0.5…
## $ Cash.Turnover.Rate <dbl> 4.58000e+08, 2…
## $ Cash.Flow.to.Sales <dbl> 0.6715677, 0.6…
## $ Fixed.Assets.to.Assets <dbl> 0.4242058, 0.4…
## $ Current.Liability.to.Liability <dbl> 0.6762692, 0.3…
## $ Current.Liability.to.Equity <dbl> 0.3390770, 0.3…
## $ Equity.to.Long.term.Liability <dbl> 0.1265495, 0.1…
## $ Cash.Flow.to.Total.Assets <dbl> 0.6375554, 0.6…
## $ Cash.Flow.to.Liability <dbl> 0.4586091, 0.4…
## $ CFO.to.Assets <dbl> 0.5203819, 0.5…
## $ Cash.Flow.to.Equity <dbl> 0.3129049, 0.3…
## $ Current.Liability.to.Current.Assets <dbl> 0.11825048, 0.…
## $ Liability.Assets.Flag <int> 0, 0, 0, 0, 0,…
## $ Net.Income.to.Total.Assets <dbl> 0.7168453, 0.7…
## $ Total.assets.to.GNP.price <dbl> 0.009219440, 0…
## $ No.credit.Interval <dbl> 0.6228790, 0.6…
## $ Gross.Profit.to.Sales <dbl> 0.6014533, 0.6…
## $ Net.Income.to.Stockholder.s.Equity <dbl> 0.8278902, 0.8…
## $ Liability.to.Equity <dbl> 0.2902019, 0.2…
## $ Degree.of.Financial.Leverage..DFL. <dbl> 0.02660063, 0.…
## $ Interest.Coverage.Ratio..Interest.expense.to.EBIT. <dbl> 0.5640501, 0.5…
## $ Net.Income.Flag <int> 1, 1, 1, 1, 1,…
## $ Equity.to.Liability <dbl> 0.01646874, 0.…
bankrupt <- bankrupt %>%
clean_names() %>%
mutate(bankrupt = as.factor(bankrupt),
liability_assets_flag = as.factor(liability_assets_flag),
net_income_flag = as.factor(net_income_flag))
bankrupt columnbankrupt_distribution <- bankrupt %>%
group_by(bankrupt) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(bankrupt_distribution, aes(x = "", y = count, fill = bankrupt)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Distribution of Bankruptcy Status",
fill = "Bankrupt")
The pie chart illustrates the distribution of the
bankruptcolumn, showing that the majority of companies (represented in red) did not go bankrupt, while a small proportion (represented in teal) did go bankrupt. This indicates an imbalanced dataset with significantly more non-bankrupt companies compared to bankrupt ones.
features <- bankrupt %>%
select(-c(bankrupt, net_income_flag, liability_assets_flag))
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1))
for (feature in names(features)) {
hist(features[[feature]], main = feature, xlab = feature)
}
Columns like
tax_rate_aare dropped because they contain many zero values, and a zero tax rate is not financially meaningful, as companies generally pay some tax unless they are in very specific situations.The
tax_rate_acolumn is dropped because it contains many zero values, and a zero tax rate is not financially meaningful, as companies generally pay some tax unless they are in very specific situations.Zeros innet_value_per_share_bandnet_value_per_share_aindicate companies with no equity value per share, which is unrealistic for operational companies and could distort the model’s predictions.Columns like
fixed_assets_to_assets,interest_bearing_debt_interest_rate, are removed due to excessive zero values or limited relevance to bankruptcy prediction. Zeros innet_value_per_share_b,net_value_per_share_a, … indicate companies with no equity value per share, which is unrealistic for operational companies and could distort the model’s predictions.Liquidity turnover ratios such as
quick_ratio,current_ratio, andaccounts_receivable_turnoverwere dropped as zeros in these metrics are implausible for operational companies.Lastly, outliers in critical financial ratios were capped to prevent them from disproportionately influencing model predictions.
bankrupt.cleaned <- bankrupt %>%
select(-c(tax_rate_a,
fixed_assets_to_assets,
interest_bearing_debt_interest_rate,
total_assets_to_gnp_price,
net_value_growth_rate,
quick_ratio,
quick_assets_current_liability,
average_collection_days,
inventory_turnover_rate_times,
fixed_assets_turnover_frequency,
inventory_current_liability,
accounts_receivable_turnover,
current_asset_turnover_rate,
research_and_development_expense_rate,
allocation_rate_per_person,
quick_asset_turnover_rate,
long_term_liability_to_current_assets,
operating_expense_rate,
total_debt_total_net_worth,
current_ratio)) %>%
mutate(across(
.cols = c(net_value_per_share_b,
net_value_per_share_a,
roa_b_before_interest_and_depreciation_after_tax,
operating_gross_margin,
realized_sales_gross_margin,
operating_profit_rate,
pre_tax_net_interest_rate,
after_tax_net_interest_rate,
current_liability_to_equity,
equity_to_long_term_liability,
no_credit_interval,
gross_profit_to_sales,
net_income_to_stockholder_s_equity,
liability_to_equity,
equity_to_liability,
degree_of_financial_leverage_dfl,
cash_flow_per_share,
operating_profit_per_share_yuan,
contingent_liabilities_net_worth,
roa_a_before_interest_and_after_tax,
roa_b_before_interest_and_depreciation_after_tax,
operating_gross_margin,
realized_sales_gross_margin,
persistent_eps_in_the_last_four_seasons,
inventory_working_capital,
per_share_net_profit_before_tax_yuan,
realized_sales_gross_profit_growth_rate,
total_asset_turnover,
current_liability_to_assets,
operating_funds_to_liability,
cash_flow_to_total_assets,
cash_flow_to_liability,
cfo_to_assets,
cash_turnover_rate,
interest_expense_ratio,
debt_ratio,
long_term_fund_suitability_ratio_a,
interest_coverage_ratio_interest_expense_to_ebit,
continuous_net_profit_growth_rate,
total_asset_growth_rate,
cash_reinvestment,
operating_profit_paid_in_capital,
current_liability_to_current_assets,
retained_earnings_to_total_assets,
total_income_total_expense,
current_liabilities_liability,
operating_profit_growth_rate,
after_tax_net_profit_growth_rate,
regular_net_profit_growth_rate,
cash_flow_rate,
total_asset_growth_rate
),
.fns = ~ifelse(. == 0, NA, pmin(., quantile(., 1, na.rm = TRUE), na.rm = TRUE))
)) %>%
drop_na()
summary(bankrupt.cleaned)
## bankrupt roa_c_before_interest_and_depreciation_before_interest
## 0:6574 Min. :0.06693
## 1: 212 1st Qu.:0.47687
## Median :0.50288
## Mean :0.50579
## 3rd Qu.:0.53566
## Max. :1.00000
## roa_a_before_interest_and_after_tax
## Min. :0.05718
## 1st Qu.:0.53588
## Median :0.55997
## Mean :0.55932
## 3rd Qu.:0.58929
## Max. :1.00000
## roa_b_before_interest_and_depreciation_after_tax operating_gross_margin
## Min. :0.05482 Min. :0.1563
## 1st Qu.:0.52765 1st Qu.:0.6005
## Median :0.55244 Median :0.6060
## Mean :0.55423 Mean :0.6080
## 3rd Qu.:0.58448 3rd Qu.:0.6139
## Max. :1.00000 Max. :0.6652
## realized_sales_gross_margin operating_profit_rate pre_tax_net_interest_rate
## Min. :0.1563 Min. :0.9734 Min. :0.7530
## 1st Qu.:0.6005 1st Qu.:0.9990 1st Qu.:0.7974
## Median :0.6060 Median :0.9990 Median :0.7975
## Mean :0.6080 Mean :0.9990 Mean :0.7974
## 3rd Qu.:0.6138 3rd Qu.:0.9991 3rd Qu.:0.7976
## Max. :0.6660 Max. :1.0000 Max. :0.8128
## after_tax_net_interest_rate non_industry_income_and_expenditure_revenue
## Min. :0.7662 Min. :0.2351
## 1st Qu.:0.8093 1st Qu.:0.3035
## Median :0.8094 Median :0.3035
## Mean :0.8093 Mean :0.3035
## 3rd Qu.:0.8095 3rd Qu.:0.3036
## Max. :0.8226 Max. :0.3301
## continuous_interest_rate_after_tax cash_flow_rate net_value_per_share_b
## Min. :0.7351 Min. :0.3067 Min. :0.1200
## 1st Qu.:0.7816 1st Qu.:0.4616 1st Qu.:0.1737
## Median :0.7816 Median :0.4651 Median :0.1844
## Mean :0.7816 Mean :0.4675 Mean :0.1908
## 3rd Qu.:0.7817 3rd Qu.:0.4710 3rd Qu.:0.1996
## Max. :0.7959 Max. :1.0000 Max. :1.0000
## net_value_per_share_a net_value_per_share_c
## Min. :0.06506 Min. :0.06506
## 1st Qu.:0.17375 1st Qu.:0.17378
## Median :0.18444 Median :0.18448
## Mean :0.19080 Mean :0.19084
## 3rd Qu.:0.19965 3rd Qu.:0.19973
## Max. :1.00000 Max. :1.00000
## persistent_eps_in_the_last_four_seasons cash_flow_per_share
## Min. :0.07857 Min. :0.1287
## 1st Qu.:0.21481 1st Qu.:0.3178
## Median :0.22454 Median :0.3225
## Mean :0.22902 Mean :0.3236
## 3rd Qu.:0.23901 3rd Qu.:0.3286
## Max. :1.00000 Max. :1.0000
## operating_profit_per_share_yuan per_share_net_profit_before_tax_yuan
## Min. :0.00627 Min. :0.05024
## 1st Qu.:0.09625 1st Qu.:0.17045
## Median :0.10431 Median :0.17978
## Mean :0.10922 Mean :0.18456
## 3rd Qu.:0.11620 3rd Qu.:0.19362
## Max. :1.00000 Max. :1.00000
## realized_sales_gross_profit_growth_rate operating_profit_growth_rate
## Min. :0.004282 Min. :0.7364
## 1st Qu.:0.022065 1st Qu.:0.8480
## Median :0.022103 Median :0.8480
## Mean :0.022402 Mean :0.8481
## 3rd Qu.:0.022153 3rd Qu.:0.8481
## Max. :1.000000 Max. :1.0000
## after_tax_net_profit_growth_rate regular_net_profit_growth_rate
## Min. :0.1807 Min. :0.1807
## 1st Qu.:0.6893 1st Qu.:0.6893
## Median :0.6894 Median :0.6894
## Mean :0.6893 Mean :0.6893
## 3rd Qu.:0.6896 3rd Qu.:0.6896
## Max. :1.0000 Max. :1.0000
## continuous_net_profit_growth_rate total_asset_growth_rate
## Min. :0.09576 Min. :0.000e+00
## 1st Qu.:0.21758 1st Qu.:4.900e+09
## Median :0.21760 Median :6.410e+09
## Mean :0.21768 Mean :5.519e+09
## 3rd Qu.:0.21762 3rd Qu.:7.390e+09
## Max. :1.00000 Max. :9.990e+09
## total_asset_return_growth_rate_ratio cash_reinvestment interest_expense_ratio
## Min. :0.2562 Min. :0.02583 Min. :0.4600
## 1st Qu.:0.2638 1st Qu.:0.37478 1st Qu.:0.6306
## Median :0.2641 Median :0.38046 Median :0.6307
## Mean :0.2642 Mean :0.37973 Mean :0.6311
## 3rd Qu.:0.2644 3rd Qu.:0.38676 3rd Qu.:0.6311
## Max. :0.3586 Max. :0.75995 Max. :1.0000
## debt_ratio net_worth_assets long_term_fund_suitability_ratio_a
## Min. :0.0002721 Min. :0.6681 Min. :0.004676
## 1st Qu.:0.0728905 1st Qu.:0.8516 1st Qu.:0.005246
## Median :0.1112556 Median :0.8887 Median :0.005666
## Mean :0.1127852 Mean :0.8872 Mean :0.008693
## 3rd Qu.:0.1484415 3rd Qu.:0.9271 3rd Qu.:0.006849
## Max. :0.3318923 Max. :0.9997 Max. :1.000000
## borrowing_dependency contingent_liabilities_net_worth
## Min. :0.1871 Min. :0.0006331
## 1st Qu.:0.3702 1st Qu.:0.0053659
## Median :0.3726 Median :0.0053659
## Mean :0.3745 Mean :0.0058218
## 3rd Qu.:0.3763 3rd Qu.:0.0057647
## Max. :0.8874 Max. :0.0731641
## operating_profit_paid_in_capital net_profit_before_tax_paid_in_capital
## Min. :0.008859 Min. :0.05244
## 1st Qu.:0.096176 1st Qu.:0.16947
## Median :0.104166 Median :0.17853
## Mean :0.109107 Mean :0.18291
## 3rd Qu.:0.116005 3rd Qu.:0.19164
## Max. :1.000000 Max. :1.00000
## inventory_and_accounts_receivable_net_value total_asset_turnover
## Min. :0.3514 Min. :0.001499
## 1st Qu.:0.3974 1st Qu.:0.076462
## Median :0.4001 Median :0.118441
## Mean :0.4025 Mean :0.141828
## 3rd Qu.:0.4045 3rd Qu.:0.176911
## Max. :1.0000 Max. :1.000000
## net_worth_turnover_rate_times revenue_per_person operating_profit_per_person
## Min. :0.009032 Min. :0.0001116 Min. :0.0000
## 1st Qu.:0.021936 1st Qu.:0.0104674 1st Qu.:0.3925
## Median :0.029516 Median :0.0186361 Median :0.3959
## Mean :0.038473 Mean :0.0355880 Mean :0.4008
## 3rd Qu.:0.042903 3rd Qu.:0.0358423 3rd Qu.:0.4019
## Max. :1.000000 Max. :1.0000000 Max. :1.0000
## working_capital_to_total_assets quick_assets_total_assets
## Min. :0.4752 Min. :0.0000
## 1st Qu.:0.7746 1st Qu.:0.2421
## Median :0.8105 Median :0.3866
## Mean :0.8145 Mean :0.4004
## 3rd Qu.:0.8504 3rd Qu.:0.5407
## Max. :1.0000 Max. :1.0000
## current_assets_total_assets cash_total_assets current_liability_to_assets
## Min. :0.0000 Min. :0.0001842 Min. :0.0008467
## 1st Qu.:0.3529 1st Qu.:0.0335917 1st Qu.:0.0532728
## Median :0.5151 Median :0.0750036 Median :0.0826016
## Mean :0.5225 Mean :0.1242431 Mean :0.0903545
## 3rd Qu.:0.6892 3rd Qu.:0.1612751 3rd Qu.:0.1192421
## Max. :1.0000 Max. :1.0000000 Max. :0.3431427
## operating_funds_to_liability inventory_working_capital
## Min. :0.01472 Min. :0.1765
## 1st Qu.:0.34108 1st Qu.:0.2770
## Median :0.34865 Median :0.2772
## Mean :0.35392 Mean :0.2775
## 3rd Qu.:0.36096 3rd Qu.:0.2774
## Max. :1.00000 Max. :1.0000
## current_liabilities_liability working_capital_equity
## Min. :0.0008623 Min. :0.5071
## 1st Qu.:0.6274174 1st Qu.:0.7336
## Median :0.8068333 Median :0.7360
## Mean :0.7617579 Mean :0.7360
## 3rd Qu.:0.9418341 3rd Qu.:0.7386
## Max. :1.0000000 Max. :1.0000
## current_liabilities_equity retained_earnings_to_total_assets
## Min. :0.1538 Min. :0.5021
## 1st Qu.:0.3281 1st Qu.:0.9313
## Median :0.3297 Median :0.9377
## Mean :0.3313 Mean :0.9351
## 3rd Qu.:0.3323 3rd Qu.:0.9449
## Max. :0.7759 Max. :1.0000
## total_income_total_expense total_expense_assets working_capitcal_turnover_rate
## Min. :0.000772 Min. :0.00000 Min. :0.5884
## 1st Qu.:0.002237 1st Qu.:0.01457 1st Qu.:0.5939
## Median :0.002337 Median :0.02268 Median :0.5940
## Mean :0.002551 Mean :0.02890 Mean :0.5940
## 3rd Qu.:0.002493 3rd Qu.:0.03588 3rd Qu.:0.5940
## Max. :1.000000 Max. :0.46348 Max. :0.6054
## cash_turnover_rate cash_flow_to_sales current_liability_to_liability
## Min. :0.000e+00 Min. :0.6618 Min. :0.0008623
## 1st Qu.:0.000e+00 1st Qu.:0.6716 1st Qu.:0.6274174
## Median :1.085e+09 Median :0.6716 Median :0.8068333
## Mean :2.477e+09 Mean :0.6716 Mean :0.7617579
## 3rd Qu.:4.530e+09 3rd Qu.:0.6716 3rd Qu.:0.9418341
## Max. :1.000e+10 Max. :0.6760 Max. :1.0000000
## current_liability_to_equity equity_to_long_term_liability
## Min. :0.1538 Min. :0.02585
## 1st Qu.:0.3281 1st Qu.:0.11093
## Median :0.3297 Median :0.11235
## Mean :0.3313 Mean :0.11543
## 3rd Qu.:0.3323 3rd Qu.:0.11711
## Max. :0.7759 Max. :1.00000
## cash_flow_to_total_assets cash_flow_to_liability cfo_to_assets
## Min. :0.09209 Min. :0.02806 Min. :0.07425
## 1st Qu.:0.63336 1st Qu.:0.45714 1st Qu.:0.56626
## Median :0.64540 Median :0.45976 Median :0.59347
## Mean :0.64987 Mean :0.46186 Mean :0.59385
## 3rd Qu.:0.66302 3rd Qu.:0.46424 3rd Qu.:0.62490
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## cash_flow_to_equity current_liability_to_current_assets liability_assets_flag
## Min. :0.06196 Min. :0.00022 0:6783
## 1st Qu.:0.31301 1st Qu.:0.01803 1: 3
## Median :0.31496 Median :0.02755
## Mean :0.31554 Mean :0.03129
## 3rd Qu.:0.31770 3rd Qu.:0.03827
## Max. :0.50887 Max. :1.00000
## net_income_to_total_assets no_credit_interval gross_profit_to_sales
## Min. :0.4118 Min. :0.4087 Min. :0.1563
## 1st Qu.:0.7969 1st Qu.:0.6236 1st Qu.:0.6005
## Median :0.8107 Median :0.6239 Median :0.6060
## Mean :0.8083 Mean :0.6240 Mean :0.6080
## 3rd Qu.:0.8266 3rd Qu.:0.6242 3rd Qu.:0.6139
## Max. :1.0000 Max. :1.0000 Max. :0.6651
## net_income_to_stockholder_s_equity liability_to_equity
## Min. :0.3447 Min. :0.1335
## 1st Qu.:0.8401 1st Qu.:0.2770
## Median :0.8412 Median :0.2788
## Mean :0.8406 Mean :0.2802
## 3rd Qu.:0.8424 3rd Qu.:0.2814
## Max. :1.0000 Max. :0.6523
## degree_of_financial_leverage_dfl
## Min. :0.0007895
## 1st Qu.:0.0267912
## Median :0.0268085
## Mean :0.0275491
## 3rd Qu.:0.0269140
## Max. :1.0000000
## interest_coverage_ratio_interest_expense_to_ebit net_income_flag
## Min. :0.1721 1:6786
## 1st Qu.:0.5652
## Median :0.5653
## Mean :0.5654
## 3rd Qu.:0.5657
## Max. :1.0000
## equity_to_liability
## Min. :0.008753
## 1st Qu.:0.024543
## Median :0.033852
## Mean :0.047218
## 3rd Qu.:0.052837
## Max. :0.942729
set.seed(1)
bankrupt.split <- createDataPartition(bankrupt.cleaned$bankrupt, p = 0.8, list = F, times = 10)
head(bankrupt.split)
## Resample01 Resample02 Resample03 Resample04 Resample05 Resample06
## [1,] 1 1 1 2 3 1
## [2,] 2 2 2 4 4 2
## [3,] 3 3 4 5 5 4
## [4,] 4 4 5 7 6 5
## [5,] 5 6 6 10 7 6
## [6,] 6 7 7 11 8 7
## Resample07 Resample08 Resample09 Resample10
## [1,] 1 2 1 1
## [2,] 2 3 2 5
## [3,] 4 4 3 7
## [4,] 5 5 6 8
## [5,] 6 7 7 9
## [6,] 7 10 9 10
set.seed(1)
feature.train <- bankrupt.cleaned[bankrupt.split[, 1], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
feature.test <- bankrupt.cleaned[-bankrupt.split[, 1], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
target.train <- bankrupt.cleaned[bankrupt.split[, 1], 'bankrupt']
target.test <- bankrupt.cleaned[-bankrupt.split[, 1], 'bankrupt']
preprocess.obj <- preProcess(feature.train, method = c("scale", "center"))
feature.train <- predict(preprocess.obj, feature.train)
feature.test <- predict(preprocess.obj, feature.test)
random.forest <- randomForest(x = feature.train, y = target.train)
rf.pred <- predict(random.forest, newdata = feature.test)
rf.error <- mean(rf.pred != target.test)
round(rf.error, 3)
## [1] 0.032
error.frame <- data.frame(matrix(ncol = 3, nrow = 10))
colnames(error.frame) <- c("folds", "tree_error", "kNN_error")
set.seed(1)
for (i in seq_len(10)) {
feature.train <- bankrupt.cleaned[bankrupt.split[, i], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
feature.test <- bankrupt.cleaned[-bankrupt.split[, i], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
target.train <- bankrupt.cleaned[bankrupt.split[, i], 'bankrupt']
target.test <- bankrupt.cleaned[-bankrupt.split[, i], 'bankrupt']
preprocess.obj <- preProcess(feature.train, method = c("scale", "center", "knnImpute"))
feature.train <- predict(preprocess.obj, feature.train)
feature.test <- predict(preprocess.obj, feature.test)
full.train <- cbind(feature.train, target.train)
full.train <- full.train %>%
rename(bankrupt = target.train)
# kNN error
k.form <- sqrt(length(target.test))
k.form <- ifelse(k.form %% 2 == 0, k.form + 1, k.form) # Ensure k.form is odd
knn.fit <- knn3(feature.train, target.train, k = k.form)
knn.fit <- knn3(feature.train, target.train, k = k.form)
knn.pred <- predict(knn.fit, feature.test, type = 'class')
error <- mean(ifelse(target.test != knn.pred, 1, 0))
error.frame[i, "kNN_error"] <- error
error.frame[i, "folds"] <- i
# Decision Tree
decision.tree <- rpart(bankrupt ~ ., data = full.train, method = "class")
decision.tree.predictions <- predict(decision.tree, newdata = feature.test, type = "class")
decision.tree.error <- mean(target.test != decision.tree.predictions)
error.frame[i, 'tree_error'] <- decision.tree.error
}
error.frame
## folds tree_error kNN_error
## 1 1 0.03908555 0.03097345
## 2 2 0.03318584 0.03171091
## 3 3 0.03318584 0.03097345
## 4 4 0.03539823 0.03171091
## 5 5 0.03908555 0.03023599
## 6 6 0.03466077 0.03171091
## 7 7 0.03982301 0.03097345
## 8 8 0.03466077 0.03171091
## 9 9 0.04056047 0.03171091
## 10 10 0.03171091 0.03023599
ggplot(error.frame, aes(x = folds)) +
geom_line(aes(y = tree_error), color = "red") +
geom_line(aes(y = kNN_error), color = "blue") +
scale_x_continuous(breaks = seq(min(error.frame$folds), max(error.frame$folds), by = 1)) +
labs(x = "Folds", y = "Error Rate", title = "Error Rates for Decision Tree and kNN") +
theme_minimal()
mean.tree.error <- mean(error.frame$tree_error)
mean.kNN.error <- mean(error.frame$kNN_error)
print(paste("Mean Decision Tree Error Rate:", round(mean.tree.error, 3)))
## [1] "Mean Decision Tree Error Rate: 0.036"
print(paste("Mean kNN Error Rate:", round(mean.kNN.error, 3)))
## [1] "Mean kNN Error Rate: 0.031"
print(paste("Random Forest Error Rate:", round(rf.error, 3)))
## [1] "Random Forest Error Rate: 0.032"
Mean Error Rates:
Decision Tree: The mean error rate across the 10 folds is 0.036.
k-Nearest Neighbors (kNN): The mean error rate across the 10 folds is 0.031.
Random Forest: The error rate for the Random Forest model is 0.032.
knn.confusion <- confusionMatrix(knn.pred, target.test)
knn.confusion$table
## Reference
## Prediction 0 1
## 0 1314 41
## 1 0 1
mean(knn.pred == target.test)
## [1] 0.969764
plot_confusion_matrix <- function(confusion_matrix, title) {
cm_table <- as.data.frame(confusion_matrix$table)
cm_table$Prediction <- factor(cm_table$Prediction, levels = rev(levels(cm_table$Prediction)))
cm_table$Correct <- cm_table$Reference == cm_table$Prediction
ggplot(data = cm_table, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Correct)) +
scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
geom_text(aes(label = Freq), vjust = 1) +
labs(title = title, x = "Actual", y = "Predicted") +
theme_minimal()
}
knn.confusion <- confusionMatrix(knn.pred, target.test)
plot_confusion_matrix(knn.confusion, "kNN Confusion Matrix")
tree.confusion <- confusionMatrix(decision.tree.predictions, target.test)
tree.confusion$table
## Reference
## Prediction 0 1
## 0 1299 28
## 1 15 14
mean(decision.tree.predictions == target.test)
## [1] 0.9682891
knn.confusion <- confusionMatrix(knn.pred, target.test)
plot_confusion_matrix(knn.confusion, "kNN Confusion Matrix")
rf.confusion <- confusionMatrix(rf.pred, target.test)
rf.confusion$table
## Reference
## Prediction 0 1
## 0 1307 42
## 1 7 0
mean(rf.pred == target.test)
## [1] 0.9638643
rf.confusion <- confusionMatrix(rf.pred, target.test)
plot_confusion_matrix(rf.confusion, "Random Forest Confusion Matrix")
Based on the 10-fold cross-validation and confusion matrix results, the accuracy of each model is as follows:
k-Nearest Neighbors (kNN): 0.9705
Random Forest (RF): 0.9654
Decision Tree: 0.9683
Given the goal of accurately classifying companies as bankrupt or not bankrupt, the k-Nearest Neighbors (kNN) model is recommended for this project. It has the highest accuracy (0.9705), indicating that it performs best in distinguishing between bankrupt and non-bankrupt companies. However, it is important to consider the computational cost and scalability of kNN, especially if the dataset grows larger. If interpretability and feature importance are also critical, the Random Forest model could be a viable alternative despite its slightly lower accuracy.
For the current dataset and project requirements, the kNN model is the best choice due to its superior accuracy in classifying bankruptcy status. It provides the most reliable predictions, which is essential for the project’s objective of predicting company insolvency.