Predicting Corporate Bankruptcy

Project Description:

In this project, we aim to build a machine learning model to predict company bankruptcy based on financial ratios from Taiwanese companies between 1999 and 2009. Using classification techniques, the goal is to analyze which financial factors are most predictive of bankruptcy. The project involves thorough data exploration, preprocessing, and the comparison of three machine learning models: Desicion Trees, Random Forest and k-nearest neighbors. These models will be cross-validated to evaluate performance, and their results will be analyzed to determine which is most effective at predicting company insolvency.

Dataset Description:

This dataset contains financial ratios collected from the Taiwan Economic Journal between 1999 and 2009. The goal is to predict company bankruptcy based on 95 financial indicators, covering areas such as profitability, solvency, cash flow, and corporate governance. The target variable, “Bankrupt?”, is a binary classification where 0 indicates a company went bankrupt, and 1 indicates a company did not go bankrupt. This comprehensive set of financial features allows for an in-depth analysis of factors influencing corporate financial health and the prediction of bankruptcy risk.

Reseach Question:

How accurately can financial ratios predict the likelihood of bankruptcy.

Load packages

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ggplot2)
library(e1071)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart)
## Warning: package 'rpart' was built under R version 4.2.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.2.3
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.3
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Load Dataset

bankrupt <- read.csv("data.csv")
glimpse(bankrupt)
## Rows: 6,819
## Columns: 94
## $ Bankrupt                                                <int> 1, 1, 1, 1, 1,…
## $ ROA.C..before.interest.and.depreciation.before.interest <dbl> 0.3705943, 0.4…
## $ ROA.A..before.interest.and...after.tax                  <dbl> 0.4243894, 0.5…
## $ ROA.B..before.interest.and.depreciation.after.tax       <dbl> 0.4057498, 0.5…
## $ Operating.Gross.Margin                                  <dbl> 0.6014572, 0.6…
## $ Realized.Sales.Gross.Margin                             <dbl> 0.6014572, 0.6…
## $ Operating.Profit.Rate                                   <dbl> 0.9989692, 0.9…
## $ Pre.tax.net.Interest.Rate                               <dbl> 0.7968871, 0.7…
## $ After.tax.net.Interest.Rate                             <dbl> 0.8088094, 0.8…
## $ Non.industry.income.and.expenditure.revenue             <dbl> 0.3026464, 0.3…
## $ Continuous.interest.rate..after.tax.                    <dbl> 0.7809848, 0.7…
## $ Operating.Expense.Rate                                  <dbl> 1.25697e-04, 2…
## $ Research.and.development.expense.rate                   <dbl> 0.00e+00, 0.00…
## $ Cash.flow.rate                                          <dbl> 0.4581431, 0.4…
## $ Interest.bearing.debt.interest.rate                     <dbl> 0.000725073, 0…
## $ Tax.rate..A.                                            <dbl> 0.000000000, 0…
## $ Net.Value.Per.Share..B.                                 <dbl> 0.1479499, 0.1…
## $ Net.Value.Per.Share..A.                                 <dbl> 0.1479499, 0.1…
## $ Net.Value.Per.Share..C.                                 <dbl> 0.1479499, 0.1…
## $ Persistent.EPS.in.the.Last.Four.Seasons                 <dbl> 0.1691406, 0.2…
## $ Cash.Flow.Per.Share                                     <dbl> 0.3116644, 0.3…
## $ Operating.Profit.Per.Share..Yuan.                       <dbl> 0.09592053, 0.…
## $ Per.Share.Net.profit.before.tax..Yuan.                  <dbl> 0.1387362, 0.1…
## $ Realized.Sales.Gross.Profit.Growth.Rate                 <dbl> 0.02210228, 0.…
## $ Operating.Profit.Growth.Rate                            <dbl> 0.8481950, 0.8…
## $ After.tax.Net.Profit.Growth.Rate                        <dbl> 0.6889795, 0.6…
## $ Regular.Net.Profit.Growth.Rate                          <dbl> 0.6889795, 0.6…
## $ Continuous.Net.Profit.Growth.Rate                       <dbl> 0.2175354, 0.2…
## $ Total.Asset.Growth.Rate                                 <dbl> 4.98e+09, 6.11…
## $ Net.Value.Growth.Rate                                   <dbl> 0.000326977, 0…
## $ Total.Asset.Return.Growth.Rate.Ratio                    <dbl> 0.2631000, 0.2…
## $ Cash.Reinvestment..                                     <dbl> 0.3637253, 0.3…
## $ Current.Ratio                                           <dbl> 0.002258963, 0…
## $ Quick.Ratio                                             <dbl> 0.001207755, 0…
## $ Interest.Expense.Ratio                                  <dbl> 0.6299513, 0.6…
## $ Total.debt.Total.net.worth                              <dbl> 0.021265924, 0…
## $ Debt.ratio..                                            <dbl> 0.20757626, 0.…
## $ Net.worth.Assets                                        <dbl> 0.7924237, 0.8…
## $ Long.term.fund.suitability.ratio..A.                    <dbl> 0.005024455, 0…
## $ Borrowing.dependency                                    <dbl> 0.3902844, 0.3…
## $ Contingent.liabilities.Net.worth                        <dbl> 0.006478502, 0…
## $ Operating.profit.Paid.in.capital                        <dbl> 0.09588483, 0.…
## $ Net.profit.before.tax.Paid.in.capital                   <dbl> 0.1377573, 0.1…
## $ Inventory.and.accounts.receivable.Net.value             <dbl> 0.3980357, 0.3…
## $ Total.Asset.Turnover                                    <dbl> 0.08695652, 0.…
## $ Accounts.Receivable.Turnover                            <dbl> 0.001813884, 0…
## $ Average.Collection.Days                                 <dbl> 0.003487364, 0…
## $ Inventory.Turnover.Rate..times.                         <dbl> 1.82093e-04, 9…
## $ Fixed.Assets.Turnover.Frequency                         <dbl> 1.16501e-04, 7…
## $ Net.Worth.Turnover.Rate..times.                         <dbl> 0.03290323, 0.…
## $ Revenue.per.person                                      <dbl> 0.034164182, 0…
## $ Operating.profit.per.person                             <dbl> 0.3929129, 0.3…
## $ Allocation.rate.per.person                              <dbl> 0.037135302, 0…
## $ Working.Capital.to.Total.Assets                         <dbl> 0.6727753, 0.7…
## $ Quick.Assets.Total.Assets                               <dbl> 0.16667296, 0.…
## $ Current.Assets.Total.Assets                             <dbl> 0.1906430, 0.1…
## $ Cash.Total.Assets                                       <dbl> 0.004094406, 0…
## $ Quick.Assets.Current.Liability                          <dbl> 0.001996771, 0…
## $ Current.Liability.to.Assets                             <dbl> 0.14730845, 0.…
## $ Operating.Funds.to.Liability                            <dbl> 0.3340152, 0.3…
## $ Inventory.Working.Capital                               <dbl> 0.2769202, 0.2…
## $ Inventory.Current.Liability                             <dbl> 0.001035990, 0…
## $ Current.Liabilities.Liability                           <dbl> 0.6762692, 0.3…
## $ Working.Capital.Equity                                  <dbl> 0.7212746, 0.7…
## $ Current.Liabilities.Equity                              <dbl> 0.3390770, 0.3…
## $ Long.term.Liability.to.Current.Assets                   <dbl> 0.025592368, 0…
## $ Retained.Earnings.to.Total.Assets                       <dbl> 0.9032248, 0.9…
## $ Total.income.Total.expense                              <dbl> 0.002021613, 0…
## $ Total.expense.Assets                                    <dbl> 0.064855708, 0…
## $ Current.Asset.Turnover.Rate                             <dbl> 7.010000e+08, …
## $ Quick.Asset.Turnover.Rate                               <dbl> 6.550000e+09, …
## $ Working.capitcal.Turnover.Rate                          <dbl> 0.5938305, 0.5…
## $ Cash.Turnover.Rate                                      <dbl> 4.58000e+08, 2…
## $ Cash.Flow.to.Sales                                      <dbl> 0.6715677, 0.6…
## $ Fixed.Assets.to.Assets                                  <dbl> 0.4242058, 0.4…
## $ Current.Liability.to.Liability                          <dbl> 0.6762692, 0.3…
## $ Current.Liability.to.Equity                             <dbl> 0.3390770, 0.3…
## $ Equity.to.Long.term.Liability                           <dbl> 0.1265495, 0.1…
## $ Cash.Flow.to.Total.Assets                               <dbl> 0.6375554, 0.6…
## $ Cash.Flow.to.Liability                                  <dbl> 0.4586091, 0.4…
## $ CFO.to.Assets                                           <dbl> 0.5203819, 0.5…
## $ Cash.Flow.to.Equity                                     <dbl> 0.3129049, 0.3…
## $ Current.Liability.to.Current.Assets                     <dbl> 0.11825048, 0.…
## $ Liability.Assets.Flag                                   <int> 0, 0, 0, 0, 0,…
## $ Net.Income.to.Total.Assets                              <dbl> 0.7168453, 0.7…
## $ Total.assets.to.GNP.price                               <dbl> 0.009219440, 0…
## $ No.credit.Interval                                      <dbl> 0.6228790, 0.6…
## $ Gross.Profit.to.Sales                                   <dbl> 0.6014533, 0.6…
## $ Net.Income.to.Stockholder.s.Equity                      <dbl> 0.8278902, 0.8…
## $ Liability.to.Equity                                     <dbl> 0.2902019, 0.2…
## $ Degree.of.Financial.Leverage..DFL.                      <dbl> 0.02660063, 0.…
## $ Interest.Coverage.Ratio..Interest.expense.to.EBIT.      <dbl> 0.5640501, 0.5…
## $ Net.Income.Flag                                         <int> 1, 1, 1, 1, 1,…
## $ Equity.to.Liability                                     <dbl> 0.01646874, 0.…

Data Wrangling

Standardize column names & Convert factor variable

bankrupt <- bankrupt %>%
  clean_names() %>%
  mutate(bankrupt = as.factor(bankrupt),
         liability_assets_flag = as.factor(liability_assets_flag),
         net_income_flag = as.factor(net_income_flag))

EDA

Create a pie chart for the target bankrupt column

bankrupt_distribution <- bankrupt %>%
  group_by(bankrupt) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

ggplot(bankrupt_distribution, aes(x = "", y = count, fill = bankrupt)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Bankruptcy Status",
       fill = "Bankrupt")

The pie chart illustrates the distribution of the bankrupt column, showing that the majority of companies (represented in red) did not go bankrupt, while a small proportion (represented in teal) did go bankrupt. This indicates an imbalanced dataset with significantly more non-bankrupt companies compared to bankrupt ones.

Create histgrams for features

features <- bankrupt %>%
  select(-c(bankrupt, net_income_flag, liability_assets_flag))

par(mfrow = c(3, 3), mar = c(4, 4, 2, 1))

for (feature in names(features)) {
  hist(features[[feature]], main = feature, xlab = feature)
}

  • Columns like tax_rate_a are dropped because they contain many zero values, and a zero tax rate is not financially meaningful, as companies generally pay some tax unless they are in very specific situations.

  • The tax_rate_a column is dropped because it contains many zero values, and a zero tax rate is not financially meaningful, as companies generally pay some tax unless they are in very specific situations.Zeros in net_value_per_share_b and net_value_per_share_a indicate companies with no equity value per share, which is unrealistic for operational companies and could distort the model’s predictions.

  • Columns like fixed_assets_to_assets, interest_bearing_debt_interest_rate, are removed due to excessive zero values or limited relevance to bankruptcy prediction. Zeros in net_value_per_share_b, net_value_per_share_a, … indicate companies with no equity value per share, which is unrealistic for operational companies and could distort the model’s predictions.

  • Liquidity turnover ratios such asquick_ratio,current_ratio, andaccounts_receivable_turnover were dropped as zeros in these metrics are implausible for operational companies.

  • Lastly, outliers in critical financial ratios were capped to prevent them from disproportionately influencing model predictions.

bankrupt.cleaned <- bankrupt %>%
  select(-c(tax_rate_a,
            fixed_assets_to_assets,
            interest_bearing_debt_interest_rate,
            total_assets_to_gnp_price,
            net_value_growth_rate,
            quick_ratio,
            quick_assets_current_liability,
            average_collection_days,
            inventory_turnover_rate_times,
            fixed_assets_turnover_frequency,
            inventory_current_liability,
            accounts_receivable_turnover,
            current_asset_turnover_rate,
            research_and_development_expense_rate,
            allocation_rate_per_person,
            quick_asset_turnover_rate,
            long_term_liability_to_current_assets,
            operating_expense_rate,
            total_debt_total_net_worth,
            current_ratio)) %>%
  mutate(across(
    .cols = c(net_value_per_share_b,
              net_value_per_share_a,
              roa_b_before_interest_and_depreciation_after_tax,
              operating_gross_margin,
              realized_sales_gross_margin,
              operating_profit_rate,
              pre_tax_net_interest_rate,
              after_tax_net_interest_rate,
              current_liability_to_equity,
              equity_to_long_term_liability,
              no_credit_interval,
              gross_profit_to_sales,
              net_income_to_stockholder_s_equity,
              liability_to_equity,
              equity_to_liability,
              degree_of_financial_leverage_dfl,
              cash_flow_per_share,
              operating_profit_per_share_yuan,
              contingent_liabilities_net_worth,
              roa_a_before_interest_and_after_tax,
              roa_b_before_interest_and_depreciation_after_tax,
              operating_gross_margin,
              realized_sales_gross_margin,
              persistent_eps_in_the_last_four_seasons,
              inventory_working_capital,
              per_share_net_profit_before_tax_yuan,
              realized_sales_gross_profit_growth_rate,
              total_asset_turnover,
              current_liability_to_assets,
              operating_funds_to_liability,
              cash_flow_to_total_assets,
              cash_flow_to_liability,
              cfo_to_assets,
              cash_turnover_rate,
              interest_expense_ratio,
              debt_ratio,
              long_term_fund_suitability_ratio_a,
              interest_coverage_ratio_interest_expense_to_ebit,
              continuous_net_profit_growth_rate,
              total_asset_growth_rate,
              cash_reinvestment,
              operating_profit_paid_in_capital,
              current_liability_to_current_assets,
              retained_earnings_to_total_assets,
              total_income_total_expense,
              current_liabilities_liability,
              operating_profit_growth_rate,
              after_tax_net_profit_growth_rate,
              regular_net_profit_growth_rate,
              cash_flow_rate,
              total_asset_growth_rate
    ),
    .fns = ~ifelse(. == 0, NA, pmin(., quantile(., 1, na.rm = TRUE), na.rm = TRUE))
  )) %>%
  drop_na()
summary(bankrupt.cleaned)
##  bankrupt roa_c_before_interest_and_depreciation_before_interest
##  0:6574   Min.   :0.06693                                       
##  1: 212   1st Qu.:0.47687                                       
##           Median :0.50288                                       
##           Mean   :0.50579                                       
##           3rd Qu.:0.53566                                       
##           Max.   :1.00000                                       
##  roa_a_before_interest_and_after_tax
##  Min.   :0.05718                    
##  1st Qu.:0.53588                    
##  Median :0.55997                    
##  Mean   :0.55932                    
##  3rd Qu.:0.58929                    
##  Max.   :1.00000                    
##  roa_b_before_interest_and_depreciation_after_tax operating_gross_margin
##  Min.   :0.05482                                  Min.   :0.1563        
##  1st Qu.:0.52765                                  1st Qu.:0.6005        
##  Median :0.55244                                  Median :0.6060        
##  Mean   :0.55423                                  Mean   :0.6080        
##  3rd Qu.:0.58448                                  3rd Qu.:0.6139        
##  Max.   :1.00000                                  Max.   :0.6652        
##  realized_sales_gross_margin operating_profit_rate pre_tax_net_interest_rate
##  Min.   :0.1563              Min.   :0.9734        Min.   :0.7530           
##  1st Qu.:0.6005              1st Qu.:0.9990        1st Qu.:0.7974           
##  Median :0.6060              Median :0.9990        Median :0.7975           
##  Mean   :0.6080              Mean   :0.9990        Mean   :0.7974           
##  3rd Qu.:0.6138              3rd Qu.:0.9991        3rd Qu.:0.7976           
##  Max.   :0.6660              Max.   :1.0000        Max.   :0.8128           
##  after_tax_net_interest_rate non_industry_income_and_expenditure_revenue
##  Min.   :0.7662              Min.   :0.2351                             
##  1st Qu.:0.8093              1st Qu.:0.3035                             
##  Median :0.8094              Median :0.3035                             
##  Mean   :0.8093              Mean   :0.3035                             
##  3rd Qu.:0.8095              3rd Qu.:0.3036                             
##  Max.   :0.8226              Max.   :0.3301                             
##  continuous_interest_rate_after_tax cash_flow_rate   net_value_per_share_b
##  Min.   :0.7351                     Min.   :0.3067   Min.   :0.1200       
##  1st Qu.:0.7816                     1st Qu.:0.4616   1st Qu.:0.1737       
##  Median :0.7816                     Median :0.4651   Median :0.1844       
##  Mean   :0.7816                     Mean   :0.4675   Mean   :0.1908       
##  3rd Qu.:0.7817                     3rd Qu.:0.4710   3rd Qu.:0.1996       
##  Max.   :0.7959                     Max.   :1.0000   Max.   :1.0000       
##  net_value_per_share_a net_value_per_share_c
##  Min.   :0.06506       Min.   :0.06506      
##  1st Qu.:0.17375       1st Qu.:0.17378      
##  Median :0.18444       Median :0.18448      
##  Mean   :0.19080       Mean   :0.19084      
##  3rd Qu.:0.19965       3rd Qu.:0.19973      
##  Max.   :1.00000       Max.   :1.00000      
##  persistent_eps_in_the_last_four_seasons cash_flow_per_share
##  Min.   :0.07857                         Min.   :0.1287     
##  1st Qu.:0.21481                         1st Qu.:0.3178     
##  Median :0.22454                         Median :0.3225     
##  Mean   :0.22902                         Mean   :0.3236     
##  3rd Qu.:0.23901                         3rd Qu.:0.3286     
##  Max.   :1.00000                         Max.   :1.0000     
##  operating_profit_per_share_yuan per_share_net_profit_before_tax_yuan
##  Min.   :0.00627                 Min.   :0.05024                     
##  1st Qu.:0.09625                 1st Qu.:0.17045                     
##  Median :0.10431                 Median :0.17978                     
##  Mean   :0.10922                 Mean   :0.18456                     
##  3rd Qu.:0.11620                 3rd Qu.:0.19362                     
##  Max.   :1.00000                 Max.   :1.00000                     
##  realized_sales_gross_profit_growth_rate operating_profit_growth_rate
##  Min.   :0.004282                        Min.   :0.7364              
##  1st Qu.:0.022065                        1st Qu.:0.8480              
##  Median :0.022103                        Median :0.8480              
##  Mean   :0.022402                        Mean   :0.8481              
##  3rd Qu.:0.022153                        3rd Qu.:0.8481              
##  Max.   :1.000000                        Max.   :1.0000              
##  after_tax_net_profit_growth_rate regular_net_profit_growth_rate
##  Min.   :0.1807                   Min.   :0.1807                
##  1st Qu.:0.6893                   1st Qu.:0.6893                
##  Median :0.6894                   Median :0.6894                
##  Mean   :0.6893                   Mean   :0.6893                
##  3rd Qu.:0.6896                   3rd Qu.:0.6896                
##  Max.   :1.0000                   Max.   :1.0000                
##  continuous_net_profit_growth_rate total_asset_growth_rate
##  Min.   :0.09576                   Min.   :0.000e+00      
##  1st Qu.:0.21758                   1st Qu.:4.900e+09      
##  Median :0.21760                   Median :6.410e+09      
##  Mean   :0.21768                   Mean   :5.519e+09      
##  3rd Qu.:0.21762                   3rd Qu.:7.390e+09      
##  Max.   :1.00000                   Max.   :9.990e+09      
##  total_asset_return_growth_rate_ratio cash_reinvestment interest_expense_ratio
##  Min.   :0.2562                       Min.   :0.02583   Min.   :0.4600        
##  1st Qu.:0.2638                       1st Qu.:0.37478   1st Qu.:0.6306        
##  Median :0.2641                       Median :0.38046   Median :0.6307        
##  Mean   :0.2642                       Mean   :0.37973   Mean   :0.6311        
##  3rd Qu.:0.2644                       3rd Qu.:0.38676   3rd Qu.:0.6311        
##  Max.   :0.3586                       Max.   :0.75995   Max.   :1.0000        
##    debt_ratio        net_worth_assets long_term_fund_suitability_ratio_a
##  Min.   :0.0002721   Min.   :0.6681   Min.   :0.004676                  
##  1st Qu.:0.0728905   1st Qu.:0.8516   1st Qu.:0.005246                  
##  Median :0.1112556   Median :0.8887   Median :0.005666                  
##  Mean   :0.1127852   Mean   :0.8872   Mean   :0.008693                  
##  3rd Qu.:0.1484415   3rd Qu.:0.9271   3rd Qu.:0.006849                  
##  Max.   :0.3318923   Max.   :0.9997   Max.   :1.000000                  
##  borrowing_dependency contingent_liabilities_net_worth
##  Min.   :0.1871       Min.   :0.0006331               
##  1st Qu.:0.3702       1st Qu.:0.0053659               
##  Median :0.3726       Median :0.0053659               
##  Mean   :0.3745       Mean   :0.0058218               
##  3rd Qu.:0.3763       3rd Qu.:0.0057647               
##  Max.   :0.8874       Max.   :0.0731641               
##  operating_profit_paid_in_capital net_profit_before_tax_paid_in_capital
##  Min.   :0.008859                 Min.   :0.05244                      
##  1st Qu.:0.096176                 1st Qu.:0.16947                      
##  Median :0.104166                 Median :0.17853                      
##  Mean   :0.109107                 Mean   :0.18291                      
##  3rd Qu.:0.116005                 3rd Qu.:0.19164                      
##  Max.   :1.000000                 Max.   :1.00000                      
##  inventory_and_accounts_receivable_net_value total_asset_turnover
##  Min.   :0.3514                              Min.   :0.001499    
##  1st Qu.:0.3974                              1st Qu.:0.076462    
##  Median :0.4001                              Median :0.118441    
##  Mean   :0.4025                              Mean   :0.141828    
##  3rd Qu.:0.4045                              3rd Qu.:0.176911    
##  Max.   :1.0000                              Max.   :1.000000    
##  net_worth_turnover_rate_times revenue_per_person  operating_profit_per_person
##  Min.   :0.009032              Min.   :0.0001116   Min.   :0.0000             
##  1st Qu.:0.021936              1st Qu.:0.0104674   1st Qu.:0.3925             
##  Median :0.029516              Median :0.0186361   Median :0.3959             
##  Mean   :0.038473              Mean   :0.0355880   Mean   :0.4008             
##  3rd Qu.:0.042903              3rd Qu.:0.0358423   3rd Qu.:0.4019             
##  Max.   :1.000000              Max.   :1.0000000   Max.   :1.0000             
##  working_capital_to_total_assets quick_assets_total_assets
##  Min.   :0.4752                  Min.   :0.0000           
##  1st Qu.:0.7746                  1st Qu.:0.2421           
##  Median :0.8105                  Median :0.3866           
##  Mean   :0.8145                  Mean   :0.4004           
##  3rd Qu.:0.8504                  3rd Qu.:0.5407           
##  Max.   :1.0000                  Max.   :1.0000           
##  current_assets_total_assets cash_total_assets   current_liability_to_assets
##  Min.   :0.0000              Min.   :0.0001842   Min.   :0.0008467          
##  1st Qu.:0.3529              1st Qu.:0.0335917   1st Qu.:0.0532728          
##  Median :0.5151              Median :0.0750036   Median :0.0826016          
##  Mean   :0.5225              Mean   :0.1242431   Mean   :0.0903545          
##  3rd Qu.:0.6892              3rd Qu.:0.1612751   3rd Qu.:0.1192421          
##  Max.   :1.0000              Max.   :1.0000000   Max.   :0.3431427          
##  operating_funds_to_liability inventory_working_capital
##  Min.   :0.01472              Min.   :0.1765           
##  1st Qu.:0.34108              1st Qu.:0.2770           
##  Median :0.34865              Median :0.2772           
##  Mean   :0.35392              Mean   :0.2775           
##  3rd Qu.:0.36096              3rd Qu.:0.2774           
##  Max.   :1.00000              Max.   :1.0000           
##  current_liabilities_liability working_capital_equity
##  Min.   :0.0008623             Min.   :0.5071        
##  1st Qu.:0.6274174             1st Qu.:0.7336        
##  Median :0.8068333             Median :0.7360        
##  Mean   :0.7617579             Mean   :0.7360        
##  3rd Qu.:0.9418341             3rd Qu.:0.7386        
##  Max.   :1.0000000             Max.   :1.0000        
##  current_liabilities_equity retained_earnings_to_total_assets
##  Min.   :0.1538             Min.   :0.5021                   
##  1st Qu.:0.3281             1st Qu.:0.9313                   
##  Median :0.3297             Median :0.9377                   
##  Mean   :0.3313             Mean   :0.9351                   
##  3rd Qu.:0.3323             3rd Qu.:0.9449                   
##  Max.   :0.7759             Max.   :1.0000                   
##  total_income_total_expense total_expense_assets working_capitcal_turnover_rate
##  Min.   :0.000772           Min.   :0.00000      Min.   :0.5884                
##  1st Qu.:0.002237           1st Qu.:0.01457      1st Qu.:0.5939                
##  Median :0.002337           Median :0.02268      Median :0.5940                
##  Mean   :0.002551           Mean   :0.02890      Mean   :0.5940                
##  3rd Qu.:0.002493           3rd Qu.:0.03588      3rd Qu.:0.5940                
##  Max.   :1.000000           Max.   :0.46348      Max.   :0.6054                
##  cash_turnover_rate  cash_flow_to_sales current_liability_to_liability
##  Min.   :0.000e+00   Min.   :0.6618     Min.   :0.0008623             
##  1st Qu.:0.000e+00   1st Qu.:0.6716     1st Qu.:0.6274174             
##  Median :1.085e+09   Median :0.6716     Median :0.8068333             
##  Mean   :2.477e+09   Mean   :0.6716     Mean   :0.7617579             
##  3rd Qu.:4.530e+09   3rd Qu.:0.6716     3rd Qu.:0.9418341             
##  Max.   :1.000e+10   Max.   :0.6760     Max.   :1.0000000             
##  current_liability_to_equity equity_to_long_term_liability
##  Min.   :0.1538              Min.   :0.02585              
##  1st Qu.:0.3281              1st Qu.:0.11093              
##  Median :0.3297              Median :0.11235              
##  Mean   :0.3313              Mean   :0.11543              
##  3rd Qu.:0.3323              3rd Qu.:0.11711              
##  Max.   :0.7759              Max.   :1.00000              
##  cash_flow_to_total_assets cash_flow_to_liability cfo_to_assets    
##  Min.   :0.09209           Min.   :0.02806        Min.   :0.07425  
##  1st Qu.:0.63336           1st Qu.:0.45714        1st Qu.:0.56626  
##  Median :0.64540           Median :0.45976        Median :0.59347  
##  Mean   :0.64987           Mean   :0.46186        Mean   :0.59385  
##  3rd Qu.:0.66302           3rd Qu.:0.46424        3rd Qu.:0.62490  
##  Max.   :1.00000           Max.   :1.00000        Max.   :1.00000  
##  cash_flow_to_equity current_liability_to_current_assets liability_assets_flag
##  Min.   :0.06196     Min.   :0.00022                     0:6783               
##  1st Qu.:0.31301     1st Qu.:0.01803                     1:   3               
##  Median :0.31496     Median :0.02755                                          
##  Mean   :0.31554     Mean   :0.03129                                          
##  3rd Qu.:0.31770     3rd Qu.:0.03827                                          
##  Max.   :0.50887     Max.   :1.00000                                          
##  net_income_to_total_assets no_credit_interval gross_profit_to_sales
##  Min.   :0.4118             Min.   :0.4087     Min.   :0.1563       
##  1st Qu.:0.7969             1st Qu.:0.6236     1st Qu.:0.6005       
##  Median :0.8107             Median :0.6239     Median :0.6060       
##  Mean   :0.8083             Mean   :0.6240     Mean   :0.6080       
##  3rd Qu.:0.8266             3rd Qu.:0.6242     3rd Qu.:0.6139       
##  Max.   :1.0000             Max.   :1.0000     Max.   :0.6651       
##  net_income_to_stockholder_s_equity liability_to_equity
##  Min.   :0.3447                     Min.   :0.1335     
##  1st Qu.:0.8401                     1st Qu.:0.2770     
##  Median :0.8412                     Median :0.2788     
##  Mean   :0.8406                     Mean   :0.2802     
##  3rd Qu.:0.8424                     3rd Qu.:0.2814     
##  Max.   :1.0000                     Max.   :0.6523     
##  degree_of_financial_leverage_dfl
##  Min.   :0.0007895               
##  1st Qu.:0.0267912               
##  Median :0.0268085               
##  Mean   :0.0275491               
##  3rd Qu.:0.0269140               
##  Max.   :1.0000000               
##  interest_coverage_ratio_interest_expense_to_ebit net_income_flag
##  Min.   :0.1721                                   1:6786         
##  1st Qu.:0.5652                                                  
##  Median :0.5653                                                  
##  Mean   :0.5654                                                  
##  3rd Qu.:0.5657                                                  
##  Max.   :1.0000                                                  
##  equity_to_liability
##  Min.   :0.008753   
##  1st Qu.:0.024543   
##  Median :0.033852   
##  Mean   :0.047218   
##  3rd Qu.:0.052837   
##  Max.   :0.942729

Preprocessing/cleaning

Split the data into train and test features and targets. Ensures that all features contribute equally to the distance calculations in models like kNN.

set.seed(1)
bankrupt.split <- createDataPartition(bankrupt.cleaned$bankrupt, p = 0.8, list = F, times = 10)
head(bankrupt.split)
##      Resample01 Resample02 Resample03 Resample04 Resample05 Resample06
## [1,]          1          1          1          2          3          1
## [2,]          2          2          2          4          4          2
## [3,]          3          3          4          5          5          4
## [4,]          4          4          5          7          6          5
## [5,]          5          6          6         10          7          6
## [6,]          6          7          7         11          8          7
##      Resample07 Resample08 Resample09 Resample10
## [1,]          1          2          1          1
## [2,]          2          3          2          5
## [3,]          4          4          3          7
## [4,]          5          5          6          8
## [5,]          6          7          7          9
## [6,]          7         10          9         10

Modeling

Fitting a random forest

set.seed(1)
feature.train <- bankrupt.cleaned[bankrupt.split[, 1], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
feature.test <- bankrupt.cleaned[-bankrupt.split[, 1], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
target.train <- bankrupt.cleaned[bankrupt.split[, 1], 'bankrupt']
target.test <- bankrupt.cleaned[-bankrupt.split[, 1], 'bankrupt']
preprocess.obj <- preProcess(feature.train, method = c("scale", "center"))
feature.train <- predict(preprocess.obj, feature.train)
feature.test <- predict(preprocess.obj, feature.test)

random.forest <- randomForest(x = feature.train, y = target.train)
rf.pred <- predict(random.forest, newdata = feature.test)
rf.error <- mean(rf.pred != target.test)
round(rf.error, 3)
## [1] 0.032

Create an empty data frame with three columns, one for decision tree error rate, one for kNN error rate, and another for fold number.

error.frame <- data.frame(matrix(ncol = 3, nrow = 10))

colnames(error.frame) <- c("folds", "tree_error", "kNN_error")

Fitting & testing the models

set.seed(1)
for (i in seq_len(10)) {
  feature.train <- bankrupt.cleaned[bankrupt.split[, i], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
  feature.test <- bankrupt.cleaned[-bankrupt.split[, i], !(colnames(bankrupt.cleaned) %in% 'bankrupt')]
  target.train <- bankrupt.cleaned[bankrupt.split[, i], 'bankrupt']
  target.test <- bankrupt.cleaned[-bankrupt.split[, i], 'bankrupt']
  preprocess.obj <- preProcess(feature.train, method = c("scale", "center", "knnImpute"))
  feature.train <- predict(preprocess.obj, feature.train)
  feature.test <- predict(preprocess.obj, feature.test)
  full.train <- cbind(feature.train, target.train)
  full.train <- full.train %>%
    rename(bankrupt = target.train)
  # kNN error
  k.form <- sqrt(length(target.test))
  k.form <- ifelse(k.form %% 2 == 0, k.form + 1, k.form)  # Ensure k.form is odd
  knn.fit <- knn3(feature.train, target.train, k = k.form)
  knn.fit <- knn3(feature.train, target.train, k = k.form)
  knn.pred <- predict(knn.fit, feature.test, type = 'class')
  error <- mean(ifelse(target.test != knn.pred, 1, 0))
  error.frame[i, "kNN_error"] <- error
  error.frame[i, "folds"] <- i

  # Decision Tree
  decision.tree <- rpart(bankrupt ~ ., data = full.train, method = "class")
  decision.tree.predictions <- predict(decision.tree, newdata = feature.test, type = "class")
  decision.tree.error <- mean(target.test != decision.tree.predictions)
  error.frame[i, 'tree_error'] <- decision.tree.error
}
error.frame
##    folds tree_error  kNN_error
## 1      1 0.03908555 0.03097345
## 2      2 0.03318584 0.03171091
## 3      3 0.03318584 0.03097345
## 4      4 0.03539823 0.03171091
## 5      5 0.03908555 0.03023599
## 6      6 0.03466077 0.03171091
## 7      7 0.03982301 0.03097345
## 8      8 0.03466077 0.03171091
## 9      9 0.04056047 0.03171091
## 10    10 0.03171091 0.03023599

Plot the 10-fold error

ggplot(error.frame, aes(x = folds)) +
  geom_line(aes(y = tree_error), color = "red") +
  geom_line(aes(y = kNN_error), color = "blue") +
  scale_x_continuous(breaks = seq(min(error.frame$folds), max(error.frame$folds), by = 1)) +
  labs(x = "Folds", y = "Error Rate", title = "Error Rates for Decision Tree and kNN") +
  theme_minimal()

Calculate the mean error rate for your decision tree and kNN models.

mean.tree.error <- mean(error.frame$tree_error)
mean.kNN.error <- mean(error.frame$kNN_error)
print(paste("Mean Decision Tree Error Rate:", round(mean.tree.error, 3)))
## [1] "Mean Decision Tree Error Rate: 0.036"
print(paste("Mean kNN Error Rate:", round(mean.kNN.error, 3)))
## [1] "Mean kNN Error Rate: 0.031"
print(paste("Random Forest Error Rate:", round(rf.error, 3)))
## [1] "Random Forest Error Rate: 0.032"

Mean Error Rates:

  1. Decision Tree: The mean error rate across the 10 folds is 0.036.

  2. k-Nearest Neighbors (kNN): The mean error rate across the 10 folds is 0.031.

  3. Random Forest: The error rate for the Random Forest model is 0.032.

Create Confusion Matrixs

knn.confusion <- confusionMatrix(knn.pred, target.test)
knn.confusion$table
##           Reference
## Prediction    0    1
##          0 1314   41
##          1    0    1
mean(knn.pred == target.test)
## [1] 0.969764
plot_confusion_matrix <- function(confusion_matrix, title) {
  cm_table <- as.data.frame(confusion_matrix$table)
  cm_table$Prediction <- factor(cm_table$Prediction, levels = rev(levels(cm_table$Prediction)))
  cm_table$Correct <- cm_table$Reference == cm_table$Prediction
  ggplot(data = cm_table, aes(x = Reference, y = Prediction)) +
    geom_tile(aes(fill = Correct)) +
    scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red")) +
    geom_text(aes(label = Freq), vjust = 1) +
    labs(title = title, x = "Actual", y = "Predicted") +
    theme_minimal()
}
knn.confusion <- confusionMatrix(knn.pred, target.test)
plot_confusion_matrix(knn.confusion, "kNN Confusion Matrix")

tree.confusion <- confusionMatrix(decision.tree.predictions, target.test)
tree.confusion$table
##           Reference
## Prediction    0    1
##          0 1299   28
##          1   15   14
mean(decision.tree.predictions == target.test)
## [1] 0.9682891
knn.confusion <- confusionMatrix(knn.pred, target.test)
plot_confusion_matrix(knn.confusion, "kNN Confusion Matrix")

rf.confusion <- confusionMatrix(rf.pred, target.test)
rf.confusion$table
##           Reference
## Prediction    0    1
##          0 1307   42
##          1    7    0
mean(rf.pred == target.test)
## [1] 0.9638643
rf.confusion <- confusionMatrix(rf.pred, target.test)
plot_confusion_matrix(rf.confusion, "Random Forest Confusion Matrix")

Model Performance Comparison

Based on the 10-fold cross-validation and confusion matrix results, the accuracy of each model is as follows:

  1. k-Nearest Neighbors (kNN): 0.9705

  2. Random Forest (RF): 0.9654

  3. Decision Tree: 0.9683


Recommendation

Given the goal of accurately classifying companies as bankrupt or not bankrupt, the k-Nearest Neighbors (kNN) model is recommended for this project. It has the highest accuracy (0.9705), indicating that it performs best in distinguishing between bankrupt and non-bankrupt companies. However, it is important to consider the computational cost and scalability of kNN, especially if the dataset grows larger. If interpretability and feature importance are also critical, the Random Forest model could be a viable alternative despite its slightly lower accuracy.


Conclusion

For the current dataset and project requirements, the kNN model is the best choice due to its superior accuracy in classifying bankruptcy status. It provides the most reliable predictions, which is essential for the project’s objective of predicting company insolvency.