Clustering banks based on bankruptcy risk is a crucial analytical approach for understanding and managing financial stability within the banking sector. By grouping banks with similar risk profiles, stakeholders can better assess potential vulnerabilities, allocate resources efficiently, and implement targeted strategies. This method provides a structured way to address the complexities of financial risk and improve overall oversight and intervention efforts.
Here our objective is to cluster the data points.
The data is obtained from kaggle Dataset name : Company Bankruptcy Prediction
Y - Bankrupt?: Class label
X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)
X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)
X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)
X4 - Operating Gross Margin: Gross Profit/Net Sales
X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales
X6 - Operating Profit Rate: Operating Income/Net Sales
X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales
X8 - After-tax net Interest Rate: Net Income/Net Sales
X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio
X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales
X11 - Operating Expense Rate: Operating Expenses/Net Sales
X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales
X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities
X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity
X15 - Tax rate (A): Effective Tax Rate
X16 - Net Value Per Share (B): Book Value Per Share(B)
X17 - Net Value Per Share (A): Book Value Per Share(A)
X18 - Net Value Per Share (C): Book Value Per Share(C)
X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income
X20 - Cash Flow Per Share
X21 - Revenue Per Share (Yuan ¥): Sales Per Share
X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share
X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share
X24 - Realized Sales Gross Profit Growth Rate
X25 - Operating Profit Growth Rate: Operating Income Growth
X26 - After-tax Net Profit Growth Rate: Net Income Growth
X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth
X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth
X29 - Total Asset Growth Rate: Total Asset Growth
X30 - Net Value Growth Rate: Total Equity Growth
X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth
X32 - Cash Reinvestment %: Cash Reinvestment Ratio
X33 - Current Ratio
X34 - Quick Ratio: Acid Test
X35 - Interest Expense Ratio: Interest Expenses/Total Revenue
X36 - Total debt/Total net worth: Total Liability/Equity Ratio
X37 - Debt ratio %: Liability/Total Assets
X38 - Net worth/Assets: Equity/Total Assets
X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets
X40 - Borrowing dependency: Cost of Interest-bearing Debt
X41 - Contingent liabilities/Net worth: Contingent Liability/Equity
X42 - Operating profit/Paid-in capital: Operating Income/Capital
X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital
X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity
X45 - Total Asset Turnover
X46 - Accounts Receivable Turnover
X47 - Average Collection Days: Days Receivable Outstanding
X48 - Inventory Turnover Rate (times)
X49 - Fixed Assets Turnover Frequency
X50 - Net Worth Turnover Rate (times): Equity Turnover
X51 - Revenue per person: Sales Per Employee
X52 - Operating profit per person: Operation Income Per Employee
X53 - Allocation rate per person: Fixed Assets Per Employee
X54 - Working Capital to Total Assets
X55 - Quick Assets/Total Assets
X56 - Current Assets/Total Assets
X57 - Cash/Total Assets
X58 - Quick Assets/Current Liability
X59 - Cash/Current Liability
X60 - Current Liability to Assets
X61 - Operating Funds to Liability
X62 - Inventory/Working Capital
X63 - Inventory/Current Liability
X64 - Current Liabilities/Liability
X65 - Working Capital/Equity
X66 - Current Liabilities/Equity
X67 - Long-term Liability to Current Assets
X68 - Retained Earnings to Total Assets
X69 - Total income/Total expense
X70 - Total expense/Assets
X71 - Current Asset Turnover Rate: Current Assets to Sales
X72 - Quick Asset Turnover Rate: Quick Assets to Sales
X73 - Working capitcal Turnover Rate: Working Capital to Sales
X74 - Cash Turnover Rate: Cash to Sales
X75 - Cash Flow to Sales
X76 - Fixed Assets to Assets
X77 - Current Liability to Liability
X78 - Current Liability to Equity
X79 - Equity to Long-term Liability
X80 - Cash Flow to Total Assets
X81 - Cash Flow to Liability
X82 - CFO to Assets
X83 - Cash Flow to Equity
X84 - Current Liability to Current Assets
X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise
X86 - Net Income to Total Assets
X87 - Total assets to GNP price
X88 - No-credit Interval
X89 - Gross Profit to Sales
X90 - Net Income to Stockholder’s Equity
X91 - Liability to Equity
X92 - Degree of Financial Leverage (DFL)
X93 - Interest Coverage Ratio (Interest expense to EBIT)
X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise
X95 - Equity to Liability
Getting the required packages:
pacman::p_load(MASS,clValid,cluster,dbscan,ggplot2)
Importing and cleaning the dataset:
mydata=read.csv("C:\\Users\\zeeda\\Downloads\\archive (3)\\data.csv")
table(mydata$Bankrupt.)
0 1
6599 220
Clearly the data is highly unbalanced as we have a lot of observation on banks that are not bankrupt and less on bankrupt. So we will use undersampling technique and reduce the cardinality of majority class ( non-bankrupt class)
no_bankrupt=subset(mydata,mydata$Bankrupt.==0)
mydata_index=sample(1:nrow(no_bankrupt),201,F)
no_bankrupt_data=no_bankrupt[mydata_index,]
bankrupt_data=subset(mydata,mydata$Bankrupt.==1)
mydata=rbind(no_bankrupt_data,bankrupt_data)
table(mydata$Bankrupt.)
0 1
201 220
summary(mydata)
Bankrupt. ROA.C..before.interest.and.depreciation.before.interest
Min. :0.0000 Min. :0.02428
1st Qu.:0.0000 1st Qu.:0.43022
Median :1.0000 Median :0.46921
Mean :0.5226 Mean :0.45933
3rd Qu.:1.0000 3rd Qu.:0.50261
Max. :1.0000 Max. :0.69385
ROA.A..before.interest.and...after.tax
Min. :0.0000
1st Qu.:0.4702
Median :0.5258
Mean :0.5052
3rd Qu.:0.5612
Max. :0.7439
ROA.B..before.interest.and.depreciation.after.tax Operating.Gross.Margin
Min. :0.03351 Min. :0.5329
1st Qu.:0.47272 1st Qu.:0.5966
Median :0.51823 Median :0.6016
Mean :0.50449 Mean :0.6028
3rd Qu.:0.55249 3rd Qu.:0.6091
Max. :0.72440 Max. :0.6652
Realized.Sales.Gross.Margin Operating.Profit.Rate Pre.tax.net.Interest.Rate
Min. :0.5329 Min. :0.9862 Min. :0.7572
1st Qu.:0.5966 1st Qu.:0.9988 1st Qu.:0.7971
Median :0.6016 Median :0.9990 Median :0.7973
Mean :0.6029 Mean :0.9988 Mean :0.7969
3rd Qu.:0.6092 3rd Qu.:0.9990 3rd Qu.:0.7975
Max. :0.6652 Max. :0.9995 Max. :0.7983
After.tax.net.Interest.Rate Non.industry.income.and.expenditure.revenue
Min. :0.7616 Min. :0.2351
1st Qu.:0.8090 1st Qu.:0.3032
Median :0.8093 Median :0.3035
Mean :0.8088 Mean :0.3030
3rd Qu.:0.8094 3rd Qu.:0.3035
Max. :0.8101 Max. :0.3054
Continuous.interest.rate..after.tax. Operating.Expense.Rate
Min. :0.7427 Min. :0.000e+00
1st Qu.:0.7812 1st Qu.:0.000e+00
Median :0.7815 Median :0.000e+00
Mean :0.7811 Mean :1.997e+09
3rd Qu.:0.7816 3rd Qu.:4.090e+09
Max. :0.7821 Max. :9.890e+09
Research.and.development.expense.rate Cash.flow.rate
Min. :0.000e+00 Min. :0.3438
1st Qu.:0.000e+00 1st Qu.:0.4596
Median :2.040e+08 Median :0.4622
Mean :1.766e+09 Mean :0.4641
3rd Qu.:3.190e+09 3rd Qu.:0.4660
Max. :9.920e+09 Max. :0.7433
Interest.bearing.debt.interest.rate Tax.rate..A. Net.Value.Per.Share..B.
Min. : 0 Min. :0.0000 Min. :0.06966
1st Qu.: 0 1st Qu.:0.0000 1st Qu.:0.15528
Median : 0 Median :0.0000 Median :0.17361
Mean : 8078385 Mean :0.0698 Mean :0.17514
3rd Qu.: 0 3rd Qu.:0.1195 3rd Qu.:0.18857
Max. :790000000 Max. :0.9755 Max. :0.30024
Net.Value.Per.Share..A. Net.Value.Per.Share..C.
Min. :0.06966 Min. :0.06966
1st Qu.:0.15528 1st Qu.:0.15528
Median :0.17361 Median :0.17361
Mean :0.17511 Mean :0.17519
3rd Qu.:0.18857 3rd Qu.:0.18963
Max. :0.30024 Max. :0.30024
Persistent.EPS.in.the.Last.Four.Seasons Cash.Flow.Per.Share
Min. :0.0000 Min. :0.2085
1st Qu.:0.1924 1st Qu.:0.3144
Median :0.2097 Median :0.3192
Mean :0.2074 Mean :0.3191
3rd Qu.:0.2244 3rd Qu.:0.3243
Max. :0.3505 Max. :0.4208
Revenue.Per.Share..Yuan... Operating.Profit.Per.Share..Yuan...
Min. :0.000e+00 Min. :0.00000
1st Qu.:0.000e+00 1st Qu.:0.08696
Median :0.000e+00 Median :0.09576
Mean :7.173e+06 Mean :0.09709
3rd Qu.:0.000e+00 3rd Qu.:0.10553
Max. :3.020e+09 Max. :0.25446
Per.Share.Net.profit.before.tax..Yuan...
Min. :0.0000
1st Qu.:0.1512
Median :0.1672
Mean :0.1651
3rd Qu.:0.1795
Max. :0.3233
Realized.Sales.Gross.Profit.Growth.Rate Operating.Profit.Growth.Rate
Min. :0.01885 Min. :0.7364
1st Qu.:0.02203 1st Qu.:0.8479
Median :0.02208 Median :0.8480
Mean :0.02253 Mean :0.8475
3rd Qu.:0.02213 3rd Qu.:0.8481
Max. :0.08345 Max. :0.8525
After.tax.Net.Profit.Growth.Rate Regular.Net.Profit.Growth.Rate
Min. :0.1807 Min. :0.1807
1st Qu.:0.6888 1st Qu.:0.6889
Median :0.6893 Median :0.6893
Mean :0.6868 Mean :0.6868
3rd Qu.:0.6895 3rd Qu.:0.6895
Max. :0.7831 Max. :0.7831
Continuous.Net.Profit.Growth.Rate Total.Asset.Growth.Rate
Min. :0.1617 Min. :0.000e+00
1st Qu.:0.2175 1st Qu.:4.330e+09
Median :0.2176 Median :5.890e+09
Mean :0.2173 Mean :5.206e+09
3rd Qu.:0.2176 3rd Qu.:6.900e+09
Max. :0.2195 Max. :9.980e+09
Net.Value.Growth.Rate Total.Asset.Return.Growth.Rate.Ratio Cash.Reinvestment..
Min. :0.000e+00 Min. :0.2516 Min. :0.02583
1st Qu.:0.000e+00 1st Qu.:0.2634 1st Qu.:0.37004
Median :0.000e+00 Median :0.2639 Median :0.37807
Mean :2.216e+07 Mean :0.2637 Mean :0.37647
3rd Qu.:0.000e+00 3rd Qu.:0.2642 3rd Qu.:0.38359
Max. :9.330e+09 Max. :0.2721 Max. :1.00000
Current.Ratio Quick.Ratio Interest.Expense.Ratio
Min. :0.0003551 Min. :0.000e+00 Min. :0.5251
1st Qu.:0.0050964 1st Qu.:0.000e+00 1st Qu.:0.6301
Median :0.0079251 Median :0.000e+00 Median :0.6306
Mean :0.0119850 Mean :2.192e+07 Mean :0.6309
3rd Qu.:0.0121160 3rd Qu.:0.000e+00 3rd Qu.:0.6310
Max. :0.7126299 Max. :9.230e+09 Max. :0.8122
Total.debt.Total.net.worth Debt.ratio.. Net.worth.Assets
Min. :0.000e+00 Min. :0.0000 Min. :0.4746
1st Qu.:0.000e+00 1st Qu.:0.1005 1st Qu.:0.8073
Median :0.000e+00 Median :0.1548 Median :0.8452
Mean :8.242e+06 Mean :0.1500 Mean :0.8500
3rd Qu.:0.000e+00 3rd Qu.:0.1927 3rd Qu.:0.8995
Max. :3.470e+09 Max. :0.5254 Max. :1.0000
Long.term.fund.suitability.ratio..A. Borrowing.dependency
Min. :0.004129 Min. :0.0000
1st Qu.:0.005080 1st Qu.:0.3716
Median :0.005440 Median :0.3772
Mean :0.009931 Mean :0.3825
3rd Qu.:0.006222 3rd Qu.:0.3839
Max. :0.923930 Max. :1.0000
Contingent.liabilities.Net.worth Operating.profit.Paid.in.capital
Min. :0.000000 Min. :0.00000
1st Qu.:0.005366 1st Qu.:0.08703
Median :0.005366 Median :0.09571
Mean :0.008349 Mean :0.09714
3rd Qu.:0.006038 3rd Qu.:0.10531
Max. :1.000000 Max. :0.25447
Net.profit.before.tax.Paid.in.capital
Min. :0.0000
1st Qu.:0.1508
Median :0.1662
Mean :0.1642
3rd Qu.:0.1782
Max. :0.2998
Inventory.and.accounts.receivable.Net.value Total.Asset.Turnover
Min. :0.0000 Min. :0.00000
1st Qu.:0.3979 1st Qu.:0.06297
Median :0.4015 Median :0.09745
Mean :0.4053 Mean :0.12394
3rd Qu.:0.4086 3rd Qu.:0.15592
Max. :0.7074 Max. :0.66117
Accounts.Receivable.Turnover Average.Collection.Days
Min. :0.000e+00 Min. : 0
1st Qu.:0.000e+00 1st Qu.: 0
Median :0.000e+00 Median : 0
Mean :2.898e+06 Mean : 325416
3rd Qu.:0.000e+00 3rd Qu.: 0
Max. :1.220e+09 Max. :137000000
Inventory.Turnover.Rate..times. Fixed.Assets.Turnover.Frequency
Min. :0.000e+00 Min. :0.000e+00
1st Qu.:0.000e+00 1st Qu.:0.000e+00
Median :0.000e+00 Median :0.000e+00
Mean :1.987e+09 Mean :1.564e+09
3rd Qu.:3.900e+09 3rd Qu.:9.530e+08
Max. :9.990e+09 Max. :9.990e+09
Net.Worth.Turnover.Rate..times. Revenue.per.person
Min. :0.008871 Min. :0.000e+00
1st Qu.:0.021452 1st Qu.:0.000e+00
Median :0.029839 Median :0.000e+00
Mean :0.040590 Mean :1.675e+07
3rd Qu.:0.044355 3rd Qu.:0.000e+00
Max. :0.396129 Max. :7.050e+09
Operating.profit.per.person Allocation.rate.per.person
Min. :0.3023 Min. :0.000e+00
1st Qu.:0.3849 1st Qu.:0.000e+00
Median :0.3923 Median :0.000e+00
Mean :0.3905 Mean :2.793e+07
3rd Qu.:0.3978 3rd Qu.:0.000e+00
Max. :0.5165 Max. :8.280e+09
Working.Capital.to.Total.Assets Quick.Assets.Total.Assets
Min. :0.4942 Min. :0.01743
1st Qu.:0.7337 1st Qu.:0.17470
Median :0.7828 Median :0.33588
Mean :0.7830 Mean :0.35182
3rd Qu.:0.8312 3rd Qu.:0.49470
Max. :0.9623 Max. :0.88270
Current.Assets.Total.Assets Cash.Total.Assets Quick.Assets.Current.Liability
Min. :0.02083 Min. :0.0001842 Min. :0.0001435
1st Qu.:0.31889 1st Qu.:0.0174118 1st Qu.:0.0027748
Median :0.48552 Median :0.0443369 Median :0.0052089
Mean :0.49951 Mean :0.0864997 Mean :0.0083411
3rd Qu.:0.67877 3rd Qu.:0.1040967 3rd Qu.:0.0092043
Max. :0.99545 Max. :0.6582866 Max. :0.3251893
Cash.Current.Liability Current.Liability.to.Assets
Min. :0.000e+00 Min. :0.001481
1st Qu.:0.000e+00 1st Qu.:0.068199
Median :0.000e+00 Median :0.111588
Mean :1.442e+08 Mean :0.118628
3rd Qu.:0.000e+00 3rd Qu.:0.160469
Max. :9.010e+09 Max. :0.343143
Operating.Funds.to.Liability Inventory.Working.Capital
Min. :0.02627 Min. :0.0000
1st Qu.:0.33714 1st Qu.:0.2769
Median :0.34278 Median :0.2771
Mean :0.34602 Mean :0.2766
3rd Qu.:0.35059 3rd Qu.:0.2775
Max. :0.70850 Max. :0.3066
Inventory.Current.Liability Current.Liabilities.Liability
Min. :0.000e+00 Min. :0.04958
1st Qu.:0.000e+00 1st Qu.:0.64141
Median :0.000e+00 Median :0.80095
Mean :4.253e+07 Mean :0.75683
3rd Qu.:0.000e+00 3rd Qu.:0.91956
Max. :8.790e+09 Max. :1.00000
Working.Capital.Equity Current.Liabilities.Equity
Min. :0.0000 Min. :0.0000
1st Qu.:0.7308 1st Qu.:0.3289
Median :0.7347 Median :0.3324
Mean :0.7310 Mean :0.3372
3rd Qu.:0.7382 3rd Qu.:0.3372
Max. :0.8252 Max. :1.0000
Long.term.Liability.to.Current.Assets Retained.Earnings.to.Total.Assets
Min. :0.000e+00 Min. :0.7292
1st Qu.:0.000e+00 1st Qu.:0.9091
Median :0.000e+00 Median :0.9291
Mean :4.301e+07 Mean :0.9197
3rd Qu.:0.000e+00 3rd Qu.:0.9385
Max. :7.000e+09 Max. :0.9755
Total.income.Total.expense Total.expense.Assets Current.Asset.Turnover.Rate
Min. :0.0009712 Min. :0.003324 Min. :0.000e+00
1st Qu.:0.0020192 1st Qu.:0.016257 1st Qu.:0.000e+00
Median :0.0021953 Median :0.026571 Median :0.000e+00
Mean :0.0022202 Mean :0.039829 Mean :1.229e+09
3rd Qu.:0.0023434 3rd Qu.:0.044457 3rd Qu.:0.000e+00
Max. :0.0039097 Max. :0.463483 Max. :9.880e+09
Quick.Asset.Turnover.Rate Working.capitcal.Turnover.Rate Cash.Turnover.Rate
Min. :0.000e+00 Min. :0.5729 Min. :0.000e+00
1st Qu.:0.000e+00 1st Qu.:0.5939 1st Qu.:0.000e+00
Median :0.000e+00 Median :0.5939 Median :1.160e+09
Mean :2.379e+09 Mean :0.5941 Mean :2.305e+09
3rd Qu.:5.520e+09 3rd Qu.:0.5940 3rd Qu.:3.660e+09
Max. :9.980e+09 Max. :0.6742 Max. :9.940e+09
Cash.Flow.to.Sales Fixed.Assets.to.Assets Current.Liability.to.Liability
Min. :0.6706 Min. :0.000e+00 Min. :0.04958
1st Qu.:0.6716 1st Qu.:0.000e+00 1st Qu.:0.64141
Median :0.6716 Median :0.000e+00 Median :0.80095
Mean :0.6716 Mean :1.976e+07 Mean :0.75683
3rd Qu.:0.6716 3rd Qu.:0.000e+00 3rd Qu.:0.91956
Max. :0.6908 Max. :8.320e+09 Max. :1.00000
Current.Liability.to.Equity Equity.to.Long.term.Liability
Min. :0.0000 Min. :0.0000
1st Qu.:0.3289 1st Qu.:0.1109
Median :0.3324 Median :0.1142
Mean :0.3372 Mean :0.1228
3rd Qu.:0.3372 3rd Qu.:0.1208
Max. :1.0000 Max. :0.9221
Cash.Flow.to.Total.Assets Cash.Flow.to.Liability CFO.to.Assets
Min. :0.1677 Min. :0.07397 Min. :0.2270
1st Qu.:0.6285 1st Qu.:0.45693 1st Qu.:0.5432
Median :0.6410 Median :0.45897 Median :0.5747
Mean :0.6379 Mean :0.45776 Mean :0.5730
3rd Qu.:0.6511 3rd Qu.:0.46062 3rd Qu.:0.6041
Max. :0.8637 Max. :0.81312 Max. :0.9156
Cash.Flow.to.Equity Current.Liability.to.Current.Assets Liability.Assets.Flag
Min. :0.0000 Min. :0.0001224 Min. :0.00000
1st Qu.:0.3114 1st Qu.:0.0241626 1st Qu.:0.00000
Median :0.3142 Median :0.0366234 Median :0.00000
Mean :0.3130 Mean :0.0459983 Mean :0.01425
3rd Qu.:0.3163 3rd Qu.:0.0559658 3rd Qu.:0.00000
Max. :0.5692 Max. :0.4606748 Max. :1.00000
Net.Income.to.Total.Assets Total.assets.to.GNP.price No.credit.Interval
Min. :0.4118 Min. :0.000e+00 Min. :0.5283
1st Qu.:0.7536 1st Qu.:0.000e+00 1st Qu.:0.6232
Median :0.7907 Median :0.000e+00 Median :0.6237
Mean :0.7714 Mean :4.755e+07 Mean :0.6235
3rd Qu.:0.8097 3rd Qu.:0.000e+00 3rd Qu.:0.6241
Max. :0.8955 Max. :9.170e+09 Max. :0.6797
Gross.Profit.to.Sales Net.Income.to.Stockholder.s.Equity Liability.to.Equity
Min. :0.5329 Min. :0.0000 Min. :0.0000
1st Qu.:0.5966 1st Qu.:0.8356 1st Qu.:0.2781
Median :0.6016 Median :0.8398 Median :0.2817
Mean :0.6028 Mean :0.8330 Mean :0.2870
3rd Qu.:0.6091 3rd Qu.:0.8413 3rd Qu.:0.2867
Max. :0.6651 Max. :1.0000 Max. :1.0000
Degree.of.Financial.Leverage..DFL.
Min. :0.02076
1st Qu.:0.02663
Median :0.02678
Mean :0.02797
3rd Qu.:0.02684
Max. :0.26458
Interest.Coverage.Ratio..Interest.expense.to.EBIT. Net.Income.Flag
Min. :0.4493 Min. :1
1st Qu.:0.5644 1st Qu.:1
Median :0.5652 Median :1
Mean :0.5652 Mean :1
3rd Qu.:0.5655 3rd Qu.:1
Max. :0.7370 Max. :1
Equity.to.Liability
Min. :0.003946
1st Qu.:0.018048
Median :0.023393
Mean :0.036272
3rd Qu.:0.037762
Max. :1.000000
dim(mydata)
[1] 421 96
data=mydata[,-1]
pca_model=prcomp(data,center = TRUE)
summary(pca_model)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 3.894e+09 3.266e+09 3.027e+09 2.904e+09 2.848e+09
Proportion of Variance 2.128e-01 1.497e-01 1.286e-01 1.184e-01 1.139e-01
Cumulative Proportion 2.128e-01 3.626e-01 4.912e-01 6.096e-01 7.235e-01
PC6 PC7 PC8 PC9 PC10
Standard deviation 2.562e+09 2.335e+09 2.262e+09 8.921e+08 6.049e+08
Proportion of Variance 9.210e-02 7.654e-02 7.179e-02 1.117e-02 5.140e-03
Cumulative Proportion 8.156e-01 8.921e-01 9.639e-01 9.751e-01 9.802e-01
PC11 PC12 PC13 PC14 PC15
Standard deviation 5.169e+08 4.904e+08 4.681e+08 4.504e+08 4.465e+08
Proportion of Variance 3.750e-03 3.380e-03 3.080e-03 2.850e-03 2.800e-03
Cumulative Proportion 9.840e-01 9.873e-01 9.904e-01 9.933e-01 9.961e-01
PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 3.475e+08 3.251e+08 1.673e+08 1.467e+08 6.088e+07 0.3201
Proportion of Variance 1.700e-03 1.480e-03 3.900e-04 3.000e-04 5.000e-05 0.0000
Cumulative Proportion 9.978e-01 9.992e-01 9.997e-01 1.000e+00 1.000e+00 1.0000
PC22 PC23 PC24 PC25 PC26 PC27 PC28
Standard deviation 0.2077 0.1742 0.1356 0.1218 0.116 0.09867 0.08805
Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.000 0.00000 0.00000
Cumulative Proportion 1.0000 1.0000 1.0000 1.0000 1.000 1.00000 1.00000
PC29 PC30 PC31 PC32 PC33 PC34 PC35
Standard deviation 0.07034 0.06137 0.05922 0.04598 0.04146 0.04027 0.03632
Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
Cumulative Proportion 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
PC36 PC37 PC38 PC39 PC40 PC41 PC42
Standard deviation 0.03256 0.02981 0.02834 0.0251 0.02276 0.01898 0.01756
Proportion of Variance 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000 0.00000
Cumulative Proportion 1.00000 1.00000 1.00000 1.0000 1.00000 1.00000 1.00000
PC43 PC44 PC45 PC46 PC47 PC48 PC49
Standard deviation 0.01712 0.01545 0.01459 0.01377 0.01293 0.01266 0.0119
Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
Cumulative Proportion 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.0000
PC50 PC51 PC52 PC53 PC54 PC55 PC56
Standard deviation 0.01117 0.0109 0.01032 0.01006 0.00938 0.009023 0.007936
Proportion of Variance 0.00000 0.0000 0.00000 0.00000 0.00000 0.000000 0.000000
Cumulative Proportion 1.00000 1.0000 1.00000 1.00000 1.00000 1.000000 1.000000
PC57 PC58 PC59 PC60 PC61 PC62
Standard deviation 0.007726 0.006835 0.0064 0.005821 0.005314 0.004968
Proportion of Variance 0.000000 0.000000 0.0000 0.000000 0.000000 0.000000
Cumulative Proportion 1.000000 1.000000 1.0000 1.000000 1.000000 1.000000
PC63 PC64 PC65 PC66 PC67 PC68
Standard deviation 0.004642 0.00429 0.003913 0.003407 0.002945 0.002702
Proportion of Variance 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
Cumulative Proportion 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
PC69 PC70 PC71 PC72 PC73 PC74
Standard deviation 0.002498 0.002372 0.001883 0.001725 0.001373 0.001053
Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
PC75 PC76 PC77 PC78 PC79
Standard deviation 0.0009467 0.0008456 0.0007278 0.0006526 0.0005193
Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Cumulative Proportion 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
PC80 PC81 PC82 PC83 PC84
Standard deviation 0.0004672 0.0004013 0.0003665 0.0002562 0.0001398
Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Cumulative Proportion 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
PC85 PC86 PC87 PC88 PC89
Standard deviation 9.142e-05 7.56e-05 3.445e-05 2.573e-05 1.331e-06
Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
PC90 PC91 PC92 PC93 PC94
Standard deviation 2.734e-07 2.734e-07 2.734e-07 2.734e-07 2.734e-07
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
PC95
Standard deviation 2.156e-07
Proportion of Variance 0.000e+00
Cumulative Proportion 1.000e+00
We will obtain Screeplot and Cumulative variance plot
par(mfrow=c(1,2))
plot(pca_model,typ="l",main= "Screeplot",col="blue")
variance_plot=cumsum((pca_model$sdev^2)/sum(pca_model$sdev^2))
plot(variance_plot,type="l",main="Cumulative variance",xlab="No of principal components",ylab=" variance",col="blue")
abline(h=0.9,col="red")
Now we will reduce the dimension of the data using these 7 PC’s as predictors
rotation_matrix=pca_model$rotation
reduced_rotation_matrix=rotation_matrix[,-c(8:95)]
final_data=as.matrix(data)%*%reduced_rotation_matrix
dim(final_data)
[1] 421 7
Now we will apply different Clustering Techniques
set.seed(12)
dun_index=c()
for(i in 2:10){
obj=kmeans(x=final_data,i)
dun_index[i]=dunn(dist(final_data),as.vector(obj$cluster))
}
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 2
#--optimal clustering
set.seed(12)
obj=kmeans(final_data,2)
cluster_vector=obj$cluster
#--class labels
class_vector=mydata$Bankrupt.
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data)){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 47.981
set.seed(12)
for(i in 2:8){
obj2=pam(final_data,i)
dun_index[i]=dunn(dist(final_data),obj2$clustering) }
obj2=pam(final_data,2)
plot(obj2,which=2,main="silhouette plot")
plot(obj2,which=1,main="")
obj2=pam(final_data,3)
plot(obj2,which=2,main="silhouette plot")
plot(obj2,which=1,main="")
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
From the silhouette plot we get the maximum value of the silhouette coefficient 0.5 when the number of clusters is 2.
0.5 means the clusters are moderately well separated
The Dunn Index is max when the cluster number is
which.max(dun_index)
[1] 2
#--optimal clustering
set.seed(12)
obj2=pam(final_data,2)
cluster_vector=obj2$clustering
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data)){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 55.34442
par(mfrow=c(1,2))
obj3=agnes(final_data,method="single")
plot(obj3,which=2,main="Dendogram of single linkage",col="darkblue")
dun_index=c()
for(i in 2:8){
clust=cutree(obj3,i)
dun_index[i]=dunn(dist(final_data),clust) }
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 2
#--optimal clustering
obj3=agnes(final_data,method = "single")
clust=cutree(obj3,2)
cluster_vector=clust
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data)){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 99.76247
par(mfrow=c(1,2))
obj3=agnes(final_data,method="complete")
plot(obj3,which=2,main="Dendogram of complete linkage",col="darkblue")
dun_index=c()
for(i in 2:8){
clust=cutree(obj3,i)
dun_index[i]=dunn(dist(final_data),clust) }
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 2
#--optimal clustering
obj3=agnes(final_data,method = "complete")
clust=cutree(obj3,2)
cluster_vector=clust
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data)){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 60.09501
par(mfrow=c(1,2))
obj5=diana(final_data)
plot(obj5,which=2,main="Dendogram",col="darkblue")
dun_index=c()
for(i in 2:15){
clust=cutree(obj5,i)
dun_index[i]=dunn(dist(final_data),clust) }
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 11
Which is quiet different from the rest of the results obtained from different methods.
Now we will check the accuracy how well the method is been able to classify the two data
#--optimal clustering
obj5=diana(final_data)
clust=cutree(obj3,2)
cluster_vector=clust
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data)){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 60.09501
mat=kNN(final_data, k = 8)
k_dist=sort(mat$dist[,8])
plot(k_dist,typ="l",xlab="data points",ylab="k distances",main="k distance plot")
abline(h=0.5,col="red")
db_seq=seq(0.1,2.0,0.1)
dun_index=c()
for(i in 2:10){
obj6=dbscan(final_data,eps=db_seq[i],minPts=8)
dun_index[i]=dunn(dist(final_data),obj6$cluster) }
plot(dun_index,typ="b",xlab="Member of db_seq",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 5
The 5’th member is selected so we will choose eps =0.5
db=dbscan(final_data,eps=db_seq[5],minPts=8)
db
DBSCAN clustering for 421 objects.
Parameters: eps = 0.5, minPts = 8
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 194 noise points.
0 1 2
194 208 19
Available fields: cluster, eps, minPts, metric, borderPoints
clus_vec=db$cluster
index_remove=which(clus_vec==0)
cluster_vector=clus_vec[-index_remove]
class_vector=mydata[-index_remove,]$Bankrupt.
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data[-index_remove,])){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 91.62996
dun_index=c()
for(i in 2:10){
obj6=optics(final_data,eps=db_seq[i],minPts=8)
dun_index[i]=dunn(dist(final_data),extractDBSCAN(obj6,db_seq[i])$cluster) }
plot(dun_index,typ="b",xlab="No of Clusters",ylab="Dunn Index",col="darkviolet")
which.max(dun_index)
[1] 5
The 5’th member is selected so we will choose eps =0.5
op=optics(final_data,eps=db_seq[5],minPts=8)
table(extractDBSCAN(op,db_seq[5])$cluster)
0 1 2
194 208 19
clus_vec=extractDBSCAN(optics(final_data,eps=db_seq[5],minPts=8),db_seq[5])$cluster
index_remove=which(clus_vec==0)
cluster_vector=clus_vec[-index_remove]
#--Accuracy
True_Positive=0
False_Negetive=0
True_Negetive=0
False_Positive=0
for(i in 1:nrow(final_data[-index_remove,])){
if(class_vector[i]==0){
if(cluster_vector[i]==1)
True_Positive=True_Positive+1
else
False_Negetive=False_Negetive+1
}
else{
if(cluster_vector[i]==2)
True_Negetive=True_Negetive+1
else
False_Positive=False_Positive+1
}
}
Confusion_mat=matrix(c(True_Positive,False_Negetive,True_Negetive,False_Positive),nc=2)
Accuracy=(sum(diag(Confusion_mat))/sum(Confusion_mat))*100
Accuracy
[1] 91.62996
plot(obj6,ylab="Reachability distance")
** Clearly We have 2 Clusters here with varying density.
The findings from different methods are summarized in the table below
| Method Applied | Optimal no of Clusters | Accuracy |
|---|---|---|
| K-Means | 2 | 47.98 |
| PAM | 2 | 55.35 |
| AGNES(Single Linkage) | 2 | 99.76247 |
| AGNES(complete Linkage) | 2 | 60.09 |
| DIANA | 11 | 60.09 |
| DBSCAN | 2 | 91.63 |
| OPTICS | 2 | 91.63 |
In our analysis of bankruptcy data using clustering methods, we observed that several algorithms performed exceptionally well, achieving high classification accuracy. Specifically, the clustering methods produced two distinct clusters, which effectively separated the data into “bankrupt” and “non-bankrupt” categories.
The high accuracy achieved by 3 methods indicates that the clustering algorithms were successful in identifying and grouping data points according to their bankruptcy status. This suggests that the underlying structure of the data is well-represented by the two-cluster model, and that the chosen clustering techniques are well-suited for this classification task.
Overall, the results underscore the effectiveness of clustering as a tool for distinguishing between bankrupt and non-bankrupt entities, validating the robustness of the algorithms used. This not only demonstrates the algorithms’ capability in handling this specific classification problem but also highlights their potential for broader applications in similar data analysis scenarios.