I. Introduction

1. Data Overview and Research Context

The purpose of this project is to analyze the likelihood of a company going bankrupt based on various financial ratios and indicators (or some topics that related to the bankruptcies). By examining the key financial ratios, it is possible to see how specific financial metrics influence a company’s probability of bankruptcy. The original data set analysis contains financial information (96 variables) from thousands of companies, which specifically designed for evaluating business performance indicators and the potential risk of bankruptcy in Taiwan. This data set is specifically collected from Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. Moreover, this data set includes companies that filed for bankruptcy and those that did not, which allows for a binary classification of financial health (https://archive.ics.uci.edu/dataset/572/taiwanese+bankruptcy+prediction).

After reading the data set and some indicators that might have a chance that related to the bankruptcies, 6 variables are chosen to create a subset. These 6 variables are: Current Ratio, Quick Ratio, Cash Flow to Sales Ratio, Liability to Equity Ratio, Current Assets to Total Assets Ratio, and Bankrupt or Not. The question to investigate is: “Predicting bankruptcy or explaining why one company may be more likely to declare bankruptcy can be explored using decision trees?”

library(readr)
Dataset1 <- 
  read_csv("~/Math 248 Project/Sheet1.csv")
head(Dataset1, n=2)
## # A tibble: 2 × 6
##   Bankrupt `Current Ratio` `Quick Ratio` `Cash Flow to Sales`
##      <dbl>           <dbl>         <dbl>                <dbl>
## 1        1         0.00226       0.00121                0.672
## 2        1         0.00602       0.00404                0.672
## # ℹ 2 more variables: `Liability to Equity` <dbl>,
## #   `Current Assets/Total Assets` <dbl>

2. Descriptive Statistics

Current Ratio: This variable divides current assets by current liabilities, which evaluate on a company’s capability meet its short-term debt obligations. Generally, a higher current ratio shows that the company has more liquidity which normally shows that a current ratio of 2 or higher is considered healthy while a ratio below 1 may signal potential financial distress. This is a quantitative variable.

mean(Dataset1$`Current Ratio`)
## [1] 403285
median(Dataset1$`Current Ratio`)
## [1] 0.01058717
sd(Dataset1$`Current Ratio`)
## [1] 33302156

The mean current ratio is very 403285, indicating that, on average, companies in this data set are not able to cover their current liabilities with current assets. The median (0.010) is lower than the mean, shows more than half of the companies have a current ratio below the average. The standard deviation 33302156 is relatively high compared to the mean and median, indicating a wide variance in liquidity among the companies. In other words, some may have very low liquidity, while a few could be significantly higher, which might related to the status of bankrupt.

library(mosaic)
histogram(~`Current Ratio`, data = Dataset1, nint = 25)

Checking at the histogram, the current ratio is strongly right-skewed, which shows some concern.

Quick Ratio: This variable measures a company’s capacity to pay its current liabilities without needing to sell its inventory or obtain additional financing. The quick ratio is a conservative measure of liquidity, as it excludes inventory and focuses on the most liquid assets. This is a quantitative variable.

mean(Dataset1$`Quick Ratio`)
## [1] 8376595
median(Dataset1$`Quick Ratio`)
## [1] 0.007412472
sd(Dataset1$`Quick Ratio`)
## [1] 244684748

The mean quick ratio is high (8376595), which seems to indicate a calculation or data issue, especially given that the median is much lower (0.007). The median suggests that most companies are struggling to cover current liabilities without relying on inventory, similar to the findings for the current ratio. The standard deviation (244684748) indicates potentially outliers or inconsistencies in the data.

library(mosaic)
histogram(~`Quick Ratio`, data = Dataset1, nint = 25)

Checking at the histogram, the quick ratio shows concern since it is also not normally distributed. There is a higher chance that variability and the value of each observation in the data set is so small that the histogram cannot fully explained.

Cash Flow to Sales: This variable is when operating cash flow divided by sales revenues, which shows a company’s ability to generate cash from its sales. They are in parallel, which means that if cash flows do not increase in line with sales increase, this causes for concern. Generally, this means that the higher the ratio is, the better. This is a quantitative variable.

mean(Dataset1$`Cash Flow to Sales`)
## [1] 0.6715308
median(Dataset1$`Cash Flow to Sales`)
## [1] 0.671574
sd(Dataset1$`Cash Flow to Sales`)
## [1] 0.009341346

The mean and median cash flow to sales ratio are identical (almost 0.6715), suggesting a consistent ability to generate cash from sales across the dataset. The standard deviation is very small 0.009341346, indicating that most companies have similar cash flow efficiency relative to their sales. This is a favorable sign, as it implies that companies are effectively converting sales into cash.

library(mosaic)
histogram(~`Cash Flow to Sales`, data = Dataset1, nint = 25)

Checking at the histogram, the cash flow to sales show a highly concentrated histogram around 0.6715, with very little variation. Most companies in the dataset have similar values for this ratio, with only a few outliers on either side. This suggests that this generally stable across companies.

Liability to Equity Ratio: This ratio is also know as Debt to Equity Ratio. This ratio is used for measuring a company’s ability to meet financing obligations and the structure of its financing. If the ratio is high or increasing, it shows that the company is overly dependent on financing from creditors. Optimal D/E is about 1 where equity roughly equals liabilities. Generally the rule is D/E higher than 2 will be unhealthy. This is a quantitative variable.

mean(Dataset1$`Liability to Equity`)
## [1] 0.2803652
median(Dataset1$`Liability to Equity`)
## [1] 0.2787776
sd(Dataset1$`Liability to Equity`)
## [1] 0.01446322

The mean (0.2854) and median (0.2803) suggest that, on average, companies have a moderate level of debt relative to equity, which is generally a healthy indicator of financial stability. The SD (0.01446322) indicates that they appear to be managing their debt levels reasonably well.

library(mosaic)
histogram(~`Liability to Equity`, data = Dataset1, nint = 20)

Checking at the histogram, the Liability to Equity show a highly concentrated histogram around 0.28, with very little variation. Most companies in the dataset have similar values for this ratio, with only a few outliers on either side. This suggests that this generally stable across companies.

Current Assets to Total Assets: This ratio shows that a higher ratio suggests greater liquidity, as a larger portion of the company’s assets are easily convertible to cash. This is generally favorable, especially for companies in industries that require high liquidity This is a quantitative variable.

mean(Dataset1$`Current Assets/Total Assets`)
## [1] 0.5222734
median(Dataset1$`Current Assets/Total Assets`)
## [1] 0.5148298
sd(Dataset1$`Current Assets/Total Assets`)
## [1] 0.2181118

The mean (0.5222) indicates that, on average, approximately 52.22% of companies’ total assets are current assets, which suggests a reasonable level of liquidity. However, the median (0.5148298) is slightly lower shows that some companies may have lower liquidity. The SD (0.2181) is relatively high, indicating variability in the asset structure across companies.

library(mosaic)
histogram(~`Current Assets/Total Assets`, data = Dataset1, nint = 25)

Checking at the histogram, the Current Assets/Total Assets shows most companies keep about 0.4-0.5 of their assets as current assets, as shown by the peak of the distribution. It’s relatively rare to see companies with either very low or very high proportions of current assets.

Bankrupt or not: This variable is indicating as the target or outcome variable, with values indicating whether a company has declared bankruptcy. It enables classification analysis to predict the probability of bankruptcy based on the other financial variables. This variable is a binary classification variable with two possible values: 1 indicates that the company has filed for bankruptcy. 0 indicates that the company has not filed for bankruptcy and might be still healthy.

II. Methodology

1. Tree-based Methods

The tree-based methods for regressio and classification (also known as CART) involve stratifying or segmenting the predictor space into a number of simple regions. The set of splitting rules are used to segment the predictor space can be summarized in tree so these types known as decision-tree methods. Regression trees is used to predict the quantitative or continuous response variable and the classification trees is used to predict a qualitative response.

2. Regression Tree

Notations for decision tree: Using the form of \(X_j\) < \(t_k\) or \(X_j\) \(>=\) \(t_k\), we split the first one into two large branches. Using the exmaple of predicting Baseball Players’ Salaries the tree, the left hand is Year < 4.5 and the right hand is Year >= 4.5. With the tree analogy, region R1, R2, and r3 are known as terminal nodes and the decision trees are upside down. The points along the tree where the predictor space is split is internal nodes. This tree has two Internal nodes and three Terminal nodes and the number in each leaf equal the mean of the response for the observations.

Interpretation of predicting Baseball Players’ Salaries tree: For the Hitters data, we use the data to predict the log of baseball player’s salary based on the number of Years one person has played in major league and the number of Hits that he made in the previous year. To interpret this, Years is the most important factor in determining Salary, and players with less experience earn lower salaries than more experienced players. Given that a player is less experienced, the number of Hits made in the previous year does affect Salary, and players who made more Hits last year tend to have higher salaries. However, it is likely an over-simplification of the true relationship between Hits, Years, Salary.

Regression Tree-building process: We divide the predictor space - the set of possible values for \(X_1\), \(X_2\),…, \(X_p\) into \(J\) distinct and non-overlapping regions, \(R_1\), \(R_2\), \(R_3\),…, \(R_j\). The goal is to find boxes \(R_1\), …, \(R_j\) that minimize the RSS given by: \(\sum_{j=1}^{J}\) \(\sum_{i \in R_j}\) \((y_i - \hat{y}_{R_j})^2\). We divide the dataset into \(J\) boxes or regions (\(R_1\), \(R_2\), \(R_3\),…, \(R_j\)), where each \(R_j\) represents a set of observations that fall within the j-th box. In here, \(\hat{y}_{R_j}\) is the mean reponse for the training observatiosn within the j-th box. Each box represents one of those terminal leafs of a decision tree, where all observations in the same leaf are summarized by an average. Since it is computationally infeasible to see every possible partition of the feature space into \(J\) boxes, we then take a top-down, greedy approach known as recursive binary splitting.

Top-down because it starts at the top with a whole set of observations and then it splits them into two branches. Greedy because is not finding the best split among the possible split but at the immediate place is looking. At any point when the tree is built, predictions are made by passing observations down the tree, following each split until they reach a terminal node. After that, the prediction is the mean of the training observations in that region.

Tree Pruning:

The process above described may produce good predictions on the training set, but it is likely to overfit the data which leads to poor test set performance. Why?

If the tree is grown too large, with each observation ending up in its own terminal node, the training error will be 0 because the model perfectly fits the training data. However, this leads to overfitting, meaning the model adapts too closely to the training data and performs poorly on new test data, resulting in high test error.

One possible alternative is to grow the tree so long as the decrease in the RSS due to each split exceeds some (high) threshold. Cost complexity pruning - also known as weakest link pruning - is used to to this.The formula is: \(\sum_{m=1}^{|T|} \sum_{i : x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\). For each value of \(\alpha\) there corresponds a subtree \(T\) \(\subseteq\) \(T_0\). If \(\alpha\) \(=0\), there is no pruning. When \(\alpha\) increase, tree size reduces predictably by pruning branches.

3. Classification Tree

Definition for classification tree: a classification tree is similar to regression tree, except that it is used to predict qualitative response rather than quantitative. For classification trees, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. In classification tree, we also grow the tree the same way we do for a regression tree but we can not use RSS as criterion for making split. We look at classification error rate instead.

Classification error rate: \(E = 1 - \max_k(\hat{p}_{mk})\) where \(\hat{p}_{mk}\) represents the proportion of training observations in the m-th region that are from the k-th class. Since this is not sensitive for tree growing, we will practice with two other measures which is Gini index and Deviance. Gini index is defined by \(G = \sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\) which takes on a small value if all of the \(\hat{p}_{mk}\) are close to zero or one. Gini index is also known as a mesure of node purity - a small value indicates that a node contains predominantly observations from a single class. Alternative to the Gini index is cross-entropy, which is \(D = -\sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk}\).

III. Analyze data and conclusion for Bankruptcy Data

1. Bankruptcy Analysis using Logistic Regression Model

First approach for fitting this model, using logistic regression might be a good idea. To be more specific, by transforming linear combinations of financial indicators into probabilities between 0 and 1, logistic regression shows the binary outcome of bankruptcy, making it particularly suited for this research question with the understanding of the complex in terms of financial ratios.

library(readr)
Dataset1 <- 
  read_csv("~/Math 248 Lab/Sheet2.csv")
head(Dataset1, n=2)
## # A tibble: 2 × 6
##   Bankrupt `Current Ratio` `Quick Ratio` `Cash Flow to Sales`
##      <dbl>           <dbl>         <dbl>                <dbl>
## 1        1         0.00226       0.00121                0.672
## 2        1         0.00602       0.00404                0.672
## # ℹ 2 more variables: `Liability to Equity` <dbl>,
## #   `Current Assets/Total Assets` <dbl>
  • Theoretical model: log((\(\pi_i\))/(1-\(\pi_i\)))=\(\beta_0\)+\(\beta_1\)Current Ratio +\(\beta_2\)Quick Ratio + \(\beta_3\)Cash Flow to Sales + \(\beta_4\)Liability to Equity +\(\beta_5\)Current Assets/Total Assets

Step 2: Fit the Model.

bankruptcy_model <- glm(Bankrupt ~ `Current Ratio` + `Quick Ratio` + 
                        `Cash Flow to Sales` + `Liability to Equity` + 
                        `Current Assets/Total Assets`, 
                        data = Dataset1, 
                        family = 'binomial')
summary(bankruptcy_model)
## 
## Call:
## glm(formula = Bankrupt ~ `Current Ratio` + `Quick Ratio` + `Cash Flow to Sales` + 
##     `Liability to Equity` + `Current Assets/Total Assets`, family = "binomial", 
##     data = Dataset1)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.597e+01  4.975e+00  -3.209  0.00133 ** 
## `Current Ratio`               -3.492e-09  1.181e-07  -0.030  0.97641    
## `Quick Ratio`                  2.525e-10  1.309e-10   1.930  0.05363 .  
## `Cash Flow to Sales`           9.775e-01  6.878e+00   0.142  0.88699    
## `Liability to Equity`          4.474e+01  6.468e+00   6.917 4.60e-12 ***
## `Current Assets/Total Assets` -1.412e+00  3.254e-01  -4.340 1.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1943.7  on 6818  degrees of freedom
## Residual deviance: 1847.2  on 6813  degrees of freedom
## AIC: 1859.2
## 
## Number of Fisher Scoring iterations: 11

=> log((\(\pi_i\))/(1-\(\pi_i\))) = - \(15.97\)+ -\(3.49\)×\(10\) -\(9\)Current Ratio + \((2.525×10-10)\)Quick_Ratio + \(9.775\)×CashFlowtoSales + \(44.74\)×LiabilitytoEquity -\(1.412\)×CurrentAssets/TotalAssets.

To interpret this, the intercept of \(-15.97\) represents when all the predictor variables are zero. Since the coefficient for the current ratio and quick ratio is extermely smaller than the Liability to Equity, Cash Flow to Sales, and Current Assets/ Total Assets, it might be true that these financial former indicators play a larger role in predicting bankruptcy.

Step 3: Assess the Model.

summary(bankruptcy_model)
## 
## Call:
## glm(formula = Bankrupt ~ `Current Ratio` + `Quick Ratio` + `Cash Flow to Sales` + 
##     `Liability to Equity` + `Current Assets/Total Assets`, family = "binomial", 
##     data = Dataset1)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.597e+01  4.975e+00  -3.209  0.00133 ** 
## `Current Ratio`               -3.492e-09  1.181e-07  -0.030  0.97641    
## `Quick Ratio`                  2.525e-10  1.309e-10   1.930  0.05363 .  
## `Cash Flow to Sales`           9.775e-01  6.878e+00   0.142  0.88699    
## `Liability to Equity`          4.474e+01  6.468e+00   6.917 4.60e-12 ***
## `Current Assets/Total Assets` -1.412e+00  3.254e-01  -4.340 1.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1943.7  on 6818  degrees of freedom
## Residual deviance: 1847.2  on 6813  degrees of freedom
## AIC: 1859.2
## 
## Number of Fisher Scoring iterations: 11
library(car)
library(carData)
vif(bankruptcy_model)
##               `Current Ratio`                 `Quick Ratio` 
##                      1.000000                      1.004618 
##          `Cash Flow to Sales`         `Liability to Equity` 
##                      1.000200                      1.013338 
## `Current Assets/Total Assets` 
##                      1.017723

The logistic regression model for predicting bankruptcy shows that the Liability to Equity and Current Assets/Total Assets ratios are highly significant predictors. However, the Current Ratio, Cash Flow to Sales, and Quick Ratio are not statistically significant, suggesting they don’t have a strong effect on bankruptcy prediction. The model’s AIC is \(1859.2\), indicating a reasonable fit. The VIF values are close to \(1\), suggesting no multicollinearity.

Step 4: Use the Model.

Dataset1$predicted_prob <- predict(bankruptcy_model, type = “response”)

However, it has show limitation in applying the logistic regression model for predicting bankruptcy. The primary issue is that the threshold for classification into “bankrupt” or “not bankrupt” is not specified and we cannot put a random number in here. Without a clear threshold value, we cannot confidently classify observations. Moreover, every values from the predictors are small and randomly choosing a threshold could lead to incorrect decision-making. Therefore, until the optimal threshold is determined based on further analysis or domain knowledge, we cannot use this model effectively for classification.

2. Bankruptcy Analysis using Decision Tree Method

library(tree)
Dataset1 <- read.csv("~/Math 248 Project/Sheet1.csv")
Dataset1$Bankrupt <- factor(Dataset1$Bankrupt, levels = c(0, 1), 
                            labels = c("Not Bankrupt", "Bankrupt"))
head(Dataset1, n=2)
##   Bankrupt Current.Ratio Quick.Ratio Cash.Flow.to.Sales Liability.to.Equity
## 1 Bankrupt   0.002258963 0.001207755          0.6715677           0.2902019
## 2 Bankrupt   0.006016206 0.004039367          0.6715699           0.2838460
##   Current.Assets.Total.Assets
## 1                   0.1906430
## 2                   0.1824191

The unpruned tree:

tree.Dataset1 <- tree (Bankrupt ~ . , Dataset1)
summary(tree.Dataset1)
## 
## Classification tree:
## tree(formula = Bankrupt ~ ., data = Dataset1)
## Variables actually used in tree construction:
## [1] "Liability.to.Equity" "Quick.Ratio"         "Current.Ratio"      
## [4] "Cash.Flow.to.Sales" 
## Number of terminal nodes:  7 
## Residual mean deviance:  0.2059 = 1402 / 6812 
## Misclassification error rate: 0.03153 = 215 / 6819

The unprunned decision tree shows that with 7 terminal nodes and only using four key financial indicators which are Liability to Equity, Quick Ratio, Current Ratio, and Cash Flow to Sales instead of using 5 financial indicators that we have in the dataset. To interpret this, the tree has showed that important financial ratios and the Current Assets/Total Assets is not affected to the model. Moreover, it might be true that the misclassification error rate is low with 3.15%, indicating that the tree might has decent predictive power.

# Fit the classification tree on the training data
# Using all variables except 'Bankrupt' for prediction
tree.Dataset1 <- tree(Bankrupt ~ ., Dataset1)

# Plot the initial tree
plot(tree.Dataset1)
text(tree.Dataset1, pretty = 0)

tree.Dataset1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 6819 1944.000 Not Bankrupt ( 0.967737 0.032263 )  
##    2) Liability.to.Equity < 0.282745 5598  690.700 Not Bankrupt ( 0.988746 0.011254 )  
##      4) Quick.Ratio < 0.00394985 679  264.000 Not Bankrupt ( 0.951399 0.048601 )  
##        8) Liability.to.Equity < 0.26225 7    5.742 Bankrupt ( 0.142857 0.857143 ) *
##        9) Liability.to.Equity > 0.26225 672  226.500 Not Bankrupt ( 0.959821 0.040179 ) *
##      5) Quick.Ratio > 0.00394985 4919  365.800 Not Bankrupt ( 0.993901 0.006099 ) *
##    3) Liability.to.Equity > 0.282745 1221  937.000 Not Bankrupt ( 0.871417 0.128583 )  
##      6) Quick.Ratio < 0.0050085 769  741.700 Not Bankrupt ( 0.812744 0.187256 )  
##       12) Current.Ratio < 0.00429328 162  211.300 Not Bankrupt ( 0.641975 0.358025 ) *
##       13) Current.Ratio > 0.00429328 607  495.300 Not Bankrupt ( 0.858320 0.141680 )  
##         26) Cash.Flow.to.Sales < 0.671573 319  319.800 Not Bankrupt ( 0.799373 0.200627 ) *
##         27) Cash.Flow.to.Sales > 0.671573 288  155.400 Not Bankrupt ( 0.923611 0.076389 ) *
##      7) Quick.Ratio > 0.0050085 452  117.900 Not Bankrupt ( 0.971239 0.028761 ) *

The first split variable is Liability to Equity, which appears to be the most critical factor for distinguishing between bankrupt and non-bankrupt companies. After that, it also have show that companies with a Liability to Equity > 0.28, Quick Ratio < 0.005, and Current Ratio < 0.004 have the highest predicted bankruptcy risk, with a 35.80% chance of bankruptcy. This demonstrates how multiple financial indicators, when combined, can provide more nuanced risk assessment compared to looking at individual metrics in isolation.

The cross-validation error for different size of the pruned tree:

set.seed(567)

# Split the data into training and test sets
train <- sample(1:nrow(Dataset1), 0.7 * nrow(Dataset1))
Dataset1.test <- Dataset1[-train, ]
High.test <- Dataset1$Bankrupt[-train]

tree.Dataset1 <- tree(Bankrupt~.- Bankrupt, Dataset1, subset=train)
tree.pred <- predict(tree.Dataset1, Dataset1.test, type="class")
table(tree.pred, High.test)
##               High.test
## tree.pred      Not Bankrupt Bankrupt
##   Not Bankrupt         1982       61
##   Bankrupt                1        2
# Calculate accuracy
accuracy <- (1982 + 2) / (1982 + 61 + 1 + 2)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 96.97 %"

This approach leads to correct predictions for around 96.67% of the locations in the test data set. However, in order to fully evaluate the peformance of a classification tree on these data, it might be true that considering whether pruning the tree might lead to improve the optimal of tree complexity and easier for interpretation.

# Perform cross-validation to determine optimal tree complexity
set.seed(7)
cv.Dataset1 <- cv.tree(tree.Dataset1, FUN = prune.misclass)
names(cv.Dataset1)
## [1] "size"   "dev"    "k"      "method"
cv.Dataset1
## $size
## [1] 7 4 1
## 
## $dev
## [1] 160 160 161
## 
## $k
## [1] -Inf    0    1
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
# Plot cross-validation results
par(mfrow = c(1, 2))
plot(cv.Dataset1$size, cv.Dataset1$dev, type = "b",
     xlab = "Tree Size", ylab = "Cross-validation Errors")
plot(cv.Dataset1$k, cv.Dataset1$dev, type = "b",
    xlab = "Cost-Complexity Parameter", ylab = "Cross-validation Errors")

By looking at the dev value, it corresponds to the number of cross-validation errors. The tree with 4 terminal nodes results in 160 cross-validation errors. Moreover, the tree with 4 terminal nodes has the same cross-validation error (160) as the tree with 7 terminal nodes, but it might be true that the tree with 4 terminal nodes will be simpler and more easier to interpret. Therefore, the tree with 4 terminal nodes show the optimal balance between model complexity and predictive performance.

The pruned tree:

prune.Dataset1 <- prune.misclass(tree.Dataset1, best=4)
plot (prune.Dataset1)
text(prune.Dataset1, pretty=0)

For the pruned tree, it has showed that the Liability to Equity ratio is the most important factor in determining bankruptcy risk. Companies with a Liability to Equity ratio below 0.267304 are more likely to be classified as “Not Bankrupt”. For companies above this threshold, the Quick Ratio below 0.00431417 indicates a higher probability of bankruptcy. This indicates that high Liability to Equity combined with low liquidity might be the key of bankruptcy risk according to the model. While this is a simplified view, the pruned tree also provides shortcut way to fully understand the critical financial factors associated with bankruptcy.

Conclusion

The decision tree only uses 4 key predictor: Liability-to-Equity, Quick Ratio, Current Ratio, and Cash Flow-to-Sales. The tree has 7 terminal nodes, indicating a moderately complex model. The top split separates companies based on Liability-to-Equity, with a threshold of 0.282745. Companies with a lower Liability-to-Equity ratio are classified as Not Bankrupt. For companies with higher Liability-to-Equity, the next split is based on Quick Ratio, with a threshold of 0.00394985. Those with a higher Quick Ratio are classified as Not Bankrupt. The remaining splits further refine the classification using the Current Ratio and Cash Flow-to-Sales ratio.

Model Assumption

  • The tree assumes that the dataset can be divided into smaller subsets based on the input features.
  • Splits are chosen to maximize information gain, which to make sure that each split improves the clarity of the classification or regression.
  • Each split creates exactly two subsets, simplifying the structure and ensuring clear decision paths.
  • Features are preferred to be categorical, the continuous features must be discretized before building the tree.
  • The model recursively distributes records based on feature values to construct the tree structure.
  • Attributes are selected as root or internal nodes based on statistical criteria like information gain or Gini index.

IV. Discussion and critique

1. The need for decision tree method instead of linear model in class

In this Bankrupt dataset, it has shown that the value for each predictor is extremely small and not normally distributed. Moreover, the bankruptcy dataset did not have linear relationships and interactions between financial ratios, which may not be captured well by linear or logistic regression. Additionally, decision trees are better option for datasets when the data is non-linear and complex. However, since decision trees tend to overfit, especially with smaller dataset, it might be necessary to consider ensemble methods like random forests, boosting, or basting to improve the accuracy.

2. Strengths

  • The decision tree approach effectively captured the nonlinear and complex relationships between the financial ratios and bankruptcy risk.
  • The model provided a clear and interpretable representation of the key decision points, making it easier to understand.
  • The high accuracy (96.85%) and low misclassification rate (3.15%) for the full model of tree and the prune tree demonstrate the strong predictive power of the model.

3. Improvements

  • The data used in this analysis was limited to a specific time period and geographical region (Taiwan), which limit the generalization of the findings.
  • Decision trees can be susceptible to overfitting, especially when the tree grows too complex. Techniques like pruning and more ensemble methods (including basting, boosting, and random forests) could be used to address this issue.
  • Choosing another predictors among 96 variables might possibly create better model to predict the bankruptcy.

V. Citation