This project focuses on solving a bankruptcy prediction problem using data analysis and machine learning techniques. The dataset used in this project is sourced from the Taiwan Economic Journal and is publicly available on Kaggle. It consists of 6,819 observations of 96 variables, mainly financial coefficients, covering the period from 1999 to 2009. The dataset specifically focuses on companies listed on the Taiwan Stock Exchange, with bankruptcy defined according to the exchange’s business regulations.
Predicting bankruptcy requires a substantial amount of structured data, which can be challenging to obtain due to the scarcity of bankrupt companies compared to healthy ones. Additionally, determining the exact timing of a company’s bankruptcy is difficult, making it even more challenging to predict financial distress based on potentially outdated financial data. Moreover, various factors such as company-specific characteristics (e.g., size, shareholder structure, liabilities and equity structure), sector-specific factors (e.g., industry growth, dependence on suppliers, sales sector type), local factors (country and region-specific), and global factors (economic cycles) further complicate the prediction process.
In most jurisdictions, bankruptcy is defined as the inability of an entity to pay its debts on time (exceeding legal limits of payment delays) or as over-indebtedness (liabilities to assets ratio exceeding legal limits) over a specific period. Insolvency laws aim to rehabilitate such entities, allowing them to repay their creditors and prevent further debt accumulation that could lead to a cascade of bankruptcies. Therefore, having an early warning mechanism that detects and prevents financial distress and bankruptcy is crucial. While legal criteria are commonly used for this purpose, there is significant potential for data-driven econometric models to serve as more effective early warning mechanisms. Identifying financial distress and potential bankruptcy at an early stage benefits the entity, its creditors, and the overall economy.
The following code presents the packages utilized throughout the analysis, with comments indicating their main purposes. Not all listed packages made it into the final version of this project, as some functions did not yield the expected results and were subsequently excluded.
#PACKAGES
# List of packages
packages <-
c(
#working with data
"tidyverse", # working with data (inc. ggplot2, dplyr, tidyr etc.)
"psych", #useful inter alia for descriptive statistics and data preparation
"DescTools", #descriptive statistics
"formattable", #formatting text functions
"DataExplorer", #exploratory data analysis
"skimr", #exploratory data analysis
"missForest", #impute NA using RF
#visualization
"GGally", #ggplot extension (eg. ggpairs function)
"plotly", #interactive visualization
"gridExtra", #grid upgrade for ggplot
"cowplot", #another grid upgrade for ggplot...
"ggExtra", #marginal plots
"ggfortify", #objects from other packages in ggplot (autoplot function)
#correlation
"corrplot", #advanced correlation matrix ordering and visualization
"ggcorrplot", #correlation in ggplot
"lares", #advanced correlation for big data sets
#clustering
"factoextra", #clustering
"flexclust", #clustering
"cluster", #clustering
"dbscan", #dbscan and KNN
"ClusterR",
#regression and classification
"tidymodels", #many useful functions and models
"car", #regression
"MASS", #linear regression
"glmnet", #advanced glm models
"lme4", #mixed-effects models
"caret", #advanced regression and classification
"rpart", #decision trees
"rpart.plot", #decision trees visualization
"randomForest", #random forests
"xgboost", #XGboost
"ROCR", #visualization and model evaluation
"performanceEstimation", #extension of other models (eg. smote)
"infotheo", #information theory functions
"naivebayes", #naive Bayes classifier
"e1071", #support vector machine and other models
"DALEX",
"smacof",
"FSelector",
"subselect"
)
# Install and load packages
for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
This section focuses on preparing the data for subsequent stages of the analysis, such as feature selection, data exploration, and prediction using supervised learning models. The data undergoes various procedures to ensure its suitability for these tasks, including renaming variables, conducting exploratory data analysis (EDA), calculating descriptive statistics, removing redundant variables, addressing outliers and performing data transformation. The objective of data preparation is to strike a balance between eliminating irregularities by transforming the data while preserving crucial information contained within the variables.
The first step in data preparation involves cleaning the data. This includes procedures such as renaming variables, conducting preliminary exploratory data analysis (EDA), calculating descriptive statistics, and removing clearly redundant variables.
options(scipen = 999)
#setting the seed
set.seed(999)
#loading data
data_loaded <- read_csv("data.csv")
#clearing variable names from special signs and spaces
names <- names(data_loaded)
names <- gsub("\\([^)]+\\)", "", names)
names <- gsub(" ", "_", names)
names <- gsub("-", "_", names)
names <- gsub("%$", "", names)
names <- gsub("%", "", names)
names <- gsub("/", "_to_", names)
names <- gsub("_$", "", names)
names <- gsub("'", "", names)
names[17] <- "Net_Value_Per_Share_B"
names[18] <- "Net_Value_Per_Share_A"
names[19] <- "Net_Value_Per_Share_C"
names[1] <- "Bankrupt"
#loading new names
colnames(data_loaded) <- c(names)
#removing problematic variables (one has sd=0)
data_loaded <-
data_loaded[, !names(data_loaded) %in% c("Liability_Assets_Flag", "Net_Income_Flag")]
#loading new names of the variables and coding variables to make their names shorter
names <- names(data_loaded)
original_names <- names
new_names <- c("B", paste0("V", 1:93))
names <-
as.data.frame(cbind(as.vector(original_names), as.vector(new_names)))
colnames(names) <- c("original", "new")
rm(original_names, new_names)
#object names is a dictionary containing old and new variable names
#below two functions were created to facilitate translation of variable names
#renaming function 1
rename_original_to_new <- function(data) {
for (i in seq_along(names(data))) {
var_name <- names(data)[i]
replacement <- names$new[match(var_name, names$original)]
if (!is.na(replacement)) {
names(data)[i] <- replacement
}
}
return(data)
}
#renaming function 2
rename_new_to_original <- function(data) {
for (i in seq_along(names(data))) {
var_name <- names(data)[i]
replacement <- names$original[match(var_name, names$new)]
if (!is.na(replacement)) {
names(data)[i] <- replacement
}
}
return(data)
}
#descriptive statistics
#desc_stats <- describe(data_loaded)
#dividing data set
#target_variable
bankrupt <- data_loaded[, 1]
bankrupt$Bankrupt<-as.factor(bankrupt$Bankrupt)
#variables only
data_var <- data_loaded[, -1] %>% rename_original_to_new()
#Healthy and bankrupt companies
healthy_only <- cbind(bankrupt, data_var) %>% filter(Bankrupt==0) %>% dplyr::select(-Bankrupt) %>% rename_original_to_new()
bankrupt_only <- cbind(bankrupt, data_var) %>% filter(Bankrupt==1) %>% dplyr::select(-Bankrupt) %>% rename_original_to_new()
#function for ploting boxplots (divides variables into quarters and uses 2x2 grid)
boxplots <- function(data) {
#dividing variables into parts to better visualize them
number<- ncol(data)
quarter1 <- seq(1, floor(number * 0.25))
quarter2 <- seq(floor(number * 0.25) + 1, floor(number * 0.5))
quarter3 <- seq(floor(number * 0.5) + 1, floor(number * 0.75))
quarter4 <- seq(floor(number * 0.75) + 1, number)
quarters<-list(quarter1,quarter2,quarter3,quarter4)
boxplots_list <- list()
for (i in 1:4) {
boxplots_list[[i]]<-data %>% dplyr::select(quarters[[i]]) %>%
mutate(across(everything(), scale)) %>%
gather(key = "Variable", value = "Value") %>%
mutate(Variable = factor(Variable, levels = rev(factor(names(data)[quarters[[i]]])))) %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip()
}
Result<-grid.arrange(boxplots_list[[1]], boxplots_list[[2]], boxplots_list[[3]],
boxplots_list[[4]], nrow = 2, ncol = 2)[2]
return(Result)
}
## Warning: attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 3 (1-1,1-1) arrange gtable[layout]
## 2 4 (1-1,2-2) arrange gtable[layout]
In this step, outliers were identified using position parameters and subsequently imputed using the K-Nearest Neighbors (KNN) and Random Forest (RF) algorithms. KNN operates locally, whereas RF takes a top-to-bottom approach - both consider relationships among observations and features to determine missing values, which makes them superior to simple parametric imputation methods like mean and median winsorization, as they preserve information while addressing outliers and missing values effectively.
#Outliers removal using boxplot stats
outliers_boxplot <- function(x) {
stats <- boxplot.stats(x)
lower_fence <- stats$stats[1] - 1.5* IQR(x)
upper_fence <- stats$stats[5] + 1.5* IQR(x)
x[x < lower_fence | x > upper_fence] <- NA
return(x)
}
#Outliers removal using zscore
outliers_zscore <- function(x) {
z_scores <- abs(scale(x))
threshold <- 3
x[z_scores > threshold] <- NA
return(x)
}
#detecting outliers (boxplot)
data_var_NA1<-data_var %>%
mutate(across(everything(), outliers_boxplot)) %>% as.data.frame()
#identified outliers acroaa all variables
DataExplorer::introduce(data_var_NA1)[,6]
## [1] 25422
#imputing data with RF
#data_var_RFmodel <- missForest(as.matrix(data_var_NA), 2, 5)
#data_var_RF<-as.data.frame(data_var_RFmodel$ximp)
data_var_imp1<-ImputeKnn(data_var_NA1, k=10)
#create_report(scale(data_var_imp1))
#boxplots(data_var_imp1)
data_var_NA2<- data_var_imp1 %>%
mutate(across(everything(), outliers_zscore)) %>% as.data.frame()
DataExplorer::introduce(data_var_NA2)[,6]
## [1] 8337
data_var_imp2<-ImputeKnn(data_var_NA2, k=10)
#boxplots(data_var_imp2)
#create_report(scale(data_var_imp2))
#data to be used in supervised learning
data_ready<-data_var_imp2
boxplots(data_ready)
## Warning: attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 3 (1-1,1-1) arrange gtable[layout]
## 2 4 (1-1,2-2) arrange gtable[layout]
The elimination of outliers has yielded satisfactory results, as it removed exceptional observations while retaining information about extreme values of variables. The presence of a large number of extreme values, skewness and heteroscedasticity can introduce bias in some algorithms (especially those used in unsupervised learning). To mitigate this issue, a Box-Cox transformation was employed. This transformation is preferred over the logistic transformation as its parameters can be optimized to achieve the best results. Notably, when the lambda value is set to 0, the Box-Cox transformation reduces to the logistic transformation. This choice helps address the problem of bias and enhances the suitability of the data for further analysis.
data_USL_NA1<-data_var %>%
mutate(across(everything(), outliers_boxplot)) %>% as.data.frame()
data_USL_imp1<-missForest(as.matrix(data_USL_NA1), 2, 5)
data_USL_imp1<-as.data.frame(data_USL_imp1$ximp)
#Box-Cox Transformation
data_BoxCox1<-as.data.frame(apply(data_var_imp2, 2, function(x) BoxCox(x, BoxCoxLambda(x))))
data_USL_NA2<- data_BoxCox1 %>%
mutate(across(everything(), outliers_zscore)) %>% as.data.frame()
data_USL_imp2<-missForest(as.matrix(data_USL_NA2), 2, 5)
data_USL_imp2<-as.data.frame(data_USL_imp2$ximp)
#scaled and transformed data without outliers for further stages of analysis
data_USL<-as.data.frame(scale(data_USL_imp2))
boxplots(data_USL)
## Warning: attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## attributes are not identical across measure variables; they will be dropped
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 3 (1-1,1-1) arrange gtable[layout]
## 2 4 (1-1,2-2) arrange gtable[layout]
The objective of this section is to determine the most important variables and eliminate redundant features. This process enhances the effectiveness of models and enables the utilization of algorithms that would otherwise require excessive computing power. Various criteria are employed to select features, including correlation with target variables, correlation with other variables, variance, mutual information (information theory), statistical significance in prediction models (p-value), accuracy and gini purity (decision trees) among others.
#the following function will be used to score and select variables
#it uses chi2 (Cramer's V, type of correlation), accuracy and gini purity as indicators
#for each critirion 30% of the best variables are selected
#other possible criteria include:
#linear correlation with target variable (Pearson, Spearman, Kanndall)
#entropy based meassures:
#information gain
#information gain ratio
#symmetrical uncertainty
#measures based on decision rules
#relief algorithm
#all of which can be easily included by slightly modifying the score_var function
#they were excluded in this particular project becouse results they provided
#did not effectively differentiate the variables to serve as a reliable basis for feature selection
score_var <- function(input_data, target_var, top) {
data <- cbind(target_var, as.data.frame(scale(input_data)))
x <- names(target_var)
formula <- as.simple.formula(names(input_data), x)
#linear correlation based selection
#corr_score <- linear.correlation(formula, data)
#chi score (Cramer's V), similar
chi_score <- chi.squared(formula, data)
# entropy based scores
# info_gain_score <- information.gain(formula, data)
# info_gain_ratio_score <- gain.ratio(formula, data)
# sym_uncert_score <- symmetrical.uncertainty(formula, data)
# simple decision rules score
# decision_score <- oneR(formula, data)
# random.forest.importance
rf_acc_score <- random.forest.importance(formula, data, importance.type = 1)
rf_purity_score <-random.forest.importance(formula, data, importance.type = 2)
# relief
#relief_score <- relief(formula, data)
chi_var<-cutoff.k.percent(chi_score,top)
rf_acc_var_var<-cutoff.k.percent(rf_acc_score,top)
rf_purity_var<-cutoff.k.percent(rf_purity_score,top)
Results<-unique(c(chi_var,rf_acc_var_var,rf_purity_var))
return(Results)
}
During this step, highly correlated variable pairs were identified and examined. To reduce redundancy, the variable with a higher mean correlation with other variables was eliminated. This process was iteratively applied, updating the comparison variables after each elimination.
high_corr<-findCorrelation(cor(data_ready), cutoff = 0.90, exact= TRUE)
data_to_group_variables<-data_USL[,-high_corr]
high_corr
## [1] 85 2 89 3 19 43 8 7 22 34 58 37 38 4 88 90 61 16 17 66 78 26 83 77 92
Hierarchical clustering was utilized to group variables into manageable subgroups. The number of clusters was determined based on correlation analysis, ensuring that correlated variables were grouped together. This approach facilitated a clearer understanding of variable relationships and patterns, enhancing the organization and analysis of the dataset.
bankruptYN<- bankrupt %>% mutate(Bankrupt=ifelse( Bankrupt ==1, "Y", "N"))
grouping_variables<- function (data, num_clusters,result,singles) {
corr_plot_list<- list()
corr_list<-list()
var_list<-list()
cor_matrix <- cor(data)
dissimilarity <- 1 - abs(cor_matrix)
hc <- hclust(as.dist(dissimilarity))
cluster_labels <- cutree(hc, k = num_clusters)
correlated_groups <- split(names(as.data.frame(cor_matrix)), cluster_labels)
for (i in seq(1, num_clusters)) {
if(singles==1){
plot <- ggcorrplot(
cor(data[correlated_groups[[i]]]),
hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 1.5,
digits = 2
)}
else {
plot <- ggcorrplot(
cor(data[correlated_groups[[i]]]),
type = "full",
lab = TRUE,
lab_size = 1.5,
digits = 2
)}
cor_m<-cor(data[correlated_groups[[i]]])
vars<-correlated_groups[[i]]
corr_plot_list[[i]]<- plot
corr_list[[i]]<-cor_m
var_list[[i]]<-vars
}
if(result==1){return(corr_plot_list)}
if(result==2){return(corr_list)}
if(result==3){return(var_list)}
}
var_groups_plots<-grouping_variables(data_to_group_variables,6,1,1)
var_groups_matrix<-grouping_variables(data_to_group_variables,6,2,1)
var_groups_lists<-grouping_variables(data_to_group_variables,6,3,1)
var_groups_plots
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
After hierarchical clustering, features were divided into six clusters based on their dominant characteristics: Profitability Metrics, Operational Efficiency Metrics, Cash Flow Metrics, Financial Health Metrics, Growth and Performance Metrics, and Capital and Asset Structure Metrics. Within each group, a feature selection algorithm was applied, utilizing Cramer’s V, random forest accuracy and purity, to identify the most informative features.
# Group One: Profitability Metrics
group1<-data_ready[var_groups_lists[[1]]]
names(rename_new_to_original(group1))
## [1] "ROA_before_interest_and_depreciation_before_interest"
## [2] "Operating_Profit_Rate"
## [3] "Non_industry_income_and_expenditure_to_revenue"
## [4] "Continuous_interest_rate"
## [5] "Tax_rate"
## [6] "Net_Value_Per_Share_C"
## [7] "Per_Share_Net_profit_before_tax"
## [8] "Realized_Sales_Gross_Profit_Growth_Rate"
## [9] "Operating_Profit_Growth_Rate"
## [10] "Regular_Net_Profit_Growth_Rate"
## [11] "Continuous_Net_Profit_Growth_Rate"
## [12] "Total_Asset_Growth_Rate"
## [13] "Net_Value_Growth_Rate"
## [14] "Total_Asset_Return_Growth_Rate_Ratio"
## [15] "Operating_profit_to_Paid_in_capital"
## [16] "Operating_profit_per_person"
## [17] "Retained_Earnings_to_Total_Assets"
## [18] "Total_income_to_Total_expense"
#Group Two: Operational Efficiency Metrics
group2<-data_ready[var_groups_lists[[2]]]
names(rename_new_to_original(group2))
## [1] "Realized_Sales_Gross_Margin" "Operating_Expense_Rate"
## [3] "Contingent_liabilities_to_Net_worth" "Inventory_Turnover_Rate"
## [5] "Revenue_per_person" "Inventory_to_Current_Liability"
## [7] "Total_expense_to_Assets" "Total_assets_to_GNP_price"
#Group Three: Financial Health Metrics
group3<-data_ready[var_groups_lists[[3]]]
names(rename_new_to_original(group3))
## [1] "Research_and_development_expense_rate"
## [2] "Interest_bearing_debt_interest_rate"
## [3] "Interest_Expense_Ratio"
## [4] "Total_debt_to_Total_net_worth"
## [5] "Borrowing_dependency"
## [6] "Inventory_and_accounts_receivable_to_Net_value"
## [7] "Current_Liability_to_Assets"
## [8] "Inventory_to_Working_Capital"
## [9] "Degree_of_Financial_Leverage"
## [10] "Equity_to_Liability"
#Group Four: Cash Flow Metrics
group4<-data_ready[var_groups_lists[[4]]]
names(rename_new_to_original(group4))
## [1] "Cash_flow_rate" "Cash_Flow_Per_Share"
## [3] "Cash_Reinvestment" "Cash_to_Total_Assets"
## [5] "Cash_to_Current_Liability" "Cash_Turnover_Rate"
## [7] "Cash_Flow_to_Sales" "Cash_Flow_to_Total_Assets"
## [9] "Cash_Flow_to_Liability" "CFO_to_Assets"
#Group Five: Performance Metrics
group5<-data_ready[var_groups_lists[[5]]]
names(rename_new_to_original(group5))
## [1] "Revenue_Per_Share" "Total_Asset_Turnover"
## [3] "Accounts_Receivable_Turnover" "Average_Collection_Days"
## [5] "Net_Worth_Turnover_Rate" "Current_Asset_Turnover_Rate"
## [7] "Quick_Asset_Turnover_Rate"
#Group Six: Structure Metrics
group6<-data_ready[var_groups_lists[[6]]]
names(rename_new_to_original(group6))
## [1] "Current_Ratio"
## [2] "Long_term_fund_suitability_ratio"
## [3] "Fixed_Assets_Turnover_Frequency"
## [4] "Allocation_rate_per_person"
## [5] "Working_Capital_to_Total_Assets"
## [6] "Quick_Assets_to_Total_Assets"
## [7] "Current_Assets_to_Total_Assets"
## [8] "Current_Liabilities_to_Liability"
## [9] "Working_Capital_to_Equity"
## [10] "Long_term_Liability_to_Current_Assets"
## [11] "Working_capitcal_Turnover_Rate"
## [12] "Fixed_Assets_to_Assets"
## [13] "Equity_to_Long_term_Liability"
## [14] "Current_Liability_to_Current_Assets"
## [15] "No_credit_Interval"
data_whole<-cbind(bankrupt,as.data.frame(scale(data_ready)))
data_long <- data_whole %>%
pivot_longer(cols=-Bankrupt,names_to = "Variable", values_to = "Value")
grouped_vars <- data_long %>%
mutate(Metric=case_when(Variable %in% var_groups_lists[[1]] ~ "Profitability Metrics",
Variable %in% var_groups_lists[[2]] ~ "Operational Efficiency Metrics",
Variable %in% var_groups_lists[[3]] ~ "Financial Health Metrics",
Variable %in% var_groups_lists[[4]] ~ "Cash Flow Metrics",
Variable %in% var_groups_lists[[5]] ~ "Performance Metrics",
Variable %in% var_groups_lists[[6]] ~ "Structure Metrics"
),
Bankrupt=if_else(Bankrupt==0,"Healthy Companies", "Bankrupt Companies")
)
p1 <- grouped_vars %>% filter(Metric=="Profitability Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Profitability Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p2 <- grouped_vars %>% filter(Metric=="Operational Efficiency Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Operational Efficiency Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p3 <- grouped_vars %>% filter(Metric=="Financial Health Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Financial Health Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p4 <- grouped_vars %>% filter(Metric=="Cash Flow Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Cash Flow Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p5 <- grouped_vars %>% filter(Metric=="Performance Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Performance Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p6 <- grouped_vars %>% filter(Metric=="Structure Metrics") %>% ggplot(aes(x = Value, fill=Bankrupt))+
geom_density(alpha=0.7,color=NA) +
facet_wrap(vars(Variable)) +
ggtitle("Structure Metrics")+
theme(title =element_text(size=8),
plot.title = element_text(hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none")
p1
p2
p3
p4
p5
p6
group1_var<-score_var(group1,bankruptYN,0.3)
group2_var<-score_var(group2,bankruptYN,0.2)
group3_var<-score_var(group3,bankruptYN,0.2)
group4_var<-score_var(group4,bankruptYN,0.2)
group5_var<-score_var(group5,bankruptYN,0.2)
group6_var<-score_var(group6,bankruptYN,0.2)
group1_var
## [1] "V30" "V23" "V69" "V18" "V10" "V42" "V52" "V31" "V6"
group2_var
## [1] "V5" "V70"
group3_var
## [1] "V93" "V40" "V35" "V91"
group4_var
## [1] "V59" "V57" "V32"
group5_var
## [1] "V45" "V50"
group6_var
## [1] "V54" "V33" "V73" "V56" "V65" "V55"
#data with reduced number of features
data_SF<-data_ready %>%
dplyr::select(all_of(sort(c(group6_var,group5_var,group4_var,group3_var,group2_var,group1_var))))
#create_report(data_SF)
The attempt to cluster bankrupt companies did not provide sufficient insights to reach decisive conclusions.
#data
BSF<-cbind(bankrupt,data_SF)
BSF<-BSF %>% filter(Bankrupt==1)
BSF<-BSF[,-1]
BSF<-as.data.frame(scale(BSF))
# fviz_nbclust(BSF,FUNcluster=kmeans,method = "silhouette")
# fviz_nbclust(BSF,FUNcluster=kmeans,method = "wss")
# fviz_nbclust(BSF,FUNcluster=kmeans,method = "gap_stat")
km1<-eclust(BSF, "kmeans", hc_metric="euclidean",k=3)
clust<-as.data.frame(as.factor(km1$cluster))
colnames(clust)<-c("cluster")
BSF_clust<-cbind(BSF,clust)
BSF_clust_long<-BSF_clust %>%
pivot_longer(cols=-cluster,names_to = "Variable", values_to = "Value") %>%
mutate(Metric=case_when(Variable %in% group1_var ~ "Profitability Metrics",
Variable %in% group2_var ~ "Operational Efficiency Metrics",
Variable %in% group3_var ~ "Cash Flow Metrics",
Variable %in% group4_var ~ "Financial Health Metrics",
Variable %in% group5_var ~ "Performance Metrics",
Variable %in% group6_var ~ "Structure Metrics"
))
g1<-BSF_clust_long %>% filter(Metric=="Profitability Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Profitability Metrics")
g2<-BSF_clust_long %>% filter(Metric=="Operational Efficiency Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Operational Efficiency Metrics")
g3<-BSF_clust_long %>% filter(Metric=="Financial Health Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Financial Health Metrics")
g4<-BSF_clust_long %>% filter(Metric=="Cash Flow Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Cash Flow Metrics")
g5<-BSF_clust_long %>% filter(Metric=="Performance Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Performance Metrics")
g6<-BSF_clust_long %>% filter(Metric=="Structure Metrics") %>%
ggplot(aes(y = Value, x = Variable,fill=cluster)) +
geom_boxplot() +
coord_flip() +
ggtitle("Structure Metrics")
g1
g2
g3
g4
g5
g6
In this section, we will develope models to accurately classify companies as either bankrupt or non-bankrupt based on previously selected features.
When predicting bankruptcy, the scarcity of data on bankrupt companies creates a class imbalance issue, which can make training models challenging. To address this problem, we will utilize the SMOTE (Synthetic Minority Over-sampling Technique) algorithm.
bankrupt$Bankrupt<-as.factor(bankrupt$Bankrupt)
data_model<-cbind(bankrupt,data_SF)
#dividing data into test and train sets
index <- (runif(nrow(data_model)) < 0.3)
test_set <- data_model[index, ]
train_set1 <- data_model[!index, ]
# summary(as.factor(test_set$Bankrupt))
summary(as.factor(train_set1$Bankrupt))
## 0 1
## 4630 161
#SMOTE - balancing
smote_data1 <- smote(Bankrupt ~ ., train_set1, perc.under =2, perc.over = 11)
summary(as.factor(smote_data1$Bankrupt))
## 0 1
## 3542 1932
train_set<-smote_data1
This section presents and evaluates five bankruptcy prediction models: two logistic regression models, two decision tree models, and one random forest model.
#Logistic Models
log1<-glm(Bankrupt ~ .,
data = train_set,
family = binomial(link="logit"))
log2<-stepAIC(log1, trace = FALSE)
#Decision Tree
tree1 <- rpart(Bankrupt ~ .,
data = train_set,
method = "class")
rpart.plot(tree1)
tree2 <- rpart(
Bankrupt ~ .,
data = train_set,
method = "class",
control = list(cp = 0.005)
)
rpart.plot(tree2)
# tree1$variable.importance
# tree2$variable.importance
#Random Forest
rf1 <- randomForest(Bankrupt ~ .,
data = train_set,
ntree = 1000)
#rf1
varImpPlot(rf1)
# rf1$importance
#Predictions
pred_log1 <- ifelse(predict(log1, test_set, type = "response") >= 0.5, 1, 0)
pred_log2 <- ifelse(predict(log2, test_set, type = "response") >= 0.5, 1, 0)
pred_tree1 <- predict(tree1, test_set, type = "class")
pred_tree2 <- predict(tree2, test_set, type = "class")
pred_rf1 <- predict(rf1, test_set, type = "class")
#Confusion Matrix
CM_log1<-confusionMatrix(as.factor(pred_log1),as.factor(test_set$Bankrupt))
CM_log2<-confusionMatrix(as.factor(pred_log2),as.factor(test_set$Bankrupt))
CM_tree1<-confusionMatrix(as.factor(pred_tree1), as.factor(test_set$Bankrupt))
CM_tree2<-confusionMatrix(as.factor(pred_tree2),as.factor(test_set$Bankrupt))
CM_rf<-confusionMatrix(as.factor(pred_rf1), as.factor(test_set$Bankrupt))
CM <- list()
CM[["log1"]] <- table(pred_log1, test_set$Bankrupt)
CM[["log2"]] <- table(pred_log2, test_set$Bankrupt)
CM[["tree1"]] <- table(pred_tree1, test_set$Bankrupt)
CM[["tree2"]] <- table(pred_tree2, test_set$Bankrupt)
CM[["rf1"]] <- table(pred_rf1, test_set$Bankrupt)
EvaluateModel <- function(classif_mx){
true_positive <- classif_mx[2, 2]
true_negative <- classif_mx[1, 1]
condition_positive <- sum(classif_mx[ , 2])
condition_negative <- sum(classif_mx[ , 1])
predicted_positive <- sum(classif_mx[2, ])
predicted_negative <- sum(classif_mx[1, ])
accuracy <- (true_positive + true_negative) / sum(classif_mx)
MER <- 1 - accuracy # Misclassification Error Rate (false_positive + false_positive) / sum(classif_mx)
precision <- true_positive / predicted_positive
sensitivity <- true_positive / condition_positive # Recall / True Positive Rate (TPR)
specificity <- true_negative / condition_negative
F1 <- (2 * precision * sensitivity) / (precision + sensitivity)
return(list(accuracy = accuracy,
MER = MER,
precision = precision,
sensitivity = sensitivity,
specificity = specificity,
F1 = F1))
}
sapply(CM, EvaluateModel)
## log1 log2 tree1 tree2 rf1
## accuracy 0.9018738 0.9003945 0.8989152 0.8915187 0.9452663
## MER 0.09812623 0.09960552 0.1010848 0.1084813 0.05473373
## precision 0.1929825 0.1904762 0.1983471 0.1740891 0.2868852
## sensitivity 0.7457627 0.7457627 0.8135593 0.7288136 0.5932203
## specificity 0.9065515 0.9050279 0.9014728 0.8963941 0.9558151
## F1 0.3066202 0.3034483 0.3189369 0.2810458 0.3867403
This section presents and evaluates three penalized regression models used to predict bunkruptcy: Ridge model, Lasso model and elastic-net model.
x<-as.data.frame(scale(data_ready))
y<-as.factor(bankrupt$Bankrupt)
new_data<-as.data.frame(cbind(y,x))
division <- (runif(nrow(new_data)) < 0.3)
test_set2 <- new_data[division, ]
train_set2 <- new_data[!division, ]
smote_data2 <- smote(y ~ ., train_set2, perc.under =2, perc.over = 11)
x.train <-as.matrix(smote_data2[,-1])
y.train <-as.factor(smote_data2[,1])
row.names(x.train) <- NULL
row.names(y.train) <- NULL
x.test <- as.matrix(test_set2[,-1 ])
y.test <- as.factor(test_set2[,1 ])
#RIDGE
alpha0.fit <- cv.glmnet(x.train, y.train, type.measure="auc",
alpha=0, family="binomial")
alpha0.predicted <-
predict(alpha0.fit, s=alpha0.fit$lambda.1se, newx=x.test)
#LASSO
alpha1.fit <- cv.glmnet(x.train, y.train, type.measure="auc",
alpha=1, family="binomial")
alpha1.predicted <-
predict(alpha1.fit, s=alpha1.fit$lambda.1se, newx=x.test)
#ELASTIC-NET
alpha0.5.fit <- cv.glmnet(x.train, y.train, type.measure="auc",
alpha=0.5, family="binomial")
alpha0.5.predicted <-
predict(alpha0.fit, s=alpha0.5.fit$lambda.1se, newx=x.test)
pred_ridge <- ifelse(alpha0.predicted >= 0.5, 1, 0)
pred_lasso <- ifelse(alpha1.predicted >= 0.5, 1, 0)
pred_elasticNet <- ifelse(alpha0.5.predicted >= 0.5, 1, 0)
#Confusion Matrix
CM_ridge<-confusionMatrix(as.factor(pred_ridge),as.factor(y.test))
CM_lasso<-confusionMatrix(as.factor(pred_lasso),as.factor(y.test))
CM_elasticNet<-confusionMatrix(as.factor(pred_elasticNet),as.factor(y.test))
CM2 <- list()
CM2[["ridge"]] <- table(pred_ridge, y.test)
CM2[["lasso"]] <- table(pred_lasso, y.test)
CM2[["elasticNet"]] <- table(pred_elasticNet, y.test)
sapply(CM2, EvaluateModel)
## ridge lasso elasticNet
## accuracy 0.9310178 0.9300531 0.9310178
## MER 0.06898215 0.06994694 0.06898215
## precision 0.2405063 0.2407407 0.24375
## sensitivity 0.6229508 0.6393443 0.6393443
## specificity 0.9403579 0.9388668 0.9398608
## F1 0.347032 0.3497758 0.3529412
The analysis highlights the importance of bankruptcy prediction in mitigating financial risks. The dataset was carefully reviewed, and comprehensive data preparation techniques, including thorough cleaning, outlier identification, and data transformation, were applied. Effective feature selection techniques were used, and in an attempt to uncover distinct patterns among bankrupt companies, clustering techniques were also employed. The evaluation process addressed the issue of class imbalance, and the performance of logistic regression, decision trees, random forest, and penalized regression models was assessed.
Most of the models performed well in achieving sensitivity above 0.70 (with some exceeding 0.80), indicating their effectiveness in predicting bankruptcy, which is a crucial metric when dealing with imbalanced classes. Bankruptcy prediction models should prioritize the detection of bankruptcies (sensitivity) over accurately classifying non-bankrupt companies (accuracy).
It is noteworthy that models utilizing the feature selection method proposed in this project consistently outperformed penalized regression models in terms of sensitivity. This finding underscores the value of this analysis, providing valuable insights into identifying and understanding potential financial distress.