Financial institutions, governments, lenders, investors, and businesses have been concerned about credit risk and the fallout when companies default and ultimately go bankrupt. As such, efforts to develop credible models to assess the likelihood of bankruptcy effectively and efficiently have intensified(Chen, N. et al., 2016). Given the urgency of this situation, this study was designed to determine whether a company’s bankruptcy could be predicted from its financial ratios. The dataset in question was taken from publicly available information on Polish manufacturing companies which experienced a relatively high rate of bankruptcy in the period under study(2000-2013)(UCI Machine Repository, n.d.-a). This analysis focused on the 4th year data which contained 9,792 instances (ratios derived from each company’s financial statements, 64 predictor variables and corresponding dependent variable that indicated bankruptcy status after two years.
The expectation that is that models developed can reliably predict this outcome on new or unused data with some degree of efficiency. Underlying these hypotheses is that the identification of ‘bankrupters’ is the primary concern and misidentification of ‘non-bankrupters’ as bankrupters be minimized in the model. This reality will shape both model development and evaluation.
Secondly, and of significance, is the identification of those financial ratios which are the strongest contributors to an effective model. Financial ratios are static summarizations of key financial information on a company at a single point of time and have been used for corporate bankruptcy predictions as financial distress (Liang et al., 2016). Thus, an important secondary goal in answering this research question, is determining those ratios which play a significant role in the bankruptcy prediction so that a deeper understanding of underlying conditions in bankruptcy can be ascertained (Liang et al., 2020). Finally,this study encompasses the integration of a deep analysis of the underlying data with a comparative appraisal of the output of several statistical and machine learning algorithms to assist in predicting potential company bankruptcies.
Data: The extant dataset was downloaded from and can be accessed at the UCI Machine Learning Repository(https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data# (UCI Machine Repository, n.d.-a). The authors of this published dataset (Zieba et al., 2016) compiled EMIS(Emerging Market Information Systems) data (EMIS, n.d.) on Polish manufacturing company bankruptcies because of their relatively high failure rate in the time period under analysis. Their criteria for selection of the sector centered on having the highest bankruptcy rate, the availability of data bases (EMIS), bankrupters-availability of at least one financial report in the period five years before bankruptcy, on-bankrupters-and availability of minimum of three consecutive financial states (2000-2012). The data was sorted into 5 classes:
1st year- financial rates from the 1st year of forecasting period and class label for bankruptcy after 5 years (7027 instances; 271-bankrupters; 6,756- non-bankrupters.
2nd year- financial rates from 2nd year of forecasting and class label that indicates bankruptcy status after 4 years(10173 instances;400 bankrupted companies; 9,773-non-bankrupters.
3rd year- financial rates from 3rd year of forecasting and class label that indicates bankruptcy status after 3 years(10503 instances;495 bankrupted companies; 10,008-non-bankrupters.
4th year- financial rates from 4th year of forecasting and class label that indicates bankruptcy status after 2 years( 9,792 instances;515 bankrupted companies; 9,277-non-bankrupters. subject of this study
5th year- financial rates from 5th year of forecasting and class label that indicates bankruptcy status after 1 years(5910 instances;410 bankrupted companies; 5,500-non-bankrupters.
The dataset consists of 64 financial ratios as continuous, independent variables and a single, dependent categorical variable.
The bankrupt companies (taken from 2007-2013) and solvent companies(2000-2012) were examined 1 through 5 years prior to bankruptcy. Thus, the dataset consists of data for 5 years with the corresponding class label indicating bankruptcy status, after 1-5 years, respectively.
The data collected covered a period (total 13 years) which may
not be representative of bankruptcy conditions in other time
frames(Zieba et al., 2016).
The data was limited geographically(Poland) and to a particular
industry with a relatively high rate of bankruptcy.
The financial ratios used(feature selection), while
comprehensive, are likely to be overlapping, redundant and may still not
capture essential indicators of bankruptcy. Furthermore, recent studies
have indicated that other measures of corporate governance which improve
model performance are not present in this data(Liang et al.,
2016).
The financial ratios are static representations of the financial
status of a company and may not be reliable indicators. Since the
companies are de-identified, it is not possible to develop time series
models which might be more informative than the studies performed by
others and in this analysis (Kovacova et al., 2019).
Additionally, since the companies are de-identified, it is not
possible to determine whether the company has already experienced
financial deterioration(Poston et al., 1994) prior to
the analysis of their financial statements. The only criteria here is
that the company collapsed during a given period.
Since the criteria for this study precluded datasets with less than 7,000 observations, the 5th year dataset is eliminated. Similarly, dataset 1 is eliminated since even though is initially contains 7,027 instances, after data processing and preparation (e.g., eliminating samples with excessive missing data(NAs) and outliers), it is unlikely to retain over 7,000 samples.
The likelihood that a static measure of corporate stability(i.e., financial ratios) will be significant is likely to increase the closer to bankruptcy. Although the length of financial instability for a firm cannot be ascertained(see pt.2 Limitations), analyzing data as close as possible to collapse is most likely to find bankrupters, and hence lead to the strongest model. Therefore, the 2nd year and 3rd year data were eliminated from consideration, and the 4th year data was selected for analysis.
It is tempting to combine all datasets to arrive at a much larger set. However, given the complexity described above, this approach was rejected as being unlikely to produce a meaningful model.
The 4th year data contains 9792 instances (ratios derived from each company’s financial statement) and corresponding class label that indicated bankruptcy status after 2 years. Thus, this data year represents a large sample size whose financial ratios are likely to be more reflective of predicting bankruptcy within a brief time frame( two years). The other years either contained too few samples(<~7,000) or were deemed too far out to provide meaningful predictive power.
#<b>Load standard R packages</b>
require(ggplot2)##others added as needed
require(tidytext)
require(textdata)
require(tidyverse)
require(dplyr)
require(readr)
require(xlsx)
require(readxl)
require(stats)
require(ggplot2)
require(RWeka)##needed to read arf files-used but now shown
## Loading required package: RWeka
getwd()
## [1] "C:/WGU/post grad/final"
Test_arf4<-read.csv('pbk_initial.csv')
str(Test_arf4)##all variables shown in native form
## 'data.frame': 9792 obs. of 66 variables:
## $ Obs_number: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Attr1 : num 0.1593 -0.1274 0.0705 0.1368 -0.1101 ...
## $ Attr2 : num 0.462 0.462 0.236 0.405 0.698 ...
## $ Attr3 : num 0.0777 0.2692 0.5278 0.3154 0.1888 ...
## $ Attr4 : num 1.17 1.75 3.24 1.87 1.27 ...
## $ Attr5 : num -44.9 7.6 125.7 19.1 -15.3 ...
## $ Attr6 : num 0.46702 0.000925 0.16367 0.50497 0 ...
## $ Attr7 : num 0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
## $ Attr8 : num 0.829 1.163 2.872 1.454 0.433 ...
## $ Attr9 : num 1.12 1.29 1.06 1.11 1.74 ...
## $ Attr10 : num 0.383 0.538 0.677 0.589 0.302 ...
## $ Attr11 : num 0.1895 -0.1232 0.0869 0.1368 -0.1031 ...
## $ Attr12 : num 0.41 -0.356 0.369 0.377 -0.158 ...
## $ Attr13 : num 0.1555 -0.0697 0.1048 0.1069 0.0883 ...
## $ Attr14 : num 0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
## $ Attr15 : num 771 -1871 726 924 1663 ...
## $ Attr16 : num 0.473 -0.195 0.503 0.395 0.219 ...
## $ Attr17 : num 2.16 2.16 4.24 2.47 1.43 ...
## $ Attr18 : num 0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
## $ Attr19 : num 0.1347 -0.0984 0.0768 0.0913 -0.0634 ...
## $ Attr20 : num 46.8 67.2 51.4 59.5 19.4 ...
## $ Attr21 : num 1.035 0.657 0.992 1.335 1.031 ...
## $ Attr22 : num 0.1808 -0.0801 0.0766 0.1448 0 ...
## $ Attr23 : num 0.1132 -0.0984 0.0623 0.0913 -0.0634 ...
## $ Attr24 : num 0.576 NA 0.208 0.505 NA ...
## $ Attr25 : num 0.3833 0.0891 0.6769 0.5894 0.2302 ...
## $ Attr26 : num 0.408 -0.195 0.433 0.395 0.219 ...
## $ Attr27 : num 1.442 -18.996 0.716 1.077 0 ...
## $ Attr28 : num 0.169 0.722 2.232 0.979 1.636 ...
## $ Attr29 : num 6.07 3.99 4.67 4.32 3.84 ...
## $ Attr30 : num 0.3091 0.347 -0.0246 0.2187 0.3767 ...
## $ Attr31 : num 0.1347 -0.0952 0.0768 0.0913 -0.1333 ...
## $ Attr32 : num 134.5 92.7 80.4 98.4 261.4 ...
## $ Attr33 : num 2.71 3.94 4.54 3.71 1.4 ...
## $ Attr34 : num 0.391 3.049 0.325 0.357 1.392 ...
## $ Attr35 : num 0.1808 -0.1157 0.0766 0.1448 0.0506 ...
## $ Attr36 : num 1.48 1.29 1.2 1.51 1.74 ...
## $ Attr37 : num 658.7 4.81 NA 10.08 NA ...
## $ Attr38 : num 0.384 0.618 0.677 0.632 0.302 ...
## $ Attr39 : num 0.1285 -0.0894 0.0678 0.0967 0.0292 ...
## $ Attr40 : num 0.167 0.0424 1.1451 0.2185 0.2458 ...
## $ Attr41 : num 0.0724 -0.3593 0.0716 0.0792 0.0884 ...
## $ Attr42 : num 0.1285 -0.0619 0.0678 0.0967 0 ...
## $ Attr43 : num 120 173 159 146 150 ...
## $ Attr44 : num 73.1 105.4 108 86.4 130.7 ...
## $ Attr45 : num 0.882 -0.535 0.443 0.56 -1.193 ...
## $ Attr46 : num 0.777 1.086 2.564 1.197 1.139 ...
## $ Attr47 : num 52.6 61.7 54.3 66.3 34.7 ...
## $ Attr48 : num 0.152 -0.117 0.045 0.121 -0.263 ...
## $ Attr49 : num 0.1077 -0.0907 0.0398 0.0811 -0.1517 ...
## $ Attr50 : num 1.17 1.36 3.24 1.67 1.27 ...
## $ Attr51 : num 0.462 0.358 0.236 0.362 0.696 ...
## $ Attr52 : num 0.368 0.254 0.22 0.27 0.716 ...
## $ Attr53 : num 0.833 1.442 2.862 1.829 2.618 ...
## $ Attr54 : num 0.834 1.659 2.862 1.963 2.618 ...
## $ Attr55 : num 90533 2625 24672 6650 1314 ...
## $ Attr56 : num 0.109 -0.0894 0.0543 0.1026 0.4399 ...
## $ Attr57 : num 0.416 -0.237 0.104 0.232 -0.364 ...
## $ Attr58 : num 0.891 1.062 0.946 0.897 0.572 ...
## $ Attr59 : num 0.00142 0.15041 0 0.07302 0 ...
## $ Attr60 : num 7.79 5.43 7.11 6.14 18.8 ...
## $ Attr61 : num 4.99 3.46 3.38 4.22 2.79 ...
## $ Attr62 : num 119.8 101 76.1 88.3 146.4 ...
## $ Attr63 : num 3.05 3.62 4.8 4.13 2.49 ...
## $ Attr64 : num 3.06 3.47 4.78 4.65 15.04 ...
## $ class : int 0 0 0 0 0 0 0 0 0 0 ...
dim(Test_arf4) # 9792 observations 66 variables
## [1] 9792 66
plot(colSums(is.na(Test_arf4)))##plot NAs for each variable column
sum(colSums(is.na(Test_arf4)))
## [1] 8776
(100*sum(colSums(is.na(Test_arf4)))/
sum(colSums(!is.na(Test_arf4)))) ##%age NAs in total dataset --> ~1.4% NOT SPARSE
## [1] 1.376636
RESULTS:
1.One attribute
column contains over 4,000 NAs.
2. Several
attribute columns contain hundreds of NAs.
CONCLUSIONS:
1.Columns with
excessive NAs will be dropped since replacing the NAs with the mean or
median renders the column void of useful information for analysis.
Identify the specific columns involved.
2. Will
use >300 NAs as cutoff(~3 % NA).
#use which to identify columns
which(colSums(is.na(Test_arf4))>300)##columns(variables with >300 NA's dropped)
## Attr27 Attr37 Attr45 Attr60
## 28 38 46 61
#drop columns 27,37,45,60
yr4_v2<-Test_arf4[,-c(27,37,45,60)]
dim(yr4_v2) # 62 columns remaining
## [1] 9792 62
##find rows with >25%(count 15) NAs
xyz<- rowSums(is.na(yr4_v2))
xyz1<-which(xyz>15)
xyz1
## [1] 981 1480 1740 1797 1807 1914 2899 3885 4020 4116 4132 5488 6145 6403 6615
## [16] 7491 8031 8653 9607 9619
## remove rows #remove observations with >25% NAs
yr4_v3<-yr4_v2[-c(xyz),]
dim(yr4_v3) # dropped 23 observations
## [1] 9769 62
There are still NAs to remove but ~75% have been removed. The ultimate step will be to replace with the median(codingProf.com, n.d.). This is justified since many variables do not exhibit a normal distribution(vs. using the mean).
##convert dependent variable to numeric(will be used for scaling also), replace NAs with median.
require(dplyr)
yr4_v4<-yr4_v3
yr4_v4$class<-as.numeric(yr4_v4$class)
sum(is.na(yr4_v4)) # 2,451 NAs remaining
## [1] 8710
yr4_v5<- yr4_v4 %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))#replace NAS with median
sum(is.na(yr4_v5)) # NAs remaining 0
## [1] 0
RESULTS: 1. Replaced approximately ~0.2% of
values in data frame with median(minimal impact on data integrity).
D.Dealing with multicollinearity. Find
multicollinear variables in dataset – remove with cor>0.90;
cor<-0.90 ** This is the correlation matrix on a reduced
dataset. As the first pruning step, the strategy is to find variables
with correlation values >0.90 or < (-0.90) and make decisions as
to which to eliminated. Beyond correlation, co-linear variables which
show less correlation with other variables will be kept and the other
identified correlated variable will be eliminated.
Since the variables vary in range, the existing data frame, yr4_v3, will be scaled and centered(normalization) using the scale() function.
#scaling, centering
yr4_v5_scdf<-scale(yr4_v5, center=T)
##correlation matrix
require(corrplot)
## Loading required package: corrplot
## corrplot 0.92 loaded
cor_4v5<-cor(yr4_v5_scdf)
cor_4v5_pl <-corrplot(cor_4v5)
Results: There are ways to select variables to
remove to reduce multicollinearity(MC) (Perceptive Analytics
(2018)). The challenge is to retain as much variation as
possible in the dataset while reducing the impact of MC. For this
dataset, the variables showing the strongest correlation
(>0.90; <-0.90) were identified.
##strongly positively correlated(>0.90)
N<-nrow(cor_4v5)
#
xy_p<-vector('list', N)##(**stackoverflow, n.d.-a**)
class(xy_p)
## [1] "list"
for(i in rownames(cor_4v5)){ ##for loop to identify strongly correlated variables(>0.90)
x=row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)]
xy_p[[i]]<-data.frame(row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)])
}
xy_p<-do.call(rbind, xy_p)
xy_p
colnames(xy_p)<-c('strong correlation')
xy_p ##output not shown due to length
## strongly negatively correlated (< -0.90)
xy_n<-vector('list', N)
for(i in rownames(cor_4v5)){
x=row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)]
xy_n[[i]]<-data.frame(row.names(cor_4v5)[which(cor_4v5[,i] < (-0.90))])
}
xy_n<-do.call(rbind, xy_n)
colnames(xy_n)<-c('strong neg. correlation')
xy_n
## strong neg. correlation
## Attr2.1 Attr3
## Attr2.2 Attr10
## Attr2.3 Attr25
## Attr2.4 Attr38
## Attr3.1 Attr2
## Attr3.2 Attr51
## Attr10.1 Attr2
## Attr10.2 Attr51
## Attr25.1 Attr2
## Attr25.2 Attr51
## Attr38.1 Attr2
## Attr38.2 Attr51
## Attr51.1 Attr3
## Attr51.2 Attr10
## Attr51.3 Attr25
## Attr51.4 Attr38
RESULTS: 1. Some financial assets showed perfect correlation and were identical.
2. Using domain expertise(Kieso et al. ,2016) when 2 variables show extremely high correlation, the option is to select the variable with simplest financial ratio. For example, the selected financial ratio, Attr7 showed high correlation with Attr11, Attr14, and Attr22. The financial ratio, Attr7, is EBIT/total assets while Attr11 is (gross profits+ extraordinary items+ financial expenses)/total assets. Therefore, Attr7 was selected.
CONCLUSIONS: 1. The dataset after removing multicollinear variables contains 39 independent variables and 1 dependent variable.
#eliminate MC variables
yr4_v6_scdf<-yr4_v5_scdf[,c(1:7, 9, 12:13, 15, 17:21,24,27:29,31,33:34,37:41,45:46,49:50,52:56, 57,60)]
dim(yr4_v6_scdf)
## [1] 9769 39
yr4_v6<-yr4_v6_scdf ##simply data frame name
dim(yr4_v6) ## 9,772 39
## [1] 9769 39
str(yr4_v6)
## num [1:9769, 1:39] -1.73 -1.73 -1.73 -1.73 -1.73 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:9769] "22" "23" "25" "26" ...
## ..$ : chr [1:39] "Obs_number" "Attr1" "Attr2" "Attr3" ...
Outliers can influence ML models and predictions and must be considered for removal. In this dataset, wishing to preserve as much variation as possible, outliers which z-scores >10 or < -10 are removed using the following coding. It is particularly important to remove observations which have multiple variables which fit this criterion, especially for any type of regression analysis(finnstats, 2021-a).
#rebind dependent variable (class)
yr4_v6c<-as.data.frame(cbind(yr4_v6, yr4_v3$class))
yr4_v6c$V40<-as.factor(yr4_v6c$V40)
require(data.table)
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
setnames(yr4_v6c,'V40','class') ##reset names to class in new #data.frame
yr4_v6c$class<-ifelse(yr4_v6c$class==2,1,0)
yr4_v6c$class<-as.factor(yr4_v6c$class)
table(yr4_v6c$class)
##
## 0
## 9769
str(yr4_v6c)
## 'data.frame': 9769 obs. of 40 variables:
## $ Obs_number: num -1.73 -1.73 -1.73 -1.73 -1.73 ...
## $ Attr1 : num 0.0512 1.0661 -0.1048 -0.0379 0.1603 ...
## $ Attr2 : num -0.0527 -0.0503 -0.035 0.0526 0.022 ...
## $ Attr3 : num 0.10997 0.09507 -0.00721 0.00528 0.06091 ...
## $ Attr4 : num -0.018 -0.0185 -0.0234 -0.0239 -0.022 ...
## $ Attr5 : num -1.89e-03 3.38e-05 -5.15e-03 -4.96e-03 -1.12e-03 ...
## $ Attr6 : num 0.00874 -0.03336 0.00874 0.00874 0.00874 ...
## $ Attr8 : num -0.0259 -0.026 -0.0266 -0.0282 -0.0278 ...
## $ Attr11 : num 0.0395 0.1927 -0.1162 -0.0173 0.1002 ...
## $ Attr12 : num 0.000306 0.004484 -0.002472 -0.002249 0.000225 ...
## $ Attr14 : num 0.0337 0.2148 -0.0964 -0.0431 0.1291 ...
## $ Attr16 : num -0.00443 -0.00033 -0.00642 -0.00725 -0.00551 ...
## $ Attr17 : num -0.026 -0.0261 -0.0267 -0.0283 -0.028 ...
## $ Attr18 : num 0.0175 0.1487 -0.0768 -0.0382 0.0866 ...
## $ Attr19 : num 0.00232 0.00377 0.00117 0.00121 0.00318 ...
## $ Attr20 : num 0.05757 -0.08432 -0.00937 -0.09501 -0.09488 ...
## $ Attr23 : num 0.00429 0.01086 0.00325 0.00335 0.00499 ...
## $ Attr27 : num -0.0342 -0.0326 -0.0343 -0.0343 0.1343 ...
## $ Attr28 : num 0.0794 -0.0113 -0.0439 0.1015 0.0645 ...
## $ Attr29 : num -0.667 -0.117 0.427 -0.854 1.366 ...
## $ Attr31 : num -3.44e-05 1.46e-03 -9.98e-04 -1.04e-03 8.24e-04 ...
## $ Attr33 : num -0.0412 -0.028 -0.0855 -0.0699 -0.0725 ...
## $ Attr34 : num 0.00423 0.01217 -0.05602 -0.02405 -0.03808 ...
## $ Attr38 : num 0.0346 0.0374 0.0263 -0.0709 -0.0403 ...
## $ Attr39 : num 0.0144 0.0149 0.0139 0.0142 0.0148 ...
## $ Attr40 : num -0.0317 -0.0221 -0.0347 -0.0347 -0.0266 ...
## $ Attr41 : num -0.0202 -0.022 -0.0162 -0.0109 -0.0186 ...
## $ Attr42 : num 0.0289 0.0316 0.027 0.0274 0.0303 ...
## $ Attr47 : num -0.0126 -0.0398 -0.027 -0.0429 -0.0421 ...
## $ Attr48 : num 0.1086 0.2546 -0.059 0.0924 0.2282 ...
## $ Attr51 : num -0.03 -0.0326 -0.0373 0.0808 0.0211 ...
## $ Attr52 : num -0.0109 -0.011 -0.0107 -0.0108 -0.0108 ...
## $ Attr54 : num 0.0783 -0.0105 -0.0434 0.105 0.0317 ...
## $ Attr55 : num -0.0815 -0.0491 -0.0744 -0.0977 0.6198 ...
## $ Attr56 : num 0.0133 0.0139 0.0128 0.0131 0.0138 ...
## $ Attr57 : num 0.00665 0.07105 -0.00291 0.01632 0.03321 ...
## $ Attr58 : num -0.0215 -0.0262 -0.0179 -0.0181 -0.0243 ...
## $ Attr60 : num -0.0343 -0.0319 -0.0337 -0.0313 -0.0313 ...
## $ Attr63 : num -0.0474 -0.0268 -0.1017 -0.0825 -0.0825 ...
## $ class : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
## find observations with z-score >10, <-10
yr4_6c_z10<-as.data.frame(which(yr4_v6c[,1:39] >10 | yr4_v6c[,1:39] <(-10) , arr.ind = TRUE))
dim(yr4_6c_z10)##save as separate dataset for later analysis 223 2
## [1] 187 2
dismiss_z10<-unique(yr4_6c_z10$row) ##select out unique observations
length(dismiss_z10) ##how many? 117
## [1] 94
## remove these observations(outliers) from dataset
yr4_v7<-yr4_v6c[-c(dismiss_z10),]
dim(yr4_v7) ## 9,655 40
## [1] 9675 40
dismiss_kp <-yr4_v6c[c(dismiss_z10, xyz1),] ##combine all discarded observations in a dataset for analysis
dim(dismiss_kp) ## 126 40
## [1] 114 40
summary(yr4_v7[,31:40])##display last 10 variables in dataset
## Attr51 Attr52 Attr54 Attr55
## Min. :-0.10293 Min. :-0.01115 Min. :-7.20627 Min. :-9.46878
## 1st Qu.:-0.06221 1st Qu.:-0.01101 1st Qu.:-0.04374 1st Qu.:-0.10065
## Median :-0.02905 Median :-0.01090 Median :-0.04080 Median :-0.08843
## Mean :-0.01100 Mean :-0.01048 Mean :-0.02272 Mean :-0.01567
## 3rd Qu.: 0.01336 3rd Qu.:-0.01075 3rd Qu.:-0.03393 3rd Qu.:-0.03916
## Max. : 5.47406 Max. : 0.42211 Max. : 8.58496 Max. : 9.48337
## Attr56 Attr57 Attr58 Attr60
## Min. :-0.85229 Min. :-7.443143 Min. :-3.98274 Min. :-0.03564
## 1st Qu.: 0.01293 1st Qu.:-0.002983 1st Qu.:-0.03062 1st Qu.:-0.03386
## Median : 0.01345 Median : 0.006961 Median :-0.02178 Median :-0.03261
## Mean : 0.01325 Mean : 0.010648 Mean :-0.01945 Mean :-0.01561
## 3rd Qu.: 0.01440 3rd Qu.: 0.023143 3rd Qu.:-0.01709 3rd Qu.:-0.02976
## Max. : 1.46739 Max. : 8.077145 Max. : 8.40330 Max. : 8.54421
## Attr63 class
## Min. :-0.142788 0:9675
## 1st Qu.:-0.093592
## Median :-0.062581
## Mean :-0.012177
## 3rd Qu.:-0.004941
## Max. : 8.539004
summary(dismiss_kp[,31:40])##display last 10 variables
## Attr51 Attr52 Attr54 Attr55
## Min. :-0.10293 Min. :-0.01115 Min. :-0.19397 Min. :-0.43634
## 1st Qu.:-0.08432 1st Qu.:-0.01105 1st Qu.:-0.04311 1st Qu.:-0.10034
## Median :-0.05014 Median :-0.01086 Median :-0.04080 Median :-0.09327
## Mean : 0.92995 Mean : 0.88793 Mean : 1.92169 Mean : 1.33495
## 3rd Qu.: 0.02662 3rd Qu.:-0.01011 3rd Qu.:-0.03098 3rd Qu.:-0.07090
## Max. :97.90405 Max. :98.81097 Max. :80.79225 Max. :80.33131
## Attr56 Attr57 Attr58 Attr60
## Min. :-97.65662 Min. :-66.71620 Min. :-0.14096 Min. :-0.03564
## 1st Qu.: 0.01242 1st Qu.: -0.00658 1st Qu.:-0.04739 1st Qu.:-0.03415
## Median : 0.01345 Median : 0.00416 Median :-0.02178 Median :-0.03261
## Mean : -1.12221 Mean : -0.91607 Mean : 1.64620 Mean : 1.31939
## 3rd Qu.: 0.01569 3rd Qu.: 0.02281 3rd Qu.:-0.01279 3rd Qu.:-0.03132
## Max. : 0.02587 Max. : 25.31767 Max. :83.02583 Max. :80.30797
## Attr63 class
## Min. :-0.14239 0:114
## 1st Qu.:-0.13183
## Median :-0.08733
## Mean : 1.03146
## 3rd Qu.:-0.01273
## Max. :93.02601
Histograms provide insight into the distribution of the
frequency, skewness and overall data spread(Frost,
n.d.-a). Since outliers compress the visualization, the z-score
range is reduced to -2 to 2 for these plots.
newdrop<-as.data.frame(which(yr4_v7[,1:39] > 2 | yr4_v7[,1:39] <(-2) , arr.ind = TRUE))##identity observations with z-score for variable(s) outside 2 S.D.
str(newdrop)
## 'data.frame': 1271 obs. of 2 variables:
## $ row: int 26 96 146 284 406 458 601 766 835 1032 ...
## $ col: int 2 2 2 2 2 2 2 2 2 2 ...
dim(newdrop)
## [1] 1271 2
yr4_v8<-yr4_v7[-c(newdrop$row),]
dim(yr4_v8)##--> 8853 40
## [1] 8872 40
str(yr4_v8)
## 'data.frame': 8872 obs. of 40 variables:
## $ Obs_number: num -1.73 -1.73 -1.73 -1.73 -1.73 ...
## $ Attr1 : num 0.0512 1.0661 -0.1048 -0.0379 0.1603 ...
## $ Attr2 : num -0.0527 -0.0503 -0.035 0.0526 0.022 ...
## $ Attr3 : num 0.10997 0.09507 -0.00721 0.00528 0.06091 ...
## $ Attr4 : num -0.018 -0.0185 -0.0234 -0.0239 -0.022 ...
## $ Attr5 : num -1.89e-03 3.38e-05 -5.15e-03 -4.96e-03 -1.12e-03 ...
## $ Attr6 : num 0.00874 -0.03336 0.00874 0.00874 0.00874 ...
## $ Attr8 : num -0.0259 -0.026 -0.0266 -0.0282 -0.0278 ...
## $ Attr11 : num 0.0395 0.1927 -0.1162 -0.0173 0.1002 ...
## $ Attr12 : num 0.000306 0.004484 -0.002472 -0.002249 0.000225 ...
## $ Attr14 : num 0.0337 0.2148 -0.0964 -0.0431 0.1291 ...
## $ Attr16 : num -0.00443 -0.00033 -0.00642 -0.00725 -0.00551 ...
## $ Attr17 : num -0.026 -0.0261 -0.0267 -0.0283 -0.028 ...
## $ Attr18 : num 0.0175 0.1487 -0.0768 -0.0382 0.0866 ...
## $ Attr19 : num 0.00232 0.00377 0.00117 0.00121 0.00318 ...
## $ Attr20 : num 0.05757 -0.08432 -0.00937 -0.09501 -0.09488 ...
## $ Attr23 : num 0.00429 0.01086 0.00325 0.00335 0.00499 ...
## $ Attr27 : num -0.0342 -0.0326 -0.0343 -0.0343 0.1343 ...
## $ Attr28 : num 0.0794 -0.0113 -0.0439 0.1015 0.0645 ...
## $ Attr29 : num -0.667 -0.117 0.427 -0.854 1.366 ...
## $ Attr31 : num -3.44e-05 1.46e-03 -9.98e-04 -1.04e-03 8.24e-04 ...
## $ Attr33 : num -0.0412 -0.028 -0.0855 -0.0699 -0.0725 ...
## $ Attr34 : num 0.00423 0.01217 -0.05602 -0.02405 -0.03808 ...
## $ Attr38 : num 0.0346 0.0374 0.0263 -0.0709 -0.0403 ...
## $ Attr39 : num 0.0144 0.0149 0.0139 0.0142 0.0148 ...
## $ Attr40 : num -0.0317 -0.0221 -0.0347 -0.0347 -0.0266 ...
## $ Attr41 : num -0.0202 -0.022 -0.0162 -0.0109 -0.0186 ...
## $ Attr42 : num 0.0289 0.0316 0.027 0.0274 0.0303 ...
## $ Attr47 : num -0.0126 -0.0398 -0.027 -0.0429 -0.0421 ...
## $ Attr48 : num 0.1086 0.2546 -0.059 0.0924 0.2282 ...
## $ Attr51 : num -0.03 -0.0326 -0.0373 0.0808 0.0211 ...
## $ Attr52 : num -0.0109 -0.011 -0.0107 -0.0108 -0.0108 ...
## $ Attr54 : num 0.0783 -0.0105 -0.0434 0.105 0.0317 ...
## $ Attr55 : num -0.0815 -0.0491 -0.0744 -0.0977 0.6198 ...
## $ Attr56 : num 0.0133 0.0139 0.0128 0.0131 0.0138 ...
## $ Attr57 : num 0.00665 0.07105 -0.00291 0.01632 0.03321 ...
## $ Attr58 : num -0.0215 -0.0262 -0.0179 -0.0181 -0.0243 ...
## $ Attr60 : num -0.0343 -0.0319 -0.0337 -0.0313 -0.0313 ...
## $ Attr63 : num -0.0474 -0.0268 -0.1017 -0.0825 -0.0825 ...
## $ class : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
Visualizations of histograms
#loop through
#counter=0
#par(mfrow=c(2,3))
#for(i in colnames(yr4_v8[,c(1:3,6,19,30)])){
# x<-hist(yr4_v8[,i], breaks=100)
# counter = counter +1
#}
SUMMARY Most of the distributions
encompass asymmetry. This is not surprising, given that
financial ratios (often with different feature selection) exhibit a wide
range of skewness and kurtosis(Tomczak & Wilimowska, 2016).
Representative histograms selected for further review
(Attribute): Attr: 1,2,3, 6,29, 49.
For these examples,
the skewness, kurtosis, and the Jarque-Bera Normality Test are
calculated (Bobbitt, 2020). Skewness measures the
asymmetry of the distribution with (-) indicating left skew(tail to
left); (+) indicating right skew(tail to right).
Kurtosis measures the degree to much the data clusters
in the tails of the distribution in comparison to a normal
distribution(flatness of a distribution). Normal distributions have a
skewness near zero and a kurtosis of 3.
The Jarque-Bera
test (JBT) is goodness of fit test that determines whether
sample data has skewness and kurtosis of normal distribution. Figure
1-9. Snippet JBT for variables.
Null Ho: The data
has a skewness and kurtosis that match a normal distribution.
Alternative Ho: The data has a skewness and kurtosis
that match a normal distribution.
require(moments)
## Loading required package: moments
counter=0
for(i in colnames(yr4_v8[,c(1:3,6,19,30)])){
x<-skewness(yr4_v8[,i])
y<-kurtosis(yr4_v8[,i])
z<-jarque.test(yr4_v8[,i])
counter = counter +1
print(counter); print(x); print(y); print(z) }
## [1] 1
## [1] -0.01062078
## [1] 1.793283
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 538.46, p-value < 2.2e-16
## alternative hypothesis: greater
##
## [1] 2
## [1] -0.1154087
## [1] 7.25985
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 6727.8, p-value < 2.2e-16
## alternative hypothesis: greater
##
## [1] 3
## [1] 3.348483
## [1] 31.85472
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 324362, p-value < 2.2e-16
## alternative hypothesis: greater
##
## [1] 4
## [1] -0.9622606
## [1] 165.2058
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 9727566, p-value < 2.2e-16
## alternative hypothesis: greater
##
## [1] 5
## [1] 8.807446
## [1] 159.3492
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 9151230, p-value < 2.2e-16
## alternative hypothesis: greater
##
## [1] 6
## [1] -0.4170273
## [1] 8.688804
##
## Jarque-Bera Normality Test
##
## data: yr4_v8[, i]
## JB = 12220, p-value < 2.2e-16
## alternative hypothesis: greater
Another visualization approach is to use Q-Q
plots(Perceptive Analytics (2018)). If the data comes from a
normal distribution, the Q-Q scatterplot of theoretical
vs. sample quantiles should follow a straight line. Deviation from a
straight line indicate non-normality.
par(mfrow=c(2,3))
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){
qqnorm(yr4_v7[,i])
qqline(yr4_v7[,i])
}
Continuing to explore the six exemplar variables, the
Kolmogrov-Smirnov test can be used to test for normality
(GeeksforGeeks, 2021).
Null Ho:
The sample data is from a normal distribution Alternative
Ho: The sample data does not come from a normal
distribution.
## ks.test() from the stats package
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){
x<-ks.test(yr4_v7[,i], 'pnorm', mean= mean(yr4_v7[,i]), sd= sd(yr4_v7[,i]))
print(i)
print(x)
}
## [1] "Obs_number"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.057487, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr1"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.18767, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr2"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.24717, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr5"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.41939, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr28"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.41521, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr48"
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: yr4_v7[, i]
## D = 0.2184, p-value < 2.2e-16
## alternative hypothesis: two-sided
RESULTS: (N.B.- only one snippet shown). 1. The null Ho
was rejected for each variable, i.e., none of the variables
came from a normal distribution, which aligns with the previous
analysis.
The function descdist() from the fitdistplus
package attempts to specify the probability distribution that best fits
sample data from a defined family of distributions
(Delignette-Muller et al., 2009). This function
computes descriptive parameters of the empirical distribution of a
variable and provides a skewness-kurtosis plot to aid in determining
which distribution best matches the data. This plot is a Cullen
and Frey graph which indicates the values (or range) of
skewness and kurtosis for a particular distribution. By using
bootstrapping within the program, it is possible to determine a range of
possibilities for the dataset parameters. In addition, by setting
another parameter(discrete) it is possible to get a 2nd set of possible
distributions.
require(fitdistrplus)
## Loading required package: fitdistrplus
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: survival
#descdist() function(Rdocumentation. n.d.-a)
par(mfrow=c(2,3))
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){
##using the six exemplar variables; Attr1,Attr2,Attr3,Attr6,Attr29,Attr49
x<-descdist(yr4_v7[,i],discrete = F, boot =1000, method='unbiased', graph=T) ##
print(i)
}
## [1] "Obs_number"
## [1] "Attr1"
## [1] "Attr2"
## [1] "Attr5"
## [1] "Attr28"
## [1] "Attr48"
RESULTS: 1. Observations lie outside
distributions(as do bootstraps).
2. Distribution
types: normal, uniform, exponential, logistic, lognormal, gamma.
The Generalized Additive Model for Location Scale and Shape
(GAMLSS) can be used to explore distributions that are highly
skewed and or/kurtotic (Stasinopoulos & Rigby,
2007).
a. Split the dataset by the ‘class’ variable.
It is important to recall that, because scaling and centered was done
prior to this summary, the descriptive statistical differences can be
used for comparison(Bobbitt, 2019). The z-score
approach tends to provide greater weight to outliers, which can be
significant contributors to modeling.
##Use subset to separate bankrupters and non-bankrupters.
pbkers<-subset(yr4_v7, yr4_v7$class==1)
no_pbkers<-subset(yr4_v7, yr4_v7$class==0)
summary(pbkers[,31:40])
summary(no_pbkers[,31:40]) ###<b>showing only last 10 variables</b>
## Attr51 Attr52 Attr54 Attr55
## Min. :-0.10293 Min. :-0.01115 Min. :-7.20627 Min. :-9.46878
## 1st Qu.:-0.06221 1st Qu.:-0.01101 1st Qu.:-0.04374 1st Qu.:-0.10065
## Median :-0.02905 Median :-0.01090 Median :-0.04080 Median :-0.08843
## Mean :-0.01100 Mean :-0.01048 Mean :-0.02272 Mean :-0.01567
## 3rd Qu.: 0.01336 3rd Qu.:-0.01075 3rd Qu.:-0.03393 3rd Qu.:-0.03916
## Max. : 5.47406 Max. : 0.42211 Max. : 8.58496 Max. : 9.48337
## Attr56 Attr57 Attr58 Attr60
## Min. :-0.85229 Min. :-7.443143 Min. :-3.98274 Min. :-0.03564
## 1st Qu.: 0.01293 1st Qu.:-0.002983 1st Qu.:-0.03062 1st Qu.:-0.03386
## Median : 0.01345 Median : 0.006961 Median :-0.02178 Median :-0.03261
## Mean : 0.01325 Mean : 0.010648 Mean :-0.01945 Mean :-0.01561
## 3rd Qu.: 0.01440 3rd Qu.: 0.023143 3rd Qu.:-0.01709 3rd Qu.:-0.02976
## Max. : 1.46739 Max. : 8.077145 Max. : 8.40330 Max. : 8.54421
## Attr63 class
## Min. :-0.142788 0:9675
## 1st Qu.:-0.093592
## Median :-0.062581
## Mean :-0.012177
## 3rd Qu.:-0.004941
## Max. : 8.539004
While the t-test is usually associated with normal
distributions, in fact, as sample size in the two groups increases, the
t-test is valid(Bartlett, 2013). Because of the central
limit theorem, the distribution of the sample means will converge to a
normal distribution, regardless of the population distribution.
Moreover, the estimator for the standard error of these means is
consistent regardless of the distribution of the variable. Thus, the
test statistic follows a normal distribution.
The second
consideration is that the normalization of the dataset occurred
prior to splitting. Therefore, comparing the two
subpopulations (class=0,1) is meaningful and any differences can be
statistically evaluated. Finally, this analysis is done on scaled
data. To verify that these conclusions hold on the unscaled dataset, the
tests herein were replicated and the results were unchanged (data not
shown).
Since it is unclear that the subpopulations will display
equal variances, the Welch test will be applied. In addition, it is more
robust to Type I errors and should perform better in this comparison
with two unequal sized samples (Ruxton, 2006).
# require(reshape2)
# #yr4_v7sh<- yr4_v7[,c(1:3,6,19,30)]
# dat_v47<-melt(yr4_v7)##reshape data from wide to long for lapply
# lapply(unique(dat_v47$variable), function(x){
# Good<-subset(dat_v47, class==0 & variable ==x)$value
# Bad<- subset(dat_v47, class==1 & variable ==x)$value
# t.test(Good, Bad)
# })
WELCH Null Ho: The means of the two samples are
equal. Alternative Ho: The means of the two samples are
unequal.
RESULTS:(only showing Welch test results for six
exemplars) 1. Attr1, Attr2, Attr3, Att6, Attr29: Reject null Ho: the
means are unequal. 2. Attr49: Fail to reject null Ho- the means are
undistinguishable from each other. 3. Summary for all other variables.
Fail to reject null H0 (p>0.05): Attr4,
Attr5,Attr9,Attr12,Attr13,Attr15, Attr17, Attr19,Attr21, Attr28, Attr30,
Attr34, Attr40, Attr42, Attr43, Attr49, Attr53, Attr56, Attr57, Attr58,
Attr59, Attr61, Attr64. Reject null H0 (p<0.05): Attr1, Attr2, Attr3,
Attr6, Attr7, Attr18, Attr20, Attr24, Attr29, Attr32, Attr35, Attr39,
Attr41, Attr48, Attr52, Attr55.
CONCLUSIONS 1. For
the variables where the outcome is ‘fail to reject null Ho,’, the means
of the two subpopulations are not distinguishable. 2. For the variables
where the outcome is ’reject null Ho, the means of the two samples are
unequal. Will these variables be included in the final models to be
developed? (see Part 5. Data Summary and Implications)
Since the data contains variables which have non-normal
distributions, a non-parametric test of the dataset can test the
relationship of the two subpopulations(bankrupters, non-bankrupters).
The test addresses whether the observations in one group are greater
than the other after ranking the combined groups. Since they are drawn
from the same population(same shape and spread), this is a test of the
medians of the two groups. This is the non-parametric equivalent
of the Welch t-test conducted in the last section and can
highlight variables which are different between the two subpopulations
and may be significant for the final machine learning model.
Wilcoxon Null Ho: The distribution of both
subpopulations is identical. Alternative Ho: The
distributions of the subpopulations are not identical.
Variables
tested: All 39 independent variables. Variables discussed: Six exemplar
variables.
#for(i in colnames(pbkers[,c(1:3,6,19,30)])){
# x<-wilcox.test(pbkers[,i], no_pbkers[,i])
# counter = counter +1
# print(counter)
# print(x)
#}
RESULTS: 1. The null Ho is rejected for all six exemplar variables, i.e., the distribution of these variables for the two subpopulations is not equal. 2. For the total class of variables only three variables failed to reject the null Ho (Attr20,Attr43, Attr64).
SUMMARY OF ANALYSES: 1. The parametric Welch t-test
identified 18 variables indicated that the means are different between
the two subpopulations.
2. The non-parametric Wilcoxon ranked.sum
test identified thirty-six variables which indicated that the medians
between the two subpopulations are different and therefore, the two
subpopulations are different.
3. The following
attributes(variables) have been identified in both the Welch t-test and
Wilcoxon ranked sum test as being different in the two subpopulations:
Attr1, Attr2, Attr3, Attr6, Attr7, Attr18, Attr24, Attr29, Attr32,
Attr35, Attr39, Attr41, Attr48, Attr52, Attr55.
Barboza, F., Kimure, H., and Altman, E. (2017) Machine
Learning models and bankruptcy prediction. Expert Systems with
Applications. Vol. 83: 405-417. http://isiarticles.com/bundles/Article/pre/pdf/146024.pdf
Bartlett, J. (2013) The t-test and robustness to non-normality. https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
Beaver, W.H.(1968) Market Prices, Financial Ratios and the
prediction of failure. J. Acct. Resrch. 6(2): 179-192.
Bhalla, D.
(n.d.) SAS: Calculating the Optimal Predicted Probability Cutoff. https://www.listendata.com/2015/03/sas-calculating-optimal-predicted.html
Bharadwaj, V. (2022) What is Jarque test How to perform it in R. https://www.projectpro.io/recipes/what-is-jarque-bera-test-perform-it-r
Bobbitt, Z. (2018) Mann Whitney U Test. https://www.statology.org/mann-whitney-u-test/
Bobbitt, Z. (2019) How to Normalize Data in R. https://www.statology.org/how-to-normalize-data-in-r/
Bobbitt, Z. (2020) How to Calculate Skewness & Kurtosis in R.
https://www.statology.org/skewness-kurtosis-in-r/
Bobbitt, Z. (2021-a) The Complete Guide: When to Remove Outliers in R https://www.statology.org/remove-outliers/
Bobbitt,
Z. (2021-b) How to Use Q-Q plots to Check Normality https://www.statology.org/q-q-plot-normality/
Bounthavong, M. (2021) Logistic regression in R. Rpubs by RStudio. https://rpubs.com/mbounthavong/logistic_regression
Bonthu, S., and Bindu, K.H. (2017) Review of Leading Data Analytics
Tools. Intl. Jrnl. of Engr. & Technology. Vol 7: 10-15. https://www.researchgate.net/profile/Sridevi-Bonthu-2/publication/327233649_Review_of_Leading_Data_Analytics_Tools/links/5b82bbdaa6fdcc5f8b695315/Review-of-Leading-Data-Analytics-Tools.pdf
Bredart, X. (2014) Bankruptcy Prediction Model: The Case of the
United States. Intl. J. Econ. And Finance Vol. 6(3):1-7. https://scholar.archive.org/work/odijcz2cbrerlfccirnnft2vo4/access/wayback/http://ccsenet.org/journal/index.php/ijef/article/download/32877/19695
Carroll, J. (2019) Beyond Spreadsheets with R. Chap. 7: Doing
things with lots of data. 1st edition. Publisher: Manning.
Chawla,
N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W. P. (2002) SMOTE:
Synthetic Minority Over-Sampling Technique. https://arxiv.org/pdf/1106.1813.pdf
Chen, C., Liaw,
A., and Breiman, L. (2004) Using Random Forest to Learn Imbalance Data.
Univ. California Berkeley Tech Report 666; 1-12.
Chen, N., Ribiero,
B. and Chen, A. (2016) Financial credit risk assessment: a recent
review. Artificial Intelligence Review Vol: 45(1): 1-23. https://www.researchgate.net/profile/Bernardete-Ribeiro/publication/284100532_Financial_credit_risk_assessment_a_recent_review/links/564f4dc808aeafc2aab3c43c/Financial-credit-risk-assessment-a-recent-review.pdf
Cho, K.I., and Kim, Y. M. (2021) Comparison of Bankruptcy
prediction models using statistical learning at multiple times. The
Korean Data & Information Science Society 32(3):487-499. https://doi.org/10.7465/jkdi.2021.32.3.487
Chu, M.
and Yong, K. (2021) Big Data Analytics for Business Intelligence in
Accounting and Audit. Open Journal of Social Sciences, 9, 42-52. doi:
10.4236/jss.2021.99004.
codingProf.com(n.d.) How to replace NA’s
with the Median in R. https://www.codingprof.com/how-to-replace-missing-values-with-the-median-in-r/
Coding Prof(2022) 3 Ways to Test for Multicollinearity in
R[Examples]. https://www.codingprof.com/3-ways-to-test-for-multicollinearity-in-r-examples/
Correa-Mejia, A.D. and Lopero-Castano, M. (2020) Financial ratios
as a powerful instrument to predict insolvency: a study using boosting
algorithms in Colombian firms. Estud.gerenc. 36(155) 229-238. http://www.scielo.org.co/scielo.php?pid=S0123-59232020000200229&script=sci_arttext&tlng=en
Corporate Finance Institute (2022) Financial ratios: The use of
financial figures to gain significant information about a company. https://corporatefinanceinstitute.com/resources/knowledge/finance/financial-ratios/
Cross Validated(n.d.-a) Area under the ROC curve or area under the
PR curve for imbalanced data? https://stats.stackexchange.com/questions/90779/area-under-the-roc-curve-or-area-under-the-pr-curve-for-imbalanced-data/90783#90783
Cross Validated(n.d.-b) Meaning of y-axis in Random Forest partial
dependence plot. https://stats.stackexchange.com/questions/147763/meaning-of-y-axis-in-random-forest-partial-dependence-plot
Cross Validated(n.d.-c) Regression: Transforming Variables. https://stats.stackexchange.com/questions/4831/regression-transforming-variables/4833#4833
Data Novia(n.d.-a) T-Test Essentials : Definition, Formula and
Calculation. https://www.datanovia.com/en/lessons/how-to-do-a-t-test-in-r-calculation-and-reporting/how-to-do-two-sample-t-test-in-r/
Data Science Team(2021) Box-Cox Transformation for Normalizing a
Non-Normal Variable in R. https://universeofdatascience.com/box-cox-transformation-for-normalizing-a-non-normal-variable-in-r/
Davis, J., and Goadrich, M. The Relationship between
Precision-Recall and ROC Curves. https://www.biostat.wisc.edu/~page/rocpr.pdf
Delignette-Muller, M.L., Pouillot, R., Denis, J.-B., and Dutang, C.
(2009) Use of the package fitdistrplus to specify a distribution from
non-censored or censored
data.https://civil.colorado.edu/~balajir/CVEN5454/R-sessions/sess2/intro2fitdistrplus.pdf
Devi, S.S., and Radhika, Y. (2018) A Survey on Machine Learning and
Statistical Techniques in Bankruptcy Prediction. Intl. Jour. Machine
Learning and Computing. Vol 8(2): 133-139. http://www.ijmlc.org/vol8/676-L0125.pdf
EMIS (n.d.)
Emerging Markets Information System. https://www.emis.com/
finnstats(2021-a) How to
Remove Outliers in R. https://www.r-bloggers.com/2021/09/how-to-remove-outliers-in-r-3/
finnstats (2021-b) Class Imbalance-Handling Imbalanced Data in R.
https://www.r-bloggers.com/2021/05/class-imbalance-handling-imbalanced-data-in-r/
Frost, J. (n.d.-a) How to Identify the Distribution of Your Data.
https://statisticsbyjim.com/hypothesis-testing/identify-distribution-data/
GeeksforGeeks(2021) Kolmogorov-Smirnov Test in R. https://www.geeksforgeeks.org/kolmogorov-smirnov-test-in-r-programming/
Giannopoulos, G, Sigbjornsen, S. (2019) Prediction of Bankruptcy
Using Financial Ratios in the Greek Market. Theoretical Economics
Letters 9:1114-1128. https://eprints.kingston.ac.uk/id/eprint/43059/1/Giannopoulos-G-43059-VoR.pdf
Gosiewska, A.(n.d.) ModelOriented/auditor https://github.com/ModelOriented/auditor/blob/master/R/plot_prc.R
Groff, D. (2020) Bankruptcy Overview Process and Warning Signs. https://www.mossadams.com/articles/2020/06/bankruptcy-overview-process-and-warning-signs
hackerearth (n.d.) Beginners Tutorial on XGBoost and Parameter
Tuning in R. https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Harrell, F. (2020) Damage Caused by Classification Accuracy and
Other Discontinuous Improper Accuracy Scoring Rules. https://www.fharrell.com/post/class-damage/
Hauser,
R.P., and Booth, D. (2011) Predicting Bankruptcy with Robust Logistic
Regression. J. Data Science 9; 5654-584. https://www.jds-online.com/files/JDS-716.pdf
Hjelseth, I.N., Raknerud, A., and Vatne, B.H. (2022) A bankruptcy
probability model for assessing credit risk on corporate loans with
automated variable selection. Norges Bank Research. Working Paper
pp. 1-40. https://www.norges-bank.no/contentassets/b26854d9fce24f49b68182e121eed2eb/wp_07_2022.pdf?v=06/21/2022162855&ft=.pdf
Horak, J., Vrbka, J, and Suler, P. (2020) Support Vector Machine
Methods and Artificial Neutral Networks Used for the Development of
Bankruptcy Prediction Models and their Comparison. J. Risk and Financial
Management Vol. 13(60);1-15. https://www.mdpi.com/1911-8074/13/3/60/pdf
Horton,
B. (2016) Calculating AUC: the area under a ROC Curve https://www.r-bloggers.com/2016/11/calculating-auc-the-area-under-a-roc-curve/
Islam, M.S. (2020) Predictive capability of Financial Ratios for
forecasting of Corporate Bankruptcy. IOSR Jrnl. Bus. And Mngmt.
22(6):13-57. https://www.academia.edu/download/63759958/C220610135720200627-73076-19tr1g2.pdf
Javadev, M. (2006) Predictive Power of Financial Risk Factors: An
Empirical Analysis of Default Companies. VILKALPA 31(3): 45-56. https://journals.sagepub.com/doi/pdf/10.1177/0256090920060304
Javaid, K. (2022) Cost to sales ratio. https://financiopedia.com/cost-to-sales-ratio/
Johnson, D. (2022) SAS vs. R: What is the Difference Between R and SAS?
https://www.guru99.com/sas-versus-r.html
Kieso,
D.E., Weygandt, J.J., and Warfield, T.D. (2016) Intermediate Accounting:
1435-1439. 16th edition. Publisher: Wiley.
Kitowsi, J, Kowal-Pawul,
A., and Lichota, W. (2022) Identifying Symptoms of Bankruptcy Risk Based
on Bankruptcy Prediction Models- A Case Study in Poland. Sustainability
14(3): 1416. https://doi.org/10.3390/su14031416
KnowHow (2021)
Testing the Assumptions of Logistic Regression using R. https://www.youtube.com/watch?v=jILEwqg2p3k)
Kovacova, M., Kliestik,T., Valaskova, K., Durana, P., and Juhaszova, Z.
(2019). Systematic review of variables applied in bankruptcy prediction
models of Visegrad group countries. OECONOMIA Copernicana Vol. 10(4):
743-772. http://economic-research.pl/Journals/index.php/oc/article/download/1739/1630
Kumar, N. (2019) Advantages and Disadvantages of Random Forest
Algorithm in Machine Learning. https://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-random.html
Kumar, A. (2022) Accuracy, Precision, Recall & F1-Score-Python
Examples. https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
Lantz., B. (2019-a) Machine Learning with R. Chapter 11: 359-374.
3rd edition. Packt> Publishers.
Lantz, B. (2019-b) Machine
Learning with R. Chap. 11: Improving Model Performance. 3rd edition.
Publisher: Packt Publishing Ltd.
Le, T. (2021) A comprehensive
survey of imbalanced learning methods for bankruptcy prediction. IET
Communications 16(5): 433-441. https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cmu2.12268
Leung, K. (2021) Assumptions of Logistic Regression, Clearly
Explained. https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290
Liang, D., Lu, C.C., Tsai, C.F., and Shih, G.A. (2016) Financial
ratios and corporate governance indicators in bankruptcy prediction: A
comprehensive study. Eur. Jour. of Operational Res. Vol. 252: 561-572.
https://isslab.csie.ncu.edu.tw/download/publications/1.Financial%20Ratios%20and%20Corporate%20Governance%20Indicators%20in%20Bankruptcy%20Prediction-A%20Comprehensive%20Study.pdf
Liang, D., Tsai, C.F., Lu, H.Y.R., and Chang, L.S. (2020) Combining
Corporate Governance indicators with stacking ensembles for financial
distress prediction. J. Bus. Research Vol. 120:137-146. https://isslab.csie.ncu.edu.tw/download/publications/Combining%20corporate%20governance%20indicators%20with%20stacking%20ensembles%20for%20financial%20distress%20prediction.pdf
Lunardon, N., Menardi, G., and Torelli, N. (2021) Package ‘ROSE’.
https://cran.r-project.org/web/packages/ROSE/ROSE.pdf
Lunardon, N., Menardi, G., and Torelli, N. (2014) ROSE: A Package
for Binary Imbalanced Learning. The R Journal: Vol 6(1): 79-89. https://journal.r-project.org/archive/2014-1/menardi-lunardon-torelli.pdf
Marso, S., and El Merouani, M. (2020) Bankruptcy Prediction using
Hybris Neural Networks with Artificial Bee Colony. Engineering Letters
28(4). http://www.engineeringletters.com/issues_v28/issue_4/EL_28_4_26.pdf
Menardi, G., and Torelli, N. (2010) Training and assessing
classification rules with unbalanced data. University Degli Studi Di
Trieste: Working Paper Series, N.2, 2010. Pp:1-28.
Mendekar, V.
(2021) Machine Learning-It’s all about assumptions. https://www.kdnuggets.com/2021/02/machine-learning-assumptions.html
Mendis, A. (2018) Using mlr for Machine Learning in R: A Step by
Step Approach for Decision Trees. https://towardsdatascience.com/decision-tree-classification-of-diabetes-among-the-pima-indian-community-in-r-using-mlr-778ae2f87c69
Pedamkar, P. (n.d.) R vs. Python. https://www.educba.com/r-vs-python/
Perceptive
Analytics (2018) Dealing with The Problem of Multicollinearity in R. https://www.r-bloggers.com/2018/08/dealing-with-the-problem-of-multicollinearity-in-r/
Pipis, G.(2020) Unsampling by Groups in R. R-bloggers. https://www.r-bloggers.com/2020/11/undersampling-by-groups-in-r/
Poston, K.M., Harmon, W.K., and Gramlich, J.D. (1994) A Test of
Financial Ratios as Predictors of Turnaround versus Failure among
Financially Distressed Firms. J. Appl. Bus. Research Vol 10: 41-56. https://doi.org/10.19030/jabr.v10i1.5962
Prabhakaran, S. (2016) Outlier detection and treatment with R. https://www.r-bloggers.com/2016/12/outlier-detection-and-treatment-with-r/
Prabhakaran, S. (n.d.-a) Logistic Regression http://r-statistics.co/Logistic-Regression-With-R.html
Probst, P. (2017) Tuning random forest. R-bloggers. https://www.r-bloggers.com/2017/11/tuning-random-forest/
Rai, B. (2017) Handling Class Imbalance Problem in R: Improving
Predictive Model Performance https://www.youtube.com/watch?v=Ho2Klvzjegg
Rai, B.
(2017) eXtreme Gradient Boosting XGBoost Algorithm with R- Example in
Easy Steps with One-Hot Encoding. https://www.youtube.com/watch?v=woVTNwRrFHE
Ranganathan, P., Pramesh, C.S., and Aggarwal, R. (2017) Common pitfalls
in statistical analysis: Logistic regression. Perspect. Clin. Res. Vol
8(3): 148-151. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543767/
RDocumentation (n.d.-a) descdist: Description of an empirical
distribution for non-censored data. https://www.rdocumentation.org/packages/fitdistrplus/versions/1.1-8/topics/descdist
RDocumentation (n.d.-b) glm: Fitting Generalized Linear Models https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm
RDocumentation(n.d.-c) sample.split: Split Data into Test and Train
Set https://www.rdocumentation.org/packages/caTools/versions/1.17.1/topics/sample.split
RDocumentation(n.d.-d) https://www.rdocumentation.org/packages/ROSE/versions/0.0-4/topics/roc.curve
Rdocumentation(n.d.-e) cforest: Random Forest. https://www.rdocumentation.org/packages/party/versions/1.3-0/topics/cforest
rdrr.io(n.d.-a)ROSE-package: ROSE: Random Over-Sampling Examples https://rdrr.io/cran/ROSE/man/ROSE-package.html
rdrr.io(n.d.-b) ovun.sample: Over-sampling, under-sampling, combination
of over- and under. https://rdrr.io/cran/ROSE/man/ovun.sample.html
rdrr.io (n.d.-c) smote: SMOTE algorithm for unbalanced classification
problems. https://rdrr.io/cran/performanceEstimation/man/smote.html
rdrr.io(n.d.-d) mlr-package: mlr: Machine Learning in R. https://rdrr.io/cran/mlr/man/mlr-package.html
rddr.io(n.d.-e) Partial dependence plot https://rdrr.io/cran/randomForest/man/partialPlot.html
Rhys, H.I. (2020-a) Machine learning with R, tidyverse, and mlr.
Chapter 4: Classifying based on odds with logistic regression.
Publisher: Manning.
Rhys, H.I. (2020-b) Machine Learning with R,
the tidyverse, and mlr. Chapter 8: Improving decision trees with random
forests and boosting. pp. 186-204, 1st Edition. Manning Publications Co.
Rhys, H.I. (2020-c) Machine Learning with R, the tidyverse, and
mlr: Chap. 3. Classifying based on similarity with k-nearest neighbors.
1st edition. Publisher: Manning Publications Co.
Rossiter, D.G.
(2017) Tutorial: An example of statistical data analysis using the R
environment for statistical computing. http://www.css.cornell.edu/faculty/dgr2/_static/files/R_PDF/corregr.pdf
Ruopp, M., Perkins, N.J., Whitcomb, B.W. and Schisterman, E. (2008)
Youden Index and Optimal Cut-Point Estimated from Observations Affected
by a Lower Limit of Detection. Biom. J. 50(3): 419-430. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2515362/
Ruxton, G. D. (2006) The unequal variance t-test is an underused
alternative to Student’s t-test and the Mann-Whitney U test. Behavioral
Ecology Vol. 17(4): 688-690. https://academic.oup.com/beheco/article/17/4/688/215960
Saito, T., and Rehmsmeier, M. (2015) The Precision-Recall Plot is
More Informative that the ROC Plot When Evaluating Binary Classifiers on
Imbalanced Datasets. PLOS One. https://doi.org/10.1371/journal.pone.0118432. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Sameeruddin, S. (2020) How Gradient Boosting Algorithm Works. https://dataaspirant.com/gradient-boosting-algorithm/
Sharma, N. (2018) Ways to Detect and Remove Outliers. https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
Shi, Y., and Li. X. (2019) An overview of bankruptcy prediction
models for corporate firms: A systematic review. https://upcommons.upc.edu/bitstream/handle/2117/176066/1354-5538-1-PB.pdf
Shirin’s playground(2017) Dealing with unbalanced data in machine
learning. R-bloggers. https://www.r-bloggers.com/2017/04/dealing-with-unbalanced-data-in-machine-learning/
Silipo, R. (2020) Cohen’s Kappa: What it is, when to use it and how
to avoid its pitfalls. https://towardsdatascience.com/cohens-kappa-what-it-is-when-to-use-it-and-how-to-avoid-its-pitfalls-e42447962bbc
stackoverflow(n.d.-a) Writing a for loop with the output as a data
frame in R. https://stackoverflow.com/questions/41889944/writing-a-for-loop-with-the-output-as-a-data-frame-in-r
StackExchange (n.d.-a) What distribution does my data follow? https://stats.stackexchange.com/questions/58220/what-distribution-does-my-data-follow
StackExchange(n.d.-b) ROSE and SMOTE oversampling methods. https://stats.stackexchange.com/questions/166458/rose-and-smote-oversampling-methods)
stackOverflow(n.d.-p) Plot legend randomForest in R. https://stackoverflow.com/questions/39330728/plot-legend-random-forest-r
Stasinopoulos, D.M. and Rigby, R.A. (2007) Generalized Additive
Models for Location Scale and Shape (GAMLSS) in R. J. Statistical
Software Vol 23(7): 1-46. https://www.jstatsoft.org/article/download/v023i07/20764
Talk Stats(2014) When removing outliers, creates more, what then?
https://www.talkstats.com/threads/when-removing-influential-outliers-creates-more-what-then.56919/
The R Foundation for Statistical Computing (2021) R: Software
Development Life Cycle: A Description of R’s Development, Testing,
Release and Maintenance Processes. https://www.r-project.org/doc/R-SDLC.pdf
Tomczak,
S.K., and Wilimowska, Z. (2016) Testing the Probability Distribution of
Financial Ratios. ISAT: Proc. 36th Intl. Conf. Infor. Sys. Arch. Tech.:
75-84.
UCI Machine Repository(n.d.-a) Polish companies bankruptcy
data Dataset. https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#
van Schoren, J. (n.d.) Machine Learning in R. https://joaquinvanschoren.github.io/ML-course-R/TutorialMLR.slides.html#/
Webb, J. (2017) Course Notes for IS6489, Statistics and Predictive
Analytics. Chap. 8 Logistic Regression. https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html
Westfall, P. (2016) Re: What do I do if my data distribution is not
Normal?. Retrieved from: https://www.researchgate.net/post/What-do-I-do-if-my-data-distribution-is-not-Normal/584eae2feeae3934d93f477b/citation/download
Wikipedia (n.d.-a) Logistic regression. https://en.wikipedia.org/wiki/Logistic_regression
Zach(2020) The 6 Assumptions of Logistic Regression(with Examples). https://www.statology.org/assumptions-of-logistic-regression/
Zeya, L.T. (2021) Precision and Recall Made Simple. https://towardsdatascience.com/precision-and-recall-made-simple-afb5e098970f
Zhang, Y., Liu, R., Heidari, A.A., Wang, X., Chen, Y., Wang M., and
Chen, H. (2021) Towards augmented kernel extreme learning models for
bankruptcy prediction: Algorithmic behavior and comprehensive analysis.
Neurocomputing 430: 185-212. https://doi.org/10.1016/j.neucom.2020.10.038
Zhou,
V. (2019) Decision tree learning: Gini impurity. https://victorzhou.com/blog/gini-impurity/
Zieba,
M. Tomczak, S.K., and Tomczak, J.M. (2016) Ensemble Boosted Trees with
Synthetic Features Generation in Application to Bankruptcy Prediction.
Expert Systems with Applications Vol. 58:93-101. https://www.ii.pwr.edu.pl/~tomczak/PDF/%5BMZSTJT%5D.pdf
require(here)
## Loading required package: here
## here() starts at C:/WGU/post grad/final