Part I.INTRODUCTION- STATEMENT OF THE PROBLEM

Financial institutions, governments, lenders, investors, and businesses have been concerned about credit risk and the fallout when companies default and ultimately go bankrupt. As such, efforts to develop credible models to assess the likelihood of bankruptcy effectively and efficiently have intensified(Chen, N. et al., 2016). Given the urgency of this situation, this study was designed to determine whether a company’s bankruptcy could be predicted from its financial ratios. The dataset in question was taken from publicly available information on Polish manufacturing companies which experienced a relatively high rate of bankruptcy in the period under study(2000-2013)(UCI Machine Repository, n.d.-a). This analysis focused on the 4th year data which contained 9,792 instances (ratios derived from each company’s financial statements, 64 predictor variables and corresponding dependent variable that indicated bankruptcy status after two years.

Limitations of this study include:
  1. the possibility that the financial ratios:
  1. does not capture essential indicators of bankruptcy.
  2. are static representations of the solvency of the firms, overlapping, focus on past performance, may contain errors, inconsistencies and other mistakes which limit the usefulness.
  1. In addition, this study presumes that the data preparation steps undertaken do not diminish the effectiveness of any emergent models and predictions (Sharma, 2018; Perceptive Analytics(2018)).
  2. There is no guarantee that the approaches taken, models and predictions developed are the optimum solutions to answering the original hypothesis of bankruptcy predictability(Kovacova et al., 2019).
Reasonable goals of and expectations for this analysis.

The expectation that is that models developed can reliably predict this outcome on new or unused data with some degree of efficiency. Underlying these hypotheses is that the identification of ‘bankrupters’ is the primary concern and misidentification of ‘non-bankrupters’ as bankrupters be minimized in the model. This reality will shape both model development and evaluation.

Secondly, and of significance, is the identification of those financial ratios which are the strongest contributors to an effective model. Financial ratios are static summarizations of key financial information on a company at a single point of time and have been used for corporate bankruptcy predictions as financial distress (Liang et al., 2016). Thus, an important secondary goal in answering this research question, is determining those ratios which play a significant role in the bankruptcy prediction so that a deeper understanding of underlying conditions in bankruptcy can be ascertained (Liang et al., 2020). Finally,this study encompasses the integration of a deep analysis of the underlying data with a comparative appraisal of the output of several statistical and machine learning algorithms to assist in predicting potential company bankruptcies.

PART II. DATA COLLECTION

Data: The extant dataset was downloaded from and can be accessed at the UCI Machine Learning Repository(https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data# (UCI Machine Repository, n.d.-a). The authors of this published dataset (Zieba et al., 2016) compiled EMIS(Emerging Market Information Systems) data (EMIS, n.d.) on Polish manufacturing company bankruptcies because of their relatively high failure rate in the time period under analysis. Their criteria for selection of the sector centered on having the highest bankruptcy rate, the availability of data bases (EMIS), bankrupters-availability of at least one financial report in the period five years before bankruptcy, on-bankrupters-and availability of minimum of three consecutive financial states (2000-2012). The data was sorted into 5 classes:

1st year- financial rates from the 1st year of forecasting period and class label for bankruptcy after 5 years (7027 instances; 271-bankrupters; 6,756- non-bankrupters.

2nd year- financial rates from 2nd year of forecasting and class label that indicates bankruptcy status after 4 years(10173 instances;400 bankrupted companies; 9,773-non-bankrupters.

3rd year- financial rates from 3rd year of forecasting and class label that indicates bankruptcy status after 3 years(10503 instances;495 bankrupted companies; 10,008-non-bankrupters.

4th year- financial rates from 4th year of forecasting and class label that indicates bankruptcy status after 2 years( 9,792 instances;515 bankrupted companies; 9,277-non-bankrupters. subject of this study

5th year- financial rates from 5th year of forecasting and class label that indicates bankruptcy status after 1 years(5910 instances;410 bankrupted companies; 5,500-non-bankrupters.

The dataset consists of 64 financial ratios as continuous, independent variables and a single, dependent categorical variable.

The bankrupt companies (taken from 2007-2013) and solvent companies(2000-2012) were examined 1 through 5 years prior to bankruptcy. Thus, the dataset consists of data for 5 years with the corresponding class label indicating bankruptcy status, after 1-5 years, respectively.

Data Limitations:Pre-analysis
  1. The data collected covered a period (total 13 years) which may not be representative of bankruptcy conditions in other time frames(Zieba et al., 2016).

  2. The data was limited geographically(Poland) and to a particular industry with a relatively high rate of bankruptcy.

  3. The financial ratios used(feature selection), while comprehensive, are likely to be overlapping, redundant and may still not capture essential indicators of bankruptcy. Furthermore, recent studies have indicated that other measures of corporate governance which improve model performance are not present in this data(Liang et al., 2016).

  4. The financial ratios are static representations of the financial status of a company and may not be reliable indicators. Since the companies are de-identified, it is not possible to develop time series models which might be more informative than the studies performed by others and in this analysis (Kovacova et al., 2019).

  5. Additionally, since the companies are de-identified, it is not possible to determine whether the company has already experienced financial deterioration(Poston et al., 1994) prior to the analysis of their financial statements. The only criteria here is that the company collapsed during a given period.

    Limitations of this analysis
  6. Since the criteria for this study precluded datasets with less than 7,000 observations, the 5th year dataset is eliminated. Similarly, dataset 1 is eliminated since even though is initially contains 7,027 instances, after data processing and preparation (e.g., eliminating samples with excessive missing data(NAs) and outliers), it is unlikely to retain over 7,000 samples.

  7. The likelihood that a static measure of corporate stability(i.e., financial ratios) will be significant is likely to increase the closer to bankruptcy. Although the length of financial instability for a firm cannot be ascertained(see pt.2 Limitations), analyzing data as close as possible to collapse is most likely to find bankrupters, and hence lead to the strongest model. Therefore, the 2nd year and 3rd year data were eliminated from consideration, and the 4th year data was selected for analysis.

  8. It is tempting to combine all datasets to arrive at a much larger set. However, given the complexity described above, this approach was rejected as being unlikely to produce a meaningful model.

SUMMARY

The 4th year data contains 9792 instances (ratios derived from each company’s financial statement) and corresponding class label that indicated bankruptcy status after 2 years. Thus, this data year represents a large sample size whose financial ratios are likely to be more reflective of predicting bankruptcy within a brief time frame( two years). The other years either contained too few samples(<~7,000) or were deemed too far out to provide meaningful predictive power.

PART III. DATA EXTRACTION AND PREPARATION

#<b>Load standard R packages</b>
require(ggplot2)##others added as needed
require(tidytext)
require(textdata)
require(tidyverse)
require(dplyr)
require(readr)
require(xlsx)
require(readxl)
require(stats)
require(ggplot2)
A. Data consists of a set of Attribute Relation File Format
require(RWeka)##needed to read arf files-used but now shown 
## Loading required package: RWeka
getwd()
## [1] "C:/WGU/post grad/final"
Test_arf4<-read.csv('pbk_initial.csv')   
str(Test_arf4)##all variables shown in native form
## 'data.frame':    9792 obs. of  66 variables:
##  $ Obs_number: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Attr1     : num  0.1593 -0.1274 0.0705 0.1368 -0.1101 ...
##  $ Attr2     : num  0.462 0.462 0.236 0.405 0.698 ...
##  $ Attr3     : num  0.0777 0.2692 0.5278 0.3154 0.1888 ...
##  $ Attr4     : num  1.17 1.75 3.24 1.87 1.27 ...
##  $ Attr5     : num  -44.9 7.6 125.7 19.1 -15.3 ...
##  $ Attr6     : num  0.46702 0.000925 0.16367 0.50497 0 ...
##  $ Attr7     : num  0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
##  $ Attr8     : num  0.829 1.163 2.872 1.454 0.433 ...
##  $ Attr9     : num  1.12 1.29 1.06 1.11 1.74 ...
##  $ Attr10    : num  0.383 0.538 0.677 0.589 0.302 ...
##  $ Attr11    : num  0.1895 -0.1232 0.0869 0.1368 -0.1031 ...
##  $ Attr12    : num  0.41 -0.356 0.369 0.377 -0.158 ...
##  $ Attr13    : num  0.1555 -0.0697 0.1048 0.1069 0.0883 ...
##  $ Attr14    : num  0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
##  $ Attr15    : num  771 -1871 726 924 1663 ...
##  $ Attr16    : num  0.473 -0.195 0.503 0.395 0.219 ...
##  $ Attr17    : num  2.16 2.16 4.24 2.47 1.43 ...
##  $ Attr18    : num  0.1895 -0.1274 0.0869 0.1368 -0.1101 ...
##  $ Attr19    : num  0.1347 -0.0984 0.0768 0.0913 -0.0634 ...
##  $ Attr20    : num  46.8 67.2 51.4 59.5 19.4 ...
##  $ Attr21    : num  1.035 0.657 0.992 1.335 1.031 ...
##  $ Attr22    : num  0.1808 -0.0801 0.0766 0.1448 0 ...
##  $ Attr23    : num  0.1132 -0.0984 0.0623 0.0913 -0.0634 ...
##  $ Attr24    : num  0.576 NA 0.208 0.505 NA ...
##  $ Attr25    : num  0.3833 0.0891 0.6769 0.5894 0.2302 ...
##  $ Attr26    : num  0.408 -0.195 0.433 0.395 0.219 ...
##  $ Attr27    : num  1.442 -18.996 0.716 1.077 0 ...
##  $ Attr28    : num  0.169 0.722 2.232 0.979 1.636 ...
##  $ Attr29    : num  6.07 3.99 4.67 4.32 3.84 ...
##  $ Attr30    : num  0.3091 0.347 -0.0246 0.2187 0.3767 ...
##  $ Attr31    : num  0.1347 -0.0952 0.0768 0.0913 -0.1333 ...
##  $ Attr32    : num  134.5 92.7 80.4 98.4 261.4 ...
##  $ Attr33    : num  2.71 3.94 4.54 3.71 1.4 ...
##  $ Attr34    : num  0.391 3.049 0.325 0.357 1.392 ...
##  $ Attr35    : num  0.1808 -0.1157 0.0766 0.1448 0.0506 ...
##  $ Attr36    : num  1.48 1.29 1.2 1.51 1.74 ...
##  $ Attr37    : num  658.7 4.81 NA 10.08 NA ...
##  $ Attr38    : num  0.384 0.618 0.677 0.632 0.302 ...
##  $ Attr39    : num  0.1285 -0.0894 0.0678 0.0967 0.0292 ...
##  $ Attr40    : num  0.167 0.0424 1.1451 0.2185 0.2458 ...
##  $ Attr41    : num  0.0724 -0.3593 0.0716 0.0792 0.0884 ...
##  $ Attr42    : num  0.1285 -0.0619 0.0678 0.0967 0 ...
##  $ Attr43    : num  120 173 159 146 150 ...
##  $ Attr44    : num  73.1 105.4 108 86.4 130.7 ...
##  $ Attr45    : num  0.882 -0.535 0.443 0.56 -1.193 ...
##  $ Attr46    : num  0.777 1.086 2.564 1.197 1.139 ...
##  $ Attr47    : num  52.6 61.7 54.3 66.3 34.7 ...
##  $ Attr48    : num  0.152 -0.117 0.045 0.121 -0.263 ...
##  $ Attr49    : num  0.1077 -0.0907 0.0398 0.0811 -0.1517 ...
##  $ Attr50    : num  1.17 1.36 3.24 1.67 1.27 ...
##  $ Attr51    : num  0.462 0.358 0.236 0.362 0.696 ...
##  $ Attr52    : num  0.368 0.254 0.22 0.27 0.716 ...
##  $ Attr53    : num  0.833 1.442 2.862 1.829 2.618 ...
##  $ Attr54    : num  0.834 1.659 2.862 1.963 2.618 ...
##  $ Attr55    : num  90533 2625 24672 6650 1314 ...
##  $ Attr56    : num  0.109 -0.0894 0.0543 0.1026 0.4399 ...
##  $ Attr57    : num  0.416 -0.237 0.104 0.232 -0.364 ...
##  $ Attr58    : num  0.891 1.062 0.946 0.897 0.572 ...
##  $ Attr59    : num  0.00142 0.15041 0 0.07302 0 ...
##  $ Attr60    : num  7.79 5.43 7.11 6.14 18.8 ...
##  $ Attr61    : num  4.99 3.46 3.38 4.22 2.79 ...
##  $ Attr62    : num  119.8 101 76.1 88.3 146.4 ...
##  $ Attr63    : num  3.05 3.62 4.8 4.13 2.49 ...
##  $ Attr64    : num  3.06 3.47 4.78 4.65 15.04 ...
##  $ class     : int  0 0 0 0 0 0 0 0 0 0 ...
dim(Test_arf4)      # 9792 observations   66 variables
## [1] 9792   66
B. Locate missing values in data frame NAs
plot(colSums(is.na(Test_arf4)))##plot NAs for each variable column

sum(colSums(is.na(Test_arf4)))
## [1] 8776
(100*sum(colSums(is.na(Test_arf4)))/
sum(colSums(!is.na(Test_arf4)))) ##%age NAs in total dataset --> ~1.4% NOT SPARSE
## [1] 1.376636

RESULTS:
1.One attribute column contains over 4,000 NAs.
2. Several attribute columns contain hundreds of NAs.
CONCLUSIONS:
1.Columns with excessive NAs will be dropped since replacing the NAs with the mean or median renders the column void of useful information for analysis. Identify the specific columns involved.
2. Will use >300 NAs as cutoff(~3 % NA).

#use which to identify columns
which(colSums(is.na(Test_arf4))>300)##columns(variables with >300 NA's dropped)
## Attr27 Attr37 Attr45 Attr60 
##     28     38     46     61
#drop columns 27,37,45,60
yr4_v2<-Test_arf4[,-c(27,37,45,60)]
dim(yr4_v2)   #     62 columns remaining
## [1] 9792   62
##find rows with >25%(count 15) NAs
xyz<- rowSums(is.na(yr4_v2)) 
xyz1<-which(xyz>15)
xyz1
##  [1]  981 1480 1740 1797 1807 1914 2899 3885 4020 4116 4132 5488 6145 6403 6615
## [16] 7491 8031 8653 9607 9619
## remove rows  #remove observations with >25% NAs
yr4_v3<-yr4_v2[-c(xyz),]
dim(yr4_v3) # dropped 23 observations 
## [1] 9769   62
C. Continue with NA removal - replacement with median

There are still NAs to remove but ~75% have been removed. The ultimate step will be to replace with the median(codingProf.com, n.d.). This is justified since many variables do not exhibit a normal distribution(vs. using the mean).

##convert dependent variable to numeric(will be used for scaling also), replace NAs with median.
require(dplyr)
yr4_v4<-yr4_v3
yr4_v4$class<-as.numeric(yr4_v4$class)
sum(is.na(yr4_v4))   # 2,451 NAs remaining
## [1] 8710
yr4_v5<- yr4_v4 %>%
  mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))#replace NAS with median
sum(is.na(yr4_v5)) # NAs remaining 0
## [1] 0

RESULTS: 1. Replaced approximately ~0.2% of values in data frame with median(minimal impact on data integrity).

D.Dealing with multicollinearity. Find multicollinear variables in dataset – remove with cor>0.90; cor<-0.90
** This is the correlation matrix on a reduced dataset. As the first pruning step, the strategy is to find variables with correlation values >0.90 or < (-0.90) and make decisions as to which to eliminated. Beyond correlation, co-linear variables which show less correlation with other variables will be kept and the other identified correlated variable will be eliminated.

D.Normalization(scaling, centering) and Correlation Analysis

Since the variables vary in range, the existing data frame, yr4_v3, will be scaled and centered(normalization) using the scale() function.

#scaling, centering
yr4_v5_scdf<-scale(yr4_v5, center=T)
##correlation matrix
require(corrplot)
## Loading required package: corrplot
## corrplot 0.92 loaded
cor_4v5<-cor(yr4_v5_scdf)
cor_4v5_pl <-corrplot(cor_4v5)


Results: There are ways to select variables to remove to reduce multicollinearity(MC) (Perceptive Analytics (2018)). The challenge is to retain as much variation as possible in the dataset while reducing the impact of MC. For this dataset, the variables showing the strongest correlation (>0.90; <-0.90) were identified.

##strongly positively correlated(>0.90)
N<-nrow(cor_4v5)
#
xy_p<-vector('list', N)##(**stackoverflow, n.d.-a**)
class(xy_p)
## [1] "list"
for(i in rownames(cor_4v5)){   ##for loop to identify strongly correlated variables(>0.90)
  x=row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)]
  xy_p[[i]]<-data.frame(row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)])
}
xy_p<-do.call(rbind, xy_p) 
xy_p
colnames(xy_p)<-c('strong correlation')
xy_p ##output not shown due to length
## strongly negatively correlated (< -0.90)
xy_n<-vector('list', N)
for(i in rownames(cor_4v5)){
   x=row.names(cor_4v5)[which(cor_4v5[,i] > 0.90)]
  xy_n[[i]]<-data.frame(row.names(cor_4v5)[which(cor_4v5[,i] < (-0.90))])
}
xy_n<-do.call(rbind, xy_n)
colnames(xy_n)<-c('strong neg. correlation')
xy_n 
##          strong neg. correlation
## Attr2.1                    Attr3
## Attr2.2                   Attr10
## Attr2.3                   Attr25
## Attr2.4                   Attr38
## Attr3.1                    Attr2
## Attr3.2                   Attr51
## Attr10.1                   Attr2
## Attr10.2                  Attr51
## Attr25.1                   Attr2
## Attr25.2                  Attr51
## Attr38.1                   Attr2
## Attr38.2                  Attr51
## Attr51.1                   Attr3
## Attr51.2                  Attr10
## Attr51.3                  Attr25
## Attr51.4                  Attr38

RESULTS: 1. Some financial assets showed perfect correlation and were identical.

2. Using domain expertise(Kieso et al. ,2016) when 2 variables show extremely high correlation, the option is to select the variable with simplest financial ratio. For example, the selected financial ratio, Attr7 showed high correlation with Attr11, Attr14, and Attr22. The financial ratio, Attr7, is EBIT/total assets while Attr11 is (gross profits+ extraordinary items+ financial expenses)/total assets. Therefore, Attr7 was selected.

CONCLUSIONS: 1. The dataset after removing multicollinear variables contains 39 independent variables and 1 dependent variable.

#eliminate MC variables
yr4_v6_scdf<-yr4_v5_scdf[,c(1:7, 9, 12:13, 15, 17:21,24,27:29,31,33:34,37:41,45:46,49:50,52:56, 57,60)]
dim(yr4_v6_scdf)
## [1] 9769   39
yr4_v6<-yr4_v6_scdf  ##simply data frame name
dim(yr4_v6)  ##  9,772     39
## [1] 9769   39
str(yr4_v6)
##  num [1:9769, 1:39] -1.73 -1.73 -1.73 -1.73 -1.73 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:9769] "22" "23" "25" "26" ...
##   ..$ : chr [1:39] "Obs_number" "Attr1" "Attr2" "Attr3" ...

IV. Removing potentially troublesome outliers

Outliers can influence ML models and predictions and must be considered for removal. In this dataset, wishing to preserve as much variation as possible, outliers which z-scores >10 or < -10 are removed using the following coding. It is particularly important to remove observations which have multiple variables which fit this criterion, especially for any type of regression analysis(finnstats, 2021-a).

#rebind dependent variable (class)  
yr4_v6c<-as.data.frame(cbind(yr4_v6, yr4_v3$class)) 
yr4_v6c$V40<-as.factor(yr4_v6c$V40)
require(data.table)
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
setnames(yr4_v6c,'V40','class')  ##reset names to class in new #data.frame
yr4_v6c$class<-ifelse(yr4_v6c$class==2,1,0)
yr4_v6c$class<-as.factor(yr4_v6c$class)
table(yr4_v6c$class)
## 
##    0 
## 9769
str(yr4_v6c)
## 'data.frame':    9769 obs. of  40 variables:
##  $ Obs_number: num  -1.73 -1.73 -1.73 -1.73 -1.73 ...
##  $ Attr1     : num  0.0512 1.0661 -0.1048 -0.0379 0.1603 ...
##  $ Attr2     : num  -0.0527 -0.0503 -0.035 0.0526 0.022 ...
##  $ Attr3     : num  0.10997 0.09507 -0.00721 0.00528 0.06091 ...
##  $ Attr4     : num  -0.018 -0.0185 -0.0234 -0.0239 -0.022 ...
##  $ Attr5     : num  -1.89e-03 3.38e-05 -5.15e-03 -4.96e-03 -1.12e-03 ...
##  $ Attr6     : num  0.00874 -0.03336 0.00874 0.00874 0.00874 ...
##  $ Attr8     : num  -0.0259 -0.026 -0.0266 -0.0282 -0.0278 ...
##  $ Attr11    : num  0.0395 0.1927 -0.1162 -0.0173 0.1002 ...
##  $ Attr12    : num  0.000306 0.004484 -0.002472 -0.002249 0.000225 ...
##  $ Attr14    : num  0.0337 0.2148 -0.0964 -0.0431 0.1291 ...
##  $ Attr16    : num  -0.00443 -0.00033 -0.00642 -0.00725 -0.00551 ...
##  $ Attr17    : num  -0.026 -0.0261 -0.0267 -0.0283 -0.028 ...
##  $ Attr18    : num  0.0175 0.1487 -0.0768 -0.0382 0.0866 ...
##  $ Attr19    : num  0.00232 0.00377 0.00117 0.00121 0.00318 ...
##  $ Attr20    : num  0.05757 -0.08432 -0.00937 -0.09501 -0.09488 ...
##  $ Attr23    : num  0.00429 0.01086 0.00325 0.00335 0.00499 ...
##  $ Attr27    : num  -0.0342 -0.0326 -0.0343 -0.0343 0.1343 ...
##  $ Attr28    : num  0.0794 -0.0113 -0.0439 0.1015 0.0645 ...
##  $ Attr29    : num  -0.667 -0.117 0.427 -0.854 1.366 ...
##  $ Attr31    : num  -3.44e-05 1.46e-03 -9.98e-04 -1.04e-03 8.24e-04 ...
##  $ Attr33    : num  -0.0412 -0.028 -0.0855 -0.0699 -0.0725 ...
##  $ Attr34    : num  0.00423 0.01217 -0.05602 -0.02405 -0.03808 ...
##  $ Attr38    : num  0.0346 0.0374 0.0263 -0.0709 -0.0403 ...
##  $ Attr39    : num  0.0144 0.0149 0.0139 0.0142 0.0148 ...
##  $ Attr40    : num  -0.0317 -0.0221 -0.0347 -0.0347 -0.0266 ...
##  $ Attr41    : num  -0.0202 -0.022 -0.0162 -0.0109 -0.0186 ...
##  $ Attr42    : num  0.0289 0.0316 0.027 0.0274 0.0303 ...
##  $ Attr47    : num  -0.0126 -0.0398 -0.027 -0.0429 -0.0421 ...
##  $ Attr48    : num  0.1086 0.2546 -0.059 0.0924 0.2282 ...
##  $ Attr51    : num  -0.03 -0.0326 -0.0373 0.0808 0.0211 ...
##  $ Attr52    : num  -0.0109 -0.011 -0.0107 -0.0108 -0.0108 ...
##  $ Attr54    : num  0.0783 -0.0105 -0.0434 0.105 0.0317 ...
##  $ Attr55    : num  -0.0815 -0.0491 -0.0744 -0.0977 0.6198 ...
##  $ Attr56    : num  0.0133 0.0139 0.0128 0.0131 0.0138 ...
##  $ Attr57    : num  0.00665 0.07105 -0.00291 0.01632 0.03321 ...
##  $ Attr58    : num  -0.0215 -0.0262 -0.0179 -0.0181 -0.0243 ...
##  $ Attr60    : num  -0.0343 -0.0319 -0.0337 -0.0313 -0.0313 ...
##  $ Attr63    : num  -0.0474 -0.0268 -0.1017 -0.0825 -0.0825 ...
##  $ class     : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
## find observations with   z-score >10, <-10
yr4_6c_z10<-as.data.frame(which(yr4_v6c[,1:39] >10 |  yr4_v6c[,1:39] <(-10) , arr.ind = TRUE))
dim(yr4_6c_z10)##save as separate dataset for later analysis 223    2
## [1] 187   2
dismiss_z10<-unique(yr4_6c_z10$row)  ##select out unique observations
length(dismiss_z10)  ##how many? 117
## [1] 94
## remove these observations(outliers) from dataset
yr4_v7<-yr4_v6c[-c(dismiss_z10),]
dim(yr4_v7) ##   9,655    40 
## [1] 9675   40
dismiss_kp <-yr4_v6c[c(dismiss_z10, xyz1),]  ##combine all discarded observations in a dataset for analysis
dim(dismiss_kp) ##  126  40
## [1] 114  40

RESULTS: 1. Two hundred and twenty-three(223) instances of an observation with a variable exceeded a z-score of +/- 10 were observed. 2. There are one hundred and seventeen(117) unique observations, indicating that some observations contained multiple outlier scores. 3. These outliers were segregated, along with those eliminated earlier (xyz1) into a new database for further analysis. It is anticipated that there will be more outliers identified during the next phase of modeling.
CONCLUSIONS: 1. The rationale for removing outliers is clear: the modeling will be impacted by outliers which can skew both models and predictions(Bobbitt, 2021-a). The impact of their removal on the corresponding dataset will be analyzed once outlier elimination is complete (next section). 2. While methods to remove outliers have been proposed (e.g., visual plot inspection, z-score analysis, IQR analysis)(finnstats, 2021-a), the first two were chosen. The IQR analysis approach did not yield consistent results given the skewness of many of the variables in this dataset(data not shown).

V. An analysis of the modified dataset at this stage


While other changes are likely to occur to the dataset during the modeling process, it is imperative to explore descriptive statistics and simple analysis of the dataset at this point.
a. SUMMARY STATISTICS-compare data set with discarded observations


summary(yr4_v7[,31:40])##display last 10 variables in dataset
##      Attr51             Attr52             Attr54             Attr55        
##  Min.   :-0.10293   Min.   :-0.01115   Min.   :-7.20627   Min.   :-9.46878  
##  1st Qu.:-0.06221   1st Qu.:-0.01101   1st Qu.:-0.04374   1st Qu.:-0.10065  
##  Median :-0.02905   Median :-0.01090   Median :-0.04080   Median :-0.08843  
##  Mean   :-0.01100   Mean   :-0.01048   Mean   :-0.02272   Mean   :-0.01567  
##  3rd Qu.: 0.01336   3rd Qu.:-0.01075   3rd Qu.:-0.03393   3rd Qu.:-0.03916  
##  Max.   : 5.47406   Max.   : 0.42211   Max.   : 8.58496   Max.   : 9.48337  
##      Attr56             Attr57              Attr58             Attr60        
##  Min.   :-0.85229   Min.   :-7.443143   Min.   :-3.98274   Min.   :-0.03564  
##  1st Qu.: 0.01293   1st Qu.:-0.002983   1st Qu.:-0.03062   1st Qu.:-0.03386  
##  Median : 0.01345   Median : 0.006961   Median :-0.02178   Median :-0.03261  
##  Mean   : 0.01325   Mean   : 0.010648   Mean   :-0.01945   Mean   :-0.01561  
##  3rd Qu.: 0.01440   3rd Qu.: 0.023143   3rd Qu.:-0.01709   3rd Qu.:-0.02976  
##  Max.   : 1.46739   Max.   : 8.077145   Max.   : 8.40330   Max.   : 8.54421  
##      Attr63          class   
##  Min.   :-0.142788   0:9675  
##  1st Qu.:-0.093592           
##  Median :-0.062581           
##  Mean   :-0.012177           
##  3rd Qu.:-0.004941           
##  Max.   : 8.539004
summary(dismiss_kp[,31:40])##display last 10 variables
##      Attr51             Attr52             Attr54             Attr55        
##  Min.   :-0.10293   Min.   :-0.01115   Min.   :-0.19397   Min.   :-0.43634  
##  1st Qu.:-0.08432   1st Qu.:-0.01105   1st Qu.:-0.04311   1st Qu.:-0.10034  
##  Median :-0.05014   Median :-0.01086   Median :-0.04080   Median :-0.09327  
##  Mean   : 0.92995   Mean   : 0.88793   Mean   : 1.92169   Mean   : 1.33495  
##  3rd Qu.: 0.02662   3rd Qu.:-0.01011   3rd Qu.:-0.03098   3rd Qu.:-0.07090  
##  Max.   :97.90405   Max.   :98.81097   Max.   :80.79225   Max.   :80.33131  
##      Attr56              Attr57              Attr58             Attr60        
##  Min.   :-97.65662   Min.   :-66.71620   Min.   :-0.14096   Min.   :-0.03564  
##  1st Qu.:  0.01242   1st Qu.: -0.00658   1st Qu.:-0.04739   1st Qu.:-0.03415  
##  Median :  0.01345   Median :  0.00416   Median :-0.02178   Median :-0.03261  
##  Mean   : -1.12221   Mean   : -0.91607   Mean   : 1.64620   Mean   : 1.31939  
##  3rd Qu.:  0.01569   3rd Qu.:  0.02281   3rd Qu.:-0.01279   3rd Qu.:-0.03132  
##  Max.   :  0.02587   Max.   : 25.31767   Max.   :83.02583   Max.   :80.30797  
##      Attr63         class  
##  Min.   :-0.14239   0:114  
##  1st Qu.:-0.13183          
##  Median :-0.08733          
##  Mean   : 1.03146          
##  3rd Qu.:-0.01273          
##  Max.   :93.02601

RESULTS: 1. The removal of the outliers had a small but perceptible shift in the mean(center=0.0); dramatic reduction in min/max. 2. The class split: yr4_v7: 5.25% bankrupters; dismiss_kp: 7.14%. 3. The outlier dataset has large max/min, significantly higher means.
CONCLUSIONS: 1. The outlier dataset is significantly different than the regular dataset. 2. Although slightly enriched in bankrupters, the removal of the outlier data does not significantly alter the proportion of bankrupters.
b. Distribution of full data set

i. Histograms


Histograms provide insight into the distribution of the frequency, skewness and overall data spread(Frost, n.d.-a). Since outliers compress the visualization, the z-score range is reduced to -2 to 2 for these plots.

newdrop<-as.data.frame(which(yr4_v7[,1:39] > 2 |  yr4_v7[,1:39] <(-2) , arr.ind = TRUE))##identity observations with z-score for variable(s) outside 2 S.D.
str(newdrop)
## 'data.frame':    1271 obs. of  2 variables:
##  $ row: int  26 96 146 284 406 458 601 766 835 1032 ...
##  $ col: int  2 2 2 2 2 2 2 2 2 2 ...
dim(newdrop)
## [1] 1271    2
yr4_v8<-yr4_v7[-c(newdrop$row),]
dim(yr4_v8)##--> 8853   40 
## [1] 8872   40
str(yr4_v8)
## 'data.frame':    8872 obs. of  40 variables:
##  $ Obs_number: num  -1.73 -1.73 -1.73 -1.73 -1.73 ...
##  $ Attr1     : num  0.0512 1.0661 -0.1048 -0.0379 0.1603 ...
##  $ Attr2     : num  -0.0527 -0.0503 -0.035 0.0526 0.022 ...
##  $ Attr3     : num  0.10997 0.09507 -0.00721 0.00528 0.06091 ...
##  $ Attr4     : num  -0.018 -0.0185 -0.0234 -0.0239 -0.022 ...
##  $ Attr5     : num  -1.89e-03 3.38e-05 -5.15e-03 -4.96e-03 -1.12e-03 ...
##  $ Attr6     : num  0.00874 -0.03336 0.00874 0.00874 0.00874 ...
##  $ Attr8     : num  -0.0259 -0.026 -0.0266 -0.0282 -0.0278 ...
##  $ Attr11    : num  0.0395 0.1927 -0.1162 -0.0173 0.1002 ...
##  $ Attr12    : num  0.000306 0.004484 -0.002472 -0.002249 0.000225 ...
##  $ Attr14    : num  0.0337 0.2148 -0.0964 -0.0431 0.1291 ...
##  $ Attr16    : num  -0.00443 -0.00033 -0.00642 -0.00725 -0.00551 ...
##  $ Attr17    : num  -0.026 -0.0261 -0.0267 -0.0283 -0.028 ...
##  $ Attr18    : num  0.0175 0.1487 -0.0768 -0.0382 0.0866 ...
##  $ Attr19    : num  0.00232 0.00377 0.00117 0.00121 0.00318 ...
##  $ Attr20    : num  0.05757 -0.08432 -0.00937 -0.09501 -0.09488 ...
##  $ Attr23    : num  0.00429 0.01086 0.00325 0.00335 0.00499 ...
##  $ Attr27    : num  -0.0342 -0.0326 -0.0343 -0.0343 0.1343 ...
##  $ Attr28    : num  0.0794 -0.0113 -0.0439 0.1015 0.0645 ...
##  $ Attr29    : num  -0.667 -0.117 0.427 -0.854 1.366 ...
##  $ Attr31    : num  -3.44e-05 1.46e-03 -9.98e-04 -1.04e-03 8.24e-04 ...
##  $ Attr33    : num  -0.0412 -0.028 -0.0855 -0.0699 -0.0725 ...
##  $ Attr34    : num  0.00423 0.01217 -0.05602 -0.02405 -0.03808 ...
##  $ Attr38    : num  0.0346 0.0374 0.0263 -0.0709 -0.0403 ...
##  $ Attr39    : num  0.0144 0.0149 0.0139 0.0142 0.0148 ...
##  $ Attr40    : num  -0.0317 -0.0221 -0.0347 -0.0347 -0.0266 ...
##  $ Attr41    : num  -0.0202 -0.022 -0.0162 -0.0109 -0.0186 ...
##  $ Attr42    : num  0.0289 0.0316 0.027 0.0274 0.0303 ...
##  $ Attr47    : num  -0.0126 -0.0398 -0.027 -0.0429 -0.0421 ...
##  $ Attr48    : num  0.1086 0.2546 -0.059 0.0924 0.2282 ...
##  $ Attr51    : num  -0.03 -0.0326 -0.0373 0.0808 0.0211 ...
##  $ Attr52    : num  -0.0109 -0.011 -0.0107 -0.0108 -0.0108 ...
##  $ Attr54    : num  0.0783 -0.0105 -0.0434 0.105 0.0317 ...
##  $ Attr55    : num  -0.0815 -0.0491 -0.0744 -0.0977 0.6198 ...
##  $ Attr56    : num  0.0133 0.0139 0.0128 0.0131 0.0138 ...
##  $ Attr57    : num  0.00665 0.07105 -0.00291 0.01632 0.03321 ...
##  $ Attr58    : num  -0.0215 -0.0262 -0.0179 -0.0181 -0.0243 ...
##  $ Attr60    : num  -0.0343 -0.0319 -0.0337 -0.0313 -0.0313 ...
##  $ Attr63    : num  -0.0474 -0.0268 -0.1017 -0.0825 -0.0825 ...
##  $ class     : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...


Visualizations of histograms

#loop through
#counter=0
#par(mfrow=c(2,3))
#for(i in colnames(yr4_v8[,c(1:3,6,19,30)])){
#  x<-hist(yr4_v8[,i], breaks=100)
#  counter = counter +1
#}


SUMMARY Most of the distributions encompass asymmetry. This is not surprising, given that financial ratios (often with different feature selection) exhibit a wide range of skewness and kurtosis(Tomczak & Wilimowska, 2016).
Representative histograms selected for further review (Attribute): Attr: 1,2,3, 6,29, 49.
For these examples, the skewness, kurtosis, and the Jarque-Bera Normality Test are calculated (Bobbitt, 2020). Skewness measures the asymmetry of the distribution with (-) indicating left skew(tail to left); (+) indicating right skew(tail to right). Kurtosis measures the degree to much the data clusters in the tails of the distribution in comparison to a normal distribution(flatness of a distribution). Normal distributions have a skewness near zero and a kurtosis of 3.
The Jarque-Bera test (JBT) is goodness of fit test that determines whether sample data has skewness and kurtosis of normal distribution. Figure 1-9. Snippet JBT for variables.
Null Ho: The data has a skewness and kurtosis that match a normal distribution. Alternative Ho: The data has a skewness and kurtosis that match a normal distribution.

require(moments)
## Loading required package: moments
counter=0
for(i in colnames(yr4_v8[,c(1:3,6,19,30)])){
  x<-skewness(yr4_v8[,i])
  y<-kurtosis(yr4_v8[,i])
  z<-jarque.test(yr4_v8[,i])
  counter = counter +1
  print(counter); print(x); print(y); print(z) }
## [1] 1
## [1] -0.01062078
## [1] 1.793283
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 538.46, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## [1] 2
## [1] -0.1154087
## [1] 7.25985
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 6727.8, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## [1] 3
## [1] 3.348483
## [1] 31.85472
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 324362, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## [1] 4
## [1] -0.9622606
## [1] 165.2058
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 9727566, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## [1] 5
## [1] 8.807446
## [1] 159.3492
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 9151230, p-value < 2.2e-16
## alternative hypothesis: greater
## 
## [1] 6
## [1] -0.4170273
## [1] 8.688804
## 
##  Jarque-Bera Normality Test
## 
## data:  yr4_v8[, i]
## JB = 12220, p-value < 2.2e-16
## alternative hypothesis: greater

RESULTS: 1. All variables fail the JBT. However, some exhibit skewness and kurtosis similar to normal (sk=0; kur=3)(e.g., Attr29).
2. Some attributes display a right or left skew. 3. Nearly all attributes have kurtosis greater than normal.
CONCLUSIONS: 1. The dataset’s predictor variables are non-normal to some extent. However, the normality assumption applies to the process that produces the data (i.e., random sampling from a theoretical population). Given that, how ‘close’ to normal is the distribution producing the data (Westfall, 2016) to be able to apply modeling techniques?
2. A few more tests will be performed to examine the issue of the variables’ distribution.

ii. Q/Q plots


Another visualization approach is to use Q-Q plots(Perceptive Analytics (2018)). If the data comes from a normal distribution, the Q-Q scatterplot of theoretical vs. sample quantiles should follow a straight line. Deviation from a straight line indicate non-normality.

par(mfrow=c(2,3))
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){
  qqnorm(yr4_v7[,i])
  qqline(yr4_v7[,i])
}

RESULTS: 1. 1. All variables with the exception of Attr29 show significant deviation from normality. 2. They pattern of deviation matches tailedness (see above Histogram analysis). 3. Attr29 has a near normal distribution

iii. Kolmogrov-Smirnov test


Continuing to explore the six exemplar variables, the Kolmogrov-Smirnov test can be used to test for normality (GeeksforGeeks, 2021).
Null Ho: The sample data is from a normal distribution Alternative Ho: The sample data does not come from a normal distribution.

## ks.test() from the stats package  
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){
  x<-ks.test(yr4_v7[,i], 'pnorm', mean= mean(yr4_v7[,i]), sd= sd(yr4_v7[,i]))
  print(i)
  print(x)
}
## [1] "Obs_number"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.057487, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr1"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.18767, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr2"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.24717, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr5"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.41939, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr28"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.41521, p-value < 2.2e-16
## alternative hypothesis: two-sided
## Warning in ks.test.default(yr4_v7[, i], "pnorm", mean = mean(yr4_v7[, i]), :
## ties should not be present for the Kolmogorov-Smirnov test
## [1] "Attr48"
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  yr4_v7[, i]
## D = 0.2184, p-value < 2.2e-16
## alternative hypothesis: two-sided


RESULTS: (N.B.- only one snippet shown). 1. The null Ho was rejected for each variable, i.e., none of the variables came from a normal distribution, which aligns with the previous analysis.

CONCLUSIONS: 1. All analysis supports that the financial ratios as formulated for this dataset come from non-normal distributions.
2. However, since the distributions are in some ways reflective of normal distributions and the sample sizes are relatively large (>500), for the purposes of this analysis, the final segment of this analysis will explore parametric and non-parametric analysis tests.

iv. Potential transformation to make dataset normal

Algorithms exist to attempt to convert a non-normal dataset to normality so as to apply sophisticated parametric analysis to the dataset(e.g., Box-Cox transformation(Data Science Team, 2021). There are several practical considerations. First, an examination of the variables indicated that the distribution fall into a variety of types and it is highly unlikely that a ‘one-size-fits-all’ transformational approach will suffice. Second, the modeling techniques employed do not require normality in the dataset being analyzed(Cross Validated, n.d.-c). Nonetheless, the dataset was examined to determine if such transformation was possible.

v. What types of distribution exist for the variables in this dataset?


The function descdist() from the fitdistplus package attempts to specify the probability distribution that best fits sample data from a defined family of distributions (Delignette-Muller et al., 2009). This function computes descriptive parameters of the empirical distribution of a variable and provides a skewness-kurtosis plot to aid in determining which distribution best matches the data. This plot is a Cullen and Frey graph which indicates the values (or range) of skewness and kurtosis for a particular distribution. By using bootstrapping within the program, it is possible to determine a range of possibilities for the dataset parameters. In addition, by setting another parameter(discrete) it is possible to get a 2nd set of possible distributions.

require(fitdistrplus) 
## Loading required package: fitdistrplus
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: survival
#descdist() function(Rdocumentation. n.d.-a)
par(mfrow=c(2,3))
for(i in colnames(yr4_v7[,c(1:3,6,19,30)])){ 
##using the six exemplar variables; Attr1,Attr2,Attr3,Attr6,Attr29,Attr49
   x<-descdist(yr4_v7[,i],discrete = F, boot =1000, method='unbiased', graph=T)  ## 
   print(i)
}
## [1] "Obs_number"
## [1] "Attr1"
## [1] "Attr2"
## [1] "Attr5"
## [1] "Attr28"

## [1] "Attr48"


RESULTS: 1. Observations lie outside distributions(as do bootstraps).
2. Distribution types: normal, uniform, exponential, logistic, lognormal, gamma.
The Generalized Additive Model for Location Scale and Shape (GAMLSS) can be used to explore distributions that are highly skewed and or/kurtotic (Stasinopoulos & Rigby, 2007).

CONCLUSIONS: 1. All six variables have different distributions. <br. 2. None falls within the parameters for one of the tested distributions.
3. This pattern also applies to the other variables in the dataset(data not shown).
4. This is not surprising since the data was not generated using a function that would create these distributions(StackExchange, n.d.-a). 5. There is no clear distribution pattern for the variables in the dataset and therefore no single transformation that will work to normalize.
VIa. Comparative analysis of the bankrupters (class=1) and non-bankrupter(class=0) segment of the population.


a. Split the dataset by the ‘class’ variable. It is important to recall that, because scaling and centered was done prior to this summary, the descriptive statistical differences can be used for comparison(Bobbitt, 2019). The z-score approach tends to provide greater weight to outliers, which can be significant contributors to modeling.

##Use subset to separate bankrupters and non-bankrupters. 
pbkers<-subset(yr4_v7, yr4_v7$class==1)
no_pbkers<-subset(yr4_v7, yr4_v7$class==0)
summary(pbkers[,31:40])


summary(no_pbkers[,31:40]) ###<b>showing only last 10 variables</b>
##      Attr51             Attr52             Attr54             Attr55        
##  Min.   :-0.10293   Min.   :-0.01115   Min.   :-7.20627   Min.   :-9.46878  
##  1st Qu.:-0.06221   1st Qu.:-0.01101   1st Qu.:-0.04374   1st Qu.:-0.10065  
##  Median :-0.02905   Median :-0.01090   Median :-0.04080   Median :-0.08843  
##  Mean   :-0.01100   Mean   :-0.01048   Mean   :-0.02272   Mean   :-0.01567  
##  3rd Qu.: 0.01336   3rd Qu.:-0.01075   3rd Qu.:-0.03393   3rd Qu.:-0.03916  
##  Max.   : 5.47406   Max.   : 0.42211   Max.   : 8.58496   Max.   : 9.48337  
##      Attr56             Attr57              Attr58             Attr60        
##  Min.   :-0.85229   Min.   :-7.443143   Min.   :-3.98274   Min.   :-0.03564  
##  1st Qu.: 0.01293   1st Qu.:-0.002983   1st Qu.:-0.03062   1st Qu.:-0.03386  
##  Median : 0.01345   Median : 0.006961   Median :-0.02178   Median :-0.03261  
##  Mean   : 0.01325   Mean   : 0.010648   Mean   :-0.01945   Mean   :-0.01561  
##  3rd Qu.: 0.01440   3rd Qu.: 0.023143   3rd Qu.:-0.01709   3rd Qu.:-0.02976  
##  Max.   : 1.46739   Max.   : 8.077145   Max.   : 8.40330   Max.   : 8.54421  
##      Attr63          class   
##  Min.   :-0.142788   0:9675  
##  1st Qu.:-0.093592           
##  Median :-0.062581           
##  Mean   :-0.012177           
##  3rd Qu.:-0.004941           
##  Max.   : 8.539004

VIb. Parametric test – two-sample t-test.


While the t-test is usually associated with normal distributions, in fact, as sample size in the two groups increases, the t-test is valid(Bartlett, 2013). Because of the central limit theorem, the distribution of the sample means will converge to a normal distribution, regardless of the population distribution. Moreover, the estimator for the standard error of these means is consistent regardless of the distribution of the variable. Thus, the test statistic follows a normal distribution.
The second consideration is that the normalization of the dataset occurred prior to splitting. Therefore, comparing the two subpopulations (class=0,1) is meaningful and any differences can be statistically evaluated. Finally, this analysis is done on scaled data. To verify that these conclusions hold on the unscaled dataset, the tests herein were replicated and the results were unchanged (data not shown).
Since it is unclear that the subpopulations will display equal variances, the Welch test will be applied. In addition, it is more robust to Type I errors and should perform better in this comparison with two unequal sized samples (Ruxton, 2006).

# require(reshape2)
# #yr4_v7sh<- yr4_v7[,c(1:3,6,19,30)]
# dat_v47<-melt(yr4_v7)##reshape data from wide to long for lapply
# lapply(unique(dat_v47$variable), function(x){
#   Good<-subset(dat_v47, class==0 & variable ==x)$value
#   Bad<- subset(dat_v47, class==1 & variable ==x)$value
#   t.test(Good, Bad)
# })


WELCH Null Ho: The means of the two samples are equal. Alternative Ho: The means of the two samples are unequal.

RESULTS:(only showing Welch test results for six exemplars) 1. Attr1, Attr2, Attr3, Att6, Attr29: Reject null Ho: the means are unequal. 2. Attr49: Fail to reject null Ho- the means are undistinguishable from each other. 3. Summary for all other variables. Fail to reject null H0 (p>0.05): Attr4, Attr5,Attr9,Attr12,Attr13,Attr15, Attr17, Attr19,Attr21, Attr28, Attr30, Attr34, Attr40, Attr42, Attr43, Attr49, Attr53, Attr56, Attr57, Attr58, Attr59, Attr61, Attr64. Reject null H0 (p<0.05): Attr1, Attr2, Attr3, Attr6, Attr7, Attr18, Attr20, Attr24, Attr29, Attr32, Attr35, Attr39, Attr41, Attr48, Attr52, Attr55.
CONCLUSIONS 1. For the variables where the outcome is ‘fail to reject null Ho,’, the means of the two subpopulations are not distinguishable. 2. For the variables where the outcome is ’reject null Ho, the means of the two samples are unequal. Will these variables be included in the final models to be developed? (see Part 5. Data Summary and Implications)

VIc.Mann-Whitney U (Wilcoxon ranked sum-two-sample) (Bobbitt, 2018).


Since the data contains variables which have non-normal distributions, a non-parametric test of the dataset can test the relationship of the two subpopulations(bankrupters, non-bankrupters). The test addresses whether the observations in one group are greater than the other after ranking the combined groups. Since they are drawn from the same population(same shape and spread), this is a test of the medians of the two groups. This is the non-parametric equivalent of the Welch t-test conducted in the last section and can highlight variables which are different between the two subpopulations and may be significant for the final machine learning model.
Wilcoxon Null Ho: The distribution of both subpopulations is identical. Alternative Ho: The distributions of the subpopulations are not identical.
Variables tested: All 39 independent variables. Variables discussed: Six exemplar variables.

#for(i in colnames(pbkers[,c(1:3,6,19,30)])){
#  x<-wilcox.test(pbkers[,i], no_pbkers[,i]) 
#  counter = counter +1
#  print(counter)
#  print(x)
#}


RESULTS: 1. The null Ho is rejected for all six exemplar variables, i.e., the distribution of these variables for the two subpopulations is not equal. 2. For the total class of variables only three variables failed to reject the null Ho (Attr20,Attr43, Attr64).

SUMMARY OF ANALYSES: 1. The parametric Welch t-test identified 18 variables indicated that the means are different between the two subpopulations.
2. The non-parametric Wilcoxon ranked.sum test identified thirty-six variables which indicated that the medians between the two subpopulations are different and therefore, the two subpopulations are different.
3. The following attributes(variables) have been identified in both the Welch t-test and Wilcoxon ranked sum test as being different in the two subpopulations: Attr1, Attr2, Attr3, Attr6, Attr7, Attr18, Attr24, Attr29, Attr32, Attr35, Attr39, Attr41, Attr48, Attr52, Attr55.

VII. Advantages/disadvantages of approaches used

Advantages: 1. Cleansing the dataset of obvious outliers and reducing multicollinearity provides obvious advantages in stabilizing the data with respect to the modeling techniques used in this study.
2. Scaling and centering(normalization) of the dataset creates a ‘level playing field’ for variables which show differences of scale in terms of modeling.
3. Removing observations which fall outside the z-score of +/- 10 Standard deviations (usually for multiple variables) is critical. The choice of this score was taken to maintain as much variation as possible in the dataset while removing a small population of outliers which render modeling much more difficult.
4. Testing the dataset for normality and distribution for each variable reveals the lack of normality and easily defined distributions for nearly all of the variables. This is critical in deciding which modeling approaches can be used.
Disadvantages: 1. There are many published approaches to detecting and removing outliers and reducing multicollinearity((Bobbitt, 2021-a); (Coding Prof, 2022)). The method chosen may not be the best for this dataset.
2. The tests chosen for normality detection (t-test: Data Novia, n.d.-a), Kolmogorov-Smirnov (GeeksforGeeks, 2021) may not have been executed properly or better tests exist. If a substantial portion of the variables have normal distributions, other alternative modeling strategies may be superior to those chosen here.
3. The presence of either or both condition(s) above could influence the differences between the bankrupters and non-bankrupters, which would substantially alter the prediction outcomes.

VII. REFERENCES(N.B.-This contains references for all parts of this analysis-not just this chapter).



Barboza, F., Kimure, H., and Altman, E. (2017) Machine Learning models and bankruptcy prediction. Expert Systems with Applications. Vol. 83: 405-417. http://isiarticles.com/bundles/Article/pre/pdf/146024.pdf
Bartlett, J. (2013) The t-test and robustness to non-normality. https://thestatsgeek.com/2013/09/28/the-t-test-and-robustness-to-non-normality/
Beaver, W.H.(1968) Market Prices, Financial Ratios and the prediction of failure. J. Acct. Resrch. 6(2): 179-192.
Bhalla, D. (n.d.) SAS: Calculating the Optimal Predicted Probability Cutoff. https://www.listendata.com/2015/03/sas-calculating-optimal-predicted.html
Bharadwaj, V. (2022) What is Jarque test How to perform it in R. https://www.projectpro.io/recipes/what-is-jarque-bera-test-perform-it-r Bobbitt, Z. (2018) Mann Whitney U Test. https://www.statology.org/mann-whitney-u-test/
Bobbitt, Z. (2019) How to Normalize Data in R. https://www.statology.org/how-to-normalize-data-in-r/
Bobbitt, Z. (2020) How to Calculate Skewness & Kurtosis in R. https://www.statology.org/skewness-kurtosis-in-r/
Bobbitt, Z. (2021-a) The Complete Guide: When to Remove Outliers in R https://www.statology.org/remove-outliers/
Bobbitt, Z. (2021-b) How to Use Q-Q plots to Check Normality https://www.statology.org/q-q-plot-normality/
Bounthavong, M. (2021) Logistic regression in R. Rpubs by RStudio. https://rpubs.com/mbounthavong/logistic_regression
Bonthu, S., and Bindu, K.H. (2017) Review of Leading Data Analytics Tools. Intl. Jrnl. of Engr. & Technology. Vol 7: 10-15. https://www.researchgate.net/profile/Sridevi-Bonthu-2/publication/327233649_Review_of_Leading_Data_Analytics_Tools/links/5b82bbdaa6fdcc5f8b695315/Review-of-Leading-Data-Analytics-Tools.pdf
Bredart, X. (2014) Bankruptcy Prediction Model: The Case of the United States. Intl. J. Econ. And Finance Vol. 6(3):1-7. https://scholar.archive.org/work/odijcz2cbrerlfccirnnft2vo4/access/wayback/http://ccsenet.org/journal/index.php/ijef/article/download/32877/19695
Carroll, J. (2019) Beyond Spreadsheets with R. Chap. 7: Doing things with lots of data. 1st edition. Publisher: Manning.
Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W. P. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. https://arxiv.org/pdf/1106.1813.pdf
Chen, C., Liaw, A., and Breiman, L. (2004) Using Random Forest to Learn Imbalance Data. Univ. California Berkeley Tech Report 666; 1-12.
Chen, N., Ribiero, B. and Chen, A. (2016) Financial credit risk assessment: a recent review. Artificial Intelligence Review Vol: 45(1): 1-23. https://www.researchgate.net/profile/Bernardete-Ribeiro/publication/284100532_Financial_credit_risk_assessment_a_recent_review/links/564f4dc808aeafc2aab3c43c/Financial-credit-risk-assessment-a-recent-review.pdf
Cho, K.I., and Kim, Y. M. (2021) Comparison of Bankruptcy prediction models using statistical learning at multiple times. The Korean Data & Information Science Society 32(3):487-499. https://doi.org/10.7465/jkdi.2021.32.3.487
Chu, M. and Yong, K. (2021) Big Data Analytics for Business Intelligence in Accounting and Audit. Open Journal of Social Sciences, 9, 42-52. doi: 10.4236/jss.2021.99004.
codingProf.com(n.d.) How to replace NA’s with the Median in R. https://www.codingprof.com/how-to-replace-missing-values-with-the-median-in-r/
Coding Prof(2022) 3 Ways to Test for Multicollinearity in R[Examples]. https://www.codingprof.com/3-ways-to-test-for-multicollinearity-in-r-examples/
Correa-Mejia, A.D. and Lopero-Castano, M. (2020) Financial ratios as a powerful instrument to predict insolvency: a study using boosting algorithms in Colombian firms. Estud.gerenc. 36(155) 229-238. http://www.scielo.org.co/scielo.php?pid=S0123-59232020000200229&script=sci_arttext&tlng=en
Corporate Finance Institute (2022) Financial ratios: The use of financial figures to gain significant information about a company. https://corporatefinanceinstitute.com/resources/knowledge/finance/financial-ratios/
Cross Validated(n.d.-a) Area under the ROC curve or area under the PR curve for imbalanced data? https://stats.stackexchange.com/questions/90779/area-under-the-roc-curve-or-area-under-the-pr-curve-for-imbalanced-data/90783#90783
Cross Validated(n.d.-b) Meaning of y-axis in Random Forest partial dependence plot. https://stats.stackexchange.com/questions/147763/meaning-of-y-axis-in-random-forest-partial-dependence-plot
Cross Validated(n.d.-c) Regression: Transforming Variables. https://stats.stackexchange.com/questions/4831/regression-transforming-variables/4833#4833
Data Novia(n.d.-a) T-Test Essentials : Definition, Formula and Calculation. https://www.datanovia.com/en/lessons/how-to-do-a-t-test-in-r-calculation-and-reporting/how-to-do-two-sample-t-test-in-r/
Data Science Team(2021) Box-Cox Transformation for Normalizing a Non-Normal Variable in R. https://universeofdatascience.com/box-cox-transformation-for-normalizing-a-non-normal-variable-in-r/
Davis, J., and Goadrich, M. The Relationship between Precision-Recall and ROC Curves. https://www.biostat.wisc.edu/~page/rocpr.pdf
Delignette-Muller, M.L., Pouillot, R., Denis, J.-B., and Dutang, C. (2009) Use of the package fitdistrplus to specify a distribution from non-censored or censored data.https://civil.colorado.edu/~balajir/CVEN5454/R-sessions/sess2/intro2fitdistrplus.pdf
Devi, S.S., and Radhika, Y. (2018) A Survey on Machine Learning and Statistical Techniques in Bankruptcy Prediction. Intl. Jour. Machine Learning and Computing. Vol 8(2): 133-139. http://www.ijmlc.org/vol8/676-L0125.pdf
EMIS (n.d.) Emerging Markets Information System. https://www.emis.com/
finnstats(2021-a) How to Remove Outliers in R. https://www.r-bloggers.com/2021/09/how-to-remove-outliers-in-r-3/
finnstats (2021-b) Class Imbalance-Handling Imbalanced Data in R. https://www.r-bloggers.com/2021/05/class-imbalance-handling-imbalanced-data-in-r/
Frost, J. (n.d.-a) How to Identify the Distribution of Your Data. https://statisticsbyjim.com/hypothesis-testing/identify-distribution-data/
GeeksforGeeks(2021) Kolmogorov-Smirnov Test in R. https://www.geeksforgeeks.org/kolmogorov-smirnov-test-in-r-programming/
Giannopoulos, G, Sigbjornsen, S. (2019) Prediction of Bankruptcy Using Financial Ratios in the Greek Market. Theoretical Economics Letters 9:1114-1128. https://eprints.kingston.ac.uk/id/eprint/43059/1/Giannopoulos-G-43059-VoR.pdf
Gosiewska, A.(n.d.) ModelOriented/auditor https://github.com/ModelOriented/auditor/blob/master/R/plot_prc.R
Groff, D. (2020) Bankruptcy Overview Process and Warning Signs. https://www.mossadams.com/articles/2020/06/bankruptcy-overview-process-and-warning-signs
hackerearth (n.d.) Beginners Tutorial on XGBoost and Parameter Tuning in R. https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Harrell, F. (2020) Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules. https://www.fharrell.com/post/class-damage/
Hauser, R.P., and Booth, D. (2011) Predicting Bankruptcy with Robust Logistic Regression. J. Data Science 9; 5654-584. https://www.jds-online.com/files/JDS-716.pdf
Hjelseth, I.N., Raknerud, A., and Vatne, B.H. (2022) A bankruptcy probability model for assessing credit risk on corporate loans with automated variable selection. Norges Bank Research. Working Paper pp. 1-40. https://www.norges-bank.no/contentassets/b26854d9fce24f49b68182e121eed2eb/wp_07_2022.pdf?v=06/21/2022162855&ft=.pdf
Horak, J., Vrbka, J, and Suler, P. (2020) Support Vector Machine Methods and Artificial Neutral Networks Used for the Development of Bankruptcy Prediction Models and their Comparison. J. Risk and Financial Management Vol. 13(60);1-15. https://www.mdpi.com/1911-8074/13/3/60/pdf
Horton, B. (2016) Calculating AUC: the area under a ROC Curve https://www.r-bloggers.com/2016/11/calculating-auc-the-area-under-a-roc-curve/
Islam, M.S. (2020) Predictive capability of Financial Ratios for forecasting of Corporate Bankruptcy. IOSR Jrnl. Bus. And Mngmt. 22(6):13-57. https://www.academia.edu/download/63759958/C220610135720200627-73076-19tr1g2.pdf
Javadev, M. (2006) Predictive Power of Financial Risk Factors: An Empirical Analysis of Default Companies. VILKALPA 31(3): 45-56. https://journals.sagepub.com/doi/pdf/10.1177/0256090920060304
Javaid, K. (2022) Cost to sales ratio. https://financiopedia.com/cost-to-sales-ratio/
Johnson, D. (2022) SAS vs. R: What is the Difference Between R and SAS? https://www.guru99.com/sas-versus-r.html
Kieso, D.E., Weygandt, J.J., and Warfield, T.D. (2016) Intermediate Accounting: 1435-1439. 16th edition. Publisher: Wiley.
Kitowsi, J, Kowal-Pawul, A., and Lichota, W. (2022) Identifying Symptoms of Bankruptcy Risk Based on Bankruptcy Prediction Models- A Case Study in Poland. Sustainability 14(3): 1416. https://doi.org/10.3390/su14031416
KnowHow (2021) Testing the Assumptions of Logistic Regression using R. https://www.youtube.com/watch?v=jILEwqg2p3k)
Kovacova, M., Kliestik,T., Valaskova, K., Durana, P., and Juhaszova, Z. (2019). Systematic review of variables applied in bankruptcy prediction models of Visegrad group countries. OECONOMIA Copernicana Vol. 10(4): 743-772. http://economic-research.pl/Journals/index.php/oc/article/download/1739/1630
Kumar, N. (2019) Advantages and Disadvantages of Random Forest Algorithm in Machine Learning. https://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-random.html
Kumar, A. (2022) Accuracy, Precision, Recall & F1-Score-Python Examples. https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
Lantz., B. (2019-a) Machine Learning with R. Chapter 11: 359-374. 3rd edition. Packt> Publishers.
Lantz, B. (2019-b) Machine Learning with R. Chap. 11: Improving Model Performance. 3rd edition. Publisher: Packt Publishing Ltd.
Le, T. (2021) A comprehensive survey of imbalanced learning methods for bankruptcy prediction. IET Communications 16(5): 433-441. https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cmu2.12268
Leung, K. (2021) Assumptions of Logistic Regression, Clearly Explained. https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290
Liang, D., Lu, C.C., Tsai, C.F., and Shih, G.A. (2016) Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. Eur. Jour. of Operational Res. Vol. 252: 561-572. https://isslab.csie.ncu.edu.tw/download/publications/1.Financial%20Ratios%20and%20Corporate%20Governance%20Indicators%20in%20Bankruptcy%20Prediction-A%20Comprehensive%20Study.pdf
Liang, D., Tsai, C.F., Lu, H.Y.R., and Chang, L.S. (2020) Combining Corporate Governance indicators with stacking ensembles for financial distress prediction. J. Bus. Research Vol. 120:137-146. https://isslab.csie.ncu.edu.tw/download/publications/Combining%20corporate%20governance%20indicators%20with%20stacking%20ensembles%20for%20financial%20distress%20prediction.pdf
Lunardon, N., Menardi, G., and Torelli, N. (2021) Package ‘ROSE’. https://cran.r-project.org/web/packages/ROSE/ROSE.pdf
Lunardon, N., Menardi, G., and Torelli, N. (2014) ROSE: A Package for Binary Imbalanced Learning. The R Journal: Vol 6(1): 79-89. https://journal.r-project.org/archive/2014-1/menardi-lunardon-torelli.pdf
Marso, S., and El Merouani, M. (2020) Bankruptcy Prediction using Hybris Neural Networks with Artificial Bee Colony. Engineering Letters 28(4). http://www.engineeringletters.com/issues_v28/issue_4/EL_28_4_26.pdf
Menardi, G., and Torelli, N. (2010) Training and assessing classification rules with unbalanced data. University Degli Studi Di Trieste: Working Paper Series, N.2, 2010. Pp:1-28.
Mendekar, V. (2021) Machine Learning-It’s all about assumptions. https://www.kdnuggets.com/2021/02/machine-learning-assumptions.html
Mendis, A. (2018) Using mlr for Machine Learning in R: A Step by Step Approach for Decision Trees. https://towardsdatascience.com/decision-tree-classification-of-diabetes-among-the-pima-indian-community-in-r-using-mlr-778ae2f87c69
Pedamkar, P. (n.d.) R vs. Python. https://www.educba.com/r-vs-python/
Perceptive Analytics (2018) Dealing with The Problem of Multicollinearity in R. https://www.r-bloggers.com/2018/08/dealing-with-the-problem-of-multicollinearity-in-r/
Pipis, G.(2020) Unsampling by Groups in R. R-bloggers. https://www.r-bloggers.com/2020/11/undersampling-by-groups-in-r/
Poston, K.M., Harmon, W.K., and Gramlich, J.D. (1994) A Test of Financial Ratios as Predictors of Turnaround versus Failure among Financially Distressed Firms. J. Appl. Bus. Research Vol 10: 41-56. https://doi.org/10.19030/jabr.v10i1.5962
Prabhakaran, S. (2016) Outlier detection and treatment with R. https://www.r-bloggers.com/2016/12/outlier-detection-and-treatment-with-r/
Prabhakaran, S. (n.d.-a) Logistic Regression http://r-statistics.co/Logistic-Regression-With-R.html
Probst, P. (2017) Tuning random forest. R-bloggers. https://www.r-bloggers.com/2017/11/tuning-random-forest/
Rai, B. (2017) Handling Class Imbalance Problem in R: Improving Predictive Model Performance https://www.youtube.com/watch?v=Ho2Klvzjegg
Rai, B. (2017) eXtreme Gradient Boosting XGBoost Algorithm with R- Example in Easy Steps with One-Hot Encoding. https://www.youtube.com/watch?v=woVTNwRrFHE
Ranganathan, P., Pramesh, C.S., and Aggarwal, R. (2017) Common pitfalls in statistical analysis: Logistic regression. Perspect. Clin. Res. Vol 8(3): 148-151. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543767/
RDocumentation (n.d.-a) descdist: Description of an empirical distribution for non-censored data. https://www.rdocumentation.org/packages/fitdistrplus/versions/1.1-8/topics/descdist
RDocumentation (n.d.-b) glm: Fitting Generalized Linear Models https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm
RDocumentation(n.d.-c) sample.split: Split Data into Test and Train Set https://www.rdocumentation.org/packages/caTools/versions/1.17.1/topics/sample.split
RDocumentation(n.d.-d) https://www.rdocumentation.org/packages/ROSE/versions/0.0-4/topics/roc.curve
Rdocumentation(n.d.-e) cforest: Random Forest. https://www.rdocumentation.org/packages/party/versions/1.3-0/topics/cforest
rdrr.io(n.d.-a)ROSE-package: ROSE: Random Over-Sampling Examples https://rdrr.io/cran/ROSE/man/ROSE-package.html
rdrr.io(n.d.-b) ovun.sample: Over-sampling, under-sampling, combination of over- and under. https://rdrr.io/cran/ROSE/man/ovun.sample.html
rdrr.io (n.d.-c) smote: SMOTE algorithm for unbalanced classification problems. https://rdrr.io/cran/performanceEstimation/man/smote.html
rdrr.io(n.d.-d) mlr-package: mlr: Machine Learning in R. https://rdrr.io/cran/mlr/man/mlr-package.html
rddr.io(n.d.-e) Partial dependence plot https://rdrr.io/cran/randomForest/man/partialPlot.html
Rhys, H.I. (2020-a) Machine learning with R, tidyverse, and mlr. Chapter 4: Classifying based on odds with logistic regression. Publisher: Manning.
Rhys, H.I. (2020-b) Machine Learning with R, the tidyverse, and mlr. Chapter 8: Improving decision trees with random forests and boosting. pp. 186-204, 1st Edition. Manning Publications Co.
Rhys, H.I. (2020-c) Machine Learning with R, the tidyverse, and mlr: Chap. 3. Classifying based on similarity with k-nearest neighbors. 1st edition. Publisher: Manning Publications Co. 
Rossiter, D.G. (2017) Tutorial: An example of statistical data analysis using the R environment for statistical computing. http://www.css.cornell.edu/faculty/dgr2/_static/files/R_PDF/corregr.pdf
Ruopp, M., Perkins, N.J., Whitcomb, B.W. and Schisterman, E. (2008) Youden Index and Optimal Cut-Point Estimated from Observations Affected by a Lower Limit of Detection. Biom. J. 50(3): 419-430. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2515362/
Ruxton, G. D. (2006) The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology Vol. 17(4): 688-690. https://academic.oup.com/beheco/article/17/4/688/215960
Saito, T., and Rehmsmeier, M. (2015) The Precision-Recall Plot is More Informative that the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS One. https://doi.org/10.1371/journal.pone.0118432. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Sameeruddin, S. (2020) How Gradient Boosting Algorithm Works. https://dataaspirant.com/gradient-boosting-algorithm/
Sharma, N. (2018) Ways to Detect and Remove Outliers. https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
Shi, Y., and Li. X. (2019) An overview of bankruptcy prediction models for corporate firms: A systematic review. https://upcommons.upc.edu/bitstream/handle/2117/176066/1354-5538-1-PB.pdf
Shirin’s playground(2017) Dealing with unbalanced data in machine learning. R-bloggers. https://www.r-bloggers.com/2017/04/dealing-with-unbalanced-data-in-machine-learning/
Silipo, R. (2020) Cohen’s Kappa: What it is, when to use it and how to avoid its pitfalls. https://towardsdatascience.com/cohens-kappa-what-it-is-when-to-use-it-and-how-to-avoid-its-pitfalls-e42447962bbc
stackoverflow(n.d.-a) Writing a for loop with the output as a data frame in R. https://stackoverflow.com/questions/41889944/writing-a-for-loop-with-the-output-as-a-data-frame-in-r
StackExchange (n.d.-a) What distribution does my data follow? https://stats.stackexchange.com/questions/58220/what-distribution-does-my-data-follow
StackExchange(n.d.-b) ROSE and SMOTE oversampling methods. https://stats.stackexchange.com/questions/166458/rose-and-smote-oversampling-methods)
stackOverflow(n.d.-p) Plot legend randomForest in R. https://stackoverflow.com/questions/39330728/plot-legend-random-forest-r
Stasinopoulos, D.M. and Rigby, R.A. (2007) Generalized Additive Models for Location Scale and Shape (GAMLSS) in R. J. Statistical Software Vol 23(7): 1-46. https://www.jstatsoft.org/article/download/v023i07/20764
Talk Stats(2014) When removing outliers, creates more, what then? https://www.talkstats.com/threads/when-removing-influential-outliers-creates-more-what-then.56919/
The R Foundation for Statistical Computing (2021) R: Software Development Life Cycle: A Description of R’s Development, Testing, Release and Maintenance Processes. https://www.r-project.org/doc/R-SDLC.pdf
Tomczak, S.K., and Wilimowska, Z. (2016) Testing the Probability Distribution of Financial Ratios. ISAT: Proc. 36th Intl. Conf. Infor. Sys. Arch. Tech.: 75-84.
UCI Machine Repository(n.d.-a) Polish companies bankruptcy data Dataset. https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#
van Schoren, J. (n.d.) Machine Learning in R. https://joaquinvanschoren.github.io/ML-course-R/TutorialMLR.slides.html#/
Webb, J. (2017) Course Notes for IS6489, Statistics and Predictive Analytics. Chap. 8 Logistic Regression. https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html
Westfall, P. (2016) Re: What do I do if my data distribution is not Normal?. Retrieved from: https://www.researchgate.net/post/What-do-I-do-if-my-data-distribution-is-not-Normal/584eae2feeae3934d93f477b/citation/download
Wikipedia (n.d.-a) Logistic regression. https://en.wikipedia.org/wiki/Logistic_regression
Zach(2020) The 6 Assumptions of Logistic Regression(with Examples). https://www.statology.org/assumptions-of-logistic-regression/
Zeya, L.T. (2021) Precision and Recall Made Simple. https://towardsdatascience.com/precision-and-recall-made-simple-afb5e098970f
Zhang, Y., Liu, R., Heidari, A.A., Wang, X., Chen, Y., Wang M., and Chen, H. (2021) Towards augmented kernel extreme learning models for bankruptcy prediction: Algorithmic behavior and comprehensive analysis. Neurocomputing 430: 185-212. https://doi.org/10.1016/j.neucom.2020.10.038
Zhou, V. (2019) Decision tree learning: Gini impurity. https://victorzhou.com/blog/gini-impurity/
Zieba, M. Tomczak, S.K., and Tomczak, J.M. (2016) Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications Vol. 58:93-101. https://www.ii.pwr.edu.pl/~tomczak/PDF/%5BMZSTJT%5D.pdf

require(here)
## Loading required package: here
## here() starts at C:/WGU/post grad/final