Imbalanced Classification on Bankruptcy Prediction

Bankruptcy Prediction, which aims to assess the financial condition of a company and its future perspectives within the context of operation on the market, is of great importance in economic decision making and receives extensive research. In this report, an empirical study of bankruptcy prediction is performed on the Polish companies bankruptcy data, mainly focusing on tackling imbalance as well as comparing and selecting models. Initially, Random Forest Imputation is implemented to fill in missing values, and data visualization using PCA and MDS is presented. At first, we ignore the problem of imbalance and compare the performance of different classifiers built with original feature space and principal component scores respectively, using the usual accuracy as evaluation metric. Then we apply different sampling techniques including Synthetic Minority Over-sampling Technique (SMOTE), upsampling and downsampling on the training set, with a new variable recording the number of missing values added to the original features, and compare the performance of different classifiers using the area under ROC curve as a main evaluation metric. Based on the results of the empirical study, we make a comparison of different sampling techniques, and select the most appropriate classifier for bankruptcy prediction. In the end, we discover that the AdaBoost model using downsampling is a simple-to-construct and reliable choice for a high sensitivity classifier. Furthermore, more powerful and more flexible classifiers can be obtained by tuning the decision threshold of the AdaBoost model using SMOTE sampling. The final model raises the sensitivity from the initial 30% to 86%, and retains a 86% specificity at the same time.

1 Introduction

1.1 Background

Bankruptcy prediction is of great importance in economic decision making, which aims to assess the financial condition of a company and its future perspectives within the context of longterm operation on the market. Early attempts of bankruptcy prediction are based on traditional statistical methods such as Logistic Regression or Discriminant Analysis. Later on, artifical intelligence techniques including Support Vector Machine and Neural Network were utilized. Recently, ensemble methods such as Bagging or Boosting gain popularity and perform well on bankruptcy prediction. Nevertheless, the imbalance nature of bankruptcy prediction makes the task not easy. This report is an empirical study of bankruptcy prediction based on one real world data set, mainly focusing on tackling imbalance and the comparison of different methods.

1.2 Data Set

The data set is Polish companies bankruptcy data, which contains 5910 observations and 65 variables. The first 64 variables are the predictors, including all sorts of companies’ current financial status and properties such as profit, liabilities, and working capital against sales as well as total assets. The 65th variable is the response, with values “Yes” and “No” reflecting whether or not the company will be bankrupt the next year. Only 410 out of 5910 observations are bankrupt, indicating the data is highly imbalanced. Moreover, there exist 4666 missing values in the predictors, so an imputation technique should be performed.

1.3 Objective

First of all, some preprocessing techniques are applied on the original data set and then based on the preprocessed data set, we make some visualization. Next, we divide the data set into training set (70%) and testing set (30%). Different sampling methods are performed on the training set to tackle imbalance, which are then used to construct several different classifiers including Linear Discriminant Analysis, Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Neural Network and some ensemble classifiers including Random Forest, Adaptive Boosting (AdaBoost), Gradient Boosting Machine (GBM), with Decision Tree being base classifier. Then, based on evaluation metrics such as accuracy, sensitivity, specificity and AUC, we compare the performance of different classifiers on testing set.

1.4 Organization of the report

The rest of this report will be presented as the following order:

Preprocessing and Visualization. To deal with missing values in the original data set, we implement Random Forest Imputation. Then, some visualization of the preprocessed data set are applied.
Classification without Tackling Imbalance. We construct several classifiers both with the original feature space and principal component space, and compare their performance based on accuracy, sensitivity and specificity.
Classification with Tackling Imbalance. In addition to adding a new variable numNA recording the number of missing values, we implement three sampling techniques and compare the performance of different classifiers.
Discussion. Based on the performance of different classifiers using different sampling methods, we first make a comparison of the three sampling techniques, and then select the classifier that is most suitable for bankruptcy prediction.

2 Preprocessing and Visualization

2.1 Preprocessing

In the original data set, there exist 4666 missing values, where Attr37 has the largest missing rate (43.11%), while other variables has much smaller missing rate (less than 7%). We implement simple deletion for Attr37 and multiple imputation for other variables. Compared with other popular multiple imputation methods including K-Nearest Neighbours Imputation and Multivariate Imputation using Chained Equations, Random Forest Imputation which uses a random forest trained on the observed values of a data matrix to predict the missing values, is more suitable for tackling high-dimensional data especially when there exist complex interactions among variables. Therefore, considering the high-dimensional nature of the data set, we decide to apply Random Forest Imputation. Specifically we impute missing values with all predictors excluding Attr37. Also the class information is not used in the imputation, as we assume that the true values of the missing data do not depend on class.

2.2 Visualization

The left panel of Figure 1 is a visualization of the correlation matrix of 63 predictors in the preprocessed data set, where different colors correspond to different correlation level. From this plot, we find that some of the predictors are highly correlated, indicating that some dimension reduction techniques should be implemented before classification.

The right panel of Figure 1 is a collection of boxplots for 63 predictors with respect to different class labels, where outliers are removed for better visualization. From this plot, we find that for different class labels, there does not exist obvious difference in distribution for most of the predictors, indicating that single predictor could not distinguish class label.

Heat map and box plots

There are numerous dimension reduction techniques that could be applied for visualization, among which Principal Component Analysis and Multidimensional Scaling are most popular.

First, we visualize the observations in the preprocessed data set with Principal Component Analysis. From the left panel of Figure 2 which projects observations onto the first two principal components, we find that observations of different class labels are not well separated, indicating that classification cannot be implemented on low-dimensional space.

Second, we visualize the predictors in the preprocessed data set with Multidimensional Scaling. From the right panel of Figure 2 which projects predictors onto the two-dimensional space, we find that there exist some clusters, indicating that some of the predictors are highly correlated, which further suggests that some dimension reduction techniques should be implemented before classification.

PCA Projection and MDS Projection

3 Classification without Tackling Imbalance

The severe imbalance of the original dataset suggests against using straight-forward classification methods without any modification. But for the sake of comparison, we run a few commonly used classifiers on the raw dataset (after imputation) and report their performance on training data and test data. We then operate the same set of classifiers with only the first twenty principal components and see if there is a significant improvement. Note that the classifiers implemented in this section already have their parameters, if any, tuned within certain grids of values using repeated cross-validation (10-fold, repeat 5 times) to have the optimal performance.

3.1 Without PCA

The classification methods in this section include LDA, logistic regression, kNN, SVM, neural network, random forest, AdaBoost, and gradient boosting machine. The accuracy, sensitivity (true positive rate) and specificity (true negative rate) of predictions on training and test sets are reported to compare the performance. Sensitivity and specificity are calculated with bankrupt companies being the positive class.

training and test accuracy
	train.accu	train.sens	train.spec	test.accu	test.sens	test.spec
LDA	0.9340	0.1568	0.9919	0.9278	0.1545	0.9855
Logistic	0.8513	0.4355	0.8823	0.8421	0.4309	0.8727
kNN	0.9330	0.1254	0.9932	0.9261	0.0894	0.9885
SVM	0.9306	0.0000	1.0000	0.9306	0.0000	1.0000
NN	0.9306	0.0000	1.0000	0.9306	0.0000	1.0000
RF	1.0000	1.0000	1.0000	0.9447	0.3577	0.9885
AdaBoost	0.9562	0.4460	0.9943	0.9380	0.3008	0.9855
GBM	0.9575	0.4390	0.9961	0.9380	0.2602	0.9885

From this table we can see that all of these classifiers have decent accuracy on both training and test sets. However, a further inspection on the sensitivity and specificity could reveal that the high accuracy is usually achieved via classifying most of the cases as positive, the extreme cases being SVM and neural network, with 0% sensitivity and 100% specificity.

Among these classifiers, random forest performs fairly well on the original data. It fits the training set perfectly and achieves some degree of accuracy at identifying the positive cases (35.77% sensitivity). AdaBoost and GBM yield similar results on test set but have lower sensitivity on training set. Logistic regression has the best sensitivity on test set (43.09%), but it also has lower specificity compared to other models.

3.2 With PCA

The first 20 principal components already take up 90% of the total variance. Now we run the same classification methods on the dataset of the PC scores.

training and test accuracy
	train.accu	train.sens	train.spec	test.accu	test.sens	test.spec
LDA	0.9299	0.0662	0.9943	0.4343	0.8780	0.4012
Logistic	0.9314	0.0767	0.9951	0.2893	0.9024	0.2436
kNN	0.9372	0.1986	0.9922	0.9306	0.0000	1.0000
SVM	0.9497	0.2822	0.9995	0.9306	0.0000	1.0000
NN	0.9449	0.2927	0.9935	0.8472	0.1463	0.8994
RF	1.0000	1.0000	1.0000	0.9216	0.0325	0.9879
AdaBoost	0.9862	0.8014	1.0000	0.9312	0.0081	1.0000
GBM	0.9446	0.2404	0.9971	0.8951	0.1220	0.9527

From the results we can see that PCA doesn’t really improve the models. For many classifiers we actually get worse accuracy. For instance random forest now has much lower sensitivity. This might be due to the fact that random forest usually doesn’t suffer from high dimension. By reducing dimension with PCA, we lose some of the variability and as a result get a worse random forest model. LDA and logistic regression both obtain high sensitivity and low specificity and low accuracy on test set.

4 Classification with Tackling Imbalance

4.1 Strategies

4.1.1 Sampling Method

Various kinds of techniques have been proposed for dealing with class imbalance, including upsampling the minority class, downsampling the majority class and synthetic minority upsampling technique (SMOTE).

upsampling, a.k.a upsampling, is a method that mainly deals with minority class, while downsampling aims at majority class. SMOTE is a sampling method that generates artificial minority samples based on the feature space similarities in order to shrink the learning bias of the minority class.

We experimented all three sampling methods on the dataset. Each one of them results in significant improvement on identifying positive cases. We will report their performance and discuss their pros/cons later.

4.1.2 Evaluation Metric

In addition, for binary classification with imbalance classes, the area under the Receiver Operating Characteristic curve (ROC AUC) is a more suitable evaluation metric. The ROC curve is a two-dimensional measure of classification performance that plots the True Positive Rate (sensitivity) against the False Positive Rate (specificity). From the following confusion matrix, True Positive Rate and False Positive Rate are calculated as below.

\[True\ Positive\ Rate = TP / (TP + FN)\] \[True\ Negative\ Rate = TN / (TN + FP)\]

In our analysis, upsampling, downsampling and SMOTE are performed on the preprocessed data set and ROC is used as a comparison metric.

4.1.3 Adding a New Variable

This strategy doesn’t necessarily address the imbalance issue directly. But after experimenting with different pre-processed datasets, we discover that adding a new variable numNA that records the number of missing values, excluding the deleted attribute, in each observation brings considerable improvement to the classifiers’ performance. By visualizing the distribution of numNA (see Figure 3), we can see that the vast majority of not bankrupt companies have no missing values, whereas over half of the bankrupt companies are expected to have at least one. Especially in tree-based models, this predictor can be used to form very informative split rules.

distribution of the number of NA

4.2 Results with different sampling methods

We tried most combinations of the pre-processing methods, sampling methods and different metrics mentioned above. To avoid verbosity, only the results using ROC AUC metric with numNA added to the imputed dataset are shown, since this combination usually results in the best performance for each of the classifiers tested in this project. Also, the three sampling methods are applied to the training set during the model training in order to compare their differences. Note that we are using the train function in R package caret to do all the model training. The sampling process happens in the background during the subsetting of cross-validation. So by doing the a repeated cross-validation (one of train function’s preset resampling method), we can get fairly accurate error rate estimates without the CV test sets being affected by sampling.

4.2.1 SMOTE Sampling

training and test accuracy
	train.accu	train.sens	train.spec	test.accu	test.sens	test.spec
LDA	0.8920	0.6341	0.9112	0.8770	0.6098	0.8970
Logistic	0.6222	0.9373	0.5987	0.5894	0.8699	0.5685
kNN	0.7866	0.7491	0.7894	0.7716	0.6992	0.7770
SVM	0.9188	0.9094	0.9195	0.8810	0.6423	0.8988
NN	0.8047	0.6307	0.8177	0.8026	0.5772	0.8194
RF	0.9430	1.0000	0.9387	0.9131	0.7561	0.9248
AdaBoost	0.9497	1.0000	0.9460	0.9171	0.7236	0.9315
GBM	0.9359	0.9443	0.9353	0.9188	0.7967	0.9279

cross validation mean and SD
	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
LDA	0.8161	0.5555	0.9077	0.0382	0.0891	0.0143
Logistic	0.7092	0.6262	0.7885	0.0646	0.1885	0.1458
kNN	0.7914	0.6444	0.7983	0.0487	0.0844	0.0225
SVM	0.8834	0.7187	0.8583	0.0282	0.1062	0.0255
NN	0.7979	0.6466	0.8206	0.0456	0.0977	0.0380
RF	0.9227	0.7235	0.9195	0.0226	0.0893	0.0150
AdaBoost	0.9352	0.7440	0.9326	0.0163	0.0755	0.0127
GBM	0.9314	0.7445	0.9197	0.0188	0.0865	0.0136

Accuracy and ROC Results on SMOTE Sampling

As is shown in Figure 4, for the overall performance, the Random Forest, AdaBoost and GBM show a high performace for both training and testing set. The rest methods’ accuracy exhibit an apparent pattern of imbalance.

For most of the methods, the overall accuracy is similiar to specificity(TN Rate), and the sensitivity(TP Rate) is lower than specificity, which means that if the company is actually bankrupted, th models are less likely to detect its bankruptcy. Surprisingly, Logistic Regression shows a relatively high sensitivity, compared to its overall accuracy and specificity.

For ROC table, Random Forest, AdaBoost, and GBM also indicate their nice performance in terms of ROC and CV sensitivity as well as specificity. Meanwhile, the CV sensitivity is much lower than the specifivity. The CV standard deviations of the sensitivity and specificity of logistic regression are very high, indicating a highly unreliable performance.

4.2.2 Upsampling

training and test accuracy
	train.accu	train.sens	train.spec	test.accu	test.sens	test.spec
LDA	0.8617	0.6864	0.8748	0.8466	0.6179	0.8636
Logistic	0.6374	0.7456	0.6294	0.6385	0.6585	0.6370
kNN	0.6790	0.9547	0.6584	0.6413	0.7154	0.6358
SVM	0.9181	0.8780	0.9210	0.8799	0.6179	0.8994
NN	0.7883	0.6934	0.7953	0.7795	0.6179	0.7915
RF	1.0000	1.0000	1.0000	0.9526	0.4959	0.9867
AdaBoost	0.9993	1.0000	0.9992	0.9566	0.6260	0.9812
GBM	0.9483	0.9861	0.9455	0.9239	0.7967	0.9333

cross validation mean and SD
	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
LDA	0.8038	0.6219	0.8713	0.0414	0.0878	0.0159
Logistic	0.6786	0.5779	0.7792	0.0470	0.2362	0.2231
kNN	0.7596	0.7763	0.6319	0.0537	0.0893	0.0205
SVM	0.8797	0.5413	0.9179	0.0306	0.1050	0.0127
NN	0.7946	0.7031	0.7711	0.0482	0.1181	0.0490
RF	0.9269	0.4114	0.9828	0.0203	0.0928	0.0076
AdaBoost	0.9413	0.5992	0.9774	0.0167	0.0962	0.0066
GBM	0.9360	0.7467	0.9317	0.0186	0.0765	0.0120

Accuracy and ROC Results on Up Sampling

According to Figure 5, different from result of SMOTE sampling, the performance of Random Forest and AdaBoost is not consistent. With respect to sensitivity for testing set, the AdaBoost and Random Forest are relatively low. GBM remains stable within all the performance measures. The performance of logistic regression is even more unreliable when using upsampling.

4.2.3 Downsampling

training and test accuracy
	train.accu	train.sens	train.spec	test.accu	test.sens	test.spec
LDA	0.8564	0.8014	0.8605	0.8387	0.7317	0.8467
Logistic	0.8999	0.6376	0.9195	0.8849	0.5285	0.9115
kNN	0.7614	0.7317	0.7636	0.7614	0.6911	0.7667
SVM	0.7742	0.8746	0.7668	0.7518	0.8455	0.7448
NN	0.8098	0.6725	0.8200	0.8139	0.6098	0.8291
RF	0.8366	1.0000	0.8244	0.8206	0.8862	0.8158
AdaBoost	0.8760	1.0000	0.8668	0.8556	0.8618	0.8552
GBM	0.8779	0.9826	0.8701	0.8624	0.8455	0.8636

cross validation mean and SD
	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
LDA	0.8220	0.6766	0.8628	0.0463	0.0827	0.0193
Logistic	0.7302	0.6613	0.7965	0.0756	0.1794	0.1793
kNN	0.7942	0.7241	0.7605	0.0474	0.1052	0.0294
SVM	0.8699	0.8198	0.7502	0.0324	0.0712	0.0252
NN	0.7937	0.7023	0.7705	0.0449	0.1085	0.0549
RF	0.9084	0.8450	0.8263	0.0233	0.0624	0.0202
AdaBoost	0.9224	0.8493	0.8517	0.0222	0.0683	0.0180
GBM	0.9211	0.8422	0.8426	0.0228	0.0708	0.0197

Accuracy and ROC Results on Down Sampling

When the sampling method is down sampling, the Random Forest, AdaBoost, and GBM, which are shown in Figure 6, again are the least affected by the imbalance and have the best overall performance. While LDA has the highest overall accuracy and specificity on both training and testing set, it has a very poor sensitivity, meaning its ability to detect bankrupt companies is really bad. Therefore, if our requirement is to detect as many bankrupt companies as possible, LDA is apparently not a desirable model. Also, other methods’ results are still unpromising.

4.3 Conclusion

Based on all three sampling methods, the SMOTE methods has a best resistance on the imbalanced data. And Random Forest, AdaBoost and GBM have the overall best performance in dealing with the imbalanced data. Among those three classification methods, GBM is the most stable method and remains high overall accuracy rate, sensitivity, and specificity in terms of training and testing set.

5 Discussion

5.1 Comparison across the three sampling techniques

Comparing the results in section 3 and section 4, we can observe that the performance of classfiers receive massive improvements after implementing sampling techniques, ROC metric, and the addition of new variable numNA. But of course the results vary with different sampling techniques used in the model. Next we discuss the different behaviors of each of the sampling techniques in this classification problems.

The advantage of upsampling is that it retains all the information in the training set. Also since the positive cases are replicated around 15 times, one single misclassification of a positive case will result in 15 errors. So essentially oversampling assigns a higher cost on false negative errors in the training set. But bear in mind there’s no such replication in the test set. So the replication of minority cases could lead to overfitting. This can be seen in Table 5. For the RF and AdaBoost models, the training accuracy, sensitivity and specificity are almost perfect, yet the test sensitivity is far lower than that of using SMOTE sampling, despite some improvement on test accuracy and specificity. In addition, using oversampling in this problem almost doubles the sample size. So the increase in computational burden is also quite considerable.

In contrast, downsampling offers computational ease. And there is no replication so overfitting is usually not a problem. From Table 7 and Table 8, we can see that models with downsampling tend to have higher sensitivity and lower specificity compared to the other two sampling techniques. For AdaBoost and GBM models, both the CV estimates of sensitivity and specificity are around 85%. So they can be great candidates if the focus is to find models that have high sensitivity yet still retain a reasonable specificity. The disadvantage of downsampling is of course the loss of information. Although from the results in section 4.2.3, no significant drop in performance can be seen other than the slightly lower ROC and overall accuracy compared to the other two sampling methods.

SMOTE sampling is a combination of upsampling and downsampling. It uses the existing minority cases to synthesize new artificial samples, and at the same time randomly samples a subset of the majority cases. The resulting balanced dataset is usually smaller than the original dataset but larger than the output of downsampling. It combines many of the advantages of upsampling and downsampling, like computational ease and a limited loss of information. But because SMOTE synthesize new samples by doing interpolation with existing data, it can only creates samples within current minority regions. In our models, using SMOTE sampling usually results in high ROC and high overall accuracy. The sensitivity is usually not as high as that of using downsampling, but in return the specificity is not sacrificed as much, which means not too many false positives.

5.2 Model selection

Next we determine which models should be the optimal choices in application. For any binary classification problem, there is always a trade-off between the sensitivity and specificity. The choice of model can depend on the classification goal. In this case, misclassifying a will-be bankrupt company as not bankrupt is usually assumed to be much more costly than the other way around. Therefore the ideal model should be able to correctly identify as many positive cases as possible. That being said, specificity can’t be ignored either, as a low specificity will make the positive predictions less credible, an example being the logistic regression model in section 3.2, where the sensitivity on test set reaches 90% but only one twelfth of the positive predictions are true positive (confusion matrix see below). So here we are looking for classifiers that maximize sensitivity without losing too much specificity.

        pred
          Yes   No
     Yes  111   12
     No  1248  402

Based on the tables in section 4, the model with the highest cross validation average of sensitivity is the AdaBoost model using downsampling. Among all the models using downsampling, the AdaBoost model also has the highest ROC AUC average and the highest specificity. And its small cross validation SD suggests that the model’s prediction performance is quite stable. The confusion matrix indicates that about one third of the positive predictions are true positive. It’s not perfect, but still acceptable considering the imbalance of the two classes. With both sensitivity and specificity at around 85%, the AdaBoost model should be a decent classifier for this problem.

        pred
          Yes   No
     Yes  106   17
     No   239 1411

Moreover, we can take a look at the predicted class probabilities of the AdaBoost model to get a clearer idea about how well the classifier separates the two classes. Figure six is a visualization of the predicted class probabilities on training set and test set. We can see that the classfier separates the classes fairly well. For most positive cases, the AdaBoost classifier assigns very close to 1 probability. That is true for even the test set, though a few positive cases are assigned close to 0 probability. For the negative cases, there are quite a lot of false negatives. Although that is the price to pay in order to get the high sensitivity.

distribution of predicted class probabilities

It is also viable to select the AdaBoost and GBM models with SMOTE sampling, since they both have very high ROC and reasonable sensitivity. In fact, their sensitivity still has room for improvement. Instead of naively using 0.5 as the decision threshold, we can choose thresholds that lead to better sensitivity based on the repeated cross-validation results. By lowering the threshold, more cases will be classified as positive, thus increasing the sensitivity. Inevitably, this would also mean a lower specificity. The task is to find the sweet spot between sensitivity and specificity.

Performance of models under different thresholds

Figure 8 shows us how sensitivity and specificity vary with the change of decision threshold. Based on this plot, one can selects the most suitable threshold according to the desired sensitivity and specificity levels. It is worth mentioning that threshold doesn’t affect the performance of AdaBoost model as drastically as it does GBM model, which implies that the AdaBoost model tend to get class probability close to 0 and 1, with not many cases in between. This is confirmed in Figure 9. We can observe that the class probability has a more coherent distribution in GBM model (notice the difference of the violin plots in these two models).

distribution of predicted class probabilities on test set

Table 9 illustrates a few instances of changing the threshold of AdaBoost model using SMOTE sampling. It is quite surprising that we can set the threshold as low as 0.05, and still be able to have sensitivity and specificity exceeding 86%, which is even higher than that of the AdaBoost model using downsampling. This shows the importance of not blindly using the default 0.05 threshold. Table 10 shows some examples for the GBM model. The performance is very close to the AdaBoost model, but slightly worse.

performance of AdaBoost under different thresholds
threshold	cv.sens	cv.spec	test.sens	test.spec
0.05	0.8647	0.8615	0.8780	0.8527
0.10	0.8374	0.8847	0.8374	0.8733
0.15	0.8186	0.8965	0.8211	0.8915
0.20	0.8040	0.9042	0.8211	0.9006

performance of GBM under different thresholds
threshold	cv.sens	cv.spec	test.sens	test.spec
0.25	0.8815	0.8394	0.8780	0.8315
0.30	0.8445	0.8620	0.8699	0.8594
0.35	0.8200	0.8807	0.8293	0.8800
0.40	0.7927	0.8959	0.8211	0.8982

In conclusion, if the goal is to find a classifier with high sensitivity and high specificity, the AdaBoost model using SMOTE sampling with a threshold at 0.05 can yield over 86% sensitivity and specificity. If more focus is placed on identifying positive cases, then a more fitting model can be derived by further tuning down the threshold. Similarly, we can also raise the threshold to get a model that gets fewer false positive predictions. In general, the model choice is not definitive. It usually requires comparison among multiple candidates to suit specific classification needs.

6 Reference

[1] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145:1159.

[2] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority upsampling technique. Journal of Artificial Intelligence Research, 16(1), 321:357.

[3] Atiya, A. F. (2001). Bankruptcy prediction for credit risk using neural networks: a survey and new results. IEEE Transactions on Neural Networks, 12(4), 929:935.

[4] Tsai, C.-F., & Wu, J.-W. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34, 2639:2649.

[5] Breiman, L., 1996. Bagging predictors. Machine learning 24, 123:140.

[6] Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics , 1189:1232.

[7] Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data, Technical Report 666. Statistics Department of University of California at Berkeley.

[8] Dietterich, T. G. (2000b). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139:157.