These data were extracted from IRS Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. They offer a snapshot of those nonprofits falling above the $200,000 threshold — which, in the Mohawk Valley, is 328 nonprofits between Oneida and Herkimer counties in upstate New York for reporting years 2015, 2016, and 2017.
In previous deliverables, we used 100 and 300 organizations from one reporting year, respectively, but we omitted 28 for the kernelized Support Vector Machines (SVMs) deliverable simply because that group belonged to a national network that often fills a revenue void, and therefore, skews the data bit. With only one year, the model serves as a more straightforward classifier, but what we found was either a low kappa result or an overfitting situation. With a larger data set, perhaps the model will become stronger.
Additionally, there are more than 1,000 nonprofits in the region and the decision was made to first focus on those which generate the most revenue, presumably because their assets and liabilities are more conclusive in a testing situation. As a result, mid-sized nonprofits (between $50,000-$200,000) were added to this data set, albeit with much lower asset and liabilities levels. The data are then normalized, but proving the models’ strengths and weaknesses is the more responsible first step for predictive and classification purposes and based on previous experimentation.
Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we elected to use four variables these organizations share with mid-sized organizations (990-EZ) as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories — revenue, expenses, assets, and liabilities — to measure the overall health of a tax-exempt organization. However, what makes these models different is the addition of two variables — difference between revenue and expenses, which, in turn, allows us to classify the organization as either “healthy” or “not healthy.” Also, another independent variable is added — the difference between assets and liabilities. This is labeled “gap” in the dataset. In total, three are added.
Ultimately, the “healthy” variable becomes essential as the “factor” in all models (or the dependent) and is crucial to the question — how can we predict fiscally healthy nonprofits in the Mohawk Valley? The remaining variables are independent. The one that will be closely watched is the “difference” variable as it is not an IRS reporting requirement but the essential piece in deciding binary classification and the “gap” variable, which is the difference between assets and liabilities. This may prove to be the most important finding as outlined above. There are 1,725 records in the data set.
As part of the overall project, we will use these data and methodology for ensemble learning analysis. This is the second part of the modeling.
df <- read.csv("C:/Users/bjorzech/Desktop/609_W7.csv",stringsAsFactors = FALSE)
head(df)
## revenue expenses liabilities assets diff gap healthy
## 1 72664896 83567663 232415 29604973 -10902767 29372558 1
## 2 72512263 81579736 140853 20701792 -9067473 20560939 1
## 3 1486216 7130439 310000 19107965 -5644223 18797965 1
## 4 115993449 120871280 71024983 115179428 -4877831 44154445 1
## 5 79121703 82677864 226736 17244423 -3556161 17017687 1
## 6 85724662 88871216 27876413 55501103 -3146554 27624690 1
tail(df)
## revenue expenses liabilities assets diff gap healthy
## 1720 72562193 62424461 1337986492 1481609252 10137732 143622760 0
## 1721 100285807 85487916 25230438 55648804 14797891 30418366 0
## 1722 199948992 182057650 276031568 1393517194 17891342 1117485626 0
## 1723 211794354 183021955 198431818 1372367274 28772399 1173935456 0
## 1724 37887060 4491510 20365734 12765296 33395550 -7600438 0
## 1725 238449944 173479574 294569835 1361460585 64970370 1066890750 0
str(df)
## 'data.frame': 1725 obs. of 7 variables:
## $ revenue : int 72664896 72512263 1486216 115993449 79121703 85724662 14759406 213519379 95714 4142767 ...
## $ expenses : int 83567663 81579736 7130439 120871280 82677864 88871216 17815375 216361021 2517640 6451369 ...
## $ liabilities: int 232415 140853 310000 71024983 226736 27876413 19526262 121947762 267343 4763098 ...
## $ assets : int 29604973 20701792 19107965 115179428 17244423 55501103 8184741 126538702 1941925 1052736 ...
## $ diff : int -10902767 -9067473 -5644223 -4877831 -3556161 -3146554 -3055969 -2841642 -2421926 -2308602 ...
## $ gap : int 29372558 20560939 18797965 44154445 17017687 27624690 -11341521 4590940 1674582 -3710362 ...
## $ healthy : int 1 1 1 1 1 1 1 1 1 1 ...
df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric(df$revenue)
df$expenses <- as.numeric(df$expenses)
df$assets <- as.numeric(df$assets)
df$gap <- as.numeric(df$gap)
df$diff <- as.numeric(df$diff)
df$liabilities <- as.numeric (df$liabilities)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
library (caret)
## Loading required package: lattice
## Loading required package: ggplot2
fold <- trainControl (method = "cv", number = 10)
The primary motivation for boosting is this — since it’s a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee” (Hastie, 2016, p. 337), a more holistic analysis can be conducted. When first introduced, this modeling seemed to be the most logical for two reasons. First, experimentation with algorithms like SVMs and decision trees proved conclusive at times. Second, a holistic approach seemed more appropriate as high dimensionality proved to be too inconclusive with some models, but perhaps not accurate enough with others. This committee approach allows for a more robust comparison, especially with bagging and boosting techniques.
For instance, bagging improves the performance of a noisy classifier through averaging (Hastie, 2016, p. 365). Gradient boosted models (GBMs) and sampling reduces the computing time by the same fraction η, but in many cases it actually produces a more accurate model (Hastie, 2016, p. 365). Also, although popular since its introduction, much has been written to explain the success of AdaBoost in producing accurate classifiers. Most of this work has centered on using classification trees as the “base learner” G(x), where improvements are often most dramatic (Hastie, 2016, p. 365).
It’s essential to highlight the purpose of each above as the following findings are critical in assessing the overall strength of accuracy and examining more conclusive evidence in prediction, which connects to the overall objective and hypotheses outlined above.
When creating models with all six independent predictor variables (revenue, expenses, liabilities, assets, differences, and gap), two classes (healthy and not healthy), and 1,725 records of nonprofit fiscal performance in the Mohawk Valley, the adjustment in data set, subsequent normalizing, and implementing other techniques seemed to be a sound approach. The following outlines each technique:
dt.cv <- train(healthy ~ revenue + expenses + assets + liabilities + diff + gap,
data= df,
trControl = fold,
method = "rpart")
dt.cv
## CART
##
## 1725 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1552, 1553, 1552, 1552, 1553, 1553, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0000000 0.9959437 0.9916067
## 0.4957865 0.9959437 0.9916067
## 0.9915730 0.7900255 0.4916067
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.4957865.
bag.cv <- train(healthy ~ revenue + expenses + assets + liabilities + diff + gap,
data = df,
trControl = fold,
metric = 'Accuracy',
method = "treebag")
bag.cv
## Bagged CART
##
## 1725 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1553, 1552, 1552, 1552, 1553, 1553, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9953656 0.99042
gboost.cv <- train(healthy ~ revenue + expenses + assets + liabilities + diff + gap,
data= df,
trControl = fold,
metric = 'Accuracy',
method = "gbm")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1691 nan 0.1000 0.0934
## 2 1.0171 nan 0.1000 0.0752
## 3 0.8920 nan 0.1000 0.0621
## 4 0.7868 nan 0.1000 0.0527
## 5 0.6970 nan 0.1000 0.0449
## 6 0.6204 nan 0.1000 0.0388
## 7 0.5535 nan 0.1000 0.0334
## 8 0.4959 nan 0.1000 0.0293
## 9 0.4454 nan 0.1000 0.0252
## 10 0.4005 nan 0.1000 0.0224
## 20 0.1534 nan 0.1000 0.0071
## 40 0.0415 nan 0.1000 0.0009
## 60 0.0301 nan 0.1000 0.0000
## 80 0.0262 nan 0.1000 -0.0000
## 100 0.0235 nan 0.1000 -0.0001
## 120 0.0230 nan 0.1000 -0.0001
## 140 0.0214 nan 0.1000 -0.0001
## 150 0.0206 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1692 nan 0.1000 0.0939
## 2 1.0172 nan 0.1000 0.0755
## 3 0.8914 nan 0.1000 0.0630
## 4 0.7862 nan 0.1000 0.0526
## 5 0.6956 nan 0.1000 0.0448
## 6 0.6182 nan 0.1000 0.0387
## 7 0.5512 nan 0.1000 0.0335
## 8 0.4929 nan 0.1000 0.0288
## 9 0.4417 nan 0.1000 0.0258
## 10 0.3970 nan 0.1000 0.0224
## 20 0.1498 nan 0.1000 0.0069
## 40 0.0365 nan 0.1000 0.0010
## 60 0.0196 nan 0.1000 0.0002
## 80 0.0172 nan 0.1000 -0.0002
## 100 0.0149 nan 0.1000 -0.0002
## 120 0.0135 nan 0.1000 -0.0002
## 140 0.0122 nan 0.1000 -0.0000
## 150 0.0118 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1683 nan 0.1000 0.0942
## 2 1.0162 nan 0.1000 0.0755
## 3 0.8911 nan 0.1000 0.0613
## 4 0.7851 nan 0.1000 0.0525
## 5 0.6949 nan 0.1000 0.0448
## 6 0.6174 nan 0.1000 0.0382
## 7 0.5501 nan 0.1000 0.0335
## 8 0.4920 nan 0.1000 0.0291
## 9 0.4411 nan 0.1000 0.0256
## 10 0.3964 nan 0.1000 0.0222
## 20 0.1486 nan 0.1000 0.0071
## 40 0.0341 nan 0.1000 0.0009
## 60 0.0177 nan 0.1000 -0.0000
## 80 0.0138 nan 0.1000 0.0000
## 100 0.0121 nan 0.1000 -0.0000
## 120 0.0115 nan 0.1000 -0.0001
## 140 0.0111 nan 0.1000 -0.0000
## 150 0.0108 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1688 nan 0.1000 0.0922
## 2 1.0181 nan 0.1000 0.0756
## 3 0.8925 nan 0.1000 0.0623
## 4 0.7872 nan 0.1000 0.0517
## 5 0.6966 nan 0.1000 0.0450
## 6 0.6203 nan 0.1000 0.0384
## 7 0.5545 nan 0.1000 0.0329
## 8 0.4964 nan 0.1000 0.0291
## 9 0.4457 nan 0.1000 0.0253
## 10 0.4007 nan 0.1000 0.0222
## 20 0.1535 nan 0.1000 0.0071
## 40 0.0405 nan 0.1000 0.0008
## 60 0.0322 nan 0.1000 -0.0001
## 80 0.0274 nan 0.1000 -0.0003
## 100 0.0242 nan 0.1000 -0.0003
## 120 0.0218 nan 0.1000 0.0002
## 140 0.0208 nan 0.1000 -0.0000
## 150 0.0204 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1686 nan 0.1000 0.0918
## 2 1.0173 nan 0.1000 0.0757
## 3 0.8913 nan 0.1000 0.0634
## 4 0.7854 nan 0.1000 0.0520
## 5 0.6959 nan 0.1000 0.0451
## 6 0.6186 nan 0.1000 0.0387
## 7 0.5516 nan 0.1000 0.0333
## 8 0.4933 nan 0.1000 0.0291
## 9 0.4425 nan 0.1000 0.0254
## 10 0.3978 nan 0.1000 0.0224
## 20 0.1506 nan 0.1000 0.0068
## 40 0.0386 nan 0.1000 0.0008
## 60 0.0219 nan 0.1000 0.0000
## 80 0.0176 nan 0.1000 -0.0001
## 100 0.0151 nan 0.1000 -0.0000
## 120 0.0143 nan 0.1000 -0.0000
## 140 0.0135 nan 0.1000 -0.0001
## 150 0.0133 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1693 nan 0.1000 0.0943
## 2 1.0175 nan 0.1000 0.0763
## 3 0.8923 nan 0.1000 0.0618
## 4 0.7870 nan 0.1000 0.0524
## 5 0.6975 nan 0.1000 0.0452
## 6 0.6200 nan 0.1000 0.0386
## 7 0.5527 nan 0.1000 0.0335
## 8 0.4942 nan 0.1000 0.0286
## 9 0.4429 nan 0.1000 0.0254
## 10 0.3976 nan 0.1000 0.0221
## 20 0.1498 nan 0.1000 0.0069
## 40 0.0342 nan 0.1000 0.0009
## 60 0.0195 nan 0.1000 -0.0003
## 80 0.0146 nan 0.1000 0.0000
## 100 0.0126 nan 0.1000 -0.0002
## 120 0.0117 nan 0.1000 -0.0000
## 140 0.0111 nan 0.1000 -0.0000
## 150 0.0108 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1679 nan 0.1000 0.0960
## 2 1.0164 nan 0.1000 0.0760
## 3 0.8904 nan 0.1000 0.0627
## 4 0.7854 nan 0.1000 0.0516
## 5 0.6956 nan 0.1000 0.0452
## 6 0.6183 nan 0.1000 0.0381
## 7 0.5514 nan 0.1000 0.0331
## 8 0.4931 nan 0.1000 0.0290
## 9 0.4426 nan 0.1000 0.0251
## 10 0.3976 nan 0.1000 0.0223
## 20 0.1500 nan 0.1000 0.0072
## 40 0.0382 nan 0.1000 -0.0003
## 60 0.0279 nan 0.1000 -0.0000
## 80 0.0240 nan 0.1000 -0.0001
## 100 0.0231 nan 0.1000 -0.0000
## 120 0.0211 nan 0.1000 0.0002
## 140 0.0195 nan 0.1000 0.0001
## 150 0.0189 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1681 nan 0.1000 0.0921
## 2 1.0166 nan 0.1000 0.0751
## 3 0.8907 nan 0.1000 0.0633
## 4 0.7847 nan 0.1000 0.0532
## 5 0.6943 nan 0.1000 0.0445
## 6 0.6171 nan 0.1000 0.0379
## 7 0.5503 nan 0.1000 0.0332
## 8 0.4918 nan 0.1000 0.0290
## 9 0.4408 nan 0.1000 0.0253
## 10 0.3957 nan 0.1000 0.0225
## 20 0.1471 nan 0.1000 0.0072
## 40 0.0338 nan 0.1000 0.0009
## 60 0.0210 nan 0.1000 -0.0001
## 80 0.0169 nan 0.1000 0.0001
## 100 0.0149 nan 0.1000 -0.0002
## 120 0.0131 nan 0.1000 -0.0000
## 140 0.0122 nan 0.1000 -0.0001
## 150 0.0121 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1683 nan 0.1000 0.0949
## 2 1.0163 nan 0.1000 0.0757
## 3 0.8908 nan 0.1000 0.0621
## 4 0.7856 nan 0.1000 0.0524
## 5 0.6951 nan 0.1000 0.0450
## 6 0.6178 nan 0.1000 0.0387
## 7 0.5507 nan 0.1000 0.0336
## 8 0.4921 nan 0.1000 0.0295
## 9 0.4408 nan 0.1000 0.0255
## 10 0.3958 nan 0.1000 0.0224
## 20 0.1466 nan 0.1000 0.0069
## 40 0.0323 nan 0.1000 0.0009
## 60 0.0165 nan 0.1000 -0.0001
## 80 0.0129 nan 0.1000 0.0001
## 100 0.0122 nan 0.1000 0.0000
## 120 0.0111 nan 0.1000 -0.0002
## 140 0.0108 nan 0.1000 -0.0003
## 150 0.0107 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1687 nan 0.1000 0.0935
## 2 1.0169 nan 0.1000 0.0753
## 3 0.8901 nan 0.1000 0.0630
## 4 0.7841 nan 0.1000 0.0534
## 5 0.6938 nan 0.1000 0.0452
## 6 0.6166 nan 0.1000 0.0386
## 7 0.5494 nan 0.1000 0.0331
## 8 0.4913 nan 0.1000 0.0289
## 9 0.4402 nan 0.1000 0.0253
## 10 0.3955 nan 0.1000 0.0220
## 20 0.1484 nan 0.1000 0.0069
## 40 0.0357 nan 0.1000 0.0010
## 60 0.0230 nan 0.1000 -0.0000
## 80 0.0210 nan 0.1000 -0.0001
## 100 0.0194 nan 0.1000 0.0000
## 120 0.0186 nan 0.1000 0.0002
## 140 0.0175 nan 0.1000 -0.0002
## 150 0.0174 nan 0.1000 0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1674 nan 0.1000 0.0942
## 2 1.0156 nan 0.1000 0.0751
## 3 0.8894 nan 0.1000 0.0630
## 4 0.7835 nan 0.1000 0.0531
## 5 0.6930 nan 0.1000 0.0453
## 6 0.6156 nan 0.1000 0.0388
## 7 0.5483 nan 0.1000 0.0334
## 8 0.4899 nan 0.1000 0.0291
## 9 0.4387 nan 0.1000 0.0257
## 10 0.3936 nan 0.1000 0.0224
## 20 0.1452 nan 0.1000 0.0070
## 40 0.0326 nan 0.1000 0.0009
## 60 0.0192 nan 0.1000 0.0003
## 80 0.0148 nan 0.1000 0.0000
## 100 0.0128 nan 0.1000 0.0001
## 120 0.0116 nan 0.1000 -0.0001
## 140 0.0113 nan 0.1000 -0.0002
## 150 0.0111 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1678 nan 0.1000 0.0946
## 2 1.0157 nan 0.1000 0.0771
## 3 0.8899 nan 0.1000 0.0627
## 4 0.7841 nan 0.1000 0.0529
## 5 0.6940 nan 0.1000 0.0453
## 6 0.6156 nan 0.1000 0.0384
## 7 0.5484 nan 0.1000 0.0338
## 8 0.4901 nan 0.1000 0.0296
## 9 0.4387 nan 0.1000 0.0255
## 10 0.3938 nan 0.1000 0.0224
## 20 0.1455 nan 0.1000 0.0071
## 40 0.0305 nan 0.1000 0.0009
## 60 0.0151 nan 0.1000 0.0002
## 80 0.0116 nan 0.1000 -0.0000
## 100 0.0108 nan 0.1000 -0.0000
## 120 0.0100 nan 0.1000 0.0000
## 140 0.0098 nan 0.1000 0.0000
## 150 0.0098 nan 0.1000 0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1688 nan 0.1000 0.0931
## 2 1.0175 nan 0.1000 0.0758
## 3 0.8920 nan 0.1000 0.0624
## 4 0.7868 nan 0.1000 0.0529
## 5 0.6972 nan 0.1000 0.0440
## 6 0.6202 nan 0.1000 0.0379
## 7 0.5540 nan 0.1000 0.0333
## 8 0.4965 nan 0.1000 0.0288
## 9 0.4458 nan 0.1000 0.0249
## 10 0.4007 nan 0.1000 0.0225
## 20 0.1530 nan 0.1000 0.0068
## 40 0.0391 nan 0.1000 0.0009
## 60 0.0281 nan 0.1000 -0.0003
## 80 0.0233 nan 0.1000 -0.0001
## 100 0.0220 nan 0.1000 -0.0001
## 120 0.0205 nan 0.1000 -0.0001
## 140 0.0192 nan 0.1000 0.0002
## 150 0.0184 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1683 nan 0.1000 0.0952
## 2 1.0167 nan 0.1000 0.0765
## 3 0.8913 nan 0.1000 0.0626
## 4 0.7854 nan 0.1000 0.0525
## 5 0.6951 nan 0.1000 0.0441
## 6 0.6182 nan 0.1000 0.0381
## 7 0.5511 nan 0.1000 0.0334
## 8 0.4927 nan 0.1000 0.0294
## 9 0.4413 nan 0.1000 0.0253
## 10 0.3963 nan 0.1000 0.0225
## 20 0.1486 nan 0.1000 0.0068
## 40 0.0384 nan 0.1000 -0.0003
## 60 0.0226 nan 0.1000 -0.0000
## 80 0.0172 nan 0.1000 -0.0000
## 100 0.0143 nan 0.1000 -0.0002
## 120 0.0129 nan 0.1000 -0.0001
## 140 0.0116 nan 0.1000 -0.0000
## 150 0.0114 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1689 nan 0.1000 0.0948
## 2 1.0175 nan 0.1000 0.0754
## 3 0.8924 nan 0.1000 0.0617
## 4 0.7867 nan 0.1000 0.0524
## 5 0.6967 nan 0.1000 0.0449
## 6 0.6197 nan 0.1000 0.0383
## 7 0.5530 nan 0.1000 0.0333
## 8 0.4944 nan 0.1000 0.0294
## 9 0.4433 nan 0.1000 0.0255
## 10 0.3984 nan 0.1000 0.0225
## 20 0.1484 nan 0.1000 0.0070
## 40 0.0337 nan 0.1000 0.0010
## 60 0.0174 nan 0.1000 -0.0002
## 80 0.0129 nan 0.1000 0.0000
## 100 0.0109 nan 0.1000 -0.0000
## 120 0.0101 nan 0.1000 -0.0000
## 140 0.0100 nan 0.1000 -0.0000
## 150 0.0100 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1687 nan 0.1000 0.0937
## 2 1.0168 nan 0.1000 0.0747
## 3 0.8910 nan 0.1000 0.0627
## 4 0.7849 nan 0.1000 0.0528
## 5 0.6952 nan 0.1000 0.0448
## 6 0.6179 nan 0.1000 0.0384
## 7 0.5507 nan 0.1000 0.0333
## 8 0.4925 nan 0.1000 0.0291
## 9 0.4417 nan 0.1000 0.0252
## 10 0.3975 nan 0.1000 0.0218
## 20 0.1515 nan 0.1000 0.0069
## 40 0.0390 nan 0.1000 0.0010
## 60 0.0258 nan 0.1000 -0.0002
## 80 0.0226 nan 0.1000 -0.0001
## 100 0.0203 nan 0.1000 -0.0000
## 120 0.0202 nan 0.1000 -0.0000
## 140 0.0194 nan 0.1000 -0.0003
## 150 0.0192 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1682 nan 0.1000 0.0956
## 2 1.0167 nan 0.1000 0.0753
## 3 0.8910 nan 0.1000 0.0625
## 4 0.7852 nan 0.1000 0.0533
## 5 0.6947 nan 0.1000 0.0450
## 6 0.6172 nan 0.1000 0.0385
## 7 0.5502 nan 0.1000 0.0339
## 8 0.4922 nan 0.1000 0.0291
## 9 0.4406 nan 0.1000 0.0255
## 10 0.3958 nan 0.1000 0.0224
## 20 0.1485 nan 0.1000 0.0069
## 40 0.0371 nan 0.1000 0.0008
## 60 0.0198 nan 0.1000 0.0002
## 80 0.0155 nan 0.1000 0.0000
## 100 0.0143 nan 0.1000 -0.0001
## 120 0.0137 nan 0.1000 -0.0000
## 140 0.0132 nan 0.1000 -0.0000
## 150 0.0127 nan 0.1000 0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1679 nan 0.1000 0.0925
## 2 1.0166 nan 0.1000 0.0751
## 3 0.8909 nan 0.1000 0.0628
## 4 0.7853 nan 0.1000 0.0522
## 5 0.6950 nan 0.1000 0.0450
## 6 0.6171 nan 0.1000 0.0389
## 7 0.5502 nan 0.1000 0.0335
## 8 0.4918 nan 0.1000 0.0293
## 9 0.4405 nan 0.1000 0.0255
## 10 0.3957 nan 0.1000 0.0225
## 20 0.1469 nan 0.1000 0.0070
## 40 0.0334 nan 0.1000 0.0008
## 60 0.0166 nan 0.1000 -0.0001
## 80 0.0135 nan 0.1000 -0.0000
## 100 0.0125 nan 0.1000 -0.0001
## 120 0.0118 nan 0.1000 -0.0001
## 140 0.0116 nan 0.1000 -0.0001
## 150 0.0115 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1692 nan 0.1000 0.0927
## 2 1.0179 nan 0.1000 0.0762
## 3 0.8930 nan 0.1000 0.0629
## 4 0.7875 nan 0.1000 0.0526
## 5 0.6969 nan 0.1000 0.0448
## 6 0.6199 nan 0.1000 0.0385
## 7 0.5528 nan 0.1000 0.0333
## 8 0.4953 nan 0.1000 0.0283
## 9 0.4444 nan 0.1000 0.0255
## 10 0.3989 nan 0.1000 0.0225
## 20 0.1516 nan 0.1000 0.0073
## 40 0.0385 nan 0.1000 0.0009
## 60 0.0275 nan 0.1000 -0.0001
## 80 0.0227 nan 0.1000 0.0003
## 100 0.0208 nan 0.1000 -0.0001
## 120 0.0179 nan 0.1000 -0.0000
## 140 0.0176 nan 0.1000 -0.0001
## 150 0.0168 nan 0.1000 0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1686 nan 0.1000 0.0926
## 2 1.0167 nan 0.1000 0.0752
## 3 0.8910 nan 0.1000 0.0626
## 4 0.7850 nan 0.1000 0.0527
## 5 0.6945 nan 0.1000 0.0440
## 6 0.6173 nan 0.1000 0.0383
## 7 0.5503 nan 0.1000 0.0332
## 8 0.4920 nan 0.1000 0.0289
## 9 0.4406 nan 0.1000 0.0257
## 10 0.3957 nan 0.1000 0.0224
## 20 0.1486 nan 0.1000 0.0067
## 40 0.0337 nan 0.1000 0.0010
## 60 0.0174 nan 0.1000 0.0003
## 80 0.0142 nan 0.1000 -0.0001
## 100 0.0121 nan 0.1000 -0.0000
## 120 0.0104 nan 0.1000 -0.0001
## 140 0.0094 nan 0.1000 -0.0001
## 150 0.0093 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1682 nan 0.1000 0.0959
## 2 1.0161 nan 0.1000 0.0760
## 3 0.8901 nan 0.1000 0.0630
## 4 0.7846 nan 0.1000 0.0522
## 5 0.6943 nan 0.1000 0.0447
## 6 0.6169 nan 0.1000 0.0390
## 7 0.5497 nan 0.1000 0.0333
## 8 0.4914 nan 0.1000 0.0291
## 9 0.4403 nan 0.1000 0.0253
## 10 0.3954 nan 0.1000 0.0222
## 20 0.1477 nan 0.1000 0.0071
## 40 0.0308 nan 0.1000 0.0008
## 60 0.0149 nan 0.1000 0.0002
## 80 0.0112 nan 0.1000 -0.0002
## 100 0.0089 nan 0.1000 -0.0001
## 120 0.0080 nan 0.1000 -0.0001
## 140 0.0078 nan 0.1000 -0.0001
## 150 0.0076 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1693 nan 0.1000 0.0936
## 2 1.0179 nan 0.1000 0.0748
## 3 0.8923 nan 0.1000 0.0626
## 4 0.7867 nan 0.1000 0.0523
## 5 0.6969 nan 0.1000 0.0442
## 6 0.6200 nan 0.1000 0.0385
## 7 0.5547 nan 0.1000 0.0322
## 8 0.4959 nan 0.1000 0.0291
## 9 0.4447 nan 0.1000 0.0255
## 10 0.3996 nan 0.1000 0.0223
## 20 0.1539 nan 0.1000 0.0071
## 40 0.0405 nan 0.1000 0.0009
## 60 0.0297 nan 0.1000 -0.0001
## 80 0.0253 nan 0.1000 -0.0001
## 100 0.0240 nan 0.1000 -0.0001
## 120 0.0226 nan 0.1000 0.0002
## 140 0.0206 nan 0.1000 -0.0000
## 150 0.0201 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1687 nan 0.1000 0.0938
## 2 1.0173 nan 0.1000 0.0756
## 3 0.8916 nan 0.1000 0.0630
## 4 0.7861 nan 0.1000 0.0525
## 5 0.6961 nan 0.1000 0.0457
## 6 0.6188 nan 0.1000 0.0391
## 7 0.5520 nan 0.1000 0.0332
## 8 0.4938 nan 0.1000 0.0288
## 9 0.4429 nan 0.1000 0.0256
## 10 0.3979 nan 0.1000 0.0225
## 20 0.1498 nan 0.1000 0.0071
## 40 0.0360 nan 0.1000 0.0008
## 60 0.0217 nan 0.1000 -0.0000
## 80 0.0184 nan 0.1000 0.0000
## 100 0.0159 nan 0.1000 -0.0002
## 120 0.0142 nan 0.1000 0.0001
## 140 0.0130 nan 0.1000 -0.0001
## 150 0.0128 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1686 nan 0.1000 0.0954
## 2 1.0170 nan 0.1000 0.0756
## 3 0.8917 nan 0.1000 0.0631
## 4 0.7861 nan 0.1000 0.0538
## 5 0.6961 nan 0.1000 0.0446
## 6 0.6189 nan 0.1000 0.0383
## 7 0.5522 nan 0.1000 0.0333
## 8 0.4940 nan 0.1000 0.0293
## 9 0.4432 nan 0.1000 0.0250
## 10 0.3982 nan 0.1000 0.0225
## 20 0.1493 nan 0.1000 0.0069
## 40 0.0344 nan 0.1000 0.0011
## 60 0.0186 nan 0.1000 -0.0002
## 80 0.0154 nan 0.1000 -0.0000
## 100 0.0138 nan 0.1000 -0.0000
## 120 0.0132 nan 0.1000 -0.0001
## 140 0.0123 nan 0.1000 -0.0003
## 150 0.0117 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1685 nan 0.1000 0.0945
## 2 1.0165 nan 0.1000 0.0764
## 3 0.8911 nan 0.1000 0.0626
## 4 0.7861 nan 0.1000 0.0525
## 5 0.6962 nan 0.1000 0.0450
## 6 0.6188 nan 0.1000 0.0384
## 7 0.5514 nan 0.1000 0.0332
## 8 0.4926 nan 0.1000 0.0289
## 9 0.4417 nan 0.1000 0.0252
## 10 0.3972 nan 0.1000 0.0220
## 20 0.1510 nan 0.1000 0.0071
## 40 0.0385 nan 0.1000 0.0009
## 60 0.0259 nan 0.1000 -0.0002
## 80 0.0240 nan 0.1000 -0.0000
## 100 0.0216 nan 0.1000 0.0002
## 120 0.0208 nan 0.1000 -0.0000
## 140 0.0188 nan 0.1000 -0.0001
## 150 0.0188 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1679 nan 0.1000 0.0940
## 2 1.0157 nan 0.1000 0.0762
## 3 0.8904 nan 0.1000 0.0628
## 4 0.7844 nan 0.1000 0.0528
## 5 0.6943 nan 0.1000 0.0448
## 6 0.6170 nan 0.1000 0.0386
## 7 0.5502 nan 0.1000 0.0335
## 8 0.4914 nan 0.1000 0.0293
## 9 0.4404 nan 0.1000 0.0256
## 10 0.3954 nan 0.1000 0.0224
## 20 0.1492 nan 0.1000 0.0069
## 40 0.0341 nan 0.1000 0.0010
## 60 0.0185 nan 0.1000 -0.0000
## 80 0.0137 nan 0.1000 0.0001
## 100 0.0122 nan 0.1000 -0.0000
## 120 0.0113 nan 0.1000 -0.0001
## 140 0.0099 nan 0.1000 -0.0002
## 150 0.0095 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1673 nan 0.1000 0.0930
## 2 1.0156 nan 0.1000 0.0760
## 3 0.8899 nan 0.1000 0.0636
## 4 0.7838 nan 0.1000 0.0530
## 5 0.6938 nan 0.1000 0.0452
## 6 0.6166 nan 0.1000 0.0385
## 7 0.5496 nan 0.1000 0.0335
## 8 0.4913 nan 0.1000 0.0291
## 9 0.4402 nan 0.1000 0.0256
## 10 0.3950 nan 0.1000 0.0224
## 20 0.1470 nan 0.1000 0.0067
## 40 0.0313 nan 0.1000 0.0010
## 60 0.0134 nan 0.1000 0.0001
## 80 0.0105 nan 0.1000 -0.0001
## 100 0.0089 nan 0.1000 -0.0000
## 120 0.0082 nan 0.1000 -0.0001
## 140 0.0079 nan 0.1000 -0.0000
## 150 0.0078 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1676 nan 0.1000 0.0930
## 2 1.0160 nan 0.1000 0.0763
## 3 0.8914 nan 0.1000 0.0608
## 4 0.7862 nan 0.1000 0.0524
## 5 0.6970 nan 0.1000 0.0441
## 6 0.6198 nan 0.1000 0.0381
## 7 0.5533 nan 0.1000 0.0331
## 8 0.4952 nan 0.1000 0.0287
## 9 0.4446 nan 0.1000 0.0253
## 10 0.3999 nan 0.1000 0.0220
## 20 0.1541 nan 0.1000 0.0069
## 40 0.0443 nan 0.1000 0.0010
## 60 0.0294 nan 0.1000 -0.0001
## 80 0.0258 nan 0.1000 -0.0002
## 100 0.0233 nan 0.1000 0.0000
## 120 0.0228 nan 0.1000 -0.0001
## 140 0.0220 nan 0.1000 0.0001
## 150 0.0212 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1681 nan 0.1000 0.0932
## 2 1.0167 nan 0.1000 0.0752
## 3 0.8915 nan 0.1000 0.0626
## 4 0.7861 nan 0.1000 0.0529
## 5 0.6957 nan 0.1000 0.0451
## 6 0.6186 nan 0.1000 0.0384
## 7 0.5517 nan 0.1000 0.0333
## 8 0.4932 nan 0.1000 0.0290
## 9 0.4420 nan 0.1000 0.0256
## 10 0.3970 nan 0.1000 0.0223
## 20 0.1504 nan 0.1000 0.0068
## 40 0.0377 nan 0.1000 0.0009
## 60 0.0233 nan 0.1000 -0.0001
## 80 0.0182 nan 0.1000 0.0002
## 100 0.0164 nan 0.1000 -0.0003
## 120 0.0148 nan 0.1000 -0.0003
## 140 0.0132 nan 0.1000 -0.0000
## 150 0.0127 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1686 nan 0.1000 0.0930
## 2 1.0172 nan 0.1000 0.0754
## 3 0.8918 nan 0.1000 0.0625
## 4 0.7865 nan 0.1000 0.0523
## 5 0.6966 nan 0.1000 0.0447
## 6 0.6197 nan 0.1000 0.0381
## 7 0.5530 nan 0.1000 0.0334
## 8 0.4945 nan 0.1000 0.0291
## 9 0.4433 nan 0.1000 0.0258
## 10 0.3980 nan 0.1000 0.0226
## 20 0.1498 nan 0.1000 0.0071
## 40 0.0352 nan 0.1000 0.0006
## 60 0.0189 nan 0.1000 -0.0002
## 80 0.0146 nan 0.1000 0.0001
## 100 0.0128 nan 0.1000 -0.0000
## 120 0.0120 nan 0.1000 -0.0001
## 140 0.0115 nan 0.1000 -0.0000
## 150 0.0112 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.1683 nan 0.1000 0.0941
## 2 1.0162 nan 0.1000 0.0752
## 3 0.8904 nan 0.1000 0.0627
## 4 0.7848 nan 0.1000 0.0520
## 5 0.6947 nan 0.1000 0.0447
## 6 0.6173 nan 0.1000 0.0388
## 7 0.5503 nan 0.1000 0.0331
## 8 0.4922 nan 0.1000 0.0290
## 9 0.4409 nan 0.1000 0.0258
## 10 0.3961 nan 0.1000 0.0223
## 20 0.1492 nan 0.1000 0.0070
## 40 0.0365 nan 0.1000 0.0009
## 60 0.0197 nan 0.1000 -0.0001
## 80 0.0165 nan 0.1000 0.0000
## 100 0.0150 nan 0.1000 -0.0003
## 120 0.0129 nan 0.1000 -0.0000
## 140 0.0119 nan 0.1000 -0.0001
## 150 0.0121 nan 0.1000 -0.0003
gboost.cv
## Stochastic Gradient Boosting
##
## 1725 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1553, 1553, 1553, 1551, 1552, 1553, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9959470 0.9916220
## 1 100 0.9959470 0.9916220
## 1 150 0.9965284 0.9928238
## 2 50 0.9959470 0.9916220
## 2 100 0.9953689 0.9904351
## 2 150 0.9965317 0.9928387
## 3 50 0.9959503 0.9916319
## 3 100 0.9959503 0.9916369
## 3 150 0.9953689 0.9904351
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
adaboost.cv <- train(healthy ~ revenue + expenses + assets + liabilities + diff + gap,
data= df,
trControl = fold,
metric = 'Accuracy',
method = "ada")
adaboost.cv
## Boosted Classification Trees
##
## 1725 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1552, 1553, 1553, 1553, 1552, 1551, ...
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.9959403 0.9916071
## 1 100 0.9959403 0.9916071
## 1 150 0.9959403 0.9916071
## 2 50 0.9959436 0.9916170
## 2 100 0.9953656 0.9904302
## 2 150 0.9959436 0.9916170
## 3 50 0.9953622 0.9904152
## 3 100 0.9953622 0.9904152
## 3 150 0.9953622 0.9904152
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 50, maxdepth = 2 and nu = 0.1.
split <- log (ncol(df))
param_grid <- expand.grid (.mtry= split)
rf <- train(healthy ~ revenue + expenses + assets + liabilities + diff + gap,
data= df,
trControl = fold,
tuneGrid = param_grid,
method = "rf")
rf
## Random Forest
##
## 1725 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1553, 1552, 1553, 1553, 1553, 1551, ...
## Resampling results:
##
## Accuracy Kappa
## 0.995937 0.9915869
##
## Tuning parameter 'mtry' was held constant at a value of 1.94591
comparison <- resamples (list(dt.cv = dt.cv, bag = bag.cv, gradient = gboost.cv, adaboost = adaboost.cv, Forest = rf))
summary (comparison)
##
## Call:
## summary.resamples(object = comparison)
##
## Models: dt.cv, bag, gradient, adaboost, Forest
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## dt.cv 0.9883721 0.9941944 0.9971098 0.9959437 1.0000000 1 0
## bag 0.9884393 0.9941860 0.9942197 0.9953656 0.9985549 1 0
## gradient 0.9885057 0.9941944 0.9971098 0.9965317 1.0000000 1 0
## adaboost 0.9883721 0.9941944 0.9942363 0.9959436 1.0000000 1 0
## Forest 0.9883721 0.9941860 0.9970930 0.9959370 1.0000000 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## dt.cv 0.9759137 0.9880069 0.9940406 0.9916067 1.0000000 1 0
## bag 0.9761116 0.9879821 0.9880302 0.9904200 0.9970203 1 0
## gradient 0.9762100 0.9880445 0.9940406 0.9928387 1.0000000 1 0
## adaboost 0.9759137 0.9880069 0.9881053 0.9916170 1.0000000 1 0
## Forest 0.9759137 0.9879821 0.9939911 0.9915869 1.0000000 1 0
The dramatic increase in all models from earlier findings with SVMs and decision trees called for further analysis and perhaps skepticism as SVMs are a linear, binary classifier and offered mixed results. Bagging and boosting techniques, although applied correctly, offered what appeared to be inflated results but improved the accuracy of the model considerably. After running the same model with the same techniques multiple times, the ensemble learning approach now appears to be the most conclusive.
Three primary findings also offered some interesting results. For instance in terms of accuracy, adaboost consistently lagged behind other techniques through multiple iterations and bagging, in fact, did cut through the “noise” for consistency. Ultimately, GBMs appear to be the most accurate, consistent, and fastest, aligning with its primary purpose and what has emerged as the ideal technique in this model.