DSC609 Final MV Nonprofit SVM/DT

Project Overview

These data were extracted from IRS Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. They offer a snapshot of those nonprofits falling above the $200,000 threshold — which, in the Mohawk Valley, is 328 nonprofits between Oneida and Herkimer counties in upstate New York for reporting years 2015, 2016, and 2017.

In previous deliverables, we used 100 and 300 organizations from one reporting year, respectively, but we omitted 28 for the kernelized Support Vector Machines (SVMs) deliverable simply because that group belonged to a national network that often fills a revenue void, and therefore, skews the data bit. With only one year, the model serves as a more straightforward classifier, but what we found was either a low kappa result or an overfitting situation. With a larger data set, perhaps the model will become stronger.

Additionally, there are more than 1,000 nonprofits in the region and the decision was made to first focus on those which generate the most revenue, presumably because their assets and liabilities are more conclusive in a testing situation. As a result, mid-sized nonprofits (between $50,000-$200,000) were added to this data set, albeit with much lower asset and liabilities levels. The data are then normalized, but proving the models’ strengths and weaknesses is the more responsible first step for predictive and classification purposes based on previous experimentation.

Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we elected to use four variables these organizations share with mid-sized organizations (990-EZ) as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories — revenue, expenses, assets, and liabilities — to measure the overall health of a tax-exempt organization. However, what makes these models different is the addition of two variables — difference between revenue and expenses, which, in turn, allows us to classify the organization as either “healthy” or “not healthy.” Also, another independent variable is added — the difference between assets and liabilities. This is labeled “gap” in the dataset. In total, three are added.

Ultimately, the “healthy” variable becomes essential as the “factor” in all models (or the dependent) and is crucial to the question — how can we predict fiscally healthy nonprofits in the Mohawk Valley? The remaining variables are independent. The one that will be closely watched is the “difference” variable as it is not an IRS reporting requirement but the essential piece in deciding binary classification and the “gap” variable, which is the difference between assets and liabilities. This may prove to be the most important finding as outlined above. There are 1,725 records in the data set.

As there are multiple parts of the overall project, we will first use these data and methodology for support vector machine (SVM) and decision tree (DT) analysis.

df <- read.csv("C:/Users/bjorzech/Desktop/609_W7.csv",stringsAsFactors = FALSE)
head(df)

##     revenue  expenses liabilities    assets      diff      gap healthy
## 1  72664896  83567663      232415  29604973 -10902767 29372558       1
## 2  72512263  81579736      140853  20701792  -9067473 20560939       1
## 3   1486216   7130439      310000  19107965  -5644223 18797965       1
## 4 115993449 120871280    71024983 115179428  -4877831 44154445       1
## 5  79121703  82677864      226736  17244423  -3556161 17017687       1
## 6  85724662  88871216    27876413  55501103  -3146554 27624690       1

tail(df)

##        revenue  expenses liabilities     assets     diff        gap healthy
## 1720  72562193  62424461  1337986492 1481609252 10137732  143622760       0
## 1721 100285807  85487916    25230438   55648804 14797891   30418366       0
## 1722 199948992 182057650   276031568 1393517194 17891342 1117485626       0
## 1723 211794354 183021955   198431818 1372367274 28772399 1173935456       0
## 1724  37887060   4491510    20365734   12765296 33395550   -7600438       0
## 1725 238449944 173479574   294569835 1361460585 64970370 1066890750       0

str(df)

## 'data.frame':    1725 obs. of  7 variables:
##  $ revenue    : int  72664896 72512263 1486216 115993449 79121703 85724662 14759406 213519379 95714 4142767 ...
##  $ expenses   : int  83567663 81579736 7130439 120871280 82677864 88871216 17815375 216361021 2517640 6451369 ...
##  $ liabilities: int  232415 140853 310000 71024983 226736 27876413 19526262 121947762 267343 4763098 ...
##  $ assets     : int  29604973 20701792 19107965 115179428 17244423 55501103 8184741 126538702 1941925 1052736 ...
##  $ diff       : int  -10902767 -9067473 -5644223 -4877831 -3556161 -3146554 -3055969 -2841642 -2421926 -2308602 ...
##  $ gap        : int  29372558 20560939 18797965 44154445 17017687 27624690 -11341521 4590940 1674582 -3710362 ...
##  $ healthy    : int  1 1 1 1 1 1 1 1 1 1 ...

df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric(df$revenue)
df$expenses <- as.numeric(df$expenses)
df$assets <- as.numeric(df$assets)
df$gap <- as.numeric(df$gap)
df$diff <- as.numeric(df$diff)
df$liabilities <- as.numeric (df$liabilities)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

Support Vector Machine Models

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

fold <- trainControl(method = "cv", number = 5)
svm.linear.cv <- train(healthy ~ revenue + expenses + liabilities + assets + diff + gap,
                       data = df,
                       trControl = fold,
                       method ="svmLinear")
svm.linear.cv

## Support Vector Machines with Linear Kernel 
## 
## 1725 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1379, 1381, 1381, 1380, 1379 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6487219  0.1715746
## 
## Tuning parameter 'C' was held constant at a value of 1

According to Hastie et al. (2016), by running an SVM linear model — which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space — we find an adequate accuracy output based on a five-fold, cross-validation resampling. The training model partitioned into five parts produces an equal training and test set (Müller & Guido, 2016).

In this model, the accuracy results are stronger than the initial run weeks ago, but a stronger outcome is still desired.

Kernelized SVM

To further analyze the fit of this model, a radial basis kernel is applied before the polynomial as it produces a boundary quite similar to the Bayes optimal boundary (Hastie et al., 2016, p. 424). It’s also worth noting that different sigma and parameter values were used to understand the best fit for this model. Ultimately, it was decided to use fewer and consistent values between the sigma and C because a large value of C will lead to an overfit wiggly boundary in the original feature space; a small value of C will encourage a small value and the boundary to be smoother. The regularization parameter was chosen in both cases to achieve a good test error (Hastie et al., 2016, p. 424).

param_grid <- expand.grid(sigma = c(0.001, 0.01, 0.1, 1, 10, 100),
                          C = c(0.001, 0.01, 0.1, 1, 10, 100))

svm.rbf.cv <- train(healthy ~ revenue + expenses + liabilities + assets + diff + gap,
                    data = df,
                    trControl = fold,
                    method = "svmRadial",
                    preProcess = c("center", "scale"),
                    tuneGrid = param_grid,
                    tuneLength = 10)
svm.rbf.cv

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1725 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1379, 1381, 1380, 1380, 1380 
## Resampling results across tuning parameters:
## 
##   sigma  C      Accuracy   Kappa     
##   1e-03  1e-03  0.5872467  0.00000000
##   1e-03  1e-02  0.5872467  0.00000000
##   1e-03  1e-01  0.5872467  0.00000000
##   1e-03  1e+00  0.5878264  0.00210858
##   1e-03  1e+01  0.5936252  0.01854025
##   1e-03  1e+02  0.6150712  0.07941584
##   1e-02  1e-03  0.5872467  0.00000000
##   1e-02  1e-02  0.5872467  0.00000000
##   1e-02  1e-01  0.5872467  0.00000000
##   1e-02  1e+00  0.5907216  0.01176514
##   1e-02  1e+01  0.6139067  0.07703916
##   1e-02  1e+02  0.6713201  0.23174055
##   1e-01  1e-03  0.5872467  0.00000000
##   1e-01  1e-02  0.5872467  0.00000000
##   1e-01  1e-01  0.5872467  0.00000000
##   1e-01  1e+00  0.6144915  0.07779194
##   1e-01  1e+01  0.6724778  0.23550254
##   1e-01  1e+02  0.8168294  0.59542737
##   1e+00  1e-03  0.5872467  0.00000000
##   1e+00  1e-02  0.5872467  0.00000000
##   1e+00  1e-01  0.6011581  0.03919228
##   1e+00  1e+00  0.6724778  0.23488278
##   1e+00  1e+01  0.8139325  0.58810026
##   1e+00  1e+02  0.9223354  0.83540457
##   1e+01  1e-03  0.5872467  0.00000000
##   1e+01  1e-02  0.5872467  0.00000000
##   1e+01  1e-01  0.6527675  0.18102055
##   1e+01  1e+00  0.8087202  0.57579380
##   1e+01  1e+01  0.9165416  0.82272936
##   1e+01  1e+02  0.9519074  0.89915510
##   1e+02  1e-03  0.5872467  0.00000000
##   1e+02  1e-02  0.5872467  0.00000000
##   1e+02  1e-01  0.7675539  0.47611836
##   1e+02  1e+00  0.8985688  0.78317446
##   1e+02  1e+01  0.9408878  0.87569613
##   1e+02  1e+02  0.9490055  0.89301425
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 10 and C = 100.

svm.poly.cv <- train(healthy ~ revenue + expenses + liabilities + assets + diff + gap,
                     data = df,
                     trControl = fold,
                     method = "svmPoly",
                     tuneLength = 4)
svm.poly.cv

## Support Vector Machines with Polynomial Kernel 
## 
## 1725 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1380, 1381, 1379, 1379, 1381 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa      
##   1       0.001  0.25  0.5872470  0.000000000
##   1       0.001  0.50  0.5872470  0.000000000
##   1       0.001  1.00  0.5872470  0.000000000
##   1       0.001  2.00  0.5884098  0.003300114
##   1       0.010  0.25  0.5878300  0.002142136
##   1       0.010  0.50  0.5901506  0.009211509
##   1       0.010  1.00  0.5918914  0.014138668
##   1       0.010  2.00  0.5930475  0.017878492
##   1       0.100  0.25  0.5947917  0.022762541
##   1       0.100  0.50  0.6000124  0.037440741
##   1       0.100  1.00  0.6092912  0.063371221
##   1       0.100  2.00  0.6185650  0.089054032
##   1       1.000  0.25  0.6214619  0.097049057
##   1       1.000  0.50  0.6295762  0.119662492
##   1       1.000  1.00  0.6539376  0.184536222
##   1       1.000  2.00  0.6753937  0.242420559
##   2       0.001  0.25  0.5872470  0.000000000
##   2       0.001  0.50  0.5872470  0.000000000
##   2       0.001  1.00  0.5884098  0.003300114
##   2       0.001  2.00  0.5907286  0.010367344
##   2       0.010  0.25  0.5895692  0.007567510
##   2       0.010  0.50  0.5930475  0.016929683
##   2       0.010  1.00  0.5953731  0.023925102
##   2       0.010  2.00  0.5988513  0.033688124
##   2       0.100  0.25  0.6023246  0.043946708
##   2       0.100  0.50  0.6121864  0.072398160
##   2       0.100  1.00  0.6179819  0.088351103
##   2       0.100  2.00  0.6255215  0.108575795
##   2       1.000  0.25  0.6301559  0.121675146
##   2       1.000  0.50  0.6533495  0.184592401
##   2       1.000  1.00  0.6800298  0.255389094
##   2       1.000  2.00  0.7154193  0.345769956
##   3       0.001  0.25  0.5872470  0.000000000
##   3       0.001  0.50  0.5878284  0.001652081
##   3       0.001  1.00  0.5884098  0.003796938
##   3       0.001  2.00  0.5907320  0.010859542
##   3       0.010  0.25  0.5930475  0.017405076
##   3       0.010  0.50  0.5942086  0.020673152
##   3       0.010  1.00  0.5988530  0.032770909
##   3       0.010  2.00  0.6040704  0.048868081
##   3       0.100  0.25  0.6104456  0.067143542
##   3       0.100  0.50  0.6168225  0.084720240
##   3       0.100  1.00  0.6226246  0.101058429
##   3       0.100  2.00  0.6382853  0.142983722
##   3       1.000  0.25  0.6510357  0.177199706
##   3       1.000  0.50  0.6713307  0.231292829
##   3       1.000  1.00  0.7119443  0.336466786
##   3       1.000  2.00  0.7484663  0.428797811
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 1 and C = 2.

comparison <- resamples(list(svm.linear = svm.linear.cv, svm.poly = svm.poly.cv, svm.rbf = svm.rbf.cv))
summary(comparison)

## 
## Call:
## summary.resamples(object = comparison)
## 
## Models: svm.linear, svm.poly, svm.rbf 
## Number of resamples: 5 
## 
## Accuracy 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm.linear 0.6358382 0.6376812 0.6445087 0.6487219 0.6598837 0.6656977    0
## svm.poly   0.7063953 0.7080925 0.7188406 0.7484663 0.7514451 0.8575581    0
## svm.rbf    0.9335260 0.9362319 0.9507246 0.9519074 0.9652174 0.9738372    0
## 
## Kappa 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm.linear 0.1366787 0.1442264 0.1602273 0.1715746 0.2006038 0.2161370    0
## svm.poly   0.3244147 0.3356021 0.3561823 0.4287978 0.4374835 0.6903065    0
## svm.rbf    0.8598894 0.8661612 0.8966283 0.8991551 0.9275768 0.9455198    0

Since SVMs are sensitive to “noisy” data and the amount of training data, the likelihood of overfitting appears to be consistent across the linear, radial, and polynomial models above. The recommendation weeks ago was to increase the amount of training data or change the parameters going forward for perhaps a stronger fit.

From expanding the data set to 1,725 records and experimenting with the parameters, we did, in fact, find stronger accuracy results, a stronger fit, and less bias. These models become much more conclusive as the SVM linear model still showed some improved results, but the RBF model seems to be the most improved and strongest. Similar to Hastie (2016), the regularization parameter was chosen in both cases to achieve good test error. The radial basis kernel produces a boundary quite similar to the Bayes optimal boundary (p. 424). This slight adjustment produced a much more desirable result.

Decision Trees

A tree classifier will allow us to distinguish between “healthy” or “not healthy” when assessing the fiscal performance of Mohawk Valley nonprofits. By expanding the data set, though, and keeping the dimensionality similar to the SVM models with kerneling, a much different result was found.

library(rpart)
treeAnalysis <- rpart(df$healthy ~ df$revenue + df$expenses + df$liabilities + df$assets, data = df)
treeAnalysis

## n= 1725 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##      1) root 1725 712 0 (0.58724638 0.41275362)  
##        2) df$expenses< 61647 535 177 0 (0.66915888 0.33084112)  
##          4) df$revenue>=50851.5 144   6 0 (0.95833333 0.04166667) *
##          5) df$revenue< 50851.5 391 171 0 (0.56265985 0.43734015)  
##           10) df$expenses< 36526 307 103 0 (0.66449511 0.33550489)  
##             20) df$revenue>=25886.5 91   7 0 (0.92307692 0.07692308) *
##             21) df$revenue< 25886.5 216  96 0 (0.55555556 0.44444444)  
##               42) df$expenses< 24156.5 196  76 0 (0.61224490 0.38775510)  
##                 84) df$revenue>=18113 35   3 0 (0.91428571 0.08571429) *
##                 85) df$revenue< 18113 161  73 0 (0.54658385 0.45341615)  
##                  170) df$expenses< 14214 139  52 0 (0.62589928 0.37410072)  
##                    340) df$revenue>=9326 33   1 0 (0.96969697 0.03030303) *
##                    341) df$revenue< 9326 106  51 0 (0.51886792 0.48113208)  
##                      682) df$expenses< 149 41   7 0 (0.82926829 0.17073171) *
##                      683) df$expenses>=149 65  21 1 (0.32307692 0.67692308) *
##                  171) df$expenses>=14214 22   1 1 (0.04545455 0.95454545) *
##               43) df$expenses>=24156.5 20   0 1 (0.00000000 1.00000000) *
##           11) df$expenses>=36526 84  16 1 (0.19047619 0.80952381) *
##        3) df$expenses>=61647 1190 535 0 (0.55042017 0.44957983)  
##          6) df$revenue>=67218 1121 467 0 (0.58340767 0.41659233)  
##           12) df$expenses< 75062.5 59   4 0 (0.93220339 0.06779661) *
##           13) df$expenses>=75062.5 1062 463 0 (0.56403013 0.43596987)  
##             26) df$revenue>=91667.5 987 402 0 (0.59270517 0.40729483)  
##               52) df$expenses< 105728.5 65   4 0 (0.93846154 0.06153846) *
##               53) df$expenses>=105728.5 922 398 0 (0.56832972 0.43167028)  
##                106) df$revenue>=125976 856 340 0 (0.60280374 0.39719626)  
##                  212) df$expenses< 135227 58   2 0 (0.96551724 0.03448276) *
##                  213) df$expenses>=135227 798 338 0 (0.57644110 0.42355890)  
##                    426) df$revenue>=163504.5 733 283 0 (0.61391542 0.38608458)  
##                      852) df$expenses< 170720 35   0 0 (1.00000000 0.00000000) *
##                      853) df$expenses>=170720 698 283 0 (0.59455587 0.40544413)  
##                       1706) df$revenue>=183625.5 670 258 0 (0.61492537 0.38507463)  
##                         3412) df$expenses< 198894.5 27   1 0 (0.96296296 0.03703704) *
##                         3413) df$expenses>=198894.5 643 257 0 (0.60031104 0.39968896)  
##                           6826) df$revenue>=271518.5 577 212 0 (0.63258232 0.36741768)  
##                            13652) df$expenses< 301757.5 44   1 0 (0.97727273 0.02272727) *
##                            13653) df$expenses>=301757.5 533 211 0 (0.60412758 0.39587242)  
##                              27306) df$revenue>=339625.5 516 194 0 (0.62403101 0.37596899)  
##                                54612) df$liabilities< 108627 161  37 0 (0.77018634 0.22981366) *
##                                54613) df$liabilities>=108627 355 157 0 (0.55774648 0.44225352)  
##                                 109226) df$revenue>=1487243 239  89 0 (0.62761506 0.37238494) *
##                                 109227) df$revenue< 1487243 116  48 1 (0.41379310 0.58620690) *
##                              27307) df$revenue< 339625.5 17   0 1 (0.00000000 1.00000000) *
##                           6827) df$revenue< 271518.5 66  21 1 (0.31818182 0.68181818)  
##                            13654) df$expenses< 237076 33  14 0 (0.57575758 0.42424242)  
##                              27308) df$revenue>=204065.5 22   3 0 (0.86363636 0.13636364) *
##                              27309) df$revenue< 204065.5 11   0 1 (0.00000000 1.00000000) *
##                            13655) df$expenses>=237076 33   2 1 (0.06060606 0.93939394) *
##                       1707) df$revenue< 183625.5 28   3 1 (0.10714286 0.89285714) *
##                    427) df$revenue< 163504.5 65  10 1 (0.15384615 0.84615385) *
##                107) df$revenue< 125976 66   8 1 (0.12121212 0.87878788) *
##             27) df$revenue< 91667.5 75  14 1 (0.18666667 0.81333333) *
##          7) df$revenue< 67218 69   1 1 (0.01449275 0.98550725) *

library(rpart.plot)
rpart.plot(treeAnalysis, extra = 4)

Even with the reduction of dimensionality and the increased size of the data set, the only conclusive finding remains that liabilities for nonprofit organizations appearing first on the “unhealthy” branch offer some variation within the model. Similar to the first model, revenue and expenses can be classified in a binary fashion in addition to liabilities and assets. This is important to note as little pruning is applied in this model, and in the process, offers perhaps more definitive results.

Additional Note/Pearson’s

The inconsistency of some models with all independent variables calls for further examination and most likely contributes to the reason why a universal baseline for understanding nonprofit organizations fiscal health does not exist — yet. For instance, of the 32 potential independent variables found on IRS forms, only four offer consistencies across all three filing levels (990-N, 990-EZ, and 990s). With the addition of two independent variables — difference between revenue and expenses and the gap between assets and liabilities — the models improved considerably, but more data are needed to conclusively establish that these variables are essential. This finding also stemmed from previous work, when linear models and Pearson’s correlation within a plot offered some evidence of relationships.

The results prove to be adequate and somewhat expected, given the binary nature and perhaps small data set. The likelihood of a nonprofit being labeled “healthy” appears to rely mostly on revenue and expenses and the confusion matrices, even with variances, appear to show no distinguishable results that may offer more insight into the model. Just as an additional measure, the k-NN model is plotted with Pearson’s correlation to assess any relationship between variables. The variables, “difference” and “health” are also included in an effort to find any additional insight beyond the assessed supervised algorithms.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

pairs.panels (df[,1:7], 
              method = "pearson", 
              hist.col = "#00AFBB",
              density = TRUE,  
              ellipses = TRUE 
)