DSC609 MV Nonprofits: Kernelized SVMs

Project Overview

These data were extracted from Internal Revenue Service Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. In the Mohawk Valley, there are 328 tax-exempt organizations with annual revenues of more than $200,000, therefore, they must file a 990. These data offer a snapshot of the 300 highest-grossing nonprofits between Oneida and Herkimer counties in upstate New York. We will explore these data throughout the term and eventually use the full data set. Previously, we used 100. We omitted 28 for this deliverable simply because they belong to a national network that often fills a revenue void, and therefore, skews the data bit. We will include these records in future deliverables. Although there are longitudinal data available, for this deliverable we will focus on the last full reporting year of 2018 as one year lends itself to a more straightforward classifier model.

Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we have elected to use four variables as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories – revenue, expenses, assets, and liabilities – to measure the overall health of a tax-exempt organization. However, what makes these models different is the addition of two variables – difference between revenue and expenses, which, in turn, allows us to classify the organization as either “healthy” or “not healthy.” Ultimately, the health variable becomes essential as the “factor” in all models and is crucial to the question – how can we predict fiscally healthy nonprofits in the Mohawk Valley?

For this deliverable, we will examine these data via Support Vector Machine (SVM) modeling and kernelized SVMs as this classifier solves a function-fitting problem using a particular criterion and form of regularization (Hastie et al., 2016, p. 423). Two subsequent comparisons are then created – one for different SVM models and another with two supervised classification models. We have elected to use decision trees, for their hierarchical approach to classification, and k-NN (nearest neighbor), because of its focus on considering exactly one nearest neighbor, which is the closest training data point to the point we want to make a prediction for in the model (Müller & Guido, 2016). We can adjust the parameters on each, but since the overall objective is to compare based on the same data set, these supervised models seem appropriate.

After running each model, a brief analysis is offered, followed by a holistic comparison of the results of the kernelized SVMs and the two subsequent classification models.

Load Data & Library

df <- read.csv("C:/Users/bjorzech/Desktop/609_W3.csv",stringsAsFactors = FALSE)
head(df)

##                            organization revenue expenses liabilities  assets
## 1                 Boy Scouts of America  415281   647008         100 7294994
## 2            Rob Esche Save of the Day   219112   300844         100   64292
## 3 Laborers 35 Training & Education Fund  256805   329000         100     100
## 4                  John Bosco House Inc   67161   129547         100  236370
## 5       American Federation of Teachers  253214   308651         100  273630
## 6         Kuyahoora Volunteer Ambulance  212452   258437         100  223621
##      diff healthy
## 1 -231727      No
## 2  -81732      No
## 3  -72195      No
## 4  -62386      No
## 5  -55437      No
## 6  -45985      No

tail(df)

##                     organization   revenue  expenses liabilities     assets
## 295      Neighborhood Center Inc  12512039  12394221    10470836   14997488
## 296            Preswick Glen Inc  37887060   4491510    20365734   12765296
## 297   Upstate Cerebral Palsy Inc  93374322  91328400    20508445   48870808
## 298                Utica College  99799776  93885974    69080233  117018066
## 299 Trustees of Hamilton College 199948992 182057650   276031568 1393517194
## 300  NYS Chartered Credit Unions  72562193  62424461  1337986492 1481609252
##         diff healthy
## 295   117818     Yes
## 296 33395550     Yes
## 297  2045922     Yes
## 298  5913802     Yes
## 299 17891342     Yes
## 300 10137732     Yes

str(df)

## 'data.frame':    300 obs. of  7 variables:
##  $ organization: chr  "Boy Scouts of America" "Rob Esche Save of the Day " "Laborers 35 Training & Education Fund" "John Bosco House Inc" ...
##  $ revenue     : int  415281 219112 256805 67161 253214 212452 138076 80333 182899 37205 ...
##  $ expenses    : int  647008 300844 329000 129547 308651 258437 176818 111244 209069 60687 ...
##  $ liabilities : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ assets      : int  7294994 64292 100 236370 273630 223621 448779 557945 57935 373935 ...
##  $ diff        : int  -231727 -81732 -72195 -62386 -55437 -45985 -38742 -30911 -26170 -23482 ...
##  $ healthy     : chr  "No" "No" "No" "No" ...

df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric(df$revenue)
df$expenses <- as.numeric(df$expenses)
df$assets <- as.numeric(df$assets)
df$liabilities <- as.numeric (df$liabilities)

For presentation purposes, after loading the data, the head and tail are offered to show the variables associated with these models and the 300 nonprofits used in this data set. Additionally, please note the additional variables – “difference” and “healthy.” Although “difference” is not used in any supervised model, it’s included in the k-NN plot to show potential variance and correlation between the other variables. This will also be briefly addressed in the overall analysis. Additionally, fiscal variables were loaded as “integers” and converted to numeric as a quick preprocessing step for their nominal (categorical) and interval (numeric) nature. In some respect, they can also be construed as ordinal because of their binary nature (Tan et al., 2019, p. 30).

Support Vector Machine Models

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

fold <- trainControl(method = "cv", number = 5)
svm.linear.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
                       data = df,
                       trControl = fold,
                       method ="svmLinear")
svm.linear.cv

## Support Vector Machines with Linear Kernel 
## 
## 300 samples
##   4 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 240, 240, 240 
## Resampling results:
## 
##   Accuracy   Kappa      
##   0.5833333  0.005298013
## 
## Tuning parameter 'C' was held constant at a value of 1

According to Hastie et al. (2016), by running an SVM linear model – which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space – we find an adequate accuracy output based on a five-fold, cross-validation resampling. The training model partitioned into five parts produces an equal training and test set (Müller & Guido, 2016).

Although the accuracy resampling result proves to be serviceable, the kappa is concerning. According to Cohen’s original work (Altman, 1999), bias is greater when kappa is small. In the sliding scale model, a .02 means slight agreement than would be expected by chance between marginal distributions while a negative means there is none. Ultimately, the kappa needs to be re-evaluated.

Kernelized SVM

To further analyze the fit of this model, a radial basis kernel is applied before the polynomial as it produces a boundary quite similar to the Bayes optimal boundary (Hastie et al., 2016, p. 424). It’s also worth noting that different sigma and parameter values were used to understand the best fit for this model. Ultimately, it was decided to use fewer and consistent values between the sigma and C because a large value of C will lead to an overfit wiggly boundary in the original feature space; a small value of C will encourage a small value and the boundary to be smoother. The regularization parameter was chosen in both cases to achieve a good test error (Hastie et al., 2016, p. 424).

param_grid <- expand.grid(sigma = c(0.001, 0.01, 0.1, 1, 10, 100),
                          C = c(0.001, 0.01, 0.1, 1, 10, 100))

svm.rbf.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
                    data = df,
                    trControl = fold,
                    method = "svmRadial",
                    preProcess = c("center", "scale"),
                    tuneGrid = param_grid,
                    tuneLength = 10)
svm.rbf.cv

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 300 samples
##   4 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 240, 240, 240 
## Resampling results across tuning parameters:
## 
##   sigma  C      Accuracy   Kappa       
##   1e-03  1e-03  0.5833333   0.000000000
##   1e-03  1e-02  0.5833333   0.000000000
##   1e-03  1e-01  0.5833333   0.000000000
##   1e-03  1e+00  0.5833333   0.000000000
##   1e-03  1e+01  0.5833333   0.005349955
##   1e-03  1e+02  0.5900000   0.018543046
##   1e-02  1e-03  0.5833333   0.000000000
##   1e-02  1e-02  0.5833333   0.000000000
##   1e-02  1e-01  0.5833333   0.000000000
##   1e-02  1e+00  0.5833333   0.000000000
##   1e-02  1e+01  0.5900000   0.018543046
##   1e-02  1e+02  0.5833333   0.007826384
##   1e-01  1e-03  0.5833333   0.000000000
##   1e-01  1e-02  0.5833333   0.000000000
##   1e-01  1e-01  0.5833333   0.000000000
##   1e-01  1e+00  0.5833333   0.000000000
##   1e-01  1e+01  0.5766667  -0.007843137
##   1e-01  1e+02  0.6066667   0.069643040
##   1e+00  1e-03  0.5833333   0.000000000
##   1e+00  1e-02  0.5833333   0.000000000
##   1e+00  1e-01  0.5833333   0.000000000
##   1e+00  1e+00  0.5766667  -0.010389610
##   1e+00  1e+01  0.5933333   0.038355597
##   1e+00  1e+02  0.6400000   0.160414453
##   1e+01  1e-03  0.5833333   0.000000000
##   1e+01  1e-02  0.5833333   0.000000000
##   1e+01  1e-01  0.5833333   0.000000000
##   1e+01  1e+00  0.5733333  -0.006181378
##   1e+01  1e+01  0.5900000   0.045659868
##   1e+01  1e+02  0.6900000   0.304112641
##   1e+02  1e-03  0.5833333   0.000000000
##   1e+02  1e-02  0.5833333   0.000000000
##   1e+02  1e-01  0.5833333   0.000000000
##   1e+02  1e+00  0.5933333   0.035246180
##   1e+02  1e+01  0.6600000   0.229598545
##   1e+02  1e+02  0.7700000   0.506370529
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 100 and C = 100.

The radial results were similar to the linear, however, the model seems to be overfitted with the largest sigma and C, which was ultimately used to find the best accuracy model. As a result, we use a polynomial kernel model for comparison purposes. As with other linear methods, we can make the procedure more flexible by enlarging the feature space using basis expansions such as polynomials (Hastie et al., 2016, p. 423).

svm.poly.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
                     data = df,
                     trControl = fold,
                     method = "svmPoly",
                     tuneLength = 4)
svm.poly.cv

## Support Vector Machines with Polynomial Kernel 
## 
## 300 samples
##   4 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 240, 240, 240 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa       
##   1       0.001  0.25  0.5833333   0.000000000
##   1       0.001  0.50  0.5800000  -0.006622517
##   1       0.001  1.00  0.5800000  -0.006622517
##   1       0.001  2.00  0.5766667  -0.013157895
##   1       0.010  0.25  0.5766667  -0.013157895
##   1       0.010  0.50  0.5766667  -0.013157895
##   1       0.010  1.00  0.5766667  -0.013157895
##   1       0.010  2.00  0.5766667  -0.013157895
##   1       0.100  0.25  0.5766667  -0.013157895
##   1       0.100  0.50  0.5766667  -0.013157895
##   1       0.100  1.00  0.5766667  -0.013157895
##   1       0.100  2.00  0.5766667  -0.013157895
##   1       1.000  0.25  0.5766667  -0.013157895
##   1       1.000  0.50  0.5800000  -0.003886372
##   1       1.000  1.00  0.5833333   0.005263158
##   1       1.000  2.00  0.5833333   0.005263158
##   2       0.001  0.25  0.5800000  -0.006622517
##   2       0.001  0.50  0.5766667  -0.013157895
##   2       0.001  1.00  0.5766667  -0.013157895
##   2       0.001  2.00  0.5766667  -0.013157895
##   2       0.010  0.25  0.5766667  -0.013157895
##   2       0.010  0.50  0.5766667  -0.013157895
##   2       0.010  1.00  0.5766667  -0.013157895
##   2       0.010  2.00  0.5766667  -0.013157895
##   2       0.100  0.25  0.5833333   0.000000000
##   2       0.100  0.50  0.5866667   0.009271523
##   2       0.100  1.00  0.5900000   0.018421053
##   2       0.100  2.00  0.5900000   0.018421053
##   2       1.000  0.25  0.5833333   0.007894737
##   2       1.000  0.50  0.5800000   0.001272220
##   2       1.000  1.00  0.5800000   0.001272220
##   2       1.000  2.00  0.5866667   0.017061694
##   3       0.001  0.25  0.5766667  -0.013157895
##   3       0.001  0.50  0.5766667  -0.013157895
##   3       0.001  1.00  0.5766667  -0.013157895
##   3       0.001  2.00  0.5766667  -0.013157895
##   3       0.010  0.25  0.5766667  -0.013157895
##   3       0.010  0.50  0.5766667  -0.013157895
##   3       0.010  1.00  0.5833333   0.000000000
##   3       0.010  2.00  0.5833333   0.000000000
##   3       0.100  0.25  0.5833333   0.000000000
##   3       0.100  0.50  0.5800000  -0.006622517
##   3       0.100  1.00  0.5833333   0.000000000
##   3       0.100  2.00  0.5833333   0.000000000
##   3       1.000  0.25  0.5766667  -0.010613454
##   3       1.000  0.50  0.5733333  -0.017148832
##   3       1.000  1.00  0.5833333   0.010508546
##   3       1.000  2.00  0.5866667   0.019641673
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.1 and C = 1.

Within the polynomial model, a smaller C was found with a similar accuracy result, though. This seems to be a stronger fit as a model with a similar result. When comparing the linear, radial, and polynomial models, some sacrifices may be made for a stronger model with less accuracy than a more accurate model, which may be overfitted.

comparison <- resamples(list(svm.linear = svm.linear.cv, svm.poly = svm.poly.cv, svm.rbf = svm.rbf.cv))
summary(comparison)

## 
## Call:
## summary.resamples(object = comparison)
## 
## Models: svm.linear, svm.poly, svm.rbf 
## Number of resamples: 5 
## 
## Accuracy 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm.linear 0.5666667 0.5666667 0.5833333 0.5833333 0.6000000 0.6000000    0
## svm.poly   0.5833333 0.5833333 0.5833333 0.5900000 0.5833333 0.6166667    0
## svm.rbf    0.7333333 0.7500000 0.7666667 0.7700000 0.8000000 0.8000000    0
## 
## Kappa 
##                   Min.     1st Qu.    Median        Mean    3rd Qu.       Max.
## svm.linear -0.03311258 -0.03311258 0.0000000 0.005298013 0.04635762 0.04635762
## svm.poly    0.00000000  0.00000000 0.0000000 0.018421053 0.00000000 0.09210526
## svm.rbf     0.43195266  0.47674419 0.4909091 0.506370529 0.55828221 0.57396450
##            NA's
## svm.linear    0
## svm.poly      0
## svm.rbf       0

Analysis

Since SVMs are sensitive to “noisy” data and the amount of training data, the likelihood of overfitting appears to be consistent across the linear, radial, and polynomial models above. We can increase the amount of training data or change the parameters going forward for perhaps a stronger fit. In this case, the radial basis kernel model produces the highest accuracy and kappa results, which means less bias, but with overfitting. If required to choose a model, the radial basis could prove to be serviceable and further analysis is needed. Perhaps with a larger training set, the model can be refined.

Decision Tree Model

Similar to data mining approaches, algorithms work better when dimensionality – the number of attributes in the data – is lower (Tan et al., 2019, p. 57). A tree classifier will allow us to distinguish between “healthy” or “not healthy” when assessing the fiscal performance of Mohawk Valley nonprofits. More accurately, this will consist of Hunt’s algorithm – and expanded because of the number of variables by using the splitting criterion, which is an attribute test condition (Tan et al., 2019, p. 122).

The following decision tree model is created using the same variables. Although the dimensionality is higher than previous models, all are essential to holistically assess fiscal health. Further, if desired, revenue and expenses can be classified in a binary fashion in addition to liabilities and assets. This is important to note as little pruning is applied in this model, and in the process, offers perhaps more definitive results.

library(rpart)
treeAnalysis <- rpart(df$healthy ~ df$revenue + df$expenses + df$liabilities + df$assets, data = df)
treeAnalysis

## n= 300 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 300 125 Yes (0.4166667 0.5833333)  
##     2) df$liabilities>=86182 140  69 Yes (0.4928571 0.5071429)  
##       4) df$revenue< 924026.5 52  19 No (0.6346154 0.3653846)  
##         8) df$expenses>=393377.5 24   5 No (0.7916667 0.2083333) *
##         9) df$expenses< 393377.5 28  14 No (0.5000000 0.5000000)  
##          18) df$revenue< 251156.5 17   5 No (0.7058824 0.2941176) *
##          19) df$revenue>=251156.5 11   2 Yes (0.1818182 0.8181818) *
##       5) df$revenue>=924026.5 88  36 Yes (0.4090909 0.5909091)  
##        10) df$expenses>=1321128 78  36 Yes (0.4615385 0.5384615)  
##          20) df$revenue< 1529919 7   0 No (1.0000000 0.0000000) *
##          21) df$revenue>=1529919 71  29 Yes (0.4084507 0.5915493)  
##            42) df$expenses>=1.729733e+07 19   8 No (0.5789474 0.4210526) *
##            43) df$expenses< 1.729733e+07 52  18 Yes (0.3461538 0.6538462)  
##              86) df$revenue< 5844771 35  16 Yes (0.4571429 0.5428571)  
##               172) df$expenses>=2797935 14   5 No (0.6428571 0.3571429) *
##               173) df$expenses< 2797935 21   7 Yes (0.3333333 0.6666667) *
##              87) df$revenue>=5844771 17   2 Yes (0.1176471 0.8823529) *
##        11) df$expenses< 1321128 10   0 Yes (0.0000000 1.0000000) *
##     3) df$liabilities< 86182 160  56 Yes (0.3500000 0.6500000)  
##       6) df$revenue< 261415.5 103  45 Yes (0.4368932 0.5631068)  
##        12) df$expenses>=164852 24   4 No (0.8333333 0.1666667) *
##        13) df$expenses< 164852 79  25 Yes (0.3164557 0.6835443)  
##          26) df$revenue< 148628.5 60  25 Yes (0.4166667 0.5833333)  
##            52) df$liabilities< 9539.5 53  25 Yes (0.4716981 0.5283019)  
##             104) df$revenue< 38087 15   5 No (0.6666667 0.3333333) *
##             105) df$revenue>=38087 38  15 Yes (0.3947368 0.6052632)  
##               210) df$expenses>=57768 28  13 No (0.5357143 0.4642857)  
##                 420) df$revenue< 82836 7   0 No (1.0000000 0.0000000) *
##                 421) df$revenue>=82836 21   8 Yes (0.3809524 0.6190476) *
##               211) df$expenses< 57768 10   0 Yes (0.0000000 1.0000000) *
##            53) df$liabilities>=9539.5 7   0 Yes (0.0000000 1.0000000) *
##          27) df$revenue>=148628.5 19   0 Yes (0.0000000 1.0000000) *
##       7) df$revenue>=261415.5 57  11 Yes (0.1929825 0.8070175) *

library(rpart.plot)
rpart.plot(treeAnalysis, extra = 4)

Analysis

Similar to the SVM models, accuracy in terms of gauging fiscal health is similar and the need for perhaps more training data is in order. The model shows that revenue and expenses remain stronger predictors in performance, but what is interesting in this model is that the liabilities split on the “healthy” side. The “not healthy” branch remains binary in nature, but to separate the “healthy” and perhaps “healthier” analysis on this branch, a division appears between those nonprofits that may not be as healthy because of their liabilities. Although the accuracy is slightly higher than any binary variable, it’s worth noting as this may be a key finding for funders who are assessing fiscal health. Revenue and expenses perhaps offer a definitive, straight-forward answer, but liabilities, which often emphasize fiscally conversative decision-making, need to be assessed.

k-NN Model

The following model subscribes to the fourth characteristic of nearest neighbor classifiers (Tan et al., 2019, pp. 210-211). As we find, the decision boundaries of k-NN classifiers also have high variability because they depend on the composition of training examples in the local neighborhood. Increasing the number of nearest neighbors, according to Tan et al. (2019), may reduce such variability. Further, the data are normalized but only with variables with a numeric value. In this case, the “healthy” variable is not included simply because it serves as the target or factor variable here in R. With any predictive or classification algorithm that includes distance, the data should be normalized (Tan et al., 2019, p. 211),

In this classification model, the training and test sets subscribe to a mostly 80:20 split. Also, we chose to run a stratified cross-validation with another test of K at 1, 5, and 10, which samples the positive and negative instances in a K partition (Tan et al. 2019, p. 167). A confusion matrix was then applied for each test.

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[2:5], normalize))
head(df1)

##        revenue     expenses  liabilities       assets
## 1 0.0015004140 0.0021310536 5.680177e-08 4.923651e-03
## 2 0.0008525749 0.0009907152 5.680177e-08 4.334814e-05
## 3 0.0009770543 0.0010834671 5.680177e-08 2.227308e-08
## 4 0.0003507636 0.0004264262 5.680177e-08 1.594908e-04
## 5 0.0009651952 0.0010164331 5.680177e-08 1.846391e-04
## 6 0.0008305805 0.0008510174 5.680177e-08 1.508859e-04

num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("revenue", "expenses", "liabilities", "assets")
df.subset <- df[myvars]
summary(df.subset)

##      revenue.V1          expenses.V1       liabilities.V1   
##  Min.   :-0.236829   Min.   :-0.230081   Min.   :-0.097851  
##  1st Qu.:-0.229826   1st Qu.:-0.225251   1st Qu.:-0.097838  
##  Median :-0.221471   Median :-0.216470   Median :-0.097034  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.:-0.175201   3rd Qu.:-0.170280   3rd Qu.:-0.089406  
##  Max.   :10.841501   Max.   :11.110433   Max.   :16.763380  
##       assets.V1     
##  Min.   :-0.134535  
##  1st Qu.:-0.131494  
##  Median :-0.126341  
##  Mean   : 0.000000  
##  3rd Qu.:-0.106432  
##  Max.   :12.336152

set.seed(123) 
test <- 1:56
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$healthy[-test]
test.def <- df$healthy[test]
library(class)
knn.1 <-  knn(train.df, test.df, train.def, k=1)
knn.5 <-  knn(train.df, test.df, train.def, k=5)
knn.10 <-  knn(train.df, test.df, train.def, k=10)
56 * sum(test.def == knn.1)/56

## [1] 4

56 * sum(test.def == knn.5)/56

## [1] 3

56 * sum(test.def == knn.10)/56

## [1] 1

table(knn.1 ,test.def)

##      test.def
## knn.1 No Yes
##   No   4   0
##   Yes 52   0

table(knn.5 ,test.def)

##      test.def
## knn.5 No Yes
##   No   3   0
##   Yes 53   0

table(knn.10 ,test.def)

##       test.def
## knn.10 No Yes
##    No   1   0
##    Yes 55   0

Analysis

The results prove to be adequate and somewhat expected, given the binary nature and perhaps small data set. The likelihood of a nonprofit being labeled “healthy” appears to rely mostly on revenue and expenses and the confusion matrices, even with variances, appear to show no distinguishable results that may offer more insight into the model. Just as an additional measure, the k-NN model is plotted with Pearson’s correlation to assess any relationship between variables. The variables “difference” and “health” are also included in an effort to find any additional insight beyond the assessed supervised algorithms.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

pairs.panels (df[,2:7], 
              method = "pearson", 
              hist.col = "#00AFBB",
              density = TRUE,  
              ellipses = TRUE 
)

For further examination, it’s worth noting the results of the Pearson correlation. The plots offer linear and k-NN results, but with the inclusion of the “difference” and “healthy” variables, two points must be made. First, the strongest relationship between “difference” and other key fiscal indicators is with “assets.” Although this is still a weak relationship, it trends higher than the usual indicators. It can be assumed that with higher revenue over expenses in any given year, that a nonprofit performed well. The higher the revenue, though, means those monies will most likely be shifted to assets as nonprofits cannot claim those revenues in the same fashion as a for-profit company. Second, even though the “healthy” variable shows weak relationships with the other five variables, it’s worth noting that the correlation with “difference” is the highest. More analysis is needed to determine whether this correlation should be higher – or lower. A nonprofit should not have consistent high-growth revenues, otherwise, they are not devoting as much to the operation, services, or programs. For a single year or two, this happens, but longitudinally, this may prove to be problematic. A multiple-year analysis, perhaps at a later date, may prove to be more conclusive.

Summary

Overall, a larger training set seems to be the logical conclusion when determining how to strengthen any of these classification models. With SVMs, the radial basis kerneling offers the most accuracy but other factors raise concern, mostly the kappa and the high C. If trading accuracy for overfitting is desired – which is not advisable – then this model is appropriate. In the future, more experimentation is needed when establishing the model, especially with the sigma and C values.

However, when comparing SVMs with decision trees, the latter proves to be a bit more beneficial for predictive measures. Beyond variables such as revenue and expenses, the liabilities variable offered an interesting finding, as these are often “hidden” fiscal indicators in terms of assessment. For this exercise, given the hierarchical nature of decision trees, this model is decidedly the most useful in assessing and predicting the fiscal health of Mohawk Valley nonprofits holistically.

References

Altman, Douglas G. (1999). Practical statistics for medical research. Chapman; Hall/CRC Press.

Hastie, T., Tibshirani, R., & Friedman, J. (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.

Müller, A.C., & Guido, S. (2016). Introduction to machine learning with python. Sebastopol, CA: O’Reilly Media, Inc.

Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.

DSC609 MV Nonprofits: Kernelized SVMs

Brett Orzechowski

7/26/2020

Project Overview

Load Data & Library

Support Vector Machine Models

Kernelized SVM

Analysis

Decision Tree Model

Analysis

k-NN Model

Analysis

Summary

References