These data were extracted from Internal Revenue Service Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. In the Mohawk Valley, there are 328 tax-exempt organizations with annual revenues of more than $200,000, therefore, they must file a 990. These data offer a snapshot of the 300 highest-grossing nonprofits between Oneida and Herkimer counties in upstate New York. We will explore these data throughout the term and eventually use the full data set. Previously, we used 100. We omitted 28 for this deliverable simply because they belong to a national network that often fills a revenue void, and therefore, skews the data bit. We will include these records in future deliverables. Although there are longitudinal data available, for this deliverable we will focus on the last full reporting year of 2018 as one year lends itself to a more straightforward classifier model.
Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we have elected to use four variables as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories – revenue, expenses, assets, and liabilities – to measure the overall health of a tax-exempt organization. However, what makes these models different is the addition of two variables – difference between revenue and expenses, which, in turn, allows us to classify the organization as either “healthy” or “not healthy.” Ultimately, the health variable becomes essential as the “factor” in all models and is crucial to the question – how can we predict fiscally healthy nonprofits in the Mohawk Valley?
For this deliverable, we will examine these data via Support Vector Machine (SVM) modeling and kernelized SVMs as this classifier solves a function-fitting problem using a particular criterion and form of regularization (Hastie et al., 2016, p. 423). Two subsequent comparisons are then created – one for different SVM models and another with two supervised classification models. We have elected to use decision trees, for their hierarchical approach to classification, and k-NN (nearest neighbor), because of its focus on considering exactly one nearest neighbor, which is the closest training data point to the point we want to make a prediction for in the model (Müller & Guido, 2016). We can adjust the parameters on each, but since the overall objective is to compare based on the same data set, these supervised models seem appropriate.
After running each model, a brief analysis is offered, followed by a holistic comparison of the results of the kernelized SVMs and the two subsequent classification models.
df <- read.csv("C:/Users/bjorzech/Desktop/609_W3.csv",stringsAsFactors = FALSE)
head(df)
## organization revenue expenses liabilities assets
## 1 Boy Scouts of America 415281 647008 100 7294994
## 2 Rob Esche Save of the Day 219112 300844 100 64292
## 3 Laborers 35 Training & Education Fund 256805 329000 100 100
## 4 John Bosco House Inc 67161 129547 100 236370
## 5 American Federation of Teachers 253214 308651 100 273630
## 6 Kuyahoora Volunteer Ambulance 212452 258437 100 223621
## diff healthy
## 1 -231727 No
## 2 -81732 No
## 3 -72195 No
## 4 -62386 No
## 5 -55437 No
## 6 -45985 No
tail(df)
## organization revenue expenses liabilities assets
## 295 Neighborhood Center Inc 12512039 12394221 10470836 14997488
## 296 Preswick Glen Inc 37887060 4491510 20365734 12765296
## 297 Upstate Cerebral Palsy Inc 93374322 91328400 20508445 48870808
## 298 Utica College 99799776 93885974 69080233 117018066
## 299 Trustees of Hamilton College 199948992 182057650 276031568 1393517194
## 300 NYS Chartered Credit Unions 72562193 62424461 1337986492 1481609252
## diff healthy
## 295 117818 Yes
## 296 33395550 Yes
## 297 2045922 Yes
## 298 5913802 Yes
## 299 17891342 Yes
## 300 10137732 Yes
str(df)
## 'data.frame': 300 obs. of 7 variables:
## $ organization: chr "Boy Scouts of America" "Rob Esche Save of the Day " "Laborers 35 Training & Education Fund" "John Bosco House Inc" ...
## $ revenue : int 415281 219112 256805 67161 253214 212452 138076 80333 182899 37205 ...
## $ expenses : int 647008 300844 329000 129547 308651 258437 176818 111244 209069 60687 ...
## $ liabilities : int 100 100 100 100 100 100 100 100 100 100 ...
## $ assets : int 7294994 64292 100 236370 273630 223621 448779 557945 57935 373935 ...
## $ diff : int -231727 -81732 -72195 -62386 -55437 -45985 -38742 -30911 -26170 -23482 ...
## $ healthy : chr "No" "No" "No" "No" ...
df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric(df$revenue)
df$expenses <- as.numeric(df$expenses)
df$assets <- as.numeric(df$assets)
df$liabilities <- as.numeric (df$liabilities)
For presentation purposes, after loading the data, the head and tail are offered to show the variables associated with these models and the 300 nonprofits used in this data set. Additionally, please note the additional variables – “difference” and “healthy.” Although “difference” is not used in any supervised model, it’s included in the k-NN plot to show potential variance and correlation between the other variables. This will also be briefly addressed in the overall analysis. Additionally, fiscal variables were loaded as “integers” and converted to numeric as a quick preprocessing step for their nominal (categorical) and interval (numeric) nature. In some respect, they can also be construed as ordinal because of their binary nature (Tan et al., 2019, p. 30).
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
fold <- trainControl(method = "cv", number = 5)
svm.linear.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
data = df,
trControl = fold,
method ="svmLinear")
svm.linear.cv
## Support Vector Machines with Linear Kernel
##
## 300 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 240, 240, 240, 240
## Resampling results:
##
## Accuracy Kappa
## 0.5833333 0.005298013
##
## Tuning parameter 'C' was held constant at a value of 1
According to Hastie et al. (2016), by running an SVM linear model – which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space – we find an adequate accuracy output based on a five-fold, cross-validation resampling. The training model partitioned into five parts produces an equal training and test set (Müller & Guido, 2016).
Although the accuracy resampling result proves to be serviceable, the kappa is concerning. According to Cohen’s original work (Altman, 1999), bias is greater when kappa is small. In the sliding scale model, a .02 means slight agreement than would be expected by chance between marginal distributions while a negative means there is none. Ultimately, the kappa needs to be re-evaluated.
To further analyze the fit of this model, a radial basis kernel is applied before the polynomial as it produces a boundary quite similar to the Bayes optimal boundary (Hastie et al., 2016, p. 424). It’s also worth noting that different sigma and parameter values were used to understand the best fit for this model. Ultimately, it was decided to use fewer and consistent values between the sigma and C because a large value of C will lead to an overfit wiggly boundary in the original feature space; a small value of C will encourage a small value and the boundary to be smoother. The regularization parameter was chosen in both cases to achieve a good test error (Hastie et al., 2016, p. 424).
param_grid <- expand.grid(sigma = c(0.001, 0.01, 0.1, 1, 10, 100),
C = c(0.001, 0.01, 0.1, 1, 10, 100))
svm.rbf.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
data = df,
trControl = fold,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneGrid = param_grid,
tuneLength = 10)
svm.rbf.cv
## Support Vector Machines with Radial Basis Function Kernel
##
## 300 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 240, 240, 240, 240
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 1e-03 1e-03 0.5833333 0.000000000
## 1e-03 1e-02 0.5833333 0.000000000
## 1e-03 1e-01 0.5833333 0.000000000
## 1e-03 1e+00 0.5833333 0.000000000
## 1e-03 1e+01 0.5833333 0.005349955
## 1e-03 1e+02 0.5900000 0.018543046
## 1e-02 1e-03 0.5833333 0.000000000
## 1e-02 1e-02 0.5833333 0.000000000
## 1e-02 1e-01 0.5833333 0.000000000
## 1e-02 1e+00 0.5833333 0.000000000
## 1e-02 1e+01 0.5900000 0.018543046
## 1e-02 1e+02 0.5833333 0.007826384
## 1e-01 1e-03 0.5833333 0.000000000
## 1e-01 1e-02 0.5833333 0.000000000
## 1e-01 1e-01 0.5833333 0.000000000
## 1e-01 1e+00 0.5833333 0.000000000
## 1e-01 1e+01 0.5766667 -0.007843137
## 1e-01 1e+02 0.6066667 0.069643040
## 1e+00 1e-03 0.5833333 0.000000000
## 1e+00 1e-02 0.5833333 0.000000000
## 1e+00 1e-01 0.5833333 0.000000000
## 1e+00 1e+00 0.5766667 -0.010389610
## 1e+00 1e+01 0.5933333 0.038355597
## 1e+00 1e+02 0.6400000 0.160414453
## 1e+01 1e-03 0.5833333 0.000000000
## 1e+01 1e-02 0.5833333 0.000000000
## 1e+01 1e-01 0.5833333 0.000000000
## 1e+01 1e+00 0.5733333 -0.006181378
## 1e+01 1e+01 0.5900000 0.045659868
## 1e+01 1e+02 0.6900000 0.304112641
## 1e+02 1e-03 0.5833333 0.000000000
## 1e+02 1e-02 0.5833333 0.000000000
## 1e+02 1e-01 0.5833333 0.000000000
## 1e+02 1e+00 0.5933333 0.035246180
## 1e+02 1e+01 0.6600000 0.229598545
## 1e+02 1e+02 0.7700000 0.506370529
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 100 and C = 100.
The radial results were similar to the linear, however, the model seems to be overfitted with the largest sigma and C, which was ultimately used to find the best accuracy model. As a result, we use a polynomial kernel model for comparison purposes. As with other linear methods, we can make the procedure more flexible by enlarging the feature space using basis expansions such as polynomials (Hastie et al., 2016, p. 423).
svm.poly.cv <- train(healthy ~ revenue + expenses + liabilities + assets,
data = df,
trControl = fold,
method = "svmPoly",
tuneLength = 4)
svm.poly.cv
## Support Vector Machines with Polynomial Kernel
##
## 300 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 240, 240, 240, 240
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.5833333 0.000000000
## 1 0.001 0.50 0.5800000 -0.006622517
## 1 0.001 1.00 0.5800000 -0.006622517
## 1 0.001 2.00 0.5766667 -0.013157895
## 1 0.010 0.25 0.5766667 -0.013157895
## 1 0.010 0.50 0.5766667 -0.013157895
## 1 0.010 1.00 0.5766667 -0.013157895
## 1 0.010 2.00 0.5766667 -0.013157895
## 1 0.100 0.25 0.5766667 -0.013157895
## 1 0.100 0.50 0.5766667 -0.013157895
## 1 0.100 1.00 0.5766667 -0.013157895
## 1 0.100 2.00 0.5766667 -0.013157895
## 1 1.000 0.25 0.5766667 -0.013157895
## 1 1.000 0.50 0.5800000 -0.003886372
## 1 1.000 1.00 0.5833333 0.005263158
## 1 1.000 2.00 0.5833333 0.005263158
## 2 0.001 0.25 0.5800000 -0.006622517
## 2 0.001 0.50 0.5766667 -0.013157895
## 2 0.001 1.00 0.5766667 -0.013157895
## 2 0.001 2.00 0.5766667 -0.013157895
## 2 0.010 0.25 0.5766667 -0.013157895
## 2 0.010 0.50 0.5766667 -0.013157895
## 2 0.010 1.00 0.5766667 -0.013157895
## 2 0.010 2.00 0.5766667 -0.013157895
## 2 0.100 0.25 0.5833333 0.000000000
## 2 0.100 0.50 0.5866667 0.009271523
## 2 0.100 1.00 0.5900000 0.018421053
## 2 0.100 2.00 0.5900000 0.018421053
## 2 1.000 0.25 0.5833333 0.007894737
## 2 1.000 0.50 0.5800000 0.001272220
## 2 1.000 1.00 0.5800000 0.001272220
## 2 1.000 2.00 0.5866667 0.017061694
## 3 0.001 0.25 0.5766667 -0.013157895
## 3 0.001 0.50 0.5766667 -0.013157895
## 3 0.001 1.00 0.5766667 -0.013157895
## 3 0.001 2.00 0.5766667 -0.013157895
## 3 0.010 0.25 0.5766667 -0.013157895
## 3 0.010 0.50 0.5766667 -0.013157895
## 3 0.010 1.00 0.5833333 0.000000000
## 3 0.010 2.00 0.5833333 0.000000000
## 3 0.100 0.25 0.5833333 0.000000000
## 3 0.100 0.50 0.5800000 -0.006622517
## 3 0.100 1.00 0.5833333 0.000000000
## 3 0.100 2.00 0.5833333 0.000000000
## 3 1.000 0.25 0.5766667 -0.010613454
## 3 1.000 0.50 0.5733333 -0.017148832
## 3 1.000 1.00 0.5833333 0.010508546
## 3 1.000 2.00 0.5866667 0.019641673
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.1 and C = 1.
Within the polynomial model, a smaller C was found with a similar accuracy result, though. This seems to be a stronger fit as a model with a similar result. When comparing the linear, radial, and polynomial models, some sacrifices may be made for a stronger model with less accuracy than a more accurate model, which may be overfitted.
comparison <- resamples(list(svm.linear = svm.linear.cv, svm.poly = svm.poly.cv, svm.rbf = svm.rbf.cv))
summary(comparison)
##
## Call:
## summary.resamples(object = comparison)
##
## Models: svm.linear, svm.poly, svm.rbf
## Number of resamples: 5
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## svm.linear 0.5666667 0.5666667 0.5833333 0.5833333 0.6000000 0.6000000 0
## svm.poly 0.5833333 0.5833333 0.5833333 0.5900000 0.5833333 0.6166667 0
## svm.rbf 0.7333333 0.7500000 0.7666667 0.7700000 0.8000000 0.8000000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## svm.linear -0.03311258 -0.03311258 0.0000000 0.005298013 0.04635762 0.04635762
## svm.poly 0.00000000 0.00000000 0.0000000 0.018421053 0.00000000 0.09210526
## svm.rbf 0.43195266 0.47674419 0.4909091 0.506370529 0.55828221 0.57396450
## NA's
## svm.linear 0
## svm.poly 0
## svm.rbf 0
Since SVMs are sensitive to “noisy” data and the amount of training data, the likelihood of overfitting appears to be consistent across the linear, radial, and polynomial models above. We can increase the amount of training data or change the parameters going forward for perhaps a stronger fit. In this case, the radial basis kernel model produces the highest accuracy and kappa results, which means less bias, but with overfitting. If required to choose a model, the radial basis could prove to be serviceable and further analysis is needed. Perhaps with a larger training set, the model can be refined.
Similar to data mining approaches, algorithms work better when dimensionality – the number of attributes in the data – is lower (Tan et al., 2019, p. 57). A tree classifier will allow us to distinguish between “healthy” or “not healthy” when assessing the fiscal performance of Mohawk Valley nonprofits. More accurately, this will consist of Hunt’s algorithm – and expanded because of the number of variables by using the splitting criterion, which is an attribute test condition (Tan et al., 2019, p. 122).
The following decision tree model is created using the same variables. Although the dimensionality is higher than previous models, all are essential to holistically assess fiscal health. Further, if desired, revenue and expenses can be classified in a binary fashion in addition to liabilities and assets. This is important to note as little pruning is applied in this model, and in the process, offers perhaps more definitive results.
library(rpart)
treeAnalysis <- rpart(df$healthy ~ df$revenue + df$expenses + df$liabilities + df$assets, data = df)
treeAnalysis
## n= 300
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 300 125 Yes (0.4166667 0.5833333)
## 2) df$liabilities>=86182 140 69 Yes (0.4928571 0.5071429)
## 4) df$revenue< 924026.5 52 19 No (0.6346154 0.3653846)
## 8) df$expenses>=393377.5 24 5 No (0.7916667 0.2083333) *
## 9) df$expenses< 393377.5 28 14 No (0.5000000 0.5000000)
## 18) df$revenue< 251156.5 17 5 No (0.7058824 0.2941176) *
## 19) df$revenue>=251156.5 11 2 Yes (0.1818182 0.8181818) *
## 5) df$revenue>=924026.5 88 36 Yes (0.4090909 0.5909091)
## 10) df$expenses>=1321128 78 36 Yes (0.4615385 0.5384615)
## 20) df$revenue< 1529919 7 0 No (1.0000000 0.0000000) *
## 21) df$revenue>=1529919 71 29 Yes (0.4084507 0.5915493)
## 42) df$expenses>=1.729733e+07 19 8 No (0.5789474 0.4210526) *
## 43) df$expenses< 1.729733e+07 52 18 Yes (0.3461538 0.6538462)
## 86) df$revenue< 5844771 35 16 Yes (0.4571429 0.5428571)
## 172) df$expenses>=2797935 14 5 No (0.6428571 0.3571429) *
## 173) df$expenses< 2797935 21 7 Yes (0.3333333 0.6666667) *
## 87) df$revenue>=5844771 17 2 Yes (0.1176471 0.8823529) *
## 11) df$expenses< 1321128 10 0 Yes (0.0000000 1.0000000) *
## 3) df$liabilities< 86182 160 56 Yes (0.3500000 0.6500000)
## 6) df$revenue< 261415.5 103 45 Yes (0.4368932 0.5631068)
## 12) df$expenses>=164852 24 4 No (0.8333333 0.1666667) *
## 13) df$expenses< 164852 79 25 Yes (0.3164557 0.6835443)
## 26) df$revenue< 148628.5 60 25 Yes (0.4166667 0.5833333)
## 52) df$liabilities< 9539.5 53 25 Yes (0.4716981 0.5283019)
## 104) df$revenue< 38087 15 5 No (0.6666667 0.3333333) *
## 105) df$revenue>=38087 38 15 Yes (0.3947368 0.6052632)
## 210) df$expenses>=57768 28 13 No (0.5357143 0.4642857)
## 420) df$revenue< 82836 7 0 No (1.0000000 0.0000000) *
## 421) df$revenue>=82836 21 8 Yes (0.3809524 0.6190476) *
## 211) df$expenses< 57768 10 0 Yes (0.0000000 1.0000000) *
## 53) df$liabilities>=9539.5 7 0 Yes (0.0000000 1.0000000) *
## 27) df$revenue>=148628.5 19 0 Yes (0.0000000 1.0000000) *
## 7) df$revenue>=261415.5 57 11 Yes (0.1929825 0.8070175) *
library(rpart.plot)
rpart.plot(treeAnalysis, extra = 4)
Similar to the SVM models, accuracy in terms of gauging fiscal health is similar and the need for perhaps more training data is in order. The model shows that revenue and expenses remain stronger predictors in performance, but what is interesting in this model is that the liabilities split on the “healthy” side. The “not healthy” branch remains binary in nature, but to separate the “healthy” and perhaps “healthier” analysis on this branch, a division appears between those nonprofits that may not be as healthy because of their liabilities. Although the accuracy is slightly higher than any binary variable, it’s worth noting as this may be a key finding for funders who are assessing fiscal health. Revenue and expenses perhaps offer a definitive, straight-forward answer, but liabilities, which often emphasize fiscally conversative decision-making, need to be assessed.
The following model subscribes to the fourth characteristic of nearest neighbor classifiers (Tan et al., 2019, pp. 210-211). As we find, the decision boundaries of k-NN classifiers also have high variability because they depend on the composition of training examples in the local neighborhood. Increasing the number of nearest neighbors, according to Tan et al. (2019), may reduce such variability. Further, the data are normalized but only with variables with a numeric value. In this case, the “healthy” variable is not included simply because it serves as the target or factor variable here in R. With any predictive or classification algorithm that includes distance, the data should be normalized (Tan et al., 2019, p. 211),
In this classification model, the training and test sets subscribe to a mostly 80:20 split. Also, we chose to run a stratified cross-validation with another test of K at 1, 5, and 10, which samples the positive and negative instances in a K partition (Tan et al. 2019, p. 167). A confusion matrix was then applied for each test.
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[2:5], normalize))
head(df1)
## revenue expenses liabilities assets
## 1 0.0015004140 0.0021310536 5.680177e-08 4.923651e-03
## 2 0.0008525749 0.0009907152 5.680177e-08 4.334814e-05
## 3 0.0009770543 0.0010834671 5.680177e-08 2.227308e-08
## 4 0.0003507636 0.0004264262 5.680177e-08 1.594908e-04
## 5 0.0009651952 0.0010164331 5.680177e-08 1.846391e-04
## 6 0.0008305805 0.0008510174 5.680177e-08 1.508859e-04
num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("revenue", "expenses", "liabilities", "assets")
df.subset <- df[myvars]
summary(df.subset)
## revenue.V1 expenses.V1 liabilities.V1
## Min. :-0.236829 Min. :-0.230081 Min. :-0.097851
## 1st Qu.:-0.229826 1st Qu.:-0.225251 1st Qu.:-0.097838
## Median :-0.221471 Median :-0.216470 Median :-0.097034
## Mean : 0.000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.:-0.175201 3rd Qu.:-0.170280 3rd Qu.:-0.089406
## Max. :10.841501 Max. :11.110433 Max. :16.763380
## assets.V1
## Min. :-0.134535
## 1st Qu.:-0.131494
## Median :-0.126341
## Mean : 0.000000
## 3rd Qu.:-0.106432
## Max. :12.336152
set.seed(123)
test <- 1:56
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$healthy[-test]
test.def <- df$healthy[test]
library(class)
knn.1 <- knn(train.df, test.df, train.def, k=1)
knn.5 <- knn(train.df, test.df, train.def, k=5)
knn.10 <- knn(train.df, test.df, train.def, k=10)
56 * sum(test.def == knn.1)/56
## [1] 4
56 * sum(test.def == knn.5)/56
## [1] 3
56 * sum(test.def == knn.10)/56
## [1] 1
table(knn.1 ,test.def)
## test.def
## knn.1 No Yes
## No 4 0
## Yes 52 0
table(knn.5 ,test.def)
## test.def
## knn.5 No Yes
## No 3 0
## Yes 53 0
table(knn.10 ,test.def)
## test.def
## knn.10 No Yes
## No 1 0
## Yes 55 0
The results prove to be adequate and somewhat expected, given the binary nature and perhaps small data set. The likelihood of a nonprofit being labeled “healthy” appears to rely mostly on revenue and expenses and the confusion matrices, even with variances, appear to show no distinguishable results that may offer more insight into the model. Just as an additional measure, the k-NN model is plotted with Pearson’s correlation to assess any relationship between variables. The variables “difference” and “health” are also included in an effort to find any additional insight beyond the assessed supervised algorithms.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
pairs.panels (df[,2:7],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
ellipses = TRUE
)
For further examination, it’s worth noting the results of the Pearson correlation. The plots offer linear and k-NN results, but with the inclusion of the “difference” and “healthy” variables, two points must be made. First, the strongest relationship between “difference” and other key fiscal indicators is with “assets.” Although this is still a weak relationship, it trends higher than the usual indicators. It can be assumed that with higher revenue over expenses in any given year, that a nonprofit performed well. The higher the revenue, though, means those monies will most likely be shifted to assets as nonprofits cannot claim those revenues in the same fashion as a for-profit company. Second, even though the “healthy” variable shows weak relationships with the other five variables, it’s worth noting that the correlation with “difference” is the highest. More analysis is needed to determine whether this correlation should be higher – or lower. A nonprofit should not have consistent high-growth revenues, otherwise, they are not devoting as much to the operation, services, or programs. For a single year or two, this happens, but longitudinally, this may prove to be problematic. A multiple-year analysis, perhaps at a later date, may prove to be more conclusive.
Overall, a larger training set seems to be the logical conclusion when determining how to strengthen any of these classification models. With SVMs, the radial basis kerneling offers the most accuracy but other factors raise concern, mostly the kappa and the high C. If trading accuracy for overfitting is desired – which is not advisable – then this model is appropriate. In the future, more experimentation is needed when establishing the model, especially with the sigma and C values.
However, when comparing SVMs with decision trees, the latter proves to be a bit more beneficial for predictive measures. Beyond variables such as revenue and expenses, the liabilities variable offered an interesting finding, as these are often “hidden” fiscal indicators in terms of assessment. For this exercise, given the hierarchical nature of decision trees, this model is decidedly the most useful in assessing and predicting the fiscal health of Mohawk Valley nonprofits holistically.
Altman, Douglas G. (1999). Practical statistics for medical research. Chapman; Hall/CRC Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.
Müller, A.C., & Guido, S. (2016). Introduction to machine learning with python. Sebastopol, CA: O’Reilly Media, Inc.
Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.