DSC609 MV Nonprofit Neural Network

Project Overview

These data was extracted from Internal Revenue Service Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. In the Mohawk Valley, there are 328 tax-exempt organizations with annual revenues of more than $200,000, therefore, they must file a 990. In previous deliverables, one year of filings was used, however, since mixed results were found in accuracy and bias, the choice has been made to add additional years to improve performance. These data include all the highest-grossing nonprofits between Oneida and Herkimer counties in upstate New York.

Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we have elected to use four variables as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories – revenue, expenses, assets, and liabilities – to measure the overall health of a tax-exempt organization. However, including two more variables is necessary to create a single-layer neural network. By creating a “difference” variable and a needed classifier, the inclusion of “healthy” allows for model creation.

In neural networks, the central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features (Hastie et al., 2016, p. 389). Previously, there were concerns about size of data set and bias, but neural network diagrams are sometimes drawn with an additional bias unit feeding into every unit in the hidden and output layers (Hastie et al., 2016, p. 392). A slight correction was made to increase the size of the data set, but the bias continues to be watched closely as we build the model.

Load Data & Library

df <- read.csv("C:/Users/bjorzech/Desktop/609_W6.csv",stringsAsFactors = FALSE)

df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric (df$revenue)
df$expenses <- as.numeric (df$revenue)
df$liabilities <- as.numeric (df$liabilities)
df$assets <- as.numeric (df$assets)
df$diff <- as.numeric (df$diff)
str(df)

## 'data.frame':    1094 obs. of  6 variables:
##  $ revenue    : num  7.27e+07 7.25e+07 1.49e+06 1.16e+08 7.91e+07 ...
##  $ expenses   : num  7.27e+07 7.25e+07 1.49e+06 1.16e+08 7.91e+07 ...
##  $ liabilities: num  232415 140853 310000 71024983 226736 ...
##  $ assets     : num  2.96e+07 2.07e+07 1.91e+07 1.15e+08 1.72e+07 ...
##  $ diff       : num  -10902767 -9067473 -5644223 -4877831 -3556161 ...
##  $ healthy    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

The choice was given to create models either in Python or R, but in an effort to explore (or make life more difficult), this approach called for preprocessing steps. Preprocessing also allowed deeper thinking into input variables and the chosen output for a neural network model. A classifier is needed, so ultimately “healthy” becomes a factor.

It’s also worth noting in this section that the following neural network model was run and tested after normalizing the same data. If different variables are to be used, then such a transformation is often necessary to avoid having a variable with large values dominate the results of the analysis (Tan et al., 2019, p. 71). According to Hastie et al. (2016, p. 400), at the outset it is best to standardize all inputs to have mean zero and standard deviation one. This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights.

Additionally, multiple models were created to reduce the dimensionality. In these cases, the results did not produce significant differences. Still, “healthy” remains the factor while “revenues”, “expenses”, “assets”, “liabilities”, and “diff” (short for difference between expenses and revenues or surplus/deficit) serve as the input variables. At this point, it’s also worth noting that “interpretation of the fitted model is usually difficult, because each input enters into the model in a complex and multifaceted way” (Hastie et al., 206, p. 390).

Training & Testing

random_rows = sort(sample(nrow(df), nrow(df)*.8))
training_data <- df[random_rows,]
test_data  <- df[-random_rows,]
library(nnet)

## Warning: package 'nnet' was built under R version 4.0.2

Similar to other models, the choice was made to split the training and testing sets at 80/20, although it should be noted that other variations were used to assess performance – first 70/30 and then 75/25. The 80/20 split remains the strongest.

The Model

model<-nnet(healthy ~ revenue + expenses + liabilities + assets + diff, data=training_data, size = 4, decay = 0.0001, maxit = 1000)

## # weights:  29
## initial  value 816.518287 
## iter  10 value 591.576547
## iter  20 value 590.575000
## final  value 590.555368 
## converged

At this stage of model building, it’s best to explain the rationale of choosing the settings above before the outcomes. For instance, some researchers use cross-validation to estimate the optimal number, but this seems unnecessary if cross-validation is used to estimate the regularization parameter (Hastie et al., 2016, p. 400). Also, the choice of the number of hidden layers is guided by background knowledge and experimentation. Each layer extracts features of the input for regression or classification. Use of multiple hidden layers allows construction of hierarchical features at different levels of resolution (Hastie et al., 2016, p. 400).

The five inputs remain and the training data are set, but hidden-layer experimentation offered mixed results. In this model, it’s size. For instance, two hidden layers offered weak results, three increased the bias significantly, and five skewed the findings by making the model too proportionate, or a 1:1 testing situation. The decision was made to run with four hidden layers. Additionally, it was decided to use .0001 for the fixed weight decay parameter, representing a mild amount of regularization (Hastie et al., 2016, p. 402). When raising the parameter, similar to the hidden layers number, bias increased.

Finally, it was difficult to gauge the maximum number of iterations, represented by the “maxit” setting, since 100 served as the default. When using 500, the results seemed less desirable, so the decision was made to double (1,000) and double again (2,000), before deciding on 1,000.

When we run the model, it produces 29 weights. According to Hastie et al. (2016, p. 395), the neural network model has unknown parameters, often called weights. In this case, minimizing the approach by gradient descent as a weight decides influence on the output. Given the size of the data and expectations of the model, some overfitting still exists, but the number of weights seems appropriate.

The summary and residuals below offer another perspective on the strength – or weakness – of the model.

Model Summary

summary(model)

## a 5-4-1 network with 29 weights
## options were - entropy fitting  decay=1e-04
##  b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 
##   0.26   0.24   0.16  -0.25  -0.39  -0.28 
##  b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 
##   0.36   0.61  -0.40  -0.08  -0.64  -0.06 
##  b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 
##  -0.38   0.40   0.60   0.02   0.48  -0.24 
##  b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 
##   2.49   0.02  -0.03  -0.34  -0.65  -0.24 
##  b->o h1->o h2->o h3->o h4->o 
##  0.81 -0.16  0.46 -1.09  8.51

summary(model$residuals)

##        V1            
##  Min.   :-0.5063449  
##  1st Qu.:-0.4306002  
##  Median :-0.3921620  
##  Mean   :-0.0000009  
##  3rd Qu.: 0.5693998  
##  Max.   : 0.6078380

Prediction

test_data$pred_nnet<-predict(model,test_data,type="class")
mtab<- table (test_data$pred_nnet, test_data$healthy)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

confusionMatrix(mtab)

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 109  87
##   1  16   7
##                                           
##                Accuracy : 0.5297          
##                  95% CI : (0.4613, 0.5973)
##     No Information Rate : 0.5708          
##     P-Value [Acc > NIR] : 0.9024          
##                                           
##                   Kappa : -0.0591         
##                                           
##  Mcnemar's Test P-Value : 5.3e-12         
##                                           
##             Sensitivity : 0.87200         
##             Specificity : 0.07447         
##          Pos Pred Value : 0.55612         
##          Neg Pred Value : 0.30435         
##              Prevalence : 0.57078         
##          Detection Rate : 0.49772         
##    Detection Prevalence : 0.89498         
##       Balanced Accuracy : 0.47323         
##                                           
##        'Positive' Class : 0               
##

After adding the caret library and running a confusion matrix, the results proved to be more favorable than in past models. Although unsupervised, the higher accuracy in both the model and confusion matrix confirmed the earlier belief that more data are needed, which, in turn, created stronger accuracy and less bias.

For instance, the classification and likelihood of a fiscally “healthy” nonprofit in a true-positive capacity remained stronger than the false-positives and other outcomes. Similarly, when run repeatedly, the accuracy of the model remains at 55% or higher and the kappa increases, albeit a stronger result is still desired in that area.

What also seems more conclusive are the additional findings that this model produces. For instance, the “sensitivity” score, which shows the rate of true-positives, remains high while other measures, like “balanced accuracy” seem to be consistent with other findings in previous supervised models. This number should be higher. Meanwhile, the “specificity” rate remains low when calculating the rate of negative predictions. This is expected, but perhaps not this low on a consistent basis.

Plot

library(devtools)

## Loading required package: usethis

source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')

## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef

plot.nnet(model)

## Loading required package: scales

## Loading required package: reshape

Finally, by plotting our neural network model, we’re able to see the flow of inputs to output through the neurons or hidden layers. Although this model is unsupervised and expected to change, the number of inputs (variables) seem to often gravitate to the third neuron or hidden layer while the initial bias often surpasses the first hidden layer. Similarly, the weighted connections appear stronger beyond the first neuron and are calculated using the back propagation algorithm.

Summary

In the future, although accuracy has risen and bias reduced, the same number of inputs for the same desired classification will be used but with a larger data set. It seems as though the model is not as strong as it should be with some additional overfitting, but a neural network model allows more experimentation when deciding neurons, weights, and overall maximums. This flexibility is essential in order to strengthen any model.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer.

Müller, A.C., & Guido, S. (2016). Introduction to machine learning with python. Sebastopol, CA: O’Reilly Media, Inc.