DSC609 MV Nonprofit Deep Learning Model with Keras

Project Overview

df <- read.csv("C:/Users/bjorzech/Desktop/DSC_7a.csv",stringsAsFactors = FALSE)

These data were extracted from IRS Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. They offer a snapshot of those nonprofits falling above the $200,000 threshold — which, in the Mohawk Valley, is 328 nonprofits between Oneida and Herkimer counties in upstate New York for reporting years 2015, 2016, and 2017.

Additionally, there are more than 1,000 nonprofits in the region and the decision was made to first focus on those which generate the most revenue in previous deliverables, presumably because their assets and liabilities are more conclusive in a testing situation. To potentially strengthen outcomes, mid-sized nonprofits (between $50,000-$200,000) were added to this data set, albeit with much lower asset and liabilities levels. More context related to dimensionality and normalizing is offered below.

Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we elected to use four variables these organizations share with mid-sized organizations (990-EZ) as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories — revenue, expenses, assets, and liabilities — to measure the overall health of a tax-exempt organization. However, what makes these models different is the addition of two variables — difference between revenue and expenses, which, in turn, allows us to classify the organization as either “healthy” or “not healthy.” Also, another independent variable is added — the difference between assets and liabilities. This is labeled “gap” in the dataset. In total, three are added.

For this deep learning deliverable, though, much had to be reconsidered in order for the model to produce desirable results. We will detail the methodology and approach in the next section but the first thought in the Schmidhuber piece (2014) is perhaps the most salient in approaching deep learning (DL) in neural networks (NN). Which modifiable components of a learning system are responsible for its success or failure? When we first began experimenting with the model and the given data set, this became evident and subsequently supported by Schmidhuber’s thinking. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages, where each stage transforms (often in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately assigning credit across many such stages (Schmidhuber, 2014, p. 4),

Ultimately, a different data set could have been used to close this deliverable, but since these data are part of a larger project, we will continue to experiment with the Mohawk Valley nonprofit ecosystem in order to satisfy our overall objective of creating a stronger model.

Notes on Methodology

Two decisions had to be made even before any preprocessing or normalizing step.

First, it’s worth noting that these data were used with both Python and R. We ran the same models, using the same data set, with a heavy reliance on the Keras and TensorFlow packages. Ultimately, neither programming language produced results superior enough to conclude which is more appropriate. However, in the future, given the size of the data, Python will most likely be used. Keras, the open-source, neural network library, is written in Python but seamlessly integrated into other languages with a focus on enabling fast experimentation, and supports both convolution-based networks and recurrent networks as well as combinations of the two (Falbel, et al., 2020).

Second, as noted above, we decided to use the same Mohawk Valley nonprofit data as it’s part of a larger project. Going forward, data are being collected on more than 300,000 United States nonprofits over an eight-year period, considerably increasing the size of the data set. This is important to add to explain the approach with a much smaller data set. We are examining just one region (the Mohawk Valley) with only 1,000 nonprofits, many of which are parts of a regional or national network. In total, there are 1,725 records in this data set as we build on a sample to eventually experiment with the entire population.

Normalize the Data

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

The data are normalized but only with variables with a numeric value. In this case, the “healthy” variable is not included simply because it serves as the target or factor variable here in R. With any predictive or classification algorithm that includes distance, the data should be normalized (Tan et al., 2019, p. 211).

df$healthy <- as.factor(df$healthy)
df$revenue <- as.numeric (df$revenue)
df$expenses <- as.numeric (df$revenue)
df$liabilities <- as.numeric (df$liabilities)
df$assets <- as.numeric (df$assets)
str(df)

## 'data.frame':    1725 obs. of  5 variables:
##  $ revenue    : num  7.27e+07 7.25e+07 1.49e+06 1.16e+08 7.91e+07 ...
##  $ expenses   : num  7.27e+07 7.25e+07 1.49e+06 1.16e+08 7.91e+07 ...
##  $ liabilities: num  232415 140853 310000 71024983 226736 ...
##  $ assets     : num  2.96e+07 2.07e+07 1.91e+07 1.15e+08 1.72e+07 ...
##  $ healthy    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Here is where the model begins to weaken. By using all the variables originally stated, the accuracy proves to be less than desirable with both Python and R. As used in other deliverables, an additional variable labeled “gap” was first used to show the difference between liabilities and assets. Also, the variable “difference” was included for revenues and expenses.

In a deep learning environment, the more variables seem to soften the model and dimensionality reduction is then used.

The anecdotal documentation in Schmidhuber (2014) on matters involving dimensionality reduction served as a guide in understanding weak results (p. 24). Although the focus was on images or the often-used MNIST hand-written data set, the thinking remained appropriate, especially the research of Hinton and Salakhutdinov (2006), who detailed the purpose of dimensionality reduction and its impact on error rate.

As a result, the “gap” and “difference” variables are removed, and the models run again. The following documentation offers a step-by-step approach in creating the model, followed by the results.

Data Input

x_data=matrix(data=runif(8625), nrow=1725, ncol=5)
y_data=ifelse(rowSums(x_data) > 1.5, 1, 0)
head(x_data)

##            [,1]      [,2]      [,3]      [,4]       [,5]
## [1,] 0.02061238 0.2532654 0.6550837 0.8549573 0.19502500
## [2,] 0.63553722 0.1606278 0.7979547 0.5001572 0.01276609
## [3,] 0.94308304 0.2227910 0.1221854 0.9690473 0.93533762
## [4,] 0.58366577 0.5218352 0.9154943 0.6224624 0.43708448
## [5,] 0.38611120 0.1942348 0.4218939 0.4303922 0.06336348
## [6,] 0.36295683 0.7686721 0.8735034 0.6792943 0.42837094

With five columns or variables within a uniform distribution, especially one that is continuous, the data are then loaded for an arbitrary outcome that lies between certain bounds. With 1,725 records and a “runif” function applied for random testing, we keep expectations within reason and keep the high bound at 1.5.

head(y_data)

## [1] 1 1 1 1 0 1

The Keras package is then loaded, Since it’s built on TensorFlow, an adequate input pipeline can be built. The tidyverse library is also loaded for its universal capabilities.

library(keras)

## Warning: package 'keras' was built under R version 4.0.2

## 
## Attaching package: 'keras'

## The following object is masked _by_ '.GlobalEnv':
## 
##     normalize

library(tidyverse)

## -- Attaching packages ------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Transforming the Data

The shape of the array is vital in this model as is the dimensions of the vector and matrices. Keras also includes the concept of “one-hot” vector for classification. It represents categorical variables as binary vectors and can do so with more than two (Müller and Guido, 2016). However, in this model, we only use two. The results are then shown.

y_data_oneh=to_categorical(y_data, num_classes = 2)
head(y_data_oneh)

##      [,1] [,2]
## [1,]    0    1
## [2,]    0    1
## [3,]    0    1
## [4,]    0    1
## [5,]    1    0
## [6,]    0    1

Sequential Model in Keras

model = keras_model_sequential() %>% 
  layer_dense(units = 64, activation = "relu", input_shape = ncol(x_data)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = ncol(y_data_oneh), activation = "softmax")

As a programming note, three layers are created here. The first layer represents the input variables, which are tensors in this case. The second is a layer of neurons and the third represents the output in the traditional sense of a neural network. Although sequential modeling is the most basic, it’s the most appropriate here. We then run the model.

Model

model

## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 64)                      384         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 64)                      4160        
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 2)                       130         
## ================================================================================
## Total params: 4,674
## Trainable params: 4,674
## Non-trainable params: 0
## ________________________________________________________________________________

compile(model, loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(), metrics = "accuracy")

history = fit(model,  x_data, y_data_oneh, epochs = 20, batch_size = 128, validation_split = 0.2)

It’s worth justifying some of the reasoning above. With the training and testing data, the customary approach is taken with an 80/20 split. With the number of epochs, models were run with 10 – then intervals of 10 up to 100 – before we decided to use a conservative number of 20. This is the number of times that the learning algorithm will work through the training data set (Müller and Guido, 2016). This is for fit. We then add a plot to show the performance.

##Plot

plot(history)

## `geom_smooth()` using formula 'y ~ x'

Validation & Prediction

x_data_test=matrix(data=runif(8625), nrow=1725, ncol=5)
dim(x_data_test)

## [1] 1725    5

y_data_pred=predict_classes(model, x_data_test)

glimpse(y_data_pred)

##  num [1:1725(1d)] 1 1 1 1 1 1 1 1 1 1 ...

For process purposes, it must be noted that the “one-hot” decoding is already included and the same parameters are used for prediction. This is important for consistency purposes as the model should return the same outcomes, and it does. The results, along with probabilities, are included below.

y_data_pred_oneh=predict(model, x_data_test)
dim(y_data_pred_oneh)

## [1] 1725    2

head(y_data_pred_oneh)

##             [,1]      [,2]
## [1,] 0.108112656 0.8918874
## [2,] 0.076320238 0.9236797
## [3,] 0.001935082 0.9980649
## [4,] 0.445442736 0.5545573
## [5,] 0.077563837 0.9224362
## [6,] 0.015285627 0.9847143

Evaluation

y_data_real=ifelse(rowSums(x_data_test) > 1.5, 1, 0)
y_data_real_oneh=to_categorical(y_data_real)

evaluate(model, x_data, y_data_oneh, verbose = 0)

## $loss
## [1] 0.05059318
## 
## $accuracy
## [1] 0.9866667

evaluate(model, x_data_test, y_data_real_oneh, verbose = 0)

## $loss
## [1] 0.05514201
## 
## $accuracy
## [1] 0.9843478

For the unseen data, the same steps are taken – everything from the training and testing sets to the matrices. The final outcome for accuracy is much more desirable than the original methodology, which included more dimensions and, at times, more epochs.

Summary

The Schmidhuber piece (2014) served as a more than helpful guide in walking through the history of deep learning and techniques used to advance the field. In previous weeks, we learned the importance of back propagation and errors and the above modeling is a result of this.

As Schmidhuber notes, “generally speaking, although BP allows for deep problems in principle, it seemed to work only for shallow problems (2014, p. 12). However, his thinking is balanced with an earlier documentation regarding additional hidden layers, which did not seem to offer empirical benefits. He continued that”many practitioners found solace in a theorem, stating that a neural network with a single layer of enough hidden units can approximate any multivariate continuous function with arbitrary accuracy (2014, p. 12).

With this, the accuracy is suitable for what really amounts to a shallow problem, but enough evidence to continue to push the boundaries of the project that focuses on nonprofits and assessing their fiscal health.

References

Falbel, D. (2020). keras: R Interface to ‘Keras’. tps://CRAN.R-project.org/package=keras

Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507.

Müller, A.C., & Guido, S. (2016). Introduction to machine learning with python. Sebastopol, CA: O’Reilly Media, Inc.

Schmidhuber, J. (2014). Deep learning in neural networks: An overview. The Swiss AI Lab IDSIA, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale.

Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.