1 Introduction

Artificial neural networks (ANNs) describe a specific class of machine learning algorithms designed to acquire their own knowledge by extracting useful patterns from data. Neural networks (NN) are function approximators, mapping inputs to outputs. Neural networks are composed of many interconnected computational units, called neurons. Each individual neuron possesses little intrinsic approximation capability; however, when many neurons function cohesively together, their combined effects show remarkable learning performance.

This tutorial begins with a brief description of the biologic neuron (its artificial neuron counterpart) emphasizing the similarities between the two. Following the neuron discussion, we continue with artificial neural networks general structure and nomenclature, before discussing how a neural network learns from data. The introduction section concludes with a discussion on hyperparameters and neural network evaluation criteria. Section 2 provides the reader with a walk through of two types of neural network analysis: regression, and classification. For those readers with a basic understanding of neural networks, and who would like to see their implementation in R, it is recommended that you proceed directly to section 2.

1.1 Biologic Model

ANNs are engineered computational models inspired by the brain (human & animal). While some researchers used ANNs to study animal brains, most researchers view neural networks as being inspired by, not models of, neurological systems. Figure 1 shows the basic functional unit of the brain, a biologic neuron.

Figure 1: Biologic Neuron–Source: Bruce Blaus Wikipedia

ANN neurons are simple representations of their biologic counterparts. In the biologic neuron figure please note the Dendrite, Cell body, and the Axon with the Synaptic terminals. In biologic systems, information (in the form of neuroelectric signals) flow into the neuron through the dendrites. If a sufficient number of input signals enter the neuron through the dendrites, the cell body generates a response signal and transmits it down the axon to the synaptic terminals. The specific number of input signals required for a response signal is dependent on the individual neuron. When the generated signal reaches the synaptic terminals neurotransmitters flow out of the synaptic terminals and interact with dendrites of adjoining neurons. There are three major takeaways from the biologic neuron:

  1. The neuron only generates a signal if a sufficient number of input signals enter the neurons dendrites (all or nothing)
  2. Neurons receive inputs from many adjacent neurons upstream, and can transmit signals to many adjacent signals downstream (cumulative inputs)
  3. Each neuron has its own threshold for activation (synaptic weight).

1.2 Artifical Neuron

The artificial analog of the biologic neuron is shown below in figure 2. In the artificial model the inputs correspond to the dendrites, the transfer function, net input, and activation function correspond to the cell body, and the activation corresponds to the axon and synaptic terminal.

Figure 2: Artifical Neuron–Source: Chrislb Wikipedia

The inputs to the artificial neuron may correspond to raw data values, or in deeper architectures, may be outputs from preceding artificial neurons. The transfer function sums all the inputs together (cumulative inputs). If the summed input values reach a specified threshold, the activation function generates an output signal (all or nothing). The output signal then moves to a raw output or other neurons depending on specific ANN architecture. This basic artificial neuron is combined with multiple other artificial neurons to create an ANNs such as the ones shown in figure 3.

Figure 3: Examples Multi-Neuron ANNsFigure 3: Examples Multi-Neuron ANNsFigure 3: Examples Multi-Neuron ANNs

Figure 3: Examples of Multi-Neuron ANNs Sources: Top-Left: McStrother Wikipedia Top-Right: McStrother Wikipedia Bottom: Glosser Wikipedia

ANNs are often described as having a Input layer, Hidden layer, and an Output layer. The input layer reads in data values from a user provided input. Within the hidden layer is where a majority of the ‘learning’ takes place, and the output layer displays the results of the ANN. In the bottom plot of the figure, each of the red input nodes correspond to an input vector \(\vec{x}_{i}\). Each of the black lines with correspond to a weight,\(w^{(l)}_{ij}\), and describe how artificial neurons are connections to one another within the ANN. The \(i\) subscript identifies the source and the \(j\) subscript describes to which artificial neuron the weight connects the source to. The green output nodes are the output vectors \(\vec{y}_{q}\).

Examination of the figure’s top-left and top-right plots show two possible ANN configurations. In the top-left, we see a network with one hidden layer with \(q\) artificial neurons, \(p\) input vectors \(\vec{x}\), and generates \(q\) output vectors \(\vec{y}\). Please note the bias inputs to each hidden node, denoted by the \(b_q\). The bias term is a simple constant valued 1 to each hidden node acting akin to the grand mean in a simple linear regression. Each bias term in a ANN has its own associated weight \(w\). In the top-right ANN we have a network with two hidden layers. This network adds superscript notation to the bias terms and the weights to identify to which layer each term belongs. Weights and biases with a superscript 1 act on connecting the input layer to the first layer of artificial neurons and terms with a superscript 2 connect the output of the second hidden layer to the output vectors.

The size and structure of ANNs are only limited by the imagination of the analyst.

1.3 Activation Functions

The capability of ANNs to learn approximately any function, (given sufficient training data examples) are dependent on the appropriate selection of the Activation Function (or activation functions) present in the network. Activation functions enable the ANN to learn non-linear properties present in the data. We represent the activation function here as \(\phi(\cdot)\). The input into the activation function is the weighted sum of the input features from the preceding layer. Let \(o_j\) be the output from the jth neuron in a given layer for a network for k input vector features.

\[o_j=\phi(b_j+\sum\limits_{i=1}^p w_ix_i)\]

The output (\(o_j\)) can feed into the output layer of a neural network, or in deeper architectures may feed into additional hidden layers. The activation function determines if the sum of the weighted inputs plus a bias term is sufficiently large to trigger the firing of the neuron. There is not a universal best choice for the activation function, however, researchers have provided ample information regarding what activation functions work well for ANN solutions to many common problems. The choice of the activation function governs the required data scaling necessary for ANN analysis. Below we present activation functions commonly seen in may ANNs.

1.4 How ANNs Learn

We have described the structure of ANNs, however, we have not touched on how these networks learn. For the purposes of this discussion we assume that we have a data set of labeled observations. Data sets in which we have some features (\(X\)) describing an output (\(\vec{y}\)) fall under machine learning techniques called Supervised Learning. To begin training our notional single-layer one-neuron neural network we initially randomly assign weights. We then run the neural network with the random weights and record the outputs generated. This is called a forward pass. Output values, in our case called \(\vec{y}\), are a function of the input values (\(X\)), the random initial weights (\(\vec{w}\)) and our choice of the threshold function (\(T\)). \[ \vec{\hat{y}} = f(X, \vec{w}, T) \] Once we have our ANN output values (\(\vec{\hat{y}}\)) we can compare them to the data set output values (\(\vec{y}\)). To do this we use a performance function \(P\). The choice of the performance function is a choice of the analyst, we choose to use the One-Half Square Error Cost Function otherwise known as the Sum of Squared Errors (SSE). \[P = \frac{1}{2}\|\vec{y}-\vec{\hat{y}}\|^{2}_{2} \] Now that we have our initial performance, we need a method to adjust the weights to improve the performance. For our performance function \(P\), to maximize the performance of the one-layer, one-neuron neural network, we need to minimize the difference between ANN predicted output values (\(\vec{\hat{y}}\)) and the observed data set outputs (\(\vec{y}\)). Recall that our neural network is simply a function, \(\vec{\hat{y}} = f(X, \vec{w}, T)\). Thus we can minimize the MSE by differentiating the performance function with respect to the weights (\(\vec{w}\)). Recall however, the weights in our ANN is a vector, thus we need to update each weight individually, so we require the use of the partial derivative. Additionally, we need to determine how much we want to improve. So we add a parameter \(r\), called the learning rate parameter, which is a scalar value that controls how far we move closer to the optimum weight values. The weight updates are calculated as follows: \[\Delta \vec{w} = r*(\frac{\partial P}{\partial w_0},\frac{\partial P}{\partial w_1}, ... ,\frac{\partial P}{\partial w_q})\] The previous equation describes how to adjust each of the weights associated with the \(q\) input features of \(X\) and the bias weight \(b_{0}\). We then update the weight values as prescribed by the above equation. This process is called Back-Propagation. Once the weights are updated, we can re-run the neural network with the update weight values. This entire process can be repeated a number of times until either, a set number of iterations occur, or, we reach a pre-specified performance value (minimum error rate).

The back-propagation algorithm (described in the previous paragraphs) is the fundamental process by which an ANN learns. This brief example merely summaries high level details of the procedure. For those math-minded individuals that would like to know more, please visit Patrick H. Winston’s Neural Net Lecture. In addition to providing a good introduction to back-propagation, also provide an excellent overview to neural networks in general.

The back-propagation algorithm is the most computationally expensive component to many neural networks. Given a ANN, back-propagation requires \(O(l)\) operations for \(l\)-hidden layers, and \(O(w^{2})\) operations for the number of input weights. We often describe ANNs in terms of depth and width, where the depth refers to the number of total layers, and the width refers to the number of neurons within each layer.

Prior to moving on to ANN application we must touch on one more topic, neural network hyperparameters

1.5 ANN Hyperparameters

ANN hyperparameters are settings used to control how a neural network performs. We have seen examples of hyperparameters previously, for example the learning rate in back-propagation and the selection of MSE as the performance metric. Hyperparameters dictate how well neural networks are able to learn the underlying functions they approximate. Poor hyperparameter selection can lead to ANNs that fail to converge, exhibit chaotic behavior, or converge too quickly at local, not global, optimums. Hyperparameters are initially selected based on historical knowledge about the data set being analyzed and/or based on the type of analysis being conducted. The optimum values of hyperparameters are dependent on the specific data sets being analyzed, therefore, in a majority of neural network analysis, hyperparameters need to be ‘tuned’ for the best performance. The No Free Lunch Theorem states that no machine learning algorithm (neural networks included) is always better at predicting new, unobserved, data points universally. When building a ANN, we are looking a building a network that performs reasonably well on a specific data set, not on all possible data sets.

The ultimate goal of an ANN is to train the network on training data, with the expectation that given new data the ANN will be able to predict their outputs with accuracy. The capability to predict new observations is called generalization. Generally, when ANNs are developed they are evaluated against one data set that has been split into a training data set and a test data set. The training data set is used to train the ANN, and the test data set is used to measure the neural networks capacity to generalize to new observations. When testing ANN hyperparameters we generally see multiple ANNs created with different hyperparameters trained on the training data set. Each of the ANNs are tested against the test data set and the ANN with the lowest test data set error is assumed to be the neural network with the best capacity to generalize to new observations.

When testing ANNs we are concerned with two types of error, under-fitting and over-fitting. An ANN exhibiting under-fitting is a neural network in which the error rate of the training data set is very high. An ANN exhibiting over-fitting has a large gap between the error rates on the training data set and the error rates on the test data set. We expect to see a slight performance decrease between the test and training data set error rates, however if this gap is large, over-fitting may be the cause. Researchers can always design a ANN with perfect performance on the training data set by increasing either the width or depth of the neural network. Adjusting these ANN hyperparameters is an adjustment of the neural networks capacity. In much the same way we can fit high-order polynomials in linear regression to perfectly match the output as a function of the regressors, ANNs can be ‘gamed’ by simply adding depth to the network. An over-capacity ANN is likely to show over-fitting when tested against the test data set. ANN’s are function approximators, and as approximators we are looking for a neural network that is no larger or complex than it needs to be for the required performance. Given two ANNs with equal test data set error performance, Occam’s razor dictates that the simplest model be selected, given no additional information.

2 Building ANNs

2.1 Overview

We present two types of ANNs, a neural network used for regression and a neural network used for classification. Each of the types of neural networks provide the user the following: 1. Required Package(s) 2. Data Preparation 3. Creation of exemplar ANN 4. Hyperparameter tuning 5. Exercises

2.2 Regression Neural Networks

Regression ANNs predict an output variable as a function of the inputs. The input features (independent variables) can be categorical or numeric types, however, for regression ANNs, we require a numeric dependent variable. If the output variable is a categorical variable (or binary) the ANN will function as a classifier (see next section).

2.2.1 Regression Required Packages

We require the following packages for the analysis.

library(tidyverse)
library(neuralnet)
library(GGally)

2.2.2 Data Preparation

Our regression ANN will use the Yacht Hydrodynamics data set from UCI’s Machine Learning Repository. The yacht data was provided by Dr. Roberto Lopez email. This data set contains data contains results from 308 full-scale experiments performed at the Delft Ship Hydromechanics Laboratory where they test 22 different hull forms. Their experiment tested the effect of variations in the hull geometry and the ship’s Froude number on the craft’s residuary resistance per unit weight of displacement.

To begin we download the data from UCI.

Yacht_Data <- read_table(file = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00243/yacht_hydrodynamics.data',
                       col_names = c('LongPos_COB', 'Prismatic_Coeff', 
                                     'Len_Disp_Ratio', 'Beam_Draut_Ratio', 
                                     'Length_Beam_Ratio','Froude_Num', 
                                     'Residuary_Resist')) %>%
  na.omit()

Prior to any data analysis lets take a look at the data set.

ggpairs(Yacht_Data, title = "Scatterplot Matrix of the Features of the Yacht Data Set")

Here we see an excellent summary of the variation of each feature in our data set. Draw your attention to the bottom-most strip of scatter-plots. This shows the residuary resistance as a function of the other data set features (independent experimental values). The greatest variation appears with the Froude Number feature. It will be interesting to see how this pattern appears in the subsequent regression ANNs.

Prior to regression ANN construction we first must split the Yacht data set into test and training data sets. Before we split, first scale each feature to fall in the \([0,1]\) interval.

# Scale the Data
scale01 <- function(x){
  (x - min(x)) / (max(x) - min(x))
}
Yacht_Data <- Yacht_Data %>%
  mutate_all(scale01)

# Split into test and train sets
set.seed(12345)
Yacht_Data_Train <- sample_frac(tbl = Yacht_Data, replace = FALSE, size = 0.80)
Yacht_Data_Test <- anti_join(Yacht_Data, Yacht_Data_Train)

The scale01() function maps each data observation onto the \([0,1]\) interval as called in the dplyr mutate_all() function. We then provided a seed for reproducible results and randomly extracted (without replacement) 80% of the observations to build the Yacht_Data_Train data set. Using dplyr’s anti_join() function we extracted all the observations not within the Yacht_Data_Train data set as our test data set in Yacht_Data_Test.

2.2.3 1st Regression ANN

To begin we construct a 1-hidden layer ANN with 1 neuron, the simplest of all neural networks.

set.seed(12321)
Yacht_NN1 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + 
                         Len_Disp_Ratio + Beam_Draut_Ratio + Length_Beam_Ratio +
                         Froude_Num, data = Yacht_Data_Train)

The Yacht_NN1 is a list containing all parameters of the regression ANN as well as the results of the neural network on the test data set. To view a diagram of the Yacht_NN1 use the plot() function.

plot(Yacht_NN1, rep = 'best')

This plot shows the weights learned by the Yacht_NN1 neural network, and displays the number of iterations before convergence, as well as the SSE of the training data set. To manually compute the SSE you can use the following:

NN1_Train_SSE <- sum((Yacht_NN1$net.result - Yacht_Data_Train[, 7])^2)/2
paste("SSE: ", round(NN1_Train_SSE, 4))
## [1] "SSE:  0.0361"

This SSE is the error associated with the training data set. A superior metric for estimating the generalization capability of the ANN would be the SSE of the test data set. Recall, the test data set contains observations not used to train the Yacht_NN1 ANN. To calculate the test error, we first must run our test observations through the Yacht_NN1 ANN. This is accomplished with the neuralnet package compute() function, which takes as its first input the desired neural network object created by the neuralnet() function, and the second argument the test data set feature (independent variable(s)) values.

Test_NN1_Output <- compute(Yacht_NN1, Yacht_Data_Test[, 1:6])$net.result
NN1_Test_SSE <- sum((Test_NN1_Output - Yacht_Data_Test[, 7])^2)/2
NN1_Test_SSE
## [1] 0.008417631461

The compute() function outputs the response variable, in our case the Residuary_Resist, as estimated by the neural network. Once we have the ANN estimated response we can compute the test SSE. Comparing the test error of 0.0084 to the training error of 0.0361 we see that in our case our test error is smaller than our training error.

2.2.4 Regression Hyperparameters

We have constructed the most basic of regression ANNs without modifying any of the default hyperparameters associated with the neuralnet() function. We should try and improve the network by modifying its basic structure and hyperparameter modification. To begin we will add depth to the hidden layer of the network, then we will change the activation function from the logistic to the tangent hyperbolicus (tanh) to determine if these modifications can improve the test data set SSE. When using the tanh activation function, we first must rescale the data from \([0,1]\) to \([-1,1]\) using the rescale package. For the purposes of this exercise we will use the same random seed for reproducible results, generally this is not a best practice.

# 2-Hidden Layers, Layer-1 4-neurons, Layer-2, 1-neuron, logistic activation
# function
set.seed(12321)
Yacht_NN2 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train, 
    hidden = c(4, 1), act.fct = "logistic")
## Training Error
NN2_Train_SSE <- sum((Yacht_NN2$net.result - Yacht_Data_Train[, 7])^2)/2
## Test Error
Test_NN2_Output <- compute(Yacht_NN2, Yacht_Data_Test[, 1:6])$net.result
NN2_Test_SSE <- sum((Test_NN2_Output - Yacht_Data_Test[, 7])^2)/2

# Rescale for tanh activation function
scale11 <- function(x) {
    (2 * ((x - min(x))/(max(x) - min(x)))) - 1
}
Yacht_Data_Train <- Yacht_Data_Train %>% mutate_all(scale11)
Yacht_Data_Test <- Yacht_Data_Test %>% mutate_all(scale11)

# 2-Hidden Layers, Layer-1 4-neurons, Layer-2, 1-neuron, tanh activation
# function
set.seed(12321)
Yacht_NN3 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train, 
    hidden = c(4, 1), act.fct = "tanh")
## Training Error
NN3_Train_SSE <- sum((Yacht_NN3$net.result - Yacht_Data_Train[, 7])^2)/2
## Test Error
Test_NN3_Output <- compute(Yacht_NN3, Yacht_Data_Test[, 1:6])$net.result
NN3_Test_SSE <- sum((Test_NN3_Output - Yacht_Data_Test[, 7])^2)/2

# 1-Hidden Layer, 1-neuron, tanh activation function
set.seed(12321)
Yacht_NN4 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train, 
    act.fct = "tanh")
## Training Error
NN4_Train_SSE <- sum((Yacht_NN4$net.result - Yacht_Data_Train[, 7])^2)/2
## Test Error
Test_NN4_Output <- compute(Yacht_NN4, Yacht_Data_Test[, 1:6])$net.result
NN4_Test_SSE <- sum((Test_NN4_Output - Yacht_Data_Test[, 7])^2)/2

# Bar plot of results
Regression_NN_Errors <- tibble(Network = rep(c("NN1", "NN2", "NN3", "NN4"), 
    each = 2), DataSet = rep(c("Train", "Test"), time = 4), SSE = c(NN1_Train_SSE, 
    NN1_Test_SSE, NN2_Train_SSE, NN2_Test_SSE, NN3_Train_SSE, NN3_Test_SSE, 
    NN4_Train_SSE, NN4_Test_SSE))
Regression_NN_Errors %>% ggplot(aes(Network, SSE, fill = DataSet)) + geom_col(position = "dodge") + 
    ggtitle("Regression ANN's SSE")

As evident from the plot, we see that the best regression ANN we found was Yacht_NN2 with a training and test SSE of 0.0188 and 0.0057. TWe make this determination by the value of the training and test SSEs only. Yacht_NN2’s structure is presented here:

plot(Yacht_NN2, rep = "best")

We have looked at one ANN for each of the hyperparameter settings. Generally, researchers look at more than one ANN for a given setting of hyperparameters. This capability is built into the neuralnet package using the rep argument in the neuralnet() function. Using the Yacht_NN2 hyperparameters we construct 10 different ANNs, and select the best of the 10.

set.seed(12321)
Yacht_NN2 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train, 
    hidden = c(4, 1), act.fct = "logistic", rep = 10)
plot(Yacht_NN2, rep = "best")

By setting the same seed, prior to running the 10 repetitions of ANNs, we force the software to reproduce the exact same Yacht_NN2 ANN for the first replication. The subsequent 9 generated ANNs, use a different random set of starting weights. Comparing the ‘best’ of the 10 repetitions, to the Yacht_NN2, we observe a decrease in training set error indicating we have a superior set of weights.

2.2.5 Exercises

  1. Why do we split the yacht data into a training and test data sets?

! The training data is used to train the neural network and the test data set is used to measure the neural networks generalization

  1. Re-load the Yacht Data from the UCI Machine learning repository yacht data without scaling. Run any regression ANN. What happens? Why do you think this happens?
# Load the data
Yacht_Data <- read_table(file = "http://archive.ics.uci.edu/ml/machine-learning-databases/00243/yacht_hydrodynamics.data", 
    col_names = c("LongPos_COB", "Prismatic_Coeff", "Len_Disp_Ratio", "Beam_Draut_Ratio", 
        "Length_Beam_Ratio", "Froude_Num", "Residuary_Resist")) %>% na.omit()

# Split into test and train sets
set.seed(12345)
Yacht_Data_Train <- sample_frac(tbl = Yacht_Data, replace = FALSE, size = 0.8)
Yacht_Data_Test <- anti_join(Yacht_Data, Yacht_Data_Train)
# Build a regression ANN
Yacht_NN1 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train)

! With default parameters, the NN fails to converge as indicated by the warning message. The random weights of the neural network are initialized as random between \([-1,1]\). The relative magnitude of the different features results in the ANN not converging appropriately. The easiest solution to problems such as these are to ensure that the data is scaled prior to ANN analysis. Additionally, the ANN can be set to run for a longer number of iterations, although this is not the preferred solution.

  1. After completing exercise question 1, re-scale the yacht data. Perform a simple linear regression fitting Residuary_Resist as a function of all other features. Now run a regression neural network (see 1st Regression ANN section). Plot the regression ANN and compare the weights on the features in the ANN to the p-values for the regressors.
# Scale the Data
scale01 <- function(x) {
    (x - min(x))/(max(x) - min(x))
}
Yacht_Data <- Yacht_Data %>% mutate_all(scale01)

# Split into test and train sets
set.seed(12345)
Yacht_Data_Train <- sample_frac(tbl = Yacht_Data, replace = FALSE, size = 0.8)
Yacht_Data_Test <- anti_join(Yacht_Data, Yacht_Data_Train)

# Simple linear regression
Yacht_lm <- lm(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train)
summary(Yacht_lm)
## 
## Call:
## lm(formula = Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + 
##     Len_Disp_Ratio + Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, 
##     data = Yacht_Data_Train)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.19128923 -0.11917494 -0.03038520  0.09332233  0.50426728 
## 
## Coefficients:
##                       Estimate   Std. Error  t value             Pr(>|t|)
## (Intercept)       -0.081283598  0.099487986 -0.81702              0.41473
## LongPos_COB        0.021088486  0.029464746  0.71572              0.47486
## Prismatic_Coeff   -0.002675947  0.054282812 -0.04930              0.96072
## Len_Disp_Ratio     0.108262763  0.199433775  0.54285              0.58774
## Beam_Draut_Ratio  -0.143068918  0.248657846 -0.57536              0.56559
## Length_Beam_Ratio -0.141086798  0.227408839 -0.62041              0.53558
## Froude_Num         0.630745525  0.029536672 21.35466 < 0.0000000000000002
##                      
## (Intercept)          
## LongPos_COB          
## Prismatic_Coeff      
## Len_Disp_Ratio       
## Beam_Draut_Ratio     
## Length_Beam_Ratio    
## Froude_Num        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1426861 on 239 degrees of freedom
## Multiple R-squared:  0.656415,   Adjusted R-squared:  0.6477894 
## F-statistic: 76.10111 on 6 and 239 DF,  p-value: < 0.00000000000000022204
# Regression ANN
set.seed(12321)
Yacht_NN1 <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train)
plot(Yacht_NN1, rep = "best")

! The p-value of the Froude_Num feature is the only feature identified as significant in the linear regression. This feature in the regression ANN has the largest magnitude weight value. Each of the other features have non-significant p-values and the magnitudes of the weights in the regression ANNs are much smaller than the Froude_Num weight.

  1. Build your own regression ANN using the scaled yacht data modifying one hyperparameter. Use ?neuralnet to see the function options. Plot your ANN.
# One Example
set.seed(12321)
Example_NN <- neuralnet(Residuary_Resist ~ LongPos_COB + Prismatic_Coeff + Len_Disp_Ratio + 
    Beam_Draut_Ratio + Length_Beam_Ratio + Froude_Num, data = Yacht_Data_Train, 
    algorithm = "rprop-", rep = 10)
plot(Example_NN, rep = "best")

  1. Think of a military application for regression ANNs.

2.3 Classification

Classification ANNs seek to classify an observation as belonging to some discrete class as a function of the inputs. The input features (independent variables) can be categorical or numeric types, however, we require a categorical features as the dependent variable.

2.3.1 Classification Required Packages

The following packages are required for classification ANN analysis.

library(tidyverse)
library(neuralnet)
library(GGally)

2.3.2 Data Preparation

Our classification ANN will use Haberman’s Survival data set from UCI’s Machine Learning Repository. Haberman’s data set was provided by Tjen-Sien Lim email, and contains cases from a 1958 and 1970 study conducted at the University of Chicago’s Billings Hospital on the survival of 306 patients who had undergone surgery for breast cancer. We will use this data set to predict a patient’s 5-year survival as a function of their age at date of operation, year of the operation, and the number of positive axillary nodes detected.

As before we first download the data from UCI. When this data is imported, the Survival feature is imported as an integer, this needs to be a categorical logical value so we will modify this feature using themutate() function in the dplyr package. A value of 1 in the Survival feature indicates that the patient survived for at least 5 years after the operation, and a value of 0 indites that the patient died within 5 years.

Hab_Data <- read_csv(file = "http://archive.ics.uci.edu/ml/machine-learning-databases//haberman/haberman.data", 
    col_names = c("Age", "Operation_Year", "Number_Pos_Nodes", "Survival")) %>% 
    na.omit() %>% mutate(Survival = ifelse(Survival == 2, 0, 1)) %>% mutate(Survival = factor(Survival))

A brief examination of the data set…

ggpairs(Hab_Data, title = "Scatterplot Matrix of the Features of the Haberman's Survival Data Set")

shows that many more patients survived at least 5 years after the operation. Of the patients that survived (bottom-subplots of the Survival row in the Scatterplot Matrix), we see many of the patients have few numbers of positive axillary nodes detected. Examination of the Age feature shows a few of the most elderly patients died within 5 years, and of the youngest patients we see increased 5-year survivability. We forego any more detailed visual inspection in favor of learning the relationships between the features using our classification ANN.

As in the regression ANN, we must scale our features to fall on the closed \([0,1]\) interval. For classification ANNs using the neuralnet package we will not use a training and test set for model evaluation, instead we will use the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for final model selection. These metrics balance the error as a function of the total number of model parameters, in the ANN case the model parameters correspond to the total number of hidden nodes.

scale01 <- function(x) {
    (x - min(x))/(max(x) - min(x))
}
Hab_Data <- Hab_Data %>% mutate(Age = scale01(Age), Operation_Year = scale01(Operation_Year), 
    Number_Pos_Nodes = scale01(Number_Pos_Nodes))
Hab_Data <- Hab_Data %>% mutate(Survival = as.numeric(Survival) - 1)

Classification ANNs in the neuralnet package require that the response feature, in this case Survival, be inputted as a Boolean feature. We modify the feature then run the initial classification ANN.

Hab_Data <- Hab_Data %>% mutate(Survival = as.integer(Survival) - 1) %>% mutate(Survival = ifelse(Survival == 
    1, TRUE, FALSE))

2.3.3 1st Classification ANN

We construct a 1-hidden layer ANN with 1 neuron. The neuralnet package defaults to random initial weight values, for reproducibility we set a seed and construct the network. We have added three additional arguments for the classification ANN using the neuralnet package, linear.output, err.fct, and likelihood. Setting the linear.output to FALSE and the err.fct to “ce” indicates that we are performing a classification, and forces the model to output what we may interpret as a probability of each observation belonging to Survival class 1. For classification ANN the cross-entropy error metric is more appropriate than the SSE used in regression ANNs. The likelihood argument set to TRUE indicates to neuralnet that we would like to see the AIC and BIC metrics.

set.seed(123)
Hab_NN1 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, Hab_Data, 
    linear.output = FALSE, err.fct = "ce", likelihood = TRUE)

The Hab_NN1 is a list containing all parameters of the classification ANN as well as the results of the neural network on the test data set. To view a diagram of the Hab_NN1 use the plot() function.

plot(Hab_NN1, rep = "best")

The error displayed in this plot is the cross-entropy error, which is a measure of the differences between the predicted and observed output for each of the observations in the Hab_Data data set. To view the Hab_NN1 AIC, BIC, and error metrics run the following.

Hab_NN1_Train_Error <- Hab_NN1$result.matrix[1, 1]
paste("CE Error: ", round(Hab_NN1_Train_Error, 3))
## [1] "CE Error:  0.009"
Hab_NN1_AIC <- Hab_NN1$result.matrix[4, 1]
paste("AIC: ", round(Hab_NN1_AIC, 3))
## [1] "AIC:  12.017"
Hab_NN2_BIC <- Hab_NN1$result.matrix[5, 1]
paste("BIC: ", round(Hab_NN2_BIC, 3))
## [1] "BIC:  34.359"

2.3.4 Classification Hyperparameters

Classification ANNs within the neuralnet package require the use of the ce error. This forces us into using the default act.fun hyperparameter value. As a result we will only change the structure of the classification ANNs using the hidden function setting.

set.seed(123)
Hab_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, Hab_Data, 
    linear.output = FALSE, err.fct = "ce", likelihood = TRUE, hidden = c(2, 
        1))

set.seed(123)
Hab_NN3 <- Hab_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, 
    Hab_Data, linear.output = FALSE, err.fct = "ce", likelihood = TRUE, hidden = c(2, 
        2))
set.seed(123)
Hab_NN4 <- Hab_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, 
    Hab_Data, linear.output = FALSE, err.fct = "ce", likelihood = TRUE, hidden = c(1, 
        2))

# Bar plot of results
Class_NN_ICs <- tibble(Network = rep(c("NN1", "NN2", "NN3", "NN4"), each = 3), 
    Metric = rep(c("AIC", "BIC", "ce Error * 100"), length.out = 12), Value = c(Hab_NN1$result.matrix[4, 
        1], Hab_NN1$result.matrix[5, 1], 100 * Hab_NN1$result.matrix[1, 1], 
        Hab_NN2$result.matrix[4, 1], Hab_NN2$result.matrix[5, 1], 100 * Hab_NN2$result.matrix[1, 
            1], Hab_NN3$result.matrix[4, 1], Hab_NN3$result.matrix[5, 1], 100 * 
            Hab_NN3$result.matrix[1, 1], Hab_NN4$result.matrix[4, 1], Hab_NN4$result.matrix[5, 
            1], 100 * Hab_NN4$result.matrix[1, 1]))
Class_NN_ICs %>% ggplot(aes(Network, Value, fill = Metric)) + geom_col(position = "dodge") + 
    ggtitle("AIC, BIC, and Cross-Entropy Error of the Classification ANNs", 
        "Note: ce Error displayed is 100 times its true value")

The plot indicates that as we add hidden layers and nodes within those layers, our AIC and cross-entropy error grows. The BIC appears to remain relatively constant across the designs. Here we have a case where Occam’s razor clearly applies, the ‘best’ classification ANN is the simplest and is shown in section 2.3.3.

2.3.5 Exercises

  1. Build your own classification ANN using the Hab_Data data set.
# One example
Example_NN2 <- neuralnet(Survival ~ Age + Operation_Year + Number_Pos_Nodes, 
    Hab_Data, linear.output = FALSE, err.fct = "ce", likelihood = TRUE, hidden = 3, 
    rep = 10)
plot(Example_NN2, rep = "best")

  1. The iris data set contains 4 numeric features describing 3 plant species. Think about how we would need to modify the iris data set to prepare it for a classification ANN. Hint, the data set for classification will have 7 total features.

! The feature column of the output class, Species, needs to split into three binary indicator variable output columns as shown below. Additionally the data should be scaled on \([0,1]\) as with the other data sets used in ANNs.

iris_data <- iris %>% mutate(Setosa = ifelse(Species == "setosa", TRUE, FALSE), 
    Versicolor = ifelse(Species == "versicolor", TRUE, FALSE), Virginica = ifelse(Species == 
        "virginica", TRUE, FALSE)) %>% mutate(Sepal.Length = scale01(Sepal.Length), 
    Sepal.Width = scale01(Sepal.Width), Petal.Length = scale01(Petal.Length), 
    Petal.Width = scale01(Petal.Width))
# To build a 3-class classification ANN
iris_NN <- neuralnet(Setosa + Versicolor + Virginica ~ Sepal.Length + Sepal.Width + 
    Petal.Length + Petal.Width, iris_data, linear.output = FALSE, err.fct = "ce", 
    likelihood = TRUE, hidden = 4, rep = 10)
plot(iris_NN, rep = "best")

  1. Think of a military application for classification ANNs.
  2. R packagennet has the capacity to build classification ANNs. Install and take a look at the documentation of nnet to see how it compares with neuralnet.
  3. Install the autoencoder package for class.

3 Wrapping Up

We have briefly covered ANNs in this tutorial. The material presented here is a high level overview of a topic that is currently undergoing rapid development and research. The data sets used here are minuscule in size compared to the real world data sets used in research. For example, the ImageNet LSVRC-2010 data set contains 1.2 million high-resolution images stratified into 1,000 classes. Krizhevsky, Sutskever, and Hinton built a classification ANN with 60 million parameters and 650,000 neurons. Their ANN required 6 days of training time. The ImageNet data set is larger still with 15 million labeled high-resolution images in 22,000 categories (Krizhevsky, Sutskever, and Geoffrey E. 2012).

There are countless military applications to ANNs. Imagine being able to analyze all satellite imagery using a computer with the capability to automatically classify aircraft on an airfield, bunkers in a desert, ICBM launch sites hidden in mountainous terrain. Electro-optical infrared satellites detect launch plumes during missile firings, however they are susceptible to interference that produces many false positive. Each launch detection must be analyzed by a highly trained imagery analyst to make a determination if the detection is valid or anomalous. A neural network could be trained to automate this process, greatly reducing the workload of the intelligence community.

The neuralnet package used in this tutorial is one of many tools available for ANN implementation in R. Others include:

  • nnet

  • autoencoder

  • caret

  • RSNNS

  • h2o

Regardless of your research area, there is a strong possibility that there exists an R package implementing an ANN framework that will aid you in your work.

References

Bergmeir, Christoph, and José M. Benítez. 2012. “Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNNS.” Journal of Statistical Software 46 (7): 1–26. http://www.jstatsoft.org/v46/i07/.

Dubossarsky, Eugene, and Yuriy Tyshetskiy. 2015. Autoencoder: Sparse Autoencoder for Automatic Learning of Representative Features from Unlabeled Data. https://CRAN.R-project.org/package=autoencoder.

Fritsch, Stefan, and Frauke Guenther. 2016. Neuralnet: Training of Neural Networks. https://CRAN.R-project.org/package=neuralnet.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Gunther, Frauke, and Stefan Fritsch. 2010. “Neuralnet: Training of Neural Networks.” The R Journal 2 (1). https://journal.r-project.org/archive/2010-1/RJournal_2010-1_Guenther Fritsch.pdf.

Hastie, Trevor J., Robert J. Tibshirani, and Jerome H. Friedman. 2017. The Elements of Statistical Learning Data Mining, Inference, and Prediction. Springer.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2017. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Krizhevsky, Alex, Ilya Sutskever, and Hinton Geoffrey E. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25 (NIPS2012), 1–9. doi:10.1109/5.726791.

Lichman, M. 2013a. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

———. 2013b. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2017. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.

team, The H2O.ai. 2017. H2o: R Interface for H2o. https://CRAN.R-project.org/package=h2o.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Walia, Anish Singh. 2017. “Activation Functions and It’s Types-Which Is Better?” Edited by Medium.com. https://medium.com/towards-data-science/activation-functions-and-its-types-which-is-better-a9a5310cc8f.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load ’Tidyverse’ Packages. https://CRAN.R-project.org/package=tidyverse.