A single layer neural network takes an input vector of \(p\) variables \(X = (X_1, ..., X_p)\) and builds a nonlinear function \(f(X)\) to predict the response \(Y\). The particular structure of the a simple feed-forward neural network can be seen in the Figure below using \(p=4\) predictos and \(K=5\) hidden units, or sometimes named activations.
Single Layer Neural Network.
The neural network has the form \[ \begin{align} f(X) &= \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\\ &= \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \end{align} \]
Step 1.) Compute the \(K\) activations \(A_k, k=1...K\) as functions of the input features \(X_1,...,X_p\). In particular,
\[ A_k = h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \]
where \(g(z)\) is a nonlinear activation function that is specified in advance. For example, \(g(z)\) can be the sigmmoid activation function, or perhaps the ReLU (rectified linear unit) activation function.
Step 2.) Feed these \(K\) activations from the hidden layer feed into the output layer. The output \(f(X)\) will be a linear regression model of the \(K\) activations. In particular,
\[ f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X) \]
All the parameters will be estimated from the data. For this simple \(K=5\) and \(p=4\) example, we would need to estimate \(\beta_0,...,\beta_5\) and \(w_{10},...,w_{15}\) … and \(w_{50},...,w_{55}\). In total this is 6 + (6x6) = 42 parameters. It is easy to see now that in order to apply a neural network large datasets and strong computational tools are necessary. Thus, it wasn’t until the last 15-20 years did neural networks/deep learning take off in popularity in research and industry.
As with any parameter estimation problem we must choose a criteria. For a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize \(\sum_{i=1}^n (y_i - f(x_i))^2\)
Modern neural networks typically have more than one hidden layer. We will illustrate a large dense network on the famous and publicly available MNIST handwritten digit dataset. See Midterm Part 2 where we go into full detail, including source R code, for the MNIST digit dataset.
Every image has 784 pixels, of which each pixel is stored as an eight-bit grayscale value between 0 and 255 representing the relative amount of the written digit in that tiny pixel square. Thus, an input vector is \(X = (X_1,...,X_p)\) where \(p = 784\) pixels.
The output is the class label, represented by a vector \(Y = (Y_0,...,Y_9)\) of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere.
We wil use two hidden layers \(L_1\) and \(L_2\). See the image below for a visual diagram of how our model will be constructed.
The input layer has \(p=784\) units (remember thats how many pixels are in each input image). Hidden layer \(K_1 = 256\) and hidden layer \(K_2 = 128\). Output layer 10 units (number of 0-9 digits). One can show that the number of parameters is 235,146.
We have adopted some new notation.
\(W_1\) in Figure below represents the entire matrix of weights that feed from input layer to the first hidden layer. This matrix will have (784+1) x 256 = 200,960 elements/weights.
Secondly, \(W_2\) is the matrix of weights that feed from hidden layer 1 to hidden layer 2. This matrix will have (256+1) x 128 = 32,896 elements/weights.
Lastly, \(B\) is the matrix of weights from hidden layer 2 to output layer. This matrix will have (128 + 1) x 10 = 1,290 elements/weights.
Multilayer Neural Network for Digit Classication.
One thing to note about our output functions \(f_m(X) = Z_m\). Since we are dealing with classification we would prefer that our estimate represent class probabilities (just as in logistic or multinomial regression). Similar to multinomial logistic regression we use the special softmax activation function. This will ensure that the 10 numbers behave like porbabilities. The classifier then assigns the image to the class with the highest probability.
\[ f_m(X) = Pr(Y = m | X) = \frac{e^{Z_M}}{\sum_{l=0}^9 e^{Z_l}} \]
Our criteria will be to minimize the negative multinomial log-likelihood, also known as the cross-entropy.
For full code details and implementation see Midterm Part 2
Convolutional Neural Networks (CNNs) mimic to some degree how humans classify images, by recognizing specific features or patterns anywhere in the image that distinguish each particular object class.
See Figure 10.6 below.
Figure 10.6 Covolutional Neural Network.
A CNN first identifies low-level features, such as edges, patches of color, etc. Then, these low-level features are combined to form higher-level features, such as parts of ears, eyes, and so on. Eventually, the presence or absence of these higher-level features contributes to the probability of any given output classs.
A CNN builds up this hierarchy by combining twp specialized types of hidden layers, called convolution layers and pooling layers. Convolution layers search for instances of small patterns in the image. Pooling layers downsample these to select a prominent subset.
In a recurrent neural network (RNN), the input object \(X\) is a sequence. Consider a collection of documents, such as the collection of IMDb movie reviews. Each document can be expressed as a sequence of \(L\) words, so \(X={X_1,.X_2,...X_L}\), where each \(X_l\) represents a word. The order of the words, and closeness of certain words in a sentence, convey semantic meaning.
RNNs are designed to accomodate a take advantage of the sequential nature of such input objects, much like convolutional neural networks accomodate the spatial structure of image inputs.
Figure 10.12 below illustrates the structure of a very basic RNN with sequence \(X={X_1,.X_2,...X_L}\) as input, a simple output \(Y\), and a hidden-layer sequence \(A={A_1,A_2,...,A_L}\). Each \(X_l\) is a vector; in the document example \(X_l\) could represent a one-hot encoding for the l-th word based on the language dictionary for the collection of documents. As the sequence is processed on vector \(X_l\) at a time, the network updates the activations \(A_l\) in the hidden layer, taking as input the vector \(X_l\) and the activation vector \(A_{l-1}\) from the previous step in the sequence. Each \(A_l\) feeds into the output layer and produces a prediction \(O_l\) for \(Y\).
Figure 10.12 Recurrent Neural Network.
The rest of 10.5 in the textbook goes into case studies of applications of RNNs. For example, Time Series Forecasting and Document classification. For more details of the implementation of these examples in R, please see Midterm Part 2.
A guiding principle is Occam’s razor: when faced with several methods that give roughly equivalent performance, pick the simplest.
If we can produce models with the simpler tools that perform as well, they are likely to be easier to fit and understand and potentially less fragile than the more complex approaches.
Typically we expect deep learning to be an attractive choice when the sample size of the training set is extremely large, and when interpretability of the model is not a high priority.
The technicalities of fitting/optimizing a neural network and possibly introducing penalization are shown in this section. Due to the highly technical nature of this section, details are omitted and we encourage the reader to seek the actual textbook.
Neural networks are fit using back-propogation and gradient descent. The details of such can be found in the first half of Chapter 18. In our summary below we will focus on other tuning parameters.
We have many choices for our nonlinear activation functions \(g^{(k)}\). These include but are not limited to
the sigmoid function
rectified linear
leaky rectified linear
hyperbolic tan function
See the graph below for a visual comparison of the different activation functions.
Activation Functions.
Typically this is a mixture of \(l_2\) and \(l_1\) regularization, each of which requires a tuning parameter.
This is a form of regularization that is performed when learning a network, typically at different rates at the different layers. It applies to all networks, not just convolutional; in fact, it appears to work better when applied at the deeper, denser layers.