To Study Implementation of Gradient Descent for Multi-class Classification Using a SoftMax Regression and Neural Networks

1. Abstract
2. Introduction
3. SoftMax Regression
4. Backpropagation
5. Discussion
6. Results
7. Forward
8. References

This is our final project for the Fall 2017 class SDS 385: Statistical Models for Big Data taught by Professor James Scott at the University of Texas at Austin. For readers with time constraints, we recommend going through the Abstract, Introduction, Discussion, and Results of the report.

1. Abstract

As a part of the course, we were exposed to Mathematics behind the black-box optimizers that we have been using to fit models. The cogent insight into the gradient based optimizers provided us with sufficient confidence and desire to build upon the course material and code our own program to fit neural-network models.

2. Introduction

We adopted a step-wise approach and started with SoftMax regression on dataset with \(10\) classes and then extended the work to accommodate multiple layers.
We studied the mathematics associated with backpropagation and programmed a function that can take in a list of predictors, weights and biases and returns cross-entropy loss, modified weights and modified biases.
The dimension and initial choices for the weights and biases depends on the design choice and must be decided by the user.
The function is:

\[Neural\_network(X, y, weight, bias, step\_size, activation)\]

Where,

\(X =\) predictors.

\(y =\) one hot encoded label.

\(weight =\) list of weights for each layer, in increasing order of layer index.

\(bias =\) list of bias for each layer, in increasing order of layer index.

\(step\_size =\) step size for gradient descent.

\(activation = 0\) for sigmoid, \(1\) for ReLU (Rectified Linear Unit).

Return,

\(Loss =\) cross entropy loss.

\(weight =\) list of modified weights for each layer, in increasing order of layer index.

\(bias =\) list of modified bias for each layer, in increasing order of layer index.

The code is highly modular, and the reader can play around to create variations.
This function can be set on a loop to fit simple Neural Networks with sufficient accuracy.
We wrote a feed function to provide mini batches to the loop.
The data was used after PCA transformation and dropping unimportant predictors.
We programmed our function using MNIST data set and we cross checked the function by fitting various Neural Networks configurations on different models.
We have presented the work for one such dataset, called notMNIST. The dataset is like MNIST with 10 classes of English alphabets from A to J.

3. SoftMax Regression

Our approach has two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.

The score function maps the pixel values of an image to confidence scores for each class.Let’s assume a training dataset of images \(x_i\in D\), each associated with a label \(y_i\). Here \(i=1,...,m\) and \(y_i\in \{1,...K\}\). That is, we have \(m\) examples (each with a dimensionality \(D\)) and \(K\) distinct categories. For example, in MNIST we have a training set of \(m=60,000\) images, each with \(D=28\times 28=784\) pixels, and \(K=10\), since there are \(10\) distinct classes (\(1,2...,10\)). We will now define the score function \(f:\Re^D\mapsto \Re^K\) that linearly maps the raw image pixels to class scores:

\[f(x_i;W,b)=W^Tx_i+b\]

The parameters \(W\) are often called the weights, and \(b\) is called the bias vector because it influences the output scores, but without interacting with the actual data \(x_i\).

For loss fuction (or sometimes referred to as the cost function), there are several ways to define the details and measure our unhappiness with outcomes. A commonly used loss is called SoftMax Classifier.

SoftMax regression (or multinomial logistic regression) is a generalization of logistic regression to handle multiple classes. In logistic regression we assumed that the labels were binary: \(y_i \in \{0,1\}\). We used such a classifier to distinguish between two kinds of hand-written digits. SoftMax regression enables us to handle \(y_i \in \{1,\ldots,K\}\) where \(K\) is the number of classes.

3.1 Logistic Regerssion

In logistic regression, we had a training set \(\{ (x_1, y_1), \ldots, (x_m, y_m)\}\) of \(m\) labeled examples, where the input features are \(x_i\in \Re^{D}\). With logistic regression, we were in the binary classification setting, so the labels were \(y_i\in \{0,1\}\). The score function took the form:

\[f(x_i;W) = \frac{1}{1+\exp(-W^Tx_i)}\]

and the model parameters \(W\) were trained to minimize the loss function:

\[L(W) = -\left[ \sum_{i=1}^m y_i \log f(x_i;W) + (1-y_i)\log(1-f(x_i;W) \right]\]

3.2 SoftMax Classifier

In the SoftMax regression setting, we are interested in multi-class classification and so the label \(y\) can take on \(K\) different values, rather than only two. Thus, in our training set \(\{ (x_1, y_1), \ldots, (x_m, y_m) \}\), we now have that \(y_i \in \{1, 2, \ldots, K\}\). (Note that our convention will be to index the classes starting from 1, rather than from 0). For example, in the MNIST digit recognition task, we would have \(K=10\) different classes.

Given a test input \(x\), we want our hypothesis to estimate the probability that \(P(y=k | x)\) for each value of \(k = 1, \ldots, K\). I.e., we want to estimate the probability of the class label taking on each of the \(K\) different possible values. Thus, our hypothesis will output a \(K\)-dimensional vector (whose elements sum to 1) giving us our \(K\) estimated probabilities. Concretely, our score function \(f(x;W)\) takes the form:

\[\begin{align} f(x;W) = \begin{bmatrix} P(y = 1 | x; W) \\ P(y = 2 | x; W) \\ \vdots \\ P(y = K | x; W) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(W^T_j x) }} \begin{bmatrix} \exp(W^T_1 x ) \\ \exp(W^T_2 x ) \\ \vdots \\ \exp(W^T_K x ) \\ \end{bmatrix} \end{align}\]

Here \(W_1, W_2, \ldots, W_K \in \Re^{D}\) are the parameters and the term \(\frac{1}{ \sum_{j=1}^{K}{\exp(W^T_j x) } }\) normalizes the distribution such that it sums to one.

For convenience, we will also write \(W\) to denote all the parameters of our model. When we implement SoftMax regression, it is usually convenient to represent \(\theta\) as a \(D\)-by-\(K\) matrix obtained by concatenating \(W_1, W_2, \ldots, W_K\) into columns, so that \[W = \left[\begin{array}{cccc}| & | & | & | \\ W_1 & W_2 & \cdots & W_K \\ | & | & | & | \end{array}\right]\].

We now describe the loss function that we’ll use for SoftMax regression. In the equation below, \(I\{\cdot\}\) is the “indicator function”, so that \(I\{\hbox{a true statement}\}=1\), and \(I\{\hbox{a false statement}\}=0\). Our cross-entropy loss function will be:

\[L(W) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} I\left\{y_i = k\right\} \log \frac{\exp(W^T_k x_i)}{\sum_{j=1}^K \exp(W^T_j x_i)}\right]\]

For the SoftMax cost function, we sum over the \(K\) different possible values of the class label. Note also that in SoftMax regression, we have that \[P(y_i = k | x_i ; W) = \frac{\exp(W^T_k x_i)}{\sum_{j=1}^K \exp(W^T_j x_i)}\]

The cross-entropy between a “true” distribution \(p\) and an estimated distribution \(q\) is defined as: \[H(p,q) = - \sum_x p(x) \log q(x)\]

3.3 Information Theory View

The SoftMax classifier is to minimize the cross-entropy between the estimated class probabilities and the “true” distribution where all probability mass is on the correct class (i.e. \(p=[0,.1,.,0]\) contains a single \(1\) at the \(y_i\)-th position.). The cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \(H(p,q)=H(p)+D_{KL}(p||q)\), and the entropy of the delta function \(p\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance).

3.4 Numeric Stability

When computing the SoftMax function in practice, the intermediate terms \(e^{W^T_k x_i}\) and \(\sum_{j=1}^K e^{W^T_j x_i}\) may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. So we multiply the top and bottom of the fraction by a constant \(C\) to improve the numeical stability of the computation. We let \(\log C = b\) and obtain the following mathematically equivalent expression: \[\frac{\exp(W^T_k x_i)}{\sum_{j=1}^K \exp(W^T_j x_i)} = \frac{C\exp(W^T_k x_i)}{C\sum_{j=1}^K \exp(W^T_j x_i)} = \frac{\exp(W^T_k x_i+b)}{\sum_{j=1}^K \exp(W^T_j x_i+b)}\]

Then our loss function will be \[L(W,b) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} I\left\{y_i = k\right\} \log \frac{\exp(W^T_k x_i+b)}{\sum_{j=1}^K \exp(W^T_j x_i+b)}\right]\] Due to the desirable property of SoftMax function outputting a probability distribution, we use it as the final layer in Neural Networks. For this we need to calculate the derivative or gradient and pass it back to the previous layer during backpropagation, which we will cover in the next section.

3.5 Derivative of Cross Entropy Loss with SoftMax

To make the derivation explicitly, we will use the following graph to illustrate our Neural Networks and SoftMax function.

Let \(z_j=W^T_kx_j+b\) and \(p_j=\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}\), then

If \(i=j\) \[\frac{\partial\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}}{\partial z_j} = \frac{e^{z_i} \left(\sum_{k=1}^K e^{z_k} - e^{z_j}\right)}{\left(\sum_{k=1}^K e^{z_k}\right)^2} =\frac{ e^{z_j} }{\sum_{k=1}^K e^{z_k} } \times \frac{\left( \sum_{k=1}^K e^{z_k} - e^{z_j}\right ) }{\sum_{k=1}^K e^{z_k} } =p_i(1-p_j)\]
If \(i\neq j\) \[\frac{\partial\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}}{\partial z_j} =\frac{0 - e^{z_j}e^{z_i}}{\left( \sum_{k=1}^K e^{z_k}\right)^2} =\frac{- e^{z_j} }{\sum_{k=1}^K e^{z_k} } \times \frac{e^{z_i} }{\sum_{k=1}^K e^{z_k} } =- p_j.p_i\]

So the derivative of the softmax function is given as \[\frac{\partial p_j}{\partial z_j} = \left\{ \begin{array}{ll} p_i(1-p_j) & if & i=j \\ -p_j.p_i & if & i\neq j \end{array} \right. \]

Or using Kronecker delta \[ \delta{ij} = \left\{ \begin{array}{ll} 1 & if & i=j \\ 0 & if & i\neq j \end{array} \right. \]

with \[\frac{\partial p_j}{\partial a_j} = p_i(\delta_{ij}-p_j)\]

Cross Entropy Loss with SoftMax function are used as the output layer extensively. Now we use the derivative of softmax that we derived earlier to derive the derivative of the cross entropy loss function. Let \(a_i\) be the activation function (SoftMax), then

\[L = - \sum_i y_i log(p_i)\] \[\frac{\partial L}{\partial a_i} = - \sum_k y_k \frac{\partial log(p_k)}{\partial a_i } = - \sum_k y_k \frac{\partial log(p_k)}{\partial p_k} \times \frac{\partial p_k}{ \partial a_i} = - \sum y_k \frac{1}{p_k} \times \frac{\partial p_k}{\partial a_i}\]

From derivative of SoftMax we derived earlier,

\[ \frac{\partial L}{\partial a_i} = -y_i(1-p_i) + \sum_{k \neq i} y_k.p_i = - y_i + y_ip_i + \sum_{k \neq i} y_k.p_i = p_i\left( y_i + \sum_{k \neq i} y_k\right) - y_i\]

\(y\) is a one hot encoded vector for the labels, so \(\sum_k y_k=1\) and \(y_i+\sum_{k\neq i}y_k=1\). So we have \[\frac{\partial L}{\partial a_i} = p_i - y_i\] which is a very simple and elegant expression.

To minimize the loss \(L(W,b)\) as a function of the weights and biases is the aim of our training algorithm. In other words, we want to find a set of weights and biases which make the loss as small as possible. We cannot solve for the minimum of \(L(W,b)\) analytically, and thus we’ll resort to an iterative optimization algorithm known as called backpropagation.

4. Backpropagation

The goal of backpropagation is to compute the partial derivatives \(\frac{\partial L}{\partial W}\) and \(\frac{\partial L}{\partial b}\) of the loss function \(L\) with respect to any weight \(W\) or bias \(b\) in the network.

Let \[a^l=\sigma (W^La^{l-1}+b^l) \\ z^l = W^la^{l-1} + b^l\] So \[a^l = \sigma(z^l)\] where \(z^l\) has componnets \(z_j^l=\sum_kW_{jk}^la_k^{l-1}+b_j^l\) and \(z_j^l\) is the weighted input to the activation funtion for neuron \(j\) in layer \(l\).

Now we define the error \(\delta _j^l\) of neuron \(j\) in the layer \(l\) by \(\delta _j^l=\frac{\partial L}{\partial z_j^l}\) and derive the equation for the error in the outpu layer, \[\delta _j^L=\frac{\partial L}{\partial a_L^l}\sigma^\prime(z_j^L)\]

It is easy to write the above equation in matrix-based form (\(\odot\) represents Hadamard product, i.e., element-wise product of two vectors) as \[\delta ^L=\nabla _aL\odot \sigma^\prime (z^L)\tag{BP1}\] The error \(\delta ^l\) in terms of the error in the next layer \(\delta ^{l+1}\) is \[\delta ^l=((W^{l+1})^T\delta^{l+1}) \odot\sigma^\prime (z^l)\tag{BP2}\] This equation appears complicated, but each element has a nice interpretation. Suppose we know the error \(\delta ^{l+1}\) at the \((l+1)^{\rm th}\) layer. When we apply the transpose weight matrix \((W^{l+1})^T\), we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the \(l^{\rm th}\) layer. We then take the Hadamard product \(\odot\sigma^\prime (z^l)\). This moves the error backward through the activation function in layer \(l\), giving us the error \(\delta ^l\) in the weighted input to layer \(l\).

For the rate of change of the loss with respect to any bias and any weight in the network, we have \[\frac{\partial L}{\partial b_j^l}=\delta _j^l\tag{BP3}\] \[\frac{\partial L}{\partial W_{jk}^l}=a_k^{l-1}\delta _j^l\tag{BP4}\]

\(BP1\), \(BP2\), \(BP3\) and \(BP4\) are the four equations for backpropagation algorithm which are critical to design the Neural Networks. It provides us with computing the gradient of the loss function. Let’s explicitly write this out in the form of an algorithm:

Input \(x\): Set the corresponding activation \(a^1\) for the input layer.
Feedforward: For each \(l=2,3,..., L\) compute \(z^l=W^la^{l-1}+b^l\) and \(a^l=\sigma(z^l)\).
Output error \(\delta^L\): Compute the vector \(\delta^L=\nabla_aL\odot\sigma^\prime(z^L)\).
Backpropagate the error: For each \(l=L-1, L-2,..., 2\) compute \(\delta^l=((W^{l+1})^T\delta^{l+1})\odot\sigma^\prime(z^l)\)
Output: The gradient of the loss function is given by \(\frac{\partial L}{\partial W_{jk}^l}=a_k^{l-1}\delta _j^l\) and \(\frac{\partial L}{\partial b_j^l}=\delta _j^l\).

Next section will include some discussions on our Neural Networks and implementaion of the algorithm discussed above.

5. Discussion

Finding gradient for the SoftMax regression required no modification to the gradient used for the logistic regression. The gradient for the logistic regression is given by, \[x^T(W-y)\] where \(x\) is the dataset, \(W\) is probability and \(y\) is one hot encoded labels. For SoftMax regression, the gradient remains the same, but instead of \(W\) and \(y\) being a \(N\times 1\) matrix, are now \(N\times 10\) matrix i.e., one column for each class.
For implementation of gradient descent in Neural Networks, we start by finding the quantity, \(\nabla_aL\), which is the rate of change of Loss with respect to the output from the SoftMax function. For classification problem it is given by the following equation with \(y\) is the label and \(a\) is the output from the SoftMax function, \[\nabla_aL=\frac{y}{a}-\frac{1-y}{1-a}\]
After we estimate \(\nabla_aL\), we travel backwards to find the gradient for each weight and bias matrix.
Implementation of the above discussed mathematics resulted into some problems and most of them were related to computational limitations except for that of exploding gradient.
Due to repeated dot product between vectors, as the gradient propagated backwards, the size of individual elements in the gradient matrix increased by many magnitudes when compared to the elements of the weight matrix. The size increased with each layer. We clipped the gradient by dividing each column by its \(L2\) norm.
Another similar problem that we encountered was that, due to repeated dot products, as the number of layers increased, the output becomes larger then the numerical limit for the np.exp function. To solve this, we normalized the output from all the layers except from the last one.
The complication is that we don’t want the output very small as the SoftMax function will give almost equal probability for all classes. So, we theoretically want it as big as possible. So, we also multiplied the normalized values by \(10\) to have enough magnitude for the SoftMax function to differentiate between classes and keep the numbers manageable.
The above scaling will not affect the backpropagation of gradients as the operation is equivalent to that of dividing by a constant and thus it is changing the step size for the layer but is not changing backpropagation.
The Sigmoid Prime function defined using Numpy function gave nan values for output that were very close to zero. To solve this, we added an error term to the denominator.

6. Results

\((1).\) We Compared the results for SoftMax regression and the \(3\)-layer Neural Networks with ReLU activation.

From the results of the comparison we concluded that the Neural Networks model fits the data set better and faster than SoftMax regression model.

\((2).\) We also compared the performance of the program on the same dataset with varying number of layers.

All the 3 layers has similar performance. Number of iteration required for the loss to converge increased with increase in number of layers.

\((3).\) We wrote the feed function to do batch gradient descent and compared results for different batch sizes for \(2\) layers.

Batch gradient descent works equally well as the gradient descent. The number of iterations required for convergence increases with the decrease in batch size.

\((4).\) We extensively used the MNIST dataset while writing and testing our program. To make sure that this program work with different kind of dataset, we tried to fit a \(3\)-layer Neural Networks, similar to that in part (1), on the notMNIST dataset and observed the results.

From the loss we observed that the algorithm converged for the new dataset as well.

From the above tests we concluded that our program for fitting neural networks using gradient descent works efficiently for simple models and the behavior of loss is as expected.

7. Forward

We intend to further add exponentially decaying step size to the program. We also intend to incorporate various optimizers such as the Adagrad and Adam optimizers. We believe that a lot of different gradient based optimization methods could be implemented using back-propagation. We also intend to work with other activation functions and add functions to incorporate \(L1\) and \(L2\) penalty functions. Another possible addition could be that of convolutions and Max/Average pooling.

8. References

[1]. Michael Nielsen (2017): “Neural Networks and Deep Learning”. URL: http://neuralnetworksanddeeplearning.com/.

[2]. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: “Deep Learning”. URL: http://www.deeplearningbook.org/front_matter.pdf.

[3]. CS231n: “Convolutional Neural Networks for Visual Recognition”. URL: http://cs231n.stanford.edu/.

[4]. Jorge Nocedal and Stephen J. Wright: “Numerical Optimization”. URL: https://link.springer.com/book/10.1007%2F978-0-387-40065-5.