Improvement techniques in neural network training

Introduction
Normalizing the input features
Vanishing and exploding gradients
Gradient decsent types
Optimizers
Batch normalization
Conclusion

library(plotly)

Introduction

Creating the best deep neural network model is an empirical and iterative process that works best on big datasets. Being empirical, iterative, and requiring big datasets, unfortunately, takes its toll on computer resources and time. It is therefor important to take steps to mitigate this triad of time and resource consumption.

The empirical nature refers to the network architecture that a model consists of. Gradient descent is an important part of all models. With the aim of minimizing the cost function, gradient descent can often slow down the learning that takes place. Algorithms, called optimizers have been created to speed up the process of gradient descend.

This chapter discusses some of the steps that can be implemented to reduce the burden of creating a good model and to optimize gradient descent.

Normalizing the input features

In many cases, the scale of the input features vary greatly. Where some variables might have data point values consisting of small integers, others might have values that range up to a hundred or a thousand.

Scaling the data point values for each feature variable to a similar scale improves training by altering the cost function such that gradient descent becomes easier.

This can be visualized when considering only two variables in a cost function. If one has very large values and the other has small values, then the resultant 3D graph will be very elongated (in the axis of the large values). Gradient descent must now find a long convoluted way of traveling down this gradient.

By creating a similar, small scale for each of the feature variables, the graph of the cost function variable becomes more uniform.

Scaling is most often achieved through normalization, perhaps more properly referred to as standardization. A mean and a variance is calculated for each feature variable in the training set. Each element of each feature variable then has the specific mean for that variable subtracted from it, with the result divided by the variance for that variable, as shown in equation (1).

\[\frac{x_i - \mu}{\sigma^2} \tag{1}\]

The values for \(\mu\) and \(\sigma\) for each variable in the training set is retained and used to normalize the features in the test set too. This is an important step. It is incorrect to use and overall \(\mu\) and \(\sigma\) for the training and test sets and it is also incorrect to use a separate \(\mu\) and \(\sigma\) for the test set.

Vanishing and exploding gradients

Consider the multiplication of two rational numbers in the domain \(\left( 0,1 \right)\). A few examples are shown in equation (2).

\[\frac{1}{2} \times \frac{2}{3} = \frac{1}{3} \\ \frac{1}{4} \times \frac{9}{10} = \frac{9}{40} \\ \frac{a}{b} \times \frac{c}{d} = \frac{ac}{bd}, \forall a < b, c < d,\left\{a,b,c,d\right\}\in\mathbb{Z}^+ \tag{2}\]

With these constraints it can be shown that the inequalities in equation (3) hold.

\[\left( \frac{a}{b} > \frac{ac}{bd} \right) \wedge \left( \frac{c}{d}>\frac{ac}{bd} \right) \tag{2}\]

If biases are omitted for the sake of simplicity and with a linear activation function \(g \left( z \right) = z\), \(\hat{y}\) can be calculated as shown in equation (3).

\[W^{\left[ l \right]} \cdot W^{\left[ l-1 \right]} \cdot W^{\left[ l-2 \right]} \cdot \ldots \cdot W^{\left[ 2 \right]} \cdot W^{\left[ 1 \right]} \cdot x \tag{3}\]

With appropriate dimensions and with weight values in the domain \(\left( 0,1 \right)\), \(W\), will have element values approaching zero as \(l\) increases. In other words, given a sufficiently deep network, there is the threat of the values of the weights (parameters) approaching zero. This is known as the vanishing gradient problem.

By a similar argument and for all weight values larger than \(1\), the parameters will increase in size, known as the exploding gradient problem.

A similar argument yet again holds for the derivatives during gradient descent (backpropagation).

The problem of small weights is compounded by the relatively slower gradient descent which in turn might take many epochs to converge.

A partial solution lies in the random selection of the initial weight values. The goal is to set the variance of the weight matrix to the reciprocal of the number of input nodes which is to be multiplied by that matrix. (When using the rectified linear unit activation (ReLU) function, \(\frac{2}{n}\), where \(n\) is the number of input nodes, is a better choice.)

Setting the variance of a matrix to \(\frac{2}{n}\) (for ReLU) is achieved by multiplying each element in the matrix by the square root of \(\frac{2}{n}\). For tanh activation \(\frac{1}{n}\) is used. This is known as Xavier initialization. Note that this is no longer commonly used as ReLu has supplanted the hyperbolic tangent function, tanh, as popular activation function.

Gradient decsent types

Gradient descent is the process that seek the minimum of a function through an iterative process. This process updates parameters at every step. The step itself can be one of three types.

Batch gradient descent

In this type of gradient descent, update of the parameter values are only done after all of the samples have been analyzed in preparing the cost function.

While this type of learning allows for inclusion of all of the training data and certainly maximizes the potential for properly approaching global minima, it can be computationally expensive. This happens when the dataset is too big to fit into memory and constant retrieval from slower storage space must be made.

Mini-batch gradient descent

The process of gradient descent can be hastened if the training sample set can be broken up into parts, called mini-batches. During an epoch, gradient descent can take place after each mini-batch so that at the end of the epoch, a lot of progress can potentially be made. The process of forward propagation and backpropagation takes place during each mini-batch process. An epoch still refers to completing all of the mini-batches.

The term batch refers to the complete training set. Note, though, that in code, the argument batch_size = is used. This actually refers to the mini-batch sample size.

As a rule of thumb, the whole dataset can be used when it is relatively small and will not penalize the overall time taken to converge to a minimum value for the cost function. For bigger datasets, a power of \(2\) such as \(16,32,64,128,256,512\) is useful as mini-batch size. In most cases this works well with parallel graphics card memory architecture and optimizes its use. Irrespective of the size used, it is important that it fits within physical memory allocation.

Stochastic gradient descent

The extreme form of a mini-batch size is \(1\). Every sample in the training dataset is its own mini-batch. The gradient descent that results is called stochastic gradient descent (SGD) and can vary quite considerably for each sample. The search for a minimum is therefor usually termed noisy.

SGD is able to explore a larger hypothesis space (different parts of the multidimensional cost function) and find global minima this way. It also requires less memory usage as only a single sample is required at each turn.

It is not always clear, however, if and how learning is progressing. Usually, the cost is only shown at the end of each epochs and it may take many epochs to confirm progress. It can also be computationally expensive to require many updates for each epoch.

Optimizers

Optimizes are algorithms that can improve descent to the global minimum.

Gradient descent with momentum

Gradient descent with momentum uses the concept of an exponentially weighted moving average to increase the rate of gradient descent. Instead of updating the parameters during backpropagation with the usual learning rate times the partial derivative of the specific parameter, note is kept of these partial derivatives. An exponentially weighted moving average (EWMA) is calculated which is then instead multiplied by the learning rate.

An EWMA considers sequential values and calculates a moving average along each of the values such that there is an exponential decay in how previous values in the sequence contributes to the current average.

The code chunk below creates data point values along the \(x\)-axis and then calculates the sine of each of these values for the \(y\)-axis, but adds some random noise and the integer \(1\) to each value.

x = seq(from = 0, to = 2*pi, by = pi/180)
y = sin(x) + rnorm(length(x), mean = 0, sd = 0.1) + 1

Figure 1 below shows the sine curve and the data with noise.

f1 <- plot_ly(x = x,
             y = y,
             name = "data",
             type = "scatter",
             mode = "markers") %>% 
  add_trace(x = x,
            y = sin(x) + 1,
            name = "sine",
            type = "scatter",
            mode = "lines") %>% 
  layout(title = "Data point values along sine curve",
         xaxis = list(title = "Input values"),
         yaxis = list(title = "Output values"))
f1

Fig 1 Random noise along sine curve

For a given coefficient, \(\beta\), a moving average, \(v_i\), over each of the output values \(y_i\), is given in equation (1).

\[v_i = \beta \times v_{i-1} + \left( 1 - \beta \right) \times y_i \tag{1}\]

The code chunk below uses a for loop over the output values. Figure 2 shows the EWMA for \(\beta = 0.9\).

N <- length(x)
beta <- 0.9
v <- vector(length = N)
for (i in 2:N){
  v[i] <- (beta * v[i - 1]) + ((1 - beta) * y[i])
}
f1 <- f1 %>% add_trace(x = x,
                       y = v,
                       name = "EWMA",
                       type = "scatter",
                       mode = "lines")
f1

Added EWMA

Note the initial start at zero and the time taken to catch up. This is usually of no consequence in deep neural network training as training will occur over many epochs.

Expansion of equation (1) for a specific value for \(\beta\) and some \(i \in \mathbb{N}\) shows that the number of previous data points over which the average is computed is approximately given in equation (2).

\[\approx \frac{1}{1 - \beta^i}\tag{2}\]

The effect of an EWMA is that gradient descent orthogonal (in higher dimensional space) to the idealized direction to be taken are averaged out over time, but those in the correct direction become additive, hence the term momentum, i.e. gradient descent builds up momentum in the right direction.

Equation (3) below shows that instead of the partial derivative of the weight being stored, it is updated as an exponential moving average for some coefficient \(\beta \in \left[ 0,1 \right]\).

\[V_{\partial W_{i}} = \beta_v V_{\partial W_{i-1}} + \left( 1 - \beta_v \right) \partial W_{i} \tag{3}\]

This exponential moving average update of the derivative is then used to update the weight as shown in equation (4).

\[W_{i} = W_{i-1} \times \alpha V_{\partial W_{i-1}} \tag{4}\]

Root mean square propagation

Root mean square propagation (RMSprop) also attempts to speed up gradient descent. It differs from momentum by squaring the value of the derivative in each iteration. The change in equations (3) and (4) are reflected in equation (5).

\[S_{\partial W_{i}} = \beta_s S_{\partial W_{i-1}} + \left( 1 - \beta_s \right) \partial W_{i}^2 \\ W_{i} = W_{i-1} \times \alpha \frac{\partial W_{i-1}}{\sqrt{S_{\partial W_{i-1}}}} \tag{5}\]

Combining momentum and root mean square propagation

One of the most widely used optimization algorithms for gradient descent combines momentum and RMSprop into adapative moment estimation (ADAM).

One addition to the combination to these two algorithms is the correction of bias that exists at the start of exponentially weighted moving average calculations. For any iteration, \(t\), in this calculation the current average is simply divided as shown in equation (6).

\[\rho_{\partial W}^{\text{corrected}} = \frac{\rho_{\partial W}}{1 - \beta^t}\tag{6}\]

Here \(\rho\) refers to either \(V\) as for momentum or \(S\) as for RMSprop. This corrects for small values of \(t\), but for larger values of \(t\) the denominator approaches \(1\) and makes very little difference.

The parameter update is shown in equation (7).

\[W_{i+1} = W_{i} \times \alpha \frac{V_{\partial W_i}^{\text{corrected}}}{\sqrt{S_{\partial W_i}^{\text{corrected}}}} \tag{7}\]

Note that ADAM requires the setting of hyperparameters \(\alpha\), \(\beta_v\), and \(\beta_s\). Typical values include \(\beta_v = 0.9\) and \(\beta_s = 0.999\).

Learning rate decay

A constant learning rate can lead to slow learning requiring many epochs to minimize the cost function. On the contrary, it may also lead to overshoot once the global minimum is reached. The obvious solution may exist in an initial larger learning rate, which decreases as the values of the parameters converge to the minimum Equation (8) shows how to decrease the value of the learning rate, \(\alpha\), at a decay rate, \(\eta\), over each epoch, \(\mathscr{E}\).

\[\alpha_{\mathscr{H+1}} = \frac{1}{1 + \eta \left(\mathscr{H+1}\right)} \alpha_{\mathscr{H}}\tag{8}\]

The code below demonstrates this process. The learning rate starts at \(0.01\), and at each update stage, it decreases by \(\eta = 0.1\).

alpha <- c(0.01)
eta <- 0.1

for (i in 2:20){
  alpha = append(alpha, 1 / (1 + eta * i) * alpha[i - 1])
}

f2 <- plot_ly(x = 1:20,
             y = alpha,
             name = "data",
             type = "scatter",
             mode = "markers") %>% 
  layout(title = "Decreasing decay rate",
         xaxis = list(title = "Batch"),
         yaxis = list(title = "Learning rate"))
f2

Decreasing learning rate

The decay rate, \(\eta\), is another hyperparameter that must be set.

Note that there are a number of decay types such as exponential decay and staircase decay. More sophisticated methods look at updating the learning rate differently for every parameter.

Batch normalization

Just as with normalizing the input variables, the values of each of the nodes in a hidden layer can also be normalized. This is referred to as batch normalization.

It is most common to normalize the values of the nodes in a layer before applying the activation function, although normalization after applying the activation is also possible.

In standard form, every node value \(z^{\left( i \right)}\) is shown in equation (9).

\[z_{\text{norm}}^{\left( i \right)} = \frac{z^{\left( i \right)} - \mu}{\sigma}\tag{9}\]

The distribution parameters, \(\mu\) and \(\sigma\) used for this normalization can be turned into learnable parameters by equation (10).

\[\tilde{z}^{\left( i \right)} = \gamma z_{\text{norm}}^{\left( i \right)} + \beta\tag{10}\]

Conclusion

There are a variety of methods to improve the training of a network. These unfortunately bring a plethora of hyperparameters and network setups that take time, vigilance, and experience to implement.

The use of the improvements to a network is mathematically complex, but easy to express in code. TensorFlow and other neural network platforms have built-in function that executes all the ideas mentioned in this chapter.