Gradient Descent

Gradient descent is an optimization technique that is used to minimize a function, as long as the derivative can be found.
It is often used when we can’t find the minimum of a function analytically.

The algorithm

Initialise the algorithm with the following parameters
- \(\theta\), the parameter value to start the search
- \(\alpha\), the learning rate
- max_iters, the maximum amount of iterations for the algorithm to do
- precision, the desired precison of the result - how small the step has to be for the algorithm to be considered converged.
Take steps in the direction of the negative of the gradient. The size of these steps are dictated by the learning rate. The iterations will look like:

\[ \theta_{n}=\theta_{n-1}-\alpha\frac{\partial}{\partial\theta}J(\theta_{n-1})\]
We can either go until max_iters or use a stopping criteria. This criteria is often in the form of length of the most recent step.
- Stop if the absolute value of the step is less than the precision:
\[ |\theta_n-\theta_{n-1}|<\text{precision}\]

Advanced Gradient Methods

Typical gradient descent can easily get stuck in local minima and can also take a long time to converge, therefore advanced optimisation techniques have been created to solve this issue.

Original Gradient Descent

Note original gradient descent weight increments:

\[ \Delta w_{ij} = \underbrace{\left(\eta*\frac{\partial E}{\partial w_{ij}}\right)}_{\text{Learning Rate*Weight Gradient}}\]

Momentum

Momentum speeds up the training process by utilising the magnitude of previous weight updates in the current weight update. This acts like an acceleration, converging faster when the weight updates are large.
- Consider a momentum rate \(\gamma\)

\[ \Delta w_{ij} = \left(\eta*\frac{\partial E}{\partial w_{ij}}\right) + \left(\gamma*\Delta w_{ij}^{t-1}\right)\]

Stochastic Gradient Descent

The original gradient descent uses the whole dataset as a single batch prior to weight updates.
SGD computes the gradient for an instance of the dataset and updates the weights after the gradients are computed. Note that the dataset is shuffled prior at the start of each iteration.
Minibatch SGD uses the same idea as SGD however minibatches are used rather than single instances
- This is useful for big datasets where the whole dataset can’t be loaded at once
- Uses batches reduces the variance of parameter updates and can lead to more stable convergence
  - More often updates can also lead to faster convergence
- Randomness can result in jumping out of local minima
- However, the error function is not as well optimised as with GD, where GD can find and stop at the minima while SGD will oscilate around the minima.

Adaptive Learning Rate Methods

Adapting the learning rate during learning allows different weights to have different rates and for the rate to increase/decrease as time goes on.
Note that adaptive gradients are not always a good choice, and all options should be considered for the specfic problem.

AdaGrad

Adagrad is an early adaptive learning algorithm that accumulates the sum of squared gradients from previous iterations
- Due to this accumulation, it can cause the learning rate to rapidly diminish. Therefore it is not good for deep learning problems
- However, it is very effective for simple problems such as linear regression.
Consider \(g_{t,i}\) as the partial derivative of the error function with respect to weights \(\theta_i\) at time step t:
- \(G_{ii,t}\) Is the diagonal matrix where each diagonal element is the sum of squared gradients from the beginning to \(t=0\)

\[ g_{t,i} = \Delta_\theta J(\theta_{t,i})\]

\[ \theta_{t+1,i} = \theta_{t,i}-\frac{\eta}{\sqrt{G_{t,ii}+\epsilon}}*g_{t,i}\]

AdaDelta/RMSProp

Adadelta restricts the window of accumulated past gradients to a fixed size rather than all of the gradients
- Less memory is used and this fights against the diminishing learning rate of Adagrad
Sum of gradients is recursively defined as a decaying average of all past squared gradients:

\[ \mathbb{E}[g^2]_t = \gamma*\mathbb{E}[g^2]_{t-1}+(1-\gamma)g^2_t \]

Therefore the weights:

\[ \theta_{t+1,i} = \theta_{t,i}-\frac{\eta}{\sqrt{\mathbb{E}[g^2]_t+\epsilon}}*g_{t,i} \]

Adam

Adam computes adapted learning rates from estimates of the first and second moments.
First moment:

\[ m_t = \beta_1*m_{t-1}+(1-\beta_1)g_t \]

Second moment:

\[ v_t = \beta_2*v_{t-1}+(1-\beta_2)g^2_t \]

These moments are updated to eliminate bias towards 0 in the initial time steps:

\[ \hat{m} = \frac{m_t}{1-\beta_1^t} \]

\[ \hat{v} = \frac{v_t}{1-\beta_2^t} \]

With weights:

\[ \theta_{t+1,i} = \theta_{t,i}-\frac{\eta}{\sqrt{v_t}+\epsilon}*\hat{m}_t \] ## Optimizer Summary

Optimizers

Optimizers 2

Activiation Functions

No activation on input layer
Default activation function for hidden layers is RELU as it does not saturate towards the extrema. However, due to the max nature of the function it can lead to dead neurons with no gradient, therefore alternatives such as Selu and leaky relu are often used.
- However, Relu is extremly fast to compute therefore it is still often used.
For classification tasks either sigmoid or softmax activation function is used for output.
For regression tasks, if it is unbounded then use no activation, if it is bounded use Tanh (-1,1)
Guidelines for typical neural network architectures:

Guidelines

Data Processing

One hot encode class output vectors and ordinal encode categorical inputs

Number of Hidden Layers

Need to determine the optimal amount of hidden layers and/or neurons for the specific problem
- We can alter the number of hidden layers until the neural network starts to overfit. Can also consider having a too large network and using heavy regularisation.
Adding neurons and layers increases the computation time and can lead to decreased performance due to overfitting.

Regularization

Weight Decay

Prevents overfitting and weight values getting too large by adding a penalty term:
Original weight update:

\[ w_i \leftarrow w_i-\eta*\frac{\partial E}{\partial w} \]

Penalised weight update:

\[ w_i \leftarrow w_i-\eta*\frac{\partial E}{\partial w} -\eta\lambda w_i\]

Dropout

Computationally cheap way to reduce overfitting and improve generalisation performance by dropping nodes from the network while training.
- This can slow down convergence, however the converged weights are normally superior to non-dropout models
Dropout can be seen as averaging over an ensemble of smaller networks.

Recurrent Neural Networks

Contain feedback connections, making them appropriate for modelling dynamic and temporal sequences.
- Utilise internal states/memory in a context layer to process variable-length sequences of inputs.
Gradients are calculated through backpropogation through time
- Network is unfolded in the temporal dimension and then conventional backpropogation is performed.
- Same set of weights are used for every time step of the computation
BPTT can become problematic when training very long sequences due to the vanishing gradient problem.
- One possible fix is truncated BPTT, an approximation of BPTT that can run forward and backward through chunks rather than the whole sequence
  - Forward pass in unaffected however the backward pass is calculated through the chunks.
Advantages:
- Can process any length of input
- Computation for step t can use information from multiple steps back
- Model size doesn’t increase for longer input
- Same weights for every time step
Disadvantages:
- Long computational time on the recurrent layers
- In practice it is difficult to access information from many steps back
- Vanishing gradients problem

LSTM

While RNNs have trouble learning long-term dependencies in data, LSTM was created to have better capacity for long time sequences.
LSTM uses memory cells and gates to pass through long and short hidden term states
However, these networks are often more computationally costly when compared to RNN

Neural Network Notes

Jake

29/11/2021