C. Donovan
11 April 2018
NB: If it's not in the lecture or lab, it's not in the exam
We'll fit a basic NN to some image data for classification and see how we did
[R: You'll do similar in the lab this week]
Simple in principle:
It's a search over multiple dimensions (dictated by number of parameters/weights).
Nasty ones (like NNs)
Note:
Simple in principle:
This is a gradient search, iterating over multiple dimensions (dictated by number of parameters/weights).
Refer H, T & F sections 11.3 & 11.4. Simplified version follows.
Consider the following simple NN \[ y = \beta_0 + \beta_1z_1 + \beta_2z_2 \] where
\[ \begin{align*} z_1 &= \frac{1}{1+e^{-(\alpha_0 + \alpha_1x_1 + \alpha_2x_2)}}\\ z_2 &= \frac{1}{1+e^{(\alpha_3 + \alpha_4x_1 + \alpha_5x_2)}}\\ \end{align*} \]
We're seeking to optmise the weights (the \( \alpha \) and \( \beta \)).
\[ \frac{\partial R}{\partial \beta_1} = 2(y-(\beta_0+\beta_1 z_{1} + \beta_2 z_{2}))(-1) \times z_{1} \]
\[ \frac{\partial R}{\partial \alpha_1} = 2(y-(\beta_0+\beta_1 z_{1} + \beta_2 z_{2}))(-1) \times \beta_1 \times z_1(1-z_1) \times x_1 \]
The details change particularly with the loss and activation functions (combination function is probably the same).
The following are sometimes referred to the the errors (often denoted \( \delta \)) that have been “propagated backwards”:
We needed pass forwards to get the error, then using this work backwards to evaluate the derivatives.
\[ \begin{align*} \beta_k^{r+1}&=\beta_k^{r}-\gamma\sum_i^{n}\frac{\partial R_i}{\partial \beta_k^r}\\ \alpha_s^{r+1}&=\alpha_s^{r}-\gamma\sum_i^{n}\frac{\partial R_i}{\partial \alpha_s^r} \end{align*} \]
where the size of movements is controlled by \( \gamma \) (learning rate). We alter weights from the bottom up (start with the \( \alpha \))
Fitting can be thought of as a gradient search method e.g. some variant on the Newton method that seeks to minimise the error. So we have to consider:
This is a type of regularisation - penalised fitting
\[ R_{\boldsymbol{\theta}}+\lambda J(\boldsymbol{\theta}) \]
Here \( J \) is a measure of the size of the weights e.g. \( \sum \beta^2 + \sum \alpha^2 \) (usually excluding biases).
Measuring generalisation error:
Use the validation data to:
These are all aspects of NN complexity.
Example via Caret package.
We've yet to cover: