ID5059 Lecture 15 - Neural Networks 3

C. Donovan
11 April 2018

Administrivia

Project 2:
- Using other people's code and ideas

NB: If it's not in the lecture or lab, it's not in the exam

Today

Example NN
BP calculations
Preventing over-fitting
NN pros/cons to date

Example NN on images

We'll fit a basic NN to some image data for classification and see how we did

Example NN on image - the data

Hand-written numbers from the MNIST dataset
60,000 training images of numbers, 28 \times 28 resolution
Also has a test set of 10,000
NNs can be quite good with images, also this is a multi-class response (10 categories), which is also a good match to a NN

Example NN on image - the data

[R: You'll do similar in the lab this week]

Fitting NNs - a gradient search example (BP)

Simple in principle:

Given weights - NN gives a y-hat
\( \hat{y} \) compared to \( y \) gives an error measure (RSS say)
Changing the weights can make this bigger or smaller
Want to change weights to make this smaller
Error is a function of weights - so numerically optimise to reduce

It's a search over multiple dimensions (dictated by number of parameters/weights).

Error Surface

Nasty ones (like NNs)

Maybe lots of local minima - starting locations are influential
The surface is less predictable and we have to search intensively/come up with tricks

plot of chunk unnamed-chunk-1

Back propagation - more detailed

Note:

The fine-scale details are not examinable
I discovered this is torturous, so will put a detailed set of calculations on Moodle, rather than in the lecture
Nonetheless, some elements follow

Back propagation - high level view

Simple in principle:

Set some initial weights (can't estimate error without a parameterised model) - software deals with this - probably random uniform.
Calculate an initial error (based on observed versus current predicted).
For each weight determine if increasing or decreasing the weight increases/decreases the error.
Move a bit in the correct direction. Recalculate error with new parameters. Repeat.
Stop at some point i.e. further weight alterations make no/little improvement.

This is a gradient search, iterating over multiple dimensions (dictated by number of parameters/weights).

Back propagation - mid-level view

Refer H, T & F sections 11.3 & 11.4. Simplified version follows.

Create little local problems to solve at each non-input node. Iteration \( r+1 \): \[ \beta^{r+1}=\beta^r-\gamma \frac{\partial R}{\partial \beta^r} \]
So, if \( R \) increases with increasing \( \beta^r \), decrease to create \( \beta^{r+1} \) by step \( \gamma \).
Keep doing this until \( R \) gets small.

Back propagation in more detail

Consider the following simple NN \[ y = \beta_0 + \beta_1z_1 + \beta_2z_2 \] where

\[ \begin{align*} z_1 &= \frac{1}{1+e^{-(\alpha_0 + \alpha_1x_1 + \alpha_2x_2)}}\\ z_2 &= \frac{1}{1+e^{(\alpha_3 + \alpha_4x_1 + \alpha_5x_2)}}\\ \end{align*} \]

We're seeking to optmise the weights (the \( \alpha \) and \( \beta \)).

Back propagation in more detail

As discussed, this is a (non-linear) optimisation problem - we want to change the weights to better predict \( y \).
Define a loss function - lets say simple square error (i.e. want to minimise the RSS). Use \( R \) for _R_esubstitution error \[ R_i = (y_i-\hat{y}_i)^2 \]
Set initial weights - just random numbers will do.
Now want to know whether increasing or decreasing a particular weight is good or bad WRT \( R \).

Back propagation in more detail

Use the inital weights and inputs to make predictions \( y \).
\( \hat{y} \) (and \( R \)) is a function of many things, but we want to alter weights. So we want to determine \[ \frac{\partial R_i}{\partial \beta_k}\quad {\rm and}\quad \frac{\partial R_i}{\partial \alpha_s} \] for \( k=1,2 \) and \( s=1,2 \)

Back propagation in more detail

Start with the weights nearest \( y \), say \( \beta_1 \), noting \( \hat{y}=\beta_0+\beta_1 z_1 + \beta_2 z_2 \).
\( R \) is a function of \( y \) (fixed) and \( \hat{y} \): \( (y - \hat{y})^2 \), say \( h(y, \hat{y}) \)
\( f \)=identity (a placeholder for a potential activation function), \( g \)=linear combination function, so we have \( \hat{y} = f(g(z, \beta))=\beta_0+\beta_1 z_1 + \beta_2 z_2 \)
Meaning \( R=h(f(g(z, \beta))) \) - to get \( \frac{\partial R}{\partial \beta} \) we can apply the chain rule
Let's drop \( i \) for the time being. Also we're using an identity activation function for \( f \), which makes things easier: <!– \[ -->