ID5059 Lecture 15 - Neural Networks 3

C. Donovan
11 April 2018

Administrivia

  • Project 2:
    • Using other people's code and ideas

NB: If it's not in the lecture or lab, it's not in the exam

Today

  • Example NN
  • BP calculations
  • Preventing over-fitting
  • NN pros/cons to date

Example NN on images

We'll fit a basic NN to some image data for classification and see how we did

Example NN on image - the data

  • Hand-written numbers from the MNIST dataset
  • 60,000 training images of numbers, 28 \times 28 resolution
  • Also has a test set of 10,000
  • NNs can be quite good with images, also this is a multi-class response (10 categories), which is also a good match to a NN

Example NN on image - the data

[R: You'll do similar in the lab this week]

Fitting NNs - a gradient search example (BP)

Simple in principle:

  • Given weights - NN gives a y-hat
  • \( \hat{y} \) compared to \( y \) gives an error measure (RSS say)
  • Changing the weights can make this bigger or smaller
  • Want to change weights to make this smaller
  • Error is a function of weights - so numerically optimise to reduce

It's a search over multiple dimensions (dictated by number of parameters/weights).

Error Surface

Nasty ones (like NNs)

  • Maybe lots of local minima - starting locations are influential
  • The surface is less predictable and we have to search intensively/come up with tricks

plot of chunk unnamed-chunk-1

Back propagation - more detailed

Note:

  • The fine-scale details are not examinable
  • I discovered this is torturous, so will put a detailed set of calculations on Moodle, rather than in the lecture
  • Nonetheless, some elements follow

Back propagation - high level view

Simple in principle:

  • Set some initial weights (can't estimate error without a parameterised model) - software deals with this - probably random uniform.
  • Calculate an initial error (based on observed versus current predicted).
  • For each weight determine if increasing or decreasing the weight increases/decreases the error.
  • Move a bit in the correct direction. Recalculate error with new parameters. Repeat.
  • Stop at some point i.e. further weight alterations make no/little improvement.

This is a gradient search, iterating over multiple dimensions (dictated by number of parameters/weights).

Back propagation - mid-level view

Refer H, T & F sections 11.3 & 11.4. Simplified version follows.

  • Create little local problems to solve at each non-input node. Iteration \( r+1 \): \[ \beta^{r+1}=\beta^r-\gamma \frac{\partial R}{\partial \beta^r} \]
  • So, if \( R \) increases with increasing \( \beta^r \), decrease to create \( \beta^{r+1} \) by step \( \gamma \).
  • Keep doing this until \( R \) gets small.

Back propagation in more detail

Consider the following simple NN \[ y = \beta_0 + \beta_1z_1 + \beta_2z_2 \] where

\[ \begin{align*} z_1 &= \frac{1}{1+e^{-(\alpha_0 + \alpha_1x_1 + \alpha_2x_2)}}\\ z_2 &= \frac{1}{1+e^{(\alpha_3 + \alpha_4x_1 + \alpha_5x_2)}}\\ \end{align*} \]

We're seeking to optmise the weights (the \( \alpha \) and \( \beta \)).

Back propagation in more detail

  • As discussed, this is a (non-linear) optimisation problem - we want to change the weights to better predict \( y \).
  • Define a loss function - lets say simple square error (i.e. want to minimise the RSS). Use \( R \) for _R_esubstitution error \[ R_i = (y_i-\hat{y}_i)^2 \]
  • Set initial weights - just random numbers will do.
  • Now want to know whether increasing or decreasing a particular weight is good or bad WRT \( R \).

Back propagation in more detail

  • Use the inital weights and inputs to make predictions \( y \).
  • \( \hat{y} \) (and \( R \)) is a function of many things, but we want to alter weights. So we want to determine \[ \frac{\partial R_i}{\partial \beta_k}\quad {\rm and}\quad \frac{\partial R_i}{\partial \alpha_s} \] for \( k=1,2 \) and \( s=1,2 \)

Back propagation in more detail

  • Start with the weights nearest \( y \), say \( \beta_1 \), noting \( \hat{y}=\beta_0+\beta_1 z_1 + \beta_2 z_2 \).
  • \( R \) is a function of \( y \) (fixed) and \( \hat{y} \): \( (y - \hat{y})^2 \), say \( h(y, \hat{y}) \)
  • \( f \)=identity (a placeholder for a potential activation function), \( g \)=linear combination function, so we have \( \hat{y} = f(g(z, \beta))=\beta_0+\beta_1 z_1 + \beta_2 z_2 \)
  • Meaning \( R=h(f(g(z, \beta))) \) - to get \( \frac{\partial R}{\partial \beta} \) we can apply the chain rule
  • Let's drop \( i \) for the time being. Also we're using an identity activation function for \( f \), which makes things easier: <!– \[ -->