ID5059 Lecture 14 - Neural Networks 2

C. Donovan
09 April 2018

Administrivia

Marking for project 1 - I'll aim for mid this week
Project 2:
- Missing values
- Imbalanced response classes
- Factor covariates with many levels
- Proto-typing on small data

NB: If it's not in the lecture or lab, it's not in the exam

(Artificial) Neural Network components

We need to have a common language to discuss these

So we have to be familiar with the general jargon surrounding these

A simple NN as a Mathematical Formula

\[ \hat{y} = \hat{\beta}_0 + \hat{\alpha}_1z_{2,1} + \hat{\alpha}_2z_{2,2} + \hat{\alpha}_3z_{2,3} \]

where

\[ \begin{align*} z_{21} &= \tanh( \hat{\alpha}_4 + \hat{\alpha}_5z_{1,1} + \hat{\alpha}_6z_{1,2})\\ z_{22} &= \tanh( \hat{\alpha}_7 + \hat{\alpha}_8z_{1,1} + \hat{\alpha}_9z_{1,2})\\ z_{23} &= \tanh( \hat{\alpha}_{10} + \hat{\alpha}_{11}z_{1,1} + \hat{\alpha}_{12}z_{1,2}) \end{align*} \]

and

\[ \begin{align*} z_{1,1} &= \tanh( \hat{\alpha}_{13} + \hat{\alpha}_{14}x_1 + \hat{\alpha}_{15}x_2)\\ z_{1,2} &= \tanh( \hat{\alpha}_{16} + \hat{\alpha}_{17}x_1 + \hat{\alpha}_{18}x_2)\\ \end{align*} \]

Conversion to a diagrammatic form

We have two inputs \( x_1 \) and \( x_2 \)
There are two hidden layers of two and three nodes each
Activation functions are \( tanh \)
The output function is identity
The combination functions are linear/additive with biases within each

Examples

Main components

Layers: input, hidden (multiple), output
Connections and their weights and biases
Combination functions: linear
Activation functions: Identity, tanh, exp, logistic… there are many more!
Output functions: Back to response scale - Identity, (multiple) Logistic.

NN components

Weights and biases: from a statistical perspective, these weights are simply parameters of a potentially non-linear function, and the biases are the intercept terms for the linear components.
Combination Functions: in our example equations above these are the linear combinations expressed in matrix form, they combine the input variables or the hidden nodes

NN components

Activation functions: these are the functions wrapping the combination functions. Common variants:
- Identity Function: does not alter the value of the argument. The resulting range may be \( \in \mathcal{R} \).
- Sigmoid Functions: \( S \)-shaped functions with the logistic or hyperbolic tangent functions being common. The resulting values will be bounded - \( (0,1) \) or \( (-1, 1) \) respectively.

Others Gaussian functions (bell-shaped); Exponential and Reciprocal Functions; rectified linear units (ReLU); step functions…. many

NN components

Network Layers: as the hidden layers are contrivances under control of the analyst, the number of layers and units within these can be large.
- The layering is partly for convenience, where all the nodes/units share similar characteristics such as their activation and combination functions.
- All the nodes in a layer are usually, when starting out, connected to all the nodes in the next (you can skip layers however).
Architecture: the design you settle on prior to fitting your NN e.g. number of hidden nodes/layers, connections, activation functions etc.

Digging around

Google has a great set of tools called tensorFlow, which you can also call from Python or R.

Go to this website to have a play around, it is really, really good:

www.playground.tensorflow.org

Digging around

Lets look again at some building blocks in R

Fitting a Neural Net

Assume we have an architecture. We have data and lots of parameters - what value should the parameters take?

Fitting a Neural Net

Start with arbitrary weights and biases. Define an error function. Search for update values that reduce the error. Iterate until convergence (hopefully).

This is numerical optimisation

Non-linear problem with large numbers of parameters.
You will not find a general analytic solution for solving the weights.
All methods implemented are iterative numerical approaches - trial-and-error searches.
Conceptually simple what we want to do, once we define 'best'.

we have seens such things before (although not known it)

Numerical optimisation

Broadly fitting is similar to other models we have seen:

We have some idea of “good” which leads to an objective function (or loss function): basically the way we compare \( y \) and \( \hat{y} \). For example OLS, we have \[ \sum_i(y_i-\hat{y})^2 \] max/min as required
Changing the model parameters changes \( \hat{y} \), which in turn affects the our evaluation of the objective function.
We try want parameters that minimise/maximise the objective function

We need sensible ways to do this search through the parameter space

Error Surface

Many models are well-behaved

No local optima
Maybe we can arrange so the search is very fast
e.g. GLMs look a bit like this - we can pretty much jump directly to the solution

plot of chunk unnamed-chunk-1

Error Surface

Nasty ones (like NNs)

Maybe lots of local minima - starting locations are influential
The surface is less predictable and we have to search intensively/come up with tricks

plot of chunk unnamed-chunk-2

Numerical optimisation - simple example OLS

Consider a simple set of numeric \( x \) and \( y \) data. We want \( y=f(x)=\beta_0 + \beta_1 x \), but how do we determine the \( \beta \)?

We know our idea of good (make \( \hat{y} \) close to \( y \))
Ignore the presence of an analytical solution - Without looking at \( x \) vs \( y \), how do we change \( \beta \) to improve things?

[drawing ensues]

Numerical optimisation - generally

We can imagine ways to do this without much effort:

A crude grid-search i.e. try lots of combinations of parameters and choose the best (not very efficient)
Usually we use a gradient search of some sort: We can maybe tell which direction is a good way to move
- For some problems this is easy e.g. lots of stats models like GLMs are well behaved - the direction is clear, and we get to the max/min very quickly
- Others can be complicated e.g. local minima can be found, but they are local hills/valleys - the global best is uncertain

Numerical optimisation - NNs

There are a lot of optimisation approaches for NNs

Defining what is “good” is reasonably easy i.e. the loss/objective functions (e.g. quadratic/mean-squared error/SSE for numeric or cross-entropy for classification… several others)
They are non-linear with lots of parameters and local minima/maxima i.e. horrible

We look at one (the classic) - the back-propagation (BP) algorithm

Fitting NNs - a gradient search example (BP)

Simple in principle:

Given weights - NN gives a y-hat
\( \hat{y} \) compared to \( y \) gives an error measure (RSS say)
Changing the weights can make this bigger or smaller
Want to change weights to make this smaller
Error is a function of weights - so numerically optimise to reduce

It's a search over multiple dimensions (dictated by number of parameters/weights).

Back propagation - high level view

Simple in principle:

Set some initial weights (can't estimate error without a parameterised model) - software deals with this - probably random uniform.
Calculate an initial error (based on observed versus current predicted).
For each weight determine if increasing or decreasing the weight increases/decreases the error.
Move a bit in the correct direction. Recalculate error with new parameters. Repeat.
Stop at some point i.e. further weight alterations make no/little improvement.

This is a gradient search, iterating over multiple dimensions (dictated by number of parameters/weights).

Back propagation - mid-level view

Refer H, T & F sections 11.3 & 11.4. Simplified version follows.

Take \( R_\mathbf{\theta} \) as the residual error given parameters \( \theta_1, \theta_2... \) etc. (e.g. RSS).
Create little local problems to solve at each non-input node.
A little bit of calculus gets you there: application of the chain rule allows determination of error changes at each non-input node.
Want how \( R_\mathbf{\theta} \) (\( R \) for simplicity) changes for changes in each parameter i.e. \( \frac{\partial R}{\partial \theta_i} \) for \( i \)-th par.

Back propagation - mid-level view

Refer H, T & F sections 11.3 & 11.4. Simplified version follows.

Create little local problems to solve at each non-input node. Iteration \( r+1 \): \[ \beta^{r+1}=\beta^r-\gamma \frac{\partial R}{\partial \beta^r} \]
So, if \( R \) increases with increasing \( \beta^r \), decrease to create \( \beta^{r+1} \) by step \( \gamma \).
Keep doing this until \( R \) gets small.

Back propagation - mid-level view

Refer H, T & F sections 11.3 & 11.4. Simplified version follows.

Note:

the NN starts simple (boring set of parameters), gets more complicated as we iterate.
the step size (\( \gamma \)) controls how rapidly we fluctuate the parameters (learning rate').
so complexity can be controlled by stopping the optimisation process.
one pass through all the data, changing weights, is called an epoch.

A visual pass through

The following gives a flavour of what is involved.
The notation is not consistent with how we've been working - we'll do a worked example in the next lecture.
You should note:
- How calculation intensive this potentially is.
- How the objective is to alter weights by some amount to reduce prediction error.

Back propagation

You do not have to do this by hand for assessment in this course

We iterate until some criteria is met
Obviously your computer will deal with this, but is still can take a long time
You might have 1000s of inputs, many hidden nodes and thousands of observations e.g. fitting to images

Back propagation

Coming up:

Some example calculations for the BP algorithm
How do we control fitting generally and over-fitting in particular
Specific common loss functions, choices of activation functions, heuristics
Examples of NNs fitted to some simple problems
More complicated NNs: Convoluted Neural Networks (CNNs) and what Deep-learning is