ID5059 Lecture 13 - Neural Networks

C. Donovan
06 April 2018

Administrivia

Marking for project 1 - I'll aim for mid-next week

If it's not in the lecture or lab, it's not in the exam

Short exercise

Everybody on the Left

I want build some software to predict someone's age of death

Everybody on the Right

I want build some software to decide whether I should give someone a loan

Short exercise

Questions

What data do I need?
How do I get it?
How do I treat it?
What is the model going to look like?
How do I know it's any good?

Big picture

This is another class of model that has easily scaled complexity: a simple process can produce very complex functions
We use it like previous for regression-like problems: we have a bunch of X/covariates/inputs/features/whatever - we have a \( y \) we think is functionally related:

\[ \mathbf{y} = f(\mathbf{X}) + noise \] we want to usefully approximate \( f \)

\( y \) may be categorical (ordinal or nominal) or numeric
\( x \) are numeric (we can relax this like we've done before)

Neural Nets - Suggested reading

This a very brief overview of NNs (although the multitude of minor details makes detailed views difficult). For further information:

Interestingly there is no section on Neural Nets in James et. al.
Basheer & Hajmeer (2000) paper - quite a nice high-level overview (note terminology is very loose within NN literature)
Hastie et. al. (2000) give a succinct statistical over-view of fitting basic NNs + controlling complexity
See Nielson, M.: http://neuralnetworksanddeeplearning.com/index.html for a nice online intro
See Goodfellow et al: http://www.deeplearningbook.org/ for much more depth than you need at this point

Neural Net intro - Ought-to-knows

How a general NN can be displayed graphically
The NN terminology exemplified by such a diagram
How a relatively simple single hidden-layer, two input NN can produce a complex non-linear prediction surface
The form of given activation functions (both the equation and sketch)
How NN weights and biases are derived

Neural Nets

Some contentions/comments to start:

NNs seem complex (indeed, a fitted one can be) and a bit magical'.
They are however built (as usual) from simple components.

The problem is similar to previous:

Create a model for signal that is capable of being complex.
Fit the model (estimate parameters) so it approximates some data (specify an objective function).
Ensure the generality of predictions by controlling complexity in the fitting process (e.g. optimise complexity using some measure of generalisation error).

All very familiar - let's begin.

A simple NN as a Mathematical Formula

\[ \ln\left(\frac{\hat{p}}{(1-\hat{p})}\right) = \hat{\beta}_0 + \hat{\beta}_1z_1 + \hat{\beta}_2z_2 + \hat{\beta}_3z_3 \]

where

\[ \begin{align*} z_1 &= \tanh( \hat{\alpha}_4 + \hat{\alpha}_5x_1 + \hat{\alpha}_6x_2)\\ z_2 &= \tanh( \hat{\alpha}_7 + \hat{\alpha}_8x_1 + \hat{\alpha}_9x_2)\\ z_3 &= \tanh( \hat{\alpha}_{10} + \hat{\alpha}_{11}x_1 + \hat{\alpha}_{12}x_2) \end{align*} \]

What did all that mean?

The output will be a (fitted) probability-like thing \[ \hat{p} = \frac{1}{1+e^{-\theta}} \]

where \( \theta \) is a linear weighted sum of \( z_i \) terms, with fitted parameters (weights) \( \hat{\beta}_i \)

There is an additional fitted weight \( \hat{\beta}_0 \) that is an intercept or bias term

The \( z_i \) are formed by

weighting inputs \( x_i \) with optimal \( \hat{\alpha}_k \)
adding another \( \hat{\alpha} \) bias term
taking the hyperbolic tangent of the sum

Combining building blocks

Let's look at this graphically in R (example code is on Moodle)

Begin with 2 inputs (so we can plot them easily).
Specify 3 nodes and use tanh activation functions and linear combination function.
Combine these for an output surface under different example weights.
How complex can we get?

Conversion to a diagrammatic form

For ease of understanding for non-mathematicians

We have two input \( x_1 \) and \( x_2 \) sources
We have an input bias source
There is an internal layer of three \( z \) nodes, each taking in weighted inputs and outputting \( tanh \) of the summed inputs
There is an internal bias source for \( \hat{\beta}_0 \)
There is an output layer with one node, producing the logistic function of the weithed sum of the internal layer outputs
The number output is a probability between 0 and 1

Examples

NN components

Weights and biases: from a statistical perspective, these weights are simply parameters of a potentially non-linear function, and the biases are the intercept terms for the linear components.
Combination Functions: in our example equations above these are the linear combinations expressed in matrix form, they combine the input variables or the hidden nodes.

NN components

Activation functions: these are the functions wrapping the combination functions, and several variants are commonly used:
Identity Function - does not alter the value of the argument. The resulting range may be \( \in \mathcal{R} \).
Sigmoid Functions - \( S \)-shaped functions with the logistic or hyperbolic tangent functions being common. The resulting values will be bounded - \( (0,1) \) or \( (-1, 1) \) respectively. The logistic is given by: \[ \phi(\theta)=\frac{1}{1+e^{-\theta}} \] for some argument value \( \theta \).
\( \tanh \) - hyperbolic tangent gives real values within \( (-1,1) \)
Others: Gaussian functions (bell-shaped); functions bounded below by zero but unbounded above, e.g. Exponential and Reciprocal Functions.

NN components

Network Layers: as the hidden layers are contrivances under control of the analyst, the number of layers and units within these can be large.
The layering is partly for convenience, where all the nodes/units share similar characteristics such as their activation and combination functions.
All the nodes in a layer are usually, when starting out, connected to all the nodes in the next.

Main components

Layers: input, hidden, output.
Connections and weights.
Combination functions: linear.
Activation functions: Identity, tanh, exp, logistic.
Output functions: Back to response scale - Identity, (multiple) Logistic.

Digging around

Google has a great set of tools called tensorFlow, which you can also call from Python or R.

Go to this website to have a play around, it is really, really good:

www.playground.tensorflow.org

Digging around

Lets look at some building blocks in R

Overview of our coverage

NNs are an art'
Jargon can be inconsistent.
Huge number of decisions that can be made in their construction and the results are sensitive to these.
We'll look at the general ideas and very few specific implementations.

Fitting a Neural Net

Start with arbitrary weights and biases. Define an error function. Search for update values that reduce the error. Iterate until convergence (hopefully).

This is numerical optimisation

Non-linear problem with large numbers of parameters.
You will not find a general analytic solution for solving the weights.
All methods implemented are iterative numerical approaches - trial-and-error searches.
Conceptually simple what we want to do, once we define best'.

we'll return to this!