ID5059 Lecture 13 - Neural Networks

C. Donovan
06 April 2018

Administrivia

  • Marking for project 1 - I'll aim for mid-next week

If it's not in the lecture or lab, it's not in the exam

Short exercise

Everybody on the Left

I want build some software to predict someone's age of death

Everybody on the Right

I want build some software to decide whether I should give someone a loan

Short exercise

Questions

  • What data do I need?
  • How do I get it?
  • How do I treat it?
  • What is the model going to look like?
  • How do I know it's any good?

Big picture

  • This is another class of model that has easily scaled complexity: a simple process can produce very complex functions
  • We use it like previous for regression-like problems: we have a bunch of X/covariates/inputs/features/whatever - we have a \( y \) we think is functionally related:

\[ \mathbf{y} = f(\mathbf{X}) + noise \] we want to usefully approximate \( f \)

  • \( y \) may be categorical (ordinal or nominal) or numeric
  • \( x \) are numeric (we can relax this like we've done before)

Neural Nets - Suggested reading

This a very brief overview of NNs (although the multitude of minor details makes detailed views difficult). For further information:

  • Interestingly there is no section on Neural Nets in James et. al.
  • Basheer & Hajmeer (2000) paper - quite a nice high-level overview (note terminology is very loose within NN literature)
  • Hastie et. al. (2000) give a succinct statistical over-view of fitting basic NNs + controlling complexity
  • See Nielson, M.: http://neuralnetworksanddeeplearning.com/index.html for a nice online intro
  • See Goodfellow et al: http://www.deeplearningbook.org/ for much more depth than you need at this point

Neural Net intro - Ought-to-knows

  • How a general NN can be displayed graphically
  • The NN terminology exemplified by such a diagram
  • How a relatively simple single hidden-layer, two input NN can produce a complex non-linear prediction surface
  • The form of given activation functions (both the equation and sketch)
  • How NN weights and biases are derived

Neural Nets

Some contentions/comments to start:

  • NNs seem complex (indeed, a fitted one can be) and a bit magical'.
  • They are however built (as usual) from simple components.

The problem is similar to previous:

  • Create a model for signal that is capable of being complex.
  • Fit the model (estimate parameters) so it approximates some data (specify an objective function).
  • Ensure the generality of predictions by controlling complexity in the fitting process (e.g. optimise complexity using some measure of generalisation error).

All very familiar - let's begin.

A simple NN as a Mathematical Formula

\[ \ln\left(\frac{\hat{p}}{(1-\hat{p})}\right) = \hat{\beta}_0 + \hat{\beta}_1z_1 + \hat{\beta}_2z_2 + \hat{\beta}_3z_3 \]

where

\[ \begin{align*} z_1 &= \tanh( \hat{\alpha}_4 + \hat{\alpha}_5x_1 + \hat{\alpha}_6x_2)\\ z_2 &= \tanh( \hat{\alpha}_7 + \hat{\alpha}_8x_1 + \hat{\alpha}_9x_2)\\ z_3 &= \tanh( \hat{\alpha}_{10} + \hat{\alpha}_{11}x_1 + \hat{\alpha}_{12}x_2) \end{align*} \]

What did all that mean?

The output will be a (fitted) probability-like thing \[ \hat{p} = \frac{1}{1+e^{-\theta}} \]

where \( \theta \) is a linear weighted sum of \( z_i \) terms, with fitted parameters (weights) \( \hat{\beta}_i \)

There is an additional fitted weight \( \hat{\beta}_0 \) that is an intercept or bias term

The \( z_i \) are formed by

  • weighting inputs \( x_i \) with optimal \( \hat{\alpha}_k \)
  • adding another \( \hat{\alpha} \) bias term
  • taking the hyperbolic tangent of the sum

Combining building blocks

Let's look at this graphically in R (example code is on Moodle)

  • Begin with 2 inputs (so we can plot them easily).
  • Specify 3 nodes and use tanh activation functions and linear combination function.
  • Combine these for an output surface under different example weights.
  • How complex can we get?

Conversion to a diagrammatic form

For ease of understanding for non-mathematicians

  • We have two input \( x_1 \) and \( x_2 \) sources
  • We have an input bias source
  • There is an internal layer of three \( z \) nodes, each taking in weighted inputs and outputting \( tanh \) of the summed inputs
  • There is an internal bias source for \( \hat{\beta}_0 \)
  • There is an output layer with one node, producing the logistic function of the weithed sum of the internal layer outputs
  • The number output is a probability between 0 and 1

Examples

Examples

Examples

Examples

Examples

Examples

Examples

Examples

Examples

NN components

  • Weights and biases: from a statistical perspective, these weights are simply parameters of a potentially non-linear function, and the biases are the intercept terms for the linear components.

  • Combination Functions: in our example equations above these are the linear combinations expressed in matrix form, they combine the input variables or the hidden nodes.

NN components

  • Activation functions: these are the functions wrapping the combination functions, and several variants are commonly used:

  • Identity Function - does not alter the value of the argument. The resulting range may be \( \in \mathcal{R} \).

  • Sigmoid Functions - \( S \)-shaped functions with the logistic or hyperbolic tangent functions being common. The resulting values will be bounded - \( (0,1) \) or \( (-1, 1) \) respectively. The logistic is given by: \[ \phi(\theta)=\frac{1}{1+e^{-\theta}} \] for some argument value \( \theta \).

  • \( \tanh \) - hyperbolic tangent gives real values within \( (-1,1) \)

  • Others: Gaussian functions (bell-shaped); functions bounded below by zero but unbounded above, e.g. Exponential and Reciprocal Functions.

NN components

  • Network Layers: as the hidden layers are contrivances under control of the analyst, the number of layers and units within these can be large.
  • The layering is partly for convenience, where all the nodes/units share similar characteristics such as their activation and combination functions.
  • All the nodes in a layer are usually, when starting out, connected to all the nodes in the next.

Main components

  • Layers: input, hidden, output.
  • Connections and weights.
  • Combination functions: linear.
  • Activation functions: Identity, tanh, exp, logistic.
  • Output functions: Back to response scale - Identity, (multiple) Logistic.

Digging around

Google has a great set of tools called tensorFlow, which you can also call from Python or R.

  • Go to this website to have a play around, it is really, really good:

www.playground.tensorflow.org

Digging around

Lets look at some building blocks in R

Overview of our coverage

  • NNs are an art'
  • Jargon can be inconsistent.
  • Huge number of decisions that can be made in their construction and the results are sensitive to these.
  • We'll look at the general ideas and very few specific implementations.

Fitting a Neural Net

Start with arbitrary weights and biases. Define an error function. Search for update values that reduce the error. Iterate until convergence (hopefully).

This is numerical optimisation

  • Non-linear problem with large numbers of parameters.
  • You will not find a general analytic solution for solving the weights.
  • All methods implemented are iterative numerical approaches - trial-and-error searches.
  • Conceptually simple what we want to do, once we define best'.

we'll return to this!