ID5059 Lecture 18 - Wrapup

C. Donovan
20 April 2018

Administrivia

Presentations: time slots are on Moodle - mostly random from R
Marking: I suck, there's lots, it'll be this week, watch Moodle
Pub: customarily I buy a round at course-end. Suggest end of next week.

NB: If it's not in the lecture or lab, it's not in the exam

Today

Wrap up

A little more on clustering
Some tidying up of Neural Networks & extensions

k-means algorithm in action

Agglomeration process

place all points that are zero distance apart into clusters - this gives up to \( n \).
some distance value for governing fusion is raised until two existing clusters are found similar enough to fuse.
the relevant clusters are fused, and the distance value governing fusion is increased - repeat from step 2 until all clusters are fused into one super-cluster.

Hierarchical clustering

The history in the fusing process described above, naturally gives rise to dendrograms
The single observation clusters are at one end.
The cluster of all observations at the other.
The vertical distance reflects the distance between clusters being fused/divided.
Big distances indicate heterogenous clusters joining/splitting

Prediction and cluster interpretation

Have clusters - now what?

If we have hierarchical clustering - cut' the dendrogram
If \( k \)-means, run with desired \( k \)

In keeping with the two over-arching uses of statistical models:

we will typically want to allocate new observations to a cluster/group - prediction
or investigate the nature of the clusters we have identified - description

Prediction

Prediction is conceptually easy

nearest cluster centroid.
For a vector of new values, the same simple approach.
(although can have fuzzy memberships i.e. some numeric measure of membership).

Description

Description is relatively hard

Describing/interpreting the clusters is a more involved process.
Our clusters exist in \( p \)-dimensions which makes their description a daunting task.
Some of our \( p \) variables may be more important in defining some clusters than others
cluster definitions may also be complex interactions between variables

Cluster description

Summary statistics: the predicted class memberships can be used as a simple index for subsetting the data
Reduced spaces: the data, and hence the clusters, will be generally \( p \)-dimensional. However reduced space methods do allow us to explore these visually.

Principal Components Analysis (PCA)

(Covered if time) We can use dimension reduction to explore/help.

Final NNs

Output functions
Categorical inputs
Scaling
Deep-learning
Convolutional Neural Networks

Output functions

Like everything in NNs, there are many. However, three are common for 'simple' NNs (as opposed to deep nets).
The type of response implies a type of output function (and loss function)
- For numeric responses, likely go for squared error or absolute loss and an indentity output (unconstrained numeric predictions)
- For two-class responses, likely go for logistic to provide probability scale predictions (0-1)

Output functions

For multi-class responses, likely go for softmax:

\( t_k \) is the net value in a final layer node, \( K \) is the number of such nodes

\[ y_k = \frac{e^{t_k}}{\sum_{l=1}^K e^{t_l}} \]

Note, sums to 1 - so like a probability distribution over classes.

The loss functions for binary or multiclass problems are often log-loss/cross-entropy

Categorical inputs & scaling

NNs only accept numeric inputs (underneath maybe). So for categorical inputs:
- if ordinal, perhaps treat as numeric
- if nominal, dummy variable coding (one-hot encoding)
Generally it is a good idea to scale inputs to a NN e.g. give all mean 0, variance 1.

More complex NNs

Recent years have seen a NN renaissance, largely in image processing:

So called Deep-learning
Relatedly, Convolutional Neural Networks (CNNs)

Beyond simple NNs: Go deep

There has been a recent upsurge in NN popularity. Mainly due to great performance in image classification problems. The NNs in question can be massively complicated:
- 2012 imageNet classification - NN with \( 1\times 10^9 \) parameters
- Used 1000 computers - 16,000 cores to fit
Hence the term deep-learning - very deep/complex NNs
The distinction between shallow and deep learning is a bit vague - but the extremes are clearly different
- See Moodle - preprint of Schmidhuber, J. (2015) Deep learning in neural networks: An overview. Neural Networks Volume 61, Pages 85-117.

Convolutional NNs

These are typically used in the context of image processing. We have looked at a simple/naive approach already.

Recall MNIST numbers NN:

60000 training images, 28x28 pixels each, clean and centered
Our basic NN:
- Each input node is a pixel (784)
- The output nodes are classes of image i.e. numbers 0-9
- Even with a few hidden nodes and layers, a lot of parameters

This is clearly naive - the pixels are arranged spatially! Which is not reflected in the architecture.

Convolutional NNs

Basic idea is simple:

Don't fully connect all inputs to all first layers of NN as though they are independent
Motivated by real vision - group process blocks of inputs based on proximity (like scanning with your eye)
Amongst other things, this reduces the number of linkages

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - components

CNNs have many features we're familiar with from other NNs, but also:

kernels (/convolutions); convolution layers
pooling layers
fully connected layers
strides & receptive fields

Convolutional NNs - convolving

Typically we're talking images - if colour, think 3 values per pixel (RGB).

So we have an input volume, say \( n\times n \) pixels by 3 channels (RGB)
We'll have 1000s of these for training with labels
Note each channel is a matrix
Key are the kernels, which are 'small' windows (e.g. 4x4 pixel/matrix) that pass over the lerger image/matrix, filtering the values
This is the convolving in the name - a set of matrix operations

Convolutional NNs - kernels

source: http://timdettmers.com/2015/03/26/convolution-deep-learning/

source: https://colah.github.io/posts/2014-07-Understanding-Convolutions/

Convolutional NNs - kernels

observe here the stride and possible overlap

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - pooling layers

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - fully connected layers

These are familiar to us i.e. inputs feed through hidden layers with full connectivity
they tend to be later in the CNN, when the complexity has been reduced by convolving and pooling.

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - notable example

Read Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

Got relatively great results at that point in time
Predicting 1000 classes from 1.2 million training images
60 million parameters, 650,000 neurons
ReLU activation functions
5 convolutional layers + max-pooling layers
3 fully connected layers
Softmax output function (1000 class)

Convolutional NNs

Read Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

Substantial time spent specifying architecture
Results were sensitive to architecture choices
Required fitting with GPUs - two here directed at different parts of the NN
Weight decay, momentum parameters and dropout
6 days to fit

Tricky! Later CNNs made this look small (see previous)

Going beyond simple NNs

Further reading/resources - there is a massive amount of good stuff out there - selection follows:

https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/
http://deeplearning.net/tutorial/lenet.html
http://cs231n.github.io/convolutional-networks/
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
NB TensorFlow offers CNN building blocks - you can call these from R or Python