ID5059 Lecture 18 - Wrapup

C. Donovan
20 April 2018

Administrivia

  • Presentations: time slots are on Moodle - mostly random from R
  • Marking: I suck, there's lots, it'll be this week, watch Moodle
  • Pub: customarily I buy a round at course-end. Suggest end of next week.

NB: If it's not in the lecture or lab, it's not in the exam

Today

Wrap up

  • A little more on clustering
  • Some tidying up of Neural Networks & extensions

k-means algorithm in action

Agglomeration process

  • place all points that are zero distance apart into clusters - this gives up to \( n \).
  • some distance value for governing fusion is raised until two existing clusters are found similar enough to fuse.
  • the relevant clusters are fused, and the distance value governing fusion is increased - repeat from step 2 until all clusters are fused into one super-cluster.

Hierarchical clustering

  • The history in the fusing process described above, naturally gives rise to dendrograms
  • The single observation clusters are at one end.
  • The cluster of all observations at the other.
  • The vertical distance reflects the distance between clusters being fused/divided.
  • Big distances indicate heterogenous clusters joining/splitting

Prediction and cluster interpretation

Have clusters - now what?

  • If we have hierarchical clustering - cut' the dendrogram
  • If \( k \)-means, run with desired \( k \)

In keeping with the two over-arching uses of statistical models:

  • we will typically want to allocate new observations to a cluster/group - prediction
  • or investigate the nature of the clusters we have identified - description

Prediction

Prediction is conceptually easy

  • nearest cluster centroid.
  • For a vector of new values, the same simple approach.
  • (although can have fuzzy memberships i.e. some numeric measure of membership).

Description

Description is relatively hard

  • Describing/interpreting the clusters is a more involved process.
  • Our clusters exist in \( p \)-dimensions which makes their description a daunting task.
  • Some of our \( p \) variables may be more important in defining some clusters than others
  • cluster definitions may also be complex interactions between variables

Cluster description

  • Summary statistics: the predicted class memberships can be used as a simple index for subsetting the data
  • Reduced spaces: the data, and hence the clusters, will be generally \( p \)-dimensional. However reduced space methods do allow us to explore these visually.

Principal Components Analysis (PCA)

(Covered if time) We can use dimension reduction to explore/help.

Final NNs

  • Output functions
  • Categorical inputs
  • Scaling
  • Deep-learning
  • Convolutional Neural Networks

Output functions

  • Like everything in NNs, there are many. However, three are common for 'simple' NNs (as opposed to deep nets).
  • The type of response implies a type of output function (and loss function)

    • For numeric responses, likely go for squared error or absolute loss and an indentity output (unconstrained numeric predictions)
    • For two-class responses, likely go for logistic to provide probability scale predictions (0-1)

Output functions

For multi-class responses, likely go for softmax:

  • \( t_k \) is the net value in a final layer node, \( K \) is the number of such nodes

\[ y_k = \frac{e^{t_k}}{\sum_{l=1}^K e^{t_l}} \]

Note, sums to 1 - so like a probability distribution over classes.

  • The loss functions for binary or multiclass problems are often log-loss/cross-entropy

Categorical inputs & scaling

  • NNs only accept numeric inputs (underneath maybe). So for categorical inputs:
    • if ordinal, perhaps treat as numeric
    • if nominal, dummy variable coding (one-hot encoding)
  • Generally it is a good idea to scale inputs to a NN e.g. give all mean 0, variance 1.

More complex NNs

Recent years have seen a NN renaissance, largely in image processing:

  • So called Deep-learning
  • Relatedly, Convolutional Neural Networks (CNNs)

Beyond simple NNs: Go deep

  • There has been a recent upsurge in NN popularity. Mainly due to great performance in image classification problems. The NNs in question can be massively complicated:

    • 2012 imageNet classification - NN with \( 1\times 10^9 \) parameters
    • Used 1000 computers - 16,000 cores to fit
  • Hence the term deep-learning - very deep/complex NNs

  • The distinction between shallow and deep learning is a bit vague - but the extremes are clearly different

    • See Moodle - preprint of Schmidhuber, J. (2015) Deep learning in neural networks: An overview. Neural Networks Volume 61, Pages 85-117.

Convolutional NNs

These are typically used in the context of image processing. We have looked at a simple/naive approach already.

Recall MNIST numbers NN:

  • 60000 training images, 28x28 pixels each, clean and centered
  • Our basic NN:
    • Each input node is a pixel (784)
    • The output nodes are classes of image i.e. numbers 0-9
    • Even with a few hidden nodes and layers, a lot of parameters

This is clearly naive - the pixels are arranged spatially! Which is not reflected in the architecture.

Convolutional NNs

Basic idea is simple:

  • Don't fully connect all inputs to all first layers of NN as though they are independent
  • Motivated by real vision - group process blocks of inputs based on proximity (like scanning with your eye)
  • Amongst other things, this reduces the number of linkages

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - components

CNNs have many features we're familiar with from other NNs, but also:

  • kernels (/convolutions); convolution layers
  • pooling layers
  • fully connected layers
  • strides & receptive fields

Convolutional NNs - convolving

Typically we're talking images - if colour, think 3 values per pixel (RGB).

  • So we have an input volume, say \( n\times n \) pixels by 3 channels (RGB)
  • We'll have 1000s of these for training with labels
  • Note each channel is a matrix
  • Key are the kernels, which are 'small' windows (e.g. 4x4 pixel/matrix) that pass over the lerger image/matrix, filtering the values
  • This is the convolving in the name - a set of matrix operations

Convolutional NNs - kernels

source: http://timdettmers.com/2015/03/26/convolution-deep-learning/

source: https://colah.github.io/posts/2014-07-Understanding-Convolutions/

Convolutional NNs - kernels

  • observe here the stride and possible overlap
source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - pooling layers

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - fully connected layers

  • These are familiar to us i.e. inputs feed through hidden layers with full connectivity
  • they tend to be later in the CNN, when the complexity has been reduced by convolving and pooling.

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - notable example

Read Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

  • Got relatively great results at that point in time
  • Predicting 1000 classes from 1.2 million training images
  • 60 million parameters, 650,000 neurons
  • ReLU activation functions
  • 5 convolutional layers + max-pooling layers
  • 3 fully connected layers
  • Softmax output function (1000 class)

Convolutional NNs

Read Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

  • Substantial time spent specifying architecture
  • Results were sensitive to architecture choices
  • Required fitting with GPUs - two here directed at different parts of the NN
  • Weight decay, momentum parameters and dropout
  • 6 days to fit

Tricky! Later CNNs made this look small (see previous)

Going beyond simple NNs

Further reading/resources - there is a massive amount of good stuff out there - selection follows: