Statchat - Machine Learning (ML) from the stats side

C. Donovan
20 August 2019

Objectives

Demystify the area for those outside ML
Fly over some common methods which are indicative
Moan a little

Some examples

Image processing - the poster boy of ML:

https://www.dropbox.com/s/bmmrhtyn88hnnar/CNN_classifier.mp4?dl=0

https://www.youtube.com/watch?v=qrzQ_AB1DZk

Predictive models on satellite data:

https://www.dropbox.com/s/7ufi99upvon81zi/VMS.mp4?dl=0

https://globalfishingwatch.org/map/

What actually is ML?

Definitions

The term was coined in the 50s - naively/tellingly: Machine + Learning

Machines - could be various things (cars, robots, an app), but ultimately now computers or things controlled by them
Learning - teaching said machines to perform some task. They “learn” by exposure to data.

Basically algorithms informed by data. This sounds familiar…

Definitions

Why not ask Wikipedia? (Dr Paxton - are you here?):

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. … Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task

Sounds very familiar…

A Venn diagram of sorts -

https://thedatascientist.com/data-science-without-programming/

There are lots of these - great fun

A Venn diagram of sorts -

Predictive models

Machine learning is frequently statistical modelling, with some disregard for inference, focussed on predictive performance.

\[ y = f(x_1, ..., x_p; \theta_1, ..., \theta_k) + e \]

\( y \) is the target (response), \( x \) are the features (covariates), \( f \) signal with parameters \( \theta \), \( e \) noise

In stats:

We'd usually have a formal model for the signal \( f \) and for the noise \( e \)
Estimation of \( \hat{f} \) usually built from the distribution of \( e \) (e.g. the other ML: max likelihood)
Similarly for ML, but focus on accurate \( \hat{f} \), and the distribution of \( e \) often roughly considered

Function estimation

This is the core of what we're doing, by whatever means necessary to get a good estimate

It's usually a standard problem

Usually a supervised problem (we have observations of both \( y \) and a set of \( x \) to estimate \( f \))

Assume a form for \( f \), estimate the best parameters using the data
Most statistical models would be claimed under ML banner
Philosophically, tend more towards flexible classes of model for \( f \), rather than classical stats modelling with a fairly rigid model structure

How to approximate f?

Choose class for \( f \): polynomial regression, neural networks, boosted classification trees etc
Estimate best parameters:
- define an objective/loss function e.g. \( \sum_i (y_i-\hat{y}_i)^2 \)
- find parameters to minimise this for a specific \( f \) e.g. gradient search over \( \theta_1, ..., \theta_k \)
- repeat altering the complexity of \( f \) to minimise generalisation error e.g. the loss function on a validation set, \( k \)-fold cross-validation etc
Appropriate model complexity becomes something to estimate too

This all sounds very familar…

Under/over-fitting

Like to estimate the complexity of \( f \)
too simple and we're underfitting too complex and we're overfitting
both cases give poor generalisation (prediction to unseen data), hence new data is simulated by validation/test sets, \( k \)-fold cross-validation etc

This all sounds very familiar. Model selection is widespread over statistics, but we might use AIC rather than a hold-out approach (a staple of ML).

Methods

Observe what is considered here under the umbrella of ML (see picture) - this is commonplace.
There are some methods not particularly statistical, or very fringe, which might be considered more pure ML (Convolutional Neural Networks).
So ML is not well defined by tools, so much as how they're viewed/used.

[See this summary document as an example: https://st4.ning.com/topology/rest/1.0/file/get/2808312886?profile=original]

Examples

Building complex models from simple blocks

Trees
Boosted trees
Random forests
Neural networks

Simple trees: An arbitrary 2-D space

An arbitrary 2-D space

Space splitting

A single split

Space splitting

Split of a subspace

Space splitting

Further splitting of a subspace

Space splitting

Further splitting

Space splitting

Potential 3-D surface

Binary partitioning process as a tree

An example tree diagram for a contrived partioning

Boosting trees

These are a bit rubbish by themselves
However, using lots of such models as building blocks, we get better predictions
This is akin to model averaging where we might combine predictions from competing models, weighted by AIC.
This is ensemble modelling in ML parlance, it is commonplace

Boosting trees

In short:

Fit a model, like the tree previous
Observations incorrectly/poorly predicted get increased weight - refit.
Do this many times - combine the models in a weighted fashion.
The number of models combined is related to complexity. Control in a standard ML way

Lots of Trees - Random Forests

In short:

Fit a model, like the tree previous, on a bootstrap of the data with randomly selected covariates.
Do this many times - combine the models.
The number of models combined is related to complexity. Control in a standard ML way

A simple NN as a Mathematical Formula

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1z_1 + \hat{\beta}_2z_2 + \hat{\beta}_3z_3 \]

where

\[ \begin{align*} z_1 &= \tanh( \hat{\alpha}_4 + \hat{\alpha}_5x_1 + \hat{\alpha}_6x_2)\\ z_2 &= \tanh( \hat{\alpha}_7 + \hat{\alpha}_8x_1 + \hat{\alpha}_9x_2)\\ z_3 &= \tanh( \hat{\alpha}_{10} + \hat{\alpha}_{11}x_1 + \hat{\alpha}_{12}x_2) \end{align*} \]

So NNs are a complex thing built from lots of simple components.

The ubiquitous NN diagram

Estimate parameters as you'd expect

Define an appropriate loss function comparing \( y \) to \( \hat{y} \) (we could include a logistic component to make a classification problem)
Move over the space of weights (all the \( \alpha \)) in directions that improve this objective [actually a horrible optimisation problem]
Stop when your generalisation error is good

Tensorflow sandbox

If you want to get to grips with NNs as a newbie - this is highly recommended:

https://playground.tensorflow.org/

I'll now play with it….

Statisticians, hold my beer

NNs, getting deep and convoluted

Recent years have seen a NN renaissance, largely in image processing:

So called Deep-learning: add lots of layers to your NN and you're now doing deep-learning.
Relatedly, Convolutional Neural Networks (CNNs)

Beyond simple NNs: Go deep

There has been a recent upsurge in NN popularity. Mainly due to great performance in image classification problems. The NNs in question can be massively complicated:
- 2012 imageNet classification - NN with \( 1\times 10^9 \) parameters
- Used 1000 computers - 16,000 cores to fit
Hence the term deep-learning - very deep/complex NNs
The distinction between shallow and deep learning is a bit vague - but the extremes are clearly different
- See Schmidhuber, J. (2015) Deep learning in neural networks: An overview. Neural Networks Volume 61, Pages 85-117.

Convolutional NNs

Consider a simple handwriting problem (e.g. MNIST data - recognise numbers):

Unpack all the pixels to be inputs to an NN
The pattern of pixel intensities is related to the class (a number 0-9)

Josef Steppan - https://commons.wikimedia.org/w/index.php?curid=64810040

Convolutional NNs

MNIST data consists of 60000 training images, 28x28 pixels each, clean and centered
Our basic NN:
- Each input node is a pixel (784)
- The output nodes are classes of image i.e. numbers 0-9
- Even with a few hidden nodes and layers, a lot of parameters

This is clearly naive - the pixels are arranged spatially! Which is not reflected in the architecture.

Convolutional NNs

Basic idea is simple:

Don't fully connect all inputs to all first layers of NN as though they are independent
Filter inputs in various ways similar to photo-shop filters
Motivated by real vision - group process blocks of inputs based on proximity (like scanning with your eye)
Do some spatial averaging
Amongst other things, this reduces the number of linkages

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - components

CNNs have many features we're familiar with from other NNs, but also:

kernels (/convolutions); convolution layers
pooling layers
fully connected layers
strides & receptive fields

Convolutional NNs - convolving

Typically we're talking images - if colour, think 3 values per pixel (RGB).

So we have an input volume, say \( n\times n \) pixels by 3 channels (RGB)
We'll have 1000s of these for training with labels
Note each channel is a matrix
Key are the kernels, which are 'small' windows (e.g. 4x4 pixel/matrix) that pass over the lerger image/matrix, filtering the values
This is the convolving in the name - a set of matrix operations

Convolutional NNs - kernels

source: http://timdettmers.com/2015/03/26/convolution-deep-learning/

source: https://colah.github.io/posts/2014-07-Understanding-Convolutions/

Convolutional NNs - kernels

observe here the stride and possible overlap

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - pooling layers

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - fully connected layers

These are familiar to us i.e. inputs feed through hidden layers with full connectivity
they tend to be later in the CNN, when the complexity has been reduced by convolving and pooling.

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - notable example

See Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

Got relatively great results at that point in time
Predicting 1000 classes from 1.2 million training images
60 million parameters, 650,000 neurons
ReLU activation functions
5 convolutional layers + max-pooling layers
3 fully connected layers
Softmax output function (1000 class)

Convolutional NNs

See Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

Substantial time spent specifying architecture
Results were sensitive to architecture choices
Required fitting with GPUs - two here directed at different parts of the NN
Weight decay, momentum parameters and dropout
6 days to fit

Tricky! Later CNNs made this look small (see previous)

NB: Transfer learning

Transfer learning is very popular now.

CNNs can be huge and take a looong time to fit.
So, take the first bits of an extant CNN which has various convolutions and pooling with parameters already determined.
Attach your bit to the end, so only fit the final layers for your specific problem.

General ML approach

Fiddle your data - “feature engineering”, data wrangling & munging
Focus on predictive accuracy, usually assessed by data unseen to the model fitting process
Be relatively agnostic about the function for the signal - choose broad classes that permit lots of complexity
Choose loss-functions/objectives that suit response type e.g. squared-error loss for numeric problems
Tune model complexity as part of the process - again focussing on data unseen to the modelling process
Try/compare lots of competing types - winning is clearly defined (lowest generalisation error)
[try “ensembling” your models when in trouble]

Hype vs reality

ML is far from magical - can get poor performance for all the usual reasons models don't work

Data is poor (errors, values not missing at random etc..)
Model isn't right - over/underfitted, don't understand inputs properly, wrong sort
Low signal-to-noise
Processes are changing (future is different to the past)
Class imbalances
Tuning too hard e.g. NN architecture sensitivities
…

Examples

Banking
Simple classification problems ID5059
Kaggle competitions (www.kaggle.com)

Hype vs reality

https://blogs.gartner.com/smarterwithgartner/files/2018/08/PR_490866_5_Trends_in_the_Emerging_Tech_Hype_Cycle_2018_Hype_Cycle.png

Hammers and nails

“When your only tool is a hammer, every problem looks like a nail”

Corollary (pers. comm. D. O'reilly 2019, MSc student)

“When your only tool is a hammer, it behooves you to treat each problem as a nail”

Hammers and nails

The excitement around NNs/RNNs/LSTM/CNNs seems to be creating an odd shift:

There is actually (expensive) hardware specifically for a class of models!:

https://www.nvidia.com/en-us/deep-learning-ai/

https://www.nvidia.com/en-us/data-center/dgx-2/

They're great for images, but often pretty poor elsewhere, but progressively used everywhere.

Hammers and nails

Python has become the go-to tool for much ML
The Keras library is commonly used, often accessing Tensorflow
[NB you can access all this through R too]
Lots and lots of material/examples/etc of vastly varying quality around (e.g. see stock-market prediction using RNN-LSTM, Yuk!)
End result is much ML is very “same-y”, everybody is throwing these at problems

Wrap up

Some caveats

I've talked about ML models - this is usually a small part to make something useful:

Data acquisition/management: video streams, APIs, web-scraping
Delivery: apps & GUIs

There are aspects to love

A clear sensible objective measured direction - generalisation error/future predictive error using data unseen to the modelling process
Super-flexible model classes that can capture complexity without fine-scale management e.g. interactions are a pain in vanilla regressions
Lots and lots of cool tools, practicioners & activity (CS folk are very engaged)
Computational efficiency is well-covered

Should you use them?

You might consider the ML “approach” (and/or somewhat non-statistical models) if:

Prediction is key, over inference
You're not particularly interested in interpreting pieces of the model
You don't have strong views about the underlying function for the signal
You're dealing with image/video classification problems (definitely)