Statchat - Machine Learning (ML) from the stats side

C. Donovan
20 August 2019

Objectives

  • Demystify the area for those outside ML
  • Fly over some common methods which are indicative
  • Moan a little

Some examples

Some examples

What actually is ML?

Definitions

The term was coined in the 50s - naively/tellingly: Machine + Learning

  • Machines - could be various things (cars, robots, an app), but ultimately now computers or things controlled by them
  • Learning - teaching said machines to perform some task. They “learn” by exposure to data.

Basically algorithms informed by data. This sounds familiar…

Definitions

Why not ask Wikipedia? (Dr Paxton - are you here?):

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. … Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task

Sounds very familiar…

A Venn diagram of sorts -

There are lots of these - great fun

A Venn diagram of sorts -

Predictive models

Machine learning is frequently statistical modelling, with some disregard for inference, focussed on predictive performance.

\[ y = f(x_1, ..., x_p; \theta_1, ..., \theta_k) + e \]

\( y \) is the target (response), \( x \) are the features (covariates), \( f \) signal with parameters \( \theta \), \( e \) noise

In stats:

  • We'd usually have a formal model for the signal \( f \) and for the noise \( e \)
  • Estimation of \( \hat{f} \) usually built from the distribution of \( e \) (e.g. the other ML: max likelihood)
  • Similarly for ML, but focus on accurate \( \hat{f} \), and the distribution of \( e \) often roughly considered

Function estimation

This is the core of what we're doing, by whatever means necessary to get a good estimate

It's usually a standard problem

Usually a supervised problem (we have observations of both \( y \) and a set of \( x \) to estimate \( f \))

  • Assume a form for \( f \), estimate the best parameters using the data
  • Most statistical models would be claimed under ML banner
  • Philosophically, tend more towards flexible classes of model for \( f \), rather than classical stats modelling with a fairly rigid model structure

How to approximate f?

  • Choose class for \( f \): polynomial regression, neural networks, boosted classification trees etc
  • Estimate best parameters:
    • define an objective/loss function e.g. \( \sum_i (y_i-\hat{y}_i)^2 \)
    • find parameters to minimise this for a specific \( f \) e.g. gradient search over \( \theta_1, ..., \theta_k \)
    • repeat altering the complexity of \( f \) to minimise generalisation error e.g. the loss function on a validation set, \( k \)-fold cross-validation etc
  • Appropriate model complexity becomes something to estimate too

This all sounds very familar…

Under/over-fitting

  • Like to estimate the complexity of \( f \)
  • too simple and we're underfitting too complex and we're overfitting
  • both cases give poor generalisation (prediction to unseen data), hence new data is simulated by validation/test sets, \( k \)-fold cross-validation etc

This all sounds very familiar. Model selection is widespread over statistics, but we might use AIC rather than a hold-out approach (a staple of ML).

Methods

Methods

  • Observe what is considered here under the umbrella of ML (see picture) - this is commonplace.
  • There are some methods not particularly statistical, or very fringe, which might be considered more pure ML (Convolutional Neural Networks).
  • So ML is not well defined by tools, so much as how they're viewed/used.

Examples

Building complex models from simple blocks

  • Trees
  • Boosted trees
  • Random forests
  • Neural networks

Simple trees: An arbitrary 2-D space

An arbitrary 2-D space

Space splitting

A single split

Space splitting

Split of a subspace

Space splitting

Further splitting of a subspace

Space splitting

Further splitting

Space splitting

Potential 3-D surface

Binary partitioning process as a tree

An example tree diagram for a contrived partioning

Boosting trees

  • These are a bit rubbish by themselves
  • However, using lots of such models as building blocks, we get better predictions
  • This is akin to model averaging where we might combine predictions from competing models, weighted by AIC.
  • This is ensemble modelling in ML parlance, it is commonplace

Boosting trees

In short:

  • Fit a model, like the tree previous
  • Observations incorrectly/poorly predicted get increased weight - refit.
  • Do this many times - combine the models in a weighted fashion.
  • The number of models combined is related to complexity. Control in a standard ML way

Lots of Trees - Random Forests

In short:

  • Fit a model, like the tree previous, on a bootstrap of the data with randomly selected covariates.
  • Do this many times - combine the models.
  • The number of models combined is related to complexity. Control in a standard ML way

A simple NN as a Mathematical Formula

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1z_1 + \hat{\beta}_2z_2 + \hat{\beta}_3z_3 \]

where

\[ \begin{align*} z_1 &= \tanh( \hat{\alpha}_4 + \hat{\alpha}_5x_1 + \hat{\alpha}_6x_2)\\ z_2 &= \tanh( \hat{\alpha}_7 + \hat{\alpha}_8x_1 + \hat{\alpha}_9x_2)\\ z_3 &= \tanh( \hat{\alpha}_{10} + \hat{\alpha}_{11}x_1 + \hat{\alpha}_{12}x_2) \end{align*} \]

  • So NNs are a complex thing built from lots of simple components.

The ubiquitous NN diagram

Estimate parameters as you'd expect

  • Define an appropriate loss function comparing \( y \) to \( \hat{y} \) (we could include a logistic component to make a classification problem)
  • Move over the space of weights (all the \( \alpha \)) in directions that improve this objective [actually a horrible optimisation problem]
  • Stop when your generalisation error is good

Tensorflow sandbox

If you want to get to grips with NNs as a newbie - this is highly recommended:

https://playground.tensorflow.org/

I'll now play with it….

Statisticians, hold my beer

NNs, getting deep and convoluted

Recent years have seen a NN renaissance, largely in image processing:

  • So called Deep-learning: add lots of layers to your NN and you're now doing deep-learning.
  • Relatedly, Convolutional Neural Networks (CNNs)

Beyond simple NNs: Go deep

  • There has been a recent upsurge in NN popularity. Mainly due to great performance in image classification problems. The NNs in question can be massively complicated:

    • 2012 imageNet classification - NN with \( 1\times 10^9 \) parameters
    • Used 1000 computers - 16,000 cores to fit
  • Hence the term deep-learning - very deep/complex NNs

  • The distinction between shallow and deep learning is a bit vague - but the extremes are clearly different

    • See Schmidhuber, J. (2015) Deep learning in neural networks: An overview. Neural Networks Volume 61, Pages 85-117.

Convolutional NNs

Consider a simple handwriting problem (e.g. MNIST data - recognise numbers):

  • Unpack all the pixels to be inputs to an NN
  • The pattern of pixel intensities is related to the class (a number 0-9)

Convolutional NNs

  • MNIST data consists of 60000 training images, 28x28 pixels each, clean and centered
  • Our basic NN:
    • Each input node is a pixel (784)
    • The output nodes are classes of image i.e. numbers 0-9
    • Even with a few hidden nodes and layers, a lot of parameters

This is clearly naive - the pixels are arranged spatially! Which is not reflected in the architecture.

Convolutional NNs

Basic idea is simple:

  • Don't fully connect all inputs to all first layers of NN as though they are independent
  • Filter inputs in various ways similar to photo-shop filters
  • Motivated by real vision - group process blocks of inputs based on proximity (like scanning with your eye)
  • Do some spatial averaging
  • Amongst other things, this reduces the number of linkages

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - components

CNNs have many features we're familiar with from other NNs, but also:

  • kernels (/convolutions); convolution layers
  • pooling layers
  • fully connected layers
  • strides & receptive fields

Convolutional NNs - convolving

Typically we're talking images - if colour, think 3 values per pixel (RGB).

  • So we have an input volume, say \( n\times n \) pixels by 3 channels (RGB)
  • We'll have 1000s of these for training with labels
  • Note each channel is a matrix
  • Key are the kernels, which are 'small' windows (e.g. 4x4 pixel/matrix) that pass over the lerger image/matrix, filtering the values
  • This is the convolving in the name - a set of matrix operations

Convolutional NNs - kernels

source: http://timdettmers.com/2015/03/26/convolution-deep-learning/

source: https://colah.github.io/posts/2014-07-Understanding-Convolutions/

Convolutional NNs - kernels

  • observe here the stride and possible overlap
source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - pooling layers

source: https://xrds.acm.org/blog/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/

Convolutional NNs - fully connected layers

  • These are familiar to us i.e. inputs feed through hidden layers with full connectivity
  • they tend to be later in the CNN, when the complexity has been reduced by convolving and pooling.

Convolutional NNs - overall architecture

source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Convolutional NNs - notable example

See Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

  • Got relatively great results at that point in time
  • Predicting 1000 classes from 1.2 million training images
  • 60 million parameters, 650,000 neurons
  • ReLU activation functions
  • 5 convolutional layers + max-pooling layers
  • 3 fully connected layers
  • Softmax output function (1000 class)

Convolutional NNs

See Krizhevsky et al (2012) - they describe their CNN for the imagenet problem:

  • Substantial time spent specifying architecture
  • Results were sensitive to architecture choices
  • Required fitting with GPUs - two here directed at different parts of the NN
  • Weight decay, momentum parameters and dropout
  • 6 days to fit

Tricky! Later CNNs made this look small (see previous)

NB: Transfer learning

Transfer learning is very popular now.

  • CNNs can be huge and take a looong time to fit.
  • So, take the first bits of an extant CNN which has various convolutions and pooling with parameters already determined.
  • Attach your bit to the end, so only fit the final layers for your specific problem.

General ML approach

General ML approach

  • Fiddle your data - “feature engineering”, data wrangling & munging
  • Focus on predictive accuracy, usually assessed by data unseen to the model fitting process
  • Be relatively agnostic about the function for the signal - choose broad classes that permit lots of complexity
  • Choose loss-functions/objectives that suit response type e.g. squared-error loss for numeric problems
  • Tune model complexity as part of the process - again focussing on data unseen to the modelling process
  • Try/compare lots of competing types - winning is clearly defined (lowest generalisation error)
  • [try “ensembling” your models when in trouble]

Hype vs reality

Hype vs reality

ML is far from magical - can get poor performance for all the usual reasons models don't work

  • Data is poor (errors, values not missing at random etc..)
  • Model isn't right - over/underfitted, don't understand inputs properly, wrong sort
  • Low signal-to-noise
  • Processes are changing (future is different to the past)
  • Class imbalances
  • Tuning too hard e.g. NN architecture sensitivities

Examples

  • Banking
  • Simple classification problems ID5059
  • Kaggle competitions (www.kaggle.com)

Hype vs reality

Hammers and nails

“When your only tool is a hammer, every problem looks like a nail”

Corollary (pers. comm. D. O'reilly 2019, MSc student)

“When your only tool is a hammer, it behooves you to treat each problem as a nail”

Hammers and nails

The excitement around NNs/RNNs/LSTM/CNNs seems to be creating an odd shift:

There is actually (expensive) hardware specifically for a class of models!:

https://www.nvidia.com/en-us/deep-learning-ai/

https://www.nvidia.com/en-us/data-center/dgx-2/

They're great for images, but often pretty poor elsewhere, but progressively used everywhere.

Hammers and nails

  • Python has become the go-to tool for much ML
  • The Keras library is commonly used, often accessing Tensorflow
  • [NB you can access all this through R too]
  • Lots and lots of material/examples/etc of vastly varying quality around (e.g. see stock-market prediction using RNN-LSTM, Yuk!)
  • End result is much ML is very “same-y”, everybody is throwing these at problems

Wrap up

Some caveats

I've talked about ML models - this is usually a small part to make something useful:

  • Data acquisition/management: video streams, APIs, web-scraping
  • Delivery: apps & GUIs

There are aspects to love

  • A clear sensible objective measured direction - generalisation error/future predictive error using data unseen to the modelling process
  • Super-flexible model classes that can capture complexity without fine-scale management e.g. interactions are a pain in vanilla regressions
  • Lots and lots of cool tools, practicioners & activity (CS folk are very engaged)
  • Computational efficiency is well-covered

Should you use them?

You might consider the ML “approach” (and/or somewhat non-statistical models) if:

  • Prediction is key, over inference
  • You're not particularly interested in interpreting pieces of the model
  • You don't have strong views about the underlying function for the signal
  • You're dealing with image/video classification problems (definitely)