This Lecture

Architectures
- Multilayer Perceptrons
  - The first networks
  - Name not applicable anymore
- Convolutional Networks
  - Computer vision
  - “Attention networks” for text
- Recurrent Networks
  - Natural language processing
  - Speech recognition
Objectives
- Prediction
  - Classification, Regression
- Encoding
  - Deep Autoencoders
- Generation
  - Generative Adversarial Networks

How to get pretty good

Architecture
- Your network should fit your objective
- Activation functions too
Non-linear optimization
- Gradient descent
- Momentum
- Exploding/Vanishing gradients
General ML validation techniques

How to get really good

Geoffrey Hinton

Study Material

From Linear Models to Neurons

Linear model
- \(y = X\beta + \varepsilon\)
- Convenient
Basis Expansion
- \(y = \sum \phi_i(Xd)\beta_i+ \varepsilon\)
- Still convenient
To tackle large-scale problems we want bases that adapt to our data.
SVMs create basis functions using data points
Neural networks use parametric non-linear activation functions to estimate bases.

The multilayer perceptron

Perceptron

\[ y = \varphi(\sum w_i x_i + b) = \begin{cases} 1, \hspace{5mm} \sum w_ix_i > -b \\ 0, \hspace{5mm} \sum w_ix_i \leq -b \end{cases} \]

Multilayer perceptron
- Like perceptron but… multiple layers.

What we call MLP today is not literally a perceptron, since it doesn’t use the non-linearity above.

Sigmoid
- A neuron’s “probability” of firing
Tanh
- Nice properties. F.ex. \(Tanh(0) = 0\)
ReLu
- \(max(x, 0)\)
- Most used and default recommendation now.

From neurons: The MLP

Image from Computer Age Statistical Inference by Efron and Hastie

Forward propagation

We have data \(X\) and a network with \(L\) layers. Each layer, \(\ell\) has \(n_\ell\) hidden nodes, so the weight matrix \(W^{(\ell)}\) has dimension \(n_{\ell - 1} \times n_{\ell}\), that is it takes as input \(n_{\ell - 1}\) columns and outputs \(n_{\ell}\) columns.

Set \(A^{(0)} = X\) and we get for each \(\ell\) in \(1:L\)

\[ \begin{aligned} Z^{(\ell)} &= W^{(\ell)}A^{(\ell - 1)} + b^{(\ell)} \\ A^{(\ell)} &= f^{(\ell)}(Z^{(\ell)}) \end{aligned} \]

In particular

\[ \hat Y = A^{(L)} \]

Backpropagation

Having obtained our predictions, our error for each observation is

\[ \mathcal L = (y_i - \hat y_i)^2 = (y_i - A^{(L)})^2 \]

For each weight parameter we need \(\frac{\partial \mathcal L}{\partial W^{\ell}}\). To do this we use the chain rule:

\[ \begin{aligned} dZ^{(\ell)} &= dA^{(\ell)} f'(Z^{(\ell)}) \\ dW^{(\ell)} &= dZ^{(\ell)} A^{(\ell - 1)T} \\ db^{(\ell)} &= \text{rowSums}(dZ^{(\ell)}) \\ dA^{(\ell - 1)} &= W^{(\ell)}dZ^{(\ell)} \end{aligned} \]

To see why this is true:

\[ \begin{aligned} \frac{\partial \mathcal L}{\partial W^{(L)}} &= \frac{\partial \mathcal L}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}} \\ &= \frac{\partial \mathcal L}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}} \\ &= \left[2(y_i - A^{(L)}) * f^{(L)'}(Z)\right]W^{(L)T} \\ \frac{\partial \mathcal L}{\partial A^{(L - 1)}} &= \frac{\partial \mathcal L}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial A^{(L - 1)}} \end{aligned} \]

Fitting MLPs

Gradient descent
- Efficient
- At each step, save the gradients you need to feed further down the chain.
- Can use simple matrix arithmetic
- Modern optimizers
  - Minibatch
  - Momentum
  - Adaptive learning rates (AdaGrad, RMSProp, Adam)

Regularization
- Can use ye olde parameter norm penalties.
- Dropout regularization
- Injecting noise
  - Into inputs
  - Into parameters
  - Into targets
- ~~Early stopping~~
- Parameter sharing

Adam

Both images from Deep Learning, Goodfellow, Bengio, Courville

ConvNets

Applying MLP to images?
- Images are typically \(p\times q \times 3\) dimensional (RGB).
- Not translation invariant
- Too many parameters

Convolutional Neural Networks use filters which they slide across the image to detect features.

Image from Computer Age Statistical Inference by Efron and Hastie

Convolution and filters

The Sobel Filter

\[ F_h = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} * A \\ F_v = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix} * A \\ F = \sqrt{F_h^2 + F_v^2} \]

Instead of using handmade filters, ConvNets learn the parameters of the filters.

Convolution gives translation equivariance
MaxPool layers give poor man’s translation invariance

“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” - Geoffrey Hinton

Sequence Models

What if we have input data of variable length?
- Text, audio
Recurrent Neural Networks are similar to nonlinear dynamical systems

\[ y_t = f(y_{t-1}, x_t) \]

Image from Deep Learning with R by Abhijit Ghatak

Long Short Term Memory

Image from Deep Learning with R by Abhijit Ghatak (who got it from Andrew Ng.’s course [who was inspired by Chris Olah])

\[ \begin{aligned} \Gamma_u &= \sigma(W_u[a^{(t - 1)}, x^{(t)}] + b_u) \\ \Gamma_f &= \sigma(W_f[a^{(t - 1)}, x^{(t)}] + b_f) \\ \Gamma_o &= \sigma(W_o[a^{(t - 1)}, x^{(t)}] + b_o) \\ \tilde c^{(t)} &= tanh(W_c[a^{(t - 1)}, x^{(t)}] + b_c)\\ c^{(t)} &= \Gamma_u * \tilde c^{(t)} + \Gamma_f * \tilde c^{(t - 1)}\\ a^{(t)} &= \Gamma_o * tanh(c^{(t)}) \end{aligned} \]

Music Generation

Training cases are notes from jazz songs
- There are only 11 different notes
- We can use as many octaves as we want
Train model to predict next chord.
- Then we use the model to create music.

Music generation

Autoencoders

One network encodes, another decodes
- We believe we can represent our data as a lower-dimensional non-linear manifold.
- PCA is linear autoencoding.

Encoder-Decoder

Image Encoding

We have images that are \(32\times 32\times 3\) (RGB)
- \(32 \cdot 32 \cdot 3 = 3072\) parameters
Sparse representation
- Use convolutions and maxpooling to reduce to only \(32\) dimensions
- Then use inverse convolutions to increase the sparse layer back into \(32\times 32\times 3\)
- Use squared error to lear

Denoising

Can turn it into a denoising algorithm by applying gradually increasing gaussian noise to inputs!

Clustering

Can use clustering algorithms to find similar pictures in this latent space.

What is the average of two faces?

We can linearly interpolate two faces in the latent space.

Generative Models

Adversarial Networks
Generator
- Input is gaussian noise
- Tries to output image of face
Discriminator
- Input is image of face (real or fake)
- Output is probability that image is fake

Steps:

Create fake images
Train discriminator on fake and real images
Train generator using discriminator as loss function
Repeat

Neural Networks

Elements of Statistical Learning: Chapter 11