Dive into Deep Learning

Sigrid Keydana, Trivadis
2017/01/08

Agenda

Part 1: Getting a feel for deep learning

Intro: What is Deep Learning and how does it work?
Implementing a neural network in NumPy
Linear regression using DL frameworks - meet Keras, TensorFlow, and PyTorch
Under the hood: Backpropagation in NumPy vs. TensorFlow vs. PyTorch

Agenda

Part 2: Going deeper with CNNs, LSTMs and hyperparameter tuning

Convolutional Neural Networks (CNNs) for image classification
Long Short Term Memory (LSTM) for sequential data
Hyperparameter optimization with Keras and its scikit-learn API

About this course

Focus is on code and examples (Python)
All code is made available as Jupyter notebooks
Main framework used is Keras
But part 1 also demonstrates examples in TensorFlow and PyTorch, for comparison

Notebooks

Part 1

Implementing a deep neural network with NumPy: dl_with_numpy.ipynb
Linear regression, the usual way (using scikit-learn): linear_regression_sklearn.ipynb
Keras basics: keras_basics.ipynb
Linear regression with Keras: linear_regression_keras.ipynb
TensorFlow basics: tensorflow_basics.ipynb
Linear regression with TensorFlow: linear_regression_tensorflow.ipynb
PyTorch basics: pytorch_basics.ipynb
Linear regression with PyTorch: linear_regression_pytorch.ipynb
Backpropagation in NumPy vs TensorFlow vs PyTorch: backprop_numpy_tf_pytorch.ipynb

Notebooks

Part 2

Convolutional Neural Networks with Keras (1): cnn_keras.ipynb
Convolutional Neural Networks with Keras (2): cnn_keras_pretrained.ipynb
Convolutional Neural Networks with Keras (3): cnn_keras_pretrained_2.ipynb
Long Short Term Memory (LSTM) with Keras (1): lstm_keras_1.ipynb
Long Short Term Memory (LSTM) with Keras (2): lstm_keras_2.ipynb
Hyperparameter optimization using Keras and the scikit-learn API: optimization.ipynb

Why Deep Learning?

Easy? Difficult?

playing chess
solving matrix computations
recognizing objects
understanding speech
talking
walking
driving

Easy? Difficult?

What's difficult for us is easy for the machine,- and vice versa.

Representations matter

Source: Goodfellow et al. 2016, Deep Learning

Just feed the network the right features?

What are the correct pixel values for a “bike” feature?

race bike, mountain bike, e-bike?
pixels in the shadow may be much darker
what if bike is mostly obscured by rider standing in front?

Let the network pick the features

… a layer at a time!

Deep Learning, two ways to think about it

hierarchical feature extraction (start simple, end complex)
function composition (see http://colah.github.io/posts/2015-09-NN-Types-FP/)

A short history of deep learning

The first wave: cybernetics (1940s - 1960s)

neuroscientific motivation
linear models

McCulloch-Pitts Neuron (MCP, 1943, a.k.a. Logic Circuit)

binary output (0 or 1)
neurons may have inhibiting (negative) and excitatory (positive) inputs
each neuron has a threshold that has to be surpassed by the sum of activations for the neuron to get active (output 1)
if just one input is inhibitory, the neuron will not activate

Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

Perceptron (Rosenblatt, 1958): Great expectations

compute linear combination of inputs
return +1 if result is positive, -1 if result is negative

Minsky & Papert (1969), "Perceptrons": the great disappointment

Perceptrons can only solve linearly separable problems
Big loss of interest in neural networks

The second wave: Connectionism (1980s, mid-1990s)

distributed representations
backpropagation

Backpropagation, the magic ingredient

Several “origins” in different fields, see e.g.

Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954.
Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.
Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). “Learning representations by back-propagating errors”. Nature. 323 (6088): 533-536.

Back at the time: How could the magic fail?

Only applicable in case of supervised learning
Doesn't scale well to multiple layers (or so they thought at the time)
Can converge to poor local minima (or so they thought at the time)

The third wave: Deep Learning

It all starts with: Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554.
Deep neural networks can be trained efficiently, if the weights are initialized intelligently
Return of backpropagation

The architectures en vogue now (CNN, LSTM...) have mostly been around since the 1980s/1990s.

So why the hype, ehm, success now?

big data
big models (due to better hardware, mostly)
big incentives (deep learning is profitable!)

How does a deep network learn?

Feedforward Deep Neural Network

Why hidden layers? Learning XOR

We want to predict

0 from [0,0]
0 from [1,1]
1 from [0,1]
1 from [1,0]

Trying a linear model

\( f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b \)

with Mean Squared Error cost (MSE), this leads to: \( \mathbf{w}=0, b=0.5 \)
mapping every point to 0.5!

Introduce hidden layer

\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T (\mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)

Calculation with hidden layer (1)

(X <- matrix(c(0,0,0,1,1,0,1,1), nrow = 4, ncol = 2, byrow = TRUE))

     [,1] [,2]
[1,]    0    0
[2,]    0    1
[3,]    1    0
[4,]    1    1

(W <- matrix(c(1,1,1,1), nrow = 2, ncol = 2, byrow = TRUE))

     [,1] [,2]
[1,]    1    1
[2,]    1    1

(c <- matrix(c(0,-1), nrow = 1, ncol = 2))

     [,1] [,2]
[1,]    0   -1

Calculation with hidden layer (2)

(matmul <- X %*% W)

     [,1] [,2]
[1,]    0    0
[2,]    1    1
[3,]    1    1
[4,]    2    2

(hidden <- matmul + rbind(c, c, c, c))

     [,1] [,2]
[1,]    0   -1
[2,]    1    0
[3,]    1    0
[4,]    2    1

Which gives us...

plot of chunk unnamed-chunk-3

Introducing nonlinearity

\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)

hidden_relu

  X1 X2 CLASS
1  0  0     0
2  1  0     1
3  1  0     1
4  2  1     0

Introducing nonlinearity

plot of chunk unnamed-chunk-6

The remaining hidden-to-output transformation is linear.

But the classes are already linearly separable.

Optimization

Like other machine learning algorithms, neural networks learn by minimizing a cost function.
Cost functions in neural networks normally are not convex and so, cannot be optimized in closed form.
The solution is to do gradient descent.

Local minima

Closed-form vs. gradient descent optimization for Least Squares

Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
Closed form: solve normal equations \( \mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y} \)
Alternatively, follow the gradient: \( \nabla_x f(\mathbf{x})= \mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty} \)

This gives us a way to train one weight matrix.

How about a net with several layers?

Backpropagation

basically, just the chain rule: \( \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} \)
chained over several layers:

Source: https://colah.github.io/posts/2015-08-Backprop/

Backprop example: logistic neuron

Source: Geoffrey Hinton, Neural Networks for Machine Learning Lec. 3

Decisions (1): Which loss function should I choose?

The loss (or cost) function indicates the cost incurred from false prediction / misclassification

probably the best-known loss function in machine learning is mean squared error:

\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)
most of the time, in deep learning classification tasks we use cross entropy:

\( - \sum_j{t_j log(y_j)} \)

This is the negative log probability of the right answer.