Dive into Deep Learning

Sigrid Keydana, Trivadis
2017/01/08

Agenda

 

Part 1: Getting a feel for deep learning

 

  • Intro: What is Deep Learning and how does it work?
  • Implementing a neural network in NumPy
  • Linear regression using DL frameworks - meet Keras, TensorFlow, and PyTorch
  • Under the hood: Backpropagation in NumPy vs. TensorFlow vs. PyTorch

Agenda

 

Part 2: Going deeper with CNNs, LSTMs and hyperparameter tuning

 

  • Convolutional Neural Networks (CNNs) for image classification
  • Long Short Term Memory (LSTM) for sequential data
  • Hyperparameter optimization with Keras and its scikit-learn API

About this course

 

  • Focus is on code and examples (Python)
  • All code is made available as Jupyter notebooks
  • Main framework used is Keras
  • But part 1 also demonstrates examples in TensorFlow and PyTorch, for comparison

Notebooks

 

Part 1

  • Implementing a deep neural network with NumPy: dl_with_numpy.ipynb
  • Linear regression, the usual way (using scikit-learn): linear_regression_sklearn.ipynb
  • Keras basics: keras_basics.ipynb
  • Linear regression with Keras: linear_regression_keras.ipynb
  • TensorFlow basics: tensorflow_basics.ipynb
  • Linear regression with TensorFlow: linear_regression_tensorflow.ipynb
  • PyTorch basics: pytorch_basics.ipynb
  • Linear regression with PyTorch: linear_regression_pytorch.ipynb
  • Backpropagation in NumPy vs TensorFlow vs PyTorch: backprop_numpy_tf_pytorch.ipynb

Notebooks

 

Part 2

  • Convolutional Neural Networks with Keras (1): cnn_keras.ipynb
  • Convolutional Neural Networks with Keras (2): cnn_keras_pretrained.ipynb
  • Convolutional Neural Networks with Keras (3): cnn_keras_pretrained_2.ipynb
  • Long Short Term Memory (LSTM) with Keras (1): lstm_keras_1.ipynb
  • Long Short Term Memory (LSTM) with Keras (2): lstm_keras_2.ipynb
  • Hyperparameter optimization using Keras and the scikit-learn API: optimization.ipynb

 

 

Why Deep Learning?

Easy? Difficult?

 

  • playing chess
  • solving matrix computations

  • recognizing objects

  • understanding speech

  • talking

  • walking

  • driving

Easy? Difficult?

 

What's difficult for us is easy for the machine,- and vice versa.

Representations matter

 

Source: Goodfellow et al. 2016, Deep Learning

Just feed the network the right features?

 

What are the correct pixel values for a “bike” feature?

  • race bike, mountain bike, e-bike?
  • pixels in the shadow may be much darker
  • what if bike is mostly obscured by rider standing in front?

Let the network pick the features

 

… a layer at a time!

Source: Goodfellow et al. 2016, Deep Learning

Deep Learning, two ways to think about it

 

 

 

A short history of deep learning

The first wave: cybernetics (1940s - 1960s)

 

  • neuroscientific motivation
  • linear models

McCulloch-Pitts Neuron (MCP, 1943, a.k.a. Logic Circuit)

 

  • binary output (0 or 1)
  • neurons may have inhibiting (negative) and excitatory (positive) inputs
  • each neuron has a threshold that has to be surpassed by the sum of activations for the neuron to get active (output 1)
  • if just one input is inhibitory, the neuron will not activate
Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

Perceptron (Rosenblatt, 1958): Great expectations

 

  • compute linear combination of inputs
  • return +1 if result is positive, -1 if result is negative
Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

Minsky & Papert (1969), "Perceptrons": the great disappointment

 

  • Perceptrons can only solve linearly separable problems
  • Big loss of interest in neural networks

The second wave: Connectionism (1980s, mid-1990s)

 

  • distributed representations
  • backpropagation

Backpropagation, the magic ingredient

 

Several “origins” in different fields, see e.g.

  • Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954.

  • Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.

  • Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.

  • Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). “Learning representations by back-propagating errors”. Nature. 323 (6088): 533-536.

Back at the time: How could the magic fail?

 

  • Only applicable in case of supervised learning
  • Doesn't scale well to multiple layers (or so they thought at the time)
  • Can converge to poor local minima (or so they thought at the time)

The third wave: Deep Learning

 

  • It all starts with: Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554.
  • Deep neural networks can be trained efficiently, if the weights are initialized intelligently
  • Return of backpropagation

The architectures en vogue now (CNN, LSTM...) have mostly been around since the 1980s/1990s.

 

So why the hype, ehm, success now?

  • big data
  • big models (due to better hardware, mostly)
  • big incentives (deep learning is profitable!)

 

 

How does a deep network learn?

Feedforward Deep Neural Network

Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

Why hidden layers? Learning XOR

 

We want to predict

  • 0 from [0,0]
  • 0 from [1,1]
  • 1 from [0,1]
  • 1 from [1,0]

Trying a linear model

 

\( f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b \)

  • with Mean Squared Error cost (MSE), this leads to: \( \mathbf{w}=0, b=0.5 \)
  • mapping every point to 0.5!

Introduce hidden layer

 

\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T (\mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)

Source: Goodfellow et al. 2016, Deep Learning

Calculation with hidden layer (1)

 

(X <- matrix(c(0,0,0,1,1,0,1,1), nrow = 4, ncol = 2, byrow = TRUE))
     [,1] [,2]
[1,]    0    0
[2,]    0    1
[3,]    1    0
[4,]    1    1
(W <- matrix(c(1,1,1,1), nrow = 2, ncol = 2, byrow = TRUE))
     [,1] [,2]
[1,]    1    1
[2,]    1    1
(c <- matrix(c(0,-1), nrow = 1, ncol = 2))
     [,1] [,2]
[1,]    0   -1

Calculation with hidden layer (2)

 

(matmul <- X %*% W)
     [,1] [,2]
[1,]    0    0
[2,]    1    1
[3,]    1    1
[4,]    2    2
(hidden <- matmul + rbind(c, c, c, c))
     [,1] [,2]
[1,]    0   -1
[2,]    1    0
[3,]    1    0
[4,]    2    1

Which gives us...

 

plot of chunk unnamed-chunk-3

Introducing nonlinearity

 

\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)

 

hidden_relu 
  X1 X2 CLASS
1  0  0     0
2  1  0     1
3  1  0     1
4  2  1     0

Introducing nonlinearity

 

plot of chunk unnamed-chunk-6

The remaining hidden-to-output transformation is linear.

But the classes are already linearly separable.

Optimization

 

  • Like other machine learning algorithms, neural networks learn by minimizing a cost function.
  • Cost functions in neural networks normally are not convex and so, cannot be optimized in closed form.
  • The solution is to do gradient descent.
Source: Goodfellow et al. 2016, Deep Learning

Local minima

 

Source: Goodfellow et al. 2016, Deep Learning

Closed-form vs. gradient descent optimization for Least Squares

 

  • Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
  • Closed form: solve normal equations \( \mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y} \)
  • Alternatively, follow the gradient: \( \nabla_x f(\mathbf{x})= \mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty} \)
Source: Goodfellow et al. 2016, Deep Learning

This gives us a way to train one weight matrix.

 

How about a net with several layers?

Backpropagation

 

Backprop example: logistic neuron

 

Source: Geoffrey Hinton, Neural Networks for Machine Learning Lec. 3

Decisions (1): Which loss function should I choose?

 

The loss (or cost) function indicates the cost incurred from false prediction / misclassification

  • probably the best-known loss function in machine learning is mean squared error:

    \( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)

  • most of the time, in deep learning classification tasks we use cross entropy:

    \( - \sum_j{t_j log(y_j)} \)

    This is the negative log probability of the right answer.

Decisions (2): Which activation function to choose?

 

  • for a long time, the sigmoid (logistic) activation function was used a lot: \( y = \frac{1}{1 + e^{-z}} \)
  • now rectified linear units (ReLUs) are preferred: \( y = max(0, z) \)

 

 

Deep Learning Architectures

Convolutional Neural Networks

 

  • standard feedforward networks need equally sized input (images often aren't!)
  • convolution operation extracts image features, independently of location
Source: http://cs231n.github.io/convolutional-networks/

The Convolution Operation

Source: http://cs231n.github.io/convolutional-networks/ (Live Demo on website!)

Play around in Gimp

 

Playing around with the convolution matrix filter in Gimp is a great way to gain intuition.

Source: https://docs.gimp.org/en/plug-in-convmatrix.html

Recurrent neural networks (RNNs)

 

Why add recursion?

  • cannot process sequential data with “normal” feedforward networks
  • in NLP, the n-gram approach cannot handle long-term relationships

Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___

(see Stanford CS 224D Deep Learning for NLP Lecture Notes)

Two representations of RNNs

 

Source: Goodfellow et al. 2016, Deep Learning

Long Short Term Memory (LSTM): The need to forget

 

 

End of theory - time for practice!