Deep Learning in Action

Action!



Sources: [1],[3],[4],[5],[6]

So what is a neural network?

Biological neuron and artificial neuron


Source: [10]

Prototype of a neuron: the perceptron (Rosenblatt, 1958)

Source: [10]

Deep neural networks: introducing hidden layers

Source: [10]

Deep neural networks as function composition

A deep representation is a composition of many functions

\( x \xrightarrow{w_1} h_1 \xrightarrow{w_2} h_2 \xrightarrow{w_3} ... \xrightarrow{w_n} h_n \xrightarrow{w_{n+1}} y \)

Why go deep? A bit of background

Easy? Difficult?

walk
talk
play chess
solve matrix computations

Easy for us - difficult for computers

controlled movement
speech recognition
speech generation
object recognition and object localization

Representation matters

Source: [12]

Just feed the network the right features?

What are the correct pixel values for a “bike” feature?

race bike, mountain bike, e-bike?
pixels in the shadow may be much darker
what if bike is mostly obscured by rider standing in front?

Let the network pick the features

… a layer at a time

Source: [12]

So how does a network learn?

Just a sec - let's meet a real neural network first!

Play around in the browser:

So how DOES a neural network learn?

We need:

a way to quantify our current (e.g., classification) error
a way to reduce error on subsequent iterations
a way to propagate our improvement logic from the output layer all the way back through the network!

Quantifying error: Loss functions

The loss (or cost) function indicates the cost incurred from false prediction / misclassification

Probably the best-known loss function in machine learning is mean squared error:

\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)

Most of the time, in deep learning we use cross entropy:

\( - \sum_j{t_j log(y_j)} \)

This is the negative log probability of the right answer.

Learning from errors: Gradient Descent

Source: [12]

Propagate back errors ... well: Backpropagation!

basically, just the chain rule: \( \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} \)
chained over several layers:

Source: [14]

Forward pass and backward pass: Intuition

imagine output of \( f = (x + y) * z = -12 \) “wants” to get bigger
this could happen by \( q \) getting smaller -> \( q \) receives negative gradient \( \frac{df}{dq} = -4 \)
\( q \) just passes on this gradient to \( x \) and \( y \), as \( \frac{dq}{dx} = 1 \) and \( \frac{dq}{dy} = 1 \)
alternatively, it could happen by \( z \) getting bigger -> \( z \) receives positive gradient \( \frac{df}{dz} = 3 \)

Source: [13]

Applications by example

CNNs (Convolutional Neural Networks) for computer vision
RNNs (Recurrent Neural Networks) for Natural Language Processing
Deep Reinforcement Learning for real-life learning

Easy vs. hard, revisited

VISION

Why computer vision is hard

Source: [15]

Tasks in computer vision

Source: [13]

Classification

In classification, the required output is a probability for each class.

Source: [13]

Localization

In localization, the network needs to identify the position of an object in an image.

Source: [17]

Object Detection

In object detection (a.k.a. image recognition), the network has to classify and localize multiple objects in an image.

Source: [18]

Segmentation

In segmentation, the network needs to predict a class value for each input pixel.

Source: [19]

Semantic vs. Instance Segmentation

Semantic segmentation segments by class, instance segmentation by class instance.

Source: [20]

How do we identify the required features? Enter:

Convolutional Neural Networks

A convolutional neural network

Source: [13]

The convolution operation

(Strictly, this is cross-correlation, but it doesn't matter)

Source: [13]

Gimp demo

Blur: \( \begin{bmatrix}1 & 1 & 1\\1 & 1 & 1\\1 & 1 & 1\end{bmatrix} \), sharpen: \( \begin{bmatrix}0 & -1 & 0\\-1 & 5 & -1\\0 & -1 & 0\end{bmatrix} \), edge detect: \( \begin{bmatrix}0 & 1 & 0\\1 & -4 & 1\\0 & 1 & 0\end{bmatrix} \)

see: https://docs.gimp.org/en/plug-in-convmatrix.html

Easy vs. hard, revisited

VISION

LANGUAGE

Until now, all we've seen are static snapshots

How do we handle sequences

language: words, sentences, paragraphs…
all kinds of serial information: sensor data, stock prices…

Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___

Source: [21]

How do we remember the past? Enter:

Recurrent neural networks

Hidden state


Sources: [22], [12]

Basic RNN closeup

The basic RNN at every step combines new input and existing state.

Source: [22]

Remembering is not enough

Sometimes we also need to forget!

Two kinds of state: the LSTM "conveyor belt"

The LSTM (Long Term Short Memory) architecture adds an additional state layer, the cell state

Source: [22]

LSTM cell state and the three gates

The LSTM cell state is protected by three gates, the forget, input, and output gates:


Source: [22]

RNN Example: Machine Translation

In translation, we have two sets of sequential data, one on the source and one on the target side!

Enter: sequence-to-sequence models

Source: [23]

Real-life seq-2-seq: Google's Neural Machine Translation System

Source: [24]

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

Generating image captions

Source: [26]

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

VISION <-> SOUND

Linking video and sound: adding audio to silent film

Source: [27]

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

VISION <-> SOUND

LIFE (SORT OF)

Life

Reinforcement learning

Source: [1]

Reinforcement learning: the task


Source: [1]

Reinforcement learning: The problem

If I get a reward many many actions later…

… how do I find out what concrete action I'm getting the reward for?

Reinforcement learning: The dilemma

EXPLOITATION vs. EXPLORATION

How to learn from delayed rewards: Q-Learning

Learn to maximize future (discounted) reward:

Source: [1]

The quest for real intelligence

supervised learning: reasoning by memorization
pure reinforcement learning: reasoning by trial-and-error
can we get less brute-force here?

“Reinforcement learning + deep learning = AI” (David Silver, Google Deep Mind)

Deep Q-Learning @AlphaGo


Source: [29]

Deep Learning, where to go next?

For structured reading:

“Awesome deep learning”: https://github.com/ChristosChristofidis/awesome-deep-learning
“Deep Learning Papers reading roadmap”: https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap
Andrey Karpathy's “Arxiv Sanity Preserver”: http://www.arxiv-sanity.com/

Just wanna have some cool fun?

Andrey Karpathy's blog: http://karpathy.github.io/ (especially: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
Christopher Olah's blog: http://colah.github.io/

Thanks for your attention!

Deep Learning in Action

Action!

So what is a neural network?

Prototype of a neuron: the perceptron (Rosenblatt, 1958)

Deep neural networks: introducing hidden layers

Deep neural networks as function composition

Why go deep? A bit of background

Easy for us - difficult for computers

Representation matters

Just feed the network the right features?

Let the network pick the features

So how does a network learn?

So how DOES a neural network learn?

Quantifying error: Loss functions

Learning from errors: Gradient Descent

Propagate back errors ... well: Backpropagation!

Forward pass and backward pass: Intuition

Applications by example

Easy vs. hard, revisited

VISION

Why computer vision is hard

Tasks in computer vision

Classification

Localization

Object Detection

Segmentation

Semantic vs. Instance Segmentation

How do we identify the required features? Enter:

Convolutional Neural Networks

A convolutional neural network

The convolution operation

Gimp demo

Easy vs. hard, revisited

VISION

LANGUAGE

Until now, all we've seen are static snapshots

How do we remember the past? Enter:

Recurrent neural networks

Hidden state

Basic RNN closeup

Remembering is not enough

Two kinds of state: the LSTM "conveyor belt"

LSTM cell state and the three gates

RNN Example: Machine Translation

Real-life seq-2-seq: Google's Neural Machine Translation System

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

Generating image captions

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

VISION <-> SOUND

Linking video and sound: adding audio to silent film

Easy vs. hard, revisited

VISION

LANGUAGE

VISION <-> LANGUAGE

VISION <-> SOUND

LIFE (SORT OF)

Life

Reinforcement learning

Reinforcement learning: the task

Reinforcement learning: The problem

Reinforcement learning: The dilemma

How to learn from delayed rewards: Q-Learning

The quest for real intelligence

Deep Q-Learning @AlphaGo

Deep Learning, where to go next?

Thanks for your attention!

Sources (1)

Sources (2)

Sources (3)

Sources (4)