Deep learning, concepts and frameworks: Find your way through the jungle (Part 1)

Sigrid Keydana, Trivadis
02/06/2018

Why deep learning?

The idea itself is old ...

The Perceptron (Rosenblatt, 1958)

Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

The magic ingredients have been there for long ...

Backpropagation (1980’s)	Convolutional Neural Networks (1990’s)	Long Short Term Memory (1990’s)

Sources: [1],[2],[3]

... but the hype is here NOW!



Sources: [4],[5],[6],[7],[8], [9]

Deep Learning in computer vision



Sources: [10], [11], [12], [13] (great reading list at: https://github.com/kjw0612/awesome-deep-vision)

Deep Learning in natural language processing



Sources: [14], [15], [16], [17]

Deep Learning for image (text/video/audio...) generation

Sources: [18], [19]

Deep Learning for winning games: AlphaGo


Sources: [20], [21]

Deep Learning for time series forecasting

Deep Learning in traditional data science

fraud detection
anomaly detection
churn analysis
recommender systems
…

Deep learning, how does it work?

Rule-based systems vs. machine learning vs. deep learning

	Pre-ML programs	Machine learning (ML)	Deep learning(DL)
in	data rules	data	data
learn		mappings and functions	features mappings and functions
out	conclusion	conclusion	conclusion

How does "normal machine learning" work?

Supervised learning:
- there exists a ground truth / labels we can use to train the algorithm
- train on training set, then test on test set
- regression / classification
Unsupervised learning:
- no ground truth exists
- clustering, principal components analysis…
Reinforcement learning:
- learn from (delayed!) rewards
- exploitation vs. exploration
- in practice, often “sped up” by deep learning

Cost function

Supervised learning (deep or not) works by minimizing a cost function that quantifies how much the predictions are different from the ground truth.

In regression, this normally is mean squared error:

\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)

whereas in classification it mostly is cross entropy :

\( - \sum_j{t_j log(y_j)} \) (to be a averaged over the training set)

How can we minimize that cost function?

Going to the minimum in one step: least squares example

Goal: Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
Solution: Solve normal equations \( \mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y} \)

plot of chunk unnamed-chunk-2

Iterative optimisation for least squares: gradient descent

Goal: Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
Solution: Iteratively follow the gradient “downhill”: \( x = x - \epsilon \nabla_x f(\mathbf{x})= x - \epsilon (\mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty}) \)


Source: [22]

So what about deep learning?

Again, we have

a cost function we want to minimize
an algorithm that does this minimization for us (some form of gradient descent)

But this time, we have to find out how to update the weights all through the network!

Enter: Backpropagation

basically, just the chain rule: \( \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} \)
chained over several layers:

Source: [23]

Backpropagation step-by-step

We'll gradually build up the gradient of the loss with respect to \( y_i \), the output of the last hidden layer, and the weights \( w_{ij} \), respectively.

Here

\( i \) and \( j \) are layers of the network (\( j \) being the output layer)
\( y_l \) is the output of layer \( l \)
\( z_l \) is the aggregated input going into layer \( l \) (before the activation function is applied)

Learning weights by backpropagation (1)

Step 1: Loss w.r.t. output (prediction)

At the output layer \( j \), we compare class prediction \( y_j \) and actual class \( t \), using binary cross entropy/logistic loss: \( - (t\:log(y) + (1-t)\:log(1-y)) \)

The gradient

\( \frac{dE}{dy_j} = - (\frac{t}{y} + (-1) \frac{1-t}{1-y}) = \frac{y-t}{y(1-y)} \)

describes how the error \( E \) changes as the prediction \( y_j \) changes.

Learning weights by backpropagation (2)

Step 2: How does the prediction/output \( y_j \) change as the input \( z_j \) to the final neuron changes?

In the case of a logistic (sigmoid) neuron with output \( y_j \), this is described by the gradient

\( \frac{dy_j}{dz_j} = y_j(1-y_j) \).

Learning weights by backpropagation (3)

Step 3a: How does the input \( z_j \) to the final layer change as the weight \( w_{ij} \) changes?

Here the gradient is

\( \frac{dz_j}{dw_{ij}} = y_i \),

that is, the output of layer \( i \).

Learning weights by backpropagation (4)

Step 3b: How does the input \( z_j \) to the final layer change as the output of layer \( i \) changes?

Here we have to take into account all connections a neuron \( y_i \) has to the output layer: The gradient is

\( \sum_j \frac{dz_j}{dy_i} = \sum_j w_{ij} \),

that is, the sum of the weights going out of \( y_i \).

Learning weights by backpropagation: Putting it all together

Now that we have the single gradients, we use the chain rule to back propagate the error:

The gradient of the loss w.r.t. \( y_i \) is

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) \sum_j w_{ij} \)

Accordingly, we get the gradient of the loss w.r.t. \( w_{ij} \) as

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) y_i \)

Continuation

see part 2!