Deep learning, concepts and frameworks: Find your way through the jungle (Part 1)

Sigrid Keydana, Trivadis
02/06/2018

 

 

Why deep learning?

The idea itself is old ...

 

The Perceptron (Rosenblatt, 1958)

Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

The magic ingredients have been there for long ...

 

Backpropagation (1980’s) Convolutional Neural Networks (1990’s)Long Short Term Memory (1990’s)
Sources: [1],[2],[3]

... but the hype is here NOW!

 

Sources: [4],[5],[6],[7],[8], [9]

Deep Learning in computer vision

 

Sources: [10], [11], [12], [13]
(great reading list at: https://github.com/kjw0612/awesome-deep-vision)

Deep Learning in natural language processing

 

Sources: [14], [15], [16], [17]

Deep Learning for image (text/video/audio...) generation

 

Sources: [18], [19]

Deep Learning for winning games: AlphaGo

 

Sources: [20], [21]

Deep Learning for time series forecasting

 

Deep Learning in traditional data science

 

  • fraud detection
  • anomaly detection
  • churn analysis
  • recommender systems

 

 

Deep learning, how does it work?

Rule-based systems vs. machine learning vs. deep learning

 

Pre-ML programsMachine learning (ML)Deep learning(DL)
indata
rules
datadata
learnmappings and functionsfeatures
mappings and functions
outconclusionconclusionconclusion

How does "normal machine learning" work?

 

  • Supervised learning:

    • there exists a ground truth / labels we can use to train the algorithm
    • train on training set, then test on test set
    • regression / classification
  • Unsupervised learning:

    • no ground truth exists
    • clustering, principal components analysis…
  • Reinforcement learning:

    • learn from (delayed!) rewards
    • exploitation vs. exploration
    • in practice, often “sped up” by deep learning

Cost function

 

Supervised learning (deep or not) works by minimizing a cost function that quantifies how much the predictions are different from the ground truth.

In regression, this normally is mean squared error:

\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)

whereas in classification it mostly is cross entropy :

\( - \sum_j{t_j log(y_j)} \) (to be a averaged over the training set)

How can we minimize that cost function?

 

Source: [31]]

Going to the minimum in one step: least squares example

 

  • Goal: Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
  • Solution: Solve normal equations \( \mathbf{\hat{\beta}} = (\mathbf{X}^T\mathbf{{X}})^{-1} \mathbf{X}^T \mathbf{y} \)

plot of chunk unnamed-chunk-2

Iterative optimisation for least squares: gradient descent

 

  • Goal: Minimize squared error \( f(\mathbf{x}) = ||\mathbf{X\hat{\beta} - y}||^2_2 \)
  • Solution: Iteratively follow the gradient “downhill”: \( x = x - \epsilon \nabla_x f(\mathbf{x})= x - \epsilon (\mathbf{X}^T\mathbf{X}\mathbf{\hat{\beta}} - \mathbf{X^Ty}) \)
Source: [22]

So what about deep learning?

 

Again, we have

  • a cost function we want to minimize
  • an algorithm that does this minimization for us (some form of gradient descent)

But this time, we have to find out how to update the weights all through the network!

Enter: Backpropagation

 

  • basically, just the chain rule: \( \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} \)
  • chained over several layers:  
    Source: [23]

Backpropagation step-by-step

 

We'll gradually build up the gradient of the loss with respect to \( y_i \), the output of the last hidden layer, and the weights \( w_{ij} \), respectively.

Here

  • \( i \) and \( j \) are layers of the network (\( j \) being the output layer)
  • \( y_l \) is the output of layer \( l \)
  • \( z_l \) is the aggregated input going into layer \( l \) (before the activation function is applied)

 

Learning weights by backpropagation (1)

 

Step 1: Loss w.r.t. output (prediction)

At the output layer \( j \), we compare class prediction \( y_j \) and actual class \( t \), using binary cross entropy/logistic loss: \( - (t\:log(y) + (1-t)\:log(1-y)) \)

The gradient

\( \frac{dE}{dy_j} = - (\frac{t}{y} + (-1) \frac{1-t}{1-y}) = \frac{y-t}{y(1-y)} \)

describes how the error \( E \) changes as the prediction \( y_j \) changes.

Learning weights by backpropagation (2)

 

 

Step 2: How does the prediction/output \( y_j \) change as the input \( z_j \) to the final neuron changes?

In the case of a logistic (sigmoid) neuron with output \( y_j \), this is described by the gradient

\( \frac{dy_j}{dz_j} = y_j(1-y_j) \).

Learning weights by backpropagation (3)

 

 

Step 3a: How does the input \( z_j \) to the final layer change as the weight \( w_{ij} \) changes?

Here the gradient is

\( \frac{dz_j}{dw_{ij}} = y_i \),

that is, the output of layer \( i \).

Learning weights by backpropagation (4)

 

 

Step 3b: How does the input \( z_j \) to the final layer change as the output of layer \( i \) changes?

Here we have to take into account all connections a neuron \( y_i \) has to the output layer: The gradient is

\( \sum_j \frac{dz_j}{dy_i} = \sum_j w_{ij} \),

that is, the sum of the weights going out of \( y_i \).

Learning weights by backpropagation: Putting it all together

 

Now that we have the single gradients, we use the chain rule to back propagate the error:

The gradient of the loss w.r.t. \( y_i \) is

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) \sum_j w_{ij} \)

Accordingly, we get the gradient of the loss w.r.t. \( w_{ij} \) as

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) y_i \)

Continuation

see part 2!