Sigrid Keydana, Trivadis
02/06/2018
The Perceptron (Rosenblatt, 1958)
| Backpropagation (1980’s) | Convolutional Neural Networks (1990’s) | Long Short Term Memory (1990’s) |
| Sources: [1],[2],[3] |
| Sources: [4],[5],[6],[7],[8], [9] |
| Sources: [10], [11], [12], [13] (great reading list at: https://github.com/kjw0612/awesome-deep-vision) |
| Sources: [14], [15], [16], [17] |
| Sources: [18], [19] |
| Sources: [20], [21] |
| Pre-ML programs | Machine learning (ML) | Deep learning(DL) | |
|---|---|---|---|
| in | data rules | data | data |
| learn | mappings and functions | features mappings and functions |
|
| out | conclusion | conclusion | conclusion |
Supervised learning:
Unsupervised learning:
Reinforcement learning:
Supervised learning (deep or not) works by minimizing a cost function that quantifies how much the predictions are different from the ground truth.
In regression, this normally is mean squared error:
\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)
whereas in classification it mostly is cross entropy :
\( - \sum_j{t_j log(y_j)} \) (to be a averaged over the training set)
| Source: [22] |
Again, we have
But this time, we have to find out how to update the weights all through the network!
We'll gradually build up the gradient of the loss with respect to \( y_i \), the output of the last hidden layer, and the weights \( w_{ij} \), respectively.
Here
Step 1: Loss w.r.t. output (prediction)
At the output layer \( j \), we compare class prediction \( y_j \) and actual class \( t \), using binary cross entropy/logistic loss: \( - (t\:log(y) + (1-t)\:log(1-y)) \)
The gradient
\( \frac{dE}{dy_j} = - (\frac{t}{y} + (-1) \frac{1-t}{1-y}) = \frac{y-t}{y(1-y)} \)
describes how the error \( E \) changes as the prediction \( y_j \) changes.
Step 2: How does the prediction/output \( y_j \) change as the input \( z_j \) to the final neuron changes?
In the case of a logistic (sigmoid) neuron with output \( y_j \), this is described by the gradient
\( \frac{dy_j}{dz_j} = y_j(1-y_j) \).
Step 3a: How does the input \( z_j \) to the final layer change as the weight \( w_{ij} \) changes?
Here the gradient is
\( \frac{dz_j}{dw_{ij}} = y_i \),
that is, the output of layer \( i \).
Step 3b: How does the input \( z_j \) to the final layer change as the output of layer \( i \) changes?
Here we have to take into account all connections a neuron \( y_i \) has to the output layer: The gradient is
\( \sum_j \frac{dz_j}{dy_i} = \sum_j w_{ij} \),
that is, the sum of the weights going out of \( y_i \).
Now that we have the single gradients, we use the chain rule to back propagate the error:
The gradient of the loss w.r.t. \( y_i \) is
\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) \sum_j w_{ij} \)
Accordingly, we get the gradient of the loss w.r.t. \( w_{ij} \) as
\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) y_i \)
see part 2!