Brynjólfur Gauti Jónsson, MS student statistics
Geoffrey Hinton
To tackle large-scale problems we want bases that adapt to our data.
SVMs create basis functions using data points
Neural networks use parametric non-linear activation functions to estimate bases.
\[ y = \varphi(\sum w_i x_i + b) = \begin{cases} 1, \hspace{5mm} \sum w_ix_i > -b \\ 0, \hspace{5mm} \sum w_ix_i \leq -b \end{cases} \]
What we call MLP today is not literally a perceptron, since it doesn’t use the non-linearity above.
Image from Computer Age Statistical Inference by Efron and Hastie
We have data \(X\) and a network with \(L\) layers. Each layer, \(\ell\) has \(n_\ell\) hidden nodes, so the weight matrix \(W^{(\ell)}\) has dimension \(n_{\ell - 1} \times n_{\ell}\), that is it takes as input \(n_{\ell - 1}\) columns and outputs \(n_{\ell}\) columns.
Set \(A^{(0)} = X\) and we get for each \(\ell\) in \(1:L\)
\[ \begin{aligned} Z^{(\ell)} &= W^{(\ell)}A^{(\ell - 1)} + b^{(\ell)} \\ A^{(\ell)} &= f^{(\ell)}(Z^{(\ell)}) \end{aligned} \]
In particular
\[ \hat Y = A^{(L)} \]
Having obtained our predictions, our error for each observation is
\[ \mathcal L = (y_i - \hat y_i)^2 = (y_i - A^{(L)})^2 \]
For each weight parameter we need \(\frac{\partial \mathcal L}{\partial W^{\ell}}\). To do this we use the chain rule:
\[ \begin{aligned} dZ^{(\ell)} &= dA^{(\ell)} f'(Z^{(\ell)}) \\ dW^{(\ell)} &= dZ^{(\ell)} A^{(\ell - 1)T} \\ db^{(\ell)} &= \text{rowSums}(dZ^{(\ell)}) \\ dA^{(\ell - 1)} &= W^{(\ell)}dZ^{(\ell)} \end{aligned} \]
To see why this is true:
\[ \begin{aligned} \frac{\partial \mathcal L}{\partial W^{(L)}} &= \frac{\partial \mathcal L}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}} \\ &= \frac{\partial \mathcal L}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial W^{(L)}} \\ &= \left[2(y_i - A^{(L)}) * f^{(L)'}(Z)\right]W^{(L)T} \\ \frac{\partial \mathcal L}{\partial A^{(L - 1)}} &= \frac{\partial \mathcal L}{\partial A^{(L)}}\frac{\partial A^{(L)}}{\partial Z^{(L)}}\frac{\partial Z^{(L)}}{\partial A^{(L - 1)}} \end{aligned} \]
Both images from Deep Learning, Goodfellow, Bengio, Courville
Convolutional Neural Networks use filters which they slide across the image to detect features.
Image from Computer Age Statistical Inference by Efron and Hastie
\[ F_h = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} * A \\ F_v = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix} * A \\ F = \sqrt{F_h^2 + F_v^2} \]
Instead of using handmade filters, ConvNets learn the parameters of the filters.
“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” - Geoffrey Hinton
\[ y_t = f(y_{t-1}, x_t) \]
Image from Deep Learning with R by Abhijit Ghatak
\[ \begin{aligned} \Gamma_u &= \sigma(W_u[a^{(t - 1)}, x^{(t)}] + b_u) \\ \Gamma_f &= \sigma(W_f[a^{(t - 1)}, x^{(t)}] + b_f) \\ \Gamma_o &= \sigma(W_o[a^{(t - 1)}, x^{(t)}] + b_o) \\ \tilde c^{(t)} &= tanh(W_c[a^{(t - 1)}, x^{(t)}] + b_c)\\ c^{(t)} &= \Gamma_u * \tilde c^{(t)} + \Gamma_f * \tilde c^{(t - 1)}\\ a^{(t)} &= \Gamma_o * tanh(c^{(t)}) \end{aligned} \]
Encoder-Decoder
Can turn it into a denoising algorithm by applying gradually increasing gaussian noise to inputs!
Can use clustering algorithms to find similar pictures in this latent space.
We can linearly interpolate two faces in the latent space.
Steps: