April 13, 2017

Failures of Deep Learning

  • Gradient information carries negligible information on the target function
  • Learning vs. Optimization. Decomposition vs. End-to-end.
  • Network architecture and the optimization algorithm v.s. training time.
  • Is gradient vanishing important?

Failure due to Non-Informative Gradients

Gradient Descent

\[\min\limits_w F_h(w) = \mathbb{E}_x\left[\mathcal{l}\left(p_w(x), h(x)\right)\right]\] - \(\nabla F_h(w)\) contains useful information regarding the target function \(h\).

Example

  • \(v \in \{0, 1\}^d\)
  • \(x \in \{0, 1\}^d\)
  • \(h(x) = (-1)^{x^T v}\)

Training Result

Analysis

  • The information contained by \(\nabla F_h(w)\) is measured by the variance of the gradient of \(F\) with respect to \(h\), when \(h\) is drawn uniformly at random from a collection of candidate target functions \(\mathcal{H}\).
    • 白話: gradient Variance越大的(曲面越陡峭的)越好學
  • This paper proofs that in the previous example, the variance decreases linearly with \(\left|\mathcal{H}\right|\).
  • Shamir (2016) shows that gradient-based methods cannot learn random parities and linear-periodic functions.
  • Further readings: Statistical Query Learning, Kearns (1998) and Blum et al. (1994)

Decomposition vs. End-to-end

Introduction

  • End-to-end : Joint Learning
  • Decomposition : Learning individually and then combining
  • Deep learning makes End-to-end training an attractive alternative. Question:
    • End-to-end + decomposition + added feedback…

Questions

  • What is the price of the rather appealing "End-to-end" approach?
  • Is letting a network "learn by itself" such a bad idea?
  • What is it necessary, or worth the effort, to "help" it?

Conclusion

  • When will SGD learn slowly?
  • Try to answer the following question first:
    • What exactly is the needed data?
    • What architecture is planned to be used?
    • What is the right distribution of development efforts?
  • Taking the wrong choice may be expensive

Exp 1

  • \(X_1\) denotes the space of \(28 \times 28\) binary images, with a distribution \(\mathcal{D}\) defined by the some sampleing procedure.

Exp 1

  • Intermediate target: \(y \in \{1, -1\}\), depends on the slope of the line.
  • Final target: \(X = X_1^k\), i.e. \(x = (x^{(1)}, ..., x^{(k)})\).

\[\tilde{y}(x) = \prod\limits_{i=1}^k y(x^{(i)})\]

Decomposition v.s. End-to-end

  • \(N^{(1)}\) : map image to score
  • \(N^{(2)}\) : concatenate \(k\) scores and map them to binary prediction
  • Decomposition takes the result of \(y\) into account

Result of Exp 1

  • x-axis: iteration, y-axis: accuracy
  • red: End-to-end, blue: Decomposition

Exp 2

Training a predictor, which given a "positive media reference" \(x\) to a certain stock option, will distribute our assets between the \(k = 500\) stocks in the S&P500 index in some manner.

  • End-to-end: train a deep network \(N_w\) that given \(x\) outputs a distribution over the \(k\) stocks. Objective: maximizing the gain obtained by allocating our money according to this distribution
  • Decomposition: train a ddep network \(N_w\) that given \(x\) outputs a single stock, \(y \in [k]\) whose future gains are the most positively correlated to \(x\). We need gather extra labeling for training \(N_w\) based on this criterion.

Theoretically Statement

Let \(X \times X \in \mathbb{R}^d \times \{\pm 1\}^{k}\) be the sample space, and let \(y : X \rightarrow [k]\) be some labelling function.

\[\min \limits_{w} L(w) := \mathbb{E}_{x, z \sim X \times Z} \left[-z^T N_w(x)\right]\]

  • End-to-end: the updated gradient is \(\nabla_w(-z^TN_w(x))\)
  • Decomposition: get \(y(x)\) and use \(\nabla_w(-e^T_{y(x)}N_w(x))\)

Result

  • x-axis: iteration, y-axis: loss
  • red: End-to-end, blue: Decomposition

Analysis: Signal to Noise Ratio (SNR) of Exp 1

  • The signal's strength is the squared norm of the gradients' mean estimated by SGD
  • The signal's variance is obtained by differentiating w.r.t. to a single sample

SNR of Exp 1

  • x-axis: \(k\), y-axis: estimated SNR
  • red: End-to-end, blue: Decomposition
  • The SNR of \(k \geq 3\) is below the precision obtained by a 32-bit floating point representation.

SNR of Exp 2

  • Please check the paper for details
  • The initial variance of the gradient of End-to-end is roughly \(k\) times larger than that of the decomposition estimator.

Q&A

Architecture and Conditioning

Piecewise Linear Curves

\[f(x) = b + \sum_{i=1}^k {a_i \left[x - \theta_i\right]_+}\]

Piecewise Linear Curves

get.f <- function(b, a, theta) {
  Vectorize(function(x) {
    plus <- x - theta
    plus[plus < 0] <- 0
    b + sum(a * plus)
  })
}
f <- get.f(1, c(1, -2, 3), c(0, 2, 6))
curve(f(x), -1, 8)

Fitting Piecewise Linear Curves

  • Input:
    • \(f = \left(f(0), f(1), ..., f(n-1)\right)\)
  • Output:
    • \(b, a, \theta\)

Model 1

\[\min_{\hat{U}} \mathbb{E}_f \left[\frac{1}{2} \left(W^{-1} f - \hat{U} f\right)\right]\]

  • \(W = \left(\left[i - j + 1\right]_+\right)_{i,j} \in \mathbb{R}^{n \times n} \Rightarrow f = Wp\)
  • Convex Optimization Problem
  • Unexpected slow rate of convergence

Why Model 1 is Slow

Let \(W = QSV^T\) be the singular value decomposition of \(W\). If \(\eta \lambda S^2_{1,1} \geq 1\) then \(\mathbb{E}\hat{U}_{t+1}\) diverges. Otherwise, we have

\[t + 1 \leq \frac{S^2{1,1}}{2 S^2_{n,n}} \Rightarrow \left\lVert \mathbb{E} \hat{U}_{t+1} - U\right\lVert \geq 0.5\]

  • \(\eta\) : learning rate
  • \(\mathbb{E}_f \left[Uff^TU^T\right] = \lambda I\)
  • The iterations require at least \(\Omega(n^{3.5})\) to make \(\hat{U} \rightarrow U\).

Model 2 : Convolution

  • \(Uf\) is one-dimensional convolution of \(f\) with the kernel \(\left(1, -2, 1\right)\):
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]   -2    1    0    0    0    0    0    0    0     0
##  [3,]    1   -2    1    0    0    0    0    0    0     0
##  [4,]    0    1   -2    1    0    0    0    0    0     0
##  [5,]    0    0    1   -2    1    0    0    0    0     0
##  [6,]    0    0    0    1   -2    1    0    0    0     0
##  [7,]    0    0    0    0    1   -2    1    0    0     0
##  [8,]    0    0    0    0    0    1   -2    1    0     0
##  [9,]    0    0    0    0    0    0    1   -2    1     0
## [10,]    0    0    0    0    0    0    0    1   -2     1

Model 2 : Convolution

  • The authors show that only \(\Theta(n^3)\) iterations is required to learn Convolution model

Model 3 : Additional Improvement through Explicit Conditioning

  • The authors did some math trick to create a optimization problem with condition number is approximately 1.
  • Then SGD converges using order of \(log(\frac{1}{\varepsilon})\) iterations, independently of \(n\).

Model 4 : Deep Audo-encoder

Flat Activations

Vanishing Gradient

  • Adding momentum, higher order methods, normalized gradients…
  • The authors use a family of activation functions which amplify the"vanishing gradient: piecewise flat.
    • They arrive at an efficient solution with these activations.

Exp Setup

  • The sample space \(x \in \mathbb{R}^d\)
  • The target function \(y : \mathbb{R}^d \rightarrow \mathbb{R}\), \(y(x) = u(v*^{T}x + b*)\)
    • \(v* \in \mathbb{R}^d\), \(b* \in \mathbb{R}\), and \(u : \mathbb{R} \rightarrow \mathbb{R}\) is monotonically non-decreasing function (activation function).

\[\min_w \mathbb{E}_x \left[l\left(u\left(N_w(x)\right), y(x)\right)\right]\]

  • \(N_w\) is some neural network parametrized by \(w\) and \(l\) is some loss function
  • \(u(r) = z_0 + \sum_{i \in [55]} {\mathbb{1}_{[r > z_i]} \times (z_i - z_{i-1})}\)
  • Gradient Descent will fail

Non-flat Approximation Exp

  • \(\tilde{u}(r) = z_0 + \sum_{i \in [55]} { (z_i - z_{i-1}) \times \sigma(c \times (r - z_i)) }\)

\[ \min_{v, b} \mathbb{E}_x \left[\left(\tilde{u}(v^T x + b) - y(x)\right)^2\right]\]

  • Training is still hard (converge slowly)
  • Sensitivity to the initialization of bias term is observed

Non-flat Approximation Exp

End-to-end Exp

\[\min_{w} \mathbb{E} \left[\left(N_w(x) - y(x)\right)^2\right]\]

  • Difficulty arises when regressing to non smooth functions. (\(u\) is not continuous)

End-to-end Exp

Multi-Class Experiment

  • Similar to end-to-end but the output is a classification.
  • Inaccuracies at the boundaries between classes.
  • Ignoring \(u\) results in higher sample complexity

Multi-Class Experiment

The Forward-Only Update Rule

Original gradient:

\[\nabla F(w) = \mathbb{E}_x \left[\left(u(w^Tx) - y(x)\right) \times u'(w^Tx) \times x\right]\]

  • \(u' = 0\) so \(\nabla F\) is meaningless.

The Forward-Only Update Rule

Replacement of the gradient from Kakade et al. (2011) and Kalai and Sastry (2009)

\[\tilde{\nabla} F(w) = \mathbb{E}_x \left[\left(u(w^Tx) - y(x)\right) \times x\right]\]

  • Replacing the backpropagation message for the activation function \(u\) with an identity message.

The Forward-Only Update Rule

Summary

Summary

Q&A

Reference

Blum, Avrim, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. 1994. “Weakly Learning Dnf and Characterizing Statistical Query Learning Using Fourier Analysis.” In Proceedings of the Twenty-Sixth Annual Acm Symposium on Theory of Computing, 253–62. STOC ’94. New York, NY, USA: ACM. doi:10.1145/195058.195147.

Kakade, Sham, Adam Tauman Kalai, Varun Kanade, and Ohad Shamir. 2011. “Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression.” CoRR abs/1104.2018. http://arxiv.org/abs/1104.2018.

Kalai, Adam Tauman, and Ravi Sastry. 2009. “The Isotron Algorithm: High-Dimensional Isotonic Regression.” In COLT. http://dblp.uni-trier.de/db/conf/colt/colt2009.html#KalaiS09.

Kearns, Michael. 1998. “Efficient Noise-Tolerant Learning from Statistical Queries.” J. ACM 45 (6). New York, NY, USA: ACM: 983–1006. doi:10.1145/293347.293351.

Shamir, Ohad. 2016. “Distribution-Specific Hardness of Learning Neural Networks.” CoRR abs/1609.01037. http://arxiv.org/abs/1609.01037.