- Gradient information carries negligible information on the target function
- Learning vs. Optimization. Decomposition vs. End-to-end.
- Network architecture and the optimization algorithm v.s. training time.
- Is gradient vanishing important?
April 13, 2017
\[\min\limits_w F_h(w) = \mathbb{E}_x\left[\mathcal{l}\left(p_w(x), h(x)\right)\right]\] - \(\nabla F_h(w)\) contains useful information regarding the target function \(h\).

\[\tilde{y}(x) = \prod\limits_{i=1}^k y(x^{(i)})\]
Training a predictor, which given a "positive media reference" \(x\) to a certain stock option, will distribute our assets between the \(k = 500\) stocks in the S&P500 index in some manner.
Let \(X \times X \in \mathbb{R}^d \times \{\pm 1\}^{k}\) be the sample space, and let \(y : X \rightarrow [k]\) be some labelling function.
\[\min \limits_{w} L(w) := \mathbb{E}_{x, z \sim X \times Z} \left[-z^T N_w(x)\right]\]
\[f(x) = b + \sum_{i=1}^k {a_i \left[x - \theta_i\right]_+}\]
get.f <- function(b, a, theta) {
Vectorize(function(x) {
plus <- x - theta
plus[plus < 0] <- 0
b + sum(a * plus)
})
}
f <- get.f(1, c(1, -2, 3), c(0, 2, 6))
curve(f(x), -1, 8)
\[\min_{\hat{U}} \mathbb{E}_f \left[\frac{1}{2} \left(W^{-1} f - \hat{U} f\right)\right]\]
Let \(W = QSV^T\) be the singular value decomposition of \(W\). If \(\eta \lambda S^2_{1,1} \geq 1\) then \(\mathbb{E}\hat{U}_{t+1}\) diverges. Otherwise, we have
\[t + 1 \leq \frac{S^2{1,1}}{2 S^2_{n,n}} \Rightarrow \left\lVert \mathbb{E} \hat{U}_{t+1} - U\right\lVert \geq 0.5\]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 0 0 0 0 0 0 0 0 0 ## [2,] -2 1 0 0 0 0 0 0 0 0 ## [3,] 1 -2 1 0 0 0 0 0 0 0 ## [4,] 0 1 -2 1 0 0 0 0 0 0 ## [5,] 0 0 1 -2 1 0 0 0 0 0 ## [6,] 0 0 0 1 -2 1 0 0 0 0 ## [7,] 0 0 0 0 1 -2 1 0 0 0 ## [8,] 0 0 0 0 0 1 -2 1 0 0 ## [9,] 0 0 0 0 0 0 1 -2 1 0 ## [10,] 0 0 0 0 0 0 0 1 -2 1
\[\min_w \mathbb{E}_x \left[l\left(u\left(N_w(x)\right), y(x)\right)\right]\]
\[ \min_{v, b} \mathbb{E}_x \left[\left(\tilde{u}(v^T x + b) - y(x)\right)^2\right]\]
\[\min_{w} \mathbb{E} \left[\left(N_w(x) - y(x)\right)^2\right]\]
Original gradient:
\[\nabla F(w) = \mathbb{E}_x \left[\left(u(w^Tx) - y(x)\right) \times u'(w^Tx) \times x\right]\]
Replacement of the gradient from Kakade et al. (2011) and Kalai and Sastry (2009)
\[\tilde{\nabla} F(w) = \mathbb{E}_x \left[\left(u(w^Tx) - y(x)\right) \times x\right]\]
Blum, Avrim, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. 1994. “Weakly Learning Dnf and Characterizing Statistical Query Learning Using Fourier Analysis.” In Proceedings of the Twenty-Sixth Annual Acm Symposium on Theory of Computing, 253–62. STOC ’94. New York, NY, USA: ACM. doi:10.1145/195058.195147.
Kakade, Sham, Adam Tauman Kalai, Varun Kanade, and Ohad Shamir. 2011. “Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression.” CoRR abs/1104.2018. http://arxiv.org/abs/1104.2018.
Kalai, Adam Tauman, and Ravi Sastry. 2009. “The Isotron Algorithm: High-Dimensional Isotonic Regression.” In COLT. http://dblp.uni-trier.de/db/conf/colt/colt2009.html#KalaiS09.
Kearns, Michael. 1998. “Efficient Noise-Tolerant Learning from Statistical Queries.” J. ACM 45 (6). New York, NY, USA: ACM: 983–1006. doi:10.1145/293347.293351.
Shamir, Ohad. 2016. “Distribution-Specific Hardness of Learning Neural Networks.” CoRR abs/1609.01037. http://arxiv.org/abs/1609.01037.