March 14, 2017

Bergstra and Bengio (2012)

Hyper Parameter Selection (Theoretical)

\[\begin{equation}\lambda^{(*)} = \underset{\lambda \in \Sigma}{argmin}{\mathbb{E}_{x \sim \mathcal{G}_{x}}\left[\mathcal{L}\left(x ; \mathcal{A}_{\lambda}\left(\mathcal{X}^{(train)}\right)\right)\right]}\label{eq:hyper-optimization}\end{equation}\]

  • learning algorithm \(\mathcal{A}\): find function \(f\) minimizes loss \(\mathcal{L}(x;f)\) over i.i.d. samples \(x\) from grand truth distribution \(\mathcal{G}_x\).
    • \(\mathcal{A} : \mathcal{X}^{(train)} \rightarrow f\)
  • Given hyper parameter \(\lambda \in \Sigma\), \(f = \mathcal{A}_\lambda\left(\mathcal{X}^{(train)}\right)\)
  • The formula is to find \(\lambda\) which optimizes the expected loss

Hyper Parameter Selection (Practice)

  • \(\eqref{eq:hyper-optimization}\) does not have efficient algorithm
  • \(\mathcal{G}_x\) is approximated by \(\mathcal{X}^{(valid)}\)
  • Search the trial points \(\{\lambda^{(1)}, \lambda^{(2)}, ..., \lambda^{(S)}\}\) for optimal parameter

\[\begin{eqnarray} \lambda^{(*)} &\approx& \underset{\lambda \in \Sigma}{argmin} \underset{x \in \mathcal{X}^{(valid)}}{mean} \mathcal{L}\left( x ; \mathcal{A}_\lambda \left(X^{(train)}\right) \right) \\ &=& \underset{\lambda \in \Sigma}{argmin} \Psi(\lambda) \\ &\approx& \underset{\lambda \in \{\lambda^{(1)}, \lambda^{(2)}, ..., \lambda^{(S)}\}}{argmin} \Psi(\lambda) = \hat{\lambda} \end{eqnarray}\]

How to Choose the Trial Set

  • Grid Search
    • Combination of all possible values
  • Manual Search
    • Domain Knowledge
    • Intuition

Example: Gaussian kernel SVM

  • \(\lambda = (C, \sigma)\)
    • \(C\): regularization penalty
    • \(\sigma\): Gaussian kernel
  • \(C = (0.01, 0.1, 1, 10)\)
  • \(\sigma = (0.001, 0.01, 0.1)\)
  • Trial Set: \(\{(0.01, 0.001), (0.01, 0.01), (0.01, 0.1), (0.1, 0.001), ..., (10, 0.1)\}\)

Why Manual Searc + Grid Search

Challenge

  • Grid Search: curse of dimensionality
  • Manual Search: reproducing results

Random Search

Why Random Search

Low Effective Dimensionality

  • \(\Psi\) are more sensitive to changes in some dimensions than others.
  • For example, if \(f(x1, x2) \approx g(x1)\), then \(f\) has low effective dimension.

Example: Neural Networks

Estimating Generalization

  • \(\Psi^{(valid)}(\lambda) = \underset{x \in \mathcal{X}^{(valid)}}{mean} \mathcal{L}\left(x ; \mathcal{A}_\lambda \left( \mathcal{X}^{(train)} \right)\right)\)
  • \(\Psi^{(test)}(\lambda) = \underset{x \in \mathcal{X}^{(test)}}{mean} \mathcal{L}\left(x ; \mathcal{A}_\lambda \left( \mathcal{X}^{(train)} \right)\right)\)

Comparing Different Selection Methodology

  • Pick \(\hat{\lambda}\) that optimizes \(\Psi^{(valid)}(\lambda^{(s)})\)
  • Compare \(\Psi^{(test)}(\hat{\lambda})\)

Treat the Score as Random Variable

  • Model both \(\Psi(\lambda)\) as a random variable
    • The uncertainty comes from the approximation between \(\mathbb{G}_x\) and \(\mathcal{X}\)

Variance of \(\Psi(\lambda)\)

  • It depends on the loss. For example, the variance of 0-1 loss is:
    • \(\mathbb{V}^{(valid)}(\lambda) = \frac{\Psi^{(valid)}(\lambda)\left(1 - \Psi^{(valid)}(\lambda)\right)}{\left|\mathcal{X}^{(valid)}\right| - 1}\)
    • \(\mathbb{V}^{(test)}(\lambda) = \frac{\Psi^{(test)}(\lambda)\left(1 - \Psi^{(test)}(\lambda)\right)}{\left|\mathcal{X}^{(test)}\right| - 1}\)

Probability of Picking \(\lambda(s)\)

  • Let \(Z^{(i)} \sim \mathcal(N)\left( \Psi^{(valid)}(\lambda^{(i)}), \mathbb{V}^{(valid)}(\lambda^{(i)}) \right)\)
  • The probability of pick \(s\) as \(\hat{\lambda}\) is \(P\left(Z^{(s)} < Z^{(s')}, \forall s' \neq s\right)\)
    • Let \(w_s = P\left(Z^{(s)} < Z^{(s')}, \forall s' \neq s\right)\)
  • The mean of \(\Psi^{(test)}(\hat{\lambda})\): \[\mu_Z = \sum_{s=1}^{S}{w_s \Psi^{(test)}(\lambda^{(s)})}\]
  • The variance of \(\Psi^{(test)}(\hat{\lambda})\): \[\sigma_Z^2 = - \mu_Z^2 + \sum_{s=1}^{S} w_s\left(\Psi^{(test)}(\lambda^{(s)})^2 + \mathbb{V}^{(test)}(\lambda^{(s)})\right)\]

Random Experiment Efficiency Curve

Experiments

MNIST handwritten digit dataset

Datasets

  • mnist basic: 10000 training, 2000 validation, 50000 testing
  • mnist background: 10000 training, 2000 validation, 50000 testing
  • mnist rotated: 10000 training, 2000 validation, 50000 testing
  • mnist rotated background: 10000 training, 2000 validation, 50000 testing
  • rectangles: 1000 training, 200 validation, 50000 testing
  • rectangles images: 10000 training, 2000 validation, 50000 testing

Hyper Parameters

  • Similar to Larochelle et al. (2007)
  • The type of data preprocessing: none, normalize, PCA(with different detailed settings)
  • Pretraining:
    • Distributions: uniform on (-1, 1), unit normal
    • Scaling heuristics: between 0.1 and 10.0 divided by the square root of the number of inputs, the square root of 6 divided by the square rott of the number of inputs plus hidden units
  • Number of hidden units is drawn geometrically from 18 to 1024
  • Activity function: sigmoidal or tanh nonlinearity
  • Minibatch: 20 or 100

Hyper Parameters (Cont.)

  • Learning rate \(\varepsilon_0\) is drawn geometrically from 0.001 to 10
  • annealed learning rate \(t_0\) is drawn geometrically from 300 to 30000. The final learning rate is \(\varepsilon_t = \frac{t_0\varepsilon_0}{max(t, t_0)}\)
  • \(L_2\) regularization is drawn exponentially from \(3.1 \times 10^{-7}\) to \(3.1 \times 10^{-5}\).

Results: Random Experiment Efficiency Curve

Results: Automatic Relevance Determination

Reference

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” J. Mach. Learn. Res. 13 (February). JMLR.org: 281–305. http://dl.acm.org/citation.cfm?id=2188385.2188395.

Larochelle, Hugo, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. “An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation.” In Proceedings of the 24th International Conference on Machine Learning, 473–80. ICML ’07. New York, NY, USA: ACM. doi:10.1145/1273496.1273556.