Random Search for Hyper-Parameter Optimization

March 14, 2017

Bergstra and Bengio (2012)

Hyper Parameter Selection (Theoretical)

\[\begin{equation}\lambda^{(*)} = \underset{\lambda \in \Sigma}{argmin}{\mathbb{E}_{x \sim \mathcal{G}_{x}}\left[\mathcal{L}\left(x ; \mathcal{A}_{\lambda}\left(\mathcal{X}^{(train)}\right)\right)\right]}\label{eq:hyper-optimization}\end{equation}\]

learning algorithm \(\mathcal{A}\): find function \(f\) minimizes loss \(\mathcal{L}(x;f)\) over i.i.d. samples \(x\) from grand truth distribution \(\mathcal{G}_x\).
- \(\mathcal{A} : \mathcal{X}^{(train)} \rightarrow f\)
Given hyper parameter \(\lambda \in \Sigma\), \(f = \mathcal{A}_\lambda\left(\mathcal{X}^{(train)}\right)\)
The formula is to find \(\lambda\) which optimizes the expected loss

Hyper Parameter Selection (Practice)

\(\eqref{eq:hyper-optimization}\) does not have efficient algorithm
\(\mathcal{G}_x\) is approximated by \(\mathcal{X}^{(valid)}\)
Search the trial points \(\{\lambda^{(1)}, \lambda^{(2)}, ..., \lambda^{(S)}\}\) for optimal parameter

\[\begin{eqnarray} \lambda^{(*)} &\approx& \underset{\lambda \in \Sigma}{argmin} \underset{x \in \mathcal{X}^{(valid)}}{mean} \mathcal{L}\left( x ; \mathcal{A}_\lambda \left(X^{(train)}\right) \right) \\ &=& \underset{\lambda \in \Sigma}{argmin} \Psi(\lambda) \\ &\approx& \underset{\lambda \in \{\lambda^{(1)}, \lambda^{(2)}, ..., \lambda^{(S)}\}}{argmin} \Psi(\lambda) = \hat{\lambda} \end{eqnarray}\]

How to Choose the Trial Set

Grid Search
- Combination of all possible values
Manual Search
- Domain Knowledge
- Intuition

Example: Gaussian kernel SVM

\(\lambda = (C, \sigma)\)
- \(C\): regularization penalty
- \(\sigma\): Gaussian kernel
\(C = (0.01, 0.1, 1, 10)\)
\(\sigma = (0.001, 0.01, 0.1)\)
Trial Set: \(\{(0.01, 0.001), (0.01, 0.01), (0.01, 0.1), (0.1, 0.001), ..., (10, 0.1)\}\)

Why Manual Searc + Grid Search

Manual optimization gives researchers some degree of insight into \(\Psi\)
There is no technical overhead or barrier to manual optimization
Grid search is simple to implement and parallelization is trivial
Grid search typically find a better \(\hat{\lambda}\) than purely manual sequential optimization(in the same amount of time)
Grid search is reliable in low dimensional spaces

Challenge

Grid Search: curse of dimensionality
Manual Search: reproducing results

Random Search

Independently and uniformly draw from \(\Sigma\) to produce trial set

Why Random Search

Roughly equivalent to manual search + grid search
Implementation simplicity
Reproducibility of pure grid search
Adding new trials to the set or ignoring are both feasible

Low Effective Dimensionality

\(\Psi\) are more sensitive to changes in some dimensions than others.
For example, if \(f(x1, x2) \approx g(x1)\), then \(f\) has low effective dimension.

Example: Neural Networks

Estimating Generalization

\(\Psi^{(valid)}(\lambda) = \underset{x \in \mathcal{X}^{(valid)}}{mean} \mathcal{L}\left(x ; \mathcal{A}_\lambda \left( \mathcal{X}^{(train)} \right)\right)\)
\(\Psi^{(test)}(\lambda) = \underset{x \in \mathcal{X}^{(test)}}{mean} \mathcal{L}\left(x ; \mathcal{A}_\lambda \left( \mathcal{X}^{(train)} \right)\right)\)

Comparing Different Selection Methodology

Pick \(\hat{\lambda}\) that optimizes \(\Psi^{(valid)}(\lambda^{(s)})\)
Compare \(\Psi^{(test)}(\hat{\lambda})\)

Treat the Score as Random Variable

Model both \(\Psi(\lambda)\) as a random variable
- The uncertainty comes from the approximation between \(\mathbb{G}_x\) and \(\mathcal{X}\)

Variance of \(\Psi(\lambda)\)

It depends on the loss. For example, the variance of 0-1 loss is:
- \(\mathbb{V}^{(valid)}(\lambda) = \frac{\Psi^{(valid)}(\lambda)\left(1 - \Psi^{(valid)}(\lambda)\right)}{\left|\mathcal{X}^{(valid)}\right| - 1}\)
- \(\mathbb{V}^{(test)}(\lambda) = \frac{\Psi^{(test)}(\lambda)\left(1 - \Psi^{(test)}(\lambda)\right)}{\left|\mathcal{X}^{(test)}\right| - 1}\)

Probability of Picking \(\lambda(s)\)

Let \(Z^{(i)} \sim \mathcal(N)\left( \Psi^{(valid)}(\lambda^{(i)}), \mathbb{V}^{(valid)}(\lambda^{(i)}) \right)\)
The probability of pick \(s\) as \(\hat{\lambda}\) is \(P\left(Z^{(s)} < Z^{(s')}, \forall s' \neq s\right)\)
- Let \(w_s = P\left(Z^{(s)} < Z^{(s')}, \forall s' \neq s\right)\)
The mean of \(\Psi^{(test)}(\hat{\lambda})\): \[\mu_Z = \sum_{s=1}^{S}{w_s \Psi^{(test)}(\lambda^{(s)})}\]
The variance of \(\Psi^{(test)}(\hat{\lambda})\): \[\sigma_Z^2 = - \mu_Z^2 + \sum_{s=1}^{S} w_s\left(\Psi^{(test)}(\lambda^{(s)})^2 + \mathbb{V}^{(test)}(\lambda^{(s)})\right)\]

Random Experiment Efficiency Curve

Experiments

MNIST handwritten digit dataset

28x28 pixel grey scale of digits
10 classes
https://www.iro.umontreal.ca/%20%cc%83lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007 (404?)

Datasets

mnist basic: 10000 training, 2000 validation, 50000 testing
mnist background: 10000 training, 2000 validation, 50000 testing
mnist rotated: 10000 training, 2000 validation, 50000 testing
mnist rotated background: 10000 training, 2000 validation, 50000 testing
rectangles: 1000 training, 200 validation, 50000 testing
rectangles images: 10000 training, 2000 validation, 50000 testing

Hyper Parameters

Similar to Larochelle et al. (2007)
The type of data preprocessing: none, normalize, PCA(with different detailed settings)
Pretraining:
- Distributions: uniform on (-1, 1), unit normal
- Scaling heuristics: between 0.1 and 10.0 divided by the square root of the number of inputs, the square root of 6 divided by the square rott of the number of inputs plus hidden units
Number of hidden units is drawn geometrically from 18 to 1024
Activity function: sigmoidal or tanh nonlinearity
Minibatch: 20 or 100

Hyper Parameters (Cont.)

Learning rate \(\varepsilon_0\) is drawn geometrically from 0.001 to 10
annealed learning rate \(t_0\) is drawn geometrically from 300 to 30000. The final learning rate is \(\varepsilon_t = \frac{t_0\varepsilon_0}{max(t, t_0)}\)
\(L_2\) regularization is drawn exponentially from \(3.1 \times 10^{-7}\) to \(3.1 \times 10^{-5}\).

Results: Random Experiment Efficiency Curve

Results: Automatic Relevance Determination

Reference

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” J. Mach. Learn. Res. 13 (February). JMLR.org: 281–305. http://dl.acm.org/citation.cfm?id=2188385.2188395.

Larochelle, Hugo, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. “An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation.” In Proceedings of the 24th International Conference on Machine Learning, 473–80. ICML ’07. New York, NY, USA: ACM. doi:10.1145/1273496.1273556.