Unconstrained Local Optimization: A Comparative Report of Quasi-Newton methods using the Multidimensional Rosenbrock function
Introduction
This report compares four different methods for solving optimization problems. We use a commonly used optimization function, Multidimensional Rosenbrock function, to test these algorithms. The methods we’re looking at are DFP, BFGS, L-BFGS, and L-BFGS-B. We test them in different dimensions (10D, 20D, 50D, and 100D) to see how fast they run, how accurate they are, and how much computer memory they use. Our goal is to find out which method is the best for different situations. We expect L-BFGS and its variant L-BFGS-B to outperform the other two as we increase the number of dimensions.
Unconstrained Optimization Algorithm
- Require:
- Objective function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\)
- Gradient of the function \(\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n\)
- Initial guess \(x_0\)
- Ensure: Local minimum \(x^*\) of \(f\)
- Set \(k = 0\)
- Choose an initial approximation \(H_0\) to the inverse Hessian - the identity matrix
- While not converged:
- Compute the search direction: \(p_k = -H_k \nabla f(x_k)\)
- Calculate step size \(\alpha_k\) using Line Search
- Update the new point: \(x_{k+1} = x_k + \alpha_k p_k\)
- Compute \(\delta x = x_{k+1} - x_k\)
- Compute \(\delta g = \nabla f(x_{k+1}) - \nabla f(x_k)\)
- Update formula for \(H_{k+1}\)
- \(k = k + 1\)
- Return \(x_k\) as the local minimum
Davidon–Fletcher–Powell update
The primary goal of DFP is to update the approximation of the inverse of the Hessian matrix. The update formula is: \[ H_{k+1} = H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} - \frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]
positive rank-1 update: \[\frac{\delta x \delta x^T}{\delta x^T \delta g} \]
negative rank-2 update: \[\frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]
Broyden–Fletcher–Goldfarb–Shanno update
The BFGS method is another Quasi-Newton method that approximates the inverse of the Hessian using a combination of rank-1 updates. The formula is: \[ H_{k+1} = \left(I - \frac{\delta x \delta g^T}{\delta x^T \delta g} \right) H_k \left(I - \frac{\delta g \delta x^T}{\delta x^T \delta g} \right) + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]
which can be expanded: \[ H_{k+1} = H_k - H_k \frac{\delta x \delta g^T}{\delta x^T \delta g} H_k - \frac{\delta g \delta x^T}{\delta x^T \delta g} H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]
Limited-memory BFGS or L-BFGS
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an optimization algorithm for unconstrained optimization problems. It is a member of the quasi-Newton family and is particularly well-suited for high-dimensional machine learning tasks aimed at minimizing a differentiable scalar function \(f(\mathbf{x})\)
Unlike the full BFGS algorithm, which requires storing a dense \(n \times n\) inverse Hessian matrix, L-BFGS is memory-efficient. It only stores a limited history of \(m\) updates of the gradient \(\nabla f(x)\) and position \(x\), generally with \(m < 10\).
The algorithm starts with an initial estimate \(\mathbf{x}_0\) and iteratively refines this using estimates \(\mathbf{x}_1, \mathbf{x}_2, \ldots\) The gradient \(g_k = \nabla f(\mathbf{x}_k)\) is used to approximate the inverse Hessian and identify the direction of steepest descent. The L-BFGS update for the inverse Hessian \(H_{k+1}\) is given by:
\[ H_{k+1} = (I - \rho_k s_k y_k^\top) H_k (I - \rho_k y_k s_k^\top) + \rho_k s_k s_k^\top \]
where \(\rho_k = \frac{1}{y_k^\top s_k}\), \(s_k = x_{k+1} - x_k\), \(y_k = g_{k+1} - g_k\)
L-BFGS-B (L-BFGS with Box Constraints)
L-BFGS-B is an extension of L-BFGS that can handle bound constraints, i.e., constraints of the form \[ l \leq x \leq u \] The algorithm uses projected gradients to ensure that the bounds are obeyed at every iteration.
MultiDimensional Rosenbrock Function
The Rosenbrock function is a commonly used test problem for optimization algorithms due to its non-convex nature and the difficulty in converging to the global minimum.
10 Dimensional Rosenbrock Summation
\[ f(\mathbf{x}) = \sum_{i=0}^{9} \left[ (1 - x_i)^2 + 100 \cdot (x_{i+1} - x_i^2)^2 \right] \] where \(\mathbf{x} = (x_1, x_2, \ldots, x_{10})\) and \(\mathbf{x} \in \mathbb{R}^{10}\)
We expect to see a difference in the amount of resources used by each algorithm as we increase the number of dimensions. The limited memory variant of BFGS, L-BFGS stores only a few vectors that implicitly represent the approximation of the inverse Hessian matrix. It should be highly suitable for high-dimensional problems.
10 Dimensions
Time in milliseconds
Error
Since we use local minimization, each algorithm gets stuck at the same point.
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 1.80e308 1.80e308 Inf 1.80e308 1.80e308
2 dfp 1.80e308 1.80e308 Inf 1.80e308 1.80e308
3 lbfgs 1.80e308 1.80e308 Inf 1.80e308 1.80e308
4 lbfgsb 1.80e308 1.80e308 Inf 1.80e308 1.80e308
Memory Usage
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <int> <dbl> <int> <int>
1 bfgs 42.7 16 46.2 16 96
2 dfp 58.7 32 46.2 32 112
3 lbfgs 219. 256 78.9 128 272
4 lbfgsb 96 96 32 64 128
20 dimensions
Time in seconds
Error
Same as before…
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 1.80e308 1.80e308 Inf 1.80e308 1.80e308
2 dfp 1.80e308 1.80e308 Inf 1.80e308 1.80e308
3 lbfgs 1.80e308 1.80e308 Inf 1.80e308 1.80e308
4 lbfgsb 1.80e308 1.80e308 Inf 1.80e308 1.80e308
Memory Usage
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 104 96 59.1 48 176
2 dfp 84 96 35.5 32 112
3 lbfgs 235. 208 60.6 192 304
4 lbfgsb 144 120 53.9 112 224
50 dimensions
Time in seconds
Memory Usage
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 795. 784 144. 656 944
2 dfp 885. 912 46.2 832 912
3 lbfgs 320 360 96.9 176 384
4 lbfgsb 155. 176 51.4 96 192
100 dimensions
Time in seconds
Memory Usage
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <int> <dbl> <int> <int>
1 bfgs 1941. 1840 190. 1824 2160
2 dfp 1701. 1760 379. 1296 2048
3 lbfgs 448 464 42.3 400 480
4 lbfgsb 277. 288 64.7 208 336
All dimensions concatenated data
Time in seconds
Memory Usage
# A tibble: 4 × 6
Algorithm mean median sd min max
<chr> <dbl> <int> <dbl> <int> <int>
1 bfgs 673. 176 790. 16 2160
2 dfp 636. 112 714. 32 2048
3 lbfgs 203. 208 138. 0 384
4 lbfgsb 166. 128 80.9 64 336
Data Analysis
Time
The time taken for the algorithms varies significantly as the dimensionality increases. In 10D, L-BFGS-B is the fastest, followed by DFP, then L-BFGS, with BFGS being the slowest. This trend generally holds for higher dimensions as well, with L-BFGS-B consistently outperforming the others.
Error
Since the objective is local optimization, the approximation error remains constant across all algorithms and dimensions.
Memory
Memory usage follows a unique pattern. BFGS and DFP exhibit similar binomial memory curves, across all dimensions. In contrast, L-BFGS requires significantly more memory in low dimensional optimization, while L-BFGS-B offers a balanced profile with less memory usage and faster execution times.