Unconstrained Local Optimization: A Comparative Report of Quasi-Newton methods using the Multidimensional Rosenbrock function

Author

Dominik Soós

Introduction

This report compares four different methods for solving optimization problems. We use a commonly used optimization function, Multidimensional Rosenbrock function, to test these algorithms. The methods we’re looking at are DFP, BFGS, L-BFGS, and L-BFGS-B. We test them in different dimensions (10D, 20D, 50D, and 100D) to see how fast they run, how accurate they are, and how much computer memory they use. Our goal is to find out which method is the best for different situations. We expect L-BFGS and its variant L-BFGS-B to outperform the other two as we increase the number of dimensions.

Unconstrained Optimization Algorithm

Require:
- Objective function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\)
- Gradient of the function \(\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n\)
- Initial guess \(x_0\)
Ensure: Local minimum \(x^*\) of \(f\)

Set \(k = 0\)
Choose an initial approximation \(H_0\) to the inverse Hessian - the identity matrix
While not converged:
1. Compute the search direction: \(p_k = -H_k \nabla f(x_k)\)
2. Calculate step size \(\alpha_k\) using Line Search
3. Update the new point: \(x_{k+1} = x_k + \alpha_k p_k\)
4. Compute \(\delta x = x_{k+1} - x_k\)
5. Compute \(\delta g = \nabla f(x_{k+1}) - \nabla f(x_k)\)
6. Update formula for \(H_{k+1}\)
7. \(k = k + 1\)
Return \(x_k\) as the local minimum

Davidon–Fletcher–Powell update

The primary goal of DFP is to update the approximation of the inverse of the Hessian matrix. The update formula is: \[ H_{k+1} = H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} - \frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]

positive rank-1 update: \[\frac{\delta x \delta x^T}{\delta x^T \delta g} \]

negative rank-2 update: \[\frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]

Broyden–Fletcher–Goldfarb–Shanno update

The BFGS method is another Quasi-Newton method that approximates the inverse of the Hessian using a combination of rank-1 updates. The formula is: \[ H_{k+1} = \left(I - \frac{\delta x \delta g^T}{\delta x^T \delta g} \right) H_k \left(I - \frac{\delta g \delta x^T}{\delta x^T \delta g} \right) + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]

which can be expanded: \[ H_{k+1} = H_k - H_k \frac{\delta x \delta g^T}{\delta x^T \delta g} H_k - \frac{\delta g \delta x^T}{\delta x^T \delta g} H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]

Limited-memory BFGS or L-BFGS

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an optimization algorithm for unconstrained optimization problems. It is a member of the quasi-Newton family and is particularly well-suited for high-dimensional machine learning tasks aimed at minimizing a differentiable scalar function \(f(\mathbf{x})\)

Unlike the full BFGS algorithm, which requires storing a dense \(n \times n\) inverse Hessian matrix, L-BFGS is memory-efficient. It only stores a limited history of \(m\) updates of the gradient \(\nabla f(x)\) and position \(x\), generally with \(m < 10\).

The algorithm starts with an initial estimate \(\mathbf{x}_0\) and iteratively refines this using estimates \(\mathbf{x}_1, \mathbf{x}_2, \ldots\) The gradient \(g_k = \nabla f(\mathbf{x}_k)\) is used to approximate the inverse Hessian and identify the direction of steepest descent. The L-BFGS update for the inverse Hessian \(H_{k+1}\) is given by:

\[ H_{k+1} = (I - \rho_k s_k y_k^\top) H_k (I - \rho_k y_k s_k^\top) + \rho_k s_k s_k^\top \]

where \(\rho_k = \frac{1}{y_k^\top s_k}\), \(s_k = x_{k+1} - x_k\), \(y_k = g_{k+1} - g_k\)

L-BFGS-B (L-BFGS with Box Constraints)

L-BFGS-B is an extension of L-BFGS that can handle bound constraints, i.e., constraints of the form \[ l \leq x \leq u \] The algorithm uses projected gradients to ensure that the bounds are obeyed at every iteration.

MultiDimensional Rosenbrock Function

The Rosenbrock function is a commonly used test problem for optimization algorithms due to its non-convex nature and the difficulty in converging to the global minimum.

10 Dimensional Rosenbrock Summation

\[ f(\mathbf{x}) = \sum_{i=0}^{9} \left[ (1 - x_i)^2 + 100 \cdot (x_{i+1} - x_i^2)^2 \right] \] where \(\mathbf{x} = (x_1, x_2, \ldots, x_{10})\) and \(\mathbf{x} \in \mathbb{R}^{10}\)

We expect to see a difference in the amount of resources used by each algorithm as we increase the number of dimensions. The limited memory variant of BFGS, L-BFGS stores only a few vectors that implicitly represent the approximation of the inverse Hessian matrix. It should be highly suitable for high-dimensional problems.

10 Dimensions

Time in milliseconds

Error

Since we use local minimization, each algorithm gets stuck at the same point.

# A tibble: 4 × 6
  Algorithm     mean   median    sd      min      max
  <chr>        <dbl>    <dbl> <dbl>    <dbl>    <dbl>
1 bfgs      1.80e308 1.80e308   Inf 1.80e308 1.80e308
2 dfp       1.80e308 1.80e308   Inf 1.80e308 1.80e308
3 lbfgs     1.80e308 1.80e308   Inf 1.80e308 1.80e308
4 lbfgsb    1.80e308 1.80e308   Inf 1.80e308 1.80e308

Memory Usage

# A tibble: 4 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <int> <dbl> <int> <int>
1 bfgs       42.7     16  46.2    16    96
2 dfp        58.7     32  46.2    32   112
3 lbfgs     219.     256  78.9   128   272
4 lbfgsb     96       96  32      64   128

20 dimensions

Time in seconds

Error

Same as before…

# A tibble: 4 × 6
  Algorithm     mean   median    sd      min      max
  <chr>        <dbl>    <dbl> <dbl>    <dbl>    <dbl>
1 bfgs      1.80e308 1.80e308   Inf 1.80e308 1.80e308
2 dfp       1.80e308 1.80e308   Inf 1.80e308 1.80e308
3 lbfgs     1.80e308 1.80e308   Inf 1.80e308 1.80e308
4 lbfgsb    1.80e308 1.80e308   Inf 1.80e308 1.80e308

Memory Usage

# A tibble: 4 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <int> <int>
1 bfgs       104      96  59.1    48   176
2 dfp         84      96  35.5    32   112
3 lbfgs      235.    208  60.6   192   304
4 lbfgsb     144     120  53.9   112   224

50 dimensions

Time in seconds

Memory Usage

# A tibble: 4 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <int> <int>
1 bfgs       795.    784 144.    656   944
2 dfp        885.    912  46.2   832   912
3 lbfgs      320     360  96.9   176   384
4 lbfgsb     155.    176  51.4    96   192

100 dimensions

Time in seconds

Memory Usage

# A tibble: 4 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <int> <dbl> <int> <int>
1 bfgs      1941.   1840 190.   1824  2160
2 dfp       1701.   1760 379.   1296  2048
3 lbfgs      448     464  42.3   400   480
4 lbfgsb     277.    288  64.7   208   336

All dimensions concatenated data

Time in seconds

Memory Usage

# A tibble: 4 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <int> <dbl> <int> <int>
1 bfgs       673.    176 790.     16  2160
2 dfp        636.    112 714.     32  2048
3 lbfgs      203.    208 138.      0   384
4 lbfgsb     166.    128  80.9    64   336

Data Analysis

Time

The time taken for the algorithms varies significantly as the dimensionality increases. In 10D, L-BFGS-B is the fastest, followed by DFP, then L-BFGS, with BFGS being the slowest. This trend generally holds for higher dimensions as well, with L-BFGS-B consistently outperforming the others.

Error

Since the objective is local optimization, the approximation error remains constant across all algorithms and dimensions.

Memory

Memory usage follows a unique pattern. BFGS and DFP exhibit similar binomial memory curves, across all dimensions. In contrast, L-BFGS requires significantly more memory in low dimensional optimization, while L-BFGS-B offers a balanced profile with less memory usage and faster execution times.