Resource Consumption in Optimization Algorithms: A Focus on Quasi-Newton Methods

Author

Dominik Soós

Overview of the Algorithm

Unconstrained optimization algorithm:

Require:
- Objective function \[ f: \mathbb{R}^n \rightarrow \mathbb{R} \]
- Gradient of the function \[ \nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n \]
- Initial guess \[x_0\]
Ensure: Local minimum \[ x^* \] of \[ f \]

Set \[ k = 0 \]
Choose an initial approximation \[ H_0 \] to the inverse Hessian - the identity matrix
While not converged:
1. Compute the search direction: \[ p_k = -H_k \nabla f(x_k) \]
2. Calculate step size \[ \alpha_k \] using Line Search
3. Update the new point: \[ x_{k+1} = x_k + \alpha_k p_k \]
4. Compute \[ \delta x = x_{k+1} - x_k \]
5. Compute \[ \delta g = \nabla f(x_{k+1}) - \nabla f(x_k) \]
6. Update formula for \[ H_{k+1} \]
7. \[ k = k + 1 \]
Return \[ x_k \] as the local minimum

Davidon–Fletcher–Powell update

The primary goal of DFP is to update the approximation of the inverse of the Hessian matrix. The update formula is: \[ H_{k+1} = H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} - \frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]

positive rank-1 update: \[\frac{\delta x \delta x^T}{\delta x^T \delta g} \]

negative rank-2 update: \[\frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]

Broyden–Fletcher–Goldfarb–Shanno update

The BFGS method is another quasi-Newton method that approximates the inverse of the Hessian using a combination of rank-1 updates. The formula is: \[ H_{k+1} = \left(I - \frac{\delta x \delta g^T}{\delta x^T \delta g} \right) H_k \left(I - \frac{\delta g \delta x^T}{\delta x^T \delta g} \right) + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]

which can be expanded: \[ H_{k+1} = H_k - H_k \frac{\delta x \delta g^T}{\delta x^T \delta g} H_k - \frac{\delta g \delta x^T}{\delta x^T \delta g} H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]

Limited-memory BFGS or L-BFGS

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an optimization algorithm for unconstrained optimization problems. It is a member of the quasi-Newton family and is particularly well-suited for high-dimensional machine learning tasks aimed at minimizing a differentiable scalar function \[ f(\mathbf{x}) \]

Unlike the full BFGS algorithm, which requires storing a dense \[ n \times n \] inverse Hessian matrix, L-BFGS is memory-efficient. It only stores a limited history of \[ m \] updates of the gradient \[ \nabla f(x) \] and position \[ x \], generally with \[ m < 10 \].

The algorithm starts with an initial estimate \[ \mathbf{x}_0 \] and iteratively refines this using estimates \[ \mathbf{x}_1, \mathbf{x}_2, \ldots \] The gradient \[ g_k = \nabla f(\mathbf{x}_k) \] is used to approximate the inverse Hessian and identify the direction of steepest descent. The L-BFGS update for the inverse Hessian \[ H_{k+1} \] is given by:

\[ H_{k+1} = (I - \rho_k s_k y_k^\top) H_k (I - \rho_k y_k s_k^\top) + \rho_k s_k s_k^\top \]

where \[ \rho_k = \frac{1}{y_k^\top s_k}\] \[s_k = x_{k+1} - x_k\] \[ y_k = g_{k+1} - g_k \]

L-BFGS-B (L-BFGS with Box Constraints)

L-BFGS-B is an extension of L-BFGS that can handle bound constraints, i.e., constraints of the form \[l \leq x \leq u\]. The algorithm uses projected gradients to ensure that the bounds are obeyed at every iteration.

Data

The data was generated using a genetic algorithm to search through the parameter space. At first the population size was set to 5 with maximum populations = 5, 5x5 = 25 individuals in total. I ran 5 5x5 experiments, then 5 10x10=100, then 5 20x20=400 individuals. The number of individuals were quadroupled to search the parameter space more efficiently.

Then the 15 experiments were concatenated to get an idea of the “average-case” space complexity.

Rastrigin Function

25 points

Time

Error

Warning: Removed 4 rows containing non-finite values (`stat_density()`).

# A tibble: 3 × 6
  Algorithm     mean   median     sd   min      max
  <chr>        <dbl>    <dbl>  <dbl> <dbl>    <dbl>
1 bfgs      3.24e  0 2.24e  0   2.28 0.995 6.18e  0
2 dfp       7.92e  0 8.49e  0   2.89 4.88  1.15e  1
3 lbfgsb    1.08e308 1.80e308 Inf    3.98  1.80e308

Memory Usage

# A tibble: 3 × 6
  Algorithm    mean median      sd   min    max
  <chr>       <dbl>  <int>   <dbl> <int>  <int>
1 bfgs       13107.      0  21362.     0  49152
2 dfp         9830.      0  14654.     0  32768
3 lbfgsb    127795.  65536 160279. 16384 409600

100 points

Time

Error

Warning: Removed 6 rows containing non-finite values (`stat_density()`).

Warning: Groups with fewer than two data points have been dropped.

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl>
1 bfgs      0.519  0.430 0.570 0      1.13
2 dfp       2.18   2.29  1.68  0.379  4.27
3 lbfgsb    3.23   3.98  2.21  0      4.97

Memory Usage

# A tibble: 3 × 6
  Algorithm    mean median      sd   min    max
  <chr>       <dbl>  <dbl>   <dbl> <int>  <int>
1 bfgs       16384       0  28378.     0  49152
2 dfp         3277.      0   7327.     0  16384
3 lbfgsb    217088   40960 380583.     0 786432

400 points

Time

Error

Warning: Removed 7 rows containing non-finite values (`stat_density()`).

Warning: Groups with fewer than two data points have been dropped.

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

# A tibble: 3 × 6
  Algorithm  mean median    sd      min   max
  <chr>     <dbl>  <dbl> <dbl>    <dbl> <dbl>
1 bfgs      0.403  0.308 0.450 0.000500 0.995
2 dfp       0.499  0.462 0.472 0.00660  1.14 
3 lbfgsb    1.74   1.49  1.70  0        3.98

Memory Usage

# A tibble: 3 × 6
  Algorithm    mean median      sd   min    max
  <chr>       <dbl>  <dbl>   <dbl> <int>  <int>
1 bfgs       24576       0  49152      0  98304
2 dfp         9830.      0  21981.     0  49152
3 lbfgsb    217088   40960 380583.     0 786432

Concatenated data

Time

Error

Warning: Removed 22 rows containing non-finite values (`stat_density()`).

# A tibble: 3 × 6
  Algorithm     mean median     sd     min      max
  <chr>        <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
1 bfgs      1.61e  0  0.995   2.01 0       6.18e  0
2 dfp       3.54e  0  2.29    3.75 0.00660 1.15e  1
3 lbfgsb    4.15e307  3.98  Inf    0       1.80e308

Memory Usage

# A tibble: 3 × 6
  Algorithm    mean median      sd   min    max
  <chr>       <dbl>  <dbl>   <dbl> <int>  <int>
1 bfgs       21845.   8192  30718.     0  98304
2 dfp         7646.      0  17369.     0  49152
3 lbfgsb    187786.  32768 310981.     0 786432

Resources for Helical Valley Function

25 points

Time

Error

Warning: Removed 4 rows containing non-finite values (`stat_density()`).

# A tibble: 3 × 6
  Algorithm     mean   median     sd   min      max
  <chr>        <dbl>    <dbl>  <dbl> <dbl>    <dbl>
1 bfgs      3.24e  0 2.24e  0   2.28 0.995 6.18e  0
2 dfp       7.92e  0 8.49e  0   2.89 4.88  1.15e  1
3 lbfgsb    1.08e308 1.80e308 Inf    3.98  1.80e308

Memory Usage

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <int> <dbl> <int> <int>
1 bfgs        3.2      0  7.16     0    16
2 dfp         3.2      0  7.16     0    16
3 lbfgsb     51.2     16 81.1      0   192

100 points

Time

Error

Warning: Removed 6 rows containing non-finite values (`stat_density()`).

Warning: Groups with fewer than two data points have been dropped.

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl>
1 bfgs      0.519  0.430 0.570 0      1.13
2 dfp       2.18   2.29  1.68  0.379  4.27
3 lbfgsb    3.23   3.98  2.21  0      4.97

Memory Usage

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <int> <int>
1 bfgs       5.33      0  9.24     0    16
2 dfp        9.6       0 14.3      0    32
3 lbfgsb    48        56 34.6      0    80

400 points

Time

Error

Warning: Removed 7 rows containing non-finite values (`stat_density()`).

Warning: Groups with fewer than two data points have been dropped.

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

# A tibble: 3 × 6
  Algorithm  mean median    sd      min   max
  <chr>     <dbl>  <dbl> <dbl>    <dbl> <dbl>
1 bfgs      0.403  0.308 0.450 0.000500 0.995
2 dfp       0.499  0.462 0.472 0.00660  1.14 
3 lbfgsb    1.74   1.49  1.70  0        3.98

Memory Usage

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <int> <int>
1 bfgs       12        0  24       0    48
2 dfp        41.6      0  76.4     0   176
3 lbfgsb     36       24  46.0     0    96

Concatenated data

Time

Error

Warning: Removed 22 rows containing non-finite values (`stat_density()`).

# A tibble: 3 × 6
  Algorithm     mean median     sd     min      max
  <chr>        <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
1 bfgs      1.61e  0  0.995   2.01 0       6.18e  0
2 dfp       3.54e  0  2.29    3.75 0.00660 1.15e  1
3 lbfgsb    4.15e307  3.98  Inf    0       1.80e308

Memory Usage

# A tibble: 3 × 6
  Algorithm  mean median    sd   min   max
  <chr>     <dbl>  <dbl> <dbl> <int> <int>
1 bfgs       6.67      0  14.4     0    48
2 dfp       20.3       0  45.0     0   176
3 lbfgsb    40.6      48  46.5     0   144