Resource Consumption in Optimization Algorithms: A Focus on Quasi-Newton Methods
Overview of the Algorithm
Unconstrained optimization algorithm:
- Require:
- Objective function \[ f: \mathbb{R}^n \rightarrow \mathbb{R} \]
- Gradient of the function \[ \nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n \]
- Initial guess \[x_0\]
- Ensure: Local minimum \[ x^* \] of \[ f \]
- Set \[ k = 0 \]
- Choose an initial approximation \[ H_0 \] to the inverse Hessian - the identity matrix
- While not converged:
- Compute the search direction: \[ p_k = -H_k \nabla f(x_k) \]
- Calculate step size \[ \alpha_k \] using Line Search
- Update the new point: \[ x_{k+1} = x_k + \alpha_k p_k \]
- Compute \[ \delta x = x_{k+1} - x_k \]
- Compute \[ \delta g = \nabla f(x_{k+1}) - \nabla f(x_k) \]
- Update formula for \[ H_{k+1} \]
- \[ k = k + 1 \]
- Return \[ x_k \] as the local minimum
Davidon–Fletcher–Powell update
The primary goal of DFP is to update the approximation of the inverse of the Hessian matrix. The update formula is: \[ H_{k+1} = H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} - \frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]
positive rank-1 update: \[\frac{\delta x \delta x^T}{\delta x^T \delta g} \]
negative rank-2 update: \[\frac{H_k \delta g \delta g^T H_k}{\delta g^T H_k \delta g} \]
Broyden–Fletcher–Goldfarb–Shanno update
The BFGS method is another quasi-Newton method that approximates the inverse of the Hessian using a combination of rank-1 updates. The formula is: \[ H_{k+1} = \left(I - \frac{\delta x \delta g^T}{\delta x^T \delta g} \right) H_k \left(I - \frac{\delta g \delta x^T}{\delta x^T \delta g} \right) + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]
which can be expanded: \[ H_{k+1} = H_k - H_k \frac{\delta x \delta g^T}{\delta x^T \delta g} H_k - \frac{\delta g \delta x^T}{\delta x^T \delta g} H_k + \frac{\delta x \delta x^T}{\delta x^T \delta g} \]
Limited-memory BFGS or L-BFGS
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an optimization algorithm for unconstrained optimization problems. It is a member of the quasi-Newton family and is particularly well-suited for high-dimensional machine learning tasks aimed at minimizing a differentiable scalar function \[ f(\mathbf{x}) \]
Unlike the full BFGS algorithm, which requires storing a dense \[ n \times n \] inverse Hessian matrix, L-BFGS is memory-efficient. It only stores a limited history of \[ m \] updates of the gradient \[ \nabla f(x) \] and position \[ x \], generally with \[ m < 10 \].
The algorithm starts with an initial estimate \[ \mathbf{x}_0 \] and iteratively refines this using estimates \[ \mathbf{x}_1, \mathbf{x}_2, \ldots \] The gradient \[ g_k = \nabla f(\mathbf{x}_k) \] is used to approximate the inverse Hessian and identify the direction of steepest descent. The L-BFGS update for the inverse Hessian \[ H_{k+1} \] is given by:
\[ H_{k+1} = (I - \rho_k s_k y_k^\top) H_k (I - \rho_k y_k s_k^\top) + \rho_k s_k s_k^\top \]
where \[ \rho_k = \frac{1}{y_k^\top s_k}\] \[s_k = x_{k+1} - x_k\] \[ y_k = g_{k+1} - g_k \]
L-BFGS-B (L-BFGS with Box Constraints)
L-BFGS-B is an extension of L-BFGS that can handle bound constraints, i.e., constraints of the form \[l \leq x \leq u\]. The algorithm uses projected gradients to ensure that the bounds are obeyed at every iteration.
Data
The data was generated using a genetic algorithm to search through the parameter space. At first the population size was set to 5 with maximum populations = 5, 5x5 = 25 individuals in total. I ran 5 5x5 experiments, then 5 10x10=100, then 5 20x20=400 individuals. The number of individuals were quadroupled to search the parameter space more efficiently.
Then the 15 experiments were concatenated to get an idea of the “average-case” space complexity.
Rastrigin Function
25 points
Time
Error
Warning: Removed 4 rows containing non-finite values (`stat_density()`).
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 3.24e 0 2.24e 0 2.28 0.995 6.18e 0
2 dfp 7.92e 0 8.49e 0 2.89 4.88 1.15e 1
3 lbfgsb 1.08e308 1.80e308 Inf 3.98 1.80e308
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <int> <dbl> <int> <int>
1 bfgs 13107. 0 21362. 0 49152
2 dfp 9830. 0 14654. 0 32768
3 lbfgsb 127795. 65536 160279. 16384 409600
100 points
Time
Error
Warning: Removed 6 rows containing non-finite values (`stat_density()`).
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 0.519 0.430 0.570 0 1.13
2 dfp 2.18 2.29 1.68 0.379 4.27
3 lbfgsb 3.23 3.98 2.21 0 4.97
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 16384 0 28378. 0 49152
2 dfp 3277. 0 7327. 0 16384
3 lbfgsb 217088 40960 380583. 0 786432
400 points
Time
Error
Warning: Removed 7 rows containing non-finite values (`stat_density()`).
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 0.403 0.308 0.450 0.000500 0.995
2 dfp 0.499 0.462 0.472 0.00660 1.14
3 lbfgsb 1.74 1.49 1.70 0 3.98
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 24576 0 49152 0 98304
2 dfp 9830. 0 21981. 0 49152
3 lbfgsb 217088 40960 380583. 0 786432
Concatenated data
Time
Error
Warning: Removed 22 rows containing non-finite values (`stat_density()`).
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 1.61e 0 0.995 2.01 0 6.18e 0
2 dfp 3.54e 0 2.29 3.75 0.00660 1.15e 1
3 lbfgsb 4.15e307 3.98 Inf 0 1.80e308
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 21845. 8192 30718. 0 98304
2 dfp 7646. 0 17369. 0 49152
3 lbfgsb 187786. 32768 310981. 0 786432
Resources for Helical Valley Function
25 points
Time
Error
Warning: Removed 4 rows containing non-finite values (`stat_density()`).
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 3.24e 0 2.24e 0 2.28 0.995 6.18e 0
2 dfp 7.92e 0 8.49e 0 2.89 4.88 1.15e 1
3 lbfgsb 1.08e308 1.80e308 Inf 3.98 1.80e308
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <int> <dbl> <int> <int>
1 bfgs 3.2 0 7.16 0 16
2 dfp 3.2 0 7.16 0 16
3 lbfgsb 51.2 16 81.1 0 192
100 points
Time
Error
Warning: Removed 6 rows containing non-finite values (`stat_density()`).
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 0.519 0.430 0.570 0 1.13
2 dfp 2.18 2.29 1.68 0.379 4.27
3 lbfgsb 3.23 3.98 2.21 0 4.97
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 5.33 0 9.24 0 16
2 dfp 9.6 0 14.3 0 32
3 lbfgsb 48 56 34.6 0 80
400 points
Time
Error
Warning: Removed 7 rows containing non-finite values (`stat_density()`).
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 0.403 0.308 0.450 0.000500 0.995
2 dfp 0.499 0.462 0.472 0.00660 1.14
3 lbfgsb 1.74 1.49 1.70 0 3.98
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 12 0 24 0 48
2 dfp 41.6 0 76.4 0 176
3 lbfgsb 36 24 46.0 0 96
Concatenated data
Time
Error
Warning: Removed 22 rows containing non-finite values (`stat_density()`).
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bfgs 1.61e 0 0.995 2.01 0 6.18e 0
2 dfp 3.54e 0 2.29 3.75 0.00660 1.15e 1
3 lbfgsb 4.15e307 3.98 Inf 0 1.80e308
Memory Usage
# A tibble: 3 × 6
Algorithm mean median sd min max
<chr> <dbl> <dbl> <dbl> <int> <int>
1 bfgs 6.67 0 14.4 0 48
2 dfp 20.3 0 45.0 0 176
3 lbfgsb 40.6 48 46.5 0 144