Limit-memory BFGS Method

Slide: Starting with the classical Newton's method, we will examine its limitations and the improvements introduced by quasi-Newton methods. Finally, we will delve into the L-BFGS algorithm, which is particularly suited for large-scale problems. Inspired by Dr. Zhepei Wang's Lecture "Numerical Optimization for Robotics".

Introduction

Unconstrained Optimization

Consider a smooth and twice-differentiable unconstrained optimization problem:

min_{x} f (x)

Descent methods provide an iterative solution:

x^{k + 1} = x^{k} + α^{k} \cdot d^{k}

where $d^{k}$ is the direction, and $α^{k}$ is the step size.

Newton's Method

By second-order Taylor expansion,

f (x) \approx f (x^{k}) + \nabla f (x^{k})^{⊤} (x - x^{k}) + \frac{1}{2} (x - x^{k})^{⊤} \nabla^{2} f (x^{k}) (x - x^{k})

Minimizing quadratic approximation

\nabla^{2} f (x^{k}) (x - x^{k}) + \nabla f (x^{k}) = 0

For $\nabla^{2} f (x^{k}) ≻ 0$

x^{k + 1} = x^{k} - [\nabla^{2} f (x^{k})]^{- 1} \nabla f (x^{k})

Courtesy: Ardian Umam

Damped Newton Method

For $\nabla^{2} f (x^{k}) ⊁ 0$ , the direction $d^{k}$ cannot be directly solved from $\nabla^{2} f (x^{k}) d^{k} = - \nabla f (x^{k})$ . In such cases, a PD matrix $M^{k}$ must be constructed to approximate the Hessian.

If the function is convex, $\nabla^{2} f (x^{k})$ may be singular. Adding a regularization term ensures positive definiteness:

M^{k} = \nabla^{2} f (x^{k}) + λ I

$λ > 0$ starts small and grows until Cholesky decomposition works.

If the function is nonconvex, $\nabla^{2} f (x^{k})$ may be indefinite. To handle this, the Bunch-Kaufman decomposition is applied to obtain $L D L^{⊤}$ and $\tilde{D}$ .

M^{k} = L \tilde{D} L^{⊤}

Finally, direction is solved from $M^{k} d^{k} = - \nabla f (x^{k})$ .

Practical Newton Method

Moreover, we can select $α^{k}$ by backtracking line search to ensure sufficient decrease in the objective function, satisfying the Armijo condition:

f (x^{k} + α^{k} d^{k}) \leq f (x^{k}) + c_{1} \cdot α^{k} \nabla f (x^{k})^{⊤} d^{k}

where $c_{1} \in (0, 1)$ is a small constant.

Courtesy: Cornell University

Quasi-Newton Methods

Newton's Method: Limitations

High Cost: Computing the Hessian and its inverse requires $O (n^{3})$ operations, impractical for large problems.
Indefinite Hessian: In nonconvex cases, the Hessian may lead to steps toward saddle points.
Ill-Conditioning: Poorly conditioned Hessians amplify errors and hinder convergence.
Inaccurate Model: Local quadratic approximations may fail for complex functions, causing inefficiency or divergence.

Quasi-Newton Approximation

Newton approximation:

f (x) \approx f (x^{k}) + (x - x^{k})^{⊤} g^{k} + \frac{1}{2} (x - x^{k})^{⊤} H^{k} (x - x^{k})

H^{k} d^{k} = - g^{k}

Quasi-Newton approximation:

f (x) \approx f (x^{k}) + (x - x^{k})^{⊤} g^{k} + \frac{1}{2} (x - x^{k})^{⊤} B^{k} (x - x^{k})

B^{k} d^{k} = - g^{k}

The matrix $B^{k}$ should:

Avoid full second-order derivatives.
Have a closed-form solution for linear equations.
Retain first-order curvature information.
Preserve the descent direction.

BFGS Method

Descent Direction

Search direction $d^{k}$ should make acute angle with negative gradient.

(g^{k})^{⊤} d^{k} = - (g^{k})^{⊤} (B^{k})^{- 1} g^{k} < 0

$B^{k}$ must be PD since for all non-negative $g^{k}$ , $(g^{k})^{⊤} (B^{k})^{- 1} g^{k} > 0$ .

Courtesy: Active Calculus

Curvature Information

At the point $x^{k + 1}$ , the gradient is $g^{k + 1}$ . We want $B^{k + 1}$ to satisfy:

B^{k + 1} (x^{k + 1} - x^{k}) \approx g^{k + 1} - g^{k}

B^{k + 1} s^{k} = y^{k}

The Optimal $B^{k + 1}$ ?

Infinitely many $B^{k + 1}$ satisfy the secant condition, how to choose the best one. We define the following weighted least square problem

min_{B} ‖ B - B^{k} ‖_{W}^{2} subject to B = B^{⊤}, B s^{k} = y^{k}

In BFGS, weight matrix is select to be:

W = \int_{0}^{1} \nabla^{2} f [(1 - τ) x^{k} + τ x^{k + 1}] d τ

Closed-form Solution for $B^{k + 1}$

To derive the optimal $B^{k + 1}$ , we construct the Lagrangian function:

L (B, Λ) = \frac{1}{2} ‖ B - B^{k} ‖_{W}^{2} + tr [Λ^{⊤} (B s^{k} - y^{k})]

Taking the derivative of the Lagrangian with respect to $B$ and setting it to zero gives:

\frac{\partial L}{\partial B} = W (B - B^{k}) W + Λ (s^{k})^{⊤} = 0

Rearranging the terms, we express $B$ as:

B = B^{k} - W^{- 1} Λ (s^{k})^{⊤} W^{- 1}

Substituting this expression into the secant condition $B s^{k} = y^{k}$ , we obtain:

(B^{k} - W^{- 1} Λ (s^{k})^{⊤}) s^{k} = y^{k}

Solving for $Λ$ , we find:

Λ = W (y^{k} - B^{k} s^{k}) {((s^{k})^{⊤} W^{- 1} s^{k})}^{- 1}

Finally, substituting $Λ$ back, the closed-form solution for $B^{k + 1}$ is:

B^{k + 1} = B^{k} + \frac{y^{k} (y^{k})^{⊤}}{(s^{k})^{⊤} y^{k}} - \frac{B^{k} s^{k} (B^{k} s^{k})^{⊤}}{(s^{k})^{⊤} B^{k} s^{k}}

Broyden-Fletcher-Goldfarb-Shanno (BFGS)

Given the initial value $B^{0} = I$ , the updates are performed iteratively:

B^{k + 1} = B^{k} + \frac{y^{k} (y^{k})^{⊤}}{(s^{k})^{⊤} y^{k}} - \frac{B^{k} s^{k} (B^{k} s^{k})^{⊤}}{(s^{k})^{⊤} B^{k} s^{k}}

with $s^{k} = x^{k + 1} - x^{k}$ and $y^{k} = g^{k + 1} - g^{k}$ .

For computational efficiency, we often work with the inverse of $B^{k}$ directly. The iterative update for $(B^{k})^{- 1}$ is given by:

C^{k + 1} = (I - \frac{s^{k} (y^{k})^{⊤}}{(s^{k})^{⊤} y^{k}}) C^{k} (I - \frac{y^{k} (s^{k})^{⊤}}{(s^{k})^{⊤} y^{k}}) + \frac{s^{k} (s^{k})^{⊤}}{(s^{k})^{⊤} y^{k}}

Guaranteeing PD of $B^{k + 1}$

To ensure that $B^{k + 1}$ remains positive definite (PD), the following curvature condition must hold:

(y^{k})^{⊤} s^{k} > 0

For any nonzero vector $z$ , using the Cauchy-Schwarz inequality:

\begin{aligned} z^{⊤} B^{k + 1} z & = z^{⊤} B^{k} z + \frac{(z^{⊤} y^{k})^{2}}{(y^{k})^{⊤} s^{k}} - \frac{(z^{⊤} B^{k} s^{k})^{2}}{(s^{k})^{⊤} B^{k} s^{k}} \\ \geq \frac{z^{⊤} B^{k} z (s^{k})^{⊤} B^{k} s^{k} - (z^{⊤} B^{k} s^{k})^{2}}{(s^{k})^{⊤} B^{k} s^{k}} \geq 0 \end{aligned}

Equalities hold only when $z^{⊤} y^{k} = 0$ and $z ∥ s^{k}$ . Given that $(y^{k})^{⊤} s^{k} > 0$ , these conditions cannot hold simultaneously. Therefore, if $B^{k} ≻ 0$ , it follows that $B^{k + 1} ≻ 0$ .

Guranteeing $(y^{k})^{⊤} s^{k} > 0$

Armijo Condition (AC) cannot gurantee curvature, we need curvature condition (CC):

(d^{k})^{⊤} \nabla f (x^{k} + α^{k} d^{k}) \geq c_{2} \cdot (d^{k})^{⊤} \nabla f (x^{k})

Typically, $c_{1} = 10^{- 4}, c_{2} = 0.9$ .

Courtesy: Ján Kopačka

L-BFGS Method

Lewis-Overton Line Search

The Lewis-Overton line search is a sophisticated backtracking line search designed specifically for quasi-Newton methods:

Given search direction $d^{k}$ , current point $x^{k}$ and gradient $g^{k}$
Initialize trial step $α := 1$ , $α_{l} := 0$ , $α_{r} := + \infty$
Repeat
1. Update bounds:
  - If $AC (α)$ fails, set $α_{r} := α$
  - Else if $CC (α)$ fails, set $α_{l} := α$
  - Otherwise, return $α$
2. Update $α$ :
  - If $α_{r} < + \infty$ , set $α := CubicInterpolate (α_{l}, α_{r})$ or $α := (α_{l} + α_{r}) / 2$
  - Otherwise, set $α := CubicExtrapolate (α_{l}, α_{r})$ or $α := 2 α_{l}$
3. Ensure $α \in [α_{min}, α_{max}]$

Cautious Update

Sometimes, when line search is inexact or the function is poorly conditioned, $(y^{k})^{⊤} s^{k} > 0$ cannot gurantee. To ensure numerical stability and maintain the PD Hessian approximation, L-BFGS employs a cautious update strategy:

Skip update condition: If the curvature condition $(y^{k})^{⊤} s^{k} > ϵ | s^{k} |^{2}$ is not satisfied, where $ϵ$ is a small positive constant (e.g., $10^{- 6}$ ), skip the update for this iteration: $B^{k + 1} = B^{k}$ .

Powell's Damping: If the curvature condition $(y^{k})^{⊤} s^{k} \geq η (s^{k})^{⊤} B^{k} s^{k}$ is not satisfied, where $η$ is a small positive constant (e.g., $0.2$ or $0.25$ ).

{\tilde{y}}^{k} = θ y^{k} + (1 - θ) B^{k} s^{k}, θ = \frac{(1 - η) \cdot (s^{k})^{⊤} B^{k} s^{k}}{(s^{k})^{⊤} B^{k} s^{k} - (y^{k})^{⊤} s^{k}}

Cautious updates guaranteed to have its iterates converge to a critical point if the function has bounded sublevel sets and a Lipschitz continuous gradient.

Two-Loop Recursion

L-BFGS uses a two-loop recursion to compute the search direction without explicitly forming the Hessian approximation. The algorithm maintains a history of the most recent $m$ pairs ${(s^{i}, y^{i})}_{i = k - m}^{k - 1}$ , where typically $m$ is between 5 and 20.

Initialize an empty array $A$ of length $m$ , $d = g^{k}$
For $i = k - 1, k - 2, \dots, k - m$ :
1. $A^{i + m - k} := ⟨ s^{i}, d ⟩ / ⟨ s^{i}, y^{i} ⟩$
2. $d := d - L^{i + m - k} y^{i}$
$d := d \cdot ⟨ s^{k - 1}, y^{k - 1} ⟩ / ⟨ y^{k - 1}, y^{k - 1} ⟩$
For $i = k - m, k - m + 1, \dots, k - 1$ :
1. $a := ⟨ y^{i}, d ⟩ / ⟨ s^{i}, y^{i} ⟩$
2. $d := d + s^{i} (A^{i + m - k} - a)$
Return $d$

This approach reduces the storage requirement from $O (n^{2})$ to $O (m n)$ and the computational cost per iteration from $O (n^{2})$ to $O (m n)$ .

Algorithm Summary

The complete L-BFGS algorithm with cautious update and Lewis-Overton line search:

Initialize $x^{0}, g^{0} := \nabla f (x^{0})$ , choose $m$
For $k = 0, 1, 2, \dots$ until convergence:
1. Compute search direction: $d^{k}$ using L-BFGS two-loop recursion
2. Find step size $α^{k}$ using Lewis-Overton line search
3. Update: $x^{k + 1} = x^{k} + α^{k} d^{k}$
4. Compute $s^{k} = x^{k + 1} - x^{k}$ , $y^{k} = \nabla f (x^{k + 1}) - \nabla f (x^{k})$
5. Apply cautious update to $({s^{k}}, {y^{k}})$

Limit-memory BFGS Method ​

Introduction ​

Unconstrained Optimization ​

Newton's Method ​

Damped Newton Method ​

Practical Newton Method ​

Quasi-Newton Methods ​

Newton's Method: Limitations ​

Quasi-Newton Approximation ​

BFGS Method ​

Descent Direction ​

Curvature Information ​

The Optimal Bk+1? ​

Closed-form Solution for Bk+1 ​

Broyden-Fletcher-Goldfarb-Shanno (BFGS) ​

Guaranteeing PD of Bk+1 ​

Guranteeing (yk)⊤sk>0 ​

L-BFGS Method ​

Lewis-Overton Line Search ​

Cautious Update ​

Two-Loop Recursion ​

Algorithm Summary ​

Open Source Implementation ​