Skip to content

Limit-memory BFGS Method

Slide: Starting with the classical Newton's method, we will examine its limitations and the improvements introduced by quasi-Newton methods. Finally, we will delve into the L-BFGS algorithm, which is particularly suited for large-scale problems. Inspired by Dr. Zhepei Wang's Lecture "Numerical Optimization for Robotics".

Introduction

Unconstrained Optimization

Consider a smooth and twice-differentiable unconstrained optimization problem:

minxf(x)

Descent methods provide an iterative solution:

xk+1=xk+αkdk

where dk is the direction, and αk is the step size.

Newton's Method

By second-order Taylor expansion,

f(x)f(xk)+f(xk)(xxk)+12(xxk)2f(xk)(xxk)

Minimizing quadratic approximation

2f(xk)(xxk)+f(xk)=0

For 2f(xk)0

xk+1=xk[2f(xk)]1f(xk)

Courtesy: Ardian Umam

Damped Newton Method

For 2f(xk)0, the direction dk cannot be directly solved from 2f(xk)dk=f(xk). In such cases, a PD matrix Mk must be constructed to approximate the Hessian.

If the function is convex, 2f(xk) may be singular. Adding a regularization term ensures positive definiteness:

Mk=2f(xk)+λI

λ>0 starts small and grows until Cholesky decomposition works.

If the function is nonconvex, 2f(xk) may be indefinite. To handle this, the Bunch-Kaufman decomposition is applied to obtain LDL and D~.

Mk=LD~L

Finally, direction is solved from Mkdk=f(xk).

Practical Newton Method

Moreover, we can select αk by backtracking line search to ensure sufficient decrease in the objective function, satisfying the Armijo condition:

f(xk+αkdk)f(xk)+c1αkf(xk)dk

where c1(0,1) is a small constant.

Courtesy: Cornell University

Quasi-Newton Methods

Newton's Method: Limitations

  • High Cost: Computing the Hessian and its inverse requires O(n3) operations, impractical for large problems.
  • Indefinite Hessian: In nonconvex cases, the Hessian may lead to steps toward saddle points.
  • Ill-Conditioning: Poorly conditioned Hessians amplify errors and hinder convergence.
  • Inaccurate Model: Local quadratic approximations may fail for complex functions, causing inefficiency or divergence.

Quasi-Newton Approximation

Newton approximation:

f(x)f(xk)+(xxk)gk+12(xxk)Hk(xxk)Hkdk=gk

Quasi-Newton approximation:

f(x)f(xk)+(xxk)gk+12(xxk)Bk(xxk)Bkdk=gk

The matrix Bk should:

  • Avoid full second-order derivatives.
  • Have a closed-form solution for linear equations.
  • Retain first-order curvature information.
  • Preserve the descent direction.

BFGS Method

Descent Direction

Search direction dk should make acute angle with negative gradient.

(gk)dk=(gk)(Bk)1gk<0

Bk must be PD since for all non-negative gk, (gk)(Bk)1gk>0.

Courtesy: Active Calculus

Curvature Information

At the point xk+1, the gradient is gk+1. We want Bk+1 to satisfy:

Bk+1(xk+1xk)gk+1gkBk+1sk=yk

The Optimal Bk+1?

Infinitely many Bk+1 satisfy the secant condition, how to choose the best one. We define the following weighted least square problem

minBBBkW2subject toB=B,Bsk=yk

In BFGS, weight matrix is select to be:

W=012f[(1τ)xk+τxk+1]dτ

Closed-form Solution for Bk+1

To derive the optimal Bk+1, we construct the Lagrangian function:

L(B,Λ)=12BBkW2+tr[Λ(Bskyk)]

Taking the derivative of the Lagrangian with respect to B and setting it to zero gives:

LB=W(BBk)W+Λ(sk)=0

Rearranging the terms, we express B as:

B=BkW1Λ(sk)W1

Substituting this expression into the secant condition Bsk=yk, we obtain:

(BkW1Λ(sk))sk=yk

Solving for Λ, we find:

Λ=W(ykBksk)((sk)W1sk)1

Finally, substituting Λ back, the closed-form solution for Bk+1 is:

Bk+1=Bk+yk(yk)(sk)ykBksk(Bksk)(sk)Bksk

Broyden-Fletcher-Goldfarb-Shanno (BFGS)

Given the initial value B0=I, the updates are performed iteratively:

Bk+1=Bk+yk(yk)(sk)ykBksk(Bksk)(sk)Bksk

with sk=xk+1xk and yk=gk+1gk.

For computational efficiency, we often work with the inverse of Bk directly. The iterative update for (Bk)1 is given by:

Ck+1=(Isk(yk)(sk)yk)Ck(Iyk(sk)(sk)yk)+sk(sk)(sk)yk

Guaranteeing PD of Bk+1

To ensure that Bk+1 remains positive definite (PD), the following curvature condition must hold:

(yk)sk>0

For any nonzero vector z, using the Cauchy-Schwarz inequality:

zBk+1z=zBkz+(zyk)2(yk)sk(zBksk)2(sk)BkskzBkz(sk)Bksk(zBksk)2(sk)Bksk0

Equalities hold only when zyk=0 and zsk. Given that (yk)sk>0, these conditions cannot hold simultaneously. Therefore, if Bk0, it follows that Bk+10.

Guranteeing (yk)sk>0

Armijo Condition (AC) cannot gurantee curvature, we need curvature condition (CC):

(dk)f(xk+αkdk)c2(dk)f(xk)

Typically, c1=104,c2=0.9.

Courtesy: Ján Kopačka

L-BFGS Method

The Lewis-Overton line search is a sophisticated backtracking line search designed specifically for quasi-Newton methods:

  1. Given search direction dk, current point xk and gradient gk
  2. Initialize trial step α:=1, αl:=0, αr:=+
  3. Repeat
    1. Update bounds:
      • If AC(α) fails, set αr:=α
      • Else if CC(α) fails, set αl:=α
      • Otherwise, return α
    2. Update α:
      • If αr<+, set α:=CubicInterpolate(αl,αr) or α:=(αl+αr)/2
      • Otherwise, set α:=CubicExtrapolate(αl,αr) or α:=2αl
    3. Ensure α[αmin,αmax]

Cautious Update

Sometimes, when line search is inexact or the function is poorly conditioned, (yk)sk>0 cannot gurantee. To ensure numerical stability and maintain the PD Hessian approximation, L-BFGS employs a cautious update strategy:

Skip update condition: If the curvature condition (yk)sk>ϵ|sk|2 is not satisfied, where ϵ is a small positive constant (e.g., 106), skip the update for this iteration: Bk+1=Bk.

Powell's Damping: If the curvature condition (yk)skη(sk)Bksk is not satisfied, where η is a small positive constant (e.g., 0.2 or 0.25).

y~k=θyk+(1θ)Bksk,θ=(1η)(sk)Bksk(sk)Bksk(yk)sk

Cautious updates guaranteed to have its iterates converge to a critical point if the function has bounded sublevel sets and a Lipschitz continuous gradient.

Two-Loop Recursion

L-BFGS uses a two-loop recursion to compute the search direction without explicitly forming the Hessian approximation. The algorithm maintains a history of the most recent m pairs (si,yi)i=kmk1, where typically m is between 5 and 20.

  1. Initialize an empty array A of length m, d=gk
  2. For i=k1,k2,,km:
    1. Ai+mk:=si,d/si,yi
    2. d:=dLi+mkyi
  3. d:=dsk1,yk1/yk1,yk1
  4. For i=km,km+1,,k1:
    1. a:=yi,d/si,yi
    2. d:=d+si(Ai+mka)
  5. Return d

This approach reduces the storage requirement from O(n2) to O(mn) and the computational cost per iteration from O(n2) to O(mn).

Algorithm Summary

The complete L-BFGS algorithm with cautious update and Lewis-Overton line search:

  1. Initialize x0,g0:=f(x0), choose m
  2. For k=0,1,2, until convergence:
    1. Compute search direction: dk using L-BFGS two-loop recursion
    2. Find step size αk using Lewis-Overton line search
    3. Update: xk+1=xk+αkdk
    4. Compute sk=xk+1xk, yk=f(xk+1)f(xk)
    5. Apply cautious update to ({sk},{yk})

Open Source Implementation