Description
10. Unconstrained minimization
• terminology and assumptions
• gradient descent method
• steepest descent method
• Newton’s method
• self-concordant functions
• implementation
Unconstrained minimization
minimize f(x)
• f convex, twice continuously differentiable (hence domf open)
• we assume optimal value p⋆ = infx f(x) is attained (and finite) unconstrained minimization methods
• produce sequence of points x(k) ∈ domf, k = 0,1,… with
f(x(k)) → p⋆
• can be interpreted as iterative methods for solving optimality condition
∇f(x⋆) = 0
Initial point and sublevel set
algorithms in this chapter require a starting point x(0) such that
• x(0) ∈ domf
• sublevel set S = {x | f(x) ≤ f(x(0))} is closed
2nd condition is hard to verify, except when all sublevel sets are closed:
• equivalent to condition that epif is closed
• true if domf = Rn
• true if f(x) → ∞ as x → bddomf
examples of differentiable functions with closed sublevel sets:
Strong convexity and implications
f is strongly convex on S if there exists an m > 0 such that
for all x ∈ S
implications
• for x,y ∈ S,
hence, S is bounded
• p⋆ > −∞, and for x ∈ S,
useful as stopping criterion (if you know m)
Descent methods
x(k+1) = x(k) + t(k)∆x(k) with f(x(k+1)) < f(x(k))
• other notations: x+ = x + t∆x, x := x + t∆x
• ∆x is the step, or search direction; t is the step size, or step length
• from convexity, f(x+) < f(x) implies ∇f(x)T∆x < 0
(i.e., ∆x is a descent direction)
General descent method.
given a starting point x ∈domf.
repeat
1. Determine a descent direction ∆x.
2. Line search. Choose a step size t > 0.
3. Update. x := x + t∆x. until stopping criterion is satisfied.
Line search types
exact line search: t = argmint>0f(x + t∆x) backtracking line search (with parameters α ∈ (0,1/2), β ∈ (0,1))
• starting at t = 1, repeat t := βt until
f(x + t∆x) < f(x) + αt∇f(x)T∆x
• graphical interpretation: backtrack until t ≤ t0
Gradient descent method
general descent method with ∆x = −∇f(x)
given a starting point x ∈domf.
repeat
1. ∆x := −∇f(x).
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + t∆x. until stopping criterion is satisfied.
• stopping criterion usually of the form k∇f(x)k2 ≤ ǫ
• convergence result: for strongly convex f,
f(x(k)) − p⋆ ≤ ck(f(x(0)) − p⋆)
c ∈ (0,1) depends on m, x(0), line search type
• very simple, but often very slow; rarely used in practice quadratic problem in R2
) (γ > 0)
with exact line search, starting at x(0) = (γ,1):
• very slow if γ ≫ 1 or γ ≪ 1
• example for γ = 10:
nonquadratic example
f(x1,x2) = ex1+3×2−0.1 + ex1−3×2−0.1 + e−x1−0.1
a problem in R100
500
f(x) = cTx − Xlog(bi − aTi x)
i=1
‘linear’ convergence, i.e., a straight line on a semilog plot
Steepest descent method
normalized steepest descent direction (at x, for norm k · k):
∆xnsd = argmin{∇f(x)Tv | kvk = 1}
interpretation: for small v, f(x + v) ≈ f(x) + ∇f(x)Tv; direction ∆xnsd is unit-norm step with most negative directional derivative
(unnormalized) steepest descent direction
∆xsd = k∇f(x)k∗∆xnsd
satisfies ∇f(x)T∆sd = −k∇f(x)k2∗ steepest descent method
• general descent method with ∆x = ∆xsd
• convergence properties similar to gradient descent
examples
• Euclidean norm: ∆xsd = −∇f(x)
• quadratic norm
• ℓ1-norm: ∆xsd = −(∂f(x)/∂xi)ei, where |∂f(x)/∂xi| = k∇f(x)k∞
unit balls and normalized steepest descent directions for a quadratic norm and the ℓ1-norm:
choice of norm for steepest descent
• steepest descent with backtracking line search for two quadratic norms
• ellipses show {x | kx − x(k)kP = 1}
• equivalent interpretation of steepest descent with quadratic norm k · kP: gradient descent after change of variables x¯ = P1/2x shows choice of P has strong effect on speed of convergence
Newton step
∆xnt = −∇2f(x)−1∇f(x)
interpretations
• x + ∆xnt minimizes second order approximation
• x + ∆xnt solves linearized optimality condition
• ∆xnt is steepest descent direction at x in local Hessian norm
dashed lines are contour lines of f; ellipse is {x + v | vT∇2f(x)v = 1} arrow shows −∇f(x)
Newton decrement
a measure of the proximity of x to x⋆
properties
• gives an estimate of f(x) − p⋆, using quadratic approximation fb:
• equal to the norm of the Newton step in the quadratic Hessian norm
• directional derivative in the Newton direction: ∇f(x)T∆xnt = −λ(x)2
• affine invariant (unlike k∇f(x)k2)
Newton’s method
given a starting point x ∈domf, tolerance ǫ > 0. repeat
1. Compute the Newton step and decrement.
∆xnt := −∇2f(x)−1∇f(x); λ2 := ∇f(x)T∇2f(x)−1∇f(x).
2. Stopping criterion. quit if λ2/2 ≤ ǫ.
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + t∆xnt.
affine invariant, i.e., independent of linear changes of coordinates: Newton iterates for f˜(y) = f(Ty) with starting point y(0) = T−1x(0) are
y(k) = T−1x(k)
Classical convergence analysis
assumptions
• f strongly convex on S with constant m
• ∇2f is Lipschitz continuous on S, with constant L > 0:
k∇2f(x) − ∇2f(y)k2 ≤ Lkx − yk2
(L measures how well f can be approximated by a quadratic function) outline: there exist constants η ∈ (0,m2/L), γ > 0 such that
• if k∇f(x)k2 ≥ η, then f(x(k+1)) − f(x(k)) ≤ −γ
• if k∇f(x)k2 < η, then
damped Newton phase (k∇f(x)k2 ≥ η)
• most iterations require backtracking steps
• function value decreases by at least γ
• if p⋆ > −∞, this phase ends after at most (f(x(0)) − p⋆)/γ iterations
quadratically convergent phase (k∇f(x)k2 < η)
• all iterations use step size t = 1
• k∇f(x)k2 converges to zero quadratically: if k∇f(x(k))k2 < η, then
conclusion: number of iterations until f(x) − p⋆ ≤ ǫ is bounded above by
• γ, ǫ0 are constants that depend on m, L, x(0)
• second term is small (of the order of 6) and almost constant for practical purposes
• in practice, constants m, L (hence γ, ǫ0) are usually unknown
• provides qualitative insight in convergence properties (i.e., explains two algorithm phases)
Examples
example in R2 (page 1–9)
k
• backtracking parameters α = 0.1, β = 0.7
• converges in only 5 steps
• quadratic local convergence
example in R100 (page 1–10)
10−15 0
0 2 4 6 8 10 0 2 4 6 8 k k
• backtracking parameters α = 0.01, β = 0.5
• backtracking line search almost as fast as exact l.s. (and much simpler)
• clearly shows two phases in algorithm
example in R10000 (with sparse ai)
10000 100000
f(x) = − X log(1 − x2i) − X log(bi − aTi x)
• backtracking parameters α = 0.01, β = 0.5.
• performance similar as for small examples
Self-concordance
shortcomings of classical convergence analysis
• depends on unknown constants (m, L, . . . )
• bound is not affinely invariant, although Newton’s method is
convergence analysis via self-concordance (Nesterov and Nemirovski)
• does not depend on any unknown constants
• gives affine-invariant bound
• applies to special class of convex functions (‘self-concordant’ functions)
• developed to analyze polynomial-time interior-point methods for convex optimization
Self-concordant functions
definition
• convex f : R → R is self-concordant if |f′′′(x)| ≤ 2f′′(x)3/2 for all x ∈ domf
• f : Rn → R is self-concordant if g(t) = f(x + tv) is self-concordant for all x ∈ domf, v ∈ Rn examples on R
• linear and quadratic functions
• negative logarithm f(x) = −logx
• negative entropy plus negative logarithm: f(x) = xlogx − logx affine invariance: if f : R → R is s.c., then f˜(y) = f(ay + b) is s.c.:
f˜′′′(y) = a3f′′′(ay + b), f˜′′(y) = a2f′′(ay + b)
Self-concordant calculus
properties
• preserved under positive scaling α ≥ 1, and sum
• preserved under composition with affine function
• if g is convex with domg = R++ and |g′′′(x)| ≤ 3g′′(x)/x then
f(x) = log(−g(x)) − logx
is self-concordant examples: properties can be used to show that the following are s.c.
• f(X) = −logdetX on Sn++
• f(x) = −log(y2 − xTx) on {(x,y) | kxk2 < y}
Convergence analysis for self-concordant functions
summary: there exist constants η ∈ (0,1/4], γ > 0 such that
• if λ(x) > η, then f(x(k+1)) − f(x(k)) ≤ −γ
• if λ(x) ≤ η, then
(η and γ only depend on backtracking parameters α, β) complexity bound: number of Newton iterations bounded by
for α = 0.1, β = 0.8, ǫ = 10−10, bound evaluates to 375(f(x(0)) − p⋆) + 6 numerical example: 150 randomly generated instances of
minimize
◦: m = 100, n = 50
: m = 1000, n = 500
♦: m = 1000, n = 50
• number of iterations much smaller than 375(f(x(0)) − p⋆) + 6
• bound of the form c(f(x(0)) − p⋆) + 6 with smaller c (empirically) valid
Implementation
main effort in each iteration: evaluate derivatives and solve Newton system
H∆x = g
where H = ∇2f(x), g = −∇f(x)
via Cholesky factorization
H = LLT, ∆xnt = L−TL−1g, λ(x) = kL−1gk2
• cost (1/3)n3 flops for unstructured system
• cost ≪ (1/3)n3 if H sparse, banded
example of dense Newton system with structure
• assume A ∈ Rp×n, dense, with p ≪ n
• D diagonal with diagonal elements method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n3)
method 2 (page 9–15): factor ; write Newton system as
eliminate ∆x from first equation; compute w and ∆x from
cost: 2p2n (dominated by computation of
Reviews
There are no reviews yet.