## Description

+1/1/60+

Profs. Martin Jaggi and Nicolas Flammarion 1

For your examination, preferably print documents compiled from automultiple-choice.

+1/2/59+

First part, multiple choice

There is exactly one correct answer per question.

Smoothness and gradient descent

Question 1 Let us define f : x ∈ R 7→ cos(x). We consider xt ∈ R and xt+1 = xt − ∇f(xt). Assume that xt is not a critical point of f. Which one of the following statements is true:

There exists an xt such that f(xt+1) > f(xt)

None of the above

Question 2 Assume you want to minimize a function , where for each i, fi is convex and L-smooth over Rd. Which of the following statements is false:

If I use a constant step-size γ < L1 , then GD will converge but not SGD.

If n = 1, then SGD and GD correspond to the same recursion.

If n is very big then gradient descent can be computationally infeasible.

SGD corresponds to xt+1 = xt −γt∇fit(xt) where it is the remainder of t divided by .

Question 3 For a > 0 and b ∈ R, consider f(x) = a·x4+b, x ∈ R. Assume you perform gradient descent on f with a constant step-size γ. Which one of the following statements is true:

If |x0| ≤ 1 and 0 < γ ≤ 1 then the iterates converge to 0.

Depending on my starting point x0 and my step size, either my iterates xt converge to 0, or diverge |xt| −→ +∞. t→∞

For the iterates to converge, my step size must depend on b.

For a starting point then the iterates converge to 0.

For a starting point x0, whatever step size I pick, the iterates will never converge to 0.

Newton’s Method and Quasi-Newton

Question 4 How many steps does the Newton’s method require to reach an error smaller than ε > 0 when minimizing a strictly convex quadratic function:

It depends on the step size.

It depends on the condition number of the quadratic function.

+1/3/58+

Question 5 We apply Newton’s method to a function f with a critical point x⋆ starting from iterate x0.

Assume that f has bounded inverse Hessians and Lipschitz continuous Hessians. Among the following propositions, what is the extra assumption which allows to show that ∥xT −x⋆∥ < ε after T = O(loglog(1/ε)) steps?

Convexity.

Smoothness.

Taking the average iterate.

Decreasing step size. Strong convexity.

∥x0 − x⋆∥ should be small.

Function properties

Consider the function d : D → R with D ⊆ R defined as , where x and x are the coordinates of x. Let us consider three cases: (A) when D = R , (B) when D = {x ∈ R : ∥x∥ ≤ 1}, and (C) when D = {x ∈ R2 : x2 = 3}.

Question 6

C only.

A, B and C.

A and C only. A only.

A and B only.B and C only.

B only.

None of them.

Question 7

C only.

B and C only.

A and B only. A and C only.

A only.

B only.

A, B and C.

None of them.

For your examination, preferably print documents compiled from automultiple-choice.

+1/4/57+

Coordinate descent

Question 8 Compared to gradient descent, coordinate descent with gradient-based updates can speed up optimization when coordinate-wise gradients are cheap to compute, and when the coordinates i (i = 1,…,d) have varying smoothness constants Li. We use coordinate-dependent step sizes ηi, and make a gradient step on coordinate i with probability pi. To obtain a convergence rate that depends on instead of maxi Li, you would use

ηi < ηj and pi < pj ηi > ηj and pi < pj ηi < ηj and pi > pj

ηi > ηj and pi > pj

Subgradient descent

Question 9

Constrained optimization

Consider the Lasso regression

Question 10 the linear minimization oracle LMO(∇f(x )) where x

[−1,0]⊤

[0,0]⊤ [1,0]⊤ [0,1]⊤

Question 11

0

+1/5/56+

Proximal gradient descent

Question 12 For h(x) = |x|, the soft thresholding operator is defined by the proximal operator proxh,t(u).

Then for u ≥ t > 0, proxh,t(u) can be written as which of the following?

For your examination, preferably print documents compiled from automultiple-choice.

+1/6/55+

Second part, true/false questions

Question 13 (Convexity) A function f(x) is convex if and only if g(x) = −f(x) is non-convex.

Question 14

Question 15

Question 16

Question 17

2

Question 18 gradient descent with step-size 0 < γ <

Question 19 (Frank-Wolfe) Consider , if we apply the Frank-Wolfe algorithm with stepsize

+1/7/54+

Third part, open questions

Answer in the space provided! Your answer must be justified with all steps. Do not cross any checkboxes, they are reserved for correction.

+1/8/53+

Question 23: 2 points. Is Dh symmetric, i.e., Dh(x,y) = Dh(y,x)? Prove your answer.

0 1 2

+1/9/52+

Question 26: 3 points. Assume condition (S). Show that for any y,x,z ∈ Rd we have

+1/10/51+

We consider that the function h satisfies additionally the following properties:

• The gradient of h takes all possible values, i.e., ∇h(Rd) = Rd. We consider the optimization algorithm defined as x0 ∈ Rd and which iterates:

xt+1 := Tγ(xt) for t ∈ N. (MD)

+1/11/50+

Question 30: 2 points. Let u ∈ Rd and consider the iterates defined in equation (MD). We denote the average of the iterates x . Show the following inequality:

+1/12/49+

Question 32: 2 points. Does the inequality proved in Question 32 imply convergence f(xt) → f(u)? Prove your answer.

0 1 2

+1/13/48+

Application to Poison Linear Inverse Problems

Let us denote by R+ = {x ∈ R,x ≥ 0} and R+∗ = {x ∈ R,x > 0}. Given a matrix and a vector b , the goal is to reconstruct a signal x such that

+1/14/47+

Let us denote by ai the i-th row of the matrix A. We assume that ai ̸= 0 and for all j. Let us consider Burg’s entropy defined by .

Question 36: 5 points. Show that for any L satisfying L ≥ ∥b∥1, the function Lh − f is convex on

0 1 2 3 4 5

(HINT: you can compute the Hessian and show that it is positive semi-definite.)

+1/15/46+

We will minimize the function f using the Mirror Descent algorithm and the potential h whose update rule is defined similarly by

+1/16/45+

Question 38: 3 points. Assuming that you can apply the results derived in Question 34, which convergence rate do you obtain with this algorithm on this problem? Why is it surprising?

0 1 2 3

For your examination, preferably print documents compiled from automultiple-choice.

## Reviews

There are no reviews yet.