Error estimates for iterative algorithms for minimizing regularized quadratic subproblems

ABSTRACT We derive bounds for the objective errors and gradient residuals when finding approximations to the solution of common regularized quadratic optimization problems within evolving Krylov spaces. These provide upper bounds on the number of iterations required to achieve a given stated accuracy. We illustrate the quality of our bounds on given test examples.


Introduction
In this paper, we derive upper bounds for the number of iterations required to reach a certain level of optimality by subspace methods for solving the trust-region subproblem minimize x∈R n q(x) := g T x + 1 2 x T Hx subject to x ≤ δ (1) and its regularization variant minimize x∈R n q R (x, σ , p) := q(x) Here, we are given a gradient g, a symmetric, but possibly indefinite, Hessian H, a radius δ > 0, a weight σ > 0 and a power p > 2, and use the Euclidean norm · . Subproblems (1)-(2) lie at the heart of the step calculation in both trust-region and cubic-regularization methods for unconstrained optimization [9,11,31,32]. A typical requirement in the trust-region case is that the computed x should decrease the objective function, i.e. q(x) < q(0) ≡ 0, and that the gradient of the Lagrangian for the problem, g + Hx + μx, should be smaller than a prescribed tolerance in norm, i.e. g + Hx + μx ≤ for some given > 0, whose precise value determines the rate of convergence of the trust-region algorithm, and a suitable Lagrange multiplier, μ ≥ 0, for the trust-region constraint x ≤ δ. For regularization problems, a similar requirement is that q R (x, σ , p) < q R (0, σ , p) ≡ 0 and that the norm of the gradient of q R (x, σ , p) should be small. Since ∇ x q R (x, σ , p) = g + Hx + μx where μ = σ x p−2 , the latter requirement is identical to (3) but for a different μ. As the subspace methods we consider automatically ensure that their relevant objectives decrease, our intention is to provide bounds on the number of steps (actually products with H) required by such methods to achieve (3) for the problems under consideration. The subspaces of interest here are the nested Krylov spaces K k := K(H, g, k) for k ≥ 0, where, for general A and b, we define K(A, b, k) := span{A i b} k −1 i=0 . A sequence of estimates x k are generated so that for the trust-region subproblem, or for the regularization case; here and elsewhere arg min refers to the set of global minimizers of the function under consideration within the domain specified. 1 This is useful as the well-known GLTR method [18] for (1) and the GLRT approach [9] for (2), which exploit the evolving Lanczos basis for K k , use precisely these formulations. 2 However, we must be cautious as it is well known [18,Theorem 5.8] that such methods may fail to solve the problem if the sequence of Krylov subspaces lies in an unpropitious non-trivial invariant subspace of n , and in this case it may be necessary to enhance the search space with a specific eigenvector of H from outside the Krylov space. Fortunately, as we shall see, this is not necessary if our goal is merely to satisfy (3).
To put subspace methods in context, we briefly review other approaches. Early methods [16,30] were aimed at the trust-region case (1), and use the optimality conditions for the problem (see Theorem 2.1) to reduce it to a scalar root-finding problem involving a socalled secular equation. A safeguarded Newton process is applied, and the required value and derivatives of the related secular function at each scalar value μ require the direct solution of a symmetric, positive-definite linear system (H + μI)x = −g. These ideas may be extended to higher-order root-finding methods, and for problems with sparse Hessians [20].
The first methods designed to avoid factorization [37,38] are only able to find approximate solutions, based on properties of the conjugate-gradient method. GLTR [18] is the first method fully to explore this Krylov subspace. Other subspace methods [13,14,23] have focussed on evolving very low-dimensional (non-Krylov) subspaces. Another popular approach is to couch (1) as a parametric or generalized eigenvalue problem [1,26,34,35], and to harness existing, powerful software for this domain. A third suggestion is to formulate the problem as a semi-definite or second-order cone program [15,24,25,33], and as before to use sophisticated methods from this area. GLTR has a good reputation [1] unless the solution required lies outside the Krylov space, when, perhaps unsurprisingly, eigenvalue approaches may have the edge.
The regularization subproblem (5) has risen in importance in the 21st century, and obvious analogues [6,9,20] of the direct and subspace approaches have been proposed. There has also been a move towards simpler (accelerated, first-order methods) [5,7,40] that recognize that matrices may be too large to manipulate when the problem is enormous.
While all of these methods provide guarantees of convergence to a solution of their relevant subproblem, little has been said about how this convergence occurs. Recently there has been some effort to bound the decrease in the objective function value, and thus the overall number of iterations required to reach a specified objective accuracy, for first-order and subspace methods [8,24,25,41,42]. Our interest, though, is in trying to achieve (3), since this is more significant for the overall convergence of many nonlinear optimization methods [9,11,32]. This is the main focus of our paper.
In §2 we examine the benefits and limitations of Krylov approximations to the solutions we wish to find. We follow this, in §3, by deriving bounds both on the decrease in the model objective functions and on the norm of the violation of the first-order criticality residuals from the Krylov space under consideration. We note that bounds on the former have already been proposed [8,41,42], and indeed our bounds make use of some of those in [8]. The bounds we derive for the residuals generalize well-known ones for conjugate-gradient methods applied to definite linear systems. They behave just as in the conjugate-gradient case, but necessarily reflect the additional complications due to the potential indefiniteness of the matrices involved and the presence of explicit or implicit regularization. The bounds allow us to give an upper estimate of how many iterations will be required to satisfy (3), and thus on the work involved in finding a suitable approximation to the solution of the subproblem under consideration. We examine our residual bounds on test examples that are designed to illustrate a variety of spectral distributions in §4. Finally, we make concluding remarks in §5.

Solutions from the Krylov space and beyond
Let λ 1 ≤ · · · ≤ λ n be the eigenvalues of H, with the leftmost λ 1 having multiplicity n 1 , and let u i , i ∈ N := {1, . . . , n} be the corresponding orthonormal eigenvectors. Crucially, there are well-known characterizations of the global solutions of (1) and (2).
where the Lagrange multiplier μ * ≥ max(0, −λ 1 ) and μ * ( x * 2 − δ 2 ) = 0. Moreover the solution is unique and μ * > max(0, −λ 1 ) whenever g T u i = 0 for some 1 ≤ i ≤ n 1 .  (6), where the multiplier μ * = σ x * p−2 ≥ −λ 1 . Moreover the solution is unique and μ * > −λ 1 whenever g T u i = 0 for some We consider the evolving Krylov spaces K k , k ≥ 0, in more detail. Clearly we may decompose g = n j=1 (g T u j )u j in terms of the basis of eigenvectors {u j } j∈N of H. Let I + := {j | g T u j = 0}, I 0 := N \ I + and m := |I + |. 3 Thus Therefore K m = · · · = K n , since K m = span{u j } j∈I + and the vectors H i g for m < i ≤ n are dependent on those in K m . Hence, our Krylov methods will make no further progress beyond the mth iteration, and at that point provide estimates of their relevant solutions x m and multipliers μ m . We now contrast x m with a desired global minimizer x * . Let U + be the n by m matrix whose columns are the eigenvectors u j , j ∈ I + , U 0 be the n by n−m matrix whose columns are the remaining eigenvectors and U = (U + : U 0 ). Likewise let be the diagonal matrix of eigenvalues ordered as for U, and let + and 0 be its diagonal blocks. Thus If we defineḡ := U T g, and therefore g = Uḡ, this leads to sinceḡ 0 = 0 as u T j g = 0 for all j ∈ I 0 . Consider the trust-region subproblem (1), and the change of variables x = Ux. In this case, x * = Ux * , wherē The optimality conditions (6) for this are wherex * and the Lagrange multiplier satisfy and we have partitionedx By contrast, if x ∈ K m , then x = U +x+ for some vectorx + ∈ R m , in which case (4) gives x m = U +x + * , where uniquelŷ The optimality conditions (6) then imply that wherex + * and the Lagrange multiplier Givenx + * , letx 0 * = 0 and definê In this case, (14) and (16) become and Now comparex * and μ * from (10)-(12) withx * and μ + * from (15), (17) and (18). The only substantial difference is between (11) and (15). Indeed, if μ + * ≥ max(0, −λ 1 ), the two sets of conditions are identical, and in this case x m = x * and μ m = μ * , i.e. the solution from the subspace K m solves the full-space trust-region problem (1). This must occur if min j∈I + λ j = λ 1 or, equivalently I + ∩ {1, . . . , n 1 } = ∅, where we recall n 1 is the multiplicity of λ 1 , but may also happen if min j∈I + λ j > λ 1 . If μ + * < −λ 1 , I + ∩ {1, . . . , n 1 } = ∅, and x m cannot solve (1), but it is nonetheless a critical point of the problem. 4 This possibility is often called the 'hard case' [30] and μ * = −λ 1 ; the first block equation in (10) uniquely definesx + * , andx 0 * is a multiple of any eigenvector of the second (singular) block, the precise combination ensuring that x * = δ.
The main lesson here is that if we wish to solve (1) we shall have to look outside the Krylov space and may need to compute an eigenvector corresponding to λ 1 . If we are content simply in finding a critical point of (1), the Krylov space suffices. An essentially identical argument may be used in the case of the regularization subproblem (2) with the same conclusions.

Bounds on the decrease of the objective functions
In essence Carmon and Duchi [8] provide the following bounds. 5 where x k is given by (4), x * is a global minimizer of (1), and μ * is its corresponding Lagrange multiplier, and (ii) where x k is given by (5), x * is a global minimizer of (2) and μ * = σ x * .
Thus the error in the relevant objective function decreases at worst linearly as a function of the subspace dimension unless μ * = −λ 1 , in which case Theorem 3.1 provides no useful bound. As we have already mentioned, the unlikely 'hard case' μ * = −λ 1 only occurs if g is orthogonal to the space of eigenvectors corresponding to the eigenvalue λ 1 of H, and should this happen these eigenvectors will not occur in the Krylov spaces K k , except through numerical rounding. A simple expedient advocated by others [8] is to perturb g by a small random vector so that, with high probability, it will have some component in the space of eigenvectors corresponding to the eigenvalue λ 1 , and thus the hard case cannot occur.
We note that Carmon and Duchi actually provide a second, sublinear decrease estimate that may be less pessimistic for small k, but we shall not use this here. Zhang et al. [42,Theorem 4.2] suggest quantitatively-similar bounds in the trust-region case, but again we shall not need them.
We now restrict our attention to the best estimate x m available from the evolving Krylov space. We provide a bound on the decrease of the objective function at this point compared to that at the best estimate x k available from K k ; the bound indicates at worst a linear rate of convergence as the space expands. We exclude the special case g = 0 since then x = 0 is a critical point of both of the subproblems under consideration.
Then, for all k ≥ 0, where x k and x m are given by (4), and μ m is the Lagrange multiplier corresponding to x m , and (ii) where x k and x m are given by (5), κ m is given by (23) but now μ m = σ x m .
Proof: Since g = 0, I + = ∅, m > 0 and both λ min + and λ max + are well defined. Let H + be the matrix with eigenpairs {λ j , u j } for j ∈ I + and {λ max + , u j } for j ∈ I 0 = N \ I + . Then for k ≥ 0, and the iterates x k generated from the Krylov spaces K k for (4) and (5) for the problem with Hessian H + are identical to those with Hessian H. However the hard case cannot occur with the Hessian H + as none of the eigenvalues for j ∈ I 0 is smaller than the smallest for j ∈ I + . Hence, as we saw in §2, for this Hessian x * = x m and μ * = μ m . Thus we may apply Theorem 3.1 for the problem with Hessian H + to deduce (22) and (24).
As Carmon and Duchi mention, this then implies a worst-case estimate of iterations in order to guarantee q(
as V k has orthonormal columns. Furthermore pre-multiplying (26) by V T k yields and thus the definition (4) implies that Moreover, applying Theorem 2.1 to (4) shows that T k + μ k I is positive definite. Let It then follows from (26), (30) and (31) that Hence r k = γ k e T k y k v k+1 and Note that the definition of γ k > 0 as the (k, k + 1)-st entry of T k+1 and the Cauchy Schwarz inequality implies that Thus, aside from the term γ k g > 0, the residual norm decays with |e T k (T k + μ k I) −1 e 1 |, and we now focus on this.
We  Note that we shall prefer the slightly weaker, but simpler, bound and indeed c = 1/λ min (M) so long as κ(M) ≥ 1 + √ 2. For any k ≤ m, we have that T k from (27) is symmetric and tridiagonal with left-and right-most eigenvalues (Ritz values) respectively θ 1,k < θ k,k . As we have seem T k + μ k I is positive definite, and thus has distinct left-and right-most eigenvalues as well as spectral condition number We may apply Lemma 3.3 to T k + μ k I to deduce our main result.

Theorem 3.4:
The residual (32) for the kth iterate, x * k , generated by either the trust-region subproblem (4) or the regularization subproblem (5) satisfies the bound where κ k is given by (37) and γ k is the (k, k + 1)-st entry of T k+1 .
In the trust-region case, this leads to the following residual bound. 6 Corollary 3.5: The residual (32) for the kth iterate, x * k , generated by the trust-region subproblem (4) satisfies the bound where κ k is given by (37).
A similar result is possible for the regularization case.

Corollary 3.7:
The residual (32) for the kth iterate, x * k , generated by the regularization subproblem (5) satisfies the bound where κ k is given by (37).

Comments
We now comment on the bounds obtained in §3. 1 K(A, b, k), and this is easily shown to lead to the bound where P k = {polynomials ψ of degree k for which ψ(0) = 1} and λ i (A), i ∈ N , are a subset of the eigenvalues of A. Weakening the requirement that the maximum in (49) instead considers ψ(λ) over all λ ∈ [λ min (A), λ max (A)] and invoking a well-known bound from approximation theory relating to Chebyshev polynomials, it then follows that we thus obtain the bound The presence of κ m ≤ κ * in the bounds in §3.1 and 3.2 is strongly reminiscent of the CG case, and indicates that rescaling (preconditioning) the problem so that κ m or κ * = O(1) would be beneficial. In the strictly convex case when H is positive definite, κ m is no larger than the traditional condition number λ max + /λ min + obtained from (29). Although we know that θ k,k increases monotonically from (29), as does μ k [10,28], we have not been able to prove that κ k increases monotonically, 7 albeit in practice it appears to.
We need to be very cautious here as although such bounds accurately predict the worst possible case [8,22], they are often very pessimistic in general, a point stressed in [27]. We shall return to this in §4. Nevertheless, if one is interested in the worst-case, bounds such as (25), (44) and (48) are relevant.
We tried two other approaches to derive useful bounds on the norm of the residual, (32). The first aims to use the known decrease in the model objective given by Corollary 3.2 to deduce a similar bound on r k . Since x m from (4) is a critical point of (1), we have that and hence Elementary manipulation of these (see Appendix 1) then leads to the bound which exposes the dependence on the model objective decrease q(x k ) − q(x m ). Unfortunately, aside from the case where μ m = 0 for which ρ k = 0, we are not able to find a useful bound on ρ k ; ideally we would like to show that ρ k ≤ 0. Of course, even had we had succeeded in bounding ρ k , the overall bound we would have obtained via Corollary 3.2 for the q(x k ) − q(x m ) term would not have been substantially different from those given in Corollaries 3.6 and 3.8.
Our second approach tried to mimic that taken for the CG method for positive-definite linear systems. However, the argument relating the A-norm of the error to the A −1 -norm of the residual and the subsequent min-max characterization depends crucially on the definiteness of A, and thus this line of attack is not obvious for our case where H may be indefinite. Nevertheless it is easy to derive the bound on the residual (32) (see Appendix 2). Although we do not know how to derive a useful bound on ψ k (λ j + μ k ), as we see in §4, to do so might provide a much closer match to the true residual than provided by Corollaries 3.6 and 3.8.

Experiments
We consider nine examples that aim to illustrate our analysis; all nine are available as part of the CUTEst [21] set of test problems. Each is of the form for i = 1, . . . , n; in our tests we let n = 1000, and ignore the additional simple-bound constraints specified in the CUTEst examples. The first three are convex with Hessian spectra that bunch towards the bottom of the range, that are equispaced, and that coalesce towards the top of the range respectively. The second three shift the spectra of the first three downwards by roughly n/2, leading to indefinite Hessians, while the last three concave examples shift downwards by more or less n + 1/n. In Figure 1 we compare the true residual against bounds derived in Section 3 when running GALAHAD's [19] GLTR package [18] to solve the trust-region subproblem (1) on the first three test examples. Almost identical plots have been obtained for the remaining examples for the trust-region case since the residual shows that shifting the Hessian downwards by ω is compensated by shifting the multiplier upwards by the same amount once the trust-region constraint is active.
We observe that although Theorem 3.4 and Corollary 3.5 provide bounds on the residual, they may be far from sharp, especially when the spectrum is equispaced or bunched towards the top end. In particular, the bounds do not capture the superlinear behaviour of the residuals in these cases; the slopes best mimic those from the earlier iterations. This largely agrees with the observations made and conclusions drawn in the linearequation case [27]. The inferiority of the bound in Corollary 3.5 compared to that in Theorem 3.4 merely reflects the weakening that results when approximating unknown quantities (i.e. θ k,k + μ k ) by known ones (i. e. H, g, δ). By contrast, the bound provided by (53) is quantitatively far better, but, of course, this requires full knowledge of the spectrum.      produce essentially identical plots when moving from the convex via the indefinite to the concave cases.
Once again, we observe that the bounds (38), (46) may be far from sharp, and can fail to reflect the later superlinear convergence of the residuals. The behaviour is most extreme for the concave examples, and for those whose spectra coalesce at the top of their ranges. As before, (53) provides a much more faithful bound.
Finally Figures 5 illustrate the effect of changing the regularization weight, σ , when solving (2) for the example DIAGNQT, on the residual estimates. The subproblems become increasingly hard as σ shrinks, and the estimates correspondingly poorer. Indeed, the decrease predicted by (38) when σ = 100 barely indicates convergence, while although the rates for the actual residual and the prediction (53) are initially slow, they later accelerate.

Conclusions
We have derived bounds for the objective errors and gradient residuals when finding approximations to the solution of common regularized quadratic optimization problems within evolving Krylov spaces. Those for the objective errors are trivial extensions of existing ones [8], while those for the gradient residuals generalize well-known ones for conjugate-gradient methods applied to definite linear systems. Quantitatively the bounds behave just as in the conjugate-gradient case, but reflect additional complication of the subproblems, particularly the potential indefiniteness of the matrices involved.
We express some caution since in exceptional cases Krylov methods may not find the global solutions to our problems. If this is the goal, then additional precautions [8,18] that are not covered by our bounds may be necessary. If our goal is simply to find an approximation that yields a small gradient residual-and this is often the case when the subproblem occurs as component of a more general optimization calculation-then our bounds are appropriate, and provide upper bounds on the number of iterations required to achieve a given stated accuracy.
Our bounds do not reflect the 'superlinear' behaviour that is sometimes observed in practice that results from annihilation of extreme eigenvalues by the Krylov process. A more sophisticated analysis, akin to that by Axelsson, Kaporin and others [3,4], might provide this, but we have not attempted it.