On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces

We show that the minimizers of regularized quadratic functions restricted to their natural Krylov spaces increase in Euclidean norm as the spaces expand.

minimize x∈ n Q(x; σ, p) := 1 2 x T H x + g T x + 1 p σ x p (1.1) where σ > 0, p > 2 and · is the Euclidean norm-note that Q is bounded below over n , and all global minimizers have the same norm [7, §3]. Such methods have been advocated by a number of authors, e.g., [1][2][3]8]. Here we are interested in how the norms of the estimates of the solution evolve as the Krylov process proceeds. The main utility is that these estimates provide useful predictions for the "multipliers" σ x p−2 as the Krylov subspace expands [9]. Our result is an analogue of that obtained by Lukšan, Matonoha and Vlček [11] for the trust-region subproblem. By way of motivation and explanation, the solution x * to (1.1) necessarily satisfies the first-order criticality condition ∇ x Q(x * ; σ, p) = 0, i.e., In addition, H +μ * I is positive semi-definite at any global minimizer, and if H +μ * I is positive definite, the minimizer is unique [3, Thm.3.1]. If μ * was known, the minimizer might be found simply by solving the linear system (1.2), and the skill is then in finding convergent estimates of μ * by iteration [5]. Briefly, this is achieved by seeking the rightmost root of the secular equation while ensuring that H + μI is positive semi-definite. This is always possible so long as g intersects the subspace U of eigenvectors of H corresponding to the leftmost eigenvalue λ min (H ) of H , and in this case H + μ * I is positive definite. The rare possibility that the later does not happen is known colloquially as the "hard case" [12], and the solution to (1.1) in the hard case involves an addition component from U . We also make the connection between (1.1) and the regularized quadratic namely that if ν = μ * and the hard case does not occur then x * minimizes Q ν (x). This all presupposes that one can solve the linear system (1.3), and unfortunately in some applications matrix factorization is out of the question, indeed H may only be available indirectly via Hessian-vector products H v for given v. An attractive alternative in such cases is to find an approximation to the solution of (1.1) by restricting the search domain to a subspace or sequence of subspaces. A particularly appealing set of nested subspaces is provided by the Krylov space defined by H and g that we will define formally in the next section. Crucially, the k-th Krylov subspace, K k (H , g), may be generated recursively through Hessian-vector products, and has an orthogonal basis V k , with the desirable property that if the vector e 1 ∈ k is the first column of the identity matrix and T k = V T k H V k ∈ k×k is tridiagonal. The latter implies that solving (1.5) is feasible via its optimality equations even when the dimension is large, since factorizing shifted tridiagonal matrices and solving linear systems involving them may be achieved in a few multiples of k floatingpoint operations.
Of course, we need to judge when x k is a meaningful approximation to x * as the subspaces evolve, and furthermore to solve each successive subproblem (1.5) efficiently. The former is addressed in [2,6] and requires estimates of μ k , while the latter appeals to the ideas in [5] and relies on a good starting "guess" for μ k . Thus generating a good starting guess provides motivation for our short paper.
In the next section we provide a set of lemmas leading to our main result, namely that so long as the evolving Krylov subspaces are of full dimensionality, the norms of the solution estimates s k and the corresponding "multipliers" μ k increase monotonically. We summarize, extend and discuss implications and limitations of our results in the concluding Sect. 3.

The main result
We start with four vital lemmas that we use to prove our main result. The first shows a simple property of the conjugate gradient method. We use the generic notation H 0 (resp. H 0) to mean that the real, symmetric matrix H is positive definite (resp. positive semi-definite).

Lemma 1 Given a real symmetric matrix H and real vector g, let
Next, we state a crucial relation between the parameter ν that defines Q ν (x) in (1.4) and the norm of the minimizer of Q ν (x) within the k-th Krylov space. Then We define the grade of H and g, grade(H , g) ≤ n, to be the maximum dimension of the evolving Krylov spaces K k (H , g), k = 1, . . . , n [10]. Our final lemma indicates that the evolving minimizers are unique.

Lemma 4 Let H , g and V k be as in Lemma 3 and let μ k by the rightmost root of the secular equation
Then V T k H V k + μ k I 0 for all 1 ≤ k ≤ m := grade (H , g).

Proof
Using the Lanczos orthonormal basis, we have that V T k H V k = T k for an irreducible tridiagonal matrix T k for k = 1, . . . , m. It then follows [4,Thm.7.5.12] that K k (H , g) has a nontrivial intersection with the space of eigenvectors of T k corresponding to the eigenvalue λ min (T k ) (i.e., the "hard case" cannot occur), and thus that the only permitted root μ k of the secular Eq. (2.3) for the problem satisfies μ k > −λ min (T k ), where λ min denotes the leftmost eigenvalue of its symmetric matrix argument [5,Sec.2.2].
We are now in a position to state and prove our main theorem. Theorem 1 Given a real symmetric matrix H , vector g and scalars σ > 0 and p > 2, let m = grade (H , g), Proof Let V j be as in the statement of Lemma 3. The vector x j = V j y j is a minimizer of the j-th regularization subproblem if and only if Since we have V T k (H + μ k I )V k 0 and V T (H + μ I )V 0, and as K k (H + μ k I , g) = K k (H , g) by Lemma 2, it follows from (2.5) that x k is also the (unique) solution of the subspace minimization problem Assume that μ k > μ , which implies that Then it follows from Lemma 1 that But since μ < μ k , Lemma 3 gives that Hence using the definition (2.4) and combining the inequalities (2.6) and (2.7) which is a contradiction. Thus μ k ≤ μ has to hold. It then follows from the definition The monotonic behaviour of the multipliers μ k was predicted in [9, Lem.2.6] when p = 3, but the proof suggested there relied on [11,Thm.2.6], which appears to have a minor flaw-the proof depends on [13, Thm.2.1], but applies this at one point to an indefinite H + μI . Lemma 1 avoids this issue, and the same result fixes the proof of [11,Thm.2.6] that applies in the trust-region case.

Comments and conclusions
We have shown that the norms of the approximations generated by well-known Krylov methods for solving the regularization problem (1.1) increase monotonically as the dimension of the Krylov spaces expands. This implies that the corresponding "multipliers" μ k also increase, and is useful as estimates of these multipliers are crucial when solving the Krylov subproblem; in particular, as the multiplier for the k-th problem is a lower bound for the (k + 1)-st, Newton-like iterations for the required root of the secular equation will converge both globally and rapidly to μ k+1 when started from μ k if additionally μ k > λ min (T k+1 ) [5, §3]. In particular, Newton's method, the secant method or methods based upon certain higher (odd)-order Taylor approximations or nonlinear rescalings of the term x(μ) p−2 all converge monotonically from such a starting μ.
Knowledge of the monotonic nature of these quantities is also important when deriving convergence bounds [6] for such methods. We warn readers that in exceptional circumstances, namely when g is orthogonal to the eigenspace corresponding to the leftmost eigenvalue of H and σ is not large enough, the global minimizer of (1.1) will not lie in K m (H , g), and μ m will underestimate the optimal multiplier. This (zero-probability) possibility is often referred to as the "hard case" ( [3], §6,1, [12]), and, despite their popularity, might be viewed as an unavoidable defect of Krylov methods.
The main result here may trivially be extended for Krylov methods to