next up previous contents
Next: 3. Linear Regression Up: 2. Conjugate Gradient Previous: 2.1.2 The Right Iterative   Contents


2.2 Minimizing Arbitrary Convex Functions With Conjugate Gradient

Setting the first derivative of a quadratic form equal to zero creates a system of linear equations. For this reason we refer to CG as linear CG when it is applied to a quadratic form. For differentiable convex functions in general the first derivative is not a linear system, and in this context we use nonlinear CG.

Let $ f$ be a quadratic form and $ g$ be an arbitrary differentiable convex function. Both have a symmetric positive semidefinite Hessian, which for $ f$ is constant and for $ g$ is not constant. Just like linear CG, nonlinear CG attempts to find the minimum of $ g$ by evaluating the gradient, computing a conjugate direction and finding the minium in that direction. However there are many differences between the two methods. Everywhere we used the matrix $ \mathbf{A}$ for linear CG, we must now use $ g''(\ensuremath{\mathbf{x}})$ or make a different computation. Where we were able to compute the minimum point on the line search in direction $ \ensuremath{\mathbf{d_i}}$ in Equation 2.11, we must now make a true search for the minimum. All methods we might have tried to compute the linear combination of Equation 2.12 would have resulted in the same coefficient $ \beta_i$ seen in Equation 2.13, but for nonlinear CG there are several reasonable formulas for $ \beta_i$.

These changes make nonlinear CG more expensive than linear CG. We can no longer define the residual $ \mathbf{r}$ as $ \ensuremath{\mathbf{b}} - \ensuremath{\mathbf{A}} \ensuremath{\mathbf{x}}$, and must rely on the more general definition $ \ensuremath{\mathbf{r}} = -g'(x)$. This definition requires gradient evaluations, but avoids computation of the Hessian. There are many ways to search for the minimum along a direction $ \ensuremath{\mathbf{d_i}}$, but almost every method requires evaluation or approximation of the gradient at several points along the line $ \ensuremath{\mathbf{x_i}} - \hat{\alpha} \ensuremath{\mathbf{d_i}}$ to find the optimal value for $ \alpha$. Some line searches require computation or estimation of $ g''$, which may be expensive.

For this thesis we use the Secant method for line searches in nonlinear CG. This method uses directional derivatives at two endpoints along the line to fit a quadratic and moves the next line search iteration to the minimum of that quadratic. The actual slope is computed at this point and is used to determine which endpoint should be eliminated. Since $ g$ is quadratic the true minimum will always be found between the endpoints, so long as the endpoints are initialized with the minimizer between them. This condition itself requires a search along the negative gradient, looking for a point with positive directional derivative.

Terminating the line search can be done in many ways. Many methods have been suggested such as checking the relative change of $ g$ at the line search iterates, using the relative difference of the slopes, observing the slopes themselves, or even stopping after a fixed number of iterations [41,31]. We have tried all of these approaches except the last. In our application of nonlinear CG to the LR maximum likelihood equations, we have selected a combination of relative and absolute slope checking.

There are several ways to compute conjugate search directions. In all cases a linear combination between the gradient and the previous search direction is used, but the value of the coefficient $ \beta$ in Equation 2.12 changes. Three methods of computing $ \beta$ are


$\displaystyle \beta_i^{\mbox{(FR)}}$ $\displaystyle =$ $\displaystyle \ensuremath{\mathbf{r_i}}^T \ensuremath{\mathbf{r_i}} / \ensuremath{\mathbf{r_{i-1}}}^T\ensuremath{\mathbf{r_{i-1}}}$ (2.17)
$\displaystyle \beta_i^{\mbox{(PR)}}$ $\displaystyle =$ $\displaystyle \ensuremath{\mathbf{r_i}}^T \left( \ensuremath{\mathbf{r_i}} - \e...
..._{i-1}}} \right) /
\ensuremath{\mathbf{r_{i-1}}}^T\ensuremath{\mathbf{r_{i-1}}}$ (2.18)
$\displaystyle \beta_i^{\mbox{(HS)}}$ $\displaystyle =$ $\displaystyle \ensuremath{\mathbf{r_i}}^T \left( \ensuremath{\mathbf{r_i}} - \e...
...i-1}^T \left( \ensuremath{\mathbf{r_i}} - \ensuremath{\mathbf{r_{i-1}}} \right)$ (2.19)

The first of these, $ \beta_i^{(\mbox{FR})}$, is known as Fletcher-Reeves and is the same formula used in linear CG. The second, $ \beta_i^{(\mbox{PR})}$, is called Polak-Ribiére and the remaining formula is Hestenes-Stiefel. Use of Fletcher-Reeves guarantees convergence so long $ \ensuremath{\mathbf{x_0}}$ is sufficiently close to $ \ensuremath{\mathbf{x^*}}$. Though functions have been devised for which Polak-Ribiére and Hestenes-Stiefel perform very poorly, use of these formulas often improves performance [41,31].

With nonlinear CG the Hessian changes with every iteration. While we can ensure $ \ensuremath{\mathbf{d_i}}$ and $ \ensuremath{\mathbf{d_{i+1}}}$ are conjugate using the Hessian at $ \ensuremath{\mathbf{x_{i+1}}}$, we cannot in general ensure that $ \ensuremath{\mathbf{d_i}}$ is conjugate to $ \ensuremath{\mathbf{d_{i+2}}}$. During initial iterations of nonlinear CG we can assume that we are not near $ \ensuremath{\mathbf{x^*}}$, and hence large steps are taken which create large differences in the Hessian. Once we are near $ \ensuremath{\mathbf{x^*}}$ we might assume that the Hessian will not change much between iterations, and conjugacy will again become meaningful. For this reason restarts are periodically employed.

A nonlinear CG restart at iteration $ i$ simply means setting $ \ensuremath{\mathbf{d_i}}
= -g'(\ensuremath{\mathbf{x_i}})$ as if we were starting from scratch with initial guess $ \ensuremath{\mathbf{x_i}}$. There are several popular methods for deciding when to restart. The simplest criterion derives from the observation that there cannot be more than $ n$ conjugate directions in $ \mathbb{R}^n$, and hence a restart should occur after every $ n$ iterations. Another popular restart criterion is Powell restarts, which checks the orthogonality of successive residuals. A restart is triggered when

$\displaystyle \frac{\ensuremath{\mathbf{r_i}}^T \ensuremath{\mathbf{r_{i-1}}}}{\ensuremath{\mathbf{r_{i-1}}}^T\ensuremath{\mathbf{r_{i-1}}}} > \nu$ (2.20)

for some small value of $ \nu$. One choice of $ \nu$ commonly used is $ \vert\vert \ensuremath{\mathbf{r_i}} \vert\vert^2 / \vert\vert \ensuremath{\mathbf{r_{i-1}}}^2 \vert\vert$, which corresponds to restarting when $ \beta_i^{\mbox{(PR)}} < 0$. Note that Powell restarts can be easily incorporated into Polak-Ribiére direction updates by setting $ \beta_i = \max(\beta_i^{\mbox{(PR)}}, 0)$. We will refer to this combination of direction update and restart criterion as ``modified Polak-Ribiére''. [41,31]

Algorithm 2 summarizes nonlinear CG. A line search method must be chosen for 2.1 of this algorithm, to replace the step length computation 1.1 of Algorithm 1. The direction update function must be selected from several possibilities, and a criteria must be chosen for when to perform a CG restart. Finally, note that the residual requires a generic gradient computation.

All of the issues discussed here will be raised again in Section 5.3, where we apply nonlinear CG to the convex but non-quadratic LR likelihood function described in Section 4.2. We will continue to use the line search and restart criteria described here, and will compare performance using all four direction update formulas.

.0
\begin{algorithm}
% latex2html id marker 4405
[tbp] \SetKwInOut{Input}{input} \S...
...:= -g'(\ensuremath{\mathbf{x_{i+1}}})$ \;
\par
$i := i+1$ \;
}\end{algorithm}


next up previous contents
Next: 3. Linear Regression Up: 2. Conjugate Gradient Previous: 2.1.2 The Right Iterative   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu