Setting the first derivative of a quadratic form equal to zero creates a system of linear equations. For this reason we refer to CG as linear CG when it is applied to a quadratic form. For differentiable convex functions in general the first derivative is not a linear system, and in this context we use nonlinear CG.
Let
be a quadratic form and
be an arbitrary differentiable
convex function. Both have a symmetric positive semidefinite Hessian,
which for
is constant and for
is not constant. Just like
linear CG, nonlinear CG attempts to find the minimum of
by
evaluating the gradient, computing a conjugate direction and finding
the minium in that direction. However there are many differences
between the two methods. Everywhere we used the matrix
for
linear CG, we must now use
or make a different
computation. Where we were able to compute the minimum point on the
line search in direction
in Equation 2.11, we
must now make a true search for the minimum. All methods we might
have tried to compute the linear combination of
Equation 2.12 would have resulted in the same
coefficient
seen in Equation 2.13, but for
nonlinear CG there are several reasonable formulas for
.
These changes make nonlinear CG more expensive than linear CG. We can
no longer define the residual
as
, and
must rely on the more general definition
. This
definition requires gradient evaluations, but avoids computation of
the Hessian. There are many ways to search for the minimum along a
direction
, but almost every method requires evaluation or
approximation of the gradient at several points along the line
to find the optimal value for
. Some line searches require computation or estimation of
, which may be expensive.
For this thesis we use the Secant method for line searches in
nonlinear CG. This method uses directional derivatives at two
endpoints along the line to fit a quadratic and moves the next line
search iteration to the minimum of that quadratic. The actual slope
is computed at this point and is used to determine which endpoint
should be eliminated. Since
is quadratic the true minimum will
always be found between the endpoints, so long as the endpoints are
initialized with the minimizer between them. This condition itself
requires a search along the negative gradient, looking for a point
with positive directional derivative.
Terminating the line search can be done in many ways. Many methods
have been suggested such as checking the relative change of
at the
line search iterates, using the relative difference of the slopes,
observing the slopes themselves, or even stopping after a fixed number
of iterations [41,31]. We have tried all of
these approaches except the last. In our application of nonlinear CG
to the LR maximum likelihood equations, we have selected a combination
of relative and absolute slope checking.
There are several ways to compute conjugate search directions. In all
cases a linear combination between the gradient and the previous
search direction is used, but the value of the coefficient
in
Equation 2.12 changes. Three methods of computing
are
| (2.17) | |||
![]() |
(2.18) | ||
| (2.19) |
With nonlinear CG the Hessian changes with every iteration. While we
can ensure
and
are conjugate using the
Hessian at
, we cannot in general ensure that
is conjugate to
. During initial iterations of
nonlinear CG we can assume that we are not near
, and hence
large steps are taken which create large differences in the Hessian.
Once we are near
we might assume that the Hessian will not
change much between iterations, and conjugacy will again become
meaningful. For this reason restarts are periodically
employed.
A nonlinear CG restart at iteration
simply means setting
as if we were starting from scratch with initial
guess
. There are several popular methods for deciding when
to restart. The simplest criterion derives from the observation that
there cannot be more than
conjugate directions in
,
and hence a restart should occur after every
iterations. Another
popular restart criterion is Powell restarts, which checks the
orthogonality of successive residuals. A restart is triggered when
![]() |
(2.20) |
. Note that Powell
restarts can be easily incorporated into Polak-Ribiére direction updates by
setting
. We will refer to
this combination of direction update and restart criterion as
``modified Polak-Ribiére''. [41,31]
Algorithm 2 summarizes nonlinear CG. A line search method must be chosen for 2.1 of this algorithm, to replace the step length computation 1.1 of Algorithm 1. The direction update function must be selected from several possibilities, and a criteria must be chosen for when to perform a CG restart. Finally, note that the residual requires a generic gradient computation.
All of the issues discussed here will be raised again in Section 5.3, where we apply nonlinear CG to the convex but non-quadratic LR likelihood function described in Section 4.2. We will continue to use the line search and restart criteria described here, and will compare performance using all four direction update formulas.