We would like to have a set of
linearly independent direction
vectors
and scalars
for which
Because we cannot know
without knowing the solution, we cannot
solve Equation 2.7 in any amount of time.
However, we can compute
, as shown in
Equation 2.6. Left multiplying
Equation 2.7 by
produces
If we assume we can find a set of linearly independent directions
which are mutually
-orthogonal, we may solve
for
by left-multiplying Equation 2.8
by
, since
![]() |
(2.9) | ||
| (2.10) |
Not only does there exist a set of linearly independent mutually
-orthogonal direction vectors
not in the null space
of
, but each vector can be computed incrementally. Let
. This allows computation of
and
, and all remaining direction vectors can be computed as
needed via
It is worth emphasizing that each direction vector is a linear
combination of the current residual and the previous direction vector.
Because the direction vectors are linearly independent, a simple
induction implies that
is a linear combination of
and
| (2.15) |
Figure 2.2 shows the conjugate directions used by CG
when started from the same place as Steepest Descent was in
Figure 2.1. Because CG will converge within
iterations, this is a worst-case starting point for CG requiring
steps. CG does so much better than Steepest Descent in our example
because it assumes, correctly, that
is a quadratic form. If
is not a quadratic form then adjustments must be made. The resulting
algorithm is often referred to as nonlinear CG, and is
discussed briefly in Section 2.2. Predictably,
the CG algorithm discussed so far is called linear CG.
We have omitted demonstration of the existence of a linearly
independent set of conjugate directions which are not in the null space
of
, and also omitted many interesting properties of CG. The
value of
in Equation 2.11 would be the same had
we defined it as the step length for minimizing line-search in
direction
starting at
. The minimizer is
. This is where the gradient, and hence the residual, is
perpendicular to
. Thus
is perpendicular to
. Equation 2.14 then implies that each new
residual is perpendicular to the space already searched. It also
follows that the residuals are mutually orthogonal and that
is minimized over
. These properties help explain the efficiency per
iteration of CG.
One may also observe that the computations needed in each CG iteration
depend only on
and the previous iteration, with the exception
of
as shown in Equation 2.11. In fact
can also be written in terms of the previous iteration
[41,31]. Therefore we only need to store a
few vectors and
. Because only matrix-vector computations are
needed, storage and computation may exploit the structure or sparsity
of
.
It was demonstrated above that CG should arrive at the exact solution
in no more than
iterations when
is
.
This property does not hold when rounded arithmetic is used, for
instance when CG is performed on a computer. Errors during
computations can reduce the conjugacy of the direction vectors and
decrease efficiency so that many more than
iterations are
required. [31]
The convergence rate for CG is not entirely straightforward. In Shewchuk [41] this rate is described as
![]() |
(2.16) |
There are many possibilities for deciding when to terminate CG besides the relative size of the error. Furthermore the relation between the true error and the computed error is complex, and the subject of ongoing numerical analysis research. Termination alternatives include minimizing the absolute size of the error, or relative or absolute size of the residual, or context-dependent quantities such as the relative change of the likelihood when computing a maximum likelihood estimate.
Finding a good termination criteria is very important to minimize
wasted time while ensuring sufficient optimality. When we discuss our
use of CG later in this thesis, the most important issue will be
proper termination. Because the worst-case time complexity of CG
depends on the spectrum of
, our CG termination methods should
to adapt to each dataset. We will also examine how the starting point
affects our methods in the one case where an informed choice
is possible. Overall, we aim hide all of these issues from the users
of our algorithms.