Throughout this section we have evaluated many parameters for stability, termination and speed. Most of the variations on IRLS were rejected. We have retained rrlambda for regularization, cgwindow to assure continued improvement, and cgbinit was kept for cgdeveps experiments due to a speed improvement. Due to the interesting and different behaviors of the cgeps and cgdeveps termination methods, we will treat each as a different classifier for the remainder of this thesis. In the following chapter we will compare IRLS performance to that of other classification algorithms and see if our fixed parameters are adequate.
Algorithm 5 shows our final version of IRLS.
The lrmax, rrlambda, and lreps parameters are embedded in the
pseudocode to emphasize that they are fixed at their default values.
There are two important changes from Algorithm 4. The
first is at line 5.1, where we call CG
to find an approximate solution to the linear weighted least squares
subproblem. We have have written this as a function call to
CUSTOMCG, since the details vary according to whether we are
using cgeps or cgdeveps. The second change is at
line 5.2. Instead of returning the
value of
at the final iteration, we choose the value of
which had the minimum deviance among IRLS iterations.
When using cgeps, we modify CG as shown in
Algorithm 6. To prevent confusion, we
have replaced the scalar
used in the original CG of
Algorithm 1 with
. The primary difference
is the addition of cgwindow termination. This may be seen at
line 6.1 and the lines beginning
at 6.2. Note that this version of
CG ignores the
parameter sent by IRLS.
Algorithm 7 shows our modification to
CG when using cgdeveps. Again we use
in place of the
original scalar
of Algorithm 1. Another
difference is the addition of the cgwindow technique, seen at or around
lines 7.2
and 7.3. The final difference
is at line 7.4, where we choose
the best
according to deviance. When compared to the CG
algorithm used with cgeps, described in the previous paragraph, there
are two important differences. The first is the nonzero
initialization of
at
line 7.1, which makes use of the
parameter passed by our IRLS algorithm. This
is the result of the cgbinit parameter we have chosen to use with
cgdeveps. A side-effect of the nonzero
is a change in the
computation of
, shown in the following line. The second
important difference with the cgeps version of CG is the selection
of the best
at line 7.4.
The lack of CG position history in Algorithm 6 is regrettable, since it complicates comparison of cgeps and cgdeveps. However, this difference is unlikely to have had any material effect. Had CG iterations caused the final iterate to be much worse than the best iterate, then cgdecay would have had greater positive effect in the experiments of Section 5.2.1.13. Instead, those experiments suggested that the deviance grew slowly after reaching a minimum. Since optimizing the likelihood does not necessarily correspond to maximizing the AUC score [46], it is not clear that choosing a slightly non-optimal iterate should have any negative effect at all.
We have neglected many interesting possibilities for IRLS with CG. It is possible to use CG with a preconditioning matrix, which may contribute to stability and speed if some or all of the covariance matrix can be estimated [41,31,7]. We made one experiment with a simple diagonal preconditioner [41]. The result was unsatisfactory but the failure was not investigated. Alternate techniques for IRLS or CG termination could be tried, including the use of validation sets [10]. Again we made an unsuccessful foray, and did not pursue a remedy. Though CG is designed for positive definite matrices, it has performed well on a variety of ill-conditioned data matrices arising from datasets such as ds1 with linearly dependent or duplicate columns. A version of CG known as biconjugate gradient is designed for positive semidefinite matrices, and this might further accelerate IRLS. Besides biconjugate gradient, Greenbaum [7] lists several more iterative methods for solving linear systems. Truncated-Newton methods should be investigated for finding the zeros of the LR score equations, and the result compared to our combination of IRLS and CG.