Many of the parameters in Section 5.2.1 affect termination and optimality, but were aimed at enhancing stability and providing a robust foundation for the remaining parameters. The parameters discussed in this section are primarily concerned with achieving optimal scores without wasted computation. These include the parameters lreps, cgeps and cgdeveps. There are an additional two parameters, lrmax and cgmax, used to set upper limits on the number of CG and IRLS iterations. These parameters are summarized in Table 5.3.
IRLS is terminated when the relative difference of the deviance falls below lreps or the number of IRLS iterations reaches lrmax. See Section 4.4 for details. Deviance-based termination is common, with the relative difference suggested by McIntosh [26]. Because relative deviances are scale-independent, it is not difficult to find a value for lreps which works well for most datasets.
Several options for CG termination were discussed in 2.1.2. Our focus is understanding whether and when LR is suitable for classification, especially in data mining applications, and we have examined very few termination criteria. The cgeps parameter is used for the traditional residual termination criteria. For linear CG inside IRLS, the residual vector is
x2html_comment_mark>511
figures/auc_lreps_cgeps_dsadec.ps figures/time_lreps_cgeps_dsb.ps figures/time_lreps_cgeps_dsadec.ps
|
The top half of Figure 5.5 shows how the AUC score varies as a function of lreps and cgeps for datasets ds2 and ds1.10pca. Each line on these plots represents a different lreps value, and the horizontal axis represents cgeps. For ds2 there is little difference between lreps lines so long as cgeps is not made too large. For ds1.10pca there is again little difference between lreps values, and no change for cgeps values. The ds1.10pca plot demonstrates that an absurdly low value of lreps, such as 0.5, can be somewhat harmful. This lreps sensitivity is dwarfed by the effect of too large a cgeps for ds2, and we will be very careful choosing a default cgeps value. The bottom half of Figure 5.5 illustrates how speed changes with lreps and cgeps. As one would expect, the shortest times occurring for the largest lreps and cgeps values. These results are representative of experiments on all six datasets used in our previous experiments.
tex2html_comment_mark>514
figures/time_cgeps_all.ps
|
The largest, and hence fastest, ``safe'' value of lreps appears to be 0.1. To be somewhat conservative, when using cgeps to terminate CG we will choose lreps=0.05. Choosing a value for cgeps is more difficult. Figure 5.6 shows how AUC and speed vary with cgeps in all six datasets used in previous experiments with lreps set to 0.05. Note that the vertical ``Time'' axis is logarithmic. It appears that any small value for cgeps will produce the same AUC score, but we incur a time penalty if cgeps is too small. This time penalty is severe for ds1, imdb and ds2. The largest sensible cgeps values for ds1, with respect to AUC , are quite different than those for imdb and ds2. Any conservative choice of cgeps will cause unduly slow results for ds1, which leads to a reconsideration of cgeps scaling.
The scaling used in every previous experiment multiplied the reported
cgeps by
, the number of attributes in the dataset. This choice
was made using the naive argument that the CG residual bound should
depend on its dimensionality. Perhaps the most obvious change is to
multiply by the square root of the dimensionality, since a Euclidean
norm is used. Another option is multiplying cgeps by the value of
the initial CG residual before any IRLS iterations, so that problems
which start with a large residual terminate CG iterations sooner. A
similar termination criterion is suggested in Shewchuk [41] and
discussed briefly in Section 2.1.2. When the binitmean parameter is disabled, the LR parameter vector
is
initialized to zero and the initial CG residual
of
Equation 5.1 simplifies to
.
tex2html_comment_mark>515
figures/time_cgeps_grad_all.ps
|
Figure 5.6 suggests that the most flattering
value of the
-scaled cgeps is 0.01 for ds1, 0.0005 for imdb,
and somewhere near 0.0001 for ds2. The table below
shows how these values scale under the number of attributes
,
the square root of the number of attributes
, and the
Euclidean norm of the initial residual vector
.
| Dataset |
|
|
||||
| ds1 | 6,348 | 0.01 | 79.67 | 0.79778 | 83,227 | 0.00076 |
| imdb | 685,569 | 0.0005 | 827.99 | 0.41399 | 92,503 | 0.00370 |
| ds2 | 1,143,054 | 0.0001 | 1069.14 | 0.10691 | 171,694 | 0.00066 |
We conclude that a
-cgeps of 0.005 is adequate for all
datasets, while 0.001 is safe for all datasets. Choosing
-cgeps=0.001 increases the time significantly. Nonetheless
we prefer
-cgeps
for autonomous
applications and will use this setting for later experiments in this
thesis.
|
|
ds1 | imdb | citeseer | ds2 | ds1.100pca | ds1.10pca | ||||||
| 0.05 | 0.925 | 23 | 0.965 | 139 | 0.928 | 36 | 0.673 | 843 | 0.889 | 22 | 0.788 | 4 |
| 0.01 | 0.947 | 42 | 0.970 | 193 | 0.947 | 41 | 0.717 | 1590 | 0.912 | 26 | 0.846 | 6 |
| 0.005 | 0.948 | 47 | 0.983 | 279 | 0.947 | 37 | 0.720 | 2372 | 0.917 | 31 | 0.846 | 5 |
| 0.001 | 0.949 | 86 | 0.983 | 370 | 0.946 | 44 | 0.720 | 3417 | 0.918 | 44 | 0.846 | 7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0005 | 0.949 | 93 | 0.983 | 429 | 0.946 | 45 | 0.720 | 3548 | 0.918 | 46 | 0.846 | 7 |
| 0.0001 | 0.949 | 106 | 0.983 | 589 | 0.945 | 58 | 0.720 | 3838 | 0.918 | 50 | 0.846 | 7 |
| 0.00005 | 0.949 | 112 | 0.983 | 638 | 0.945 | 61 | 0.720 | 3936 | 0.918 | 52 | 0.846 | 7 |
| 0.00001 | 0.949 | 117 | 0.983 | 752 | 0.945 | 70 | 0.720 | 4211 | 0.918 | 56 | 0.846 | 7 |
The size of the CG residual
is influenced by many factors
including the number of positive dataset rows and the sparsity of the
data, and hence there are many reasonable scalings for cgeps. For
completeness we have performed experiments for other scale factors
including the number of nonzero elements
, the average number of
nonzero elements per dataset row, the dataset sparsity
, the
``dense'' size
, and the number of positive dataset rows. While
-scaling is both logically and empirically attractive, the
need to scale cgeps motivates exploration of CG termination criteria
which do not depend on the CG residual.
One alternative to residual-based termination methods for CG is to use
a context-sensitive measure such as the deviance. The cgdeveps parameter tests the relative difference of the deviance between CG
iterations, just as lreps tests this quantity for IRLS iterations.
While the CG residual is a natural part of CG computations, the LR
deviance is not. The deviance requires
O
time to compute.
However, as seen with lreps the value of cgdeveps is adaptive, and
may be more suitable for an autonomous classifier. We do not allow
cgeps and cgdeveps to be used simultaneously, though nothing
precludes this possibility.
x2html_comment_mark>544
figures/auc_lreps_cgdeveps_jlee.ps figures/time_lreps_cgdeveps_dsb.ps figures/time_lreps_cgdeveps_jlee.ps
|
The top half of Figure 5.8 illustrates AUC versus cgdeveps for ds2 and citeseer. The imdb dataset behaves something like citeseer on a smaller scale, while the others mimic ds2. The improvement in AUC for citeseer with looser cgdeveps comes from avoiding the overfitting we witnessed in earlier experiments with this dataset. Examining the deviances we see that the overfit experiments have reduced deviance an extra 1.5%, unlike the 500+% reductions seen in Table 5.7, and we are not concerned with the small AUC penalty in these experiments. It is possible that different stability parameter defaults would be appropriate for cgdeveps, but for this work we will keep our current stability parameters and avoid over-tight lreps and cgdeveps values.
Figure 5.8 suggests that an lreps of 0.1 is adequate and 0.05 is safe so long as cgdeveps is sufficiently small. As with cgeps, larger lreps and cgdeveps cause earlier termination and hence better speed. The bottom half of Figure 5.8 shows the relation between lreps, cgdeveps and time for datasets ds2 and citeseer. For these datasets and the others, the cost of lreps=0.05 is always very similar to the cost of lreps=0.1. Preferring to error on the side of safety, we will use lreps=0.05 for all cgdeveps experiments.
tex2html_comment_mark>556
figures/time_cgdeveps_all.ps
|
The relation between AUC , time and cgdeveps when lreps=0.05 is
shown for all six datasets in Figure 5.9.
Note that the time axis is logarithmic. The most interesting
cgdeveps values are 0.01, 0.005 and 0.001.
Table 5.20 shows AUC and times for these values.
If not for dataset ds1.10pca, setting cgdeveps=0.01 would be adequate,
perhaps even safe. While cgdeveps=0.001 appears safe for everyone,
it comes with a 10% to 30% speed penalty over cgdeveps=0.005.
Though 0.005 produces only adequate results for ds1.10pca, the
neighboring decrease in AUC is only 3%. Given these trade-offs and
since this only seems to affect one of our datasets, we accept the
risk of data mining on-the-edge and choose cgdeveps=0.005. We ran
each experiment from the cgdeveps=0.005 row of
Table 5.20, just as we did in the
-cgeps=0.001 row of Table 5.19. There was
little deviance and the original experiments were representative of
the mean.
The paragraphs above describe two very different strategies for terminating CG iterations. Because of the sizeable difference, and because of the challenges and trade-offs made when selecting default values for cgeps and cgdeveps, we will show results for both cgeps and cgdeveps throughout most of our later experiments.
Our experiments agree with the assertion in McCullagh and Nelder [25] that few IRLS iterations are needed to achieve optimality. Table 5.21 shows the maximum number of LR iterations for folds of the experiments reported above using default parameter values. For these experiments the maximums and minimums were always similar. In both the cgeps and cgdeveps experiments, no more than thirteen LR iterations were ever used. Therefore it is reasonable to set lrmax at thirty to prevent runaway experiments.
Table 5.21 also shows CG iteration maximums, taken within and then over the folds. From these counts we see that the number of CG iterations never exceeded fifty-five for our experiments with the default parameters. Because of the variability between CG iteration counts, we feel that setting cgmax to two-hundred is a reasonable restraint. While the counts in Table 5.21 are the maximum counts over the folds of each experiment, the minimum counts are not much different. We have added lreps, both CG termination parameters, lrmax and cgmax to our LR description tree, shown in Figure 5.10.
|