We have found that CG-MLE requires many of the same stability parameters, with nearly identical default values, as IRLS did. Notably different was the ease with which the termination epsilon cgeps was found in Section 5.3.2. Perhaps the most interesting results are in Section 5.3.3, which examined the effects of different CG direction update formulas and the BFGS-MLE.
Our final CG-MLE method is summarized in
Algorithm 8. Starting at
line 8.1 we set our nonlinear CG
parameters as we have described above.
Line 8.2 shows how the binitmean parameter
changes the value of
. Several lines related to cgwindow are
present. As with our IRLS implementation, shown in
Algorithm 5, we return the value of
which minimized the deviance. This may be seen in
line 8.3. As in
Algorithm 5, we have embedded the parameters
cgmax, cgeps, and cgwindow to emphasize that we have fixed their
values.
It is worth emphasizing the difference between CG-MLE and IRLS with cgdeveps. Both use the relative difference as a termination criterion. However, IRLS is a different nonlinear optimization method than nonlinear CG. The first IRLS iteration starts from the same place as the first CG-MLE iteration. In IRLS, linear CG is applied to a linear weighted least squares problem. In CG-MLE, nonlinear CG is applied to the score equations. Termination of linear CG for the first IRLS iteration is very unlikely to occur at the same place as termination occurs for nonlinear CG applied to the LR score equations, though linear CG should terminate with far fewer computations. At this point there are more IRLS iterations to run, but CG-MLE is finished. While both algorithms should ultimately arrive at similar parameter estimates, there is no reason to believe they will take the same path to get there, or require the same amount of computation. That both algorithms apply the same termination criteria to their version of CG is at best a superficial similarity.
We do not have many new ideas to propose for optimizing the LR MLE using numerical methods. Regularization was already suggested by Zhang and Oles [49]. Minka [27] made a brief comparison of several numerical methods, including a quasi-Newton method algorithm called Böhning's method, in a short technical report. Minka mentioned the need for regularization, and in two of his three datasets found that CG outperformed other algorithms. Zhang et al. [48] preferred Hestenes-Stiefel direction updates when using CG for nonlinear convex optimization, which is somewhat at odds with the conclusions of this chapter. We do not have an explanation for this contradiction. In fact, the most promising line of exploration opened by this section is the possibility that Fletcher-Reeves updates, usually ignored for nonlinear CG, might work well in our environment.