next up previous contents
Next: 5.3.5 Direct (CG-MLE) Summary Up: 5.3 CG-MLE Parameter Evaluation Previous: 5.3.3.1 Nonlinear CG Direction   Contents

5.3.4 BFGS-MLE

The second technique is an alternative to CG-MLE. Instead of using CG for nonlinear minimization we use a Newton-like method which approximates the Hessian matrix without forming it explicitly. Our choice of Hessian approximations is the BFGS update, named for authors Broyden, Fletcher, Goldfarb and Shanno. We will call this LR variant BFGS-MLE. We rely on the GNU Scientific Library [8], or GSL, for our BFGS-MLE experiments. While they call their method gsl_multimin_fdfminimizer_vector_bfgs(), the usual name for such algorithms is limited-memory quasi-Newton methods. Further information on these methods can be found in Nash and Sofer [31]. For the remainder of this section BFGS will mean a limited memory quasi-Newton method with a BFGS Hessian approximation.

We will use the term BFGS to refer to both the BFGS matrix update, and to a quasi-Newton method using the BFGS matrix update. Because BFGS is a quasi-Newton method, it is somewhat similar to IRLS in the context of LR. See Section 4.3 for further explanation of the relation between IRLS and Newton's method. Where our IRLS approximates the solution to linear subproblems using expressions involving the true Hessian, BFGS approximates the Hessian and finds true solutions given that approximation. Because BFGS is applied directly to the MLE system of equations, it is reminiscent of CG-MLE. Note that all three of these algorithms will follow a different path on the MLE surface and hence can be expected to perform differently.

BFGS can be terminated when the size of the estimated gradient is near zero, or using the relative difference of the deviances. The GSL BFGS implementation returns after every line search and after every line search iteration. Computing the deviance each time is too expensive and the change in deviance is too small to make the relative change approach work well. The alternatives are to identify line search completion by examining the sequence of points generated along the line search, or to use the gradient size for termination. The GSL documentation seems to encourage gradient-based termination, and hence this is our chosen termination method.

We do not intend to investigate BFGS-MLE thoroughly in this thesis, and only wish to make a rough comparison of the time it takes to converge to a competitive AUC . To this end we have run twelve experiments on each of our six datasets. These twelve experiments vary three parameters of GSL's BFGS implementation. The first of these is the gradient-termination epsilon gradeps for which we try values 0.01, 0.001 and 0.0001. The second parameter is initalpha, the initial step-length used by the line search. We do not anticipate that BFGS will be particularly sensitive to this parameter since our optimization surface is convex, and we only try values 0.01 and 0.000001 to be certain. The final parameter is lineeps, the tolerance applied to the relative orthogonality of the search direction and the gradient for the current line-search iteration. We were unsure how to tune this parameter, and tested its effect by setting it to 0.01 or 0.000001. All of these parameters are part of the GSL BFGS minimizer. We used the same cgwindow, rrlambda and binitmean settings chosen in our CG-MLE tests.


Table: Best BFGS AUC
  ds1 imdb citeseer ds2 ds1.100pca ds1.10pca
BFGS-MLE 0.943 258 0.983 2775 0.947 755 0.716 4651 0.912 403 0.831 89
CG-MLE 0.946 152 0.983 469 0.946 98 0.724 3784 0.916 506 0.844 50


Table: Fastest BFGS time for ``good'' AUC scores.
  ds1 imdb citeseer ds2 ds1.100pca ds1.10pca
BFGS-MLE 0.940 166 0.982 2628 0.947 740 0.716 4651 0.911 396 0.826 81
CG-MLE 0.946 152 0.983 469 0.946 98 0.724 3784 0.916 506 0.844 50

Our experiments confirm that the value of initalpha had little effect on time or AUC . The same can be said of lineeps, with the exception of experiments on dataset ds1. It seems likely that a self-tuning version of lineeps similar to $ \ensuremath{\mathbf{r}}_0$-cgeps of Section 5.2.2 would perform adequately. For gradeps the value 0.01 always produced poor AUC scores, while there was no obvious pattern to whether 0.001 or 0.0001 would produce the optimal AUC or speed. Thankfully there was little variation between AUC and speed for gradeps=0.001 and gradeps=0.0001.

The best results of our experiments with BFGS-MLE are summarized in Tables 5.41 and 5.42. The first of these tables shows the best AUC value obtained over the twelve parameter combinations for each dataset. The second table shows the fastest experiment of the twelve which produces an AUC similar to the best AUC values of the first table. BFGS achieved similar AUC scores to those found by IRLS in Tables 5.19 and 5.20 and CG-MLE in Table 5.37. The fastest times are not competitive with the other algorithms. The times for BFGS experiments with gradeps=0.01 are much faster than those shown in Table 5.42, but some AUC scores in that table are approaching sub-optimal and all scores are miserable once gradeps=0.01.

Our BFGS-MLE experiments lead us to believe that proper tuning and careful attention to detail will not produce significant gains over the other LR methods we have described. We will not explore BFGS-MLE further in this thesis.


next up previous contents
Next: 5.3.5 Direct (CG-MLE) Summary Up: 5.3 CG-MLE Parameter Evaluation Previous: 5.3.3.1 Nonlinear CG Direction   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu