We will use the term BFGS to refer to both the BFGS matrix update, and to a quasi-Newton method using the BFGS matrix update. Because BFGS is a quasi-Newton method, it is somewhat similar to IRLS in the context of LR. See Section 4.3 for further explanation of the relation between IRLS and Newton's method. Where our IRLS approximates the solution to linear subproblems using expressions involving the true Hessian, BFGS approximates the Hessian and finds true solutions given that approximation. Because BFGS is applied directly to the MLE system of equations, it is reminiscent of CG-MLE. Note that all three of these algorithms will follow a different path on the MLE surface and hence can be expected to perform differently.
BFGS can be terminated when the size of the estimated gradient is near zero, or using the relative difference of the deviances. The GSL BFGS implementation returns after every line search and after every line search iteration. Computing the deviance each time is too expensive and the change in deviance is too small to make the relative change approach work well. The alternatives are to identify line search completion by examining the sequence of points generated along the line search, or to use the gradient size for termination. The GSL documentation seems to encourage gradient-based termination, and hence this is our chosen termination method.
We do not intend to investigate BFGS-MLE thoroughly in this thesis, and only wish to make a rough comparison of the time it takes to converge to a competitive AUC . To this end we have run twelve experiments on each of our six datasets. These twelve experiments vary three parameters of GSL's BFGS implementation. The first of these is the gradient-termination epsilon gradeps for which we try values 0.01, 0.001 and 0.0001. The second parameter is initalpha, the initial step-length used by the line search. We do not anticipate that BFGS will be particularly sensitive to this parameter since our optimization surface is convex, and we only try values 0.01 and 0.000001 to be certain. The final parameter is lineeps, the tolerance applied to the relative orthogonality of the search direction and the gradient for the current line-search iteration. We were unsure how to tune this parameter, and tested its effect by setting it to 0.01 or 0.000001. All of these parameters are part of the GSL BFGS minimizer. We used the same cgwindow, rrlambda and binitmean settings chosen in our CG-MLE tests.
| ds1 | imdb | citeseer | ds2 | ds1.100pca | ds1.10pca | |||||||
| BFGS-MLE | 0.943 | 258 | 0.983 | 2775 | 0.947 | 755 | 0.716 | 4651 | 0.912 | 403 | 0.831 | 89 |
| CG-MLE | 0.946 | 152 | 0.983 | 469 | 0.946 | 98 | 0.724 | 3784 | 0.916 | 506 | 0.844 | 50 |
| ds1 | imdb | citeseer | ds2 | ds1.100pca | ds1.10pca | |||||||
| BFGS-MLE | 0.940 | 166 | 0.982 | 2628 | 0.947 | 740 | 0.716 | 4651 | 0.911 | 396 | 0.826 | 81 |
| CG-MLE | 0.946 | 152 | 0.983 | 469 | 0.946 | 98 | 0.724 | 3784 | 0.916 | 506 | 0.844 | 50 |
Our experiments confirm that the value of initalpha had little
effect on time or AUC . The same can be said of lineeps, with the
exception of experiments on dataset ds1. It seems likely that a
self-tuning version of lineeps similar to
-cgeps of
Section 5.2.2 would perform adequately. For
gradeps the value 0.01 always produced poor AUC scores, while
there was no obvious pattern to whether 0.001 or 0.0001 would produce
the optimal AUC or speed. Thankfully there was little variation
between AUC and speed for gradeps=0.001 and gradeps=0.0001.
The best results of our experiments with BFGS-MLE are summarized in Tables 5.41 and 5.42. The first of these tables shows the best AUC value obtained over the twelve parameter combinations for each dataset. The second table shows the fastest experiment of the twelve which produces an AUC similar to the best AUC values of the first table. BFGS achieved similar AUC scores to those found by IRLS in Tables 5.19 and 5.20 and CG-MLE in Table 5.37. The fastest times are not competitive with the other algorithms. The times for BFGS experiments with gradeps=0.01 are much faster than those shown in Table 5.42, but some AUC scores in that table are approaching sub-optimal and all scores are miserable once gradeps=0.01.
Our BFGS-MLE experiments lead us to believe that proper tuning and careful attention to detail will not produce significant gains over the other LR methods we have described. We will not explore BFGS-MLE further in this thesis.