next up previous contents
Next: 8. Conclusions Up: 4 Conclusion Previous: 4 Conclusion   Contents


7. Related Work

The earliest work we are aware of exploring the application of CG to LR is McIntosh [26]. In his thesis, McIntosh observes

The increasing power and decreasing price of small computers, especially ``personal computers'', has made them increasingly popular in statistical analysis. The day may not be too far off when every statistician has on his or her desktop computing power on a par with the large mainframe computers of 15 or 20 years ago.
McIntosh goes on to extol the limited memory required by CG compared to QR algorithms. The author modified GLIM [6] statistics software to use CG, and explored its application to several generalized linear models including LR. It appears that McIntosh replaced IRLS entirely, using nonlinear CG to create what we describe as CG-MLE:
...using a version of GLIM in which the standard fitting algorithm had been replaced by the Hestenes-Stiefel algorithm. ...Since fitting generalized linear models involves the minimization of a function that is not necessarily quadratic, the algorithms used were not exactly algorithms A.1 and A.3. The only difference is that the gradient $ g(\beta)$ is computed directly rather than recursively.
McIntosh's applications were traditional statistics problems with carefully chosen attributes and a few tens of data points. He comments on several numerical issues. This is the most detailed and comprehensive examination of CG-MLE and linear models to date.

The use of CG to directly maximize the LR likelihood has been reported more recently in several papers. Minka [27] compares the effectiveness of several numerical methods for maximizing the LR likelihood in a short technical report. He reports FLOP counts on three small, dense synthetic datasets. The author finds that CG is often very effective, but also states that Newton-Raphson may be more suitable for correlated data. Our approach to ridge regression for CG-MLE is similar to Minka's. This technical report describes traditional IRLS, coordinate-wise Newton-Raphson, nonlinear CG, annealed coordinate-wise MAP on the dual of the LR likelihood, the quasi-Newton Böhning's method with back-substitution, and two versions of iterative scaling. Implementation details are not discussed, except to say that Matlab was used.

Zhang et al. [48] describes a modified LR algorithm for use in text classification. Their approach yields an iterative approximation to an SVM. The authors use CG to find the MLE, and state that in their experiments the Hestenes-Stiefel direction updates for nonlinear CG were more efficient than Polak-Ribiére or Fletcher-Reeves. The authors claim that their algorithm is more efficient than SVM$ ^{\mbox{\emph{light}}}$ on ``large-scale text categorization collections''.

In the text categorization, information retrieval, document filtering and document routing literatures, many papers apply LR in some form. Besides the previously mentioned work by Zhang et al. [48], there is Zhang and Oles [49]. The authors discuss the reputation of LR as slow and numerically unstable in text categorization, referring to several papers. They suggest that the failures reported in Schutze et al. [40] could be the result of that paper's lack of regularization in general, and

[a]nother reason could be their choice of the Newton-Raphson method of numerical optimization, which in our experience could become unstable, especially without regularization.
The authors use the same regularization techniques that we have already discussed, related to ridge regression. They develop a common framework for describing several linear classifiers including linear regression, LR, and SVMs. They create specialized solvers for each of these using their framework and a modified Gauss-Seidel method. Intriguingly, they make an aside comment that
instead of Algorithm 1 [modified Gauss-Seidel], one may also apply Newton's method directly to (12) [the general framework for linear classifiers], as suggested in Hastie and Tibshirani (1990) and used in Schütze et al. (1995). The resulting linear system from Newton's method can be solved by an iterative solver such as the Gauss-Seidel iteration or CG.
which makes one wonder whether they are referring to something like our IRLS and CG combination. However, they continue
However, a properly implemented line search method, which can be complicated, is required to guarantee convergence. Such a line search method can also result in small step sizes, which slows down the convergence.
Thus they do not appear to be suggesting solving the weighted least square subproblem of IRLS with CG, since the application of CG to any linear system does not require a line search or small step sizes. Among their conclusions is that a properly regularized implementation of LR is competitive with SVMs for classification of documents, though they remark it is somewhat slower. This statement runs contrary to the results of this thesis, suggesting that they did not anticipate or properly evaluate the combination of IRLS and CG.

We are not aware of any other literature closely related to our work. Hastie et al. [10] comments that ``Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome.'' Our work in applying LR to data mining and high-dimensional classification problems was first discussed in Komarek and Moore [20]. We have made some initial text classification experiments, similar to those of Yang and Liu [47]. These were promising in that our scores were competitive with those of SVM$ ^{\mbox{\emph{light}}}$, which is often considered state-of-the-art for text classification; and further our IRLS implementation ran nearly four times faster than SVM$ ^{\mbox{\emph{light}}}$. Our IRLS implementation participated in a short survey of approaches for solving the link classification problem described in Section 5.1.3.1. This paper, Kubica et al. [22], demonstrated that LR could be reasonably applied to large multiclass problems requiring as many models as there were independent variables.


next up previous contents
Next: 8. Conclusions Up: 4 Conclusion Previous: 4 Conclusion   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu