next up previous contents
Next: 4 Conclusion Up: 6. Characterizing Logistic Regression Previous: 6.3 Real-world Datasets   Contents


6.4 Conclusions

This chapter has demonstrated that LR is a capable classifier for high-dimensional datasets. The LR algorithms were more consistent than any other classifier tested. Though the SVM classifiers have excellent initial predictions on some datasets, their performance fell behind LR in every case. Our results suggest that the modified IRLS techniques we have introduced in this thesis are better performers than the traditional CG-MLE approach. It is difficult to choose between LR-CGEPS and LR-CGDEVEPS, though the latter performed exceptionally well on ds2.

It is reasonable to wonder whether the strong LR performance on the real-world datasets is due to having selected default LR values based on these same datasets. Unfortunately, we do not have additional real-world data available for further binary classification experiments. However, we have conducted some preliminary tests on the Reuters-21578 corpus, a dataset used in text classification experiments. These are multiclass experiments, unlike the two class experiments we have presented. The nature of the classification task requires significant and subjective preprocessing. Because of these issues, and because we do not have much experience in the text classification discipline, we have not included our early results. Our macro- and micro-averaged F1 scores were on par with the best we were able to produce using SVM LINEAR and SVM RBF, and were similar to scores reported by the SVM$ ^{\mbox{\emph{light}}}$ author on his version of the Reuters-21578 corpus [17]. Our LR implementation computed these results in less than one-third the time of linear SVM with SVM$ ^{\mbox{\emph{light}}}$. This encourages us to believe that the results presented in this chapter reasonably represent the relative performance of SVMs and LR.

It is reasonable to ask why we are concerned with fast, autonomous classifiers for use in science and business. Since the underlying research that produces large datasets is often time-consuming, what does it matter if the classifier finishes in one minute or one day? There are several answers to this question. The first is that the size of datasets is likely to grow faster than we imagine. Existing datasets often have attributes which are in some way chosen by humans, and are limited by the number of sensors on the assembly line, or the manner in which a molecule is simplified using binary features. Adding new sensors, or binary molecule descriptors, is often cheap and easy. It increases the power of the classifying model, but due care must be taken to avoid overfitting. Adding more data records is expensive when humans are running the experiments, but roboticization is decreasing those costs. We expect that both the number of attributes and the number of records will increase rapidly as adequate analysis tools become available.

When datasets are large enough that the tools available take a nontrivial amount of time to run, it might be expensive to search for a good set of algorithm parameters. For this reason we endeavor to build self-tuning algorithms, for instance with proper scaling for cgeps and cgdeveps. The use of validation sets, mentioned in Chapter 10 below, may prove even more adaptive when sufficient data is available. Of course, when the algorithm is sufficiently fast we can run cross-validations with many choices of parameters in a reasonable amount of time. Though not addressed here, this should allow brute-force tuning systems.

For those not interested in cross-validations, tuning, and large datasets, speed may still be of interest. Finding a good subset of $ M$ attributes, as required for traditional interpretation of LR, can require a search over many possible models. Various exact and approximate approaches to pruning this model space have been created to reduce the amount of time required. With a fast version of LR, approximations can be improved and exact methods become more palatable.

Many large datasets can be represented with significantly fewer attributes than they possess. For instance, KNN performs nearly as well on the PCA-compressed datasets ds1.100pca as LR performs on original ds1. Why should we work on high-dimensional problems when many can be transformed to low-dimensional problems? Transformations such as PCA are not free. For instance, PCA requires at least a partial singular value decomposition, which is itself tricky to implement correctly and efficiently [44]. Once the problem is transformed into only a few dimensions, it is likely that nonlinear classification boundaries will be required for optimal performance. Algorithms capable of reliably finding nonlinear boundaries, such as KNN and SVM RBF, often require more computation than linear classifiers. Recall from Tables 6.1 and 6.2 that LR with cgdeveps ran faster on ds1 than KNN or SVM RBF ran on ds1.100pca. Finally, the extra step required to project the raw dataset onto a small basis requires additional time and attention from the user. We conclude that using fast algorithms on the raw dataset is preferable to using dimensionality reduction techniques such as PCA.


next up previous contents
Next: 4 Conclusion Up: 6. Characterizing Logistic Regression Previous: 6.3 Real-world Datasets   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu