This chapter has demonstrated that LR is a capable classifier for high-dimensional datasets. The LR algorithms were more consistent than any other classifier tested. Though the SVM classifiers have excellent initial predictions on some datasets, their performance fell behind LR in every case. Our results suggest that the modified IRLS techniques we have introduced in this thesis are better performers than the traditional CG-MLE approach. It is difficult to choose between LR-CGEPS and LR-CGDEVEPS, though the latter performed exceptionally well on ds2.
It is reasonable to wonder whether the strong LR performance on the
real-world datasets is due to having selected default LR values based
on these same datasets. Unfortunately, we do not have additional
real-world data available for further binary classification
experiments. However, we have conducted some preliminary tests on the
Reuters-21578 corpus, a dataset used in text classification
experiments. These are multiclass experiments, unlike the two class
experiments we have presented. The nature of the classification task
requires significant and subjective preprocessing. Because of these
issues, and because we do not have much experience in the text
classification discipline, we have not included our early results.
Our macro- and micro-averaged F1 scores were on par with the best we
were able to produce using SVM LINEAR and SVM RBF, and were similar to
scores reported by the
SVM
author on his version of the
Reuters-21578 corpus [17]. Our LR implementation
computed these results in less than one-third the time of linear SVM
with
SVM
. This encourages us to believe that the results
presented in this chapter reasonably represent the relative
performance of SVMs and LR.
It is reasonable to ask why we are concerned with fast, autonomous classifiers for use in science and business. Since the underlying research that produces large datasets is often time-consuming, what does it matter if the classifier finishes in one minute or one day? There are several answers to this question. The first is that the size of datasets is likely to grow faster than we imagine. Existing datasets often have attributes which are in some way chosen by humans, and are limited by the number of sensors on the assembly line, or the manner in which a molecule is simplified using binary features. Adding new sensors, or binary molecule descriptors, is often cheap and easy. It increases the power of the classifying model, but due care must be taken to avoid overfitting. Adding more data records is expensive when humans are running the experiments, but roboticization is decreasing those costs. We expect that both the number of attributes and the number of records will increase rapidly as adequate analysis tools become available.
When datasets are large enough that the tools available take a nontrivial amount of time to run, it might be expensive to search for a good set of algorithm parameters. For this reason we endeavor to build self-tuning algorithms, for instance with proper scaling for cgeps and cgdeveps. The use of validation sets, mentioned in Chapter 10 below, may prove even more adaptive when sufficient data is available. Of course, when the algorithm is sufficiently fast we can run cross-validations with many choices of parameters in a reasonable amount of time. Though not addressed here, this should allow brute-force tuning systems.
For those not interested in cross-validations, tuning, and large
datasets, speed may still be of interest. Finding a good subset of
attributes, as required for traditional interpretation of LR, can
require a search over many possible models. Various exact and
approximate approaches to pruning this model space have been created
to reduce the amount of time required. With a fast version of LR,
approximations can be improved and exact methods become more
palatable.
Many large datasets can be represented with significantly fewer attributes than they possess. For instance, KNN performs nearly as well on the PCA-compressed datasets ds1.100pca as LR performs on original ds1. Why should we work on high-dimensional problems when many can be transformed to low-dimensional problems? Transformations such as PCA are not free. For instance, PCA requires at least a partial singular value decomposition, which is itself tricky to implement correctly and efficiently [44]. Once the problem is transformed into only a few dimensions, it is likely that nonlinear classification boundaries will be required for optimal performance. Algorithms capable of reliably finding nonlinear boundaries, such as KNN and SVM RBF, often require more computation than linear classifiers. Recall from Tables 6.1 and 6.2 that LR with cgdeveps ran faster on ds1 than KNN or SVM RBF ran on ds1.100pca. Finally, the extra step required to project the raw dataset onto a small basis requires additional time and attention from the user. We conclude that using fast algorithms on the raw dataset is preferable to using dimensionality reduction techniques such as PCA.