next up previous contents
Next: 9. Contributions Up: 4 Conclusion Previous: 7. Related Work   Contents


8. Conclusions

In this thesis we analyzed two approaches to fitting LR models for high-dimensional data for classification and data mining. The data was restricted to sparse binary or dense real-valued attributes, and binary outputs. The first approach used an iteratively re-weighted least squares algorithm, accelerated by CG, and regularized through an application of ridge regression. We believe the result is a novel LR fitting procedure, and we proposed two variants with different termination criteria.

The second approach was the traditional likelihood maximization with CG, regularized through a typical application of ridge regression. This combination is not novel, but we believe we are the first to make a large, careful empirical study of it.

For both approaches we discussed many possible modifications. These ideas were explored in a thorough, empirical setting. We eliminated all but a small set of parameterized modifications which enhanced stability, optimality and speed. For each of the chosen modifications a default value was chosen via analysis of a large collection of experiments. The values chosen offered robust performance under a wide variety of conditions and removed any need for tuning.

Our three LR variants were tested on synthetic datasets to establish the relation between performance and the number of dataset rows, the number of dataset attributes, the sparsity of the dataset, and the level of interdependence among the attributes. The same experiments were conducted on linear and radial basis function Support Vector Machines, an accelerated version of $ k$-nearest neighbor with two different values of $ k$, and a fast implementation of Bayes' classifier. The LR algorithms had predictive behavior similar to the other classifiers, with the exception of $ k$-nearest neighbor which performed poorly on these synthetic datasets. The LR algorithms were faster than the Support Vector Machines and the accelerated $ k$-nearest neighbor algorithm.

The set of eight classifiers just described were tested on six real-world datasets from life sciences data mining and link analysis applications. These datasets had up to one million attributes and up to one-hundred thousand rows. The sparsity varied from one-hundred percent down to two-thousandths of one percent, and the total number of nonzero dataset entries varied from thirty million to one-quarter million. The output class distribution was highly skewed. The tests were ten-fold cross-validations, scored using the ``area under curve'' (AUC ) metric to assess ranking ability. For all but the smallest datasets, the LR algorithms had AUC scores as high or higher than the competing algorithms. The LR algorithms also ran faster than all of the other algorithms, with the exception of Bayes' classifier which was rendered unusable by poor AUC scores.

We have demonstrated that LR is a capable and fast tool for data mining and high-dimensional classification problems. Our novel IRLS fitting procedures outperformed the traditional combination of CG and LR maximum likelihood estimation, as well as state-of-the-art techniques such as Support Vector Machines. This superior performance was demonstrated on many synthetic and real-world datasets.


next up previous contents
Next: 9. Contributions Up: 4 Conclusion Previous: 7. Related Work   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu