In this thesis we analyzed two approaches to fitting LR models for high-dimensional data for classification and data mining. The data was restricted to sparse binary or dense real-valued attributes, and binary outputs. The first approach used an iteratively re-weighted least squares algorithm, accelerated by CG, and regularized through an application of ridge regression. We believe the result is a novel LR fitting procedure, and we proposed two variants with different termination criteria.
The second approach was the traditional likelihood maximization with CG, regularized through a typical application of ridge regression. This combination is not novel, but we believe we are the first to make a large, careful empirical study of it.
For both approaches we discussed many possible modifications. These ideas were explored in a thorough, empirical setting. We eliminated all but a small set of parameterized modifications which enhanced stability, optimality and speed. For each of the chosen modifications a default value was chosen via analysis of a large collection of experiments. The values chosen offered robust performance under a wide variety of conditions and removed any need for tuning.
Our three LR variants were tested on synthetic datasets to establish
the relation between performance and the number of dataset rows, the
number of dataset attributes, the sparsity of the dataset, and the
level of interdependence among the attributes. The same experiments
were conducted on linear and radial basis function Support Vector
Machines, an accelerated version of
-nearest neighbor with two
different values of
, and a fast implementation of Bayes'
classifier. The LR algorithms had predictive behavior similar to the
other classifiers, with the exception of
-nearest neighbor which
performed poorly on these synthetic datasets. The LR algorithms were
faster than the Support Vector Machines and the accelerated
-nearest neighbor algorithm.
The set of eight classifiers just described were tested on six real-world datasets from life sciences data mining and link analysis applications. These datasets had up to one million attributes and up to one-hundred thousand rows. The sparsity varied from one-hundred percent down to two-thousandths of one percent, and the total number of nonzero dataset entries varied from thirty million to one-quarter million. The output class distribution was highly skewed. The tests were ten-fold cross-validations, scored using the ``area under curve'' (AUC ) metric to assess ranking ability. For all but the smallest datasets, the LR algorithms had AUC scores as high or higher than the competing algorithms. The LR algorithms also ran faster than all of the other algorithms, with the exception of Bayes' classifier which was rendered unusable by poor AUC scores.
We have demonstrated that LR is a capable and fast tool for data mining and high-dimensional classification problems. Our novel IRLS fitting procedures outperformed the traditional combination of CG and LR maximum likelihood estimation, as well as state-of-the-art techniques such as Support Vector Machines. This superior performance was demonstrated on many synthetic and real-world datasets.