We have tried several LR techniques in our quest for reliability and speed. These include
|
Initial experiments with unregularized IRLS implemented using a
Cholesky decomposition and triangular back-substitution failed due to
linear dependencies among the columns of our data. The linear
dependencies cause the matrix
to be
positive semidefinite instead of positive definite, and the Cholesky
decomposition fails. We tried a column-elimination strategy to remove
dependencies. In our first attempt, we removed dependencies from
sparse datasets using an exact version of Gaussian elimination.
Later, we combined the column elimination with the Cholesky
decomposition. This Modified Cholesky technique is described in
Komarek and Moore [20]. While the latter approach was faster
than the former, it was still very slow. We also encountered
stability and overfitting problems. In particular, we encountered the
model and weight saturation described in
Section 5.2.1.1.
Stepwise logistic regression is a model-selection technique where attributes are added one-at-a-time to the model. The model quality is assessed after addition, and the addition is accepted or rejected. Stepwise LR is described in Hosmer and Lemeshow [13]. We implemented this using our Modified Cholesky technique. There are two principal problems with Stepwise LR. The first is deciding in which order attributes should be added to the model, which affects which attributes survive for the final model. The second problem is the continual reduction of performance as attributes are added, and the fact that you must estimate the LR parameters for many models. Later we attempted a tree-based divide-and-conquer strategy. The dataset attributes were recursively partitioned and many two-attribute LR problems were solved. The result was no faster than our existing techniques, and therefore we did not explore optimal methods for recombination of the partitioned results.
Instead of abandoning IRLS, we chose to explore the effects of approximate solutions to the weighted least squares linear regression subproblem. Because this only required solving a linear system, CG appeared appropriate. This solved many of our numerical and speed problems, but introduced new complexity. We also tried using nonlinear CG to find the deviance minimizers directly, as suggested by McIntosh [26] and Minka [27]. While this removed some of the IRLS complexity and instability, it required nonlinear CG and introduced different complexities. To address numerical instability problems in both methods, we tried a variety of LR model modifications as well as ridge regression or similar coefficient shrinkage techniques.
The time complexity of IRLS with CG is simply the time complexity of
CG since Equation 4.16 is a linear equation, times the
maximum number of IRLS iterations allowed. According to
Shewchuk [41] the time complexity of CG is
O
, where
is the number of nonzero entries in the linear system's matrix
. For IRLS,
. We are ignoring
the condition number
described in
Section 2.1.2. This time complexity is a reflection of
the time to compute
, whose formula is shown in
Equation 2.11. Because we can compute
as a series of sparse matrix-vector products with dimension
, it is
not necessary to incur the
cost of computing
directly. Therefore CG is
O
. A quick way to understand this
complexity is by assuming that the number of CG iterations is
essentially constant relative to the size of our data, which has
always been true in our experiments, and the most expensive part of CG
is the
computation.
As we will discuss later, the number of IRLS iterations this number is
typically small, around five or ten. If we bound the maximum number
of IRLS iterations by thirty, for example, the worst-case complexity
is
O
=
O
. Perhaps it is deceptive to hide
the number of IRLS iterations in the asymptotic complexity. On the
other hand, we show in Chapter 6 that our IRLS
implementation is empirically linear in the number of dataset rows,
when the sparsity and number of attributes is fixed. That is, the
number of IRLS iterations remains relatively constant across a wide
range of dataset sizes. Therefore, we feel it is less misleading to
ignore the small number of IRLS iterations when considering asymptotic
complexity, especially when the number of IRLS iterations is bounded.
Our combination of IRLS and CG is very similar to
truncated-Newton methods. Let
be the vector
J
, as defined in
Section 4.3, Equation 4.13. The vector
is called the Newton direction, and it is often written
as the solution to the linear system
J
. This system can be solved with an
iterative method. Because
is itself an approximation, it is
not necessary to compute
exactly and the iterative method may
be stopped early. Truncated-Newton methods get their name from this
early stopping of the iterative approximation to the Newton direction
[31]. The combination of IRLS and CG described earlier
does not approximate the Newton direction. Instead, CG is applied to
the weighted least squares regression that arises from the combination
of the current position and the Newton direction. This linear system
is different than that which describes the Newton direction alone, and
we do not know whether the behavior of a truncated-Newton method would
be identical to our IRLS and CG combination. Furthermore, IRLS is not
equivalent to the Newton-Raphson method for all generalized linear
models [25]. Applying CG to IRLS may have a
better statistical foundation than truncated-Newton methods when
generalizing beyond LR.
In the sections below are reports of stability, termination criterion and speed experiments with LR. These experiments will identify features important for successful use of LR in data mining and other high-dimensional classification tasks. The following chapter contains characterization and comparison experiments for LR, SVM, KNN and BC. In that chapter we will empirically demonstrate the utility of LR for these tasks.
Throughout the remainder of this thesis, references to IRLS imply use of CG for the weighted least squares linear regression subproblem. All references to maximum likelihood estimation, which will share the abbreviation MLE with maximum likelihood estimates, use nonlinear CG to solve the LR score equations. Exceptions to these conventions will be explicitly noted. The linear and nonlinear CG implementations are our own. We tried the CG implementation in the GNU Scientific Library [8] and found it was somewhat slower and more difficult to use.