next up previous contents
Next: 5.1.3 Datasets Up: 5.1 Preliminaries Previous: 5.1.1 Why LR?   Contents


5.1.2 Our Approach

We have tried several LR techniques in our quest for reliability and speed. These include

The relationship of these methods to MLE and IRLS are depicted in Figure 5.1, and are described below.

Figure 5.1: Tree showing overview of our LR implementation efforts. As this chapter progresses, more information will be added to this tree. The left-most branch lists several initial implementations that were rejected.
\includegraphics{figures/treemap-methods.eps}

Initial experiments with unregularized IRLS implemented using a Cholesky decomposition and triangular back-substitution failed due to linear dependencies among the columns of our data. The linear dependencies cause the matrix $ \ensuremath{\mathbf{A}} = \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{X}}$ to be positive semidefinite instead of positive definite, and the Cholesky decomposition fails. We tried a column-elimination strategy to remove dependencies. In our first attempt, we removed dependencies from sparse datasets using an exact version of Gaussian elimination. Later, we combined the column elimination with the Cholesky decomposition. This Modified Cholesky technique is described in Komarek and Moore [20]. While the latter approach was faster than the former, it was still very slow. We also encountered stability and overfitting problems. In particular, we encountered the model and weight saturation described in Section 5.2.1.1.

Stepwise logistic regression is a model-selection technique where attributes are added one-at-a-time to the model. The model quality is assessed after addition, and the addition is accepted or rejected. Stepwise LR is described in Hosmer and Lemeshow [13]. We implemented this using our Modified Cholesky technique. There are two principal problems with Stepwise LR. The first is deciding in which order attributes should be added to the model, which affects which attributes survive for the final model. The second problem is the continual reduction of performance as attributes are added, and the fact that you must estimate the LR parameters for many models. Later we attempted a tree-based divide-and-conquer strategy. The dataset attributes were recursively partitioned and many two-attribute LR problems were solved. The result was no faster than our existing techniques, and therefore we did not explore optimal methods for recombination of the partitioned results.

Instead of abandoning IRLS, we chose to explore the effects of approximate solutions to the weighted least squares linear regression subproblem. Because this only required solving a linear system, CG appeared appropriate. This solved many of our numerical and speed problems, but introduced new complexity. We also tried using nonlinear CG to find the deviance minimizers directly, as suggested by McIntosh [26] and Minka [27]. While this removed some of the IRLS complexity and instability, it required nonlinear CG and introduced different complexities. To address numerical instability problems in both methods, we tried a variety of LR model modifications as well as ridge regression or similar coefficient shrinkage techniques.

The time complexity of IRLS with CG is simply the time complexity of CG since Equation 4.16 is a linear equation, times the maximum number of IRLS iterations allowed. According to Shewchuk [41] the time complexity of CG is O$ \left( m \right)$, where $ m$ is the number of nonzero entries in the linear system's matrix $ \mathbf{A}$. For IRLS, $ \ensuremath{\mathbf{A}} = \ensuremath{\mathbf{X}}^T \ensuremath{\mathbf{W}} \ensuremath{\mathbf{X}}$. We are ignoring the condition number $ \kappa$ described in Section 2.1.2. This time complexity is a reflection of the time to compute $ \alpha_i$, whose formula is shown in Equation 2.11. Because we can compute $ \ensuremath{\mathbf{A}} \ensuremath{\mathbf{d_k}}$ as a series of sparse matrix-vector products with dimension $ M$, it is not necessary to incur the $ M^2$ cost of computing $ \ensuremath{\mathbf{A}} \ensuremath{\mathbf{x}}$ directly. Therefore CG is O$ \left( m \right)$. A quick way to understand this complexity is by assuming that the number of CG iterations is essentially constant relative to the size of our data, which has always been true in our experiments, and the most expensive part of CG is the $ \alpha_i$ computation.

As we will discuss later, the number of IRLS iterations this number is typically small, around five or ten. If we bound the maximum number of IRLS iterations by thirty, for example, the worst-case complexity is O$ \left( 30 \times m \right)$ = O$ \left( m \right)$. Perhaps it is deceptive to hide the number of IRLS iterations in the asymptotic complexity. On the other hand, we show in Chapter 6 that our IRLS implementation is empirically linear in the number of dataset rows, when the sparsity and number of attributes is fixed. That is, the number of IRLS iterations remains relatively constant across a wide range of dataset sizes. Therefore, we feel it is less misleading to ignore the small number of IRLS iterations when considering asymptotic complexity, especially when the number of IRLS iterations is bounded.

Our combination of IRLS and CG is very similar to truncated-Newton methods. Let $ \mathbf{p}$ be the vector J$ _{\ensuremath{\mathbf{h}}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i)^{-1} \ensuremath{\mathbf{h}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i)$, as defined in Section 4.3, Equation 4.13. The vector $ \mathbf{p}$ is called the Newton direction, and it is often written as the solution to the linear system J$ _{\ensuremath{\mathbf{h}}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i)
\...
...}} = -\ensuremath{\mathbf{h}}(\ensuremath{\hat{\ensuremath{\mathbf{\beta}}}}_i)$. This system can be solved with an iterative method. Because $ \hat{\ensuremath{\mathbf{\beta}}}$ is itself an approximation, it is not necessary to compute $ \mathbf{p}$ exactly and the iterative method may be stopped early. Truncated-Newton methods get their name from this early stopping of the iterative approximation to the Newton direction [31]. The combination of IRLS and CG described earlier does not approximate the Newton direction. Instead, CG is applied to the weighted least squares regression that arises from the combination of the current position and the Newton direction. This linear system is different than that which describes the Newton direction alone, and we do not know whether the behavior of a truncated-Newton method would be identical to our IRLS and CG combination. Furthermore, IRLS is not equivalent to the Newton-Raphson method for all generalized linear models [25]. Applying CG to IRLS may have a better statistical foundation than truncated-Newton methods when generalizing beyond LR.

In the sections below are reports of stability, termination criterion and speed experiments with LR. These experiments will identify features important for successful use of LR in data mining and other high-dimensional classification tasks. The following chapter contains characterization and comparison experiments for LR, SVM, KNN and BC. In that chapter we will empirically demonstrate the utility of LR for these tasks.

Throughout the remainder of this thesis, references to IRLS imply use of CG for the weighted least squares linear regression subproblem. All references to maximum likelihood estimation, which will share the abbreviation MLE with maximum likelihood estimates, use nonlinear CG to solve the LR score equations. Exceptions to these conventions will be explicitly noted. The linear and nonlinear CG implementations are our own. We tried the CG implementation in the GNU Scientific Library [8] and found it was somewhat slower and more difficult to use.


next up previous contents
Next: 5.1.3 Datasets Up: 5.1 Preliminaries Previous: 5.1.1 Why LR?   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu