Among the strengths of LR is the interpretability of the parameter
estimates. Because the expectation function
is the mean of a Bernoulli random variable,
is a probability. A
deeper analysis of fitted LR models is possible if the model's
attributes were chosen carefully to avoid certain types of
interactions and correlations. Hosmer and Lemeshow [13] and
Myers et al. [30] explain in detail how to create LR models and
measure the significance of a fitted LR model. Hosmer and Lemeshow [13]
covers in depth the interpretation of a fitted model's coefficients in
a wide variety of settings.
We are interested in LR for data-mining and high-dimensional
classification. In these settings it is unlikely that the dataset's
attributes were carefully chosen, or that one can sift through all
combinations of hundreds-of-thousands of attributes to avoid harmful
interactions and correlations. For this reason we must assume that
our datasets will have correlated or even identical variables. Our
experiment matrix
will have linear dependencies and
will only be positive semidefinite. The columns of
will be badly scaled. The outcomes
used for estimating
may have as many false positives as true positives.
Furthermore, all of these shortcomings should be handled without human
intervention.
For the reasons stated above, we will not explore model adequacy, significance or interpretability. Among the many interesting statistics available for LR model analysis, we are only interested in the loss function. This function will be part of the termination criterion for our parameter estimation methods described in Sections 4.2 and 4.3. Both LR and linear regression are members of a family of models called generalized linear models, and the loss functions for members of this family are defined as a scaled likelihood ratio known as the deviance. In particular, the deviance is defined as
DEV ![]() |
(4.17) |
| DEV |
(4.18) | ||
![]() |
(4.19) |
Since the deviance is just the negative log-likelihood, using either
quantity to terminate iterative parameter estimation techniques will
result in the same
. Because deficiencies in the data matrix
could cause the LR score equations shown in
Equation 4.12 to have multiple solutions
,
waiting until
converges to terminate iterative methods may
not be efficient. We prefer to terminate when the relative difference
of the deviance
DEV
DEV
DEV
is sufficiently small. Determining what
constitutes sufficiently small will be discussed later in this thesis.
LR is susceptible to the same parameter estimation problems caused by
correlated attributes as is linear regression
[13,27]. In Section 3.1 we
discussed a technique called ridge regression to prevent overly large
parameter estimates. A similar technique can be used when estimating
for LR. The simplest way to apply a ridge regression
penalty to LR maximum likelihood estimation is to add it to the loss
function, which is the deviance. This corresponds to subtracting it
from the log-likelihood function, allowing generic convergence
criteria to be applied for numerical techniques. When using IRLS for
parameter estimation there is a weighted least squares linear
regression subproblem to which traditional ridge regression penalties
may be applied. Unfortunately this is not equivalent to performing
the IRLS derivation of Section 4.3 with the penalty
applied to the LR log-likelihood function. Instead of deriving a
custom IRLS which incorporates the penalty in the likelihood function,
we will use the simpler approach of using traditional ridge regression
when solving the weighted least squares subproblem.
Now that we have defined a termination criterion and described methods of applying ridge regression, we may summarize our two methods of find the LR MLE. Algorithm 3 describes the direct maximum likelihood estimation approach of Section 4.2, while Algorithm 4 describes the IRLS approach of of Section 4.3. Lines 3.1 and 4.1 in these algorithms include the regularization methods described above. While the direct method appears much simpler, it can be difficult to find a good iterative method for the nonlinear objective function shown in line 3.1. Despite the additional lines of computation for IRLS, the weighted least squares subproblem in line 4.1 is simply a linear system of equations.
We will be discussing various implementations of LR throughout this thesis. We have created a tree-figure depicting the relationship of our methods and modifications. Figure 4.3 shows the beginning of this tree. Currently there is only a mention of LR and the nonlinear deviance minimization problem, followed by a choice of generic nonlinear methods on the right and IRLS on the left. Additional nodes will be added to this tree in in Chapter 5.
|