Figure 4.1:
Logistic function with one attribute
.
|
Figure 4.2:
The logistic model's binomial error variance changes with the mean
.
|
We can think of an experiment in

as a Bernoulli trial with mean
parameter

. Thus

is a Bernoulli random variable
with mean

and variance

. It is
important to note that the variance of

depends on the mean and
hence on the experiment

. To model the relation between each
experiment

and the expected value of its outcome, we will use
the logistic function. This function is written as
 |
(4.1) |
where

is the vector of parameters, and its shape may be seen
in Figure
4.1. We assume that

so
that

is a constant term, just as we did for linear
regression in Section
3.1. Thus our regression model is
 |
(4.2) |
where

is our error term. It may be easily seen in
Figure
4.2 that the error term

can
have only one of two values. If

then

, otherwise

. Since

is
Bernoulli with mean

and variance

, the error

has zero mean and variance

. This is different than the linear
regression case where the error and outcome had constant variance

independent of the mean.
Because the LR model is nonlinear in

, minimizing the RSS
as defined for linear regression in Section
3.1 is not
appropriate. Not only is it difficult to compute the minimum of the
RSS, but the RSS minimizer will not correspond to maximizing the
likelihood function. It is possible to transform the logistic model
into one which is linear in its parameters using the
logit
function

, defined as
 |
(4.3) |
We apply the logit to the outcome variable and expectation function of
the original model in Equation
4.2 to create the
new model
 |
(4.4) |
However, we cannot use linear regression least squares techniques for
this new model because the error

is not Normally
distributed and the variance is not independent of the mean. One
might further observe that this transformation is not well defined for

or

. Therefore we turn to parameter estimation methods
such as maximum likelihood or
iteratively re-weighted least squares.
It is important to notice that LR is a linear classifier, a result of
the linear relation of the parameters and the components of the data.
This indicates that LR can be thought of as finding a hyperplane to
separate positive and negative data points. In high-dimensional
spaces, the common wisdom is that linear separators are almost always
adequate to separate the classes.