Suppose a linear regression model for average daily humidity contains
attributes for the day-of-month and the temperature. Suppose further
that in the data
these attributes are correlated, perhaps
because the temperature rose one degree each day data was collected.
When computing the expectation function
, overly large positive estimates of
could be transparently offset by large
negative values for
, since these
attributes change together. More complicated examples of this
phenomenon could include any number of correlated variables and more
realistic data scenarios.
Though the expectation function may hide the effects of correlated
data on the parameter estimate, computing the expectation of new data
may expose ridiculous parameter estimates. Furthermore, ridiculous
parameter estimates may interfere with proper interpretation of the
regression results. Nonlinear regressions often use iterative methods
to estimate
. Correlations may allow unbound growth of
parameters, and consequently these methods may fail to converge.
The undesirable symptoms of correlated attributes can be reduced by
restraining the size of the parameter estimates, or preferring small
parameter estimates. Ridge regression is a linear regression
technique which modifies the RSS computation of
Equation 3.7 to include a penalty for large parameter
estimates for reasons explained below. This penalty is usually
written as
where
is
the vector of slope parameters
. This
penalty is added to the RSS, and we now wish to minimize
Ridge regression was originally developed to overcome singularity when
inverting
to compute the covariance matrix. In this
setting the ridge coefficient
is a perturbation of the
diagonal entries of
to encourage non-singularity
[10]. The covariance matrix in a ridge regression
setting is
. Using this
formulation, Ridge regression may be seen as belonging to an older
family of techniques called regularization [34].
Another interpretation of Ridge regression is available through
Bayesian point estimation. In this setting the belief that
should be small is coded into a prior distribution. In
particular, suppose the parameters
are
treated as independent Normally-distributed random variables with zero
mean and known variance
. Suppose further that the outcome
vector
is distributed as described in
Section 3.1 but with known variance
. The
prior distribution allows us to weight the likelihood
of the data by the
probability of
. After normalization, this weighted
likelihood is called the posterior distribution. The posterior
distribution of
is
![]() |
(3.15) |