We will illustrate the strengths and weaknesses of LR for data mining
and high-dimensional classification through life sciences and link
detection datasets. We will characterize the performance of LR and
other classification algorithms using several synthetic datasets.
When discussing datasets in this thesis, each record belongs to, or is
predicted to belong to, the positive or negative class. A
positive row is a row belonging to, or predicted to belong to,
the positive class. A similar definition holds for a negative
row.
is the number of rows in the dataset, and
is the number
of attributes. The sparsity factor
is the proportion of
nonzero entries. Thus the total number of nonzero elements in a
dataset is the product
. Our datasets are summarized in
Tables 5.1 and 5.2.