next up previous contents
Next: 5.1.3.1 Life sciences Up: 5.1 Preliminaries Previous: 5.1.2 Our Approach   Contents


5.1.3 Datasets

We will illustrate the strengths and weaknesses of LR for data mining and high-dimensional classification through life sciences and link detection datasets. We will characterize the performance of LR and other classification algorithms using several synthetic datasets. When discussing datasets in this thesis, each record belongs to, or is predicted to belong to, the positive or negative class. A positive row is a row belonging to, or predicted to belong to, the positive class. A similar definition holds for a negative row. $ R$ is the number of rows in the dataset, and $ M$ is the number of attributes. The sparsity factor $ F$ is the proportion of nonzero entries. Thus the total number of nonzero elements in a dataset is the product $ MRF$. Our datasets are summarized in Tables 5.1 and 5.2.



Subsections
next up previous contents
Next: 5.1.3.1 Life sciences Up: 5.1 Preliminaries Previous: 5.1.2 Our Approach   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu