next up previous contents
Next: 6.2.2 Number of Attributes Up: 6.2 Synthetic Datasets Previous: 6.2 Synthetic Datasets   Contents


6.2.1 Number of Rows

Figure: AUC and time versus number of rows. Note that the vertical time axis in the right plot is logarithmic.
\includegraphics[width=\textwidth]{figures/perf_auc_numrows_all.ps}
tex2html_comment_mark>801 figures/perf_time_numrows_all.ps

Figure 6.1 shows how the classifiers performed as the number of rows in the dataset increased. Note that there is no coupling between the attributes of this dataset, as shown in Table 5.2. Furthermore, half of all dataset rows are positive. This is quite unlike the real-world datasets we will see later in this chapter.

The most obvious feature is the poor KNN AUC scores, shown on the left side of the figure. We will see that KNN does poorly on all of our synthetic datasets, and we attribute this to the independence of the attributes, the linear boundaries, and the equal number of positive and negative rows. We will see in Section 6.2.4 that KNN performs somewhat better as the coupling between attributes increases, and in Section 6.3 that KNN surpasses all of the other classifiers on our PCA-compressed datasets ds1.100pca and ds1.10pca. There is little interesting happening between the other classifiers in this AUC plot.

On the other hand, it is easy to differentiate the learners in the time plot, which is on the right side of Figure 6.1. Note that the vertical time axis is logarithmic, and hence a unit change vertically represents an order of magnitude change in the number of seconds required to make classifications. The learners capable of representing nonlinear classification boundaries, namely KNN and SVM RBF, are the slowest. In Section 6.3 we will be able to evaluate whether it is worthwhile to spend extra time on nonlinear boundaries for high-dimensional real-world data.

It appears that the LR classifiers are faster than SVM LINEAR. Though the CG-MLE and SVM LINEAR graphs cross, as shown in Figure 6.2, they also cross a second time. Nevertheless, the effectiveness of the SVM$ ^{\mbox{\emph{light}}}$ optimizations for solving the SVM quadratic program is remarkable. Between the LR classifiers, it appears that LR-CGEPS has a small advantage. Careful inspection of this log-log plot shows that the slope is one, and hence we are observing the predicted linear performance of the LR algorithms.

At the bottom of Figure 6.1 we see that BC is indeed a very fast algorithm. Though the attributes in this synthetic dataset do not satisfy BC's conditional independence assumptions, as described in Section 6.1.3, the noiseless linear boundary helps BC make good predictions.

Figure: AUC versus sparsity, magnified.
\includegraphics{figures/perf_time_numrows_mag_all.ps}


next up previous contents
Next: 6.2.2 Number of Attributes Up: 6.2 Synthetic Datasets Previous: 6.2 Synthetic Datasets   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu