next up previous contents
Next: 6.4 Conclusions Up: 6. Characterizing Logistic Regression Previous: 6.2.4 Attribute Coupling   Contents


6.3 Real-world Datasets

We would like to compare performance of LR, SVM, KNN and BC on real-world datasets. Unfortunately, we do not have real-world datasets available other than those used for choosing the LR parameters in Chapter 5. While we have done our best to tune SVM LINEAR, SVM RBF and KNN for these experiments, we cannot claim to have been as thorough as we were in Chapter 5. There is no tuning to be done for BC, since it has no parameters. We must be careful when comparing timings on these datasets.

In the previous section we saw that BC was the fastest algorithm, the LR classifiers were next, followed by SVM and NEWKNN. These results were for synthetic datasets, on which NEWKNN did very poorly. The time spent on nonlinear boundaries by NEWKNN and SVM RBF was certainly wasted, since the synthetic datasets are linearly separable. The datasets used in this section, which are described in Section 5.1.3, may or may not be linearly separable. These datasets are noisy, and the class distribution is heavily skewed with few positive rows.


Table 6.1: Classifier performance on real-world datasets, 1 of 2.
   ds1 imdb ±citeseer ± 
Classifier  Time AUC Time AUC ±Time AUC ± 
LR-CGEPS  86 0.949 ±0.009 303 0.983 ±0.007 43 0.946 ±0.021
LR-CGDEVEPS  59 0.948 ±0.009 320 0.983 ±0.007 67 0.945 ±0.021
CG-MLE  151 0.946 ±0.008 369 0.983 ±0.009 97 0.946 ±0.021
SVM LINEAR  188 0.918 ±0.012 565 0.938 ±0.016 87 0.810 ±0.049
SVM RBF  1850 0.924 ±0.012 6553 0.954 ±0.011 1456 0.854 ±0.045
KNN K=1  424 0.790 ±0.029 NA NA ±NA NA NA ±NA
KNN K=9  782 0.909 ±0.016 NA NA ±NA NA NA ±NA
KNN K=129  2381 0.938 ±0.010 NA NA ±NA NA NA ±NA
BC  4 0.884 ±0.011 33 0.507 ±0.023 10 0.501 ±0.038


Table 6.2: Classifier performance on real-world datasets, 2 of 2.
   ds2 ds1.100pca ±ds1.10pca ± 
Classifier  Time AUC Time AUC ±Time AUC ± 
LR-CGEPS  2983 0.720 ±0.030 44 0.918 ±0.011 8 0.846 ±0.013
LR-CGDEVEPS  1647 0.722 ±0.032 35 0.913 ±0.011 9 0.842 ±0.015
CG-MLE  3198 0.724 ±0.030 364 0.916 ±0.012 48 0.844 ±0.014
SVM LINEAR  2536 0.693 ±0.034 130 0.874 ±0.012 68 0.582 ±0.048
SVM RBF  67117 0.700 ±0.034 1036 0.897 ±0.010 490 0.856 ±0.017
KNN K=1  NA NA ±NA 74 0.785 ±0.024 9 0.753 ±0.028
KNN K=9  NA NA ±NA 166 0.894 ±0.016 14 0.859 ±0.019
KNN K=129  NA NA ±NA 819 0.938 ±0.010 89 0.909 ±0.013
BC  127 0.533 ±0.020 8 0.890 ±0.012 2 0.863 ±0.015

Tables 6.1 and 6.2 show the AUC and timing results for our experiments. Looking first at the AUC scores, we see that the LR algorithms are performing comparably. SVM LINEAR usually scores a little lower than SVM RBF, except for ds1.10pca where it scores well below every classifier. Our assumption for the ds1.10pca experiment is that SVM$ ^{\mbox{\emph{light}}}$ is not adjusting properly for such a small dataset, and we are not sure how to compensate. We are tempted to conclude that the nonlinear classification boundaries used in SVM RBF worked better than the linear boundaries of SVM LINEAR. However, SVM RBF did not outscore the linear boundaries used by LR.

We were unable to build a useful ball tree for KNN within 4GB of memory for imdb, citeseer and ds2. Using the computers with more memory, creating a ball tree for ds2 required over 4.6GB, and for imdb and citeseer required over 8GB. We allowed a ds2 experiment to complete to determine whether we should adjust rmin just for this dataset so that the memory requirements were met. With k=9, KNN finished the ten-fold cross-validation in approximately 5 hours, and scored around 0.654. This score might improve with k=129, but the other KNN results suggest it won't improve enough to make it competitive with the LR scores. Furthermore, the time would increase with k=129 or a smaller ball tree, further reducing the appeal of KNN relative to LR for this dataset. That we can run KNN on this dataset is very impressive. However, the focus of this thesis is LR and, in this context, it does not make sense to spend time on KNN for these datasets.

BC put in a miserable performance on these real datasets. We cannot blame this on tuning since BC has no parameters, and the best explanation is that these datasets do not satisfy the conditional independence assumption made by BC. Consider imdb, in which each row represents an entertainment production and the output represents the participation of voice actor Mel Blanc. Suppose a link contains animation director Chuck Jones and composer/conductor Carl Stalling. Chuck and Carl are conditionally independent if

$\displaystyle P($Carl$\displaystyle \vert$   Mel$\displaystyle ,$   Chuck$\displaystyle ) = P($Carl$\displaystyle \vert$   Mel$\displaystyle )$ (6.10)

This statement would not be true if Carl and Chuck worked together regardless of whether Mel did the voices. While we cannot comment on the veracity of this particular scenario, it seems likely that the link datasets have many attributes which are not conditionally independent given the output.

LR generally outscored the other learners on these datasets, except on the PCA-compressed datasets. In several cases, SVM was not far below LR, and more comprehensive tuning could even the scores. KNN was much more competitive here than it was on the synthetic datasets, even beating LR on ds1.100pca and ds1.10pca. BC did poorly almost everywhere.

To draw final conclusions from these tables, we should examine the timings. Though BC was very fast, it's poor scores suggest it is not well-suited to this data. SVM RBF was very slow relative to SVM LINEAR and LR, and only scored as well as LR on ds1.10pca. Though SVM LINEAR was closer to the LR classifiers' speeds, it consistently scored well below LR. KNN was usually slow, but for the PCA datasets the time was not unreasonable and the scores were worthwhile. Our conclusion, on these datasets, is that the LR algorithms are the best choice. Furthermore, it seems that our LR-CGEPS and LR-CGDEVEPS perform better than our implementation of the traditional CG-MLE.

We conclude this section with some brief observations of the ROC curves for these experiments. We have organized these curves by dataset. Figures 6.7, 6.8 and 6.9 correspond to ds1, imdb and citeseer, while Figures 6.10, 6.11 and 6.12 correspond to ds2, ds1.100pca and ds1.10pca. Each figure has two graphs. The top graph has a linear ``False positives'' axis, while the bottom graph has a logarithmic ``False positives'' axis. Each plot is built from as many data points as there are rows in the corresponding dataset. For technical reasons these points are interpolated with cubic splines, and the splines are resampled where the bullets appear in the plots. One should not associate the bullets with the individual steps of the ROC curve.

The most interesting part of these figures is the bottom left corner of the logarithmic plot. It is here that we see how accurate the classifiers are for the first few predictions. In the ds1 and imdb graphs, we see LR and SVM close together for their top five predictions, but SVM appears to have an edge over LR for initial accuracy. In the citeseer graph, SVM's initial performance is even stronger. In the long run, we know that LR outperforms SVM as indicated by the AUC scores, and careful observation of the graphs confirms this. The ds1.100pca and ds1.10pca linear ROC curves show KNN outperforming the other algorithms, though the initial segment does not show any clear leaders. The most important conclusion from the ROC curves is that the classifier with the best initial ranking performance might not be the most consistent classifier.

Figure: ROC curves for ds1. Note that the bottom graph has a logarithmic ``False positives'' axis.
\includegraphics[height=0.45\textheight]{figures/perf_roc_ds1.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_ds1.ps}

Figure: ROC curve for imdb. Note that the bottom graph has a logarithmic ``False positives'' axis. The ball trees used for the KNS2 algorithm exceeded 4GB of memory and were terminated.
\includegraphics[height=0.45\textheight]{figures/perf_roc_melblanc.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_melblanc.ps}

Figure: ROC curve for citeseer. Note that the bottom graph has a logarithmic ``False positives'' axis. The ball trees used for the KNS2 algorithm exceeded 4GB of memory and were terminated.
\includegraphics[height=0.45\textheight]{figures/perf_roc_jlee.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_jlee.ps}

Figure: ROC curve for ds2. Note that the bottom graph has a logarithmic ``False positives'' axis. The ball trees used for the KNS2 algorithm exceeded 4GB of memory and were terminated.
\includegraphics[height=0.45\textheight]{figures/perf_roc_ds2.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_ds2.ps}

Figure: ROC curve for ds1.100pca. Note that the bottom graph has a logarithmic ``False positives'' axis.
\includegraphics[height=0.45\textheight]{figures/perf_roc_t100.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_t100.ps}

Figure: ROC curve for ds1.10pca. Note that the bottom graph has a logarithmic ``False positives'' axis.
\includegraphics[height=0.45\textheight]{figures/perf_roc_t10.ps}

tex2html_comment_mark>53 \includegraphics[height=0.45\textheight]{figures/perf_roc_logx_t10.ps}


next up previous contents
Next: 6.4 Conclusions Up: 6. Characterizing Logistic Regression Previous: 6.2.4 Attribute Coupling   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu