We would like to compare performance of LR, SVM, KNN and BC on real-world datasets. Unfortunately, we do not have real-world datasets available other than those used for choosing the LR parameters in Chapter 5. While we have done our best to tune SVM LINEAR, SVM RBF and KNN for these experiments, we cannot claim to have been as thorough as we were in Chapter 5. There is no tuning to be done for BC, since it has no parameters. We must be careful when comparing timings on these datasets.
In the previous section we saw that BC was the fastest algorithm, the LR classifiers were next, followed by SVM and NEWKNN. These results were for synthetic datasets, on which NEWKNN did very poorly. The time spent on nonlinear boundaries by NEWKNN and SVM RBF was certainly wasted, since the synthetic datasets are linearly separable. The datasets used in this section, which are described in Section 5.1.3, may or may not be linearly separable. These datasets are noisy, and the class distribution is heavily skewed with few positive rows.
| ds1 | imdb | ± | citeseer | ± | |||||||||
| Classifier | Time | AUC | Time | AUC | ± | Time | AUC | ± | |||||
| LR-CGEPS | 86 | 0.949 | ± | 0.009 | 303 | 0.983 | ± | 0.007 | 43 | 0.946 | ± | 0.021 | |
| LR-CGDEVEPS | 59 | 0.948 | ± | 0.009 | 320 | 0.983 | ± | 0.007 | 67 | 0.945 | ± | 0.021 | |
| CG-MLE | 151 | 0.946 | ± | 0.008 | 369 | 0.983 | ± | 0.009 | 97 | 0.946 | ± | 0.021 | |
| SVM LINEAR | 188 | 0.918 | ± | 0.012 | 565 | 0.938 | ± | 0.016 | 87 | 0.810 | ± | 0.049 | |
| SVM RBF | 1850 | 0.924 | ± | 0.012 | 6553 | 0.954 | ± | 0.011 | 1456 | 0.854 | ± | 0.045 | |
| KNN K=1 | 424 | 0.790 | ± | 0.029 | NA | NA | ± | NA | NA | NA | ± | NA | |
| KNN K=9 | 782 | 0.909 | ± | 0.016 | NA | NA | ± | NA | NA | NA | ± | NA | |
| KNN K=129 | 2381 | 0.938 | ± | 0.010 | NA | NA | ± | NA | NA | NA | ± | NA | |
| BC | 4 | 0.884 | ± | 0.011 | 33 | 0.507 | ± | 0.023 | 10 | 0.501 | ± | 0.038 | |
| ds2 | ds1.100pca | ± | ds1.10pca | ± | |||||||||
| Classifier | Time | AUC | Time | AUC | ± | Time | AUC | ± | |||||
| LR-CGEPS | 2983 | 0.720 | ± | 0.030 | 44 | 0.918 | ± | 0.011 | 8 | 0.846 | ± | 0.013 | |
| LR-CGDEVEPS | 1647 | 0.722 | ± | 0.032 | 35 | 0.913 | ± | 0.011 | 9 | 0.842 | ± | 0.015 | |
| CG-MLE | 3198 | 0.724 | ± | 0.030 | 364 | 0.916 | ± | 0.012 | 48 | 0.844 | ± | 0.014 | |
| SVM LINEAR | 2536 | 0.693 | ± | 0.034 | 130 | 0.874 | ± | 0.012 | 68 | 0.582 | ± | 0.048 | |
| SVM RBF | 67117 | 0.700 | ± | 0.034 | 1036 | 0.897 | ± | 0.010 | 490 | 0.856 | ± | 0.017 | |
| KNN K=1 | NA | NA | ± | NA | 74 | 0.785 | ± | 0.024 | 9 | 0.753 | ± | 0.028 | |
| KNN K=9 | NA | NA | ± | NA | 166 | 0.894 | ± | 0.016 | 14 | 0.859 | ± | 0.019 | |
| KNN K=129 | NA | NA | ± | NA | 819 | 0.938 | ± | 0.010 | 89 | 0.909 | ± | 0.013 | |
| BC | 127 | 0.533 | ± | 0.020 | 8 | 0.890 | ± | 0.012 | 2 | 0.863 | ± | 0.015 | |
Tables 6.1 and 6.2 show the
AUC and timing results for our experiments. Looking first at the
AUC scores, we see that the LR algorithms are performing comparably.
SVM LINEAR usually scores a little lower than SVM RBF, except for
ds1.10pca where it scores well below every classifier. Our assumption
for the ds1.10pca experiment is that
SVM
is not adjusting
properly for such a small dataset, and we are not sure how to
compensate. We are tempted to conclude that the nonlinear
classification boundaries used in SVM RBF worked better than the
linear boundaries of SVM LINEAR. However, SVM RBF did not outscore
the linear boundaries used by LR.
We were unable to build a useful ball tree for KNN within 4GB of memory for imdb, citeseer and ds2. Using the computers with more memory, creating a ball tree for ds2 required over 4.6GB, and for imdb and citeseer required over 8GB. We allowed a ds2 experiment to complete to determine whether we should adjust rmin just for this dataset so that the memory requirements were met. With k=9, KNN finished the ten-fold cross-validation in approximately 5 hours, and scored around 0.654. This score might improve with k=129, but the other KNN results suggest it won't improve enough to make it competitive with the LR scores. Furthermore, the time would increase with k=129 or a smaller ball tree, further reducing the appeal of KNN relative to LR for this dataset. That we can run KNN on this dataset is very impressive. However, the focus of this thesis is LR and, in this context, it does not make sense to spend time on KNN for these datasets.
BC put in a miserable performance on these real datasets. We cannot blame this on tuning since BC has no parameters, and the best explanation is that these datasets do not satisfy the conditional independence assumption made by BC. Consider imdb, in which each row represents an entertainment production and the output represents the participation of voice actor Mel Blanc. Suppose a link contains animation director Chuck Jones and composer/conductor Carl Stalling. Chuck and Carl are conditionally independent if
| (6.10) |
LR generally outscored the other learners on these datasets, except on the PCA-compressed datasets. In several cases, SVM was not far below LR, and more comprehensive tuning could even the scores. KNN was much more competitive here than it was on the synthetic datasets, even beating LR on ds1.100pca and ds1.10pca. BC did poorly almost everywhere.
To draw final conclusions from these tables, we should examine the timings. Though BC was very fast, it's poor scores suggest it is not well-suited to this data. SVM RBF was very slow relative to SVM LINEAR and LR, and only scored as well as LR on ds1.10pca. Though SVM LINEAR was closer to the LR classifiers' speeds, it consistently scored well below LR. KNN was usually slow, but for the PCA datasets the time was not unreasonable and the scores were worthwhile. Our conclusion, on these datasets, is that the LR algorithms are the best choice. Furthermore, it seems that our LR-CGEPS and LR-CGDEVEPS perform better than our implementation of the traditional CG-MLE.
We conclude this section with some brief observations of the ROC curves for these experiments. We have organized these curves by dataset. Figures 6.7, 6.8 and 6.9 correspond to ds1, imdb and citeseer, while Figures 6.10, 6.11 and 6.12 correspond to ds2, ds1.100pca and ds1.10pca. Each figure has two graphs. The top graph has a linear ``False positives'' axis, while the bottom graph has a logarithmic ``False positives'' axis. Each plot is built from as many data points as there are rows in the corresponding dataset. For technical reasons these points are interpolated with cubic splines, and the splines are resampled where the bullets appear in the plots. One should not associate the bullets with the individual steps of the ROC curve.
The most interesting part of these figures is the bottom left corner of the logarithmic plot. It is here that we see how accurate the classifiers are for the first few predictions. In the ds1 and imdb graphs, we see LR and SVM close together for their top five predictions, but SVM appears to have an edge over LR for initial accuracy. In the citeseer graph, SVM's initial performance is even stronger. In the long run, we know that LR outperforms SVM as indicated by the AUC scores, and careful observation of the graphs confirms this. The ds1.100pca and ds1.10pca linear ROC curves show KNN outperforming the other algorithms, though the initial segment does not show any clear leaders. The most important conclusion from the ROC curves is that the classifier with the best initial ranking performance might not be the most consistent classifier.
tex2html_comment_mark>53
|
tex2html_comment_mark>53
|
tex2html_comment_mark>53
|
tex2html_comment_mark>53
|
tex2html_comment_mark>53
|
tex2html_comment_mark>53
|