next up previous contents
Next: 6.1.3 BC Up: 6.1 Classifiers Previous: 6.1.1 SVM   Contents


6.1.2 KNN

KNN is a well-known non-parametric classifier. The intuition for KNN is simply to use the $ k$ nearest data points to determine whether or not a test point $ \mathbf{x}$ is positive. If $ k>1$, some method of combining $ k$ binary values into a single prediction is needed. For instance, one might predict the class assigned to the majority of the $ k$ nearest neighbors.

This geometric idea requires a distance metric, and the Euclidean distance is commonly used. A naive implementation of KNN would keep the training data in memory, and compute the distance between all $ R$ data points and the test point. Since each distance computation requires $ M$ time, or $ MF$ time on average if the data is sparse, classifying a single test point is O$ \left( MRF \right)$. If another test point were used, another $ R$ distance evaluations are made. Thus the weakness of KNN is that it ignores structure in the data, and making more than a few classifications can be very expensive.

One non-naive approach to KNN uses one or more space-partitioning trees to organize the training data according to its geometric structure. In our comparisons, we use the ball tree-based KNS2 algorithm described in Liu et al. [24]. The KNS2 algorithms uses two ball trees to separately store the positive and negative training examples. If the data has a very uneven class distribution, then one of the ball trees will be small.

Suppose for concreteness that the positive class is small, as is the case with our real-world datasets. To decide whether a test point is positive or negative, KNS2 first finds the $ k$ nearest positive points to the test point. This is fast because the positive tree is small. Then KNS2 analyzes the negative points in the larger tree, stopping as soon as it can determine the exact number of negative points among the $ k$ nearest neighbors. The ratio of positive to negative nearest neighbors is used to rank the test point.

Building a ball tree requires O$ \left( FMR\log R \right)$ time if the tree is balanced. Once the positive and negative trees are built, the time required for the necessary queries will depend on the geometric structure of the data. If the data is clustered and the class distribution is skewed, queries will be very fast and KNS2 should vastly outperform naive KNN. In such cases KNS2 can scale to much larger datasets than have been successfully attempted with naive KNN. If the data is uniformly distributed and there are as many positive as negative points, the ball trees will be useless and queries will be relatively slow. This latter scenario is roughly the case for our synthetic datasets. A further problem occurs for non-clustered high-dimensional data. In such data all data points are equidistant from one another, invalidating the intuition behind KNN [9].

For our experiments, we tried $ k=1$, $ k=9$, and $ k=129$ due to our experience with KNS2 on real-world datasets. We can expect poor predictions with high variance from values of $ k$ which are too small, while large values of $ k$ increase the time required to compute the majority class.

We had some difficulty during ball tree construction in KNS2. On many synthetic datasets, and on the larger real-world datasets, ball trees with the default settings required more than 8GB of memory. We were able to decrease the size of the ball tree to under 4GB by increasing the rmin parameter, which controls the number of training rows stored in the leaf nodes. However, the resulting trees did not successfully accelerate KNS2. Such experiments were abandoned and are absent from our graphs and tables.


next up previous contents
Next: 6.1.3 BC Up: 6.1 Classifiers Previous: 6.1.1 SVM   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu