KNN is a well-known non-parametric classifier. The intuition for KNN
is simply to use the
nearest data points to determine whether or
not a test point
is positive. If
, some method of
combining
binary values into a single prediction is needed. For
instance, one might predict the class assigned to the majority of the
nearest neighbors.
This geometric idea requires a distance metric, and the Euclidean
distance is commonly used. A naive implementation of KNN would keep
the training data in memory, and compute the distance between all
data points and the test point. Since each distance computation
requires
time, or
time on average if the data is sparse,
classifying a single test point is
O
. If another test point
were used, another
distance evaluations are made. Thus the
weakness of KNN is that it ignores structure in the data, and making
more than a few classifications can be very expensive.
One non-naive approach to KNN uses one or more space-partitioning trees to organize the training data according to its geometric structure. In our comparisons, we use the ball tree-based KNS2 algorithm described in Liu et al. [24]. The KNS2 algorithms uses two ball trees to separately store the positive and negative training examples. If the data has a very uneven class distribution, then one of the ball trees will be small.
Suppose for concreteness that the positive class is small, as is the
case with our real-world datasets. To decide whether a test point is
positive or negative, KNS2 first finds the
nearest positive points
to the test point. This is fast because the positive tree is small.
Then KNS2 analyzes the negative points in the larger tree, stopping as
soon as it can determine the exact number of negative points among the
nearest neighbors. The ratio of positive to negative nearest
neighbors is used to rank the test point.
Building a ball tree requires
O
time if the tree is
balanced. Once the positive and negative trees are built, the time
required for the necessary queries will depend on the geometric
structure of the data. If the data is clustered and the class
distribution is skewed, queries will be very fast and KNS2 should
vastly outperform naive KNN. In such cases KNS2 can scale to much
larger datasets than have been successfully attempted with naive KNN.
If the data is uniformly distributed and there are as many positive as
negative points, the ball trees will be useless and queries will be
relatively slow. This latter scenario is roughly the case for our
synthetic datasets. A further problem occurs for non-clustered
high-dimensional data. In such data all data points are equidistant
from one another, invalidating the intuition behind KNN
[9].
For our experiments, we tried
,
, and
due to our
experience with KNS2 on real-world datasets. We can expect poor
predictions with high variance from values of
which are too small,
while large values of
increase the time required to compute the
majority class.
We had some difficulty during ball tree construction in KNS2. On many synthetic datasets, and on the larger real-world datasets, ball trees with the default settings required more than 8GB of memory. We were able to decrease the size of the ball tree to under 4GB by increasing the rmin parameter, which controls the number of training rows stored in the leaf nodes. However, the resulting trees did not successfully accelerate KNS2. Such experiments were abandoned and are absent from our graphs and tables.