From KNN to Nearest Neighbor Classification

A slecture by ECE student Jonathan Manring

Based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

**I. Introduction **

The K-nearest neighbor (KNN) and nearest neighbor (NN) approaches to data classification are closely related, with the nearest neighbor approach largely being a simplification of KNN. In this tutorial, we will explain first the concept of KNN, secondly the nearest neighbor approach, and thirdly discuss briefly the comparative advantages and disadvantages of each. The focus of this tutorial will be the concepts more than the rigorous mathematical definitions, though this will be touched on as well.

**II. The K-Nearest-Neighbor (KNN) Method **

The K-nearest neighbors (KNN) approach is one of the so-called non-parametric density estimation methods since it does not assume any particular “parametric” density form. A better category might be local density estimation, since KNN and other methods in this class estimate the density function locally.

The KNN method begins with a set of labeled training data (points) in the feature space. Note that it is essential that the training data reflect the prior probabilities of the possible classes! New points are classified by expanding a window around them until a pre-determined number of training data points, k, are contained within the window. Let the following definitions hold:

Point to be classified:

```
$ \vec{x_o} $
```

Volume of the window around the unclassified point that contains k neighbors:

```
$ V_k(\vec{x_o}) $
```

Number of total points in the training set: *N*

Then an estimate of the density function at $ \vec{x_o} $ is:

```
$ \bar{\rho_k}(\vec{x_o}) = \frac{k - 1}{NV_k(\vec{x_o})} $
```

It turns out that this is in fact an unbiased estimate of $ \vec{x_o} $!

$ E[\bar{\rho_k}(\vec{x_o})] = \rho(\vec{x_o}) $

Thus, increasing k, the number of nearest neighbors, will not cause KNN to become more accurate. It is perfectly unbiased already. Increasing k has the effect of reducing the variance of the estimate of $ \rho(\vec{x_o}) $, but the expectation of $ \bar{\rho_k}(\vec{x_o}) $ is always accurate.

**III. Classification with KNN **

KNN leads to a very easy and intuitive classification method. The class with the majority (or plurality, as the case may be) of points in the set of k neighbors becomes the class of the point in question. For example, if k is 5 and 3 of the neighbors belong to class 1 and 2 of them belong to class 2, the data point $ \vec{x_o} $ will be assigned to class 1, based on the majority vote.

**IV. The Nearest Neighbor method **

The nearest neighbor approach is similar to the KNN approach in that it relies on labeled training data. It is again necessary that the training data accurately reflect the prior probabilities of each class. The class of the “nearest” training point is taken as the class of the data point $ \vec{x_o} $. The “nearest” neighbor can be found using a distance metric, $ D(\vec{x_1},\vec{x_2}) $, that must satisfy the following conditions:

1)

$ D(\vec{x_1},\vec{x_2}) \geq 0 $

2)

$ D(\vec{x_1},\vec{x_2}) = D(\vec{x_2},\vec{x_1}) $

3)

$ D(\vec{x},\vec{x}) = 0 $

4)

$ D(\vec{x_1},\vec{x_2}) + D(\vec{x_2},\vec{x_3}) \geq D(\vec{x_1},\vec{x_3}) $

The last property, known as the triangle inequality, is often the most constrictive. Yet it allows for large data sets to be indexed such that large sections of the data need not even be considered when searching for the nearest neighbor of a new point.

Below are several example metrics for distance:

Euclidean distance:

$ D(\vec{x_1},\vec{x_2})=||\vec{x_1}-\vec{x_2}||_{L_2}=\sqrt{\sum_{i=1}^n ({x_1}^i-{x_2}^i)^2} $

Manhattan (cab driver) distance:

$ D(\vec{x_1},\vec{x_2})=||\vec{x_1}-\vec{x_2}||_{L_1}=\sum_{i=1}^n |{x_1}^i-{x_2}^i| $

Minkowski metric:

$ D(\vec{x_1},\vec{x_2})=||\vec{x_1}-\vec{x_2}||_{L_p}=(\sum_{i=1}^n ({x_1}^i-{x_2}^i)^p)^{\frac{1}{p}} $

Riemannian metric:

$ D(\vec{x_1},\vec{x_2})=\sqrt{(\vec{x_1}-\vec{x_2})^\top \mathbb{M}(\vec{x_1}-\vec{x_2})} $

Infinite norm:

$ D(\vec{x_1},\vec{x_2})=||\vec{x_1}-\vec{x_2}||_{\infty}=max_i |{x_1}^i-{x_2}^i| $

**V. Comparison**

The nearest neighbor method is very similar to the KNN method. However, only one neighbor is considered, leading to more instability. As noted above, the KNN method does not improve in average accuracy as k increases, but the variance of the accuracy does decrease with increasing k. An advantage of the nearest neighbor classification method is that it is very computationally efficient, compared to KNN and parametric methods since it is only necessary to locate one labeled neighbor. However, the nearest neighbor method is a classification method only, and will not allow the estimation of the actual density function, as the KNN method will. This is only a minor drawback however, as the goal in estimating density functions is usually classification.

**VI. References**

[1] Lecture notes from ECE662: Statistical Pattern Recognition and Decision Making Processes, Purdue University, Mireille Boutin

[2] Introduction to Statistical Pattern Recognition, 2nd edition, 1990, by K. Fukunaga.

## Questions and comments

If you have any questions, comments, reviews etc. please post them on this page.