22 Nearest Neighbor Estimates

22.1 Introduction

In the previous chapter we finished our discussion by listing two core ideas that are common to several nonparametric regression methods. On one hand, we typically invoke a notion of “Neighborhood” of nearby points \(x_i\) around a query point \(x_0\). On the other hand, we also require some sort of mechanism to get a “local” fit using the data points \((x_i, y_i)\) in the neighborhood.

Let’s now describe one flavor of procedures that is based on a narrow notion of neighborhood, and very simple local fitting mechanisms.

22.2 \(k\) Nearest Neighbors (KNN)

Perhaps the most popular approach that takes into account neighboring points to make predictions is \(k\) Nearest Neighbors, or KNN for short.

In KNN, we define a neighborhood, denoted \(\mathcal{N}_k(x_0)\), that consists of the \(k\) closest points around a query point \(x_0\). This neighborhood will give us a set of \(k\) points \((x_i, y_i)\), for \(i \in \mathcal{N}_k(x_0)\), that will be taken into account to compute \(\hat{y}_0\) with some local averaging mechanism.

To keep the notation simpler, sometimes we will use \(\mathcal{N}_0 = \mathcal{N}_k(x_0)\) to denote the neighborhood around \(x_0\).

The plain vanilla KNN estimator \(\hat{f}(x_0)\) is defined as the arithmetic mean of the \(y_i\) values associated to the \(k\) closest points \(x_i\) to \(x_0\):

\[ \textsf{running mean:} \quad \hat{f}(x_0) = \frac{1}{k} \sum_{i \in \mathcal{N}_0} y_i = \hat{y}_0 \]

A less common option of local averaging mechanism is the median of the \(y_i\) values associated to the \(k\) closest points \(x_i\) to \(x_0\):

\[ \textsf{running median:} \quad \hat{f}(x_0) = \text{median} (y_i | i \in \mathcal{N}_0) \]

A much less common approach is to use a local linear fit, which is to say, estimate a linear regression based in the neighboring points \((x_i, y_i), i \in \mathcal{N}_0\)

\[ \textsf{running line:} \quad \hat{f}(x_0) = b_{0,x_0} + b_{1,x_0} x_0 \]

where \(b_{0,x_0}\) and \(b_{1,x_0}\) are the least squares estimates using the data points \((x_i, y_i)\), with \(i \in \mathcal{N}_0\).

22.2.1 Distance Measures

One important consideration behind KNN has to do with the question of “How do we measure the closeness between two points \(x_0\) and \(x_i\)?” In the univariate case, we typically use the absolute distance:

\[ d(x_0, x_i) = |x_0 - x_i| \]

Notice that for the univariate case, there’s no difference between using the absolute distance or using the Euclidean distance. The \(k\) nearest neighbors should be the same regardless of the choice of distance.

In the multivariate case, common options for distances are Euclidean distance, and Manhattan distance (but many other distances are also available).

\[ \textsf{Euclidean:} \quad d(x_0, x_i) = \sqrt{\sum_{j=1}^{p} (x_{0j} - x_{ij})^2} \\ \textsf{Manhattan:} \quad d(x_0, x_i) = \sum_{j=1}^{p} |x_{0j} - x_{ij}| \]

For instance, say we have a data set of \(n = 10\) objects:

\[ \mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \dots, (x_9, y_9), (x_{10}, y_{10}) \} \]

and suppose that we have a given query point \(x_0\) for which we want to compute its predicted response \(\hat{y}_0\). Let’s also assume that we are considering the \(k = 3\) closest neighbors to \(x_0\). What we have to do is compute all the distances \(d_i = d(x_0, x_i)\) between the query point and the \(n\) data points:

\[ \textsf{set of distances:} \quad (d_1, d_2, d_3, \dots, d_9, d_{10}) \]

and then select the 3 closest \(x_i\) points to \(x_0\). For illustrations purposes, let’s pretend that \(d_5\), \(d_2\), and \(d_7\) are the closest distances to \(x_0\), in that order. This results in the neighbor indices \(\mathcal{N}_3 (x_0) = (2, 5, 7)\), and the corresponding \(y_i\)-values: \((y_2, y_5, y_7)\). The predicted value \(\hat{y}_0\) is then computed as:

\[ \frac{y_2 + y_5 + y_7}{3} = \hat{y}_0 \]

22.2.2 KNN Estimator

In general, the plain KNN estimator can be seen as an average of \(y_i\)’s:

\[ \hat{f}(x_0) = \frac{1}{k} \sum_{i \in \mathcal{N}_0} y_i \quad \longrightarrow \quad \hat{f}(x_0) = \sum_{i \in \mathcal{N}_0} w_i(x_0) \hspace{1mm} y_i \]

where \(w_i (x_0)\) is a weight of the form:

\[ w_i (x_0) = \begin{cases} \frac{1}{k} \quad \text{if} & i \in \mathcal{N}_k (x_0) \\ 0 \quad \text{else} \end{cases} \]

Under this formulation, the standard KNN estimator is simply a constant average of the \(y_i\) (with \(i \in \mathcal{N}_{k(x_0)}\)).

22.2.3 How to find \(k\)?

As we mentioned above, \(k\) is a tuning parameter, and therefore it cannot be found analytically. Instead, you have to implement a trial and error process with various values for \(k\), and determine which seems to be a good one. How? Typically with cross-validation, or other type of resampling approach. Here’s a description of the steps to be carried out.

Assumet that we have a training data set consisting of \(n\) data points: \(\mathcal{D}_{train} = (\mathbf{x_1}, y_1), \dots, (\mathbf{x_n}, y_n)\). To avoid confusion between the number of neighbors \(k\), and the number of folds \(K\), here we are going to use \(Q = K\) to indicate the index of folds.

Using \(Q\)-fold cross-validation, we (randomly) split the data into \(Q\) folds:

\[ \mathcal{D}_{train} = \mathcal{D}_{fold-1} \cup \mathcal{D}_{fold-2} \dots \cup \mathcal{D}_{fold-Q} \]

Each fold set \(\mathcal{D}_{fold-q}\) will play the role of an evaluation set \(\mathcal{D}_{eval-q}\). Having defined \(Q\) fold sets, we form the corresponding \(Q\) retraining sets:

\(\mathcal{D}_{train-1} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-1}\)
\(\mathcal{D}_{train-2} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-2}\)
\(\dots\)
\(\mathcal{D}_{train-Q} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-Q}\)

The cross-validation procedure then repeats the following loops:

For \(k = 1, 3, \dots, B\), (number of neighbors)
- For \(q = 1, \dots, Q\), (fold number)
  - fit KNN \(h_{k,q}\) with \(k_b\)-neighbors on \(\mathcal{D}_{train-q}\)
  - compute and store \(E_{eval-q} (h_{k,q})\) using \(\mathcal{D}_{eval-q}\)
- end for \(q\)
- compute and store \(E_{cv_{k}} = \frac{1}{Q} \sum_q E_{eval-q}(h_{k,q})\)
end for \(k\)
Compare all cross-validation errors \(E_{cv_1}, E_{cv_2}, \dots, E_{cv_B}\) and choose the smallest of them, say \(E_{cv_{b^*}}\)
Use \(k^*\) to fit the (finalist) KNN model

Properties of KNN estimator

Here are some important remarks:

Notice that the function that “generates” the weights \(w_i (x_0)\) is discontinuous, which explains why the fitted curve is jagged.
Notice also that the formulation of \(\hat{f}(x_0)\) as a weighted average of neighboring \(y_i\) is linear with respect to the response \(Y\).
KNN, compared to a parametric model, is very difficult to interpret, since there are no coefficients or analytical parameters.
The parameter \(k\) is a tuning or hyperparameter.
Large values for \(k\) result in more inflexible fits (i.e. more bias), but
also less variance.
Small values for \(k\) result in more flexible fits (i.e. less bias), at the expense of increasing the variance.