23 Kernel Smoothers

23.1 Introduction

In the previous chapter, we described the KNN estimator as the arithmetic mean or average of the \(y_i\) values associated to the \(k\) closest \(x_i\) observations to the query point \(x_0\):

\[ \hat{f}(x_0) = \frac{1}{k} \sum_{i \in \mathcal{N}_0} y_i \quad \longrightarrow \quad \hat{f}(x_0) = \sum_{i \in \mathcal{N}_0} w_i(x_0) \hspace{1mm} y_i \]

As you can tell, such estimate can be expressed in a generic form of a weighted average of \(y_i\) (with \(i \in \mathcal{N}_0\)), in which the weights are all constant \(w_i(x_0) = 1/k\).

This way of handling the weights \(w_i(x_0)\) gives the same importance to the \(k\) nearest neighbors of \(x_0\), regardless of how close or how far they are from the query point. Which sometimes could be problematic.

Consider the figure below. Suppose that we are interested in predicting \(y_0\) using \(k\) closest neighbors. The local information of \(x_i\) is concentrated around the actual observed \(x_0\).

k = 10 Closest Neighbors

Figure 23.1: k = 10 Closest Neighbors

However, if we consider more neighbors \(k\), now some of the \(x_i\) values are considerably farther from the query point \(x_0\):

k = 20 Closest Neighbors

Figure 23.2: k = 20 Closest Neighbors

Instead of giving the same importance to all the neighbors, we could instead give them variable importances depending on their proximity to \(x_0\). Those values \(x_i\) that are closer to \(x_0\) should receive more importance via a bigger weight. Those values \(x_i\) that are farther from \(x_0\) should receive less importance vis a smaller weight.

k = 20 Closest Neighbors, Weighted

Figure 23.3: k = 20 Closest Neighbors, Weighted

Consequently, we can modify the formula of the KNN estimator that uses a constant average, and replace the constant weights with variable weights. This modification will result in an estimator that computes \(\hat{f}(x_0)\) as a weighted average.

23.2 Kernel Smoothers

We are looking for a weighted average estimator

\[ \hat{f}(x_0) = \sum_{i \in N_0} w_i(x_0) \hspace{1mm} y_i \]

in which the weights \(w_i(x_0)\) take into account the proximity of \(x_i\) to \(x_0\).

One very interesting option for determining weights is with Kernel functions. These are functions that take into account the density of \(X\)-points.

23.2.1 Kernel Functions

What is a kernel? The term “kernel” is somewhat of an umbrella term in the stattistical and machine world. You will often hear people talking about kernel functions, kernel methods, kernel regression, kernel classifiers, just to mention a few of them. With respect to nonparametric regression, and in particular with linear smoothers, a kernel is a function that has special properties. Before looking at those properties, we should say that the type of kernel we are referring to in this chapter, is different (although related) from terms such as Mercer kernels, and the so-called kernel trick.

A one-dimensional smoothing kernel is a symmetric function \(K(u) : \mathbb{R} \rightarrow \mathbb{R}\) with the following properties:

  • finite support: \(K(u) = 0\) for \(|u| \geq 1\)

  • symmetry: \(K(u) = K(-u)\)

  • positive values: \(K(u) > 0\) for \(|u| < 1\)

  • normalization: \(\int K(x) dx = 1\)

  • zero-midpoint: \(\int x K(x) dx = 0\)

  • \(\int x^2 K(x) dx > 0\)

Some of the common Kernel functions are:

\[ \begin{aligned} \text{Uniform} & \qquad K(u) = \frac{1}{2}, \quad \text{for } |u| \leq 1 \\ & \\ \text{Gaussian} & \qquad K(u) = \frac{1}{\sqrt{2 \pi}} exp(-u^2 / 2) \\ & \\ \text{Epanechnikov} & \qquad K(u) = \frac{3}{4} (1 - u^2), \quad \text{for } |u| \leq 1\\ & \\ \text{Biweight} & \qquad K(u) = \frac{15}{16} (1 - u^2)^2, \quad \text{for } |u| \leq 1 \\ & \\ \text{Triweight} & \qquad K(u) = \frac{35}{32} (1 - u^2)^3, \quad \text{for } |u| \leq 1 \\ \end{aligned} \]

Various Kernels (source: wikipedia)

Figure 23.4: Various Kernels (source: wikipedia)

In case you are curious, Wikipedia’s entry about kernel functions has many more options: https://en.wikipedia.org/wiki/Kernel_(statistics)

23.2.2 Weights from Kernels

Choosing any type of one-dimensional kernel, we can construct weights

\[ w_i (x_0) = \frac{K \big(\frac{x_i - x_0}{h} \big)}{\sum_{i}^{n} K \big( \frac{x_i - x_0}{h} \big)} \]

Weights obtained in this way have the property that they add up to 1:

\[ \sum_{i}^{n} w_i (x_0) = 1 \]

23.2.3 Kernel Estimator

The kernel estimator is

\[ \hat{f}(x_0) = \frac{\sum_{i=1}^{n} y_i K \big(\frac{x_i - x_0}{h} \big)}{\sum_{i=1}^{n} K \big(\frac{x_i - x_0}{h} \big)} = \sum_{i=1}^{n} w_i(x_0) y_i \]

The kernel estimator minimizes a localized squared error

\[ \sum_{i=1}^{n} K \left( \frac{x_i - x_0}{h} \right) (y_i - \theta)^2 \]

It can be shown that the estimator \(\theta = \hat{f}()\) is given by:

\[ \hat{f}(x_0) = \frac{\sum_{i=1}^{n} y_i K \big(\frac{x_i - x_0}{h} \big)}{\sum_{i=1}^{n} K \big(\frac{x_i - x_0}{h} \big)} = \sum_{i=1}^{n} w_i(x_0) y_i \]

This estimator is also known as the Nadaraya-Watson (NW) estimator, and it is a local constant estimator.

Why do we say \(\hat{f}()\) is a local constant estimator? Because it is a local weighted average of the \(y_i\)’s.

If we use the Epanechnikov kernel, it turns out that the NW estimator is optimal. But this is not something to get obsessed about. In practice, you can use any flavor of kernel function, and be able to obtain similar results.

23.3 Local Polynomial Estimators

Let’s bring back the kernel estimator which minimizes a localized squared error function:

\[ \sum_{i=1}^{n} K \left( \frac{x_i - x_0}{h} \right) (y_i - \theta)^2 \]

Instead of using a local weighted average of \(y_i\)’s, which is basically a local constant fit, why not use a more flexible fit such as a local linear fit, or more general, a local polynomial fit?

This is precisely what local polynomial estimators are for.

If we use a polynomial of degree 1, that is a line, the local minimization problem becomes:

\[ \sum_{i=1}^{n} K \left( \frac{x_i - x_0}{h} \right) \big( y_i - b_{0,x_0} - b_{1,x_0} (x_i - x_0) \big)^2 \]

The estimated \(\hat{f}(x_0) \approx b_{0,x_0} + b_{1,x_0} (x - x_0)\).

If we arrange the terms in an \(n \times 2\) matrix \(\mathbf{X_0}\), and a \(n \times n\) matrix of weights \(\mathbf{W}\)

\[ \mathbf{X_0} = \ \begin{bmatrix} 1 & x_{1} - x_0 \\ 1 & x_{2} - x_0 \\ \vdots & \vdots \\ 1 & x_{n} - x_0 \\ \end{bmatrix}, \\ \quad \\ \mathbf{W_0} = \begin{bmatrix} K_h (x_1 - x_0) & 0 & \dots & 0\\ 0 & K_h (x_2 - x_0) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & K_h (x_n - x_0) \\ \end{bmatrix} \]

then:

\[ \mathbf{b}_{x_0} = (\mathbf{X_0}^\mathsf{T} \mathbf{W_0 X_0})^{-1} \mathbf{X_0}^\mathsf{T} \mathbf{W_0 y} \]