9 MSE of Estimator

In this chapter we provide a preliminary review of the Mean Squared Error (MSE) of an estimator. This will allow us to have a more gentle introduction to the next chapter about the famous Bias-Variance tradeoff.

9.1 MSE of an Estimator

In order to discuss the bias-variance decomposition of a regression function and its expected MSE, we would like to first review the concept of the mean squared error of an estimator. Recall that estimation consists of providing an approximate value to the parameter of a population, using a (random) sample of observations drawn from such population.

Say we have a population of \(n\) objects and we are interested in describing them with some numeric characteristic \(\theta\). For example, our population may be formed by all students in some college, and we want to know their average height. We call this (theoretical) average the parameter.

Figure 9.1: Population described by some parameter of interest.

To estimate the value of the parameter, we may draw a random sample of \(m < n\) students from the population and compute a statistic \(\hat{\theta}\). Ideally, we would use some statistic \(\hat{\theta}\) that approximates well the parameter \(\theta\).

Figure 9.2: Random sample from a population

In practice, this is the typical process that you would carry out:

Get a random sample from a population.
Use the limited amount of data in the sample to estimate \(\theta\) using some formula to compute \(\hat{\theta}\).
Make a statement about how reliable an estimator \(\hat{\theta}\) is.

Now, for illustration purposes, let’s do the following mental experiment. Pretend that you can draw multiple random samples, all of the same size \(m\), from the population under study. In fact, you should pretend that you can get an infinite number of samples. And suppose that for each sample you compute a statistic \(\hat{\theta}\). A first random sample of size \(m\) would result in \(\hat{\theta}_1\). A second random sample of size \(m\) would result in \(\hat{\theta}_2\). And so on.

Figure 9.3: Various random samples of equal size and their statistics

A couple of important things to notice:

An estimator is a random variable
- A first sample will result in \(\hat{\theta}_1\)
- A second sample will result in \(\hat{\theta}_2\)
- A third sample will result in \(\hat{\theta}_3\)
- and so on …
Some samples will yield a \(\hat{\theta}_k\) that overestimates \(\theta\)
Other samples will yield a \(\hat{\theta}_k\) that underestimates \(\theta\)
Some samples will yield a \(\hat{\theta}_k\) matching \(\theta\)

In theory, we could get a very large number of samples, and visualize the distribution of \(\hat{\theta}\), like in the figure below:

Figure 9.4: Distribution of an estimator

As you would expect, some estimators will be close to the parameter \(\theta\), while others not so much.

Under general assumptions, we can also assume that the estimator has expected value \(\mathbb{E}(\hat{\theta})\), with finite variance \(var(\hat{\theta})\).

Figure 9.5: Distribution of an estimator

An interesting question to consider is:

In general, how much different—or similar—is \(\hat{\theta}\) from \(\theta\)?

To be more concrete: on average, how close we expect the estimator to be from the parameter? To answer this question we can look for a measure to assess the typical distance of estimators from the parameter.

This involves looking at the difference: \(\hat{\theta} - \theta\), which is commonly referred to as the estimation error:

\[ \text{estimation error} = \hat{\theta} - \theta \tag{9.1} \]

We would like to measure the “size” of such difference. Notice that the estimation error is also a random variable:

A first sample will result in an error \(\hat{\theta}_1 - \theta\)
A second sample will result in an error \(\hat{\theta}_2 - \theta\)
A third sample will result in an error \(\hat{\theta}_3 - \theta\)
and so on …

So how do we measure the “size” of the estimation errors? The typical way to quantify the amount of estimation error is by calculating the squared errors, and then averaging over all the possible values of the estimators. This is known as the Mean Squared Error (MSE) of \(\hat{\theta}\):

\[ \text{MSE}(\hat{\theta}) = \mathbb{E} [(\hat{\theta} - \theta)^2] \tag{9.2} \]

MSE is the squared distance from our estimator \(\hat{\theta}\) to the true value \(\theta\), averaged over all possible samples.

It is convenient to regard the estimation error, \(\hat{\theta} - \theta\), with respect to \(\mathbb{E}(\hat{\theta})\). In other words, the distance between \(\hat{\theta}\) and \(\theta\) can be expressed with respect to the expected value \(\mathbb{E}(\hat{\theta})\):

Figure 9.6: Estimator, its mean, and the parameter

Let’s rewrite \((\hat{\theta} - \theta)^2\) as \(( \hat{\theta} - \mathbb{E}(\hat{\theta}) + \mathbb{E}(\hat{\theta}) - \theta)^2\), and let \(\mathbb{E}(\hat{\theta}) = \mu_{\hat{\theta}}\). Then:

\[\begin{align*} (\hat{\theta} - \theta)^2 &= \left ( \hat{\theta} - \mathbb{E}(\hat{\theta}) + \mathbb{E}(\hat{\theta}) - \theta \right )^2 \\ &= ( \hat{\theta} - \mu_{\hat{\theta}} + \mu_{\hat{\theta}} - \theta )^2 \\ &= (\underbrace{\hat{\theta} - \mu_{\hat{\theta}}}_{a} + \underbrace{\mu_{\hat{\theta}} - \theta}_{b})^2 \\ &= a^2 + b^2 + 2ab \\ \Longrightarrow \mathbb{E} \left [ (\hat{\theta} - \theta)^2 \right ] &= \mathbb{E}[a^2 + b^2 + 2ab] \tag{9.3} \end{align*}\]

We have that \(\text{MSE}(\hat{\theta}) = \mathbb{E} [(\hat{\theta} - \theta)^2]\) can be decomposed as:

\[\begin{align*} \mathbb{E} \left [ (\hat{\theta} - \theta)^2 \right ] &= \mathbb{E}[a^2 + b^2 + 2ab] \\ &= \mathbb{E}(a^2) + \mathbb{E}(b^2) + 2\mathbb{E}(ab) \\ &= \mathbb{E} [ (\hat{\theta} - \mu_{\hat{\theta}})^2 ] + \mathbb{E} [ (\mu_{\hat{\theta}} - \theta)^2 ] + 2\mathbb{E}(ab) \tag{9.4} \end{align*}\]

Notice that \(\mathbb{E}(ab)\):

\[ \mathbb{E}(ab) = \mathbb{E}[ (\hat{\theta} - \mu_{\hat{\theta}}) (\mu_{\hat{\theta}} - \theta) ] = 0 \tag{9.5} \]

Consequently

\[\begin{align*} \text{MSE}(\hat{\theta}) &= \mathbb{E} \left [ (\hat{\theta} - \theta)^2 \right ] \\ & \\ &= \mathbb{E} [ (\hat{\theta} - \mu_{\hat{\theta}})^2 ] + \mathbb{E} [ (\mu_{\hat{\theta}} - \theta)^2 ] \\ & \\ &= \mathbb{E} [(\hat{\theta} - \mu_{\hat{\theta}})^2] + \mathbb{E} [ (\mu_{\hat{\theta}} - \theta) ]^2 \\ & \\ &= \underbrace{\mathbb{E} [(\hat{\theta} - \mu_{\hat{\theta}})^2]}_{\text{Variance}} + (\underbrace{\mu_{\hat{\theta}} - \theta}_{\text{Bias}})^2 \\ & \\ &= \text{Var}(\hat{\theta}) + \text{Bias}^{2} (\hat{\theta}) \tag{9.6} \end{align*}\]

The MSE of an estimator can be decomposed in terms of Bias and Variance.

Bias, \(\mu_{\hat{\theta}} - \theta\), is the tendency of \(\hat{\theta}\) to overestimate or underestimate \(\theta\) over all possible samples.
Variance, \(\text{Var}(\hat{\theta})\), simply measures the average variability of the estimators around their mean \(\mathbb{E}(\hat{\theta})\).

In summary, the MSE of an estimator can be simply described as a sum of a term measuring how far off the estimator is “on average” from its expected value, and a term measuring the variability of the estimator.

9.1.1 Prototypical Cases of Bias and Variance

Depending on the type of estimator \(\hat{\theta}\), and the sample size \(m\), we can get statistics having different behaviors. The following diagram illustrates four classic scenarios contrasting low and high values for both the bias and the variance.

Figure 9.7: Prototypical scenarios for Bias-Variance