7 Regression via Maximum Likelihood

In this chapter we describe Maximum Likelihood (ML) to obtain estimated parameters (i.e. coefficients) for Linear Regression.

We have introduced OLS from a purely geometrical perspective, as well as from purely model-fitting perspective.

As previously mentioned, in a regression model we use one or more features \(X\) to say something about the reponse \(Y\). With a linear regression model we combine features in a linear way in order to approximate the response In the univarite case, we have a linear equation:

7.1 Linear Regression Reminder

al case when we have \(p>1\) predictors:

\[ \mathbf{X} = \ \begin{bmatrix} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{np} \\ \end{bmatrix}, \qquad \mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{n} \\ \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} b_{0} \\ b_{1} \\ \vdots \\ b_{p} \\ \end{bmatrix} \]

With the matrix of features, the response, and the coefficients we have a compact expression for the predicted outcomes:

\[ \mathbf{\hat{y}} = \mathbf{Xb} \]

In path diagram form, the linear model looks like this:

Linear combination with constant term

Figure 7.1: Linear combination with constant term

Notice that we haven’t said anything about \(Y\) and \(X_1, \dots, X_p\) other than \(Y\) is a real-value variable, and the \(X\)-variables have been encoded numerically. In other words, we have not mentioned anything about stochastic assumptions, nor we have made requirements about their distributions.

All we cared about in chapter Linear Regression were the algebraic aspects, the minimization criterion, and the solution.

It is possible to consider certain theoretical assumptions about the model \(f : \mathcal{X} \to \mathcal{y}\)

Consider a linear regression model:

\[ y_i = b_0 x_{i0} + b_1 x_{i1} + b_p x_{ip} + \epsilon_i = \mathbf{b^\mathsf{T} x_i} + \epsilon_i \]

and assume that the noise terms \(\varepsilon_i\) are independent and have a Gaussian distribution with mean zero and constant variance \(\sigma^2\):

\[ \varepsilon_i \sim N(0, \sigma^2) \]

Under this assumption, how do we obtain the parameters \(\mathbf{b} = (b_0, b_1, \dots, b_p)\) of a linear regression model? We’ll describe how to obtain a solution via Maximum Likelihood (ML).

Let’s start with the following model:

\[ y = f\left( X_1, \dots, X_p \right) + \varepsilon \ \Longleftrightarrow \ y_i = f\left( x_{i1}, \dots, x_{ip} \right) + \varepsilon_i \]

Our goal is to estimate \(f\), somehow. Why do we include an error term? Well, it is unlikely that the input variables will be able to express all of the behavior of the response variables. Hence, we add back this “error” in the form of the \(\varepsilon\) term.

Now, to proceed with regression, we will need to make a few assumptions.

Soft Assumptions: (a.k.a. Gauss-Markov Assumptions)

  • Assume that \(\varepsilon_i\) is random. This requires us to establish a distributional assumption; we can pick any distribution so long as \(\mathrm{mean}(\varepsilon_i) = 0\) and \(\mathrm{var}(\varepsilon_i) = \sigma^2\) with \(\mathrm{cor}(\varepsilon_i, \varepsilon_{\ell}) = 0\) for \(i \neq \ell\).

  • These assumptions enable us to state the Gauss-Markov Theorem: \(\mathbf{\hat{b}}_{\mathrm{OLS}}\) is BLUE: the Best Linear Unbiased Estimator. We take “best” to mean “lowest variance.”

Hard Assumptions: (i.e. a bit more restrictive)

  • Assume that \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\). This in turn implies that \(y_i \sim \mathcal{N}(\mathbf{\hat{b}}^{\mathsf{T}} \mathbf{x}_i, \ \sigma^2)\).

The benefit of these hard assumptions is they enable us to perform Maximum Likelihood Estimation.

7.1.1 Maximum Likelihood

For this section, we take the “hard assumptions” as given. Then, the joint distribution of \(y_1, y_2, \dots, y_n\) is

\[\begin{align} P\left(\mathbf{y} \mid \mathbf{X}, \mathbf{b}, \sigma^2 \right) &= \prod_{i=1}^{n} f\left(y_i ; \mathbf{X}, \mathbf{b}, \sigma^2 \right) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{ 2 \pi \sigma^2}} \mathrm{exp}\left\{ - \frac{1}{2 \sigma ^2} \left( y_i - \mathbf{b}^\mathsf{T} \mathbf{x_i} \right)^2 \right\} \\ &= \left( 2 \pi \sigma^2 \right)^{- \frac{n}{2} } \exp\left\{ - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} \left( y_i - \mathbf{b}^\mathsf{T} \mathbf{x_i} \right)^2 \right\} \\ \end{align}\]

taking logarithm we get:

\[\begin{align} \ell &= \log\left[ P\left(\mathbf{y} \mid \mathbf{X}, \mathbf{b}, \sigma^2 \right) \right] \\ &= -\frac{n}{2} \ln\left( 2 \pi \sigma^2 \right) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} \left( y_i - \mathbf{b}^\mathsf{T} \mathbf{x_i} \right)^{2} \end{align}\]

Why did we do this? Well, suppose we want to find \(\mathbf{\hat{b}}_{\mathrm{ML}}\), the maximum likelihood estimate of \(\mathbf{b}\). To do so, we would take the derivative of the log-likelihood, set equal to 0, and solve:

\[\begin{align} \ell &= -\frac{n}{2} \ln( 2 \pi \sigma^2) - \frac{1}{\sigma^2} \left( \mathbf{y} - \mathbf{X} \mathbf{b} \right)^{\mathsf{T}} \left( \mathbf{y} - \mathbf{X} \mathbf{b} \right) \\ &= c + \frac{1}{2 \sigma^2} \left( \mathbf{b}^\mathsf{T} \mathbf{X}^\mathsf{T} \mathbf{X} \mathbf{b} - 2 \mathbf{b}^\mathsf{T} \mathbf{X}^\mathsf{T} \mathbf{y} + \mathbf{y}^\mathsf{T} \mathbf{y} \right) \\ \end{align}\]

Taking partial derivatives:

\[ \frac{\partial \ell}{\partial \mathbf{b}} = \frac{1}{2 \sigma^2} \left( 2 \mathbf{X}^\mathsf{T} \mathbf{X} \mathbf{b} - 2 \mathbf{X}^\mathsf{T} \mathbf{y} \right) \rightarrow \mathbf{0} \\ \Rightarrow \ \mathbf{X}^\mathsf{T} \mathbf{X} \mathbf{b} = \mathbf{X}^\mathsf{T} \mathbf{y} \]

\[ \mathbf{\hat{b}}_{\mathrm{ML}} = (\mathbf{X^\mathsf{T} X})^{-1} \mathbf{X^\mathsf{T}y} \]

These are precisely the normal equations. That is, if \(\mathbf{X^\mathsf{T} X}\) is invertible, the maximum likelihood estimator of \(\mathbf{b}\) is exactly the same as the OLS estimate of \(\mathbf{b}\).

7.1.2 ML Estimator of \(\sigma^2\)

Determine the ML estimate of the other model paramater: \(\sigma^2\) (the constant variance).

The likelihood of \(\mathbf{y}\), that is, the joint distribution of \(\mathbf{y} = (y_1, \dots, y_n)\) given \(\mathbf{X}\), \(\mathbf{b}\), and \(\sigma\) is:

\[\begin{align*} p(Y | \mathbf{X}, \mathbf{b}, \sigma) &= \prod_{i=1}^{n} p(y_i | \mathbf{x_i}, \mathbf{b}, \sigma) \\ &= \prod_{i=1}^{n} (2 \pi \sigma)^{-1/2} \hspace{1mm} exp\{ - \frac{1}{2}\left( \frac{y_i - \mathbf{w^\mathsf{T}x_i}}{\sigma} \right)^2 \} \\ &= (2 \pi \sigma)^{-n/2} \hspace{1mm} exp \left\{-\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \right\} \end{align*}\]

In order to find ML estimate of \(\sigma\), we calculate the (partial) derivative of the log-likelihood with respect to \(\sigma\):

\[\begin{align*} \frac{\partial l(\sigma)}{\partial \sigma} &= \frac{\partial}{\partial \sigma} \left[ \frac{-n}{2} log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \right] \\ &= -\frac{n}{2} \frac{\partial}{\partial \sigma} \left[ log(\sigma^2) \right] - \frac{\partial}{\partial \sigma} \left [ \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \right] \\ &= -\frac{n}{\sigma} + \sigma^{-3} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \end{align*}\]

Equating to zero we get: \[\begin{align*} \frac{n}{\sigma} &= \frac{1}{\sigma^3} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \\ n \sigma^2 &= (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \\ \sigma^2 &= \frac{1}{n} (\mathbf{y} - \mathbf{Xb})^\mathsf{T}(\mathbf{y} - \mathbf{Xb}) \\ \sigma^2 &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{b^\mathsf{T} x_i})^2 \end{align*}\]