3 Geometric Duality
Before discussing unsupervised as well as supervised learning methods, we prefer to give you a prelude by talking and thinking about data in a geometric sense. This chapter will set the stage for most of the topics covered in later chapters.
Let’s suppose we have some data in the form of a data matrix. For convenience purposes, let’s also suppose that all variables are measured in a real-value scale. Obviously not all data is expressed or even encoded numerically. You may have categorical or symbolic data. But for this illustration, let’s assume that any categorical and symbolic data has already been transformed into a numeric scale (e.g. dummy indicators, optimal scaling).
It’s very enlightening to think of a data matrix as viewed from the glass of Geometry. The key idea is to think of the data in a matrix as elements living in a multidimensional space. Actually, we can regard a data matrix from two apparently different perspectives that, in reality, are intimately connected: the rows perspective and the columns perspective. In order to explain these perspectives, let’s use the following diagram of a data matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns, with \(x_{ij}\) representing the element in the \(i\)-th row and \(j\)-th column.
When we look at a data matrix from the columns perpective what we are doing is focusing on the \(p\) variables. In a similar way, when looking at a data matrix from its rows perspective, we are focusing on the \(n\) individuals. Like a coin, though, this matrix has two sides: a rows side, and a columns side. That is, we could look at the data from the rows point of view, or the columns point of view. These two views are (of course) not completely independent. This double perspective or duality for short, is like the two sides of the same coin.
3.1 Rows Space
We know that human vision is limited to three-dimensions, but pretend that you had superpowers that let you visualize a space with any number of dimensions.
Because each row of the data matrix has \(p\) elements, we can regard individuals as objects that live in a \(p\)-dimensional space. For visualization purposes, think of each variable as playing the role of a dimension associated to a given axis in this space; likewise, consider each of the \(n\) individuals as being depicted as a point (or particle) in such space, like in the following diagram:
In the figure above, even though we are showing only three axes, you should pretend that you are visualizing a \(p\)-dimensional space (imaging that there are \(p\) axes). Each point in this space corresponds to a single individual, and they all form what we can call a cloud of points.
3.2 Columns Space
We can do the same visual exercise with the columns of a data matrix. Since each variable has \(n\) elements, we can regard the set of \(p\) variables as objects that live in an \(n\)-dimensional space. However, instead of representing each variable with a dot, it’s better to graphically represent them with an arrow (or vector). Why? Because of two reasons: one is to distinguish them from the individuals (dots). But more important, because the esential thing with a variable is not really its magnitude (and therefore its position) but its direction. Often, as part of the data preprocessing steps, we apply transformations on variables that change their scales (e.g. shrinking them, or stretching them) without modifying their directions. So it’s more convenient to focus primarily on their directions.
Analogously to the rows space and its cloud of individuals, you should also pretend that the image above is displaying an \(n\)-dimensional space with a bunch of blue arrows pointing in various directions.
What’s next?
Now that we know how to think of data from a geometric perspective, the next step is to discuss a handful of common operations that can be performed with points and vectors that live in some geometric space.
3.3 Cloud of Individuals
In the previous sections, we introduced the powerful idea of looking at the rows and columns of a data matrix from the lens of geometry. We are assuming in general that the rows have to do with \(n\) individuals that lie in a \(p\)-dimensional space.
Let’s start describing a set of common operations that we can apply on the individuals (living in a \(p\)-dimensional space).
3.3.1 Average Individual
We can ask about the typical or average individual.
If you only have one variable, then all the individual points lie in a one-dimensional space, which is basically a line. Here’s a simple example with five individuals described by one variable:
The most common way to think about the typical or average individual is in terms of the arihmetic average of the values, which geometrically corresponds to the “balancing point”. The diagram below shows three possible locations for a fulcrum (represented as a red triangle). Only the average value 5 results in the balancing point which keeps the values on the number line in equilibrium:
Algebraically we have: individuals \(x_1, x_2, \dots, x_n\), and the average is:
\[ \bar{x} = \frac{x_1 + \dots + x_n}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i \tag{3.1} \]
In vector notation, the average can be calculated with an inner product between \(\mathbf{x} = (x_1, x_2, \dots, x_n)\), and a constant vector of \(n\)-elements \((1/n)\mathbf{1}\):
\[ \bar{x} = \frac{1}{n} \mathbf{x^\mathsf{T}1} \tag{3.2} \]
What about the multivariate case? It turns out that we can also ask about the average individual of a cloud of points, like in the following figure:
The average individual, in a \(p\)-dimensional space is the point \(\mathbf{\vec{g}}\) containing as coordiantes the averages of all the variables:
\[ \mathbf{\vec{g}} = (\bar{x}_1, \bar{x}_2, \dots, \bar{x}_p) \tag{3.3} \]
where \(\bar{x}_j\) is the average of the \(j\)-th variable.
This average individual \(\mathbf{\vec{g}}\) is also known as the centroid, barycenter, or center of gravity of the cloud of points.
3.3.2 Centered Data
Often, it is convenient to transform the data in such a way that the centroid of a data set becomes the origin of the cloud of points. Geometrically, this type of transformation involves a shift of the axes in the \(p\)-dimensional space. Algebraically, this transformation corresponds to expressing the values of each variable in terms of deviations from their means.
In the unidimensional case, say we have \(n\) individuals \(\mathbf{x} = (x_1, x_2, \dots, x_n)\) with a mean of \(\bar{x} = (1/n) \sum_{i=1}^{n} x_i\). The vector of centered values are:
\[ \mathbf{\bar{x}} = (x_1 - \bar{x}, x_2 - \bar{x}, \dots, x_n - \bar{x}) \tag{3.4} \]
In the multidimensional case, the set of centered data values are:
\[ \mathbf{x_1} - \mathbf{g}, \mathbf{x_2} - \mathbf{g}, \dots, \mathbf{x_n} - \mathbf{g} \tag{3.5} \]
3.3.3 Distance between individuals
Another common operation that we may be interested in is the distance between two individuals. Obviously the notion of distance is not unique, since you can choose different types of distance measures. Perhaps the most comon type of distance is the (squared) Euclidean distance. Unless otherwise mentioned, this will be the default distance used in this book.
If you have one variable \(X\), then the squared distance \(d^2(i,\ell)\) between two individuals \(x_i\) and \(x_\ell\) is:
\[ d^2(i,\ell) = (x_i - x_\ell)^2 \tag{3.6} \]
In general, with \(p\) variables, the squared distance between the \(i\)-th individual and the \(\ell\)-th individual is:
\[\begin{align*} d^2(i,\ell) &= (x_{i1} - x_{\ell 1})^2 + (x_{i2} - x_{\ell 2})^2 + \dots + (x_{ip} - x_{\ell p})^2 \\ &= (\mathbf{\vec{x}_i} - \mathbf{\vec{x}_\ell})^\mathsf{T} (\mathbf{\vec{x}_i} - \mathbf{\vec{x}_\ell}) \tag{3.7} \end{align*}\]
3.3.4 Distance to the centroid
A special case is the distance between any individual \(i\) and the average individual:
\[\begin{align*} d^2(i,g) &= (x_{i1} - \bar{x}_1)^2 + (x_{i2} - \bar{x}_2)^2 + \dots + (x_{ip} - \bar{x}_p)^2 \\ &= (\mathbf{\vec{x}_i} - \mathbf{\vec{g}})^\mathsf{T} (\mathbf{\vec{x}_i} - \mathbf{\vec{g}}) \tag{3.8} \end{align*}\]
3.3.5 Measures of Dispersion
What else can we calculate with the individuals? Think about it. So far we’ve seen how to calculate the average individual, as well as distances between individuals. The average individual or centroid plays the role of a measure of center. And everytime you get a measure of center, it makes sense to get a measure of spread.
Overall Dispersion
One way to compute a measure of scatter among individuals is to consider all the squared distances between pairs of individuals. For instance, say you have three individuals \(a\), \(b\), and \(c\). We can calculate all pairwise distances and add them up:
\[ d^2(a,a) + d^2(b,b) + d^2(c,c) + \\ d^2(a,b) + d^2(b,a) + \\ d^2(a,c) + d^2(c,a) + \\ d^2(b,c) + d^2(c,b) \tag{3.9} \]
In general, when you have \(n\) individuals, you can obtain up to \(n^2\) squared distances. We will give the generic name of Overall Dispersion to the sum of all squared pairwise distances:
\[ \text{overall dispersion} = \sum_{i=1}^{n} \sum_{\ell=1}^{n} d^2(i,\ell) \tag{3.10} \]
Inertia
Another measure of scatter among individuals can be computed by averaging the distances between all individuals and the centroid.
The average sum of squared distances from each point to the centroid then becomes
\[ \frac{1}{n} \sum_{i=1}^{n} d^2(i,g) = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{\vec{x}_i} - \mathbf{\vec{g}})^\mathsf{T} (\mathbf{\vec{x}_i} - \mathbf{\vec{g}}) \tag{3.11} \]
We will name this measure Inertia, borrowing this term from the concept of inertia used in mechanics (in physics).
\[ \text{Inertia} = \frac{1}{n} \sum_{i=1}^{n} d^2(i,g) \tag{3.12} \]
What is the motivation behind this measure? Consider the \(p = 1\) case; i.e. when \(\mathbf{X}\) is simply a column vector
\[ \mathbf{X} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{pmatrix} \tag{3.13} \]
The centroid will simply be the mean of these points: i.e. \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\).
The sum of squared-distances from each point to the centroid then becomes:
\[ (x_1 - \bar{x})^2 + \dots + (x_n - \bar{x})^2 = \sum_{i=1}^{n} (x_i - \bar{x})^2 \tag{3.14} \]
Does the above formula look familiar? What if we take the average of the squared distances to the centroid?
\[ \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 = \frac{(x_1 - \bar{x})^2 + \dots + (x_n - \bar{x})^2}{n} \tag{3.15} \]
Same question: Do you recognize this formula? You better do… This is nothing else than the formula of the variance of \(X\). And yes, we are dividing by \(n\) (not by \(n-1\)). Hence, you can think of inertia as a multidimensional extension of variance, which gives the typical squared distance around the centroid.
Overall Dispersion and Inertia
Interestingly, the overall dispersion and the inertia are connected through the following relation:
\[\begin{align*} \text{overall dispersion} &= \sum_{i=1}^{n} \sum_{\ell=1}^{n} d^2(i,\ell) \\ &= 2n \sum_{i=1}^{n} d^2(i,g) \\ &= (2n^2) \text{Inertia} \tag{3.16} \end{align*}\]
The proof of this relation is left as a homework exercise.
3.4 Cloud of Variables
The starting point when analyzing variables involves computing various summary measures—such as means, and variances—to get an idea of the common or central values, and the amount of variability of each variable. In this section we will review how concepts like the mean of a variable, the variance, covariance, and correlation, can be interpreted in a geometric sense, as well as their expressions in terms of vector-matrix operations.
3.4.1 Mean of a Variable
To measure variation of one variable, we usually begin by calculating a “typical” value. The idea is to summarize the values of a variable with one or two representative values. You will find this notion under several terms like measures of center, location, central tendency, or centrality.
As mentioned in the previous section, the prototypical summary value of center is the mean, sometimes referred to as average. The mean of an \(n-\)element variable \(X = (x_1, x_2, \dots, x_n)\), represented by \(\bar{x}\), is obtained by adding all the \(x_i\) values and then dividing by their total number \(n\):
\[ \bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} \tag{3.17} \]
Using summation notation we can express \(\bar{x}\) in a very compact way as:
\[ \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i \tag{3.18} \]
If you associate a constant weight of \(1/n\) to each observation \(x_i\), you can look at the formula of the mean as a weighted sum:
\[ \bar{x} = \frac{1}{n} x_1 + \frac{1}{n} x_2 + \dots + \frac{1}{n} x_n \tag{3.19} \]
This is a slightly different way of looking at the mean that will allow you to generalize the concept of an “average” as a weighted aggregation of information. For example, if we denote the weight of the \(i\)-th individual as \(w_i\), then the average can be expressed as:
\[\begin{align*} \bar{x} &= w_1 x_1 + w_2 x_2 + \dots + w_n x_n \\ &= \sum_{i=1}^{n} w_i x_i \\ &= \mathbf{w^\mathsf{T} x} \tag{3.20} \end{align*}\]
3.4.2 Variance of a Variable
A measure of center such as the mean is not enoguh to summarize the information of a variable. We also need a measure of the amount of variability. Synonym terms are variation, spread, scatter, and dispersion.
Because of its relevance and importance for statistical learning methods, we will focus on one particular measure of spread: the variance (and its square root the standard deviation).
Simply put, the variance is a measure of spread around the mean. The main idea behind the calculation of the variance is to quantify the typical concentration of values around the mean. The way this is done is by averaging the squared deviations from the mean.
\[\begin{align*} Var(X) &= \frac{(x_1 - \bar{x})^2 + \dots + (x_n - \bar{x})^2}{n} \\ &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \tag{3.21} \end{align*}\]
Let’s disect the terms and operations involved in the formula of the variance.
the main terms are the deviations from the mean \((x_i - \bar{x})\), that is, the difference between each observation \(x_i\) and the mean \(\bar{x}\).
conceptually speaking, we want to know what is the average size of the deviations around the mean.
simply averaging the deviations won’t work because their sum is zero (i.e. the sum of deviations around the mean will cancel out because the mean is the balancing point).
this is why we square each deviation: \((x_i - \bar{x})^2\), which literally means getting the squared distance from \(x_i\) to \(\bar{x}\).
having squared all the deviations, then we average them to get the variance.
Because the variance has squared units, we need to take the square root to “recover” the original units in which \(X\) is expressed. This gives us the standard deviation
\[ sd(X) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \tag{3.22} \]
In this sense, you can say that the standard deviation is roughly the average distance that the data points vary from the mean.
Sample Variance
In practice, you will often find two versions of the formula for the variance: one in which the sum of squared deviations is divided by \(n\), and another one in which the division is done by \(n-1\). Each version is associated to the statistical inference view of variance in terms of whether the data comes from the population or from a sample of the population.
The population variance is obtained dividing by \(n\):
\[ \textsf{population variance:} \quad \frac{1}{(n)} \sum_{i=1}^{n} (x_i - \bar{x})^2 \tag{3.23} \]
The sample variance is obtained dividing by \(n - 1\) instead of dividing by \(n\). The reason for doing this is to get an unbiased estimor of the population variance:
\[ \textsf{sample variance:} \quad \frac{1}{(n-1)} \sum_{i=1}^{n} (x_i - \bar{x})^2 \tag{3.24} \]
It is important to note that most statistical software compute the variance with the unbiased version. If you implement your own functions and are planning to compare them against other software, then it is crucial to known what other programmers are using for computing the variance. Otherwise, your results might be a bit different from the ones with other people’s code.
In this book, unless indicated otherwise, we will use the factor \(\frac{1}{n}\) when introducing concepts of variance, and related measures. If needed, we will let you know when a formula needs to use the factor \(\frac{1}{n-1}\).
3.4.3 Variance with Vector Notation
In a similar way to expressing the mean with vector notation, you can also formulate the variance in terms of vector-matrix notation. First, notice that the formula of the variance consists of the addition of squared terms. Second, recall that a sum of numbers can be expressed with an inner product by using the unit vector (or summation operator). If we denote the mean vector as \(\mathbf{\bar{x}}\), then the variance of a vector \(\mathbf{x}\) can be obtained with the following inner product:
\[ Var(\mathbf{x}) = \frac{1}{n} (\mathbf{x} - \mathbf{\bar{x}})^\mathsf{T} (\mathbf{x} - \mathbf{\bar{x}}) \tag{3.25} \]
where \(\mathbf{\bar{x}}\) is an \(n\)-element vector of mean values \(\bar{x}\).
Assuming that \(\mathbf{x}\) is already mean-centered, then the variance is proportional to the squared norm of \(\mathbf{x}\)
\[ Var(\mathbf{x}) = \frac{1}{n} \hspace{1mm} \mathbf{x}^{\mathsf{T}} \mathbf{x} = \frac{1}{n} \| \mathbf{x} \|^2 \tag{3.26} \]
This means that we can formulate the variance with the general notion of an inner product:
\[ Var(\mathbf{x}) = \frac{1}{n} \langle \mathbf{x}, \mathbf{x} \rangle \tag{3.27} \]
3.4.4 Standard Deviation as a Norm
If we use a metric matrix \(\mathbf{D} = diag(1/n)\) then we have that the variance is given by a special type of inner product:
\[ Var(\mathbf{x}) = \langle \mathbf{x}, \mathbf{x} \rangle_{D} = \mathbf{x}^{\mathsf{T}} \mathbf{D x} \tag{3.28} \]
From this point of view, we can say that the variance of \(\mathbf{x}\) is equivalent to its squared norm when the vector space is endowed with a metric \(\mathbf{D}\). Consequently, the standard deviation is simply the length of \(\mathbf{x}\) in this particular geometric space.
\[ sd(\mathbf{x}) = \| \mathbf{x} \|_{D} \tag{3.29} \]
When looking at the standard deviation from this perspective, you can actually say that the amount of spread of a vector \(\mathbf{x}\) is actually its length (under the metric \(\mathbf{D}\)).
3.4.5 Covariance
The covariance generalizes the concept of variance for two variables. Recall that the formula for the covariance between \(\mathbf{x}\) and \(\mathbf{y}\) is:
\[ cov(\mathbf{x, y}) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x}) (y_i - \bar{y}) \tag{3.30} \]
where \(\bar{x}\) is the mean value of \(\mathbf{x}\) obtained as:
\[ \bar{x} = \frac{1}{n} (x_1 + x_2 + \dots + x_n) = \frac{1}{n} \sum_{i = 1}^{n} x_i \tag{3.31} \]
and \(\bar{y}\) is the mean value of \(\mathbf{y}\):
\[ \bar{y} = \frac{1}{n} (y_1 + y_2 + \dots + y_n) = \frac{1}{n} \sum_{i = 1}^{n} y_i \tag{3.32} \]
Basically, the covariance is a statistical summary that is used to assess the linear association between pairs of variables.
Assuming that the variables are mean-centered, we can get a more compact expression of the covariance in vector notation:
\[ cov(\mathbf{x, y}) = \frac{1}{n} (\mathbf{x^{\mathsf{T}} y}) \tag{3.33} \]
Properties of covariance:
- the covariance is a symmetric index: \(cov(X,Y) = cov(Y,X)\)
- the covariance can take any real value (negative, null, positive)
- the covariance is linked to variances under the name of the Cauchy-Schwarz inequality:
\[ cov(X,Y)^2 \leq var(X) var(Y) \tag{3.34} \]
3.4.6 Correlation
Although the covariance indicates the direction—positive or negative—of a possible linear relation, it does not tell us how big or small the relation might be. To have a more interpretable index, we must transform the convariance into a unit-free measure. To do this we must consider the standard deviations of the variables so we can normalize the covariance. The result of this normalization is the coefficient of linear correlation defined as:
\[ cor(X, Y) = \frac{cov(X, Y)}{\sqrt{var(X)} \sqrt{var(Y)}} \tag{3.35} \]
Representing \(X\) and \(Y\) as vectors \(\mathbf{x}\) and \(\mathbf{y}\), we can express the correlation as:
\[ cor(\mathbf{x}, \mathbf{y}) = \frac{cov(\mathbf{x}, \mathbf{y})}{\sqrt{var(\mathbf{x})} \sqrt{var(\mathbf{y})}} \tag{3.36} \]
Assuming that \(\mathbf{x}\) and \(\mathbf{y}\) are mean-centered, we can express the correlation as:
\[ cor(\mathbf{x, y}) = \frac{\mathbf{x^{\mathsf{T}} y}}{\|\mathbf{x}\| \hspace{1mm} \|\mathbf{y}\|} \tag{3.37} \]
As it turns out, the norm of a mean-centered variable \(\mathbf{x}\) is proportional to the square root of its variance (or standard deviation):
\[ \| \mathbf{x} \| = \sqrt{\mathbf{x^{\mathsf{T}} x}} = \frac{1}{\sqrt{n}} \sqrt{var(\mathbf{x})} \tag{3.38} \]
Consequently, we can also express the correlation with inner products as:
\[ cor(\mathbf{x, y}) = \frac{\mathbf{x^{\mathsf{T}} y}}{\sqrt{(\mathbf{x^{\mathsf{T}} x})} \sqrt{(\mathbf{y^{\mathsf{T}} y})}} \tag{3.39} \]
or equivalently:
\[ cor(\mathbf{x, y}) = \frac{\mathbf{x^{\mathsf{T}} y}}{\| \mathbf{x} \| \hspace{1mm} \| \mathbf{y} \|} \tag{3.40} \]
In the case that both \(\mathbf{x}\) and \(\mathbf{y}\) are standardized (mean zero and unit variance), that is:
\[ \mathbf{x} = \begin{bmatrix} \frac{x_1 - \bar{x}}{\sigma_{x}} \\ \frac{x_2 - \bar{x}}{\sigma_{x}} \\ \vdots \\ \frac{x_n - \bar{x}}{\sigma_{x}} \end{bmatrix}, \hspace{5mm} \mathbf{y} = \begin{bmatrix} \frac{y_1 - \bar{y}}{\sigma_{y}} \\ \frac{y_2 - \bar{y}}{\sigma_{y}} \\ \vdots \\ \frac{y_n - \bar{y}}{\sigma_{y}} \end{bmatrix} \tag{3.41} \]
the correlation is simply the inner product:
\[ cor(\mathbf{x, y}) = \mathbf{x^{\mathsf{T}} y} \hspace{5mm} \textsf{(standardized variables)} \tag{3.42} \]
3.4.7 Geometry of Correlation
Let’s look at two variables (i.e. vectors) from a geometric perspective.
The inner product ot two mean-centered vectors \(\langle \mathbf{x}, \mathbf{y} \rangle\) is obtained with the following equation:
\[ \mathbf{x^{\mathsf{T}} y} = \|\mathbf{x}\| \hspace{1mm} \|\mathbf{y}\| \hspace{1mm} cos(\theta_{x,y}) \tag{3.43} \]
where \(cos(\theta_{x,y})\) is the angle between \(\mathbf{x}\) and \(\mathbf{y}\). Rearranging the terms in the previous equation we get that:
\[ cos(\theta_{x,y}) = \frac{\mathbf{x^\mathsf{T} y}}{\|\mathbf{x}\| \hspace{1mm} \|\mathbf{y}\|} = cor(\mathbf{x, y}) \tag{3.44} \]
which means that the correlation between mean-centered vectors \(\mathbf{x}\) and \(\mathbf{y}\) turns out to be the cosine of the angle between \(\mathbf{x}\) and \(\mathbf{y}\).
3.4.8 Orthogonal Projections
Last but not least, we finish this chapter with a discussion of projections. To be more specific, the statistical interpretation of orthogonal projections.
Let’s motivate this discussion with the following question: Consider two variables \(\mathbf{x}\) and \(\mathbf{y}\). Can we approximate one of the variables in terms of the other? This is an asymmetric type of association since we seek to say something about the variability of one variable, say \(\mathbf{y}\), in terms of the variability of \(\mathbf{x}\).
We can think of several ways to approximate \(\mathbf{y}\) in terms of \(\mathbf{x}\). The approximation of \(\mathbf{y}\), denoted by \(\mathbf{\hat{y}}\), means finding a scalar \(b\) such that:
\[ \mathbf{\hat{y}} = b \mathbf{x} \tag{3.45} \]
The common approach to get \(\mathbf{\hat{y}}\) in some optimal way is by minimizing the square difference between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\).
The answer to this question comes in the form of a projection. More precisely, we orthogonally project \(\mathbf{y}\) onto \(\mathbf{x}\):
\[ \mathbf{\hat{y}} = \mathbf{x} \left( \frac{\mathbf{y^\mathsf{T} x}}{\mathbf{x^\mathsf{T} x}} \right) \tag{3.46} \]
or equivalently:
\[ \mathbf{\hat{y}} = \mathbf{x} \left( \frac{\mathbf{y^\mathsf{T} x}}{\| \mathbf{x} \|^2} \right) \tag{3.47} \]
For convenience purposes, we can rewrite the above equation in a slightly different format:
\[ \mathbf{\hat{y}} = \mathbf{x} (\mathbf{x^\mathsf{T}x})^{-1} \mathbf{x^\mathsf{T}y} \tag{3.48} \]
If you are familiar with linear regression, you should be able to recognize this equation. We’ll come back to this when we get to the chapter about Linear regression.
3.4.9 The mean as an orthogonal projection
Let’s go back to the concept of mean of a variable. As we previously mention, a variable \(X = (x_1, \dots, x_n)\), can be thought of a vector \(\mathbf{x}\) in an \(n\)-dimensional space. Furthermore, let’s also consider the constant vector \(\mathbf{1}\) of size \(n\). Here’s a conceptual diagram for this situation:
Out of curiosity, what happens when we ask about the orthogonal projection of \(\mathbf{x}\) onto \(\mathbf{1}\)? Something like in the following picture:
This projection is expressed in vector notation as:
\[ \mathbf{\hat{x}} = \mathbf{1} \left( \frac{\mathbf{x^\mathsf{T} 1}}{\mathbf{1^\mathsf{T} 1}} \right) \tag{3.49} \]
or equivalently:
\[ \mathbf{\hat{x}} = \mathbf{1} \left( \frac{\mathbf{x^\mathsf{T} 1}}{\| \mathbf{1} \|^2} \right) \tag{3.50} \]
Note that the term in parenthesis is just a scalar, so we can actually express \(\mathbf{\hat{x}}\) as \(b \mathbf{1}\). This means that a projection implies multiplying \(\mathbf{1}\) by some number \(b\), such that \(\mathbf{\hat{x}} = b \mathbf{1}\) is a stretched or shrinked version of \(\mathbf{1}\). So, what is the scalar \(b\)? It is simply the mean of \(\mathbf{x}\):
\[ \mathbf{\hat{x}} = \mathbf{1} \left( \frac{\mathbf{x^\mathsf{T} 1}}{\| \mathbf{1} \|^2} \right) = \bar{x} \mathbf{1} \tag{3.51} \]
This is better appreciated in the following figure.
What this tells us is that the mean of the variable \(X\), denoted by \(\bar{x}\), has a very interesting geometric interpretation. As you can tell, \(\bar{x}\) is the scalar by which you would multiply \(\mathbf{1}\) in order to obtain the vector projection \(\mathbf{\hat{x}}\).