2 Introduction

Picture a data set containing scores of several courses for college students. For example, courses like matrix algebra, multivariable calculus, statistics, and probability. And say we also have historical data about a course in Statistical Learning. In particular we have final scores measured on a scale from 0 to 100, we also have final grades (in letter grade scale), as well as a third interesting variable “Pass - Non-Pass” indicating whether the student passed statistical learning. Some data like that fits perfectly well in a tabular format. The rows contain the records for a bunch of students, and the columns refer to the variables.

Student	LinAlg	Calculus	Statistics	StatLearn	Grade	P/NP
1	\(\bigcirc\)	\(\bigcirc\)	\(\bigcirc\)	92	A	P
2	\(\bigcirc\)	\(\bigcirc\)	\(\bigcirc\)	85	B	P
3	\(\bigcirc\)	\(\bigcirc\)	\(\bigcirc\)	40	F	NP
New	\(\bigcirc\)	\(\bigcirc\)	\(\bigcirc\)	?	?	?

Suppose that, based on this historical data, we wish to predict the score of a new student (whose Linear Algebra, Calculus, and Statistics grades are known) in Statistical Learning. To do so, we would fit some sort of model to our data; i.e. we would perform regression. This is a form of supervised learning, since our model is trained using known inputs (i.e. LinAlg, Calculus, and Statistics grades) as well as known responses (i.e. the Statistical Learning grades of the previous students).

Likewise, we may be interested in studying the data not from a prediction oriented perspective but from a purely exploratory perspective. For example, maybe we want to investigate what is the relationship between the courses Linear Algebra, Calculus, and Statistics; that is: explore the relationship between the features. Or maybe we want to study the resemblance among individuals and see what kind of students have similar scores, or if there are “natural” groups of individuals based on their features. Both of these tasks are examples of unsupervised learning. We use the information in the data to discover patterns, without focusing on any single variable as a target response.

In summary, we will focus on two types of learning paradigms:

Supervised Learning: where we have inputs, and one (or more) response variable(s).

Unsupervised Learning: where we have inputs, but not response variables.

Figure 2.1: Inputs and outputs in supervised and unsupervised learning

By the way, there are other types of Learning paradigms (e.g. deep learning, reinforcement learning), but we won’t discuss them in this book.

To visualize the different types of learning, the different types of variables, and the methodology associated with each combination of learning/data types, we can use the following graphic:

Figure 2.2: Supervised and Unsupervised Corners

2.1 Basic Notation

In this book we are going to use a fair amount of math notation. Becoming familiar with the meaning of all the different symbols as soon as possible, should allow you to keep the learning curve a little bit less steep.

The starting point is always the data, which we will assume to be in a tabular format, that can be translated into a mathematical matrix object. Here’s an example of a data matrix \(\mathbf{X}\) of size \(n \times p\)

\[ \mathbf{X} = \ \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1j} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2j} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots & & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{ij} & \cdots & x_{ip} \\ \vdots & \vdots & & \vdots & & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nj} & \cdots & x_{np} \\ \end{bmatrix} \tag{2.1} \]

By default, we will assume that the rows of a data matrix correspond to the individuals or objects. Likewise, we will also assume that the columns of a data matrix correspond to the variables or features observed on the individuals. In this sense, the symbol \(x_{ij}\) represents the value observed for the \(j\)-th variable on the \(i\)-th individual.

Throughout this book, every time you see the letter \(i\), either alone or as an index associated with any other symbol (superscript or subscript), it means that such term corresponds to an individual or a row of some data matrix. For instance, symbols like \(x_i\), \(\mathbf{x_i}\), and \(\alpha_i\) are all examples that refer to—or denote a connection with—individuals.

In turn, we will always use the letter \(j\) to convey association with variables or columns of some data matrix. For instance, \(x_j\), \(\mathbf{x_j}\), and \(\alpha_j\) are examples that refer to—or denote a connection with—variables.

For better or for worse, we’ve made the decision to represent both the rows and the columns of a matrix as vectors using the same notation: as bold lower case letters such as \(\mathbf{x_i}\) and \(\mathbf{x_j}\). Because we know that there’s a risk of confusing a vector that corresponds to a row with a vector that corresponds to a column, sometimes we will use the arrow notation for vectors associated to the row of a data matrix: \(\mathbf{\vec{x}_i}\).

So, going back to the above data matrix \(\mathbf{X}\), we can represent the first variable as a vector \(\mathbf{x_1} = (x_{11}, x_{21}, \dots, x_{n1})\). Likewise, we can represent the first individual with the vector \(\mathbf{\vec{x}_1} = (x_{11}, x_{12}, \dots, x_{1p})\).

Here’s a reference table with the most common symbols and notation used throughout the book.

Symbol	Description
\(n\)	number of objects or individuals
\(p\)	number of variables or features
\(i\)	running index for rows or individuals
\(j\)	running index for columns or variables
\(k\)	running index determined by context
\(\ell, m, q\)	other auxiliary indexes
\(f()\), \(h()\), \(d()\)	functions
\(\lambda, \mu, \gamma, \alpha\)	greek letters represent scalars
\(\mathbf{x}\), \(\mathbf{y}\)	variables, size determined by context
\(\mathbf{w}\), \(\mathbf{a}\), \(\mathbf{b}\)	vectors of weight coefficients
\(\mathbf{z}\), \(\mathbf{t}\), \(\mathbf{u}\)	components or latent variables
\(\mathbf{X} : n \times p\)	data matrix with \(n\) rows and \(p\) columns
\(x_{ij}\)	element of a matrix in \(i\)-th row and \(j\)-th column
\(\mathbf{1}\)	vector of ones, size determined by context
\(\mathbf{I}\)	identity matrix, size determined by context

By the way, there are many more symbols that will appear in later chapters. But for now these are the fundamental ones.

Likewise, the table below contains some of the most common operators that we will use in subsequent chapters:

Symbol	Description
\(\mathbb{E}[X]\)	expected value of a random variable \(X\)
\(\\|\mathbf{a}\\|\)	euclidean norm of a vector
\(\mathbf{a}^{\mathsf{T}}\)	transpose of a vector (or matrix)
\(\mathbf{a^{\mathsf{T}}b}\)	inner product of two vectors
\(\langle \mathbf{a}, \mathbf{b} \rangle\)	inner product of two vectors
\(det(\mathbf{A})\)	determinant of a square matrix
\(tr(\mathbf{A})\)	trace of a square matrix
\(\mathbf{A}^{-1}\)	inverse of a square matrix
\(diag(\mathbf{A})\)	diagonal of a square matrix
\(E()\)	overall error function
\(err()\)	pointwise error function
\(sign()\)	sign function
\(var()\)	variance function
\(cov()\)	covariance function
\(cor()\)	correlation function