It is common knowledge that a matrix serves as a linear transformation or a linear function. That is, it can take on an input and return another desired output. However, a matrix may also viewed as a data to some statistical models 1. Thus, this post is an attempt to shed light on these two aspects of a matrix. While thinking matrix as a linear transformation is a standard of a mathematician, the latter might be useful in the context of machine learning since it provides an insight in the concept of dimension reduction.
This section will be a quick recap of this view and for more geometrical interpretation, check an excellent video from Grant Sanderson (3Blue1Brown).
Let \(\mathbf{A}\) be a \(M \times N\) matrix (\(M\) rows and \(N\) columns). \(\mathbf{A}\) can be viewed as a linear function that maps a random vector in an \(N\)-dimensional vector space into a \(M\)-dimensional vector space. We could write this mapping mathematically as follows
\[ \mathbf{y} = \mathbf{A}\mathbf{x},\]
or in a diagrammatic way
\[ \mathbf{x} \rightarrow \mathbf{A} \rightarrow \mathbf{y}. \]
The flow of thoughts can be illustrated as: an \(N\)-vector \(\mathbf{x}\) is transferred through a function \(\mathbf{A}\), which then outputs a \(M\)-vector \(\mathbf{y}\). The essence of linear algebra is that every vector can be represented as basis vectors of the domain where that vector landed and scale these basis vectors by some constant factors. Hence, it is important to visualize the transformation with the nucleus being the standard basis vectors \(\{\mathbf{e}_1, \dots, \mathbf{e}_{N} \}\),
\[\mathbf{x} = x_1 \mathbf{e}_1 + \dots + x_{N} \mathbf{e}_{N}, \]
In the similar fashion, for each \(i = 1, \dots, N\), the product between \(\mathbf{A}\) and \(\mathbf{e}_i\) turns out to be the column \(i^{th}\) of \(\mathbf{A}\), denoted as \(\mathbf{a}_{i}\) for the same subcript \(i\). Then, the output \(y\) can be written as
\[\begin{align} \mathbf{y} &= \mathbf{A} \mathbf{x} \\ &= x_1 \mathbf{A}\mathbf{e}_1 + x_2 \mathbf{A}\mathbf{e}_2 + \dots + x_N \mathbf{A}\mathbf{e}_{N} \\ &= x_1\mathbf{a}_1 + x_2\mathbf{a}_2 + \dots + x_N \mathbf{a}_N. \end{align}\]
The second equation underpins the conceptual framework of the matrix-as-function view: the outcome \(\mathbf{y}\) of such transformation is a linear combination of the basis vectors \(\{\mathbf{e}_1, \dots, \mathbf{e}_N \}\) which have been transformed by the columns vector of \(\mathbf{A}\) (and hence become the new standard basis vectors in the new vector space.
Having discussed on matrices playing the functional role, we ponder ourselves whether these concepts discussed earlier have to interact with data?
In the context of machine learning and statistics, data are often represented by a matrix \(\mathbf{A}\) with columns and rows being \(N\) features and \(M\) independent samples or observations (note that independent observations is a standard assumption in statistic domain). Within this view, each column of \(\mathbf{A}\) \(\mathbf{a}_{n}\) is an \(M\)-vector representing a single feature for all observations and vice versa, each row of \(\mathbf{A}\) is an \(N\)-vector representing one single feature.
Consider a simple example where our matrix \(\mathbf{A}\) represents \(N\) different new born babies born in different cities throughout \(M\) different time points. More precisely, imagine that the columns are cities in VietNam and the corresponding rows are years in the 2000-2022 period. Each cell is the number of newborn babies of a given VietNam city on a given year:
\[\begin{equation} \mathbf{A}_{\text{VietNam newborn babies}} = \begin{bmatrix} & \text{TPHCM} & \text{Da Lat} & \dots & \text{Ha Noi} \\ 2000 & a_{11} & a_{12} & \dots & a_{1N} \\ 2001 & a_{21} & a_{22} & \dots & a_{2N} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 2022 & a_{M1} & a_{M2} & \dots & a_{MN} \end{bmatrix} \end{equation}\]
What is the advantage of this data representation compared with others? Suppose that we want to model the newborn babies in the new city that does not appear in our database, for example the new city is called “Mong Mo”. Furthermore, the newborn babies in Mong Mo is denoted as a \(M\)-vector \(\mathbf{y}_{\text{Mong Mo}}\). Then for the sake of simplicity, we can represent it as a linear combination of other newborn babies in the recorded cities 2. Mathematically, \(\mathbf{y}_{\text{Mong Mo}}\) can be written via matrix \(\mathbf{A}\) as:
\[\mathbf{y}_{\text{Mong Mo}} = x_1\mathbf{A}\mathbf{e}_1 + x_2 \mathbf{A}\mathbf{e}_2 + \dots + x_N \mathbf{A} \mathbf{e}_{N}. \]
The set of scalars \(\{x_1, \dots, x_{N}\}\) are solely coefficients that satisfy the equality of relationship. As discussed earlier, _{N} is the \(N\)-th column of matrix \(\mathbf{A}\). Therefore, Mong Mo’s newborn babies can be represented as a linear combination of the cities’ babies:
\[\begin{align} \mathbf{y}_{\text{Mong Mo}} &= \mathbf{A}\mathbf{e}_1 + x_2 \mathbf{A}\mathbf{e}_2 + \dots + x_N \mathbf{A} \mathbf{e}_{N} \\ & = x_1 \mathbf{a}_{\text{TPHCM}} + x_2 \mathbf{a}_{\text{Da Lat}} + \dots + x_N \mathbf{a}_{\text{Ha Noi}}. \end{align}\]
The quintessence of matrix-as-data view lies in the heart of the last equation: new feature (or datum in general) can be illustrated through a combination of existing data. Now, it can be questioned whether we could represent the data in a cross-sectional time domain (since row vector representing \(N\) cities at a single time index \(m \in \{2000,\dots, 2022\}\), simply transpose the matrix \(\mathbf{A}\) and make the similar lines of reasoning.
Viewing matrix as data may underscore the essence of dimension reduction mission. Imagine the case when our data matrix \(\mathbf{A}\) representing newborn babies exhibited many collinear relationship between some cities. For instance, TPHCM and Ha Noi had a highly correlated relationships in newborn babies. Under such circumstance, the matrix \(\mathbf{A}\) turns out to be a low-rank one. Hence, we can use dimension reduction technique to transform \(\mathbf{A}\) into \(\mathbf{B}\) where number of column vectors in \(\mathbf{B}\), \(K\) is less than those of \(\mathbf{A}\), \(N\) \((K < N)\). This can be described as follows:
\[\begin{equation} \mathbf{A} \in \mathbb{R}^{M \times N} \xrightarrow{\text{dimension reduction}} \mathbf{B} \in \mathbb{R}^{M \times K}. \end{equation}\]
Return to our example in the previous section, Mong Mo’s newborn babies over the examined period can be approximated as a linear combination of column vectors of matrix \(\mathbf{B}\):
\[\begin{equation} \mathbf{y}_{\text{Mong Mo}} = x_1 \mathbf{b}_1 + x_2 \mathbf{b}_2 + \dots + x_K \mathbf{b}_{K}. \end{equation}\]
Note that the indices are no longer the features of \(\mathbf{A}\) since the mapping from \(\mathbf{A}\) to \(\mathbf{B}\) is not an injective function. Instead, each column of \(\mathbf{b}_{k}\) can be explained as representing a subset of cities that expose some identical features in newborn babies. Illuminatively, we might have the following relationship:
\[\begin{equation} \mathbf{y}_{\text{Mong Mo}} = x_1\mathbf{b}_{\text{rural areas}} + x_2 \mathbf{b}_{\text{income average}} + \dots + x_{K} \mathbf{b}_{\text{industrial development}}. \end{equation}\]
Dimension reduction techniques such as Principal Component Analysis (PCA) rely on this fundamental idea to work. That is, the goal is to seek for a low-rank matrix \(\mathbf{B}\) that can somehow reliably estimates the matrix \(\mathbf{A}\) while retains a “good quality” in relationships in our data.