Reference: Introduction to Statistical Learning Chapter 10.1-10.2
The goal of PCA is to find low-dimensional summaries of high-dimensional data sets.
This is useful for compression, for denoising, for plotting, and for making sense of data sets that initially seem too complicated to understand.
It differs from clustering:
Nestle Toll House Chocolate-chip cookies: 280 grams flour, 150 grams white sugar, 165 grams brown sugar, 225 grams butter, 2 eggs, 0 grams water…
Mary Berry's Victoria sponge cake: 225 grams flour, 225 grams white sugar, 0 grams brown sugar, 225 grams butter, 4 eggs, 0 grams water…
seriouseats.com old fashioned flaky pie dough: 225 grams flour, 15 grams white sugar, 0 grams brown sugar, 225 grams butter, 0 eggs, 115 grams water…
Each baked good is constructed by following a recipe: a combination of the same basic ingredients.
The amounts of each ingredient differ from one baked good to the next:
Our goal is to reverse-engineer both the ingredients and the amounts/recipes from an observed set of “baked goods” (i.e. original data points).
Alas, PCA is less delicious than baking, and it uses more linear algebra. Say that \( v \in \mathbb{R}^P \) is some vector. This defines a subspace of \( \mathbb{R}^P \): \[ \mathcal{V} = \left\{z: z = \alpha_i v, \alpha_i \in \mathbb{R} \right\} \]
Now let \( X \) be our usual \( N \times P \) data matrix with rows \( x_i^T \).
Suppose we project each \( x_i^T \) in our data matrix onto the subspace \( \mathcal{V} \).
Key idea 1: projection = summary
Key idea 2: the “best summary” is the one that preserves as much of the variance in the original data points as possible.
Let's walk through pca_intro.R
Given data points \( x_1, \ldots, x_N \), with each \( x_i \in \mathbb{R}^{P} \), and a candidate vector \( v_1 \), the variance of the projected points is \[ \mathrm{variance} = \frac{1}{n} \sum_{i=1}^n \left[ \alpha_i - \bar{\alpha} \right]^2 \] where \( \alpha_i x_i \cdot v_1 \).
So we solve: \[ \underset{v_1 \in \mathbb{R}\, , \; \Vert v \Vert_2 = 1}{\mathrm{maximize}} \quad \sum_{i=1}^n \left[x_i \cdot v_1 - \left( \frac{1}{n} \sum_{i=1}^N x_i \cdot v_1 \right) \right]^2 \]
Note: we constrain \( v_1 \) to have length 1; otherwise we could blow up the variance of the projected points as large as we wanted to.
The solution \( v_1 \) to this optimization problem:
The projected points \( \alpha_i = x_i \cdot v \) are called the scores on the first principal component.
We can think of the projected location of \( x_i \) as an approximation to the original data point: \( x_i \approx \hat{x}_i = \alpha_i v \).
Or to make the approximation error explicit: \[ \begin{aligned} x_{ij} &= \hat{x}_{ij} + e_i \\ &= \alpha_i v_{j} + e_i \end{aligned} \] This is like a regression problem for the jth feature variable.
Thus PCA is like estimating \( P \) regression coefficients \( v_1 = (v_{11}, \ldots, v_{1P}) \) all at once, where the feature variable is hidden.
We can write the approximation for the whole matrix as follows:
\[ \begin{aligned} X &\approx \left( \begin{array}{r r r r} \alpha_1 v_{11} & \alpha_1 v_{12} & \cdots & \alpha_1 v_{1P} \\ \alpha_2 v_{11} & \alpha_2 v_{12} & \cdots & \alpha_2 v_{1P} \\ \vdots & \vdots & & \vdots \\ \alpha_N v_{11} & \alpha_N v_{12} & \cdots & \alpha_N v_{1P} \\ \end{array} \right) \\ &= \alpha v_1^T \quad \mbox{(outer product of $\alpha$ and $v_1$)} \end{aligned} \]
And if we explicitly include the error term:
\[ X = \alpha v_1^T + E \]
where \( E \) is a residual matrix with entries
\[ \begin{aligned} E_{ij} &= x_{ij} - \hat{x}_{ij} \\ &= x_{ij} - \alpha_i v_{1j} \end{aligned} \]
With this in place, we can now define principal components 2 and up!
Thus principal component \( M \) is defined recursively in terms of the fit from principal components 1 through \( M-1 \).
Note: in practice we often stop with far fewer than \( P \) principal components.
Let's see the examples in congress109.R and NCI60.R.