PCA is a sophisticated technique for reducing high dimensional datasets. It allows us to express data interms of new features at the expense of distorting the original data and unlike SVD, we don’t ever decompress the data. It’s useful for extremely large amounts of very high dimensional data but typically for run of the mill data - its hard to beat analysind a scatter plot matrix and manually removing features that look more redundant.

Convention, terminology and more nuanced differences with SVD

Modern convention prefers the data.table \(A\in\mathbb{R}^{m\times n}\), where \(m\) represents samples (aka observations), which are the grouped collection of single-value measurements for each individual subject, occupying the rows (i.e. data types are fixed within columns, but may vary across rows). The \(n\) features (aka variables) are the measured quantities across all subjects, occupy the columns (i.e. data types are fixed through individual columns). (Strang, 2016) uses the older/gene-expression transposed data matrix convention, swapping the left and right singular vectors matrices in SVD, and performs PCA using the left \(U\) matrix (the eigenvectors of the covariance matrix), whereas modern PCA is typically computed using the right \(V\) matrix. Universally, principal components calculated in the measurement/variable space are expressed as loadings and their projection onto the sample/observation space are the associated scores. Conversely, loadings may also be calculated in the sample/observation space and represented as scores in the measurement/variable space but I haven’t ever come across this. Finally, In SVD the loadings calculated in one space exactly correspond to the scores projected from the singular vectors in the other however, this is not the case for PCA as manipulation of the data matrix (centering and scaling) breaks the parity between the spaces.

head(iris,2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
# only use a manageable number of flowers
dt_iris_100_labelled <- as.data.table(iris)[
  sample(.N, 100, replace = FALSE)
]

# drop the non-numeric species column
dt_iris_100 <- Filter(is.numeric,dt_iris_100_labelled)

A <- as.matrix(dt_iris_100)


SVD data matrix and singular vectors:

Singular value decompositions uses the raw data matrix. When performing SVD we use the unaltered data matrix with no centering or scaling. Using the modern convention the SVD takes the form \(A=U\:\Sigma\:V^T\) where:

\[AA^T=U\Sigma V^T(U\Sigma V^T)^T=U\Sigma^2U^T\]

if \(U\) is orthogonal eigenvector matrix this is the diagonalisation of the positive definite \(AA^T\) matrix: \[U\Sigma^2U^T\sim Q\Lambda Q^{T} \sim S\Lambda S^{-1}\\\:\\ \therefore\quad AA^T\:S=S\Lambda\;\sim\;AA^T\:U=U\Sigma^2\]

So, the left singular vectors \(\vec{\mathbb{u}_i}\in\mathbb{R}^n\) span observation space and correspond to both the eigenvectors of \(AA^T\) and the square roots of the eigenvalues \(\lambda_i\) known as the singular values: \(\sigma_i=\sqrt{\lambda_i}\). \[ AA^T\:\vec{\mathbb{u}_i}=\sigma_i^2\vec{\mathbb{u}_i} \]

\[A^TA=(U\Sigma V^T)^TU\Sigma V^T=V\Sigma^2V^T\]

if \(V\) is orthogonal eigenvector matrix this is the diagonalisation of the positive definite \(A^TA\) matrix: \[V\Sigma^2V^T\sim Q\Lambda Q^{T} \sim S\Lambda S^{-1}\\\:\\ \therefore\quad A^TA\:S=S\Lambda\;\sim\;A^TA\:V=V\Sigma^2\]

So the right singular vectors \(\vec{\mathbb{v}}_i\in\mathbb{R}^m\) span the feature space and they correspond to both the eigenvectors of \(A^TA\) and the singular values \(\sigma_i\) which are the square root of the eigenvalues. \(V\) was orthogonal - lucky that! \[ A^TA\:\vec{\mathbb{v}_i}=\sigma_i^2\vec{\mathbb{v}_i} \]

PCA data preparation:

PCA nearly always involves centring and sometimes scaling of the data matrix. In PCA we generally centre each variable/feature column \(x'_i = x_i - \mu_x\) ensuring they are centered over a proper subspace of the sample space \(\mathbb{R}^m\). We use the sample covariance matrix to do this \(\Sigma\in\mathbb{R}^{n\times n}\). If units vary across the features or if there is a wide difference between the distribution of two columns with the same units, we also scale the features: \(x'_i = \frac{x_i - \mu_x}{\sigma_x}\) 1. This prevents a wide distribution from completely dominating the analysis. We use the sample correlation matrix to do this \(R\in\mathbb{R}^{n\times n}\).



PCA with Right Singular Vectors

More often the number of observations in a data martix greatly exceed features (\(m>>n\)) and we perform PCA by first calculating the right singular vectors i.e. the feature space representation before projecting them into the observation space to find the loadings and the scores respectively. In this case we normalise the data across the columns/features of the data matrix…

Sample Covariance Matrix: \(\Sigma = \frac{1}{m-1}A^TA\) (modern convention) \[ \begin{align} \Sigma&=\frac{1}{m-1}\begin{bmatrix} \text{Var}(x) & \dots & \text{Cov}(x,y) \\ \vdots & \ddots & \vdots\\ \text{Cov}(x,y) & \dots & \text{Var}(y) \end{bmatrix}\\ &=\frac{1}{n-1}\begin{bmatrix} \sum_i(x_i-\mu_x)^2 & \dots & \sum_i(x_i-\mu_x)\cdot(y_i-\mu_y) \\ \vdots & \ddots & \vdots \\ \sum_i(x_i-\mu_x)\cdot(y_i-\mu_y) & \dots & \sum_i(y_i-\mu_y)^2 \end{bmatrix} \end{align} \]

sample_covariance_matrix<-cov(A)

Sample Correlation Matrix: \(R = \frac{1}{m-1}DA^TAD\) (modern convention) \[ \begin{align} R&=\frac{1}{m-1}\begin{bmatrix} 1 & \dots & \text{Cor}(x,y) \\ \vdots & \ddots & \vdots \\ \text{Cor}(x,y) & \dots & 1 \end{bmatrix}\\ &=\frac{1}{n-1}\begin{bmatrix} 1 & \dots & \frac{\text{Cov}(x,y)}{\sigma_x\sigma_y} \\ \vdots & \ddots & \vdots \\ \frac{\text{Cov}(x,y)}{\sigma_x\sigma_y} & \dots & 1 \end{bmatrix} \end{align} \]

sample_correlation_matrix<-cor(A)

I’m seeing very strong positive correlations and a few negative ones due to two separate clusters in the feature space:

Aside: Setosa can be readily identified from its smaller petals or even the combination of the two sepal features i.e it’s fatter sepal compared to Versicolor and Virginica. This species is responsible for the second cluster


PCA assumes that our data is centered and primarily captures variance within a continuous subspace of the feature space. The presence of distinct clusters can distort the analysis, clusters are the enemy of PCA analysis as PCA may align with the separation between clusters rather than meaningful variance within them.

I’ll resample and remove the Setosa species from the dataset. However, when such prior knowledge is unavailable, appropriate clustering techniques such as DBSCAN can be applied to identify and eliminate multiple clusters before performing PCA

…this looks much better!



Dual formulation of PCA

Less often the number of features in a data martix greatly exceed observations (\(n>>m\)) and we are interested in the similarity between observations (rather than variance of features), I’m not sure how common this is! Here we perform PCA by first calculating the left singular vectors i.e. the observation space representation before projecting them into the feature space to find the loadings and the scores respectively. In this case we normalise the data across the rows/observations of the data matrix…

We can use the variable covariance matrix \(\Sigma\) or the variable correlation matrix \(R\) to calculate the left sample space \(\mathbb{R}^{m\times m}\) singular vectors.

Variable Covariance Matrix: \(\Sigma = \frac{1}{n-1}AA^T\) (modern convention)

Sigma <- cov(t(A))

Variable Correlation Matrix: \(R = \frac{1}{n-1}DAA^TD\) (modern convention)

R <- cor(t(A))


The directions of the singular vectors calculated from the observation space do not correspond to left and right singular vectors calculated in feature space given the differences in normalisation (centering and scaling) associated with the covariance and correlation matrices. Normalisation across features produces quite different data to normalisation across observations! (Conversely in SVD where the raw data matrix is used to calculate the principal components with no normalisation the left and right singular vectors will match). In PCA, we must therefore calculate observation scores as linear combination of features defined by the loadings calculated from the right singular vectors i.e. not through diagonalisation of the dot product of the transposed data matrix as in SVD.

There is not enough data and all features in the sample space look terribly correlated - there is likely going to be one dominant singular value… I’m not going to progress this any more.

Singular Vectors & principal Components

Modern convention has the left singular vectors are the orthogonal basis for the row/feature space and the right singular vectors as the orthogonal basis for the column/ovservation space. Usually there are more observations than features \(A\in\mathbb{R}^{m\times n}\) where \(m>>n\) and \(r=\text{Rank}(A^TA)=\text{Rank}(AA^T)\). The directions of the first \(r\) left singular vectors (with non-zero eigenvalues) correspond to the first \(r\) right singular vectors/values - their directions are the same albeit their representation is different - feature space and observation space respectively. The subset of these singular vectors with which have large (or appreciable) singular values are considered principal components.

The left singular vector matrix \(U\) is generally a very large dimensional null space and is typically a much larger matrix than the right singular vector matrix \(V\) but we only use the first \(r\) columns of each.

In PCA we prioritise the principal component (observation) scores - the representation of the principal components in sample/observation space since however, the principal component loadings - the representation of the principal components in feature/variable space reveal the directions of maximum variance in the dataset.

Right Singular Vectors \(V\) (and their Loadings/Feature Space Representation)

To calculate the right eigenvector matrix \(V\), we solve \(A^TAV=V\Sigma^2\). This allows us to express each observation as a linear combination of the right singular vectors \(v_i\in\mathbb{R}^n\). Essentially, this is a rotation of the axes, where each new axis captures the maximum remaining variance in the data. The singular values \(\sigma_i\) indicate the importance of each axis, with larger values reflecting greater variance. More specifically, we can directly compute the proportion of variance explained by each axis.

# Calculate Singular Values and principal Component Vectors...
diagonalisation <- eigen(cor(dt_iris_100))

# Order Singular Values (Matrix Diagonal), maintain naming
Sigma <- diagonalisation$values
Sigma <- Sigma[order(diagonalisation$values,decreasing = TRUE)]
  
# Order Right Singular Vectors, maintain naming
V <- diagonalisation$vectors
V <- V[,order(diagonalisation$values,decreasing = TRUE)]
  
# flip PC1 to correspond to flower size
V[, 1] <- -V[, 1]

#  Proportion of variance ...
pve <- Sigma / sum(Sigma)

Approximately \(74\%\) of variance is explained by the first principal component PC1 which is very strongly negatively correlated with all the features. We are of course free to flip PC1: \(\vec{v}_1\to-\vec{v}_1\) and name it Perianth.Size The second component PC captures \(\sim16\%\), it closely represents Sepal.Broadness. Associating meaningful names to the principal components is imprecise, as there is usually some overlap or crossover in the features they capture.

Feature Loadings represent the components of the singular vectors within the feature space. They indicate the specific contribution of each feature to the principal components, showing the linear combination of features that define each PC.

Biplots are the conventional way to show the singular vectors in the original space






A Biplot shows singular vector directions in a spatial manner but a simple bar graph is maybe a cleaner way to present the relative importance of variables.

Observation Scores represent the components of the singular vectors within the observation space. They indicate the specific contribution of each feature to the principal components, showing the linear combination of features that define each PC. res are the projections of the observations (or samples) onto the principal components.

Practical PCA workflow with prcomp

1. impute missing values

2. standardise the data (can be defered)

dt_iris_100_scaled <- scale(dt_iris_100) 

# scale keeps the mean of each feature before centering
attr(dt_iris_100_scaled, "scaled:center")
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        6.045        2.865        4.640        1.610
# plus a bunch of additional attributes
attributes(dt_iris_100_scaled)
## $dim
## [1] 20  4
## 
## $dimnames
## $dimnames[[1]]
## NULL
## 
## $dimnames[[2]]
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## 
## 
## $`scaled:center`
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        6.045        2.865        4.640        1.610 
## 
## $`scaled:scale`
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6847781    0.2924938    0.7949843    0.4470959

3. perform PCA & standardisation

pca_result <- prcomp(dt_iris_100_scaled, center = TRUE, scale. = TRUE)

V <- pca_result$rotation # singular vector loadings
Sigma <- pca_result$sdev  # just the diagonal

mu_feature <- pca_result$center # feature means
sigma_feature <- pca_result$scale # feature std. devs.
U <- pca_result$x # observation scores

summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7557 0.6957 0.62015 0.22137
## Proportion of Variance 0.7706 0.1210 0.09615 0.01225
## Cumulative Proportion  0.7706 0.8916 0.98775 1.00000
# Biplot to visualize PCA
biplot(pca_result, scale = 0)

# Scree plot to visualize explained variance
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
ggplot(data.frame(PC = 1:length(variance_explained), VarianceExplained = variance_explained), 
       aes(x = PC, y = VarianceExplained)) +
  geom_line() +
  geom_point() +
  xlab("Principal Component") +
  ylab("Proportion of Variance Explained") +
  ggtitle("Scree Plot")

References

Strang, G. (2016). Introduction to linear algebra (5th ed). Wellesley-Cambridge press.

  1. Here \(\sigma_x^2 = \text{Var}(x_i)\), which is the variance, and is not a singular value.↩︎