3.2 - Principal component analysis

Motivation

  • Principal component analysis (PCA), like MDS, is a dimension reduction technique.
  • It is a way of representing high-dimensional data in lower dimensions, in order to:
    • Visualize
    • Identify characteristics of similar observations
    • Understand underlying structure
  • While classic metric multidimensional scaling \(\equiv\) PCA, in general the motivation is a bit different.
    • MDS: find a low-dimensional representation that preserves distance
    • PCA: find a low-dimensional representation that preserves explained variability

PCA is a linear combination of X’s

  • Starting point: \(n\times p\) data matrix \(X\), want to represent in \(k\leq p\) dimensions
  • Let \(X_{ij}\) represent the \(j^{th}\) column for the \(i^{th}\) observation
    • \(j=1,2,...,p\)
    • \(i = 1,2,...,n\)
  • For the \(i^{th}\) data observation, the \(k^{th}\) principal component, (also known as score), \(PC_{ik}\), is a linear combination (recall: weighted sum) of the \(X\)’s:

\[PC_{ik} = v_{k1} X_{i1} + v_{k2} X_{i2} + ... + v_{kp} X_{ip}\]

  • The vector of weights \((v_{k1},v_{k2},...,v_{kp})\) are called the loadings of the \(k^{th}\) principal component, where \(\sum_{j=1}^pv_{kj}^2 = 1\) (i.e., each vector of loadings has \(||v_k||_2 = 1\).)
  • \(PC_{1:n,i}\) and \(PC_{1:n, j}\) are uncorrelated

Matrix form

\[ \scriptsize \begin{pmatrix} \color{black}{PC_{11}} & \color{lightgray}{PC_{12}} & \dots & \color{lightgray}{PC_{1k}} \\ \color{lightgray}{PC_{21}} & \color{lightgray}{PC_{22}} & \dots & \color{lightgray}{PC_{2k}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{PC_{n1}} & \color{lightgray}{PC_{n2}} & \dots & \color{lightgray}{PC_{nk}} \end{pmatrix}_{n \times k} = \begin{pmatrix} \color{black}{X_{11}} & \color{black}{X_{12}} & \dots & \color{black}{X_{1p}} \\ \color{lightgray}{X_{21}} & \color{lightgray}{X_{22}} & \dots & \color{lightgray}{X_{2p}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{X_{n1}} & \color{lightgray}{X_{n2}} & \dots & \color{lightgray}{X_{np}} \end{pmatrix}_{n \times p} \begin{pmatrix} \color{black}{v_{11}} & \color{lightgray}{v_{21}} & \dots & \color{lightgray}{v_{k1}} \\ \color{black}{v_{12}} & \color{lightgray}{v_{22}} & \dots & \color{lightgray}{v_{k2}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{black}{v_{1p}} & \color{lightgray}{v_{2p}} & \dots & \color{lightgray}{v_{kp}} \end{pmatrix}_{p \times k} \]

\[ \scriptsize \begin{pmatrix} \color{lightgray}{PC_{11}} & \color{lightgray}{PC_{12}} & \dots & \color{lightgray}{PC_{1k}} \\ \color{black}{PC_{21}} & \color{lightgray}{PC_{22}} & \dots & \color{lightgray}{PC_{2k}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{PC_{n1}} & \color{lightgray}{PC_{n2}} & \dots & \color{lightgray}{PC_{nk}} \end{pmatrix}_{n \times k} = \begin{pmatrix} \color{lightgray}{X_{11}} & \color{lightgray}{X_{12}} & \dots & \color{lightgray}{X_{1p}} \\ \color{black}{X_{21}} & \color{black}{X_{22}} & \dots & \color{black}{X_{2p}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{X_{n1}} & \color{lightgray}{X_{n2}} & \dots & \color{lightgray}{X_{np}} \end{pmatrix}_{n \times p} \begin{pmatrix} \color{black}{v_{11}} & \color{lightgray}{v_{21}} & \dots & \color{lightgray}{v_{k1}} \\ \color{black}{v_{12}} & \color{lightgray}{v_{22}} & \dots & \color{lightgray}{v_{k2}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{black}{v_{1p}} & \color{lightgray}{v_{2p}} & \dots & \color{lightgray}{v_{kp}} \end{pmatrix}_{p \times k} \]

\[ \scriptsize \begin{pmatrix} \color{lightgray}{PC_{11}} & \color{lightgray}{PC_{12}} & \dots & \color{lightgray}{PC_{1k}} \\ \color{lightgray}{PC_{21}} & \color{lightgray}{PC_{22}} & \dots & \color{black}{PC_{2k}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{PC_{n1}} & \color{lightgray}{PC_{n2}} & \dots & \color{lightgray}{PC_{nk}} \end{pmatrix}_{n \times k} = \begin{pmatrix} \color{lightgray}{X_{11}} & \color{lightgray}{X_{12}} & \dots & \color{lightgray}{X_{1p}} \\ \color{black}{X_{21}} & \color{black}{X_{22}} & \dots & \color{black}{X_{2p}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{X_{n1}} & \color{lightgray}{X_{n2}} & \dots & \color{lightgray}{X_{np}} \end{pmatrix}_{n \times p} \begin{pmatrix} \color{lightgray}{v_{11}} & \color{lightgray}{v_{21}} & \dots & \color{black}{v_{k1}} \\ \color{lightgray}{v_{12}} & \color{lightgray}{v_{22}} & \dots & \color{black}{v_{k2}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{lightgray}{v_{1p}} & \color{lightgray}{v_{2p}} & \dots & \color{black}{v_{kp}} \end{pmatrix}_{p \times k} \]

Properties

\[\scriptsize \begin{pmatrix} \color{blue}{PC_{11}} & \color{orange}{PC_{12}} & \dots & PC_{1k} \\ \color{blue}{PC_{21}} & \color{orange}{PC_{22}} & \dots & PC_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ \color{blue}{PC_{n1}} & \color{orange}{PC_{n2}} & \dots & PC_{nk} \end{pmatrix} = X \begin{pmatrix} \color{blue}{v_{11}} & \color{orange}{v_{21}} & \dots & v_{k1} \\ \color{blue}{v_{12}} & \color{orange}{v_{22}} & \dots & v_{k2} \\ \vdots & \vdots & \ddots & \vdots \\ \color{blue}{v_{1p}} & \color{orange}{v_{2p}} & \dots & v_{kp} \end{pmatrix}\]

  • \(\color{blue}{PC_{1:n,1}}=X\color{blue}{v_{1,1:p}}\), \(\color{orange}{PC_{1:n,2}}=X\color{orange}{v_{2,1:p}}\), etc.
  • \(\color{blue}{\sum_{j=1}^pv_{1j}^2} =\color{orange}{\sum_{j=1}^pv_{2j}^2}= 1\), etc.
  • \(\color{blue}{PC_{1:n,1}}\) and \(\color{orange}{PC_{1:n,2}}\) are uncorrelated with each other and all other PCs

Goal of PCA

Find loadings to produce PCs that:

  • Maximize explained variation in reduced \(k\)-dimensional space (what does this mean?!) ;
  • Are uncorrelated with each other, i.e. each score tells us something entirely unique about the \(k\)-dimensional representation.
  • Put together, this means:
    • \(\color{blue}{PC_{1:n,1}}\) explains as much variability in \(X\) as possible
    • \(\color{orange}{PC_{1:n,2}}\) explains as much remaining variability in \(X\) as possible that hasn’t already been explained by \(\color{blue}{PC_{1:n,1}}\)
    • \(\color{black}{PC_{1:n,3}}\) explains as much remaining variability in \(X\) as possible that hasn’t already been explained by \((\color{blue}{PC_{1:n,1}}, \color{orange}{PC_{1:n,2}})\)
    • etc.

Starting small

  • Let’s gain some intuition behind what it means to preserve “variability” by taking a look at the cars data set.
  • Here, of course, we have \(p=2\) dimensional data, so “dimension reduction” means going down to \(k=1\) dimension:
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
  • Consider loadings, where:
    • \(V_1 = \begin{pmatrix}1 \\ 0 \end{pmatrix}\)
      • PC \(XV_1\) = speed
    • \(V_2 = \begin{pmatrix}0 \\ 1 \end{pmatrix}\)
      • PC \(XV_2\) =distance
    • \(V_3 = \begin{pmatrix} \sqrt{0.5} \\ \sqrt{0.5} \end{pmatrix}\)
      • PC \(XV_3\)= equal weight to speed and distance
    • Note that \(||V_1||_2 = ||V_2||_2 = ||V_3||_2 = 1\)

Plotting the reductions

2D plot:

V1 <- matrix(c(1,0), 2, 1)
V2 <- matrix(c(0,1), 2, 1)
V3 <- matrix(sqrt(c(0.5,0.5)), 2, 1)

XV1 <- (as.matrix(cars) %*% V1) %>% data.frame
XV2 <- (as.matrix(cars) %*% V2) %>% data.frame
XV3 <- (as.matrix(cars) %*% V3) %>% data.frame

1D principal components:

Variance explained

  • So which \(2\rightarrow 1\) dimension reduction (\(XV_1\), \(XV_2\), or \(XV_3\)) is best?
  • Need to consider total amount of variability in full \(p\)-dimensional space
  • Recall: diagonals of \(\Sigma\), the covariance matrix of \(X\), measure variability along each axis in \(p\)-dimensional space.
(Sigma <- cov(cars))
          speed     dist
speed  27.95918 109.9469
dist  109.94694 664.0608
  • Total variability in 2 dimensions: 27.959+664.061 = 1603.9
  • So, which of the 1D components explains the greatest proportion of this variability?
var(XV1)
         XV
XV 27.95918
var(XV2)
         XV
XV 664.0608
var(XV3)
         XV
XV 455.9569

…so \(V_2\) is the best loading matrix, and PC2 is the best score…right??

But it’s not fair!

  • Distance is just measured on a larger scale than speed!
  • It is driving the “total variability,” so any linear combination that weights distance more than speed will explain a larger proportion of variability.

  • If I measure speed in “miles per minute”, now giving all the weight to speed explains more total variance:
cars2 <- cars %>% mutate(speed = speed*60)
cov(cars2)
           speed      dist
speed 100653.061 6596.8163
dist    6596.816  664.0608
var(as.matrix(cars2) %*% V1)
         [,1]
[1,] 100653.1
var(as.matrix(cars2) %*% V2)
         [,1]
[1,] 664.0608
var(as.matrix(cars2) %*% V3)
         [,1]
[1,] 57255.38

Always scale!

  • Thus, in general, we should always mean-center and scale our data before performing PCA.
cars_scaled <- scale(cars)
cov(cars_scaled)
          speed      dist
speed 1.0000000 0.8068949
dist  0.8068949 1.0000000

  • The covariance of scaled data is just \(R\), the correlation matrix of \(X\)
  • Total “variability of scaled data” = \(p\), the full dimensionality of data matrix \(X\)
var(as.matrix(cars_scaled) %*% V1)
     [,1]
[1,]    1
var(as.matrix(cars_scaled) %*% V2)
     [,1]
[1,]    1
var(as.matrix(cars_scaled) %*% V3)
         [,1]
[1,] 1.806895

… So it looks like equal weights on the scaled speed and distance variables (\(v_3\)) explains the most variability.

How do you find “optimal” loadings?

  • Recall:
    • \(V_1 = \begin{pmatrix}1 \\ 0 \end{pmatrix}\)
      • PC \(Xv_1\) = speed
    • \(V_2 = \begin{pmatrix}0 \\ 1 \end{pmatrix}\)
      • PC \(Xv_1\) =distance
    • \(V_3 = \begin{pmatrix} \sqrt{0.5} \\ \sqrt{0.5} \end{pmatrix}\)
      • PC \(XV_3\)= equal weight to speed and distance
  • It turns out \(V_3\) is best of these 3, for our scaled data example.
  • Is \(V_3\) the best there is, or is there one better?
  • How do we find the optimal loadings, in general?

SVD/eigen analysis of \(R\)!

  • Given mean-centere and scaled data matrix \(X_S\), the optimal loadings for a reduced \(k\)-dimensional representation are the first \(k\) vectors from \(V\), from singular value decomposition:

\[X_S = UDV^T\]

  • Finding the optimal loadings for a \(2\rightarrow 1\) transformation:
V <- svd(cars_scaled)$v
V[,1]
[1] 0.7071068 0.7071068
  • Have we seen these before?
c(sqrt(0.5), sqrt(0.5))
[1] 0.7071068 0.7071068
  • These are also the first \(k\) eigenvectors of the correlation matrix!
eigen(cor(cars))$vector[,1]
[1] 0.7071068 0.7071068

Finding one score

  • The \(k^{th}\) score or principal component, \(S_k\), can also be found with the SVD ingredients:

\[S_k = U[,k]\cdot D[k]\]

svd_cars_scaled <- svd(cars_scaled)
U <- svd_cars_scaled$u
D <- svd_cars_scaled$d
V <- svd_cars_scaled$v
head(U[,1] * D[1])
[1] -2.648984 -2.429466 -2.192920 -1.699003 -1.729914 -1.760825
as.matrix(cars_scaled) %*% V3 %>% head
          [,1]
[1,] -2.648984
[2,] -2.429466
[3,] -2.192920
[4,] -1.699003
[5,] -1.729914
[6,] -1.760825

Finding first \(k\) scores

  • We will often be interested in the first \(k\) scores, \(k\leq p\).
  • For our 2D car example, the largest \(k\) can be is 2.
  • Finding the first \(k\) scores using \(U\) and \(D\) from SVD:

\[S_{1:k} = U[,1:k]diag(D[1:k])\]

(first2scores <- U[,1:2]%*% diag(D[1:2]) 
  %>% data.frame
) %>% head
         X1          X2
1 -2.648984 -0.40001729
2 -2.429466 -0.61953575
3 -2.192920 -0.05371253
4 -1.699003 -0.54762908
5 -1.729914 -0.24926210
6 -1.760825  0.04910488
  • Finding the first \(k\) scores using scaled \(X_S\) and first \(k\) loadings from \(V\):

\[S_{1:k} = X_SV[,1:k]\]

(first2scores <- cars_scaled %*% V[,1:2]
  %>% data.frame
 ) %>% head
         X1          X2
1 -2.648984 -0.40001729
2 -2.429466 -0.61953575
3 -2.192920 -0.05371253
4 -1.699003 -0.54762908
5 -1.729914 -0.24926210
6 -1.760825  0.04910488

Visualizing first \(k\) scores when \(k=p\)

  • If \(k=p\) (i.e., number of PCs = dimension of data matrix) then:
  • Scores are a rotation of the scaled data, such that \(Cor(PC1, PC2) = 0\).
  • Maximum variability along first PC; second most along second PC; etc.

3D example: Prehistoric goblets

  • Suppose you have been asked to analyze data on 25 prehistoric goblets from Thailand (Professor C.F.W. Higham, University of Otago, as taken from Manly, B.F.J. 1986. Multivariate Statistical Methods: A Primer. Chapman and Hall, London, 159pp.)

  • Let’s consider three of these:

  • \(X_1\) = mouth width
  • \(X_2\) = total width
  • \(X_3\) = total height

Reading and plotting

goblets <- read.csv('Data/goblets.csv') %>% 
  select(MouthWidth:TotalHeight)
head(goblets)
  MouthWidth TotalWidth TotalHeight
1         13         21          23
2         14         14          24
3         19         23          24
4         17         18          16
5         19         20          16
6         12         20          24
  • Initial questions:
    • These variables are correlated. Do we need all 3 to characterize the goblets?
    • Is there a linear combination of these variables that reduces these variables to 2 dimensions that works well to characterize these goblets?
library(GGally)
ggpairs(goblets) + theme_bw(base_size = 14)

Finding the scores

  • Step 1: Mean-center and scale
goblets_scaled <- scale(goblets)
  • Step 2: SVD the scaled data
goblet_svd <- svd(goblets_scaled)
U <- goblet_svd$u
D <- diag(goblet_svd$d)
V <- goblet_svd$v
  • Step 3: Find the scores

    • Approach A: Use U and D
PCs <- U %*% D
head(PCs, 3)
           [,1]        [,2]         [,3]
[1,]  0.6708634 -0.39922090 -0.328157212
[2,] -0.1146955 -0.02834772  1.081974437
[3,]  2.0481499  1.08716822  0.009978735
  • Approach B: Apply the loading weights V to scaled \(X\)
PCs <- goblets_scaled %*% V
head(PCs, 3)
           [,1]        [,2]         [,3]
[1,]  0.6708634 -0.39922090 -0.328157212
[2,] -0.1146955 -0.02834772  1.081974437
[3,]  2.0481499  1.08716822  0.009978735

The 3D PCs are just rotated scaled data

  • Correlated variables
  • Uncorrelated PCs are a rotation

Diminishing returns

  • An important fact about principal components:
    • Each component explains a diminishing amount of variability
    • Recall, total variability of all PCs = \(p\)

Each PC explains something different from the others

  • Recall all PCs must be orthogonal (uncorrelated), as ours are:
cor(PCs) %>% round(2)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Measuring quality of 2D representation

  • We can use this information to measure the proportion of total variability explained by each PC:
(PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
)
        X1       X2         X3
1 2.228046 0.672099 0.09985496
(PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/3))
)
        X1       X2         X3
1 0.742682 0.224033 0.03328499

Takeaways:

  • PC1 alone explains 74.27% of the total variability (amount of information) in these 3 variables
  • PC2 explains an additional 22.40%
  • PC1 and PC2 together explain (74.27+22.40)=96.67% (!!) of the variability \(\implies\) we lose very little information representing these 3 variables in 2 dimensions
  • PC3 adds virtually no information, explaining the remaining 3.33%.

Variance of PCs = eigenvalues of \(R\)!

  • We’ve already seen how the loadings are the eigenvectors of \(R\), the correlation matrix (aka covariance of scaled data)
  • The eigenvalues of \(R\) are exactly equal to amount of variability explained by each PC:
(PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
)
        X1       X2         X3
1 2.228046 0.672099 0.09985496
R <- cor(goblets)
eigen(R)$values
[1] 2.22804607 0.67209898 0.09985496

PC interpretations

  • We can understand what each PC represents by analyzing the first couple loadings:
loadings <- goblet_svd$v
rownames(loadings) <- colnames(goblets)
loadings[,1:2]
                 [,1]       [,2]
MouthWidth  0.4926846  0.8186937
TotalWidth  0.6474517 -0.1183912
TotalHeight 0.5814364 -0.5618933
  • Loadings for PC1 (0.49, 0.65, 0.58):
    • Appear to give relatively equal weight to MouthWidth, TotalWidth, and TotalHeight
    • Large PC1 = wider, taller goblets
    • Small PC1 = narrower, shorter goblets
  • Loadings for PC2 (0.82, -0.12, -0.56):
    • Appears to contrast MouthWidth with TotalHeight
    • Large PC2 = large MouthWidth but small TotalHeight: short, stout goblets

PC1 vs PC2

Let’s see if we’re right by plotting just the first 2 PCs, using size and color to represent MouthWidth and TotalHeight (scatterplot created with ggplot, ellipses/arrows/annotations added with PPT):

loadings[,1:2]
                 [,1]       [,2]
MouthWidth  0.4926846  0.8186937
TotalWidth  0.6474517 -0.1183912
TotalHeight 0.5814364 -0.5618933

Summary

  • PCA is a dimension reduction technique.
  • PCA should always be done on scaled \(X\). Assume \(X\) scaled for rest of summary.
  • Each PC/score is a linear combination of the \(X\) columns.
  • All PCs are uncorrelated.
  • PC1 explains as much total variability in \(X\) as possible, PC2 as much remaining that has not been explained by PC1, etc.
  • There are up to \(p\) possible PCs. If all \(p\) are considered, the matrix of PCs is a rotation of \(X\) to be an “uncorrelated \(p\)-dimensional (American) football.”
  • Each PC is a linear combination of \(p\) data columns. The linear weights are called loadings.

Summary (continued)

  • Loadings are:
    • \(V\) from SVD of \(X\)
    • Eigenvectors of \(R\)
  • First \(k\) PCs:
    • First \(k\) columns of \(UD\) from SVD of \(X\)
    • Fist \(k\) columns of \(XV\)
  • Variance of \(PC_k\) = eigenvalue \(k\) of \(R\)

Concept question 1

Consider the below plots of four mean-centered and scaled data sets:

The first PC for each data set is found and the variance of each PC1 shown below:

var(firstpc_1)
[1] 1.750179
var(firstpc_2)
[1] 1.942512
var(firstpc_3)
[1] 1.116269
var(firstpc_4)
[1] 1.377787

Match the PC1 to the data set from which it comes.

Concept question 2

Consider the below plots of four mean-centered and scaled data sets:

The second PC for each data set is found and the variance of each PC2 shown below:

var(secondpc_1)
[1] 0.2498212
var(secondpc_2)
[1] 0.05748802
var(secondpc_3)
[1] 0.8837308
var(secondpc_4)
[1] 0.6222128

Match the PC2 to the data set from which it comes.

Concept question 3

Consider the plot below:

4 possible sets of loadings for the first PC are:

A. (0.43, 0.90)
B. (0.90, -0.43)
C. (0.71, 0.71)
D. (-0.43, 0.90)

Which of these are the loadings for the first PC from this data set?

Concept question 4

Consider the following scaled data and loadings:

X1   X2   X3
1.2 -0.8  0.5
0.0  1.5  1.8
Variable   V1       V2      V3
X1         0.707   -0.500   0.500
X2         0.500    0.707   0.500  
X3         0.500    0.500  -0.707

Reduce these data to one dimension using the first PC.

Concept question 5

Consider the following correlation matrix and subsequent eigendecomposition:

R
           Math Reading Science Hours Attendance
Math       1.00    0.72    0.85  0.65       0.43
Reading    0.72    1.00    0.68  0.58       0.39
Science    0.85    0.68    1.00  0.71       0.47
Hours      0.65    0.58    0.71  1.00       0.34
Attendance 0.43    0.39    0.47  0.34       1.00
eigen(R)
eigen() decomposition
$values
[1] 3.3829121 0.7271795 0.4280074 0.3227359 0.1391651

$vectors
           [,1]        [,2]        [,3]        [,4]        [,5]
[1,] -0.4943116 -0.14078526 -0.19307822  0.49961607  0.67002993
[2,] -0.4527383 -0.14549796 -0.64466890 -0.58799530 -0.11190154
[3,] -0.5009088 -0.09446515  0.07063200  0.46830772 -0.71823792
[4,] -0.4406478 -0.27485999  0.73046871 -0.42044049  0.14116213
[5,] -0.3249676  0.93516746  0.09246291 -0.09253642  0.05239731

A. What percent of the total variability is explained by reducing the dimension of these data to the first two principal components?

B. What would this code produce:

V <- eigen(R)$vectors
(scaled_data %*% V)[,3] %>% 
  var