Principal Component Analysis

Introduction

PCA is one of the simplest multivariate methods, also widely used for different purposes. For example, in regression a principal component can be used as response variable to account for the multivariate behaviour of original response variables under the effect of some predictors. Or, it can be used to fix collinearity problems of a set of predictors. In general, PCA is a method for dimensionality reduction of a multivariate data.

The method consists of transforming through linear combinations the \(p\) original variables \(Y_1\), \(Y_2\), …, \(Y_2\) into new \(p\) latent variables (the principal components), \(Z_1\), \(Z_2\), …, \(Z_p\), that are uncorrelated in its importance order and able to describe the variability of the original data set in a condensed form, so that a few (say 2 or 3) principal components hold most of that variability. The order of importance is such that

\[Var(Z_1) \geq Var(Z_2) \geq ... \geq Var(Z_p)\]

Then, if the reduction is effective,

\[Var(Z_1) + Var(Z_2) \approx Var(Y_1) + Var(Y_2) + ... + Var(Y_p)\]

The procedure

Consider a multivariate data set \(\bf{Y}\), containing \(n\) observations (rows) on \(p\) variables (columns). The first principal component is a linear combination of those variables, so that

\[Z_1 = a_{11}Y_1 + a_{12}Y_2 + ... + a_{1p}Y_p\]

where \(a_{1j}\) is the coefficient associated with the loading of \(Y_j\) on \(Z_1\).

PCA is based on the spectral decomposition (or the more generic singular value decomposition, for rectangular matrices) of the covariance (or correlation, if the variables are standardized to mean zero and unit standard deviation) matrix \({\bf R}\)

\[{\bf R} = Cov({\bf Y})\] And that decomposition consists of solving a problem of eigenvalues (\(\lambda_j\)) and eigenvectors (\({\bf a}_j\)), where \(\lambda_j = Var(Z_j)\).

If there is no correlated original variables, \(Y_j\), then the PCA is not effective, and probably all the \(Z_j\) will have to be retained to explain the variability of the original data set. But if there are some correlated \(Y_j\), then the first few \(Z_j\) will condense variability and allow for dimensionality reduction.

For example, take the matrix

M1 <- matrix(c(7, 0, 0, 2), nrow = 2)
M1

     [,1] [,2]
[1,]    7    0
[2,]    0    2

Suppose this is the covariance matrix of \(\bf{Y}\), containing the variables \(Y_1\) and \(Y_2\). The total variance in \(\bf{Y}\) is \(Var(Y_1) + Var(Y_2) = 7 + 2 = 9\). There is no covariance, or \(Cov(Y_1, Y_2) = 0\). Hence,

eigen(M1)

eigen() decomposition
$values
[1] 7 2

$vectors
     [,1] [,2]
[1,]   -1    0
[2,]    0   -1

Notice how \(\lambda_1 + \lambda_2 = 7 + 2 = 9\). And, \(Z_1 = -1 \times Y_1 + 0 \times Y_2\), and \(Z_2 = 0\times Y_1 - 1 \times Y_2\). Then, there will be no use for transforming the original data into PCs.

Now, consider this new matrix with a covariance

M2 <- matrix(c(7, -1.9, -1.9, 2), nrow = 2)
M2

     [,1] [,2]
[1,]  7.0 -1.9
[2,] -1.9  2.0

eigen(M2)

eigen() decomposition
$values
[1] 7.6 1.4

$vectors
      [,1]  [,2]
[1,] -0.95 -0.32
[2,]  0.32 -0.95

Well, \(\lambda_1 + \lambda_2 \approx 7.64 + 1.36 = 9\), which is still the total variance \(Var(Y_1) + Var(Y_2)\). But have you noticed how \(\lambda_1 = 7.64\) condensed \(100\frac{7.64}{9}=85\%\) of the total variance in \(Z_1 = -0.95Y_1 + 0.32Y2\)? On the other hand, \(Z_2 = -0.32Y_1 - 0.95Y_2\) contains \(100\frac{1.36}{9}=15\%\) of the total variance. We can use that information to evaluate the importance of the PCs.

Note The scale of the original variables \(Y_i\) may be different. That affects the magnitude of their variances, and hence the result of a PCA. In order to avoid misleading results, it is advisable to first standardize the variables by subtracting each value by the mean and then dividing by the standard deviation. \[Y_j^* = \frac{Y_j - \bar{Y_j}}{s_j}\] This is equivalent to extracting the eigenvalues and eigenvectors from the correlation matrix instead of the covariance matrix. In R, use the function scale().

Check the result with this correlation matrix, where \(Cor(Y_1, Y_2) = -0.65\).

M3 <- matrix(c(1, -0.65, -0.65, 1), nrow = 2)
eigen(M3)

eigen() decomposition
$values
[1] 1.65 0.35

$vectors
      [,1]  [,2]
[1,] -0.71 -0.71
[2,]  0.71 -0.71

An example

The following data set contains observations on 20 genotypes of garlic for production variables, namely diameter of bulb, length of bulb, bulb weight, average number of bulbils per bulb, average number of leaves per plant, and lead area.

path <- 'http://arsilva.weebly.com/uploads/2/1/0/0/21008856/alho.txt'
garlic <- read.table(path, header = TRUE)
colnames(garlic) <- c('Genotype', 'DIAM', 'LENGTH', 'BW', 'NbB', 'NL', 'LA')
garlic$Genotype <- 1:nrow(garlic)
str(garlic)

'data.frame':   20 obs. of  7 variables:
 $ Genotype: int  1 2 3 4 5 6 7 8 9 10 ...
 $ DIAM    : num  38.8 45 45.6 48 44.4 ...
 $ LENGTH  : num  30 33.7 35.1 34.3 34 ...
 $ BW      : num  22.9 31.7 36.2 32.5 26 ...
 $ NbB     : num  29.8 22.2 51.4 41.5 15.5 ...
 $ NL      : num  6.5 7.5 6.75 6.75 7.25 7.25 6.5 6.5 7.5 7.5 ...
 $ LA      : num  95.3 108.5 93.4 109.2 121.4 ...

There are some correlated variables

print(cor(garlic[, -1]), digits = 2)

       DIAM LENGTH   BW   NbB    NL   LA
DIAM   1.00   0.91 0.92 0.382 0.420 0.60
LENGTH 0.91   1.00 0.90 0.318 0.461 0.69
BW     0.92   0.90 1.00 0.580 0.355 0.66
NbB    0.38   0.32 0.58 1.000 0.099 0.30
NL     0.42   0.46 0.35 0.099 1.000 0.78
LA     0.60   0.69 0.66 0.298 0.779 1.00

Because the variables have different scales, let’s standardize them

garlic_s <- scale(garlic[, -1])
apply(garlic_s, 2, mean)    # means

    DIAM   LENGTH       BW      NbB       NL       LA 
-2.4e-16 -6.2e-16  1.7e-16 -1.5e-17 -6.9e-18 -2.7e-17

apply(garlic_s, 2, sd)      # Std. deviations

  DIAM LENGTH     BW    NbB     NL     LA 
     1      1      1      1      1      1

Now each variable has an unit variance. Thus, the total variance is 6.

A PCA can be done using either princomp() or prcomp(). The former is based on singular values. In a square matrix such as from a correlation, they provide the same result.

pca <- prcomp(garlic_s)
summary(pca)

Importance of components:
                         PC1   PC2   PC3   PC4    PC5     PC6
Standard deviation     1.982 1.044 0.833 0.431 0.2695 0.16276
Proportion of Variance 0.655 0.182 0.116 0.031 0.0121 0.00442
Cumulative Proportion  0.655 0.837 0.952 0.983 0.9956 1.00000

First two PCs retained together 83.7% of the total variance. Let’s see the eigenvalues, or variances

pca$sdev^2

[1] 3.929 1.091 0.695 0.186 0.073 0.026

It is noteworthy how the variance of PC1 (\(\lambda_1 = 3.93\)) condensed variability of almost four original variables. In fact, this can be used as a criterion to select principal components, that is, those with \(\lambda > 1\), because otherwise a PC would explain as much as a original variable, which wouldn’t justify the transformation.

Now the eigenvectors, or loadings, are

print(pca$rotation, digits = 2)

        PC1    PC2   PC3    PC4    PC5   PC6
DIAM   0.46  0.146 -0.36  0.378 -0.516  0.48
LENGTH 0.47  0.044 -0.38 -0.045  0.786  0.14
BW     0.47  0.277 -0.13 -0.127 -0.226 -0.78
NbB    0.26  0.575  0.73  0.125  0.139  0.18
NL     0.32 -0.652  0.34  0.557  0.079 -0.21
LA     0.42 -0.380  0.25 -0.717 -0.196  0.25

In PC1, the weights of the original variables are similar. PC1 can so be understood as an index of general biomass or plant development. PC2, on the other hand, reflects a second dimension of the data, highlighting a contrast between leaf (NL, LA) and bulb (WB, NbB) variables.

Scores

Using matrix notation, a PC is of the form

\[{\bf Z}_j = {\bf Ya}_j\]

That is

Y <- garlic_s
a <- pca$rotation
Z <- Y %*% a
round(head(Z), 2)

       PC1   PC2   PC3   PC4   PC5   PC6
[1,] -1.82  0.39  0.80 -0.14  0.02 -0.03
[2,]  0.40 -0.43 -0.11  0.68  0.07 -0.30
[3,]  0.93  1.73  0.54  0.56  0.54 -0.13
[4,]  0.86  1.14  0.12  0.45 -0.04  0.43
[5,] -0.01 -0.78 -0.39  0.20  0.22  0.27
[6,]  1.84  0.87  0.67  0.40 -0.26 -0.03

which are called scores of the principal components. They have zero means

colMeans(Z)

     PC1      PC2      PC3      PC4      PC5      PC6 
-3.0e-16  0.0e+00  2.3e-16 -6.0e-17 -3.9e-16 -3.3e-16

and are not correlated

cor(Z)

         PC1      PC2      PC3      PC4      PC5      PC6
PC1  1.0e+00  1.0e-16 -1.2e-16 -1.1e-16  2.4e-15 -1.4e-15
PC2  1.0e-16  1.0e+00 -3.2e-15  7.1e-17  8.2e-16  4.6e-16
PC3 -1.2e-16 -3.2e-15  1.0e+00  2.8e-16 -8.9e-16  6.6e-16
PC4 -1.1e-16  7.1e-17  2.8e-16  1.0e+00  9.6e-16 -1.1e-15
PC5  2.4e-15  8.2e-16 -8.9e-16  9.6e-16  1.0e+00  1.2e-15
PC6 -1.4e-15  4.6e-16  6.6e-16 -1.1e-15  1.2e-15  1.0e+00

The scores also allow us to identify patterns of observations (Genotypes). For instance, \(Z_{11} = -1.82\) tells us that the first genotype has a (relative) very low biomass. The opposite is observed for genotype no. 6 (\(Z_{11} = 1.84\), above the mean - zero). We’d do better interpretations using a plot.

plot(Z, type = 'n')
abline(h = 0, v = 0, lty = 2)
text(x = Z[, 1], y = Z[, 2], 
     labels = garlic$Genotype, cex = 0.8, 
     col = sign(Z[,1]) + 3)

Biplot

Biplot is essentialy a graphical method used to represent a rows (observations) and columns (variables) of a rectangular matrix in two or three dimensions. It consists of approximating a the original data \({\bf Y}\) through the singular value decomposition

\[\begin{array}{rcl} {\bf Y} & = & {\bf UDV}^T \\ & = & ({\bf UD}^{1-\alpha})({\bf D}^{\alpha} {\bf V})^T \\ & = & {\bf GH}^T \end{array}\]

where \({\bf D}\) is a diagonal matrix containing \(m = \min(n, p)\) non-null singular values; \({\bf U}\) and \({\bf V}\) are the eigenvectors for rows and columns, respectively; and \(\alpha\) is a scaling factor (usually 0, 0.5 or 1).

A biplot is useful to represent scores and loadings from a PCA:

biplot(x = Z, y = pca$rotation, cex = c(0.8, 0.7))  # G, H
abline(h = 0, v = 0, lty = 2)

Or just biplot(pca, scale = 0)

Notice how it is easier to identify, for example, that genotype 13 is the most productive, or that genotype 3 has the greatest number of bulbils per bulb (NbB) as well as low production of leaves (NL, LA).

Miscellanea

There are easier ways get the scores Z and the biplot

Z <- pca$x
biplot(pca, scale = 0)

Exercise

Import the soil data and calculate principal components. Use a biplot to detect what variables can be used to identify difference in soil layers (factor ‘Camada’). Tip: try to use the argument xlabs to show the levels of ‘Camada’ on the biplot.