PCA is one of the simplest multivariate methods, also widely used for different purposes. For example, in regression a principal component can be used as response variable to account for the multivariate behaviour of original response variables under the effect of some predictors. Or, it can be used to fix collinearity problems of a set of predictors. In general, PCA is a method for dimensionality reduction of a multivariate data.
The method consists of transforming through linear combinations the \(p\) original variables \(Y_1\), \(Y_2\), …, \(Y_2\) into new \(p\) latent variables (the principal components), \(Z_1\), \(Z_2\), …, \(Z_p\), that are uncorrelated in its importance order and able to describe the variability of the original data set in a condensed form, so that a few (say 2 or 3) principal components hold most of that variability. The order of importance is such that
\[Var(Z_1) \geq Var(Z_2) \geq ... \geq Var(Z_p)\]
Then, if the reduction is effective,
\[Var(Z_1) + Var(Z_2) \approx Var(Y_1) + Var(Y_2) + ... + Var(Y_p)\]
Consider a multivariate data set \(\bf{Y}\), containing \(n\) observations (rows) on \(p\) variables (columns). The first principal component is a linear combination of those variables, so that
\[Z_1 = a_{11}Y_1 + a_{12}Y_2 + ... + a_{1p}Y_p\]
where \(a_{1j}\) is the coefficient associated with the loading of \(Y_j\) on \(Z_1\).
PCA is based on the spectral decomposition (or the more generic singular value decomposition, for rectangular matrices) of the covariance (or correlation, if the variables are standardized to mean zero and unit standard deviation) matrix \({\bf R}\)
\[{\bf R} = Cov({\bf Y})\] And that decomposition consists of solving a problem of eigenvalues (\(\lambda_j\)) and eigenvectors (\({\bf a}_j\)), where \(\lambda_j = Var(Z_j)\).
If there is no correlated original variables, \(Y_j\), then the PCA is not effective, and probably all the \(Z_j\) will have to be retained to explain the variability of the original data set. But if there are some correlated \(Y_j\), then the first few \(Z_j\) will condense variability and allow for dimensionality reduction.
For example, take the matrix
M1 <- matrix(c(7, 0, 0, 2), nrow = 2)
M1 [,1] [,2]
[1,] 7 0
[2,] 0 2
Suppose this is the covariance matrix of \(\bf{Y}\), containing the variables \(Y_1\) and \(Y_2\). The total variance in \(\bf{Y}\) is \(Var(Y_1) + Var(Y_2) = 7 + 2 = 9\). There is no covariance, or \(Cov(Y_1, Y_2) = 0\). Hence,
eigen(M1)eigen() decomposition
$values
[1] 7 2
$vectors
[,1] [,2]
[1,] -1 0
[2,] 0 -1
Notice how \(\lambda_1 + \lambda_2 = 7 + 2 = 9\). And, \(Z_1 = -1 \times Y_1 + 0 \times Y_2\), and \(Z_2 = 0\times Y_1 - 1 \times Y_2\). Then, there will be no use for transforming the original data into PCs.
Now, consider this new matrix with a covariance
M2 <- matrix(c(7, -1.9, -1.9, 2), nrow = 2)
M2 [,1] [,2]
[1,] 7.0 -1.9
[2,] -1.9 2.0
eigen(M2)eigen() decomposition
$values
[1] 7.6 1.4
$vectors
[,1] [,2]
[1,] -0.95 -0.32
[2,] 0.32 -0.95
Well, \(\lambda_1 + \lambda_2 \approx 7.64 + 1.36 = 9\), which is still the total variance \(Var(Y_1) + Var(Y_2)\). But have you noticed how \(\lambda_1 = 7.64\) condensed \(100\frac{7.64}{9}=85\%\) of the total variance in \(Z_1 = -0.95Y_1 + 0.32Y2\)? On the other hand, \(Z_2 = -0.32Y_1 - 0.95Y_2\) contains \(100\frac{1.36}{9}=15\%\) of the total variance. We can use that information to evaluate the importance of the PCs.
Note The scale of the original variables \(Y_i\) may be different. That affects the magnitude of their variances, and hence the result of a PCA. In order to avoid misleading results, it is advisable to first standardize the variables by subtracting each value by the mean and then dividing by the standard deviation. \[Y_j^* = \frac{Y_j - \bar{Y_j}}{s_j}\] This is equivalent to extracting the eigenvalues and eigenvectors from the correlation matrix instead of the covariance matrix. In R, use the function scale(). |
Check the result with this correlation matrix, where \(Cor(Y_1, Y_2) = -0.65\).
M3 <- matrix(c(1, -0.65, -0.65, 1), nrow = 2)
eigen(M3)eigen() decomposition
$values
[1] 1.65 0.35
$vectors
[,1] [,2]
[1,] -0.71 -0.71
[2,] 0.71 -0.71
The following data set contains observations on 20 genotypes of garlic for production variables, namely diameter of bulb, length of bulb, bulb weight, average number of bulbils per bulb, average number of leaves per plant, and lead area.
path <- 'http://arsilva.weebly.com/uploads/2/1/0/0/21008856/alho.txt'
garlic <- read.table(path, header = TRUE)
colnames(garlic) <- c('Genotype', 'DIAM', 'LENGTH', 'BW', 'NbB', 'NL', 'LA')
garlic$Genotype <- 1:nrow(garlic)
str(garlic)'data.frame': 20 obs. of 7 variables:
$ Genotype: int 1 2 3 4 5 6 7 8 9 10 ...
$ DIAM : num 38.8 45 45.6 48 44.4 ...
$ LENGTH : num 30 33.7 35.1 34.3 34 ...
$ BW : num 22.9 31.7 36.2 32.5 26 ...
$ NbB : num 29.8 22.2 51.4 41.5 15.5 ...
$ NL : num 6.5 7.5 6.75 6.75 7.25 7.25 6.5 6.5 7.5 7.5 ...
$ LA : num 95.3 108.5 93.4 109.2 121.4 ...
There are some correlated variables
print(cor(garlic[, -1]), digits = 2) DIAM LENGTH BW NbB NL LA
DIAM 1.00 0.91 0.92 0.382 0.420 0.60
LENGTH 0.91 1.00 0.90 0.318 0.461 0.69
BW 0.92 0.90 1.00 0.580 0.355 0.66
NbB 0.38 0.32 0.58 1.000 0.099 0.30
NL 0.42 0.46 0.35 0.099 1.000 0.78
LA 0.60 0.69 0.66 0.298 0.779 1.00
Because the variables have different scales, let’s standardize them
garlic_s <- scale(garlic[, -1])
apply(garlic_s, 2, mean) # means DIAM LENGTH BW NbB NL LA
-2.4e-16 -6.2e-16 1.7e-16 -1.5e-17 -6.9e-18 -2.7e-17
apply(garlic_s, 2, sd) # Std. deviations DIAM LENGTH BW NbB NL LA
1 1 1 1 1 1
Now each variable has an unit variance. Thus, the total variance is 6.
A PCA can be done using either princomp() or prcomp(). The former is based on singular values. In a square matrix such as from a correlation, they provide the same result.
pca <- prcomp(garlic_s)
summary(pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1.982 1.044 0.833 0.431 0.2695 0.16276
Proportion of Variance 0.655 0.182 0.116 0.031 0.0121 0.00442
Cumulative Proportion 0.655 0.837 0.952 0.983 0.9956 1.00000
First two PCs retained together 83.7% of the total variance. Let’s see the eigenvalues, or variances
pca$sdev^2[1] 3.929 1.091 0.695 0.186 0.073 0.026
It is noteworthy how the variance of PC1 (\(\lambda_1 = 3.93\)) condensed variability of almost four original variables. In fact, this can be used as a criterion to select principal components, that is, those with \(\lambda > 1\), because otherwise a PC would explain as much as a original variable, which wouldn’t justify the transformation.
Now the eigenvectors, or loadings, are
print(pca$rotation, digits = 2) PC1 PC2 PC3 PC4 PC5 PC6
DIAM 0.46 0.146 -0.36 0.378 -0.516 0.48
LENGTH 0.47 0.044 -0.38 -0.045 0.786 0.14
BW 0.47 0.277 -0.13 -0.127 -0.226 -0.78
NbB 0.26 0.575 0.73 0.125 0.139 0.18
NL 0.32 -0.652 0.34 0.557 0.079 -0.21
LA 0.42 -0.380 0.25 -0.717 -0.196 0.25
In PC1, the weights of the original variables are similar. PC1 can so be understood as an index of general biomass or plant development. PC2, on the other hand, reflects a second dimension of the data, highlighting a contrast between leaf (NL, LA) and bulb (WB, NbB) variables.
Using matrix notation, a PC is of the form
\[{\bf Z}_j = {\bf Ya}_j\]
That is
Y <- garlic_s
a <- pca$rotation
Z <- Y %*% a
round(head(Z), 2) PC1 PC2 PC3 PC4 PC5 PC6
[1,] -1.82 0.39 0.80 -0.14 0.02 -0.03
[2,] 0.40 -0.43 -0.11 0.68 0.07 -0.30
[3,] 0.93 1.73 0.54 0.56 0.54 -0.13
[4,] 0.86 1.14 0.12 0.45 -0.04 0.43
[5,] -0.01 -0.78 -0.39 0.20 0.22 0.27
[6,] 1.84 0.87 0.67 0.40 -0.26 -0.03
which are called scores of the principal components. They have zero means
colMeans(Z) PC1 PC2 PC3 PC4 PC5 PC6
-3.0e-16 0.0e+00 2.3e-16 -6.0e-17 -3.9e-16 -3.3e-16
and are not correlated
cor(Z) PC1 PC2 PC3 PC4 PC5 PC6
PC1 1.0e+00 1.0e-16 -1.2e-16 -1.1e-16 2.4e-15 -1.4e-15
PC2 1.0e-16 1.0e+00 -3.2e-15 7.1e-17 8.2e-16 4.6e-16
PC3 -1.2e-16 -3.2e-15 1.0e+00 2.8e-16 -8.9e-16 6.6e-16
PC4 -1.1e-16 7.1e-17 2.8e-16 1.0e+00 9.6e-16 -1.1e-15
PC5 2.4e-15 8.2e-16 -8.9e-16 9.6e-16 1.0e+00 1.2e-15
PC6 -1.4e-15 4.6e-16 6.6e-16 -1.1e-15 1.2e-15 1.0e+00
The scores also allow us to identify patterns of observations (Genotypes). For instance, \(Z_{11} = -1.82\) tells us that the first genotype has a (relative) very low biomass. The opposite is observed for genotype no. 6 (\(Z_{11} = 1.84\), above the mean - zero). We’d do better interpretations using a plot.
plot(Z, type = 'n')
abline(h = 0, v = 0, lty = 2)
text(x = Z[, 1], y = Z[, 2],
labels = garlic$Genotype, cex = 0.8,
col = sign(Z[,1]) + 3)Biplot is essentialy a graphical method used to represent a rows (observations) and columns (variables) of a rectangular matrix in two or three dimensions. It consists of approximating a the original data \({\bf Y}\) through the singular value decomposition
\[\begin{array}{rcl} {\bf Y} & = & {\bf UDV}^T \\ & = & ({\bf UD}^{1-\alpha})({\bf D}^{\alpha} {\bf V})^T \\ & = & {\bf GH}^T \end{array}\]where \({\bf D}\) is a diagonal matrix containing \(m = \min(n, p)\) non-null singular values; \({\bf U}\) and \({\bf V}\) are the eigenvectors for rows and columns, respectively; and \(\alpha\) is a scaling factor (usually 0, 0.5 or 1).
A biplot is useful to represent scores and loadings from a PCA:
biplot(x = Z, y = pca$rotation, cex = c(0.8, 0.7)) # G, H
abline(h = 0, v = 0, lty = 2)Or just biplot(pca, scale = 0)
Notice how it is easier to identify, for example, that genotype 13 is the most productive, or that genotype 3 has the greatest number of bulbils per bulb (NbB) as well as low production of leaves (NL, LA).
There are easier ways get the scores Z and the biplot
Z <- pca$x
biplot(pca, scale = 0)Import the soil data and calculate principal components. Use a biplot to detect what variables can be used to identify difference in soil layers (factor ‘Camada’). Tip: try to use the argument xlabs to show the levels of ‘Camada’ on the biplot.