Principle component analysis (PCA) distributes the variation in a multivariate dataset across components. The PCA allows us to visualize patterns that would not be apparent with common graphical techniques. Linear algebra is at the heart of the PCA, but this discussion will be light on mathematical theory. Instead, you can expect a gentle introduction to the topic, which will include how this ordination technique is carried out in R.
With the powerful tools available to us in R, there is no need to conduct a PCA manually. Contained within one line of code, R has native functions which can handle the heavy-lifting for us. My goal for the manual PCA is to expose you to the terminology and concepts in PCA. As such, you will be better prepared to defend your analysis.
The original motivation for this analysis was to establish a standard algorithm to determine the “size” of a wolf spider. One way to accomplish this is with a PCA of morphometric characteristics. The parameters which possess the highest degree of variation will be the most optimal predictor of animal size.
Blue: interocular distance, Green: carapace width, Red: carapace length
Descriptive Statistics
# Load the full data set
load("morpho_complete.Rdata")
# Eliminate factor variables & untransformed weights from full dataset
morpho <- morpho_complete[,-c(1,2,6)]
morpho.stats <- round(stat.desc(morpho, basic = FALSE), digits = 3)
kable(morpho.stats, caption = "Descriptive Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)| interoc | cwidth | clength | T.weight | |
|---|---|---|---|---|
| median | 0.798 | 2.975 | 3.706 | 1.740 |
| mean | 0.799 | 2.991 | 3.691 | 1.742 |
| SE.mean | 0.005 | 0.020 | 0.020 | 0.004 |
| CI.mean.0.95 | 0.011 | 0.039 | 0.039 | 0.008 |
| var | 0.002 | 0.028 | 0.028 | 0.001 |
| std.dev | 0.046 | 0.166 | 0.167 | 0.033 |
| coef.var | 0.058 | 0.055 | 0.045 | 0.019 |
pairs.panels(
morpho,
main = "Wolf Spider Morphometrics - Correlation Summary",
gap = 0, # set to zero for no gap between plot panels
lm = TRUE, # draw linear regression lines for pairs
stars = TRUE, # display significance
bg = c("red", "blue")[morpho_complete$sex], # color based on sex
pch = 21) # data point shapePCA is a dimensionality reduction technique that allows us to see latent patterns in the data. To do this, the PCA is based heavily on concepts in linear algebra: eigenvalues and eigenvectors are at the heart of the PCA. First, we need to establish whether the metrics in our dataset are like or mixed. If the dataset contains the same units of measure across all of the variables – i.e. all variables are weights in grams – we standardize the data via mean-centering and employ a covariance matrix. However, if our data contains mixed units – like the morphometric data in this example – we mean-center the data, divide by the standard deviation, and employ a correlation matrix. Which matrix you choose becomes essential when setting the parameters of the built-in PCA functions in R.
Our data has mixed metrics - weight (mg) and linear measures (mm).
standardize <- function(x) {(x - mean(x))/sd(x)}
# Eliminate factor variables & untransformed weights from the scaled data
my.scaled.data <- as.data.frame(apply(morpho, 2, standardize))
ggplot(my.scaled.data, aes(interoc, cwidth)) +
geom_point(size = 2) +
geom_smooth(method = 'lm') +
ggtitle("Plot of Interoccular Distance and Carapace Width") +
theme(plot.title = element_text(hjust = 0.5)) # center plot title# Calculate correlation matrix
my.cor <- cor(my.scaled.data)
# Save the eigenvalues of the correllation matrix
my.eigen <- eigen(my.cor)
# Rename matrix rows and columns for easier interpretation
rownames(my.eigen$vectors) <- c("interoc", "cwidth", "clength", "T.weight")
colnames(my.eigen$vectors) <- c("PC1", "PC2", "PC3", "PC4")| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| interoc | -0.4972563 | -0.2503917 | 0.8251161 | -0.0960398 |
| cwidth | -0.5318709 | -0.3465188 | -0.4947682 | -0.5935002 |
| clength | -0.5759897 | -0.0463106 | -0.2715924 | 0.7696290 |
| T.weight | -0.3715984 | 0.9028201 | 0.0250084 | -0.2149537 |
| PC | eigenvalues |
|---|---|
| PC1 | 2.7104296 |
| PC2 | 0.7607956 |
| PC3 | 0.4127988 |
| PC4 | 0.1159759 |
Each column in the table above (left) is an eigenvector. Each eigenvector is a principal component (PC) with its eigenvalue. These eigenvalues are given in the right table above. The eigenvalues are used to identify which principal component(s) have the strongest correlation with the overall dataset: the higher the eigenvalue, the stronger its correlation. Each eigenvector is a normalized linear combination of the variables interoc, cwidth, clength, T.weight.
The coefficients are known as loadings which are values \(\small[-1, 1]\). We can see that the variables of interest in PC1 trend together due to all of the coefficients having the same sign.
Note that the sum of the eigenvalues equals the total variance of the scaled data
## [1] 4
sum(
var(my.scaled.data[,1]),
var(my.scaled.data[,2]),
var(my.scaled.data[,3]),
var(my.scaled.data[,4]))## [1] 4
Take each eigenvalue and divide it by the total of all eigenvalues – the total variation in the dataset – then multiply by 100. This calculation yields a table with the percent variation for each PC.
pc1.var <- 100*round(my.eigen$values[1]/sum(my.eigen$values), digits = 3)
pc2.var <- 100*round(my.eigen$values[2]/sum(my.eigen$values), digits = 3)
pc3.var <- 100*round(my.eigen$values[3]/sum(my.eigen$values), digits = 3)
pc4.var <- 100*round(my.eigen$values[4]/sum(my.eigen$values), digits = 3)
pc <- data.frame(PC = c("PC1", "PC2", "PC3", "PC4"),
Percentage = c(pc1.var, pc2.var, pc3.var, pc4.var))
kable(pc, caption = "Amount of Variation Captured by PCs") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)| PC | Percentage |
|---|---|
| PC1 | 67.8 |
| PC2 | 19.0 |
| PC3 | 10.3 |
| PC4 | 2.9 |
The total variation should sum to ~100% depending on rounding error:
## [1] 100
Express the loadings and scaled data as matrices, then multiply them together. The result is a new matrix which expresses the data in terms of the PCs. These are the PCA scores.
loadings <- my.eigen$vectors
my.scaled.matrix <- as.matrix(my.scaled.data)
# the function %*% is matrix multiplication
scores <- my.scaled.matrix %*% loadings
sd <- sqrt(my.eigen$values)
rownames(loadings) <- colnames(my.scaled.data)
kable(head(scores), caption = "PCA Scores - First Six Rows") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F)| PC1 | PC2 | PC3 | PC4 |
|---|---|---|---|
| -2.2600333 | 0.4795785 | 0.1963305 | -0.0511398 |
| -1.2140895 | 1.0327956 | 0.0184971 | 0.0859329 |
| -2.9229522 | 0.7366524 | -0.3902236 | 0.5873086 |
| -1.0989536 | 1.0164286 | 0.0283129 | 0.0653717 |
| -0.5060125 | 1.8150422 | 0.8229882 | 0.6191984 |
| -0.3073533 | 0.8777547 | 0.0583005 | -0.0498082 |
The function prcomp is the primary tool for PCA in base R.
pca_morpho <- prcomp(morpho, center = TRUE, scale. = TRUE)
# Show the variables in the class "prcomp"
ls(pca_morpho)## [1] "center" "rotation" "scale" "sdev" "x"
Rotations are often referred to as the “loadings” of a PCA.
pca_loadings <- pca_morpho$rotation
kable(pca_loadings, caption = "PCA Loadings") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F)| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| interoc | -0.4972563 | -0.2503917 | 0.8251161 | -0.0960398 |
| cwidth | -0.5318709 | -0.3465188 | -0.4947682 | -0.5935002 |
| clength | -0.5759897 | -0.0463106 | -0.2715924 | 0.7696290 |
| T.weight | -0.3715984 | 0.9028201 | 0.0250084 | -0.2149537 |
Summary output of the PCA:
pca_summary <- summary(pca_morpho)$importance %>%
as.data.frame()
kable(pca_summary, caption = "PCA Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Standard deviation | 1.646338 | 0.872236 | 0.6424942 | 0.3405524 |
| Proportion of Variance | 0.677610 | 0.190200 | 0.1032000 | 0.0289900 |
| Cumulative Proportion | 0.677610 | 0.867810 | 0.9710100 | 1.0000000 |
pairs.panels(
pca_morpho$x,
main = "PCA Correlation Summary",
gap = 0, # set to zero for no gap between plot panels
lm = TRUE, # draw linear regression lines for pairs
stars = TRUE, # display significance
bg = c("red", "blue")[morpho_complete$sex], # color based on sex
pch = 21) # data point shapePCs are orthogonal vectors. Thus, the correlation coefficients equal zero for all possible pairwise combinations of PCs. Therefore, it is often used as a precursor to predictive modeling procedures as multicollinearity is eliminated.
The biplot is the primary visualization tool for PCA.
# Plot without ID labels
g <- ggbiplot(pca_morpho,
obs.scale = 1,
var.scale = 1,
groups = morpho_complete$sex,
ellipse = TRUE,
circle = FALSE,
# draw ellipse around points (+/-) 1 standard deviation
ellipse.prob = 0.68) +
scale_color_discrete(name = '') +
theme(legend.direction = "horizontal", legend.position = "top", aspect.ratio = 1)
# Plot with ID labels
h <- ggbiplot(pca_morpho,
obs.scale = 1,
var.scale = 1,
groups = morpho_complete$sex,
ellipse = TRUE,
circle = FALSE,
# draw ellipse around points (+/-) 1 standard deviation
ellipse.prob = 0.68) +
scale_color_discrete(name = '') +
theme(legend.direction = "horizontal", legend.position = "top", aspect.ratio = 1) +
geom_text_repel(aes(label = morpho_complete$id))
figure <- ggarrange(g, h, ncol = 2)
annotate_figure(figure,
top = text_grob("PCA Biplots - Morphometric Data",
color = "black", face = "bold", size = 14),fig.lab.face = "bold") PCA biplots without and with spider ID labels. Pink: females, Red: males.
Each variable in the dataset is plotted as a vector (red arrows). The cosine of the angle between any two vectors equals their correlation. For insatnce interoccular width and carapace width have a minimal angle between them, indicating high correlation.
Regarding carapace length, let’s look at spider ID#336 (male) and ID#339 (female):
## id sex interoc cwidth clength weight.mg T.weight
## 1 339 f 0.857 3.342 3.916 99.7 1.786258
## 2 336 m 0.676 2.723 3.341 42.4 1.681624
paste("minimum carapace length =",
min(morpho_complete$clength),
"maximum carapace length =",
max(morpho_complete$clength),
sep = ' ')## [1] "minimum carapace length = 3.336 maximum carapace length = 4.059"
Regarding weight, let’s look at spider ID#360 (male) and ID#55 (female):
## id sex interoc cwidth clength weight.mg T.weight
## 1 55 f 0.818 2.803 3.768 113.8 1.798779
## 2 360 m 0.870 2.909 3.688 46.5 1.695215
paste("minimum weight =",
min(morpho_complete$weight.mg),
"maximum weight =",
max(morpho_complete$weight.mg),
sep = ' ')## [1] "minimum weight = 39.8 maximum weight = 133.9"
Many methods have been propsed to answer this, but two are the most common: the Kaiser Criterion and Parallel Analysis.
If an eigenvalue associated with a PC is \(\small >1\), then you retain that component. To compute the eigenvalues from the PCA, square the standard deviations in the prcomp object.
## [1] 2.7104296 0.7607956 0.4127988 0.1159759
The reason this works: the sum of the eigenvalues is equal to the total variance in the dataset. Only the first eigenvalue is \(\small >1\), so according to the Kaiser Criterion, PC1 is sufficient to explain the variation in the dataset.
Parallel analysis is a technique designed to reduce the subjectivity of interpreting a scree plot.
A standard visual method is to choose the point which creates the most extreme “elbow” and retain that many PCs (x-axis). Accordingly, we would retain PC1 and PC2.
On the other hand, paralell analysis is a simulation-based method that generates thousands of data sets with the same number of items and range of the “real” dataset. Then, we retain the number of factors with observed eigenvalues larger than those extracted from the simulated data.
paran(morpho, iterations = 5000, centile = 0, quietly = TRUE,
status = FALSE, all = TRUE, cfa = TRUE, graph = TRUE, color = TRUE,
col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE,
file = "", width = 640, height = 640, grdevice = "png", seed = 0)##
## Using eigendecomposition of correlation matrix.
The results show that we would retain only PC1, which agrees with the Kaiser Criterion method.
The morphometric dataset we used here is virtually “ideal” for running a generic PCA in R. It contains no missing values, zeroes, extreme values or outliers. However, the Oak Woods dataset is characterized by extreme variability, outliers and zero values:
As such, a prudent course would be to conduct a robust PCA.
# Save data frame as a matrix and eliminate the first column (character)
oak.matrix <- as.matrix(oak2[,-1])
pca_oak_robust <- pca(oak.matrix, method = "robustPca", nPcs = 27,
center = TRUE, scaled = TRUE)
slplot(pca_oak_robust)paran(oak.matrix, iterations = 5000, centile = 0, quietly = TRUE,
status = FALSE, all = TRUE, cfa = TRUE, graph = TRUE, color = TRUE,
col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE,
file = "", width = 640, height = 640, grdevice = "png", seed = 0)##
## Using eigendecomposition of correlation matrix.
In Summary, a PCA of any type may not be an appropriate statistical approach for this dataset. PCoA is a better choice with its built-in flexibility with respect to distance methods. Also, clustering based on any number of variables, such as stand, Elev.m, or AspClass would likely yield more meaningful results.