Introduction

Principle component analysis (PCA) distributes the variation in a multivariate dataset across components. The PCA allows us to visualize patterns that would not be apparent with common graphical techniques. Linear algebra is at the heart of the PCA, but this discussion will be light on mathematical theory. Instead, you can expect a gentle introduction to the topic, which will include how this ordination technique is carried out in R.

Accomplishing the PCA Manually

With the powerful tools available to us in R, there is no need to conduct a PCA manually. Contained within one line of code, R has native functions which can handle the heavy-lifting for us. My goal for the manual PCA is to expose you to the terminology and concepts in PCA. As such, you will be better prepared to defend your analysis.

Motivating example - wolf spider morphometrics

The original motivation for this analysis was to establish a standard algorithm to determine the “size” of a wolf spider. One way to accomplish this is with a PCA of morphometric characteristics. The parameters which possess the highest degree of variation will be the most optimal predictor of animal size.

Blue: interocular distance, Green: carapace width, Red: carapace length

Descriptive Statistics

# Load the full data set
load("morpho_complete.Rdata")

# Eliminate factor variables & untransformed weights from full dataset
morpho <- morpho_complete[,-c(1,2,6)]

morpho.stats <- round(stat.desc(morpho, basic = FALSE), digits = 3)
kable(morpho.stats, caption = "Descriptive Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

Descriptive Statistics
	interoc	cwidth	clength	T.weight
median	0.798	2.975	3.706	1.740
mean	0.799	2.991	3.691	1.742
SE.mean	0.005	0.020	0.020	0.004
CI.mean.0.95	0.011	0.039	0.039	0.008
var	0.002	0.028	0.028	0.001
std.dev	0.046	0.166	0.167	0.033
coef.var	0.058	0.055	0.045	0.019

pairs.panels(
  morpho,
  main = "Wolf Spider Morphometrics - Correlation Summary",
  gap = 0, # set to zero for no gap between plot panels
  lm = TRUE, # draw linear regression lines for pairs
  stars = TRUE, # display significance
  bg = c("red", "blue")[morpho_complete$sex], # color based on sex 
  pch = 21) # data point shape

Covariance or Correlation?

PCA is a dimensionality reduction technique that allows us to see latent patterns in the data. To do this, the PCA is based heavily on concepts in linear algebra: eigenvalues and eigenvectors are at the heart of the PCA. First, we need to establish whether the metrics in our dataset are like or mixed. If the dataset contains the same units of measure across all of the variables – i.e. all variables are weights in grams – we standardize the data via mean-centering and employ a covariance matrix. However, if our data contains mixed units – like the morphometric data in this example – we mean-center the data, divide by the standard deviation, and employ a correlation matrix. Which matrix you choose becomes essential when setting the parameters of the built-in PCA functions in R.

Our data has mixed metrics - weight (mg) and linear measures (mm).

Standardize the Data: Mean-center and Divide by the Standard Deviation

standardize <- function(x) {(x - mean(x))/sd(x)}

# Eliminate factor variables & untransformed weights from the scaled data
my.scaled.data <- as.data.frame(apply(morpho, 2, standardize))

ggplot(my.scaled.data, aes(interoc, cwidth)) +
  geom_point(size = 2) +
  geom_smooth(method = 'lm') +
  ggtitle("Plot of Interoccular Distance and Carapace Width") +
  theme(plot.title = element_text(hjust = 0.5)) # center plot title

Find the Eigenvalues & Eigenvectors

# Calculate correlation matrix
my.cor <- cor(my.scaled.data)

# Save the eigenvalues of the correllation matrix
my.eigen <- eigen(my.cor)

# Rename matrix rows and columns for easier interpretation
rownames(my.eigen$vectors) <- c("interoc", "cwidth", "clength", "T.weight")
colnames(my.eigen$vectors) <- c("PC1", "PC2", "PC3", "PC4")

Eigenvectors
	PC1	PC2	PC3	PC4
interoc	-0.4972563	-0.2503917	0.8251161	-0.0960398
cwidth	-0.5318709	-0.3465188	-0.4947682	-0.5935002
clength	-0.5759897	-0.0463106	-0.2715924	0.7696290
T.weight	-0.3715984	0.9028201	0.0250084	-0.2149537

Eigenvalues For PCs
PC	eigenvalues
PC1	2.7104296
PC2	0.7607956
PC3	0.4127988
PC4	0.1159759

Each column in the table above (left) is an eigenvector. Each eigenvector is a principal component (PC) with its eigenvalue. These eigenvalues are given in the right table above. The eigenvalues are used to identify which principal component(s) have the strongest correlation with the overall dataset: the higher the eigenvalue, the stronger its correlation. Each eigenvector is a normalized linear combination of the variables interoc, cwidth, clength, T.weight.

The coefficients are known as loadings which are values \(\small[-1, 1]\). We can see that the variables of interest in PC1 trend together due to all of the coefficients having the same sign.

Note that the sum of the eigenvalues equals the total variance of the scaled data

sum(my.eigen$values)

## [1] 4

sum(
  var(my.scaled.data[,1]),
  var(my.scaled.data[,2]),
  var(my.scaled.data[,3]),
  var(my.scaled.data[,4]))

## [1] 4

Amount of Variation Captured by the PCs in the Dataset

Take each eigenvalue and divide it by the total of all eigenvalues – the total variation in the dataset – then multiply by 100. This calculation yields a table with the percent variation for each PC.

pc1.var <- 100*round(my.eigen$values[1]/sum(my.eigen$values), digits = 3)
pc2.var <- 100*round(my.eigen$values[2]/sum(my.eigen$values), digits = 3)
pc3.var <- 100*round(my.eigen$values[3]/sum(my.eigen$values), digits = 3)
pc4.var <- 100*round(my.eigen$values[4]/sum(my.eigen$values), digits = 3)

pc <- data.frame(PC = c("PC1", "PC2", "PC3", "PC4"),
                 Percentage = c(pc1.var, pc2.var, pc3.var, pc4.var))

kable(pc, caption = "Amount of Variation Captured by PCs") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

Amount of Variation Captured by PCs
PC	Percentage
PC1	67.8
PC2	19.0
PC3	10.3
PC4	2.9

The total variation should sum to ~100% depending on rounding error:

sum(pc1.var, pc2.var, pc3.var, pc4.var)

## [1] 100

What are the PCA “scores”?

Express the loadings and scaled data as matrices, then multiply them together. The result is a new matrix which expresses the data in terms of the PCs. These are the PCA scores.

loadings <- my.eigen$vectors
my.scaled.matrix <- as.matrix(my.scaled.data)
# the function %*% is matrix multiplication
scores <- my.scaled.matrix %*% loadings
sd <- sqrt(my.eigen$values)
rownames(loadings) <- colnames(my.scaled.data)

kable(head(scores), caption = "PCA Scores - First Six Rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = F)

PCA Scores - First Six Rows
PC1	PC2	PC3	PC4
-2.2600333	0.4795785	0.1963305	-0.0511398
-1.2140895	1.0327956	0.0184971	0.0859329
-2.9229522	0.7366524	-0.3902236	0.5873086
-1.0989536	1.0164286	0.0283129	0.0653717
-0.5060125	1.8150422	0.8229882	0.6191984
-0.3073533	0.8777547	0.0583005	-0.0498082

Accomplishing the PCA with Native R Functions

The function prcomp is the primary tool for PCA in base R.

pca_morpho <- prcomp(morpho, center = TRUE, scale. = TRUE)

# Show the variables in the class "prcomp"
ls(pca_morpho)

## [1] "center"   "rotation" "scale"    "sdev"     "x"

Rotations are often referred to as the “loadings” of a PCA.

pca_loadings <- pca_morpho$rotation
kable(pca_loadings, caption = "PCA Loadings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = F)

PCA Loadings
	PC1	PC2	PC3	PC4
interoc	-0.4972563	-0.2503917	0.8251161	-0.0960398
cwidth	-0.5318709	-0.3465188	-0.4947682	-0.5935002
clength	-0.5759897	-0.0463106	-0.2715924	0.7696290
T.weight	-0.3715984	0.9028201	0.0250084	-0.2149537

Summary output of the PCA:

pca_summary <- summary(pca_morpho)$importance %>%
  as.data.frame()
kable(pca_summary, caption = "PCA Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

PCA Summary
	PC1	PC2	PC3	PC4
Standard deviation	1.646338	0.872236	0.6424942	0.3405524
Proportion of Variance	0.677610	0.190200	0.1032000	0.0289900
Cumulative Proportion	0.677610	0.867810	0.9710100	1.0000000

Orthogonality of PCs

pairs.panels(
  pca_morpho$x,
  main = "PCA Correlation Summary",
  gap = 0, # set to zero for no gap between plot panels
  lm = TRUE, # draw linear regression lines for pairs
  stars = TRUE, # display significance
  bg = c("red", "blue")[morpho_complete$sex], # color based on sex 
  pch = 21) # data point shape

PCs are orthogonal vectors. Thus, the correlation coefficients equal zero for all possible pairwise combinations of PCs. Therefore, it is often used as a precursor to predictive modeling procedures as multicollinearity is eliminated.

Biplot

The biplot is the primary visualization tool for PCA.

# Plot without ID labels
g <- ggbiplot(pca_morpho,
              obs.scale = 1,
              var.scale = 1,
              groups = morpho_complete$sex,
              ellipse = TRUE,
              circle = FALSE,
              # draw ellipse around points (+/-) 1 standard deviation
              ellipse.prob = 0.68) +
  scale_color_discrete(name = '') +
  theme(legend.direction = "horizontal", legend.position = "top", aspect.ratio = 1)

# Plot with ID labels
h <- ggbiplot(pca_morpho,
              obs.scale = 1,
              var.scale = 1,
              groups = morpho_complete$sex,
              ellipse = TRUE,
              circle = FALSE,
              # draw ellipse around points (+/-) 1 standard deviation
              ellipse.prob = 0.68) +
  scale_color_discrete(name = '') +
  theme(legend.direction = "horizontal", legend.position = "top", aspect.ratio = 1) +
  geom_text_repel(aes(label = morpho_complete$id))

figure <- ggarrange(g, h, ncol = 2)

annotate_figure(figure,
               top = text_grob("PCA Biplots - Morphometric Data",
                               color = "black", face = "bold", size = 14),fig.lab.face = "bold")

PCA biplots without and with spider ID labels. Pink: females, Red: males.

Interpreting the Biplot

Each variable in the dataset is plotted as a vector (red arrows). The cosine of the angle between any two vectors equals their correlation. For insatnce interoccular width and carapace width have a minimal angle between them, indicating high correlation.

Regarding carapace length, let’s look at spider ID#336 (male) and ID#339 (female):

filter(morpho_complete, id == "336" | id == "339")

##    id sex interoc cwidth clength weight.mg T.weight
## 1 339   f   0.857  3.342   3.916      99.7 1.786258
## 2 336   m   0.676  2.723   3.341      42.4 1.681624

paste("minimum carapace length =",
      min(morpho_complete$clength),
      "maximum carapace length =",
      max(morpho_complete$clength),
      sep = ' ')

## [1] "minimum carapace length = 3.336 maximum carapace length = 4.059"

Regarding weight, let’s look at spider ID#360 (male) and ID#55 (female):

filter(morpho_complete, id == "360" | id == "55")

##    id sex interoc cwidth clength weight.mg T.weight
## 1  55   f   0.818  2.803   3.768     113.8 1.798779
## 2 360   m   0.870  2.909   3.688      46.5 1.695215

paste("minimum weight =",
      min(morpho_complete$weight.mg),
      "maximum weight =",
      max(morpho_complete$weight.mg),
      sep = ' ')

## [1] "minimum weight = 39.8 maximum weight = 133.9"

How many PCs can I retain to exp-lain “enough” variation?

Many methods have been propsed to answer this, but two are the most common: the Kaiser Criterion and Parallel Analysis.

Kaiser Criterion

If an eigenvalue associated with a PC is \(\small >1\), then you retain that component. To compute the eigenvalues from the PCA, square the standard deviations in the prcomp object.

pca_morpho$sdev ^ 2

## [1] 2.7104296 0.7607956 0.4127988 0.1159759

The reason this works: the sum of the eigenvalues is equal to the total variance in the dataset. Only the first eigenvalue is \(\small >1\), so according to the Kaiser Criterion, PC1 is sufficient to explain the variation in the dataset.

Parallel Analysis

Parallel analysis is a technique designed to reduce the subjectivity of interpreting a scree plot.

screeplot(pca_morpho, type = "line", main = "Scree Plot - Morphometric Data")

A standard visual method is to choose the point which creates the most extreme “elbow” and retain that many PCs (x-axis). Accordingly, we would retain PC1 and PC2.

On the other hand, paralell analysis is a simulation-based method that generates thousands of data sets with the same number of items and range of the “real” dataset. Then, we retain the number of factors with observed eigenvalues larger than those extracted from the simulated data.

paran(morpho, iterations = 5000, centile = 0, quietly = TRUE, 
    status = FALSE, all = TRUE, cfa = TRUE, graph = TRUE, color = TRUE, 
    col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE, 
    file = "", width = 640, height = 640, grdevice = "png", seed = 0)

## 
## Using eigendecomposition of correlation matrix.

The results show that we would retain only PC1, which agrees with the Kaiser Criterion method.

PCA on the Oak Woods Dataset - Robust PCA

The morphometric dataset we used here is virtually “ideal” for running a generic PCA in R. It contains no missing values, zeroes, extreme values or outliers. However, the Oak Woods dataset is characterized by extreme variability, outliers and zero values:

load("OakWoods.Rdata")
boxplot(oak2)

As such, a prudent course would be to conduct a robust PCA.

# Save data frame as a matrix and eliminate the first column (character)
oak.matrix <- as.matrix(oak2[,-1])
pca_oak_robust <- pca(oak.matrix, method = "robustPca", nPcs = 27,
                      center = TRUE, scaled = TRUE)
slplot(pca_oak_robust)

paran(oak.matrix, iterations = 5000, centile = 0, quietly = TRUE, 
    status = FALSE, all = TRUE, cfa = TRUE, graph = TRUE, color = TRUE, 
    col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE, 
    file = "", width = 640, height = 640, grdevice = "png", seed = 0)

## 
## Using eigendecomposition of correlation matrix.

In Summary, a PCA of any type may not be an appropriate statistical approach for this dataset. PCoA is a better choice with its built-in flexibility with respect to distance methods. Also, clustering based on any number of variables, such as stand, Elev.m, or AspClass would likely yield more meaningful results.

An Introduction with Examples in R
Principal Component Analysis:

Salvatore A. Sidoti | PhD Candidate | Behavioral Ecology | The Department of Evolution, Ecology & Organismal Biology | sidoti.23@osu.edu

2019-02-12

Introduction

Accomplishing the PCA Manually

Motivating example - wolf spider morphometrics

Covariance or Correlation?

Standardize the Data: Mean-center and Divide by the Standard Deviation

Find the Eigenvalues & Eigenvectors

Amount of Variation Captured by the PCs in the Dataset

What are the PCA “scores”?

Accomplishing the PCA with Native R Functions

Orthogonality of PCs

Biplot

Interpreting the Biplot

How many PCs can I retain to exp-lain “enough” variation?

Kaiser Criterion

Parallel Analysis

PCA on the Oak Woods Dataset - Robust PCA

An Introduction with Examples in R Principal Component Analysis:

Salvatore A. Sidoti | PhD Candidate | Behavioral Ecology | The Department of Evolution, Ecology & Organismal Biology | sidoti.23@osu.edu

2019-02-12

Introduction

Accomplishing the PCA Manually

Motivating example - wolf spider morphometrics

Covariance or Correlation?

Standardize the Data: Mean-center and Divide by the Standard Deviation

Find the Eigenvalues & Eigenvectors

Amount of Variation Captured by the PCs in the Dataset

What are the PCA “scores”?

Accomplishing the PCA with Native R Functions

Orthogonality of PCs

Biplot

Interpreting the Biplot

How many PCs can I retain to exp-lain “enough” variation?

Kaiser Criterion

Parallel Analysis

PCA on the Oak Woods Dataset - Robust PCA

An Introduction with Examples in R
Principal Component Analysis: