Introduction

Principle component analysis (PCA) distributes the variation in a multivariate dataset across components. The PCA allows us to visualize patterns that would not be apparent with common graphical techniques. Linear algebra is at the heart of the PCA, but this discussion will be light on mathematical theory. Instead, you can expect a gentle introduction to the topic, which will include how this ordination technique is carried out in R.

Accomplishing the PCA Manually

With the powerful tools available to us in R, there is no need to conduct a PCA manually. Contained within one line of code, R has native functions which can handle the heavy-lifting for us. My goal for the manual PCA is to expose you to the terminology and concepts in PCA. As such, you will be better prepared to defend your analysis.

Motivating example - wolf spider morphometrics

The original motivation for this analysis was to establish a standard algorithm to determine the “size” of a wolf spider. One way to accomplish this is with a PCA of morphometric characteristics. The parameters which possess the highest degree of variation will be the most optimal predictor of animal size.


Blue: interocular distance, Green: carapace width, Red: carapace length

Descriptive Statistics

Descriptive Statistics
interoc cwidth clength T.weight
median 0.798 2.975 3.706 1.740
mean 0.799 2.991 3.691 1.742
SE.mean 0.005 0.020 0.020 0.004
CI.mean.0.95 0.011 0.039 0.039 0.008
var 0.002 0.028 0.028 0.001
std.dev 0.046 0.166 0.167 0.033
coef.var 0.058 0.055 0.045 0.019


Covariance or Correlation?

PCA is a dimensionality reduction technique that allows us to see latent patterns in the data. To do this, the PCA is based heavily on concepts in linear algebra: eigenvalues and eigenvectors are at the heart of the PCA. First, we need to establish whether the metrics in our dataset are like or mixed. If the dataset contains the same units of measure across all of the variables – i.e. all variables are weights in grams – we standardize the data via mean-centering and employ a covariance matrix. However, if our data contains mixed units – like the morphometric data in this example – we mean-center the data, divide by the standard deviation, and employ a correlation matrix. Which matrix you choose becomes essential when setting the parameters of the built-in PCA functions in R.

Our data has mixed metrics - weight (mg) and linear measures (mm).

Find the Eigenvalues & Eigenvectors


Eigenvectors
PC1 PC2 PC3 PC4
interoc -0.4972563 -0.2503917 0.8251161 -0.0960398
cwidth -0.5318709 -0.3465188 -0.4947682 -0.5935002
clength -0.5759897 -0.0463106 -0.2715924 0.7696290
T.weight -0.3715984 0.9028201 0.0250084 -0.2149537
Eigenvalues For PCs
PC eigenvalues
PC1 2.7104296
PC2 0.7607956
PC3 0.4127988
PC4 0.1159759

Each column in the table above (left) is an eigenvector. Each eigenvector is a principal component (PC) with its eigenvalue. These eigenvalues are given in the right table above. The eigenvalues are used to identify which principal component(s) have the strongest correlation with the overall dataset: the higher the eigenvalue, the stronger its correlation. Each eigenvector is a normalized linear combination of the variables interoc, cwidth, clength, T.weight.

The coefficients are known as loadings which are values \(\small[-1, 1]\). We can see that the variables of interest in PC1 trend together due to all of the coefficients having the same sign.


Note that the sum of the eigenvalues equals the total variance of the scaled data

## [1] 4
## [1] 4

Amount of Variation Captured by the PCs in the Dataset

Take each eigenvalue and divide it by the total of all eigenvalues – the total variation in the dataset – then multiply by 100. This calculation yields a table with the percent variation for each PC.

Amount of Variation Captured by PCs
PC Percentage
PC1 67.8
PC2 19.0
PC3 10.3
PC4 2.9


The total variation should sum to ~100% depending on rounding error:

## [1] 100

What are the PCA “scores”?

Express the loadings and scaled data as matrices, then multiply them together. The result is a new matrix which expresses the data in terms of the PCs. These are the PCA scores.

PCA Scores - First Six Rows
PC1 PC2 PC3 PC4
-2.2600333 0.4795785 0.1963305 -0.0511398
-1.2140895 1.0327956 0.0184971 0.0859329
-2.9229522 0.7366524 -0.3902236 0.5873086
-1.0989536 1.0164286 0.0283129 0.0653717
-0.5060125 1.8150422 0.8229882 0.6191984
-0.3073533 0.8777547 0.0583005 -0.0498082

Accomplishing the PCA with Native R Functions

The function prcomp is the primary tool for PCA in base R.

## [1] "center"   "rotation" "scale"    "sdev"     "x"

Rotations are often referred to as the “loadings” of a PCA.

PCA Loadings
PC1 PC2 PC3 PC4
interoc -0.4972563 -0.2503917 0.8251161 -0.0960398
cwidth -0.5318709 -0.3465188 -0.4947682 -0.5935002
clength -0.5759897 -0.0463106 -0.2715924 0.7696290
T.weight -0.3715984 0.9028201 0.0250084 -0.2149537

Summary output of the PCA:

PCA Summary
PC1 PC2 PC3 PC4
Standard deviation 1.646338 0.872236 0.6424942 0.3405524
Proportion of Variance 0.677610 0.190200 0.1032000 0.0289900
Cumulative Proportion 0.677610 0.867810 0.9710100 1.0000000

Orthogonality of PCs

PCs are orthogonal vectors. Thus, the correlation coefficients equal zero for all possible pairwise combinations of PCs. Therefore, it is often used as a precursor to predictive modeling procedures as multicollinearity is eliminated.

Interpreting the Biplot

Each variable in the dataset is plotted as a vector (red arrows). The cosine of the angle between any two vectors equals their correlation. For insatnce interoccular width and carapace width have a minimal angle between them, indicating high correlation.

Regarding carapace length, let’s look at spider ID#336 (male) and ID#339 (female):

##    id sex interoc cwidth clength weight.mg T.weight
## 1 339   f   0.857  3.342   3.916      99.7 1.786258
## 2 336   m   0.676  2.723   3.341      42.4 1.681624
## [1] "minimum carapace length = 3.336 maximum carapace length = 4.059"

Regarding weight, let’s look at spider ID#360 (male) and ID#55 (female):

##    id sex interoc cwidth clength weight.mg T.weight
## 1  55   f   0.818  2.803   3.768     113.8 1.798779
## 2 360   m   0.870  2.909   3.688      46.5 1.695215
## [1] "minimum weight = 39.8 maximum weight = 133.9"

How many PCs can I retain to exp-lain “enough” variation?

Many methods have been propsed to answer this, but two are the most common: the Kaiser Criterion and Parallel Analysis.

Kaiser Criterion

If an eigenvalue associated with a PC is \(\small >1\), then you retain that component. To compute the eigenvalues from the PCA, square the standard deviations in the prcomp object.

## [1] 2.7104296 0.7607956 0.4127988 0.1159759

The reason this works: the sum of the eigenvalues is equal to the total variance in the dataset. Only the first eigenvalue is \(\small >1\), so according to the Kaiser Criterion, PC1 is sufficient to explain the variation in the dataset.

Parallel Analysis

Parallel analysis is a technique designed to reduce the subjectivity of interpreting a scree plot.

A standard visual method is to choose the point which creates the most extreme “elbow” and retain that many PCs (x-axis). Accordingly, we would retain PC1 and PC2.

On the other hand, paralell analysis is a simulation-based method that generates thousands of data sets with the same number of items and range of the “real” dataset. Then, we retain the number of factors with observed eigenvalues larger than those extracted from the simulated data.

## 
## Using eigendecomposition of correlation matrix.

The results show that we would retain only PC1, which agrees with the Kaiser Criterion method.

PCA on the Oak Woods Dataset - Robust PCA

The morphometric dataset we used here is virtually “ideal” for running a generic PCA in R. It contains no missing values, zeroes, extreme values or outliers. However, the Oak Woods dataset is characterized by extreme variability, outliers and zero values:

As such, a prudent course would be to conduct a robust PCA.

## 
## Using eigendecomposition of correlation matrix.

In Summary, a PCA of any type may not be an appropriate statistical approach for this dataset. PCoA is a better choice with its built-in flexibility with respect to distance methods. Also, clustering based on any number of variables, such as stand, Elev.m, or AspClass would likely yield more meaningful results.