2019-02-12

PCA

  • Principle component analysis distributes the variation in a multivariate dataset across components.
  • Visualize patterns that would not be apparent
  • Linear algebra is at the heart of the PCA
  • This discussion will be light on mathematical theory

Accomplishing the PCA Manually

Accomplishing the PCA Manually

  • Goal for the manual PCA
  • Become acquainted with the terminology and concepts in PCA
  • Better prepared to defend your analysis

Motivating Example - Wolf Spider Morphometrics

Descriptive Statistics

interoc cwidth clength T.weight
median 0.798 2.975 3.706 1.740
mean 0.799 2.991 3.691 1.742
SE.mean 0.005 0.020 0.020 0.004
CI.mean.0.95 0.011 0.039 0.039 0.008
var 0.002 0.028 0.028 0.001
std.dev 0.046 0.166 0.167 0.033
coef.var 0.058 0.055 0.045 0.019

Covariance or Correlation?

Covariance or Correlation?

  • Are the metrics in our dataset are like or mixed?
  • Like: covariance matrix with mean-centering
  • Mixed: correlation matrix with unit variance
  • Becomes essential when using the built-in PCA functions in R

Find the Eigenvalues & Eigenvectors

Calculating Eigenvectors

standardize <- function(x) {(x - mean(x))/sd(x)}

# Eliminate factor variables & untransformed weights
my.scaled.data <- as.data.frame(apply(morpho, 2, standardize))

# Calculate correlation matrix
my.cor <- cor(my.scaled.data)

# Save the eigenvalues of the correllation matrix
my.eigen <- eigen(my.cor)

# Rename matrix rows and columns for easier interpretation
rownames(my.eigen$vectors) <- c("interoc", "cwidth",
                                "clength", "T.weight")
colnames(my.eigen$vectors) <- c("PC1", "PC2", "PC3", "PC4")

Calculating Eigenvectors

PC1 PC2 PC3 PC4
interoc -0.4973 -0.2504 0.8251 -0.0960
cwidth -0.5319 -0.3465 -0.4948 -0.5935
clength -0.5760 -0.0463 -0.2716 0.7696
T.weight -0.3716 0.9028 0.0250 -0.2150

Calculating Eigenvalues

PC eigenvalues
PC1 2.7104
PC2 0.7608
PC3 0.4128
PC4 0.1160

Calculating Eigenvalues

Sum of the eigenvalues = total variance of the scaled data

sum(my.eigen$values)
## [1] 4
sum(
  var(my.scaled.data[,1]),
  var(my.scaled.data[,2]),
  var(my.scaled.data[,3]),
  var(my.scaled.data[,4]))
## [1] 4

Amount of Variation Captured by the PCs

Variation Captured by the PCs

pc1.var <- 100*round(my.eigen$values[1]/
                       sum(my.eigen$values), digits = 3)
pc2.var <- 100*round(my.eigen$values[2]/
                       sum(my.eigen$values), digits = 3)
pc3.var <- 100*round(my.eigen$values[3]/
                       sum(my.eigen$values), digits = 3)
pc4.var <- 100*round(my.eigen$values[4]/
                       sum(my.eigen$values), digits = 3)

pc <- data.frame(PC = c("PC1", "PC2", "PC3", "PC4"),
                 Percentage = c(pc1.var, pc2.var,
                                pc3.var, pc4.var))

Variation Captured by the PCs

PC Percentage
PC1 67.8
PC2 19.0
PC3 10.3
PC4 2.9

Variation Captured by the PCs

The total variation should sum to ~100% depending on rounding error:

## [1] 100

What are PCA “scores”?

What are PCA “scores”?

  • Express the loadings and scaled data as matrices, then multiply them together
  • The result is a new matrix which expresses the data in terms of the PCs
  • These are the PCA scores

What are PCA “scores”?

loadings <- my.eigen$vectors
my.scaled.matrix <- as.matrix(my.scaled.data)
# the function %*% is matrix multiplication
scores <- my.scaled.matrix %*% loadings
sd <- sqrt(my.eigen$values)
rownames(loadings) <- colnames(my.scaled.data)

What are PCA “scores”?

PC1 PC2 PC3 PC4
-2.2600 0.4796 0.1963 -0.0511
-1.2141 1.0328 0.0185 0.0859
-2.9230 0.7367 -0.3902 0.5873
-1.0990 1.0164 0.0283 0.0654
-0.5060 1.8150 0.8230 0.6192
-0.3074 0.8778 0.0583 -0.0498

PCA with Native R Functions

PCA with Native R Functions

The function prcomp is the primary tool for PCA in base R

pca_morpho <- prcomp(morpho, center = TRUE, scale. = TRUE)

# Show the variables in the class "prcomp"
ls(pca_morpho)
## [1] "center"   "rotation" "scale"    "sdev"     "x"

Summary output of the PCA

pca_summary <- summary(pca_morpho)$importance %>%
  as.data.frame() %>%
  round(., digits = 3)

Summary output of the PCA

PC1 PC2 PC3 PC4
Standard deviation 1.646 0.872 0.642 0.341
Proportion of Variance 0.678 0.190 0.103 0.029
Cumulative Proportion 0.678 0.868 0.971 1.000

Orthogonality of PCs

Biplot

How many PCs explain “enough” variation?

Kaiser Criterion

  • If an eigenvalue associated with a PC is \(\small >1\), then retain that component
  • Compute the eigenvalues from the PCA: square the SDs in the prcomp object.
## [1] 2.7104296 0.7607956 0.4127988 0.1159759

Parallel Analysis

Designed to reduce the subjectivity of interpreting a scree plot

Parallel Analysis

  • Simulation-based method
  • Generates thousands of data sets analogous to the “real” dataset
  • Retain the number of factors that possess eigenvalues larger than the simulated data

Parallel Analysis

## 
## Using eigendecomposition of correlation matrix.

PCA on the Oak Woods Dataset - Robust PCA

PCA: Oak Woods Dataset

PCA: Oak Woods Dataset

PCA: Oak Woods Dataset

In Summary, a PCA of any type may not be an appropriate statistical approach for this dataset

Questions?