Outline

Based on Biomedical Data Science by Irizarry and Love, chapter 8.

Distances in high dimensions
Principal Components Analysis and Singular Value Decomposition
Multidimensional Scaling (Principal Coordinates Analysis)
GenomicSuperSignatures Bioconductor package

Built slides at https://rpubs.com/lwaldron/916521

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school:

Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).

Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Notes about Euclidian distance in high dimensions

Points are no longer on the Cartesian plane
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Euclidian distance in matrix algebra notation

The Euclidian distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol = 1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)

##      [,1]
## [1,]   14

Note about matrix algebra in R

R is very efficient at “vectorized” matrix algebra
for very large matricies, see the:
- Matrix CRAN package (sparse matrices)
- rhdf5 and DelayedArray Bioconductor package (on-disk arrays)

2-sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1) ^ 2))

## [1] 122.8919

3-sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2)

method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

In practice, features (measures) should usually be “standardized” when calculating distances, for example by converting to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall amplitude between features may be technical, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)
- \(m\) = # dimensions

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Visualizing twin pairs data

Not much distance is lost in the second dimension

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD

note: the above formulation is for \(m\) rows \(> n\) columns
\(\mathbf{Y}\): the \(m\) rows x \(n\) cols matrix of measurements
\(\mathbf{U}\): \(m \times n\) matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): \(n \times n\) diagonal matrix (eigenvalues)
\(\mathbf{V}\): \(n \times n\) orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D

SVD of gene expression dataset

Scaling:

e.standardized <- t(scale(t(e), center = TRUE, scale = FALSE))

SVD:

s <- svd(e.standardized)
names(s)

## [1] "d" "u" "v"

Components of SVD results

dim(s$u)     # loadings

## [1] 22215   189

length(s$d)  # eigenvalues

## [1] 189

dim(s$v)     # d %*% vT = scores

## [1] 189 189

SVD

PCA is a SVD

gene expression dataset

rafalib::mypar()
p <- prcomp(t(e.standardized))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

SVD

\(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "PC1 loadings of each gene")
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings

e.pc1genes <-
    e.standardized[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
dim(e.pc1genes)

## [1] 113 189

PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function)

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

PCA interpretation: scores

SVD

\(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
In some implementations (like prcomp()), scores are already scaled by eigenvalues: \(\mathbf{D V^T}\)

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

also referred to as Principal Coordinates Analysis (PCoA)
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances

d <- as.dist(1 - cor(e.standardized))
mds <- cmdscale(d)

Summary: distances and dimension reduction

Signs of loadings and scores can be arbitrarily flipped
Dimension reduction is only helpful if you have correlated variables
Sensitive to skew and outliers, most useful for Gaussian (normal) data, although there are extensions
PCA projection:
- is the most interpretable form of dimension reduction
- can be applied to new datasets using the inverse of the loadings matrix
- screeplot will show signs of over-fitting, can be tested by cross-validation
MDS (PCoA) projection gives flexibility in distance/dissimilarity measure, but less interpretable

Low-dimensional embeddings 1

Outline

Metrics and distances

Euclidian distance (metric)

Euclidian distance in high dimensions

Notes about Euclidian distance in high dimensions

Euclidian distance in matrix algebra notation

Note about matrix algebra in R

2-sample example

3-sample example using dist()

The dist() function

Note on standardization

Dimension reduction and PCA

Visualizing twin pairs data

Not much distance is lost in the second dimension

Singular Value Decomposition (SVD)

SVD of gene expression dataset

Components of SVD results

PCA is a SVD

PCA interpretation: loadings

Visualizing PCA loadings

Genes with high PC1 loadings

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

PCA interpretation: scores

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

Summary: distances and dimension reduction

GenomicSuperSignature example