Low-dimensional embeddings 1
Levi Waldron
Outline
Based on Biomedical
Data Science by Irizarry and Love, chapter 8.
- Distances in high dimensions
- Principal Components Analysis and Singular Value Decomposition
- Multidimensional Scaling (Principal Coordinates Analysis)
- GenomicSuperSignatures Bioconductor package
Built slides at https://rpubs.com/lwaldron/916521
Metrics and distances
A metric satisfies the following five
properties:
- non-negativity \(d(a, b) \ge
0\)
- symmetry \(d(a, b) = d(b, a)\)
- identification mark \(d(a, a) =
0\)
- definiteness \(d(a, b) = 0\) if and
only if \(a=b\)
- triangle inequality \(d(a, b) + d(b, c)
\ge d(a, c)\)
- A similarity function satisfies 1-2, and
increases as \(a\) and
\(b\) become more similar
- A dissimilarity function satisfies 1-2, and
decreases as \(a\) and
\(b\) become more similar
Euclidian distance (metric)

Euclidean d = \(\sqrt{ (A_x-B_x)^2 +
(A_y-B_y)^2}\).
- Side note: also referred to as \(L_2\) norm
Euclidian distance in high dimensions
## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215 189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Interested in identifying similar samples and similar
genes
Notes about Euclidian distance in high dimensions
- Points are no longer on the Cartesian plane
- instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point
in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point
in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)
Euclidean distance as for two dimensions. E.g., the distance between
two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{
\sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{
\sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
Euclidian distance in matrix algebra notation
The Euclidian distance between samples \(i\) and \(j\) can be written as:
\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i
- \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).
## [,1] [,2] [,3]
## [1,] 1 2 3
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
## [,1]
## [1,] 14
Note about matrix algebra in R
- R is very efficient at “vectorized” matrix algebra
- for very large matricies, see the:
2-sample example
kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919
3-sample example using dist()
## [1] 22215 189
(d <- dist(t(e[, c(1, 2, 87)])))
## GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz 85.8546
## GSM92240.CEL.gz 122.8919 115.4773
## [1] "dist"
The dist() function
Excerpt from ?dist:
dist(x,
method = "euclidean",
diag = FALSE,
upper = FALSE,
p = 2)
- method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”,
“binary” or “minkowski”.
dist
class output from dist()
is used for
many clustering algorithms and heatmap functions
Caution: dist(e)
creates a 22215 x 22215 matrix
that will probably crash your R session.
Note on standardization
- In practice, features (measures) should usually be “standardized”
when calculating distances, for example by converting to z-score:
\[x_{gi} \leftarrow \frac{(x_{gi} -
\bar{x}_g)}{s_g}\]
- This is done because the differences in overall amplitude between
features may be technical, e.g.:
- GC bias, PCR amplification efficiency, …
- Euclidian distance and \(1-r\)
(Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 -
r_{xy}\)
- \(m\) = # dimensions
Dimension reduction and PCA
- Motivation for dimension reduction
Simulate the heights of twin pairs:
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1] 2 100
## [,1] [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000
Visualizing twin pairs data

Not much distance is lost in the second dimension

- Not much loss of height differences when just using average heights
of twin pairs.
- because twin heights are highly correlated
Singular Value Decomposition (SVD)
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} =
\mathbf{UDV}^\top\]

note: the above formulation is for \(m\) rows \(>
n\) columns
\(\mathbf{Y}\): the \(m\) rows x \(n\) cols matrix of measurements
\(\mathbf{U}\): \(m \times n\) matrix relating original
scores to PCA scores (loadings)
\(\mathbf{D}\): \(n \times n\) diagonal matrix
(eigenvalues)
\(\mathbf{V}\): \(n \times n\) orthogonal matrix
(eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D
SVD of gene expression dataset
Scaling:
e.standardized <- t(scale(t(e), center = TRUE, scale = FALSE))
SVD:
s <- svd(e.standardized)
names(s)
## [1] "d" "u" "v"
Components of SVD results
## [1] 22215 189
length(s$d) # eigenvalues
## [1] 189
dim(s$v) # d %*% vT = scores
## [1] 189 189

PCA is a SVD
rafalib::mypar()
p <- prcomp(t(e.standardized))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1
arbitrarily
PCA interpretation: loadings

- \(\mathbf{U}\)
(loadings): relate the principal component
axes to the original variables
- think of principal component axes as a weighted combination of
original axes
Visualizing PCA loadings
plot(p$rotation[, 1],
xlab = "Index of genes",
ylab = "Loadings of PC1",
main = "PC1 loadings of each gene")
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings
e.pc1genes <-
e.standardized[p$rotation[, 1] < -0.03 |
p$rotation[, 1] > 0.03,]
dim(e.pc1genes)
## [1] 113 189

PCA interpretation: eigenvalues
- \(\mathbf{D}\)
(eigenvalues): standard deviation scaling factor that
each decomposed variable is multiplied by.
rafalib::mypar()
plot(
p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
xlim = c(0, 150),
type = "b",
ylab = "% variance explained",
main = "Screeplot"
)

PCA interpretation: eigenvalues
Alternatively as cumulative % variance explained (using
cumsum()
function)
rafalib::mypar()
plot(
cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
ylab = "cumulative % variance explained",
ylim = c(0, 100),
type = "b",
main = "Cumulative screeplot"
)

PCA interpretation: scores

- \(\mathbf{V}\)
(scores): The “datapoints” in the reduced prinipal
component space
- In some implementations (like
prcomp()
), scores are
already scaled by eigenvalues: \(\mathbf{D
V^T}\)
PCA interpretation: scores

Multi-dimensional Scaling (MDS)
- also referred to as Principal Coordinates Analysis (PCoA)
- a reduced SVD, performed on a distance matrix
- identify two (or more) eigenvalues/vectors that preserve
distances
d <- as.dist(1 - cor(e.standardized))
mds <- cmdscale(d)

Summary: distances and dimension reduction
- Signs of loadings and scores can be arbitrarily flipped
- Dimension reduction is only helpful if you have correlated
variables
- Sensitive to skew and outliers, most useful for Gaussian (normal)
data, although there are extensions
- PCA projection:
- is the most interpretable form of dimension reduction
- can be applied to new datasets using the inverse of the loadings
matrix
- screeplot will show signs of over-fitting, can be tested by
cross-validation
- MDS (PCoA) projection gives flexibility in distance/dissimilarity
measure, but less interpretable
GenomicSuperSignature example