Low-dimensional embeddings 1

Levi Waldron

Outline

Based on Biomedical Data Science by Irizarry and Love, chapter 8.

Built slides at https://rpubs.com/lwaldron/916521

Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

Euclidian distance (metric)

Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).

Euclidian distance in high dimensions

## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Notes about Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Euclidian distance in matrix algebra notation

The Euclidian distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol = 1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
##      [,1]
## [1,]   14

Note about matrix algebra in R

2-sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919

3-sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2)

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

Dimension reduction and PCA

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Visualizing twin pairs data

Not much distance is lost in the second dimension

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD

SVD of gene expression dataset

Scaling:

e.standardized <- t(scale(t(e), center = TRUE, scale = FALSE))

SVD:

s <- svd(e.standardized)
names(s)
## [1] "d" "u" "v"

Components of SVD results

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189

SVD

PCA is a SVD

rafalib::mypar()
p <- prcomp(t(e.standardized))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

SVD

Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "PC1 loadings of each gene")
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings

e.pc1genes <-
    e.standardized[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
dim(e.pc1genes)
## [1] 113 189

PCA interpretation: eigenvalues

rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function)

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

PCA interpretation: scores

SVD

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

d <- as.dist(1 - cor(e.standardized))
mds <- cmdscale(d)

Summary: distances and dimension reduction

GenomicSuperSignature example

Google Slides:

https://bit.ly/3N7kEtJ