March 10, 2016

Session 5 outline

  • Distances in high dimensions
  • Principal Components Analysis
  • Multidimensional Scaling

Book chapter 7

Distances in high-dimensional data analysis

The importance of distance

  • High-dimensional data are complex and impossible to visualize in raw form
  • Thousands of dimensions, we can only visualize 2-3
  • Distances can simplify thousands of dimensions
animals

The importance of distance (cont'd)

  • Distances can help organize samples and genomic features

The importance of distance (cont'd)

Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)
  • A distance is only required to satisfy 1-3.
  • A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
  • A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

  • Remember grade school:
    Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
  • Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Euclidian distance in high dimensions

  • Points are no longer on the Cartesian plane,
  • instead they are in higher dimensions. For example:
    • sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
    • feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

Matrix algebra notation

t(matrix(1:3, ncol=1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol=1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)
##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1)^2))
## [1] 122.8919

3 sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
  • method: the distance measure to be used.
    • This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given.
  • dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

  • In practice, features (e.g. genes) are typically "standardized", i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

  • This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
    • GC bias, PCR amplification efficiency, …
  • Euclidian distance and \(1-r\) (Pearson cor) are related:
    • \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

Distance Exercises: p. 322-323

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Motivation for dimension reduction

  • Not much loss of height differences when just using average heights of twin pairs.
    • because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD
  • note: the above formulation is for \(m > n\)

Singular Value Decomposition (SVD)

SVD
  • \(\mathbf{Y}\): the m rows x n cols matrix of measurements
  • \(\mathbf{U}\): m x n orthogonal matrix (scores)
    • orthogonal = unit length and "perpendicular" in 3-D
  • \(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
  • \(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or loadings)

SVD of gene expression dataset

system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))
##    user  system elapsed 
##   1.939   0.137   2.118
system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))
##    user  system elapsed 
##   0.169   0.015   0.185
## all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

SVD of gene expression dataset

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189
SVD

PCA of gene expression dataset

p <- princomp(e.standardize.fast)  #cor=TRUE to scale rowss

PCA interpretation: scores

SVD
  • \(\mathbf{U}\) (scores): relate the principal component axes to the original variables
    • think of principal component axes as a weighted combination of original axes

PCA interpretation: scores

plot(p$scores[, 1], xlab="Index of genes", ylab="Scores of PC1", 
     main="Scores of PC1") #or, predict(p)
abline(h=c(-35, 25), col="red")

PCA interpretation: scores

Genes with high PC1 scores:

e.pc1genes <- e[which(p$scores[, 1] < -35 | p$scores[, 1] > 25), ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=FALSE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

  • \(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.
plot(p$sdev^2 / sum(p$sdev^2)*100, ylab="% variance explained")

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: loadings

SVD
  • \(\mathbf{V}\) (loadings): The "datapoints" in the reduced prinipal component space

PCA interpretation: loadings

PCA Lab

  1. Re-create PC plot with per-gene scaling
    1. by explicitly scaling the rows, and
    2. by setting cor=TRUE in the princomp() function
  2. Re-create screeplot using built-in plot function

Multi-dimensional Scaling (MDS)

  • also referred to as Principal Coordinates Analysis (PCoA)
  • a reduced SVD, performed on a distance matrix
  • identify two (or more) eigenvalues/vectors that preserve distances
d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Multi-dimensional Scaling (MDS)

Conclusions

  • Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
  • PCA and MDS are useful for dimension reduction when you have correlated variables
  • Variables are always centered.
  • Variables are also scaled unless you know they have the same scale in the population
  • PCA projection can be applied to new datasets if you know the matrix calculations
  • PCA is subject to over-fitting, screeplot can be tested by cross-validation

Lab

Links