Applied Statistics for High-throughput Biology: Session 5

March 10, 2016

Session 5 outline

Distances in high dimensions
Principal Components Analysis
Multidimensional Scaling

Book chapter 7

Distances in high-dimensional data analysis

The importance of distance

High-dimensional data are complex and impossible to visualize in raw form
Thousands of dimensions, we can only visualize 2-3
Distances can simplify thousands of dimensions

The importance of distance (cont'd)

Distances can help organize samples and genomic features

The importance of distance (cont'd)

Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is important and requires thought.
- choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A distance is only required to satisfy 1-3.
A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school: Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Euclidian distance in high dimensions

Points are no longer on the Cartesian plane,
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

Matrix algebra notation

t(matrix(1:3, ncol=1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol=1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)

##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1)^2))

## [1] 122.8919

3 sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

method: the distance measure to be used.
- This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

In practice, features (e.g. genes) are typically "standardized", i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

Distance Exercises: p. 322-323

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

note: the above formulation is for \(m > n\)

Singular Value Decomposition (SVD)

\(\mathbf{Y}\): the m rows x n cols matrix of measurements
\(\mathbf{U}\): m x n orthogonal matrix (scores)
- orthogonal = unit length and "perpendicular" in 3-D
\(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
\(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or loadings)

SVD of gene expression dataset

system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))

##    user  system elapsed 
##   1.939   0.137   2.118

system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))

##    user  system elapsed 
##   0.169   0.015   0.185

## all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
s <- svd(e.standardize.fast)
names(s)

## [1] "d" "u" "v"

SVD of gene expression dataset

dim(s$u)     # loadings

## [1] 22215   189

length(s$d)  # eigenvalues

## [1] 189

dim(s$v)     # d %*% vT = scores

## [1] 189 189

PCA of gene expression dataset

p <- princomp(e.standardize.fast)  #cor=TRUE to scale rowss

PCA interpretation: scores

\(\mathbf{U}\) (scores): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

PCA interpretation: scores

plot(p$scores[, 1], xlab="Index of genes", ylab="Scores of PC1", 
     main="Scores of PC1") #or, predict(p)
abline(h=c(-35, 25), col="red")

PCA interpretation: scores

Genes with high PC1 scores:

e.pc1genes <- e[which(p$scores[, 1] < -35 | p$scores[, 1] > 25), ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=FALSE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

plot(p$sdev^2 / sum(p$sdev^2)*100, ylab="% variance explained")

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: loadings

\(\mathbf{V}\) (loadings): The "datapoints" in the reduced prinipal component space

PCA interpretation: loadings

PCA Lab

Re-create PC plot with per-gene scaling
1. by explicitly scaling the rows, and
2. by setting cor=TRUE in the princomp() function
Re-create screeplot using built-in plot function

Multi-dimensional Scaling (MDS)

also referred to as Principal Coordinates Analysis (PCoA)
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Multi-dimensional Scaling (MDS)

Conclusions

Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
PCA and MDS are useful for dimension reduction when you have correlated variables
Variables are always centered.
Variables are also scaled unless you know they have the same scale in the population
PCA projection can be applied to new datasets if you know the matrix calculations
PCA is subject to over-fitting, screeplot can be tested by cross-validation

Session 5 outline

Distances in high-dimensional data analysis

The importance of distance

The importance of distance (cont'd)

The importance of distance (cont'd)

Metrics and distances

Euclidian distance (metric)

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Matrix algebra notation

Matrix algebra notation

3 sample example

3 sample example using dist()

The dist() function

Note on standardization

Dimension reduction and PCA

Motivation for dimension reduction

Motivation for dimension reduction

Motivation for dimension reduction

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD)

SVD of gene expression dataset

SVD of gene expression dataset

PCA of gene expression dataset

PCA interpretation: scores

PCA interpretation: scores

PCA interpretation: scores

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

PCA interpretation: loadings

PCA interpretation: loadings

PCA Lab

Multi-dimensional Scaling (MDS)

Multi-dimensional Scaling (MDS)

Conclusions

Lab

Links