The importance of distance (cont’d)

Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is important and requires thought.
- choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A distance is only required to satisfy 1-3.
A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school: Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Euclidian distance in high dimensions

Points are no longer on the Cartesian plane,
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

Matrix algebra notation

t(matrix(1:3, ncol=1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol=1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)

##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1)^2))

## [1] 122.8919

3 sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

Distance Exercises: p. 322-323

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

note: the above formulation is for \(m > n\)

Singular Value Decomposition (SVD)

\(\mathbf{Y}\): the m rows x n cols matrix of measurements
\(\mathbf{U}\): m x n matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
\(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D

SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))

##    user  system elapsed 
##   1.429   0.115   1.568

system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))

##    user  system elapsed 
##   0.116   0.013   0.130

all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])

## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)

## [1] "d" "u" "v"

SVD of gene expression dataset

dim(s$u)     # loadings

## [1] 22215   189

length(s$d)  # eigenvalues

## [1] 189

dim(s$v)     # d %*% vT = scores

## [1] 189 189

PCA of gene expression dataset

rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

\(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

PCA interpretation: loadings

plot(p$rotation[, 1], xlab="Index of genes", ylab="Loadings of PC1", 
     main="Scores of PC1") #or, predict(p)
abline(h=c(-0.03, 0.04), col="red")

PCA interpretation: loadings

Genes with high PC1 loadings:

e.pc1genes <- e.standardize.fast[p$rotation[, 1] < -0.03 | p$rotation[, 1] > 0.03, ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=TRUE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: scores

\(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
In some implementations (like prcomp()), scores \(\mathbf{D V^T}\)

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

also referred to as Principal Coordinates Analysis (PCoA)
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Multi-dimensional Scaling (MDS)

Batch Effects

pervasive in genomics (e.g. Leek et al. )
affect DNA and RNA sequencing, proteomics, imaging, microarray…
have caused high-profile problems and retractions
- you can’t get rid of them
- but you can make sure they are not confounded with your experiment

Batch Effects - an example

Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
Title: Common genetic variants account for differences in gene expression among ethnic groups.
- “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Batch Effects - an example

Sample metadata included date of processing:

head(table(sampleInfo$date))

## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

Batch Effects - an example

year <-  as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)

##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Batch Effects: PCA

pc <- prcomp(t(geneExpression), scale.=TRUE)
boxplot(pc$x[, 1] ~ month, varwidth=TRUE, notch=TRUE,
        main="PC1 scores vs. month", xlab="Month", ylab="PC1")

Batch Effects: MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))

Batch Effects: MDS

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
     main="MDS shaded by batch")

Batch Effects: clustering

hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)

##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary

tend to affect many or all measurements by a little bit
during experimental design:
- keep track of anything that might cause a batch effect for post-hoc analysis
- include control samples in each batch
batches can be corrected for if randomized in study design
- if confounded with treatment or outcome of interest, nothing can help you

Conclusions

Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
PCA and MDS are useful for dimension reduction when you have correlated variables
Variables are always centered.
Variables are also scaled unless you know they have the same scale in the population
PCA projection can be applied to new datasets if you know the matrix calculations
PCA is subject to over-fitting, screeplot can be tested by cross-validation

Final Project

Your project for the course will be to:

run this RNA-seq analysis workflow provided in the R Markdown file on your own computer. Make sure to set the knitr option {cache=TRUE} is set to speed up subsequent compiling of this R Markdown script.
Remove the text provided in this Rmd file, and replace it with your own basic explanations of what is going on. You do not need to explain things that we didn’t cover in class, although you are welcome to do so. For things that we did cover in class, write a short explanation that would be appropriate to someone at your own level before you started the course. You don’t need to write an essay here, the point is for you to identify familiar methods in this workflow and show that you’ve learned something about how to interpret their use in practice.

Applied Statistics for High-throughput Biology: Day 4

Day 4 outline

Distances in high-dimensional data analysis

The importance of distance

The importance of distance (cont’d)

The importance of distance (cont’d)

Metrics and distances

Euclidian distance (metric)

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Matrix algebra notation

Matrix algebra notation

3 sample example

3 sample example using dist()

The dist() function

Note on standardization

Dimension reduction and PCA

Motivation for dimension reduction

Motivation for dimension reduction

Motivation for dimension reduction

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD)

SVD of gene expression dataset

SVD of gene expression dataset

PCA of gene expression dataset

PCA interpretation: loadings

PCA interpretation: loadings

PCA interpretation: loadings

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

PCA interpretation: scores

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

Multi-dimensional Scaling (MDS)

Batch Effects

Batch Effects

Batch Effects - an example

Batch Effects - an example

Batch Effects - an example

Batch Effects: PCA

Batch Effects: MDS

Batch Effects: MDS

Batch Effects: clustering

Batch Effects - summary

Conclusions

Final Project

Links