Applied Statistics for High-throughput Biology Day 4

Levi Waldron

Day 4 outline

Book chapter 8:

Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

Euclidian distance (metric)

Euclidian distance in high dimensions

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'tissuesGeneExpression'
## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a4ea081c0/genomicsclass-tissuesGeneExpression-a43cf4b/DESCRIPTION’ ... OK
## * preparing ‘tissuesGeneExpression’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘tissuesGeneExpression_1.0.tar.gz’
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859'
## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a282882c/genomicsclass-GSE5859-3d8cdd7/DESCRIPTION’ ... OK
## * preparing ‘GSE5859’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859_1.0.tar.gz’
## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Notes about Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Euclidian distance in matrix algebra notation

The Euclidian distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol = 1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
##      [,1]
## [1,]   14

Note about matrix algebra in R

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919

3 sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(
    x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2
)

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

Dimension reduction and PCA

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Visualizing twin pairs data

Not much distance is lost in the second dimension

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD

SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <-
                t(apply(e, 1, function(x)
                    x - mean(x))))
##    user  system elapsed 
##   0.578   0.080   0.657
system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))
##    user  system elapsed 
##   0.131   0.041   0.171
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

Components of SVD results

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189
SVD

PCA is a SVD

rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

SVD

Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "PC1 loadings of each gene") #or, predict(p)
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings

e.pc1genes <-
    e.standardize.fast[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
    e.pc1genes,
    scale = "none",
    show_rownames = TRUE,
    show_colnames = FALSE
)

PCA interpretation: eigenvalues

rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function)

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

PCA interpretation: scores

SVD

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Summary: distances and dimension reduction

PCA and MDS Exercises

  1. Repeat the PCA after removing half of the features with lowest variance. Does it look any different?
  2. Repeat the MDS, using Spearman correlation instead of Pearson correlation. Does it look any different?

Batch Effects

Batch Effects - an example

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Date of processing as a proxy for batch

head(table(sampleInfo$date))
## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2
year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by month")

The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)
##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary