Levi Waldron
June 23, 2017
Book chapter 7
Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf
A metric satisfies the following five properties:
##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215 189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Interested in identifying similar samples and similar genes
Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
The distance between samples \(i\) and \(j\) can be written as:
\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).
t(matrix(1:3, ncol=1))
## [,1] [,2] [,3]
## [1,] 1 2 3
matrix(1:3, ncol=1)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)
## [,1]
## [1,] 14
Note: R is very efficient at matrix algebra
kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1)^2))
## [1] 122.8919
dim(e)
## [1] 22215 189
(d <- dist(t(e[, c(1, 2, 87)])))
## GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz 85.8546
## GSM92240.CEL.gz 122.8919 115.4773
class(d)
## [1] "dist"
Excerpt from ?dist:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
dist
class output from dist()
is used for many clustering algorithms and heatmap functionsCaution: dist(e)
creates a 22215 x 22215 matrix that will probably crash your R session.
\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]
Distance Exercises: p. 322-323
Simulate the heights of twin pairs:
library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1] 2 100
cor(t(y))
## [,1] [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
Scaling:
system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))
## user system elapsed
## 1.429 0.115 1.568
system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))
## user system elapsed
## 0.116 0.013 0.130
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE
SVD:
s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"
dim(s$u) # loadings
## [1] 22215 189
length(s$d) # eigenvalues
## [1] 189
dim(s$v) # d %*% vT = scores
## [1] 189 189
rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])
Lesson: u and v can each be multiplied by -1 arbitrarily
plot(p$rotation[, 1], xlab="Index of genes", ylab="Loadings of PC1",
main="Scores of PC1") #or, predict(p)
abline(h=c(-0.03, 0.04), col="red")
Genes with high PC1 loadings:
e.pc1genes <- e.standardize.fast[p$rotation[, 1] < -0.03 | p$rotation[, 1] > 0.03, ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=TRUE,
show_colnames = FALSE)
Alternatively as cumulative % variance explained (using cumsum()
function):
prcomp()
), scores \(\mathbf{D V^T}\)d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)
head(table(sampleInfo$date))
##
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27
## 2 2 2 5 4 2
year <- as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)
##
## year ASN CEU HAN
## 0 0 32 0
## 1 0 54 0
## 2 0 13 0
## 3 80 3 0
## 4 2 0 24
pc <- prcomp(t(geneExpression), scale.=TRUE)
boxplot(pc$x[, 1] ~ month, varwidth=TRUE, notch=TRUE,
main="PC1 scores vs. month", xlab="Month", ylab="PC1")
A starting point for a color palette:
RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)
Interpolate one color per month on a quantitative palette:
col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
main="MDS shaded by batch")
hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)
## year
## hcclass 0 1 2 3 4
## 1 0 2 8 1 0
## 2 25 43 0 2 0
## 3 6 0 0 0 0
## 4 1 7 0 0 0
## 5 0 2 5 80 26
Your project for the course will be to:
run this RNA-seq analysis workflow provided in the R Markdown file on your own computer. Make sure to set the knitr option {cache=TRUE} is set to speed up subsequent compiling of this R Markdown script.
Remove the text provided in this Rmd file, and replace it with your own basic explanations of what is going on. You do not need to explain things that we didn’t cover in class, although you are welcome to do so. For things that we did cover in class, write a short explanation that would be appropriate to someone at your own level before you started the course. You don’t need to write an essay here, the point is for you to identify familiar methods in this workflow and show that you’ve learned something about how to interpret their use in practice.