Levi Waldron
Book chapter 8:
A metric satisfies the following five properties:
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'tissuesGeneExpression'
## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a4ea081c0/genomicsclass-tissuesGeneExpression-a43cf4b/DESCRIPTION’ ... OK
## * preparing ‘tissuesGeneExpression’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘tissuesGeneExpression_1.0.tar.gz’
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859'
## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a282882c/genomicsclass-GSE5859-3d8cdd7/DESCRIPTION’ ... OK
## * preparing ‘GSE5859’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859_1.0.tar.gz’
## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215 189
## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Interested in identifying similar samples and similar genes
Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
The Euclidian distance between samples \(i\) and \(j\) can be written as:
\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).
## [,1] [,2] [,3]
## [1,] 1 2 3
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [,1]
## [1,] 14
## [1] 85.8546
## [1] 122.8919
## [1] 22215 189
## GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz 85.8546
## GSM92240.CEL.gz 122.8919 115.4773
## [1] "dist"
Excerpt from ?dist:
dist
class output from dist()
is used for
many clustering algorithms and heatmap functionsCaution: dist(e)
creates a 22215 x 22215 matrix
that will probably crash your R session.
\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]
Simulate the heights of twin pairs:
## [1] 2 100
## [,1] [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
note: the above formulation is for \(m\) rows \(> n\) columns
\(\mathbf{Y}\): the \(m\) rows x \(n\) cols matrix of measurements
\(\mathbf{U}\): \(m \times n\) matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): \(n \times n\) diagonal matrix (eigenvalues)
\(\mathbf{V}\): \(n \times n\) orthogonal matrix (eigenvectors or PCA scores)
Scaling:
## user system elapsed
## 0.578 0.080 0.657
## user system elapsed
## 0.131 0.041 0.171
## [1] TRUE
SVD:
## [1] "d" "u" "v"
## [1] 22215 189
## [1] 189
## [1] 189 189
Lesson: u and v can each be multiplied by -1 arbitrarily
plot(p$rotation[, 1],
xlab = "Index of genes",
ylab = "Loadings of PC1",
main = "PC1 loadings of each gene") #or, predict(p)
abline(h = c(-0.03, 0.03), col = "red")
e.pc1genes <-
e.standardize.fast[p$rotation[, 1] < -0.03 |
p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
e.pc1genes,
scale = "none",
show_rownames = TRUE,
show_colnames = FALSE
)
rafalib::mypar()
plot(
p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
xlim = c(0, 150),
type = "b",
ylab = "% variance explained",
main = "Screeplot"
)
Alternatively as cumulative % variance explained (using
cumsum()
function)
rafalib::mypar()
plot(
cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
ylab = "cumulative % variance explained",
ylim = c(0, 100),
type = "b",
main = "Cumulative screeplot"
)
prcomp()
), scores are
already scaled by eigenvalues: \(\mathbf{D
V^T}\)library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)
ExpressionSet
object is obsolete, we use
SummarizedExperiment
now##
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27
## 2 2 2 5 4 2
year <- as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
##
## year ASN CEU HAN
## 0 0 32 0
## 1 0 54 0
## 2 0 13 0
## 3 80 3 0
## 4 2 0 24
pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
pc$x[, 1] ~ month,
varwidth = TRUE,
notch = TRUE,
main = "PC1 scores vs. month",
xlab = "Month",
ylab = "PC1"
)
A starting point for a color palette:
Interpolate one color per month on a quantitative palette:
col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
main = "MDS shaded by month")
## year
## hcclass 0 1 2 3 4
## 1 0 2 8 1 0
## 2 25 43 0 2 0
## 3 6 0 0 0 0
## 4 1 7 0 0 0
## 5 0 2 5 80 26