Applied Statistics for High-throughput Biology: Session 6

March 17, 2016

Session 6 outline

Clustering
- Book chapter 8 "Basic Machine Learning", Clustering section
Batch Effects
- Book chapter 9 "Batch Effects"

Review: distances

Recall that metrics have strict requirements
Distances have fewer requirements
Similarity and Dissimilarity have fewer yet:
1. non-negativity \(d(a, b) \ge 0\)
2. symmetry \(d(a, b) = d(b, a)\)
3. Increase or decrease for more "similar" vectors, by whatever definition

Tissue gene expression dataset

Recall the tissue gene expression dataset:

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Sample pairwise Euclidian distance

d.euclidian <- dist( t(e) )

Sample pairwise 1-Pearson Correlation

d.pearson <- as.dist( 1 - cor(e) )

Hierarchical clustering

Clustering algorithms use dissimilarity between observations to form similar groups
Hierarchical clustering is one of many such algorithms
Agglomerative (bottom-up) - what we will use:
- Start with each point in own group
- Repeatedly merge the two groups with smallest dissimilarity, until one cluster remains
Divisive (top-down):
- Start with all points in a single group
- Repeatedly split the group into two resulting in the biggest dissimilarity, until each point is in own group
?hclust includes detailed information

Hierarchical clustering

** Using default hclust() settings ("complete" linkage)

Hierarchical clustering

** "single" linkage (uncommon)

Hierarchical clustering

Note that many equivalent re-orderings exist:

See ?reorder.dendrogram

Hierarchical clustering on tissues data

hclust() default, Euclidian distance

Hierarchical clustering on tissues data

hclust() default, 1 - Pearson Correlation distance

Generating actual clusters

Hierarchical clustering produces a dendrogram
To get clusters, you must "cut" the tree

Generating actual clusters

Using the above cutoff, how do clusters overlap with the actual tissues?

hclusters <- cutree(hc.pearson, h=0.125)
table(true=tissue, cluster=hclusters)

##              cluster
## true           1  2  3  4  5  6  7  8  9 10 11 12 13
##   cerebellum   0  0  0  0 31  0  0  2  0  0  5  0  0
##   colon        0  0  0  0  0 34  0  0  0  0  0  0  0
##   endometrium  0  0  0  0  0  0  0  0  0 15  0  0  0
##   hippocampus  0  0 29  2  0  0  0  0  0  0  0  0  0
##   kidney      19 18  0  0  0  0  0  2  0  0  0  0  0
##   liver        0  0  0  0  0  0 24  0  2  0  0  0  0
##   placenta     0  0  0  0  0  0  0  0  0  0  0  2  4

Generating actual clusters

Alternatively, specifying the number of clusters:

hclusters <- cutree(hc.pearson, k=length(unique(tissue)) + 1 )
table(true=tissue, cluster=hclusters)

##              cluster
## true           1  2  3  4  5  6  7  8
##   cerebellum   0  0 36  0  0  0  2  0
##   colon        0  0  0  0 34  0  0  0
##   endometrium 15  0  0  0  0  0  0  0
##   hippocampus  0  0 29  2  0  0  0  0
##   kidney      19 18  0  0  0  0  2  0
##   liver        0  0  0  0  0 24  2  0
##   placenta     0  0  0  0  0  0  0  6

K-means clustering

K-means clustering iteratively finds K clusters to minimize total within-cluster distance
- often (but not necessarily) using Euclidian distance

set.seed(1); km <- kmeans(t(e), centers=8)
table(Kmeans=km$cluster, Hclust=hclusters.euclidian)

##       Hclust
## Kmeans  1  2  3  4  5  6  7  8
##      1  0  0  0 34  0  0  0  0
##      2 37  0  0  0  0  0  0  0
##      3  0  0  0  0  0  0  5  0
##      4  0 12 19  0  0  0  0  0
##      5  0  0  0  0  0  6  0  0
##      6  0  0 31  0  0  0  0  0
##      7  0  0  0  0 24  0  0  0
##      8 15  0  0  0  0  0  0  6

How many clusters?

Common approaches for estimating the number of clusters are:
- Prediction Strength, Silhouette Width, Gap Statistic
- See library(cluster)
- A demonstration of all of these, combined with different dissimilarity measures for microbiome communities, in Koren et al. [1]

[1] Koren, O. et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9, e1002863 (2013).

Recall the good old clustered heatmap

Exercises: Look at ?pheatmap::pheatmap, try example("pheatmap::pheatmap")

Summary: clustering

There are many algorithms, e.g.:
- Partitioning around Medoids (PAM, see ?cluster::pam, ?pamr::pamr.cv)
- Non-negative Matrix Factorization (NMF, see ?NMF::nmf)
Remember that the choice of distance function is a part of the clustering algorithm
Unsupervised clustering provides a quantitative approach to a fundamentally qualitative problem

Batch Effects

pervasive in genomics (e.g. Leek et al. )
affect DNA and RNA sequencing, proteomics, imaging, microarray…
have caused high-profile problems and retractions
- you can't get rid of them
- but you can make sure they are not confounded with your experiment

Batch Effects - an example

Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
Title: Common genetic variants account for differences in gene expression among ethnic groups.
- "The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes."

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Batch Effects - an example

Sample metadata included date of processing:

head(table(sampleInfo$date))

## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

Batch Effects - an example

year <-  as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)

##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Batch Effects: PCA

pc <- princomp(geneExpression, cor=TRUE)
boxplot(pc$loadings[, 2] ~ month, varwidth=TRUE, notch=TRUE,
        main="PC2 vs. month", xlab="Month", ylab="PC2")

Batch Effects: MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))

Batch Effects: MDS

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
     main="MDS shaded by batch")

Batch Effects: clustering

hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)

##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary

tend to affect many or all measurements by a little bit
During experimental design:
- keep track of anything that might cause a batch effect for post-hoc analysis
- include control samples in each batch
batches can be corrected for if randomized in study design
- if confounded with treatment or outcome of interest, nothing can help you

Lab

Clustering and heatmaps exercises
Batches in your own dataset
- Created an MDS plot colored by potential batch variables (for TCGA you have "plate" and "batch_number" variables)
- Perform unsupervised clustering and produce a frequency table for batch vs. cluster
Review the clustering methods used in your TCGA publication method, post to class Google Group

Session 6 outline

Review: distances

Tissue gene expression dataset

Sample pairwise Euclidian distance

Sample pairwise 1-Pearson Correlation

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering on tissues data

Hierarchical clustering on tissues data

Generating actual clusters

Generating actual clusters

Generating actual clusters

K-means clustering

K-means clustering

How many clusters?

Recall the good old clustered heatmap

Summary: clustering

Batch Effects

Batch Effects

Batch Effects - an example

Batch Effects - an example

Batch Effects - an example

Batch Effects: PCA

Batch Effects: MDS

Batch Effects: MDS

Batch Effects: clustering

Batch Effects - summary

Lab

Links