March 17, 2016

Session 6 outline

  • Clustering
    • Book chapter 8 "Basic Machine Learning", Clustering section
  • Batch Effects
    • Book chapter 9 "Batch Effects"

Review: distances

  • Recall that metrics have strict requirements
  • Distances have fewer requirements
  • Similarity and Dissimilarity have fewer yet:

    1. non-negativity \(d(a, b) \ge 0\)
    2. symmetry \(d(a, b) = d(b, a)\)
    3. Increase or decrease for more "similar" vectors, by whatever definition

Tissue gene expression dataset

Recall the tissue gene expression dataset:

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Sample pairwise Euclidian distance

d.euclidian <- dist( t(e) )

Sample pairwise 1-Pearson Correlation

d.pearson <- as.dist( 1 - cor(e) )

Hierarchical clustering

Hierarchical clustering

  • Clustering algorithms use dissimilarity between observations to form similar groups
  • Hierarchical clustering is one of many such algorithms
  • Agglomerative (bottom-up) - what we will use:
    • Start with each point in own group
    • Repeatedly merge the two groups with smallest dissimilarity, until one cluster remains
  • Divisive (top-down):
    • Start with all points in a single group
    • Repeatedly split the group into two resulting in the biggest dissimilarity, until each point is in own group
  • ?hclust includes detailed information

Hierarchical clustering

** Using default hclust() settings ("complete" linkage)

Hierarchical clustering

** "single" linkage (uncommon)

Hierarchical clustering

Note that many equivalent re-orderings exist:

  • See ?reorder.dendrogram

Hierarchical clustering on tissues data

  • hclust() default, Euclidian distance

Hierarchical clustering on tissues data

  • hclust() default, 1 - Pearson Correlation distance

Generating actual clusters

  • Hierarchical clustering produces a dendrogram
  • To get clusters, you must "cut" the tree

Generating actual clusters

Using the above cutoff, how do clusters overlap with the actual tissues?

hclusters <- cutree(hc.pearson, h=0.125)
table(true=tissue, cluster=hclusters)
##              cluster
## true           1  2  3  4  5  6  7  8  9 10 11 12 13
##   cerebellum   0  0  0  0 31  0  0  2  0  0  5  0  0
##   colon        0  0  0  0  0 34  0  0  0  0  0  0  0
##   endometrium  0  0  0  0  0  0  0  0  0 15  0  0  0
##   hippocampus  0  0 29  2  0  0  0  0  0  0  0  0  0
##   kidney      19 18  0  0  0  0  0  2  0  0  0  0  0
##   liver        0  0  0  0  0  0 24  0  2  0  0  0  0
##   placenta     0  0  0  0  0  0  0  0  0  0  0  2  4

Generating actual clusters

Alternatively, specifying the number of clusters:

hclusters <- cutree(hc.pearson, k=length(unique(tissue)) + 1 )
table(true=tissue, cluster=hclusters)
##              cluster
## true           1  2  3  4  5  6  7  8
##   cerebellum   0  0 36  0  0  0  2  0
##   colon        0  0  0  0 34  0  0  0
##   endometrium 15  0  0  0  0  0  0  0
##   hippocampus  0  0 29  2  0  0  0  0
##   kidney      19 18  0  0  0  0  2  0
##   liver        0  0  0  0  0 24  2  0
##   placenta     0  0  0  0  0  0  0  6

K-means clustering

K-means clustering

  • K-means clustering iteratively finds K clusters to minimize total within-cluster distance
    • often (but not necessarily) using Euclidian distance
set.seed(1); km <- kmeans(t(e), centers=8)
table(Kmeans=km$cluster, Hclust=hclusters.euclidian)
##       Hclust
## Kmeans  1  2  3  4  5  6  7  8
##      1  0  0  0 34  0  0  0  0
##      2 37  0  0  0  0  0  0  0
##      3  0  0  0  0  0  0  5  0
##      4  0 12 19  0  0  0  0  0
##      5  0  0  0  0  0  6  0  0
##      6  0  0 31  0  0  0  0  0
##      7  0  0  0  0 24  0  0  0
##      8 15  0  0  0  0  0  0  6

How many clusters?

Recall the good old clustered heatmap

Exercises: Look at ?pheatmap::pheatmap, try example("pheatmap::pheatmap")

Summary: clustering

  • There are many algorithms, e.g.:
    • Partitioning around Medoids (PAM, see ?cluster::pam, ?pamr::pamr.cv)
    • Non-negative Matrix Factorization (NMF, see ?NMF::nmf)
  • Remember that the choice of distance function is a part of the clustering algorithm
  • Unsupervised clustering provides a quantitative approach to a fundamentally qualitative problem

Batch Effects

Batch Effects

  • pervasive in genomics (e.g. Leek et al. )
  • affect DNA and RNA sequencing, proteomics, imaging, microarray…
  • have caused high-profile problems and retractions
    • you can't get rid of them
    • but you can make sure they are not confounded with your experiment

Batch Effects - an example

  • Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
  • Title: Common genetic variants account for differences in gene expression among ethnic groups.
    • "The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes."
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Batch Effects - an example

  • Sample metadata included date of processing:
head(table(sampleInfo$date))
## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

Batch Effects - an example

year <-  as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)
##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Batch Effects: PCA

pc <- princomp(geneExpression, cor=TRUE)
boxplot(pc$loadings[, 2] ~ month, varwidth=TRUE, notch=TRUE,
        main="PC2 vs. month", xlab="Month", ylab="PC2")

Batch Effects: MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))

Batch Effects: MDS

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
     main="MDS shaded by batch")

Batch Effects: clustering

hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)
##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary

  • tend to affect many or all measurements by a little bit
  • During experimental design:
    • keep track of anything that might cause a batch effect for post-hoc analysis
    • include control samples in each batch
  • batches can be corrected for if randomized in study design
    • if confounded with treatment or outcome of interest, nothing can help you

Lab

  • Clustering and heatmaps exercises
  • Batches in your own dataset
    • Created an MDS plot colored by potential batch variables (for TCGA you have "plate" and "batch_number" variables)
    • Perform unsupervised clustering and produce a frequency table for batch vs. cluster
  • Review the clustering methods used in your TCGA publication method, post to class Google Group

Links