Hierarchical Cluster Analysis

Libraries

if(!require("ComplexHeatmap")){BiocManager::install("ComplexHeatmap")}

## 载入需要的程辑包：ComplexHeatmap

## 载入需要的程辑包：grid

## ========================================
## ComplexHeatmap version 2.9.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================

if(!require("matlab")){BiocManager::install("matlab")}

## 载入需要的程辑包：matlab

## 
## 载入程辑包：'matlab'

## The following object is masked from 'package:stats':
## 
##     reshape

## The following objects are masked from 'package:utils':
## 
##     find, fix

## The following object is masked from 'package:base':
## 
##     sum

library(ComplexHeatmap)
library(matlab)   # this library let's us use blue-red color spectrum for heatmap  (jet.colors)
library(matrixStats)

if(!require("dendextend")) install.packages("dendextend")

## 载入需要的程辑包：dendextend

## 
## ---------------------
## Welcome to dendextend version 1.15.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## 载入程辑包：'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

library(dendextend)

Load Prostate Cancer Data

load("data/JBC 2012/jbcdat.rda")

Dendrograms

I’m going to cluster the 100 most variable genes, because clustering the genes will show greater differences between the different cluster methods than clustering samples. 我要把100个最易变的基因聚在一起，因为聚在一起的基因会比聚在一起的样本显示出更大的差异。

rowscale <- function(x) {
      (x - rowMeans(x))/matrixStats::rowMads(x)
}

fmad <- matrixStats::rowMads(jbcdat$E)
rfilt <- rank(-fmad)
fidx <- which( rfilt <= 100)

X <- rowscale(jbcdat$E)[fidx,]
fd <- as.dist(1-cor(t(X)))

Average Linkage:

hc <- hclust(fd, method = "average") 
plot(hc,xlab=" ",main=" ",sub=" ",labels=FALSE)

Single Linkage:

hc <- hclust(fd, method = "single") 
plot(hc,xlab=" ",main=" ",sub=" ",labels=FALSE)

Complete Linkage:

hc <- hclust(fd, method = "complete") 
plot(hc,xlab=" ",main=" ",sub=" ",labels=FALSE)

Ward’s ESS:

hc <- hclust(fd, method = "ward.D2") 
plot(hc,xlab=" ",main=" ",sub=" ",labels=FALSE)

Complex Heatmap

There is a very flexible function in R for creating complex Heatmaps.

https://jokergoo.github.io/ComplexHeatmap-reference/book/introduction.html#general-design

?Heatmap

To cluster both genes and samples, we will standardize the genes first, and then the samples. Then I’ll use 1-correlation as pairwise dissimilarity and Ward’s distance for the clustering criterion. 为了将基因和样本同时聚类，我们将先标准化基因，然后标准化样本。然后我将使用1-correlation作为两两不相似度，Ward’s distance作为聚类标准。

fmad <- matrixStats::rowMads(jbcdat$E)
rfilt <- rank(-fmad)
fidx <- which( rfilt <= 500)
X <- rowscale(jbcdat$E)[fidx,]
scX <- scale(X,center=TRUE,scale=matrixStats::colMads(X))

First, I will create the color bars for annotating the samples.

jbcdat$targets$treatment <- factor(jbcdat$targets$treatment,
                                   levels=c("siNS","siCBP","sip300"))

# column heatmap annotation
colha <- ComplexHeatmap::HeatmapAnnotation(df = 
                       jbcdat$targets[,c("treatment","hour")],
                col = list(treatment = c(siNS = "pink", 
                                         siCBP = "purple",
                                         sip300 = "orange"),
                                 hour = c('0hr' = "grey",
                                        '16hr' = "lightgreen")
                          ), 
                which = "column")

Now call the heatmap.

500 features, 1-Pearson dissimilarity

ht <- ComplexHeatmap::Heatmap(scX, column_title = "Samples",
              row_title = "Features", 
              name = "log2(Expr)",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "pearson",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(ht)

In the above figure, the clustering is performed on the (gene- then sample- standardized) log2 expression data. Our sample labels show that the replicate samples cluster well. It also shows that the p300 knock-down cells are most different from the rest at both timepoints.

Comment: Since we standardized the columns, we get a very similar result using euclidean distance. They’d be identical if we’d standardized using standard deviation instead of median absolute deviation.

Let’s investigate using Euclidean distance for the (unstandardized) samples and changing the number of features. 在上图中，对(基因-然后样本-标准化)log2表达数据进行聚类。我们的样本标签显示，重复的样本很好地聚类。研究还表明，在这两个时间点，p300敲除细胞与其他细胞的差异最大。

注释:由于我们标准化了列，我们得到一个非常相似的结果使用欧几里得距离。如果我们用标准差而不是中位数绝对偏差标准化它们是一样的。

让我们研究使用欧几里得距离(非标准化)样本和改变特征的数量。 500 features, sample Euclidean dissimilarity

fmad <- matrixStats::rowMads(jbcdat$E)
rfilt <- rank(-fmad)
fidx <- which( rfilt <= 500)
X <- rowscale(jbcdat$E)[fidx,]

ht <- ComplexHeatmap::Heatmap(X, column_title = "Samples",
              row_title = "Features", 
              name = "log2(Expr)",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "euclidean",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(ht)

It looks like there is no noticeable difference if we use Euclidean distance on the unstandardized samples. [Note: the genes are still standardized.] That suggests this result is robust to scaling of the samples. Now let’s try this on 300 features. Let’s go back to using correlation for the column similarities. 在非标准化样本上使用欧几里得距离似乎没有明显的区别。[注:基因仍是标准化的。这表明这一结果对于样本的缩放是可靠的。现在让我们在300个功能上尝试一下。让我们回到对列相似性使用相关性的问题上来。 500 randomly sampled features, 1-correlation dissimilarity

set.seed(21)
idx <- sample(c(1:nrow(jbcdat$E)),500)
X <- rowscale(jbcdat$E)[idx,]
scX <- scale(X,center=TRUE,scale=matrixStats::colMads(X))

ht <- ComplexHeatmap::Heatmap(scX, column_title = "Samples",
              row_title = "Features", 
              name = "log2(Expr)",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "pearson",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(ht)

300 (top mad) features, 1-correlation dissimilarity

fidx <- which(rfilt <= 300)
X <- rowscale(jbcdat$E)[fidx,]
scX <- scale(X,center=TRUE,scale=matrixStats::colMads(X))

ht <- ComplexHeatmap::Heatmap(scX, column_title = "Samples",
              row_title = "Features", 
              name = "log2(Expr)",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "pearson",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(ht)

The bottom 6 clusters look like they’re the same as when using 500 features, but the top 2 do not. Top clusters are unstable in hierarchical clustering algorithms. 最下面的6个集群看起来和使用500个特性时一样，但是最上面的2个却不是。在层次聚类算法中，顶层聚类是不稳定的。 100 (top mad) features, 1-correlation dissimilarity

fidx <- which(rfilt <= 100)
X <- rowscale(jbcdat$E)[fidx,]
scX <- scale(X,center=TRUE,scale=matrixStats::colMads(X))

ht <- ComplexHeatmap::Heatmap(scX, column_title = " ",
              row_title = "100 Features", 
              name = " ",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "pearson",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(ht)

What has changed now?

In the figure above we used pairwise correlation for the column similarities. What if we used euclidean distance instead?

100 (top mad) features, sample Euclidean dissimilarity

fidx <- which(rfilt <= 100)
X <- rowscale(jbcdat$E)[fidx,]

hte <- ComplexHeatmap::Heatmap(X, column_title = " ",
              row_title = "100 Features", 
              name = "log2(Expr)",
              clustering_distance_rows = "pearson",
              clustering_method_rows = "ward.D2",
              clustering_distance_columns = "euclidean",
              clustering_method_columns = "ward.D2",
              col = jet.colors(32), 
              top_annotation = colha,
              show_column_names = FALSE,
              show_row_names = FALSE)

draw(hte)

And, since the row clustering isn’t changing, we should be able to plot these side-by-side using complex heatmap…

So what is the correct figure?

Extended Dendrograms

fidx <- which(rfilt <= 500)
X <- rowscale(jbcdat$E)[fidx,]
scX <- scale(X,center=TRUE,scale=matrixStats::colMads(X))

dend <- as.dist(1-cor(scX)) %>% # compute dissimilarity
        hclust(method = "ward.D2") %>% # Hierarchical clustering 
        as.dendrogram # Turn the object into a dendrogram.

Color the leafs:

cv <- c("black","yellow","red","blue","orange","purple","pink")
type_color <- cv[unclass(jbcdat$targets$type)][order.dendrogram(dend)]

dend %>% set("leaves_pch", c(16)) %>%  # node point type
  set("leaves_cex", 2) %>%  # node point size
  set("leaves_col", type_color) %>% #node point color
  set("labels", "") %>%
plot()

Add color bars:

trtcol  <- c("purple","pink","orange")
timecol <- c("grey","lightgreen")

trt_bar <- trtcol[as.numeric(factor(jbcdat$targets$treatment))]
trt_bar <- trt_bar[order.dendrogram(dend)]
time_bar <- timecol[as.numeric(factor(jbcdat$targets$hour))]
time_bar <- time_bar[order.dendrogram(dend)]
s_bar <- cbind(trt_bar,time_bar)

dend %>% set("labels", "") %>% plot
colored_bars(s_bar,rowLabels = c("Treatment","Time"))

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] dendextend_1.15.1    matrixStats_0.58.0   matlab_1.0.2        
## [4] ComplexHeatmap_2.9.0
## 
## loaded via a namespace (and not attached):
##  [1] circlize_0.4.13     shape_1.4.6         GetoptLong_1.0.5   
##  [4] tidyselect_1.1.1    xfun_0.23           purrr_0.3.4        
##  [7] colorspace_2.0-1    vctrs_0.3.8         generics_0.1.0     
## [10] viridisLite_0.4.0   htmltools_0.5.1.1   stats4_4.1.0       
## [13] yaml_2.2.1          utf8_1.2.1          rlang_0.4.11       
## [16] pillar_1.6.1        glue_1.4.2          DBI_1.1.1          
## [19] BiocGenerics_0.38.0 RColorBrewer_1.1-2  foreach_1.5.1      
## [22] lifecycle_1.0.0     stringr_1.4.0       munsell_0.5.0      
## [25] gtable_0.3.0        GlobalOptions_0.1.2 codetools_0.2-18   
## [28] evaluate_0.14       knitr_1.33          IRanges_2.26.0     
## [31] Cairo_1.5-12.2      doParallel_1.0.16   parallel_4.1.0     
## [34] fansi_0.5.0         highr_0.9           scales_1.1.1       
## [37] S4Vectors_0.30.0    gridExtra_2.3       rjson_0.2.20       
## [40] ggplot2_3.3.3       png_0.1-7           digest_0.6.27      
## [43] stringi_1.6.2       dplyr_1.0.6         clue_0.3-59        
## [46] tools_4.1.0         magrittr_2.0.1      tibble_3.1.2       
## [49] cluster_2.1.2       crayon_1.4.1        pkgconfig_2.0.3    
## [52] ellipsis_0.3.2      viridis_0.6.1       assertthat_0.2.1   
## [55] rmarkdown_2.8       iterators_1.0.13    R6_2.5.0           
## [58] compiler_4.1.0

Hierarchical Cluster Analysis

K Siegmund

6/3/2020

Libraries

Dendrograms

Complex Heatmap

Extended Dendrograms