In many Genomic paper, you will see heatmaps. Heatmaps is of no mystery. It is a way to visualize the data a.k.a. using colors to represent values. However one really needs to understand the details of heatmaps. I recommend you to read Points of view: Mapping quantitative data to color and Points of view: Heat maps from a series of articles from Nature Methods.

Usually one has a matrix and then plot the matrix using functions such as heatmap.2, pheatmap or Heatmap.

I will start with a very simple using case for heatmap. We have sequenced 20 samples and identified mutations in 10 genes. some samples have the mutation in this gene, some samples do not have it. In this case, it will be a simple 0 or 1 to represent each data point. I am going to use ggplot2 for this purpose, although the base R function rect can also draw rectangles.

Let’s simulate the data.

library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1)
# repeat the sampling 
mut<- replicate(20, sample(c(0,1), 10, replace=TRUE))
mut<- as.data.frame(mut)
colnames(mut)<- paste0("sample", 1:20)
mut<- mut %>% mutate(gene=paste0("gene", 1:10))
head(mut)
##   sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
## 1       0       0       1       0       1       0       1       0       0
## 2       0       0       0       1       1       1       0       1       1
## 3       1       1       1       0       1       0       0       0       0
## 4       1       0       0       0       1       0       0       0       0
## 5       0       1       0       1       1       0       1       0       1
## 6       1       0       0       1       1       0       0       1       0
##   sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17
## 1        0        1        1        1        1        1        1        0
## 2        0        0        1        0        0        1        1        1
## 3        1        0        0        0        0        0        0        0
## 4        1        1        0        0        1        0        0        1
## 5        1        1        0        1        1        1        1        1
## 6        1        0        0        0        1        0        0        0
##   sample18 sample19 sample20  gene
## 1        1        0        1 gene1
## 2        1        0        0 gene2
## 3        1        1        0 gene3
## 4        0        1        1 gene4
## 5        0        1        0 gene5
## 6        1        0        1 gene6

most of my codes follow a post Making Faceted Heatmaps with ggplot2

Tidy the data to the long format.

mut.tidy<- mut %>% tidyr::gather(sample, mutated, 1:20)

## change the levels for gene names and sample names so it goes 1,2,3,4... rather than 1, 10...
mut.tidy$gene<- factor(mut.tidy$gene, levels = paste0("gene", 1:10))
mut.tidy$sample<- factor(mut.tidy$sample, levels = paste0("sample", 1:20))

when fill the tiles with color, in this case, it is 0 or 1 discrete value. R thinks mutated is a numeric continuous value, change it to factor.

mut.tidy$mutated<- factor(mut.tidy$mutated)

## use a white border of size 0.5 unit to separate the tiles
gg<- ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5)
library(RColorBrewer) ## better color schema

## check all the color pallete and choose one
display.brewer.all()

mutated will have color red, unmutated have color blue.

gg<- gg + scale_fill_brewer(palette = "Set1", direction = -1)

geom_tile() draws rectangles, add coord_equal to draw squres.

gg<- gg + coord_equal()
## add title

gg<- gg + labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers")

library(ggthemes)
##starting with a base theme of theme_tufte() from the ggthemes package. It removes alot of chart junk without having to do it manually.
gg <- gg + theme_tufte(base_family="Helvetica")

#We don’t want any tick marks on the axes and I want the text to be slightly smaller than the default.

gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle = 45, hjust = 1))
gg

If you want to mannually fill the color, you can use scale_fill_manual, and check http://colorbrewer2.org/ to get the HEX representation of the color.

ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5) +
         coord_equal() +
        labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers") +
        theme_tufte(base_family="Helvetica") +
        scale_fill_manual(values = c("#7570b3", "#1b9e77")) +
        theme(axis.ticks=element_blank()) + 
        theme(axis.text.x=element_text(angle = 45, hjust = 1))

ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5) +
         coord_equal() +
        labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers") +
        theme_tufte(base_family="Helvetica") +
        scale_fill_manual(values = c("gray", "red")) +
        theme(axis.ticks=element_blank()) + 
        theme(axis.text.x=element_text(angle = 45, hjust = 1))    

Note that in a real genomic experiment, thousands of genes will be assayed, and one can use tools such as CoMET to find the mutual exclusive mutations and plot as I just did. There is a so called oncoprint in many papers and essentially they are doing the same thing as I did here, but adding many details. see one example from ComplexHeatmap:

In this example, I showed an example of using heatmap to represent discrete values (yes or no mutation), in my following post, I will post how to use heatmap to represent continuous values and do clustering on rows and columns to find patterns (unsupervised clustering). ggplot2 it self does not have clustering built-in. We will have to use the functions I mentioned in the begining of this blog. There are three main points I will stress on plotting a bi-clustered heatmap:

  1. scale your data (center/standardize your data or not).
  2. range of the data and color mapping.
  3. clustering. (which distance measure and linkage method to use).

TO BE CONTINUED…

Using heatmap to represent continuous values.

In my last blog post, I showed an example to use heatmap to repreent discrete values. I am going to continue the theme to introduce heatmap to represent continuous values. As I mentioned before, I will stress on three main points:

  1. Whether or not scale your data (center/standardize your data or not).
  2. Make sure you know the range of the data and do reasonable color mapping.
  3. How to perform the clustering. (which distance measure and linkage method to use).

In order to make the example reproducible and interesting to biologist, I am going to use an RNAseq data set downloaded by the recount package. The data contains 347 single cell RNA-seq data of 11 different distinct populations. Please read this Nature biotechnology paper: Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. I will try to reproduce the figures in the paper.

Downlaod the data by recount

#source("https://bioconductor.org/biocLite.R")

## use development branch of bioc
# useDevel()
#biocLite("recount")
library(recount)
library(dplyr)

## if you go to the paper you can find the project name
project = 'SRP041736'

# I already downloaded it, so comment it out
# download_study(project)

## load the data
load(file.path(project, 'rse_gene.Rdata'))

#We can explore a bit this RangedSummarizedExperiment 
rse_gene
## class: RangedSummarizedExperiment 
## dim: 23779 694 
## metadata(0):
## assays(1): counts
## rownames(23779): 1 10 ... 9994 9997
## rowData names(3): gene_id bp_length symbol
## colnames(694): SRR1274192 SRR1274193 ... SRR1275349 SRR1275353
## colData names(21): project sample ... title characteristics
colData(rse_gene) %>% head()
## DataFrame with 6 rows and 21 columns
##                project      sample  experiment         run
##            <character> <character> <character> <character>
## SRR1274192   SRP041736   SRS603194   SRX534477  SRR1274192
## SRR1274193   SRP041736   SRS603194   SRX534477  SRR1274193
## SRR1274194   SRP041736   SRS603195   SRX534478  SRR1274194
## SRR1274195   SRP041736   SRS603195   SRX534478  SRR1274195
## SRR1274196   SRP041736   SRS603196   SRX534479  SRR1274196
## SRR1274197   SRP041736   SRS603196   SRX534479  SRR1274197
##            read_count_as_reported_by_sra reads_downloaded
##                                <integer>        <integer>
## SRR1274192                      16260602         16260602
## SRR1274193                         76776            76776
## SRR1274194                      13225500         13225500
## SRR1274195                         64552            64552
## SRR1274196                      16521630         16521630
## SRR1274197                         83534            83534
##            proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                 <numeric>  <logical>
## SRR1274192                                              1       TRUE
## SRR1274193                                              1       TRUE
## SRR1274194                                              1       TRUE
## SRR1274195                                              1       TRUE
## SRR1274196                                              1       TRUE
## SRR1274197                                              1       TRUE
##            sra_misreported_paired_end mapped_read_count        auc
##                             <logical>         <integer>  <numeric>
## SRR1274192                      FALSE          16090529 1541642379
## SRR1274193                      FALSE             69879    2016766
## SRR1274194                      FALSE          13092059 1258982469
## SRR1274195                      FALSE             59210    1708001
## SRR1274196                      FALSE          16368219 1580611206
## SRR1274197                      FALSE             76808    2217637
##            sharq_beta_tissue sharq_beta_cell_type
##                  <character>          <character>
## SRR1274192                NA                   NA
## SRR1274193                NA                   NA
## SRR1274194                NA                   NA
## SRR1274195                NA                   NA
## SRR1274196                NA                   NA
## SRR1274197                NA                   NA
##            biosample_submission_date biosample_publication_date
##                          <character>                <character>
## SRR1274192   2014-04-22T17:28:44.407    2014-08-02T00:44:06.583
## SRR1274193   2014-04-22T17:28:44.407    2014-08-02T00:44:06.583
## SRR1274194   2014-04-22T17:28:45.247    2014-08-02T00:44:07.100
## SRR1274195   2014-04-22T17:28:45.247    2014-08-02T00:44:07.100
## SRR1274196   2014-04-22T17:28:45.343    2014-08-02T00:44:07.150
## SRR1274197   2014-04-22T17:28:45.343    2014-08-02T00:44:07.150
##              biosample_update_date avg_read_length geo_accession
##                        <character>       <integer>   <character>
## SRR1274192 2014-08-02T00:44:06.583             202            NA
## SRR1274193 2014-08-02T00:44:06.583              60            NA
## SRR1274194 2014-08-02T00:44:07.100             202            NA
## SRR1274195 2014-08-02T00:44:07.100              60            NA
## SRR1274196 2014-08-02T00:44:07.150             202            NA
## SRR1274197 2014-08-02T00:44:07.150              60            NA
##              bigwig_file       title characteristics
##              <character> <character> <CharacterList>
## SRR1274192 SRR1274192.bw          NA              NA
## SRR1274193 SRR1274193.bw          NA              NA
## SRR1274194 SRR1274194.bw          NA              NA
## SRR1274195 SRR1274195.bw          NA              NA
## SRR1274196 SRR1274196.bw          NA              NA
## SRR1274197 SRR1274197.bw          NA              NA
## At the gene level, the row data includes the gene ENTREZ ids, the gene
## symbols and the sum of the reduced exons widths, which can be used for 
## taking into account the gene length.
rowData(rse_gene)
## DataFrame with 23779 rows and 3 columns
##           gene_id bp_length      symbol
##       <character> <integer> <character>
## 1               1      4027        A1BG
## 2              10      1317        NAT2
## 3             100      1532         ADA
## 4            1000      4473        CDH2
## 5       100008589      5071     RNA28S5
## ...           ...       ...         ...
## 23775        9991      8234       PTBP3
## 23776        9992       803       KCNE2
## 23777        9993      4882       DGCR2
## 23778        9994      6763    CASP8AP2
## 23779        9997      1393        SCO2
# browse_study(project)

Preparing the metadata

It turns out that the metadata returned by recount are many NAs. I have to go to SRA run selector to find the metadata. great thanks to @Ayush.

search SRP041736 in the search box, and it will return all the runs for the project, and I downloaded the RunInfo Table.

SRAdb bioconductor package maybe able to search and download the metadata programmatically.

Aso check sra-search from NCBI.

Now, read in the metadata downloaded.

library(readr)
sra.runs<- read_tsv("~/Downloads/SraRunTable.txt")

## much more info I need!
head(sra.runs)
## Source: local data frame [6 x 35]
## 
##    BioSample_s Experiment_s Library_Name_s LoadDate_s MBases_l MBytes_l
##          (chr)        (chr)          (chr)     (date)    (int)    (int)
## 1 SAMN02731786    SRX534477         2338_1 2014-05-08     1566     1094
## 2 SAMN02731786    SRX534477         2338_1 2014-05-08        2        1
## 3 SAMN02731795    SRX534478        2338_10 2014-05-08     1273      892
## 4 SAMN02731795    SRX534478        2338_10 2014-05-08        1        1
## 5 SAMN02731796    SRX534479        2338_11 2014-05-08     1591     1114
## 6 SAMN02731796    SRX534479        2338_11 2014-05-08        2        1
## Variables not shown: Run_s (chr), SRA_Sample_s (chr), Sample_Name_s (chr),
##   age_s (chr), biomaterial_provider_s (chr), cell_line_s (chr), disease_s
##   (chr), disease_stage_s (chr), population_s (chr), sex_s (chr), tissue_s
##   (chr), Assay_Type_s (chr), AssemblyName_s (chr), BioProject_s (chr),
##   BioSampleModel_s (chr), Center_Name_s (chr), Consent_s (chr),
##   InsertSize_l (int), LibraryLayout_s (chr), LibrarySelection_s (chr),
##   LibrarySource_s (chr), Organism_s (chr), Platform_s (chr), ReleaseDate_s
##   (date), SRA_Study_s (chr), g1k_analysis_group_s (chr), g1k_pop_code_s
##   (chr), isolate_s (chr), source_s (chr)
# View(sra.runs)

The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output.

## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)

## check the count table

assay(rse)[1:6, 1:20]
##           SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1                330          0          6          0        265
## 10                94          0          0          0          0
## 100             1905       3451          0          0          4
## 1000               3          0          8          0          6
## 100008589       2026       1567        921       1358       1622
## 100009613          0          0          0          0          0
##           SRR1274197 SRR1274198 SRR1274199 SRR1274200 SRR1274201
## 1                  0          1          0        430          0
## 10                 0          0          0          0          0
## 100                0        730          0          0          0
## 1000               0          6          0          6          0
## 100008589       1046       1445       1809       1283       2364
## 100009613          0          0          0          0          0
##           SRR1274202 SRR1274203 SRR1274204 SRR1274205 SRR1274206
## 1                 54          0        931        959          0
## 10                 0          0          0          0          0
## 100               12          0          0          0        452
## 1000               0          0          0          0         12
## 100008589       1703       1455       1812        843       2942
## 100009613          0          0          0          0          0
##           SRR1274207 SRR1274208 SRR1274209 SRR1274210 SRR1274211
## 1                  0          5          0        141          0
## 10                 0          0          0          0          0
## 100             2400         23          0        218          0
## 1000               0          3          0          2          0
## 100008589       4737       1047          0       3030       1993
## 100009613          0          0          0          0          0

match the sra.runs$Run_s to the colData(rse)$run

sra.runs$Run_s %in% colData(rse)$run %>% table()
## .
## FALSE  TRUE 
##     2   694
colData(rse)$run  %in% sra.runs$Run_s %>% table()
## .
## TRUE 
##  694
# which runs are not avaiable in the count matrix?
sra.runs$Run_s[! sra.runs$Run_s %in% colData(rse)$run]
## [1] "SRR1274250" "SRR1274344"
## there should be two same samples, one is deep sequenced, the other is shallow sequenced (downsampled)
colData(rse) %>% as.data.frame() %>% group_by(sample) %>% summarise(n=n()) %>% arrange(n) %>% head()
## Source: local data frame [6 x 2]
## 
##      sample     n
##       (chr) (int)
## 1 SRS603223     1
## 2 SRS603270     1
## 3 SRS603194     2
## 4 SRS603195     2
## 5 SRS603196     2
## 6 SRS603197     2
sra.runs %>% dplyr::filter(Run_s == "SRR1274250") %>% select(SRA_Sample_s)
## Source: local data frame [1 x 1]
## 
##   SRA_Sample_s
##          (chr)
## 1    SRS603223
sra.runs %>% dplyr::filter(Run_s == "SRR1274344") %>% select(SRA_Sample_s)
## Source: local data frame [1 x 1]
## 
##   SRA_Sample_s
##          (chr)
## 1    SRS603270
colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603223")
##     project    sample experiment        run read_count_as_reported_by_sra
## 1 SRP041736 SRS603223  SRX534506 SRR1274251                         76576
##   reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1            76576                                              1
##   paired_end sra_misreported_paired_end mapped_read_count     auc
## 1       TRUE                      FALSE             68808 1984754
##   sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1              <NA>                 <NA>   2014-04-22T17:28:47.330
##   biosample_publication_date   biosample_update_date avg_read_length
## 1    2014-08-02T00:44:08.360 2014-08-02T00:44:08.360              60
##   geo_accession   bigwig_file title characteristics
## 1          <NA> SRR1274251.bw  <NA>              NA
colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603270")
##     project    sample experiment        run read_count_as_reported_by_sra
## 1 SRP041736 SRS603270  SRX534553 SRR1274345                        169528
##   reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1           169528                                              1
##   paired_end sra_misreported_paired_end mapped_read_count      auc
## 1       TRUE                      FALSE            150720 14263581
##   sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1              <NA>                 <NA>   2014-04-22T17:29:04.727
##   biosample_publication_date   biosample_update_date avg_read_length
## 1    2014-08-02T00:44:25.317 2014-08-02T00:44:25.317             196
##   geo_accession   bigwig_file title characteristics
## 1          <NA> SRR1274345.bw  <NA>              NA

These two samples are shallow sequenced ones according to the mapped_read_count number above. select out the smaller one to be the shallow sequenced ones.

shallow.run<- colData(rse) %>% as.data.frame() %>% group_by(sample) %>% dplyr::slice(which.min(mapped_read_count)) %>% ungroup()

boxplot(shallow.run$mapped_read_count)
title(main = "number of mapped reads for the shallow sequenced samples")

## merge the meta data
shallow.run <- shallow.run %>% left_join(sra.runs, by= c("run" = "Run_s"))

## summary of mapped_read_count in shallow ones. median of  ~0.2 million reads
summary(shallow.run$mapped_read_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11540  128900  201600  239800  323400  917300
## exclude the shallow runs to find the deep runs
shallow.runs<- shallow.run$run

deep.run<- colData(rse) %>% as.data.frame() %>% dplyr::filter(! run %in% shallow.runs) %>% ungroup()

boxplot(deep.run$mapped_read_count)
title(main = "number of mapped reads for the deep sequenced samples")

## merge the meta data
deep.run <- deep.run %>% left_join(sra.runs, by = c("run" = "Run_s"))

The data are always messy, the cell line name is spread to different columns. I will use regular expression to extract the first word until the space from the population_s column.

shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s)) %>% select(cell_line) %>% table()
## .
##   2338   2339     BJ   chip   GW16   GW21 GW21+3   HL60    iPS   K562 
##     22     17     37     46     26      8     16     54     24     42 
##   Kera    NPC  plate 
##     40     15      1

The names start with chip are K562 cells if you check cell_line_s column. The one starts with plate is the bulk sample (100 cells for K562 cells).

from the method session of the paper:

Hierarchical clustering of the top 500 PCA genes across 301 cells was also performed in the Fluidigm SINGuLAR package. Genes are clustered on the basis of Pearson correlation. Samples are clustered on the basis of a Euclidian distance matrix with complete linkage.

excluding the samples start with chip and plate:

22 + 17 + 37 + 26 + 8 + 16 + 54 + 24 + 42 + 40 + 15
## [1] 301

It is 301, the same number of samples used in the paper, so I guess they excluded those k562 cells

exclude those k562 cells to get only 301 single cells.

shallow.run <- shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s)) 
shallow.run<- shallow.run %>% filter(! cell_line %in% c("chip", "plate"))

To reproduce figure 2C in the paper, I need to make a PCA plot. I will use SVD sigular value decomposition to accomplish that. I had a pretty detailed write-up for SVD and PCA here. Of course, one can use the functions prcomp() and princomp() from the built-in R stats package or PCA() from FactoMineR package.

## a peek into the count matrix
assay(rse)[1:6, 1:6]
##           SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1                330          0          6          0        265
## 10                94          0          0          0          0
## 100             1905       3451          0          0          4
## 1000               3          0          8          0          6
## 100008589       2026       1567        921       1358       1622
## 100009613          0          0          0          0          0
##           SRR1274197
## 1                  0
## 10                 0
## 100                0
## 1000               0
## 100008589       1046
## 100009613          0
## select out only the counts for shallow ones
shallow.counts<- assay(rse)[, colnames(rse) %in% shallow.run$run]

## I need the cell type, tissue type info for each run in the order of runs (each column is a run) in the matrix

shallow.run.cell<- left_join(data.frame(run = colnames(shallow.counts)), shallow.run) %>% select(run, cell_line, tissue_s)
## Joining by: "run"
## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## tidy the dataframe a bit, set the NPC tissue from N/A to brain for comparisons with GW lines, and iPS tissue from N/A to pluripotent

shallow.run.cell<- shallow.run.cell %>% 
        mutate(tissue_s = gsub("(mammary gland);.+", "\\1", tissue_s)) %>% 
        mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "iPS", "pluripotent")) %>% 
        mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "NPC", "brain"))

PCA in action

## use log2 count for PCA and center the data for each sample.
X<- t(scale(t(log2(shallow.counts+1)),center=TRUE,scale=FALSE))
sv<- svd(t(X))
U<- sv$u
V<- sv$v
D<- sv$d
Z<- t(X)%*%V

## put PC1 and PC2 into a dataframe and add the cell line info side by side
pc_dat<- data.frame(cell.line = shallow.run.cell$cell_line, PC1 = Z[,1], PC2 = Z[,2])

## make figure with ggplot2
library(ggplot2)
library(ggthemes)
ggplot(pc_dat, aes(x=PC1, y=PC2, col=cell.line)) + 
        geom_point() + 
        theme_bw() +
        theme(panel.border = element_blank(), 
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(), 
              axis.line.x = element_line(color="black", size = 0.6),
              axis.line.y = element_line(color="black", size = 0.6))

Well, it is very similar to the original figure in the paper except the color and shape of the points. I mannually selected colors from http://www.color-hex.com/

p<- ggplot(pc_dat, aes(x=PC1, y=PC2)) + 
        geom_point(aes(color = cell.line, shape = cell.line)) + 
        scale_colour_manual(values=c("#24cdd4","#9e379f","#6673b7","#9FD175", "#9FD175", 
                                     "#9FD175", "#ea018d", "#0d060f", "#eede51", "#6673b7", "#9FD175")) +   
        scale_shape_manual(values=c(19, 19, 4, 19, 8, 3,19,19,19,19, 4))+
        theme_bw() +
        theme(panel.border = element_blank(), 
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(), 
              axis.line.x = element_line(color="black", size = 0.6),
              axis.line.y = element_line(color="black", size = 0.6))

p

Let’s fix the legend a bit to mimic the original figure.

p + guides(col = guide_legend(ncol=2)) + theme(legend.position = c(.9, .75))

It is still not quite the same as the original figure as the legend of cell types should be grouped together. I will leave it to you to google and find a way to change the order of the legend. I usually import the figure to inkscape and mannually change the positions and/or orders. You may also notice that the X and Y axis scales are different from the orignal paper, I guess it is because that the authors used RSEM to process the RNA-seq data and used TPM (rather than raw counts in my example) to plot PCA. Nevertheless, the distribution of the clusters are very similar.

Now it is time to reproduce figure 3b.

From the paper:

Hierarchical clustering of 65 single cells across 500 genes with the strongest PC1–PC3 loading scores identifies four major groups of cells (I–IV), and
k-means clustering identifies three clusters of genes (red, yellow and green)

So figure 3b only used cells that are from brain tissues. I want to show you how to do clustering for all the 11 cell types first then we can adventure ourselves by focusing on the brain tissue cells only (NPC, GW16, GW21, GW21+3). It is just a matter of subsetting only the cell lines that are of brain origin.

Before drawing a biclustered heatmap, one needs to select features (genes) first becasue hierachincal clustering can be quite time/memory consuming if you have too many features. In addition, many features do not contribute to the clustering, so we can drop them.

There are many ways to select the features, one of the most commonly used method is to use the most variable genes across samples.

library(genefilter)
rv<- rowVars(X)
## select the top 250 most variable genes for clustering
idx<- order(-rv)[1:250]

library(ComplexHeatmap) ## a very nice package from Zuguang Gu, he is very responsive.
Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled", 
        show_row_names = FALSE, show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE)

There are several things to note. First, the X matrix is centered, and if you check the distribution of the data, you will see there are no skewness. otherwise, you will need to creat a color mapping. see a post from Mick Recreating a famous visualisation (using gplots by Mick Watson as well)

Second, I used the default distance measure (Euclidean distance) and linkage method (complete). However, the choice of distance meause and likage method can have great effect on your clustering.

Make sure you read this paper in Nature biotechnology: How does gene expression clustering work?

some quotes from the paper:

Single-linkage clustering performs abysmally on most real-world data sets, and gene expression data is no exception. It is included in almost every single clustering package ‘for completeness,’ but should probably be removed altogether.

So, never use single linkage!

Contrary to conventional expectation, complete linkage seems to outperform average linkage.

use complete linkage

Euclidean distance, Pearson correlation and ‘uncentered’ correlation (angular separation) all seem to work reasonably well as distance measures. Euclidean distance may be more appropriate for log ratio data, whereas Pearson correlation seems to work better for absolute-valued (e.g., Affymetrix) data.

I want to add some points for the third quote. If one wants to find genes that have the same trend: all (relatively for the same gene) high in some samples and all low in other samples rather than have similar obsolute values (euclidean distance) for different genes, one should use 1 -cor(X) as distance measure.

I highly recommend you to read Chapter 10 for unsupervised clustering in the book of An Introduction to statistical learning. The book is free in PDF.

Also please read You probably don’t understand heatmaps by Mick Watson.

In the 2014 Nature biotechnology paper:

Genes are clustered on the basis of Pearson correlation. Samples are clustered on the basis of a Euclidian distance matrix with complete linkage.

Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled", 
        show_row_names = FALSE, show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "complete",
        clustering_method_columns = "complete")

Now, I want to add the cell line name and tissue of origin as annotations.

df<- data.frame(cell.line = shallow.run.cell$cell_line, tissue = shallow.run.cell$tissue_s)
## choose some colors to repesent cell lines and tissues
library(RColorBrewer)

## how many we need?
table(df$cell.line)
## 
##   2338   2339     BJ   GW16   GW21 GW21+3   HL60    iPS   K562   Kera 
##     22     17     37     26      8     16     54     24     42     40 
##    NPC 
##     15
table(df$tissue)
## 
##            brain         foreskin    mammary gland peripheral blood 
##               65               77               22               71 
## pleural effusion      pluripotent 
##               42               24
##  6 for tissues and 11 for cell lines 
tissue.cols<- brewer.pal(6, "Dark2")

cell.cols<- brewer.pal(11, "Set3")

## make a named vector from two vectors
cell.cols.assigned<- setNames(cell.cols, unique(as.character(df$cell.line)))
tissue.cols.assigned<- setNames(tissue.cols, unique(as.character(df$tissue)))

## Heatmap annotation
ha<- HeatmapAnnotation(df = df, 
                       col = list(cell.line = cell.cols.assigned, 
                                  tissue = tissue.cols.assigned))

Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled", 
        show_row_names = FALSE, show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "complete",
        clustering_method_columns = "complete",
        top_annotation = ha)

Indeed, shallow sequencing can still seprate cells of different origin very well.

Now, focus on the brain origin cells only (total 65 cells)

brain.run<- shallow.run %>% filter(cell_line %in% c("NPC", "GW16", "GW21", "GW21+3"))
brain.counts<- assay(rse)[, colnames(rse) %in% brain.run$run]

Y<- t(scale(t(log2(brain.counts+1)),center=TRUE,scale=FALSE))

##SVD to get PCs
sv.brain<- svd(t(Y))
U.brain<- sv.brain$u
V.brain<- sv.brain$v
D.brain<- sv.brain$d
Z.brain<- t(Y)%*%V.brain

Some facts of PCA:

k th column of Z, Z(k), is the k th PC.(the k th pattern)

PC loadings: V k th column of V, V(k), is the k th PC loading (feature weights). aka. the k th column of V encodes the associated k th pattern in feature space.

PC loadings: U k th column of U, U(k), is the k th PC loading (observation weights). aka. the k th column of U encodes the associated k th pattern in observation space.

Diagnal matrix: D diagnals in D: d(k) gives the strength of the k th pattern.

In the previous bi-clustering heatmap, I used most variable genes across samples to select genes (feature selection). I will use PC loadings to select genes for this example.

Variance explained by each PC

## x is the matrix D from SVD
variance_explained_each_PC<- function(x){
        var.list= list()
        varex = 0
        cumvar = 0
        denom = sum(x^2)
        for(i in 1:length(x)){
                varex[i] = x[i]^2/denom
                cumvar[i] = sum(x[1:i]^2)/denom
        }
        var.list$varex<- varex
        var.list$cumvar<- cumvar
        var.list
}

### screen plot 
screen.plot<- function(var.list){
        par(mfrow=c(1,2))
        plot(1:length(var.list$varex), var.list$varex *100,type="h",
             lwd=2,xlab="PC",ylab="% Variance Explained")
        plot(1:length(var.list$cumvar),var.list$cumvar,type="h",
             lwd=2,xlab="PC",ylab="Cummulative Variance Explained")
        
}

screen.plot(variance_explained_each_PC(D.brain))

You will see the first 3 PCs only explain 15% variance.

From the paper:

Hierarchical clustering of 65 single cells across 500 genes with the strongest PC1–PC3 loading scores identifies four major groups of cells (I–IV), and
k-means clustering identifies three clusters of genes (red, yellow and green)

The 500 topranked PCA genes were selected on the basis of the maximum absolute value of each gene loading score in the first three eigenvectors (PC1, PC2 and PC3)

select genes based on maximum absolute loadings for PC1, PC2 and PC3

## for PC1, PC2, and PC3, choose the maximum absulute loadings 
variations.3PC<- apply(abs(V.brain[,1:3]), 1, max)

## order according to the loadings and choose the top 500 genes
genes.3PC<- order(-variations.3PC)[1:500]

## cell type/tissue information
brain.run.cell<- shallow.run.cell %>% filter(cell_line %in% c("NPC", "GW16", "GW21", "GW21+3"))

df.brain<- data.frame(cell.line = brain.run.cell$cell_line)
## choose some colors to repesent cell lines
## how many we need?
table(df.brain$cell.line)
## 
##   GW16   GW21 GW21+3    NPC 
##     26      8     16     15
##  4 for cell lines, I select the color mannually http://www.color-hex.com/ to match the 
## color in the original figure

## olor space is important for interpolating colors. By default (in ComplexHeatmap pacakge), colors are linearly interpolated in LAB color space, but you can select the color space in colorRamp2() function.

brain.cell.cols<- c("#A6FFB6", "#33a1f6", "#d16303", "#322f64")
#brain.cell.cols<- brewer.pal(4, "Dark2")

## make a named vector from two vectors
brain.cell.cols.assigned<- setNames(brain.cell.cols, unique(as.character(df.brain$cell.line)))

## Heatmap annotation
ha.brain<- HeatmapAnnotation(df = df.brain, 
                             col = list(cell.line = brain.cell.cols.assigned))


library(circlize) ## for colorRamp2 color mapping
Heatmap(Y[genes.3PC, ], name = "log2 RNAseq\ncounts scaled", 
        col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
        show_row_names = FALSE,
        show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = FALSE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "complete",
        clustering_method_columns = "complete",
        top_annotation = ha.brain,
        gap = unit(0.5, "mm"))

Note that in the original figure, the rows are samples and columns are genes. We need to rotate our heatmap clock-wise 90 degree to be comparable with the original one This figure is the closest that I can get comparing to the figure 3b.

Sparse PCA

When p>>n (we have way more genes than the samples), many featuers(genes) are irrelevant. PCA can perform very badly. Sparse PCA zero out irrelevant features from PC loadings. The advantage is that we can find important features that contribute to major patterns. Tipically, opitmize PCA criterion with sparsity-encouraging penalty of matrix V. It is an active area of research!

In other words, sparse PCA does feature selection for us, it retains only the features that drive the major pattern in the data. Again, we may or may not need to do sparse PCA to do feature selection. We can use some pre-knowledges, say, oncogenes and tumor-suppressors, differentially expressed genes across observations and most variable genes across samples etc.

library("PMA")
## we also look at the first 3 PCs, we centered the data before hand
## sumabsv  
## How sparse do you want v to be? This is the sum of absolute values of elements of v. It must be between 1 and square root of number of columns (No. of genes in this case) of data. The smaller it is, the sparser v will be.

spc<- SPC(t(Y),sumabsv=10, K=3, center = FALSE)
## 1234567891011121314151617181920
## 1234567891011121314151617181920
## 1234567891011121314151617181920

how many genes selected? we look at the V matrix again, if the weights are zeros, they are not important features. sparse PCA zero them out. For each of the four PCs, how many features are retained?

apply(spc$v!=0, 2, sum)
## [1] 183 180 178
## genes retained for the PC1, PC2 and PC3
PC1.genes<- which(spc$v[,1] !=0)
PC2.genes<- which(spc$v[,2] !=0)
PC3.genes<- which(spc$v[,3] !=0)

selected.genes<- unique(c(PC1.genes, PC2.genes, PC3.genes))
## total 493 genes for PC1, PC2 and PC3 were selected

## heatmap using these 493 genes, kmeans for genes (k=4), I visually selected for k=4 after testing around.

Heatmap(Y[selected.genes, ], name = "log2 RNAseq\ncounts scaled", 
        col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
        show_row_names = FALSE,
        show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "complete",
        clustering_method_columns = "complete",
        top_annotation = ha.brain,
        gap = unit(0.5, "mm"),
        km = 4)

Let’s make the heatmap with most variable genes.

## select genes based on variance across samples
rv.brain<- rowVars(Y)
## select the top 500 most variable genes for clustering
idx.brain<- order(-rv.brain)[1:500]
library(circlize)  ## for colorRamp2 function to map values to colors  
 
## kmeans for genes k=3
Heatmap(Y[idx.brain, ], name = "log2 RNAseq\ncounts scaled", 
        col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
        show_row_names = FALSE,
        show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "ward.D",
        clustering_method_columns = "ward.D",
        top_annotation = ha.brain,
        gap = unit(0.5, "mm"),
        km = 3)

It is not the same as the figure in the paper. Now, you probably understand how important the feature selection is (which genes we choose to generate heatmaps); how important to center the data or not; and how important to choose clustering distance measures and linkage methods.

I am still waiting to see how to exactly reproduce figure 3b in the paper. Any comments are very wellcome!

better color scheme with viridis

Just for fun and for better color scheme, I will use viridis package to draw the heatmaps.

library(viridis)

Heatmap(Y[genes.3PC, ], name = "log2 RNAseq\ncounts scaled", 
        col = colorRamp2(c(-10, 0, 10), viridis(3)),
        show_row_names = FALSE,
        show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = FALSE, 
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "euclidean",
        clustering_method_rows = "complete",
        clustering_method_columns = "complete",
        top_annotation = ha.brain,
        gap = unit(0.5, "mm"))

Further readings:
Principal component analysis : the basics you should read - R software and data mining

Cluster Analysis in R - Unsupervised machine learning

You probably don’t understand heatmaps by Mick Watson

Recreating a famous visualisation (using gplots by Mick Watson as well)

Recreating the vaccination heatmaps in R using ggplot2

Why do you look at the speck in your sister’s quilt plot and pay no attention to the plank in your own heat map? by Lior Patcher

posts on biostars: understanding the clustering in heatmap

Making a heatmap with R by Dave Tang

Using RColorBrewer to colour your figures in R by R bloggers

how to change heatmap2 color range in r

Customizing gplots heatmap.2 - color range for heatmap and legend for RowSideColors

map colour to values in heatmap