In many Genomic paper, you will see heatmaps. Heatmaps is of no mystery. It is a way to visualize the data a.k.a. using colors to represent values. However one really needs to understand the details of heatmaps. I recommend you to read Points of view: Mapping quantitative data to color and Points of view: Heat maps from a series of articles from Nature Methods.
Usually one has a matrix and then plot the matrix using functions such as heatmap.2, pheatmap or Heatmap.
I will start with a very simple using case for heatmap. We have sequenced 20 samples and identified mutations in 10 genes. some samples have the mutation in this gene, some samples do not have it. In this case, it will be a simple 0 or 1 to represent each data point. I am going to use ggplot2 for this purpose, although the base R function rect can also draw rectangles.
Let’s simulate the data.
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1)
# repeat the sampling
mut<- replicate(20, sample(c(0,1), 10, replace=TRUE))
mut<- as.data.frame(mut)
colnames(mut)<- paste0("sample", 1:20)
mut<- mut %>% mutate(gene=paste0("gene", 1:10))
head(mut)
## sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
## 1 0 0 1 0 1 0 1 0 0
## 2 0 0 0 1 1 1 0 1 1
## 3 1 1 1 0 1 0 0 0 0
## 4 1 0 0 0 1 0 0 0 0
## 5 0 1 0 1 1 0 1 0 1
## 6 1 0 0 1 1 0 0 1 0
## sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17
## 1 0 1 1 1 1 1 1 0
## 2 0 0 1 0 0 1 1 1
## 3 1 0 0 0 0 0 0 0
## 4 1 1 0 0 1 0 0 1
## 5 1 1 0 1 1 1 1 1
## 6 1 0 0 0 1 0 0 0
## sample18 sample19 sample20 gene
## 1 1 0 1 gene1
## 2 1 0 0 gene2
## 3 1 1 0 gene3
## 4 0 1 1 gene4
## 5 0 1 0 gene5
## 6 1 0 1 gene6
most of my codes follow a post Making Faceted Heatmaps with ggplot2
Tidy the data to the long format.
mut.tidy<- mut %>% tidyr::gather(sample, mutated, 1:20)
## change the levels for gene names and sample names so it goes 1,2,3,4... rather than 1, 10...
mut.tidy$gene<- factor(mut.tidy$gene, levels = paste0("gene", 1:10))
mut.tidy$sample<- factor(mut.tidy$sample, levels = paste0("sample", 1:20))
when fill the tiles with color, in this case, it is 0 or 1 discrete value. R thinks mutated is a numeric continuous value, change it to factor.
mut.tidy$mutated<- factor(mut.tidy$mutated)
## use a white border of size 0.5 unit to separate the tiles
gg<- ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5)
library(RColorBrewer) ## better color schema
## check all the color pallete and choose one
display.brewer.all()
mutated will have color red, unmutated have color blue.
gg<- gg + scale_fill_brewer(palette = "Set1", direction = -1)
geom_tile() draws rectangles, add coord_equal to draw squres.
gg<- gg + coord_equal()
## add title
gg<- gg + labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers")
library(ggthemes)
##starting with a base theme of theme_tufte() from the ggthemes package. It removes alot of chart junk without having to do it manually.
gg <- gg + theme_tufte(base_family="Helvetica")
#We don’t want any tick marks on the axes and I want the text to be slightly smaller than the default.
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle = 45, hjust = 1))
gg
If you want to mannually fill the color, you can use scale_fill_manual, and check http://colorbrewer2.org/ to get the HEX representation of the color.
ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5) +
coord_equal() +
labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers") +
theme_tufte(base_family="Helvetica") +
scale_fill_manual(values = c("#7570b3", "#1b9e77")) +
theme(axis.ticks=element_blank()) +
theme(axis.text.x=element_text(angle = 45, hjust = 1))
ggplot(mut.tidy, aes(x=sample, y=gene, fill=mutated)) + geom_tile(color="white", size=0.5) +
coord_equal() +
labs(x=NULL, y=NULL, title="mutation spectrum of 20 breast cancers") +
theme_tufte(base_family="Helvetica") +
scale_fill_manual(values = c("gray", "red")) +
theme(axis.ticks=element_blank()) +
theme(axis.text.x=element_text(angle = 45, hjust = 1))
Note that in a real genomic experiment, thousands of genes will be assayed, and one can use tools such as CoMET to find the mutual exclusive mutations and plot as I just did. There is a so called oncoprint in many papers and essentially they are doing the same thing as I did here, but adding many details. see one example from ComplexHeatmap:
In this example, I showed an example of using heatmap to represent discrete values (yes or no mutation), in my following post, I will post how to use heatmap to represent continuous values and do clustering on rows and columns to find patterns (unsupervised clustering). ggplot2 it self does not have clustering built-in. We will have to use the functions I mentioned in the begining of this blog. There are three main points I will stress on plotting a bi-clustered heatmap:
TO BE CONTINUED…
In my last blog post, I showed an example to use heatmap to repreent discrete values. I am going to continue the theme to introduce heatmap to represent continuous values. As I mentioned before, I will stress on three main points:
In order to make the example reproducible and interesting to biologist, I am going to use an RNAseq data set downloaded by the recount package. The data contains 347 single cell RNA-seq data of 11 different distinct populations. Please read this Nature biotechnology paper: Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. I will try to reproduce the figures in the paper.
#source("https://bioconductor.org/biocLite.R")
## use development branch of bioc
# useDevel()
#biocLite("recount")
library(recount)
library(dplyr)
## if you go to the paper you can find the project name
project = 'SRP041736'
# I already downloaded it, so comment it out
# download_study(project)
## load the data
load(file.path(project, 'rse_gene.Rdata'))
#We can explore a bit this RangedSummarizedExperiment
rse_gene
## class: RangedSummarizedExperiment
## dim: 23779 694
## metadata(0):
## assays(1): counts
## rownames(23779): 1 10 ... 9994 9997
## rowData names(3): gene_id bp_length symbol
## colnames(694): SRR1274192 SRR1274193 ... SRR1275349 SRR1275353
## colData names(21): project sample ... title characteristics
colData(rse_gene) %>% head()
## DataFrame with 6 rows and 21 columns
## project sample experiment run
## <character> <character> <character> <character>
## SRR1274192 SRP041736 SRS603194 SRX534477 SRR1274192
## SRR1274193 SRP041736 SRS603194 SRX534477 SRR1274193
## SRR1274194 SRP041736 SRS603195 SRX534478 SRR1274194
## SRR1274195 SRP041736 SRS603195 SRX534478 SRR1274195
## SRR1274196 SRP041736 SRS603196 SRX534479 SRR1274196
## SRR1274197 SRP041736 SRS603196 SRX534479 SRR1274197
## read_count_as_reported_by_sra reads_downloaded
## <integer> <integer>
## SRR1274192 16260602 16260602
## SRR1274193 76776 76776
## SRR1274194 13225500 13225500
## SRR1274195 64552 64552
## SRR1274196 16521630 16521630
## SRR1274197 83534 83534
## proportion_of_reads_reported_by_sra_downloaded paired_end
## <numeric> <logical>
## SRR1274192 1 TRUE
## SRR1274193 1 TRUE
## SRR1274194 1 TRUE
## SRR1274195 1 TRUE
## SRR1274196 1 TRUE
## SRR1274197 1 TRUE
## sra_misreported_paired_end mapped_read_count auc
## <logical> <integer> <numeric>
## SRR1274192 FALSE 16090529 1541642379
## SRR1274193 FALSE 69879 2016766
## SRR1274194 FALSE 13092059 1258982469
## SRR1274195 FALSE 59210 1708001
## SRR1274196 FALSE 16368219 1580611206
## SRR1274197 FALSE 76808 2217637
## sharq_beta_tissue sharq_beta_cell_type
## <character> <character>
## SRR1274192 NA NA
## SRR1274193 NA NA
## SRR1274194 NA NA
## SRR1274195 NA NA
## SRR1274196 NA NA
## SRR1274197 NA NA
## biosample_submission_date biosample_publication_date
## <character> <character>
## SRR1274192 2014-04-22T17:28:44.407 2014-08-02T00:44:06.583
## SRR1274193 2014-04-22T17:28:44.407 2014-08-02T00:44:06.583
## SRR1274194 2014-04-22T17:28:45.247 2014-08-02T00:44:07.100
## SRR1274195 2014-04-22T17:28:45.247 2014-08-02T00:44:07.100
## SRR1274196 2014-04-22T17:28:45.343 2014-08-02T00:44:07.150
## SRR1274197 2014-04-22T17:28:45.343 2014-08-02T00:44:07.150
## biosample_update_date avg_read_length geo_accession
## <character> <integer> <character>
## SRR1274192 2014-08-02T00:44:06.583 202 NA
## SRR1274193 2014-08-02T00:44:06.583 60 NA
## SRR1274194 2014-08-02T00:44:07.100 202 NA
## SRR1274195 2014-08-02T00:44:07.100 60 NA
## SRR1274196 2014-08-02T00:44:07.150 202 NA
## SRR1274197 2014-08-02T00:44:07.150 60 NA
## bigwig_file title characteristics
## <character> <character> <CharacterList>
## SRR1274192 SRR1274192.bw NA NA
## SRR1274193 SRR1274193.bw NA NA
## SRR1274194 SRR1274194.bw NA NA
## SRR1274195 SRR1274195.bw NA NA
## SRR1274196 SRR1274196.bw NA NA
## SRR1274197 SRR1274197.bw NA NA
## At the gene level, the row data includes the gene ENTREZ ids, the gene
## symbols and the sum of the reduced exons widths, which can be used for
## taking into account the gene length.
rowData(rse_gene)
## DataFrame with 23779 rows and 3 columns
## gene_id bp_length symbol
## <character> <integer> <character>
## 1 1 4027 A1BG
## 2 10 1317 NAT2
## 3 100 1532 ADA
## 4 1000 4473 CDH2
## 5 100008589 5071 RNA28S5
## ... ... ... ...
## 23775 9991 8234 PTBP3
## 23776 9992 803 KCNE2
## 23777 9993 4882 DGCR2
## 23778 9994 6763 CASP8AP2
## 23779 9997 1393 SCO2
# browse_study(project)
It turns out that the metadata returned by recount are many NAs. I have to go to SRA run selector to find the metadata. great thanks to @Ayush.
search SRP041736 in the search box, and it will return all the runs for the project, and I downloaded the RunInfo Table.
SRAdb bioconductor package maybe able to search and download the metadata programmatically.
Aso check sra-search from NCBI.
Now, read in the metadata downloaded.
library(readr)
sra.runs<- read_tsv("~/Downloads/SraRunTable.txt")
## much more info I need!
head(sra.runs)
## Source: local data frame [6 x 35]
##
## BioSample_s Experiment_s Library_Name_s LoadDate_s MBases_l MBytes_l
## (chr) (chr) (chr) (date) (int) (int)
## 1 SAMN02731786 SRX534477 2338_1 2014-05-08 1566 1094
## 2 SAMN02731786 SRX534477 2338_1 2014-05-08 2 1
## 3 SAMN02731795 SRX534478 2338_10 2014-05-08 1273 892
## 4 SAMN02731795 SRX534478 2338_10 2014-05-08 1 1
## 5 SAMN02731796 SRX534479 2338_11 2014-05-08 1591 1114
## 6 SAMN02731796 SRX534479 2338_11 2014-05-08 2 1
## Variables not shown: Run_s (chr), SRA_Sample_s (chr), Sample_Name_s (chr),
## age_s (chr), biomaterial_provider_s (chr), cell_line_s (chr), disease_s
## (chr), disease_stage_s (chr), population_s (chr), sex_s (chr), tissue_s
## (chr), Assay_Type_s (chr), AssemblyName_s (chr), BioProject_s (chr),
## BioSampleModel_s (chr), Center_Name_s (chr), Consent_s (chr),
## InsertSize_l (int), LibraryLayout_s (chr), LibrarySelection_s (chr),
## LibrarySource_s (chr), Organism_s (chr), Platform_s (chr), ReleaseDate_s
## (date), SRA_Study_s (chr), g1k_analysis_group_s (chr), g1k_pop_code_s
## (chr), isolate_s (chr), source_s (chr)
# View(sra.runs)
The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output.
## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)
## check the count table
assay(rse)[1:6, 1:20]
## SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1 330 0 6 0 265
## 10 94 0 0 0 0
## 100 1905 3451 0 0 4
## 1000 3 0 8 0 6
## 100008589 2026 1567 921 1358 1622
## 100009613 0 0 0 0 0
## SRR1274197 SRR1274198 SRR1274199 SRR1274200 SRR1274201
## 1 0 1 0 430 0
## 10 0 0 0 0 0
## 100 0 730 0 0 0
## 1000 0 6 0 6 0
## 100008589 1046 1445 1809 1283 2364
## 100009613 0 0 0 0 0
## SRR1274202 SRR1274203 SRR1274204 SRR1274205 SRR1274206
## 1 54 0 931 959 0
## 10 0 0 0 0 0
## 100 12 0 0 0 452
## 1000 0 0 0 0 12
## 100008589 1703 1455 1812 843 2942
## 100009613 0 0 0 0 0
## SRR1274207 SRR1274208 SRR1274209 SRR1274210 SRR1274211
## 1 0 5 0 141 0
## 10 0 0 0 0 0
## 100 2400 23 0 218 0
## 1000 0 3 0 2 0
## 100008589 4737 1047 0 3030 1993
## 100009613 0 0 0 0 0
match the sra.runs$Run_s to the colData(rse)$run
sra.runs$Run_s %in% colData(rse)$run %>% table()
## .
## FALSE TRUE
## 2 694
colData(rse)$run %in% sra.runs$Run_s %>% table()
## .
## TRUE
## 694
# which runs are not avaiable in the count matrix?
sra.runs$Run_s[! sra.runs$Run_s %in% colData(rse)$run]
## [1] "SRR1274250" "SRR1274344"
## there should be two same samples, one is deep sequenced, the other is shallow sequenced (downsampled)
colData(rse) %>% as.data.frame() %>% group_by(sample) %>% summarise(n=n()) %>% arrange(n) %>% head()
## Source: local data frame [6 x 2]
##
## sample n
## (chr) (int)
## 1 SRS603223 1
## 2 SRS603270 1
## 3 SRS603194 2
## 4 SRS603195 2
## 5 SRS603196 2
## 6 SRS603197 2
sra.runs %>% dplyr::filter(Run_s == "SRR1274250") %>% select(SRA_Sample_s)
## Source: local data frame [1 x 1]
##
## SRA_Sample_s
## (chr)
## 1 SRS603223
sra.runs %>% dplyr::filter(Run_s == "SRR1274344") %>% select(SRA_Sample_s)
## Source: local data frame [1 x 1]
##
## SRA_Sample_s
## (chr)
## 1 SRS603270
colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603223")
## project sample experiment run read_count_as_reported_by_sra
## 1 SRP041736 SRS603223 SRX534506 SRR1274251 76576
## reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1 76576 1
## paired_end sra_misreported_paired_end mapped_read_count auc
## 1 TRUE FALSE 68808 1984754
## sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1 <NA> <NA> 2014-04-22T17:28:47.330
## biosample_publication_date biosample_update_date avg_read_length
## 1 2014-08-02T00:44:08.360 2014-08-02T00:44:08.360 60
## geo_accession bigwig_file title characteristics
## 1 <NA> SRR1274251.bw <NA> NA
colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603270")
## project sample experiment run read_count_as_reported_by_sra
## 1 SRP041736 SRS603270 SRX534553 SRR1274345 169528
## reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1 169528 1
## paired_end sra_misreported_paired_end mapped_read_count auc
## 1 TRUE FALSE 150720 14263581
## sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1 <NA> <NA> 2014-04-22T17:29:04.727
## biosample_publication_date biosample_update_date avg_read_length
## 1 2014-08-02T00:44:25.317 2014-08-02T00:44:25.317 196
## geo_accession bigwig_file title characteristics
## 1 <NA> SRR1274345.bw <NA> NA
These two samples are shallow sequenced ones according to the mapped_read_count number above. select out the smaller one to be the shallow sequenced ones.
shallow.run<- colData(rse) %>% as.data.frame() %>% group_by(sample) %>% dplyr::slice(which.min(mapped_read_count)) %>% ungroup()
boxplot(shallow.run$mapped_read_count)
title(main = "number of mapped reads for the shallow sequenced samples")
## merge the meta data
shallow.run <- shallow.run %>% left_join(sra.runs, by= c("run" = "Run_s"))
## summary of mapped_read_count in shallow ones. median of ~0.2 million reads
summary(shallow.run$mapped_read_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11540 128900 201600 239800 323400 917300
## exclude the shallow runs to find the deep runs
shallow.runs<- shallow.run$run
deep.run<- colData(rse) %>% as.data.frame() %>% dplyr::filter(! run %in% shallow.runs) %>% ungroup()
boxplot(deep.run$mapped_read_count)
title(main = "number of mapped reads for the deep sequenced samples")
## merge the meta data
deep.run <- deep.run %>% left_join(sra.runs, by = c("run" = "Run_s"))
The data are always messy, the cell line name is spread to different columns. I will use regular expression to extract the first word until the space from the population_s column.
shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s)) %>% select(cell_line) %>% table()
## .
## 2338 2339 BJ chip GW16 GW21 GW21+3 HL60 iPS K562
## 22 17 37 46 26 8 16 54 24 42
## Kera NPC plate
## 40 15 1
The names start with chip are K562 cells if you check cell_line_s column. The one starts with plate is the bulk sample (100 cells for K562 cells).
from the method session of the paper:
Hierarchical clustering of the top 500 PCA genes across 301 cells was also performed in the Fluidigm SINGuLAR package. Genes are clustered on the basis of Pearson correlation. Samples are clustered on the basis of a Euclidian distance matrix with complete linkage.
excluding the samples start with chip and plate:
22 + 17 + 37 + 26 + 8 + 16 + 54 + 24 + 42 + 40 + 15
## [1] 301
It is 301, the same number of samples used in the paper, so I guess they excluded those k562 cells
exclude those k562 cells to get only 301 single cells.
shallow.run <- shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s))
shallow.run<- shallow.run %>% filter(! cell_line %in% c("chip", "plate"))
To reproduce figure 2C in the paper, I need to make a PCA plot. I will use SVD sigular value decomposition to accomplish that. I had a pretty detailed write-up for SVD and PCA here. Of course, one can use the functions prcomp() and princomp() from the built-in R stats package or PCA() from FactoMineR package.
## a peek into the count matrix
assay(rse)[1:6, 1:6]
## SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1 330 0 6 0 265
## 10 94 0 0 0 0
## 100 1905 3451 0 0 4
## 1000 3 0 8 0 6
## 100008589 2026 1567 921 1358 1622
## 100009613 0 0 0 0 0
## SRR1274197
## 1 0
## 10 0
## 100 0
## 1000 0
## 100008589 1046
## 100009613 0
## select out only the counts for shallow ones
shallow.counts<- assay(rse)[, colnames(rse) %in% shallow.run$run]
## I need the cell type, tissue type info for each run in the order of runs (each column is a run) in the matrix
shallow.run.cell<- left_join(data.frame(run = colnames(shallow.counts)), shallow.run) %>% select(run, cell_line, tissue_s)
## Joining by: "run"
## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
## tidy the dataframe a bit, set the NPC tissue from N/A to brain for comparisons with GW lines, and iPS tissue from N/A to pluripotent
shallow.run.cell<- shallow.run.cell %>%
mutate(tissue_s = gsub("(mammary gland);.+", "\\1", tissue_s)) %>%
mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "iPS", "pluripotent")) %>%
mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "NPC", "brain"))
## use log2 count for PCA and center the data for each sample.
X<- t(scale(t(log2(shallow.counts+1)),center=TRUE,scale=FALSE))
sv<- svd(t(X))
U<- sv$u
V<- sv$v
D<- sv$d
Z<- t(X)%*%V
## put PC1 and PC2 into a dataframe and add the cell line info side by side
pc_dat<- data.frame(cell.line = shallow.run.cell$cell_line, PC1 = Z[,1], PC2 = Z[,2])
## make figure with ggplot2
library(ggplot2)
library(ggthemes)
ggplot(pc_dat, aes(x=PC1, y=PC2, col=cell.line)) +
geom_point() +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color="black", size = 0.6),
axis.line.y = element_line(color="black", size = 0.6))
Well, it is very similar to the original figure in the paper except the color and shape of the points. I mannually selected colors from http://www.color-hex.com/
p<- ggplot(pc_dat, aes(x=PC1, y=PC2)) +
geom_point(aes(color = cell.line, shape = cell.line)) +
scale_colour_manual(values=c("#24cdd4","#9e379f","#6673b7","#9FD175", "#9FD175",
"#9FD175", "#ea018d", "#0d060f", "#eede51", "#6673b7", "#9FD175")) +
scale_shape_manual(values=c(19, 19, 4, 19, 8, 3,19,19,19,19, 4))+
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line.x = element_line(color="black", size = 0.6),
axis.line.y = element_line(color="black", size = 0.6))
p
Let’s fix the legend a bit to mimic the original figure.
p + guides(col = guide_legend(ncol=2)) + theme(legend.position = c(.9, .75))
It is still not quite the same as the original figure as the legend of cell types should be grouped together. I will leave it to you to google and find a way to change the order of the legend. I usually import the figure to inkscape and mannually change the positions and/or orders. You may also notice that the X and Y axis scales are different from the orignal paper, I guess it is because that the authors used RSEM to process the RNA-seq data and used TPM (rather than raw counts in my example) to plot PCA. Nevertheless, the distribution of the clusters are very similar.
Now it is time to reproduce figure 3b.
From the paper:
Hierarchical clustering of 65 single cells across 500 genes with the strongest PC1–PC3 loading scores identifies four major groups of cells (I–IV), and
k-means clustering identifies three clusters of genes (red, yellow and green)
So figure 3b only used cells that are from brain tissues. I want to show you how to do clustering for all the 11 cell types first then we can adventure ourselves by focusing on the brain tissue cells only (NPC, GW16, GW21, GW21+3). It is just a matter of subsetting only the cell lines that are of brain origin.
Before drawing a biclustered heatmap, one needs to select features (genes) first becasue hierachincal clustering can be quite time/memory consuming if you have too many features. In addition, many features do not contribute to the clustering, so we can drop them.
There are many ways to select the features, one of the most commonly used method is to use the most variable genes across samples.
library(genefilter)
rv<- rowVars(X)
## select the top 250 most variable genes for clustering
idx<- order(-rv)[1:250]
library(ComplexHeatmap) ## a very nice package from Zuguang Gu, he is very responsive.
Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled",
show_row_names = FALSE, show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = TRUE)
There are several things to note. First, the X matrix is centered, and if you check the distribution of the data, you will see there are no skewness. otherwise, you will need to creat a color mapping. see a post from Mick Recreating a famous visualisation (using gplots by Mick Watson as well)
Second, I used the default distance measure (Euclidean distance) and linkage method (complete). However, the choice of distance meause and likage method can have great effect on your clustering.
Make sure you read this paper in Nature biotechnology: How does gene expression clustering work?
some quotes from the paper:
Single-linkage clustering performs abysmally on most real-world data sets, and gene expression data is no exception. It is included in almost every single clustering package ‘for completeness,’ but should probably be removed altogether.
So, never use single linkage!
Contrary to conventional expectation, complete linkage seems to outperform average linkage.
use complete linkage
Euclidean distance, Pearson correlation and ‘uncentered’ correlation (angular separation) all seem to work reasonably well as distance measures. Euclidean distance may be more appropriate for log ratio data, whereas Pearson correlation seems to work better for absolute-valued (e.g., Affymetrix) data.
I want to add some points for the third quote. If one wants to find genes that have the same trend: all (relatively for the same gene) high in some samples and all low in other samples rather than have similar obsolute values (euclidean distance) for different genes, one should use 1 -cor(X) as distance measure.
I highly recommend you to read Chapter 10 for unsupervised clustering in the book of An Introduction to statistical learning. The book is free in PDF.
Also please read You probably don’t understand heatmaps by Mick Watson.
In the 2014 Nature biotechnology paper:
Genes are clustered on the basis of Pearson correlation. Samples are clustered on the basis of a Euclidian distance matrix with complete linkage.
Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled",
show_row_names = FALSE, show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "complete",
clustering_method_columns = "complete")
Now, I want to add the cell line name and tissue of origin as annotations.
df<- data.frame(cell.line = shallow.run.cell$cell_line, tissue = shallow.run.cell$tissue_s)
## choose some colors to repesent cell lines and tissues
library(RColorBrewer)
## how many we need?
table(df$cell.line)
##
## 2338 2339 BJ GW16 GW21 GW21+3 HL60 iPS K562 Kera
## 22 17 37 26 8 16 54 24 42 40
## NPC
## 15
table(df$tissue)
##
## brain foreskin mammary gland peripheral blood
## 65 77 22 71
## pleural effusion pluripotent
## 42 24
## 6 for tissues and 11 for cell lines
tissue.cols<- brewer.pal(6, "Dark2")
cell.cols<- brewer.pal(11, "Set3")
## make a named vector from two vectors
cell.cols.assigned<- setNames(cell.cols, unique(as.character(df$cell.line)))
tissue.cols.assigned<- setNames(tissue.cols, unique(as.character(df$tissue)))
## Heatmap annotation
ha<- HeatmapAnnotation(df = df,
col = list(cell.line = cell.cols.assigned,
tissue = tissue.cols.assigned))
Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled",
show_row_names = FALSE, show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "complete",
clustering_method_columns = "complete",
top_annotation = ha)
Indeed, shallow sequencing can still seprate cells of different origin very well.
brain.run<- shallow.run %>% filter(cell_line %in% c("NPC", "GW16", "GW21", "GW21+3"))
brain.counts<- assay(rse)[, colnames(rse) %in% brain.run$run]
Y<- t(scale(t(log2(brain.counts+1)),center=TRUE,scale=FALSE))
##SVD to get PCs
sv.brain<- svd(t(Y))
U.brain<- sv.brain$u
V.brain<- sv.brain$v
D.brain<- sv.brain$d
Z.brain<- t(Y)%*%V.brain
k th column of Z, Z(k), is the k th PC.(the k th pattern)
PC loadings: V k th column of V, V(k), is the k th PC loading (feature weights). aka. the k th column of V encodes the associated k th pattern in feature space.
PC loadings: U k th column of U, U(k), is the k th PC loading (observation weights). aka. the k th column of U encodes the associated k th pattern in observation space.
Diagnal matrix: D diagnals in D: d(k) gives the strength of the k th pattern.
In the previous bi-clustering heatmap, I used most variable genes across samples to select genes (feature selection). I will use PC loadings to select genes for this example.
## x is the matrix D from SVD
variance_explained_each_PC<- function(x){
var.list= list()
varex = 0
cumvar = 0
denom = sum(x^2)
for(i in 1:length(x)){
varex[i] = x[i]^2/denom
cumvar[i] = sum(x[1:i]^2)/denom
}
var.list$varex<- varex
var.list$cumvar<- cumvar
var.list
}
### screen plot
screen.plot<- function(var.list){
par(mfrow=c(1,2))
plot(1:length(var.list$varex), var.list$varex *100,type="h",
lwd=2,xlab="PC",ylab="% Variance Explained")
plot(1:length(var.list$cumvar),var.list$cumvar,type="h",
lwd=2,xlab="PC",ylab="Cummulative Variance Explained")
}
screen.plot(variance_explained_each_PC(D.brain))
You will see the first 3 PCs only explain 15% variance.
From the paper:
Hierarchical clustering of 65 single cells across 500 genes with the strongest PC1–PC3 loading scores identifies four major groups of cells (I–IV), and
k-means clustering identifies three clusters of genes (red, yellow and green)
The 500 topranked PCA genes were selected on the basis of the maximum absolute value of each gene loading score in the first three eigenvectors (PC1, PC2 and PC3)
select genes based on maximum absolute loadings for PC1, PC2 and PC3
## for PC1, PC2, and PC3, choose the maximum absulute loadings
variations.3PC<- apply(abs(V.brain[,1:3]), 1, max)
## order according to the loadings and choose the top 500 genes
genes.3PC<- order(-variations.3PC)[1:500]
## cell type/tissue information
brain.run.cell<- shallow.run.cell %>% filter(cell_line %in% c("NPC", "GW16", "GW21", "GW21+3"))
df.brain<- data.frame(cell.line = brain.run.cell$cell_line)
## choose some colors to repesent cell lines
## how many we need?
table(df.brain$cell.line)
##
## GW16 GW21 GW21+3 NPC
## 26 8 16 15
## 4 for cell lines, I select the color mannually http://www.color-hex.com/ to match the
## color in the original figure
## olor space is important for interpolating colors. By default (in ComplexHeatmap pacakge), colors are linearly interpolated in LAB color space, but you can select the color space in colorRamp2() function.
brain.cell.cols<- c("#A6FFB6", "#33a1f6", "#d16303", "#322f64")
#brain.cell.cols<- brewer.pal(4, "Dark2")
## make a named vector from two vectors
brain.cell.cols.assigned<- setNames(brain.cell.cols, unique(as.character(df.brain$cell.line)))
## Heatmap annotation
ha.brain<- HeatmapAnnotation(df = df.brain,
col = list(cell.line = brain.cell.cols.assigned))
library(circlize) ## for colorRamp2 color mapping
Heatmap(Y[genes.3PC, ], name = "log2 RNAseq\ncounts scaled",
col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
show_row_names = FALSE,
show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = FALSE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "complete",
clustering_method_columns = "complete",
top_annotation = ha.brain,
gap = unit(0.5, "mm"))
Note that in the original figure, the rows are samples and columns are genes. We need to rotate our heatmap clock-wise 90 degree to be comparable with the original one This figure is the closest that I can get comparing to the figure 3b.
When p>>n (we have way more genes than the samples), many featuers(genes) are irrelevant. PCA can perform very badly. Sparse PCA zero out irrelevant features from PC loadings. The advantage is that we can find important features that contribute to major patterns. Tipically, opitmize PCA criterion with sparsity-encouraging penalty of matrix V. It is an active area of research!
In other words, sparse PCA does feature selection for us, it retains only the features that drive the major pattern in the data. Again, we may or may not need to do sparse PCA to do feature selection. We can use some pre-knowledges, say, oncogenes and tumor-suppressors, differentially expressed genes across observations and most variable genes across samples etc.
library("PMA")
## we also look at the first 3 PCs, we centered the data before hand
## sumabsv
## How sparse do you want v to be? This is the sum of absolute values of elements of v. It must be between 1 and square root of number of columns (No. of genes in this case) of data. The smaller it is, the sparser v will be.
spc<- SPC(t(Y),sumabsv=10, K=3, center = FALSE)
## 1234567891011121314151617181920
## 1234567891011121314151617181920
## 1234567891011121314151617181920
how many genes selected? we look at the V matrix again, if the weights are zeros, they are not important features. sparse PCA zero them out. For each of the four PCs, how many features are retained?
apply(spc$v!=0, 2, sum)
## [1] 183 180 178
## genes retained for the PC1, PC2 and PC3
PC1.genes<- which(spc$v[,1] !=0)
PC2.genes<- which(spc$v[,2] !=0)
PC3.genes<- which(spc$v[,3] !=0)
selected.genes<- unique(c(PC1.genes, PC2.genes, PC3.genes))
## total 493 genes for PC1, PC2 and PC3 were selected
## heatmap using these 493 genes, kmeans for genes (k=4), I visually selected for k=4 after testing around.
Heatmap(Y[selected.genes, ], name = "log2 RNAseq\ncounts scaled",
col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
show_row_names = FALSE,
show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "complete",
clustering_method_columns = "complete",
top_annotation = ha.brain,
gap = unit(0.5, "mm"),
km = 4)
## select genes based on variance across samples
rv.brain<- rowVars(Y)
## select the top 500 most variable genes for clustering
idx.brain<- order(-rv.brain)[1:500]
library(circlize) ## for colorRamp2 function to map values to colors
## kmeans for genes k=3
Heatmap(Y[idx.brain, ], name = "log2 RNAseq\ncounts scaled",
col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
show_row_names = FALSE,
show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "ward.D",
clustering_method_columns = "ward.D",
top_annotation = ha.brain,
gap = unit(0.5, "mm"),
km = 3)
It is not the same as the figure in the paper. Now, you probably understand how important the feature selection is (which genes we choose to generate heatmaps); how important to center the data or not; and how important to choose clustering distance measures and linkage methods.
I am still waiting to see how to exactly reproduce figure 3b in the paper. Any comments are very wellcome!
Just for fun and for better color scheme, I will use viridis package to draw the heatmaps.
library(viridis)
Heatmap(Y[genes.3PC, ], name = "log2 RNAseq\ncounts scaled",
col = colorRamp2(c(-10, 0, 10), viridis(3)),
show_row_names = FALSE,
show_column_names = FALSE,
row_dend_reorder = TRUE, column_dend_reorder = FALSE,
clustering_distance_rows = "pearson",
clustering_distance_columns = "euclidean",
clustering_method_rows = "complete",
clustering_method_columns = "complete",
top_annotation = ha.brain,
gap = unit(0.5, "mm"))
Further readings:
Principal component analysis : the basics you should read - R software and data mining
Cluster Analysis in R - Unsupervised machine learning
You probably don’t understand heatmaps by Mick Watson
Recreating a famous visualisation (using gplots by Mick Watson as well)
Recreating the vaccination heatmaps in R using ggplot2
Why do you look at the speck in your sister’s quilt plot and pay no attention to the plank in your own heat map? by Lior Patcher
posts on biostars: understanding the clustering in heatmap
Making a heatmap with R by Dave Tang
Using RColorBrewer to colour your figures in R by R bloggers
how to change heatmap2 color range in r
Customizing gplots heatmap.2 - color range for heatmap and legend for RowSideColors