1 Introduction

An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .

Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.

2 How to cite the data

If you use TCGA ATAC-Seq data in your analysis, please cite the following paper:

CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898

3 Workshop description

In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.

For more information about the data please visit GDC publication website and read the paper:

3.1 Pre-requisites

3.2 Workshop Participation

Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.

3.3 Goals and objectives

  • Download and understand the ATAC-Seq data
  • Compare two different groups of samples ATAC-Seq data

3.4 Enviroment: R libraries

The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.

# to read txt files
library(readr)

# to transform data into GenomicRanges
library(GenomicRanges)

# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)

# For the t.test loop
library(plyr)

# For easy volcano plot
library(TCGAbiolinks)

# For heatmap plot
library(ComplexHeatmap)
library(circlize)
library(pheatmap)

# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ELMER)

If one of them is not available in your R environment, you can easily install them with BiocManager as shown below.

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
BiocManager::install(version = "3.13",ask = FALSE) # Install last version of Bioconductor

list.of.packages <- c(
  "karyoploteR",
  "circlize",
  "ComplexHeatmap",
  "TxDb.Hsapiens.UCSC.hg38.knownGene",
  "TCGAbiolinks",
  "plyr",
  "dplyr",
  "tidyr",
  "GenomicRanges",
  "readr",
  "ELMER"
)
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)

4 Data

The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

5 Understanding the data: peaks sets

The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. And, it is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).

The paper introduces 3 sets of peaks: 1) sample peak set, 2) cancer type-specific peak set and 3) pan-cancer peak set.

  1. Sample peak set description from the paper:

Overlapping peaks called within a single sample were handled using an iterative removal procedure. First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This resulted in a set of fixed-width peaks for each sample which we refer to here as a “sample peak set.”

Since the samples varied in read depth or quality, to compare samples they developed a robust method to normalize peak significance scores across samples and cancer types, which uses MACS2 peak scores (-log10(p-value)) for each sample and converted them to a “score per million” by dividing each individual peak score by the sum of all of the peak scores in the given sample divided by 1 million.

  1. “cancer type-specific peak set” description:

First, all “sample peak sets” from a given cancer type were combined into a cumulative peak set and trimmed for overlap using the same iterative procedure mentioned above. Contains all of the reproducible peaks observed in an individual cancer type, filtered to those which were observed in at least two samples with a score per million value \(>=5\).

  1. “pan-cancer peak set” description:

First, there was a re-normalization of the score per million scores for each “cancer type-specific peak set” to prevent cancer types with more samples or higher quality samples from comprising a disproportionate share of the union peak set. The resulting peak sets were combined with the same iterative procedure mentioned above. These peaks are reproducible peaks from all cancer types that could then be used for cross-cancer comparisons.

All these descriptions can be found at the supplemental material from Corces et al. (2018) in the Peak calling section.

5.1 Comparing pan-cancer peak set and cancer type-specific peak set

If we check the both sets (Files downloaded from GDC: “All cancer type-specific peak sets. [ZIP]” and “Pan-cancer peak set. [TXT]”), the set of peaks “pan-cancer peak set” consists of \(~562K\) peaks, and it contains a subset of each “cancer type-specific peak set.” We show an example for Esophageal carcinoma (ESCA) below.

# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt")
head(atac_esca)
## # A tibble: 6 x 8
##   seqnames   start     end name     score annotation percentGC percentAT
##   <chr>      <dbl>   <dbl> <chr>    <dbl> <chr>          <dbl>     <dbl>
## 1 chr1     1290095 1290596 ESCA_107  2.46 3' UTR         0.677     0.323
## 2 chr1     1291115 1291616 ESCA_108  2.59 3' UTR         0.703     0.297
## 3 chr1     1291753 1292254 ESCA_109  7.58 3' UTR         0.639     0.361
## 4 chr1     1440824 1441325 ESCA_160  4.47 3' UTR         0.659     0.341
## 5 chr1     1630188 1630689 ESCA_179 20.6  3' UTR         0.729     0.271
## 6 chr1     2030218 2030719 ESCA_341 15.6  3' UTR         0.619     0.381
message("Number of peaks: ", nrow(atac_esca))
## Number of peaks: 127055
message("Range score: ", paste(range(atac_esca$score),collapse = " - "))
## Range score: 1.82339062289625 - 1476.55719343776
# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt")
head(atac_pan)
## # A tibble: 6 x 7
##   seqnames     start       end name      score annotation percentGC
##   <chr>        <dbl>     <dbl> <chr>     <dbl> <chr>          <dbl>
## 1 chr1        906012    906513 ACC_10     7.17 Intron         0.613
## 2 chr2     112541661 112542162 ACC_10008 22.0  Promoter       0.555
## 3 chr1      21673421  21673922 ACC_1001   6.46 Distal         0.509
## 4 chr2     112584205 112584706 ACC_10013 43.2  Promoter       0.587
## 5 chr2     112596243 112596744 ACC_10016  5.43 Intron         0.491
## 6 chr1      21725692  21726193 ACC_1002   5.20 Intron         0.405
# check the names of the PanCAN set of peaks
plyr::count(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])
##       x  freq
## 1   ACC 29311
## 2  BLCA 25337
## 3  BRCA 49748
## 4  CESC 14358
## 5  CHOL 11819
## 6  COAD 25404
## 7  ESCA 13237
## 8   GBM 15394
## 9  HNSC 16651
## 10 KIRC 15067
## 11 KIRP 24324
## 12  LGG 23836
## 13 LIHC 35787
## 14 LUAD 23729
## 15 LUSC 22195
## 16 MESO 22958
## 17 PCPG 31372
## 18 PRAD 30067
## 19 SKCM 36591
## 20 STAD 17358
## 21 TGCT 26120
## 22 THCA 31568
## 23 UCEC 20478
message("Range score: ", paste(range(atac_pan$score),collapse = " - "))
## Range score: 1.022347327 - 2527.059401

However, it is important to highlight that the “pan-cancer peak set” will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the “pan-cancer peak set” consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.

# We will be using a GRanges object a data structure to hangle genomic ranges
# This data structure provides access to useful functions such as 
# promoters, resize, findOverlps, subsetByOverlaps
# More information at:
# https://www.bioconductor.org/help/course-materials/2015/Uruguay2015/V3-GenomicRanges.html#genomicranges
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
## [1] 126935

5.2 Checking an overlapping peak

So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the “ACC_10008” peak, which has a higher score.

"ESCA_17603" %in% atac_pan.gr$name
## [1] FALSE
subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
## GRanges object with 1 range and 4 metadata columns:
##       seqnames              ranges strand |        name     score  annotation
##          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]     chr2 112541661-112542162      * |   ACC_10008   22.0306    Promoter
##       percentGC
##       <numeric>
##   [1]   0.55489
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])
## GRanges object with 1 range and 5 metadata columns:
##       seqnames              ranges strand |        name     score  annotation
##          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]     chr2 112541649-112542150      * |  ESCA_17603   6.25799    Promoter
##       percentGC percentAT
##       <numeric> <numeric>
##   [1]  0.556886  0.443114
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

5.3 Peaks size

Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp. The main reason to use fixed-width peaks is stated in the supplemental material:

We chose to use fixed-width peaks because (i) it makes count based and motif focused analyses less biased to large peaks and (ii) with large datasets merging peak sets to obtain a union peak set can lead to many peaks being merged into one very large peak, limiting our ability to resolve independent peaks.

unique(width(atac_pan.gr))
## [1] 502
unique(width(atac_esca.gr))
## [1] 502

5.4 Summary

In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.

6 Analysis: Using ATAC-Seq counts to compare two groups

Through the next section we will load the normalized counts data and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.

The main two files used in this section are:

  1. Normalized ATAC-Seq insertion counts within the pan-cancer peak set. Recommended format. [RDS]
  2. All cancer type-specific count matrices in normalized counts. [ZIP]

In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID instead of TCGA barcodes and each patient normally has two replicates.

atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt")
atac_esca_norm_ct[1:4,1:8]
## # A tibble: 4 x 8
##   seqnames  start    end name   score ESCA_1E6AC686_96C2_4… ESCA_1E6AC686_96C2_…
##   <chr>     <dbl>  <dbl> <chr>  <dbl>                 <dbl>                <dbl>
## 1 chr1      10180  10679 ESCA_1  2.08                  1.44                 1.22
## 2 chr1     180641 181140 ESCA_2  2.17                  2.55                 2.23
## 3 chr1     181193 181692 ESCA_3  6.63                  2.10                 1.98
## 4 chr1     184246 184745 ESCA_4  2.59                  1.44                 1.16
## # … with 1 more variable:
## #   ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016 <dbl>

We will change the samples names to TCGA barcodes using the file “Lookup table for various TCGA sample identifiers. [TXT]” from GDC.

gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
samples.ids$sample <- substr(samples.ids$Case_ID,1,16) # Get first 16 characters
head(samples.ids)
## # A tibble: 6 x 6
##   bam_prefix        stanfordUUID    aliquot_id    Case_UUID    Case_ID   sample 
##   <chr>             <chr>           <chr>         <chr>        <chr>     <chr>  
## 1 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 2 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 3 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 4 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 5 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
## 6 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
# match the Stanford UUIDs from the atac_esca_norm_ct column names to the
# column bam_prefix in the samples.ids table and get the TCGA ID (column Case_ID)
# from it to replace as column names
# We remove the first 5 columns since the are not samples names
standfordUUID <- colnames(atac_esca_norm_ct)[-c(1:5)]
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]
## # A tibble: 4 x 8
##   seqnames  start    end name   score `TCGA-IG-A51D-01A-31… `TCGA-IG-A51D-01A-3…
##   <chr>     <dbl>  <dbl> <chr>  <dbl>                 <dbl>                <dbl>
## 1 chr1      10180  10679 ESCA_1  2.08                  1.44                 1.22
## 2 chr1     180641 181140 ESCA_2  2.17                  2.55                 2.23
## 3 chr1     181193 181692 ESCA_3  6.63                  2.10                 1.98
## 4 chr1     184246 184745 ESCA_4  2.59                  1.44                 1.16
## # … with 1 more variable: TCGA-LN-A9FQ-01A-11-A617-42 <dbl>
# It is important to notice that our mapping produced replicated column names,
# for that reason the code using dplyr R package (which is more elegant) would not work 
atac_esca_norm_ct <- atac_esca_norm_ct %>% rename_with(
  .fn = function(standfordUUID){
    samples.ids$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
  },.cols = starts_with("ESCA_"))

We will next get the TCGA metadata information from GDC using the function TCGAbiolinks:::colDataPrepare.

atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))

samples.map <- paste0(
      samples.info$primary_diagnosis,"-",
      samples.info$sample
    ) %>% 
      { gsub(",| |NOS","",.) }  %>% # substitutes spaces by no space, and Remove NOS
      { gsub("Adenocarcinoma","ESAD",.) }  %>% # substitutes Adenocarcinoma to ESAD
      { gsub("Squamous cell carcinoma","ESCC",.) }  # substitutes Squamous cell carcinoma to ESCC

colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# create Summarized Experiment object  
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

# Matrix with data information
counts <- atac[,-c(1:5)]

# Genomic ranges information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")

# samples information
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))

esca.rse <- SummarizedExperiment(
  assays = SimpleList(
    log2norm = as.matrix(counts)
  ),
  rowRanges = rowRanges, 
  colData = colData
)

# Since we have two samples for each patient we will rename them as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
esca.rse
## class: RangedSummarizedExperiment 
## dim: 126935 33 
## metadata(0):
## assays(1): log2norm
## rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
##   ESCA_126943_chrX_155985136_155985635
##   ESCA_126944_chrX_156030008_156030507
## rowData names(2): score name
## colnames(33): ESCC-TCGA-IG-A51D-01A_rep1 ESCC-TCGA-IG-A51D-01A_rep2 ...
##   ESAD-TCGA-M9-A5M8-01A_rep1 ESAD-TCGA-M9-A5M8-01A_rep2
## colData names(140): sample patient ... Case_UUID Case_ID
colnames(esca.rse)
##  [1] "ESCC-TCGA-IG-A51D-01A_rep1" "ESCC-TCGA-IG-A51D-01A_rep2"
##  [3] "ESCC-TCGA-LN-A9FQ-01A_rep1" "ESCC-TCGA-LN-A9FQ-01A_rep2"
##  [5] "ESAD-TCGA-L7-A6VZ-01A_rep1" "ESAD-TCGA-L7-A6VZ-01A_rep2"
##  [7] "ESAD-TCGA-L5-A4OE-01A_rep1" "ESAD-TCGA-L5-A4OE-01A_rep2"
##  [9] "ESCC-TCGA-LN-A49O-01A_rep1" "ESCC-TCGA-LN-A49O-01A_rep2"
## [11] "ESAD-TCGA-IG-A7DP-01A_rep1" "ESAD-TCGA-IG-A7DP-01A_rep2"
## [13] "ESCC-TCGA-LN-A4A5-01A_rep1" "ESCC-TCGA-LN-A4A5-01A_rep2"
## [15] "ESCC-TCGA-IG-A4P3-01A_rep1" "ESCC-TCGA-IG-A3YA-01A_rep1"
## [17] "ESCC-TCGA-IG-A3YA-01A_rep2" "ESAD-TCGA-IG-A4QS-01A_rep1"
## [19] "ESAD-TCGA-IG-A4QS-01A_rep2" "ESCC-TCGA-LN-A7HX-01A_rep1"
## [21] "ESCC-TCGA-LN-A7HX-01A_rep2" "ESCC-TCGA-LN-A8HZ-01A_rep1"
## [23] "ESCC-TCGA-LN-A4A2-01A_rep1" "ESCC-TCGA-LN-A4A2-01A_rep2"
## [25] "ESCC-TCGA-LN-A4MR-01A_rep1" "ESCC-TCGA-IG-A625-01A_rep1"
## [27] "ESCC-TCGA-IG-A625-01A_rep2" "ESCC-TCGA-LN-A49W-01A_rep1"
## [29] "ESCC-TCGA-LN-A49W-01A_rep2" "ESAD-TCGA-IC-A6RE-01A_rep1"
## [31] "ESAD-TCGA-IC-A6RE-01A_rep2" "ESAD-TCGA-M9-A5M8-01A_rep1"
## [33] "ESAD-TCGA-M9-A5M8-01A_rep2"

6.1 Comparing ESCC vs ESAD ATAC-Seq

A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.

escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")

result <- plyr::adply(
  .data = assay(esca.rse),
  .margins = 1,
  .fun = function(peak){
    results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
    return(
      tibble::tibble(
        "raw_p_value"= results$p.value,
        "ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2])
    )
  }, .progress = "time", .id = "peak")

result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")

6.2 Volcano plot of t-test analysis

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks.

fdr.cut.off <- 0.0001
diff.cut.off <- 2.5

TCGAbiolinks:::TCGAVisualize_volcano(
  x = result$ESCC_minus_ESAD,
  y = result$FDR, 
  title =  paste0(
    "Volcano plot - ATAC-Seq peaks ",
    "difference in ", 
    "ESCC vs ESAD\n"
  ),
  filename = NULL,
  label =  c(
    "Not Significant",
    paste0("High in ESCC (vs ESAD)"),
    paste0("Low in ESCC (vs ESAD)")
  ),
  ylab =  expression( paste(
    -Log[10]," (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
  )),
  xlab =  expression(paste(
    "Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
  )),
  x.cut = diff.cut.off, 
  y.cut = fdr.cut.off
)

# How many peaks pass our cut-offs
table(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)
## 
##  FALSE   TRUE 
## 125988    947

6.3 Heatmap of differential significant peaks

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the \(FDR < 0.01\) and \(\Delta Log_2 Counts > 2\).

Also, we will be using TCGAbiolinks package to plot it, but you can do it using ggplot2 or plotly for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.

library(ComplexHeatmap)
library(circlize)
# Colors of the heatmap
pal_atac <- colorRampPalette(
  c('#3361A5',
    '#248AF3',
    '#14B3FF',
    '#88CEEF',
    '#C1D5DC',
    '#EAD397',
    '#FDB31A',
    '#E42A2A',
    '#A31D1D')
)(100)

# Upper track with the samples annotation
ha = HeatmapAnnotation(
  df = data.frame(
    "Group" = esca.rse$primary_diagnosis, 
    "Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")
  ),
  show_annotation_name = T,
  col = list(
    Group = c(
      "Squamous cell carcinoma, NOS" =  "red", 
      "Adenocarcinoma, NOS" = "blue"
    )
  ),
  show_legend = T,
  annotation_name_side = "left",
  annotation_name_gp = gpar(fontsize = 6)
)

# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scale
col <- colorRamp2(
  seq(
    min(plot.atac), 
    max(plot.atac), 
    by = (max(plot.atac) - min(plot.atac))/99), 
  pal_atac
)

# Show the names of a peaks enriched in ESCA and on in ESCC
idx1 <- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD > diff.cut.off)[1]
idx2 <- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD < diff.cut.off)[1]

rows.annot <- rowAnnotation(foo = anno_mark(at = c(idx1,idx2), labels = rownames(plot.atac)[c(idx1,idx2)]))

# Plot the ATAC-Seq signals
ht_list <- Heatmap(
  plot.atac,
  name = "ATAC-seq log2(counts)", 
  col = col,
  column_names_gp = gpar(fontsize = 8),
  show_column_names = F,
  heatmap_legend_param = list(
    legend_direction = "horizontal",
    labels_gp = gpar(fontsize = 12), 
    title_gp = gpar(fontsize = 12)
  ),
  show_row_names = FALSE,
  cluster_columns = TRUE,
  use_raster = TRUE,
  raster_device = c("png"),
  raster_quality = 2,
  cluster_rows = T,
  right_annotation = rows.annot,
  row_title = paste0(
    sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-seq peaks"
  ),
  row_names_gp = gpar(fontsize = 4),
  top_annotation = ha,
  column_title_gp = gpar(fontsize = 12), 
  row_title_gp = gpar(fontsize = 12)
) 

options(repr.plot.width = 15, repr.plot.height = 8)
draw(
  ht_list,
  newpage = TRUE, 
  merge_legend = TRUE,
  column_title = paste0(
    "ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
    ",  Diff mean log2 Count > ",diff.cut.off,")"
  ),
  column_title_gp = gpar(fontsize = 12, fontface = "bold"),
  heatmap_legend_side = "bottom",
  annotation_legend_side = "bottom"
)