1 Introduction

An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .

Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.

2 How to cite the data

If you use TCGA ATAC-Seq data in your analysis, please cite the following paper:

CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898

3 Workshop description

In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.

For more information about the data please visit GDC publication website and read the paper:

CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898

3.1 Pre-requisites

Basic knowledge of R syntax
R version 4.0 or higher (https://cloud.r-project.org/)
Rstudio installed (https://www.rstudio.com/products/rstudio/download/#download)
Last version of bioconductor installed
In the end of this document there is also the version of the libraries used

3.2 Workshop Participation

Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.

3.3 Goals and objectives

Download and understand the ATAC-Seq data
Compare two different groups of samples ATAC-Seq data

3.4 Enviroment: R libraries

The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.

# to read txt files
library(readr)

# to transform data into GenomicRanges
library(GenomicRanges)

# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)

# For the t.test loop
library(plyr)

# For easy volcano plot
library(TCGAbiolinks)

# For heatmap plot
library(ComplexHeatmap)
library(circlize)
library(pheatmap)

# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ELMER)

If one of them is not available in your R environment, you can easily install them with BiocManager as shown below.

if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
BiocManager::install(version = "3.13",ask = FALSE) # Install last version of Bioconductor

list.of.packages <- c(
  "karyoploteR",
  "circlize",
  "ComplexHeatmap",
  "TxDb.Hsapiens.UCSC.hg38.knownGene",
  "TCGAbiolinks",
  "plyr",
  "dplyr",
  "tidyr",
  "GenomicRanges",
  "readr",
  "ELMER"
)
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)

4 Data

The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

5 Understanding the data: peaks sets

The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. And, it is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).

The paper introduces 3 sets of peaks: 1) sample peak set, 2) cancer type-specific peak set and 3) pan-cancer peak set.

Sample peak set description from the paper:

Overlapping peaks called within a single sample were handled using an iterative removal procedure. First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This resulted in a set of fixed-width peaks for each sample which we refer to here as a “sample peak set.”

Since the samples varied in read depth or quality, to compare samples they developed a robust method to normalize peak significance scores across samples and cancer types, which uses MACS2 peak scores (-log10(p-value)) for each sample and converted them to a “score per million” by dividing each individual peak score by the sum of all of the peak scores in the given sample divided by 1 million.

“cancer type-specific peak set” description:

First, all “sample peak sets” from a given cancer type were combined into a cumulative peak set and trimmed for overlap using the same iterative procedure mentioned above. Contains all of the reproducible peaks observed in an individual cancer type, filtered to those which were observed in at least two samples with a score per million value $>=5$ .

“pan-cancer peak set” description:

First, there was a re-normalization of the score per million scores for each “cancer type-specific peak set” to prevent cancer types with more samples or higher quality samples from comprising a disproportionate share of the union peak set. The resulting peak sets were combined with the same iterative procedure mentioned above. These peaks are reproducible peaks from all cancer types that could then be used for cross-cancer comparisons.

All these descriptions can be found at the supplemental material from Corces et al. (2018) in the Peak calling section.

5.1 Comparing pan-cancer peak set and cancer type-specific peak set

If we check the both sets (Files downloaded from GDC: “All cancer type-specific peak sets. [ZIP]” and “Pan-cancer peak set. [TXT]”), the set of peaks “pan-cancer peak set” consists of $~562K$ peaks, and it contains a subset of each “cancer type-specific peak set.” We show an example for Esophageal carcinoma (ESCA) below.

# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt")
head(atac_esca)

## # A tibble: 6 x 8
##   seqnames   start     end name     score annotation percentGC percentAT
##   <chr>      <dbl>   <dbl> <chr>    <dbl> <chr>          <dbl>     <dbl>
## 1 chr1     1290095 1290596 ESCA_107  2.46 3' UTR         0.677     0.323
## 2 chr1     1291115 1291616 ESCA_108  2.59 3' UTR         0.703     0.297
## 3 chr1     1291753 1292254 ESCA_109  7.58 3' UTR         0.639     0.361
## 4 chr1     1440824 1441325 ESCA_160  4.47 3' UTR         0.659     0.341
## 5 chr1     1630188 1630689 ESCA_179 20.6  3' UTR         0.729     0.271
## 6 chr1     2030218 2030719 ESCA_341 15.6  3' UTR         0.619     0.381

message("Number of peaks: ", nrow(atac_esca))

## Number of peaks: 127055

message("Range score: ", paste(range(atac_esca$score),collapse = " - "))

## Range score: 1.82339062289625 - 1476.55719343776

# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt")
head(atac_pan)

## # A tibble: 6 x 7
##   seqnames     start       end name      score annotation percentGC
##   <chr>        <dbl>     <dbl> <chr>     <dbl> <chr>          <dbl>
## 1 chr1        906012    906513 ACC_10     7.17 Intron         0.613
## 2 chr2     112541661 112542162 ACC_10008 22.0  Promoter       0.555
## 3 chr1      21673421  21673922 ACC_1001   6.46 Distal         0.509
## 4 chr2     112584205 112584706 ACC_10013 43.2  Promoter       0.587
## 5 chr2     112596243 112596744 ACC_10016  5.43 Intron         0.491
## 6 chr1      21725692  21726193 ACC_1002   5.20 Intron         0.405

# check the names of the PanCAN set of peaks
plyr::count(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])

##       x  freq
## 1   ACC 29311
## 2  BLCA 25337
## 3  BRCA 49748
## 4  CESC 14358
## 5  CHOL 11819
## 6  COAD 25404
## 7  ESCA 13237
## 8   GBM 15394
## 9  HNSC 16651
## 10 KIRC 15067
## 11 KIRP 24324
## 12  LGG 23836
## 13 LIHC 35787
## 14 LUAD 23729
## 15 LUSC 22195
## 16 MESO 22958
## 17 PCPG 31372
## 18 PRAD 30067
## 19 SKCM 36591
## 20 STAD 17358
## 21 TGCT 26120
## 22 THCA 31568
## 23 UCEC 20478

message("Range score: ", paste(range(atac_pan$score),collapse = " - "))

## Range score: 1.022347327 - 2527.059401

However, it is important to highlight that the “pan-cancer peak set” will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the “pan-cancer peak set” consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.

# We will be using a GRanges object a data structure to hangle genomic ranges
# This data structure provides access to useful functions such as 
# promoters, resize, findOverlps, subsetByOverlaps
# More information at:
# https://www.bioconductor.org/help/course-materials/2015/Uruguay2015/V3-GenomicRanges.html#genomicranges
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))

## [1] 126935

5.2 Checking an overlapping peak

So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the “ACC_10008” peak, which has a higher score.

"ESCA_17603" %in% atac_pan.gr$name

## [1] FALSE

subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)

## GRanges object with 1 range and 4 metadata columns:
##       seqnames              ranges strand |        name     score  annotation
##          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]     chr2 112541661-112542162      * |   ACC_10008   22.0306    Promoter
##       percentGC
##       <numeric>
##   [1]   0.55489
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])

## GRanges object with 1 range and 5 metadata columns:
##       seqnames              ranges strand |        name     score  annotation
##          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]     chr2 112541649-112542150      * |  ESCA_17603   6.25799    Promoter
##       percentGC percentAT
##       <numeric> <numeric>
##   [1]  0.556886  0.443114
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

5.3 Peaks size

Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp. The main reason to use fixed-width peaks is stated in the supplemental material:

We chose to use fixed-width peaks because (i) it makes count based and motif focused analyses less biased to large peaks and (ii) with large datasets merging peak sets to obtain a union peak set can lead to many peaks being merged into one very large peak, limiting our ability to resolve independent peaks.

unique(width(atac_pan.gr))

## [1] 502

unique(width(atac_esca.gr))

## [1] 502

5.4 Summary

In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.

6 Analysis: Using ATAC-Seq counts to compare two groups

Through the next section we will load the normalized counts data and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.

The main two files used in this section are:

Normalized ATAC-Seq insertion counts within the pan-cancer peak set. Recommended format. [RDS]
All cancer type-specific count matrices in normalized counts. [ZIP]

In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID instead of TCGA barcodes and each patient normally has two replicates.

atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt")

atac_esca_norm_ct[1:4,1:8]

## # A tibble: 4 x 8
##   seqnames  start    end name   score ESCA_1E6AC686_96C2_4… ESCA_1E6AC686_96C2_…
##   <chr>     <dbl>  <dbl> <chr>  <dbl>                 <dbl>                <dbl>
## 1 chr1      10180  10679 ESCA_1  2.08                  1.44                 1.22
## 2 chr1     180641 181140 ESCA_2  2.17                  2.55                 2.23
## 3 chr1     181193 181692 ESCA_3  6.63                  2.10                 1.98
## 4 chr1     184246 184745 ESCA_4  2.59                  1.44                 1.16
## # … with 1 more variable:
## #   ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016 <dbl>

We will change the samples names to TCGA barcodes using the file “Lookup table for various TCGA sample identifiers. [TXT]” from GDC.

gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
samples.ids$sample <- substr(samples.ids$Case_ID,1,16) # Get first 16 characters
head(samples.ids)

## # A tibble: 6 x 6
##   bam_prefix        stanfordUUID    aliquot_id    Case_UUID    Case_ID   sample 
##   <chr>             <chr>           <chr>         <chr>        <chr>     <chr>  
## 1 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 2 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 3 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 4 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 5 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
## 6 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…

# match the Stanford UUIDs from the atac_esca_norm_ct column names to the
# column bam_prefix in the samples.ids table and get the TCGA ID (column Case_ID)
# from it to replace as column names
# We remove the first 5 columns since the are not samples names
standfordUUID <- colnames(atac_esca_norm_ct)[-c(1:5)]
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]

## # A tibble: 4 x 8
##   seqnames  start    end name   score `TCGA-IG-A51D-01A-31… `TCGA-IG-A51D-01A-3…
##   <chr>     <dbl>  <dbl> <chr>  <dbl>                 <dbl>                <dbl>
## 1 chr1      10180  10679 ESCA_1  2.08                  1.44                 1.22
## 2 chr1     180641 181140 ESCA_2  2.17                  2.55                 2.23
## 3 chr1     181193 181692 ESCA_3  6.63                  2.10                 1.98
## 4 chr1     184246 184745 ESCA_4  2.59                  1.44                 1.16
## # … with 1 more variable: TCGA-LN-A9FQ-01A-11-A617-42 <dbl>

# It is important to notice that our mapping produced replicated column names,
# for that reason the code using dplyr R package (which is more elegant) would not work 
atac_esca_norm_ct <- atac_esca_norm_ct %>% rename_with(
  .fn = function(standfordUUID){
    samples.ids$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
  },.cols = starts_with("ESCA_"))

We will next get the TCGA metadata information from GDC using the function TCGAbiolinks:::colDataPrepare.

atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))

samples.map <- paste0(
      samples.info$primary_diagnosis,"-",
      samples.info$sample
    ) %>% 
      { gsub(",| |NOS","",.) }  %>% # substitutes spaces by no space, and Remove NOS
      { gsub("Adenocarcinoma","ESAD",.) }  %>% # substitutes Adenocarcinoma to ESAD
      { gsub("Squamous cell carcinoma","ESCC",.) }  # substitutes Squamous cell carcinoma to ESCC

colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# create Summarized Experiment object  
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

# Matrix with data information
counts <- atac[,-c(1:5)]

# Genomic ranges information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")

# samples information
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))

esca.rse <- SummarizedExperiment(
  assays = SimpleList(
    log2norm = as.matrix(counts)
  ),
  rowRanges = rowRanges, 
  colData = colData
)

# Since we have two samples for each patient we will rename them as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")

esca.rse

## class: RangedSummarizedExperiment 
## dim: 126935 33 
## metadata(0):
## assays(1): log2norm
## rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
##   ESCA_126943_chrX_155985136_155985635
##   ESCA_126944_chrX_156030008_156030507
## rowData names(2): score name
## colnames(33): ESCC-TCGA-IG-A51D-01A_rep1 ESCC-TCGA-IG-A51D-01A_rep2 ...
##   ESAD-TCGA-M9-A5M8-01A_rep1 ESAD-TCGA-M9-A5M8-01A_rep2
## colData names(140): sample patient ... Case_UUID Case_ID

colnames(esca.rse)

##  [1] "ESCC-TCGA-IG-A51D-01A_rep1" "ESCC-TCGA-IG-A51D-01A_rep2"
##  [3] "ESCC-TCGA-LN-A9FQ-01A_rep1" "ESCC-TCGA-LN-A9FQ-01A_rep2"
##  [5] "ESAD-TCGA-L7-A6VZ-01A_rep1" "ESAD-TCGA-L7-A6VZ-01A_rep2"
##  [7] "ESAD-TCGA-L5-A4OE-01A_rep1" "ESAD-TCGA-L5-A4OE-01A_rep2"
##  [9] "ESCC-TCGA-LN-A49O-01A_rep1" "ESCC-TCGA-LN-A49O-01A_rep2"
## [11] "ESAD-TCGA-IG-A7DP-01A_rep1" "ESAD-TCGA-IG-A7DP-01A_rep2"
## [13] "ESCC-TCGA-LN-A4A5-01A_rep1" "ESCC-TCGA-LN-A4A5-01A_rep2"
## [15] "ESCC-TCGA-IG-A4P3-01A_rep1" "ESCC-TCGA-IG-A3YA-01A_rep1"
## [17] "ESCC-TCGA-IG-A3YA-01A_rep2" "ESAD-TCGA-IG-A4QS-01A_rep1"
## [19] "ESAD-TCGA-IG-A4QS-01A_rep2" "ESCC-TCGA-LN-A7HX-01A_rep1"
## [21] "ESCC-TCGA-LN-A7HX-01A_rep2" "ESCC-TCGA-LN-A8HZ-01A_rep1"
## [23] "ESCC-TCGA-LN-A4A2-01A_rep1" "ESCC-TCGA-LN-A4A2-01A_rep2"
## [25] "ESCC-TCGA-LN-A4MR-01A_rep1" "ESCC-TCGA-IG-A625-01A_rep1"
## [27] "ESCC-TCGA-IG-A625-01A_rep2" "ESCC-TCGA-LN-A49W-01A_rep1"
## [29] "ESCC-TCGA-LN-A49W-01A_rep2" "ESAD-TCGA-IC-A6RE-01A_rep1"
## [31] "ESAD-TCGA-IC-A6RE-01A_rep2" "ESAD-TCGA-M9-A5M8-01A_rep1"
## [33] "ESAD-TCGA-M9-A5M8-01A_rep2"

6.1 Comparing ESCC vs ESAD ATAC-Seq

A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.

escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")

result <- plyr::adply(
  .data = assay(esca.rse),
  .margins = 1,
  .fun = function(peak){
    results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
    return(
      tibble::tibble(
        "raw_p_value"= results$p.value,
        "ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2])
    )
  }, .progress = "time", .id = "peak")

result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")

6.2 Volcano plot of t-test analysis

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks.

fdr.cut.off <- 0.0001
diff.cut.off <- 2.5

TCGAbiolinks:::TCGAVisualize_volcano(
  x = result$ESCC_minus_ESAD,
  y = result$FDR, 
  title =  paste0(
    "Volcano plot - ATAC-Seq peaks ",
    "difference in ", 
    "ESCC vs ESAD\n"
  ),
  filename = NULL,
  label =  c(
    "Not Significant",
    paste0("High in ESCC (vs ESAD)"),
    paste0("Low in ESCC (vs ESAD)")
  ),
  ylab =  expression( paste(
    -Log[10]," (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
  )),
  xlab =  expression(paste(
    "Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
  )),
  x.cut = diff.cut.off, 
  y.cut = fdr.cut.off
)

# How many peaks pass our cut-offs
table(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)

## 
##  FALSE   TRUE 
## 125988    947

6.3 Heatmap of differential significant peaks

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the $FDR < 0.01$ and $\Delta Log_2 Counts > 2$ .

Also, we will be using TCGAbiolinks package to plot it, but you can do it using ggplot2 or plotly for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.

library(ComplexHeatmap)
library(circlize)

# Colors of the heatmap
pal_atac <- colorRampPalette(
  c('#3361A5',
    '#248AF3',
    '#14B3FF',
    '#88CEEF',
    '#C1D5DC',
    '#EAD397',
    '#FDB31A',
    '#E42A2A',
    '#A31D1D')
)(100)

# Upper track with the samples annotation
ha = HeatmapAnnotation(
  df = data.frame(
    "Group" = esca.rse$primary_diagnosis, 
    "Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")
  ),
  show_annotation_name = T,
  col = list(
    Group = c(
      "Squamous cell carcinoma, NOS" =  "red", 
      "Adenocarcinoma, NOS" = "blue"
    )
  ),
  show_legend = T,
  annotation_name_side = "left",
  annotation_name_gp = gpar(fontsize = 6)
)

# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scale
col <- colorRamp2(
  seq(
    min(plot.atac), 
    max(plot.atac), 
    by = (max(plot.atac) - min(plot.atac))/99), 
  pal_atac
)

# Show the names of a peaks enriched in ESCA and on in ESCC
idx1 <- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD > diff.cut.off)[1]
idx2 <- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD < diff.cut.off)[1]

rows.annot <- rowAnnotation(foo = anno_mark(at = c(idx1,idx2), labels = rownames(plot.atac)[c(idx1,idx2)]))

# Plot the ATAC-Seq signals
ht_list <- Heatmap(
  plot.atac,
  name = "ATAC-seq log2(counts)", 
  col = col,
  column_names_gp = gpar(fontsize = 8),
  show_column_names = F,
  heatmap_legend_param = list(
    legend_direction = "horizontal",
    labels_gp = gpar(fontsize = 12), 
    title_gp = gpar(fontsize = 12)
  ),
  show_row_names = FALSE,
  cluster_columns = TRUE,
  use_raster = TRUE,
  raster_device = c("png"),
  raster_quality = 2,
  cluster_rows = T,
  right_annotation = rows.annot,
  row_title = paste0(
    sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-seq peaks"
  ),
  row_names_gp = gpar(fontsize = 4),
  top_annotation = ha,
  column_title_gp = gpar(fontsize = 12), 
  row_title_gp = gpar(fontsize = 12)
) 

options(repr.plot.width = 15, repr.plot.height = 8)
draw(
  ht_list,
  newpage = TRUE, 
  merge_legend = TRUE,
  column_title = paste0(
    "ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
    ",  Diff mean log2 Count > ",diff.cut.off,")"
  ),
  column_title_gp = gpar(fontsize = 12, fontface = "bold"),
  heatmap_legend_side = "bottom",
  annotation_legend_side = "bottom"
)

6.4 Heatmap of differential significant peaks (z-score)

A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here.

In R the function scale can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.

plot.atac.row.z.score <- pheatmap:::scale_rows(plot.atac) # row z-score
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)

ht_list <- Heatmap(
  plot.atac.row.z.score,
  name = "Row z-score (ATAC-Seq log2(counts))", 
  col = col.zscore,
  column_names_gp = gpar(fontsize = 8),
  show_column_names = FALSE,
  heatmap_legend_param = list(
    legend_direction = "horizontal",
    labels_gp = gpar(fontsize = 12), 
    title_gp = gpar(fontsize = 12)
  ),
  show_row_names = FALSE,
  cluster_columns = TRUE,
  use_raster = TRUE,
  right_annotation = rows.annot,
  raster_device = c("png"),
  raster_quality = 2,
  cluster_rows = TRUE,
  row_title = paste0(
    sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-Seq peaks"
  ),
  row_names_gp = gpar(fontsize = 4),
  top_annotation = ha,
  column_title_gp = gpar(fontsize = 12), 
  row_title_gp = gpar(fontsize = 12)
) 

draw(
  ht_list,
  newpage = TRUE, 
  merge_legend = TRUE,
  column_title = paste0(
    "ATAC-Seq ESCC vs ESAD (FDR < ", 
    fdr.cut.off,",  Diff mean log2 Count > ",
    diff.cut.off,")"
  ),
  column_title_gp = gpar(fontsize = 12, fontface = "bold"),
  heatmap_legend_side = "bottom",
  annotation_legend_side = "bottom"
)

6.5 Merging ATAC-Seq replicates

If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.

# This function will calculate the Means of the peaks for a given group
# in our case we will calculate the mean of the replicates of each patient.
groupMeans <- function(mat, groups = NULL, na.rm = TRUE){
  stopifnot(!is.null(groups))
  gm <- lapply(unique(groups), function(x){
    rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
  }) %>% Reduce("cbind",.)
  colnames(gm) <- unique(groups)
  return(gm)
}
matMerged <- groupMeans(mat = assays(esca.rse)$log2norm, groups = colData(esca.rse)$sample)

# keep only metadata for replicate 1
metadata <- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),]

# Create the upper annotation track for the samples
ha <- HeatmapAnnotation(
  df = data.frame(
    "Group" = metadata$primary_diagnosis
  ),
  show_annotation_name = TRUE,
  col = list(
    Group = c(
      "Squamous cell carcinoma, NOS" =  "red", 
      "Adenocarcinoma, NOS" = "blue")
  ),
  show_legend = TRUE,
  annotation_name_side = "left",
  annotation_name_gp = gpar(fontsize = 6)
)

# Select the significant peaks to be plotted
plot.atac <- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scheme based on the values
col <- colorRamp2(
  breaks = seq(
    min(plot.atac),
    max(plot.atac), 
    by = (max(plot.atac) - min(plot.atac))/99), 
  colors = pal_atac
)


# Plot ATAC-Seq signal values
ht_list <- Heatmap(
  matrix = plot.atac,
  name = "ATAC-Seq log2(counts)", 
  col = col,
  column_names_gp = gpar(fontsize = 8),
  show_column_names = F,
  heatmap_legend_param = list(
    legend_direction = "horizontal",
    labels_gp = gpar(fontsize = 12), 
    title_gp = gpar(fontsize = 12)
  ),
  show_row_names = FALSE,
  cluster_columns = TRUE,
  use_raster = TRUE,
  raster_device = c("png"),
  raster_quality = 2,
  cluster_rows = T,
  right_annotation = rows.annot,
  row_title = paste0(
    sum(result$FDR < fdr.cut.off &  abs(result$ESCC_minus_ESAD) > diff.cut.off), " ATAC-seq peaks"
  ),
  row_names_gp = gpar(fontsize = 4),
  top_annotation = ha,
  column_title_gp = gpar(fontsize = 12), 
  row_title_gp = gpar(fontsize = 12)
) 

options(repr.plot.width=15, repr.plot.height=8)
draw(
  ht_list,
  newpage = TRUE, 
  merge_legend = TRUE,
  column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",  
                           Diff mean log2 Count > ",diff.cut.off,")"),
  column_title_gp = gpar(fontsize = 12, fontface = "bold"),
  heatmap_legend_side = "bottom",
  annotation_legend_side = "bottom"
)

# Start to plot z-score heatmap
# To calculate the z-score we will use the function scale from R
# By default this function is applied on the columns so we need to tranpose the
# matrix to scale by rows and transpose back
plot.atac.row.z.score <- pheatmap:::scale_rows(plot.atac) # row z-score

# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)

ht_list <- Heatmap(
  matrix = plot.atac.row.z.score,
  name = "Row z-score (ATAC-seq log2(counts))", 
  col = col.zscore,
  column_names_gp = gpar(fontsize = 8),
  show_column_names = F,
  heatmap_legend_param = list(
    legend_direction = "horizontal",
    labels_gp = gpar(fontsize = 12), 
    title_gp = gpar(fontsize = 12)
  ),
  show_row_names = FALSE,
  cluster_columns = TRUE,
  use_raster = TRUE,
  right_annotation = rows.annot,
  raster_device = c("png"),
  raster_quality = 2,
  cluster_rows = TRUE,
  row_title = paste0(
    sum(result$FDR < fdr.cut.off &   abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-seq peaks"
  ),
  row_names_gp = gpar(fontsize = 4),
  top_annotation = ha,
  column_title_gp = gpar(fontsize = 12), 
  row_title_gp = gpar(fontsize = 12)
) 

options(repr.plot.width = 15, repr.plot.height=8)
draw(
  object = ht_list,
  merge_legend = TRUE,
  newpage = TRUE, 
  column_title = paste0(
    "ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",  Diff mean log2 Count > ",diff.cut.off,")"
  ),
  column_title_gp = gpar(fontsize = 12, fontface = "bold"),
  heatmap_legend_side = "bottom",
  annotation_legend_side = "bottom"
)

7 Other useful codes

7.1 Get all TCGA cancer-specific peaks peaks

The code below will download all cancer specific peaks and transform them into an R object.

library(readr)
library(GenomicRanges)
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
library(plyr)
#-------------------------------
# Cancer specific peaks as SummaerizedExperiment object
#-------------------------------

zip.file <- "export.zip"
if(!file.exists(zip.file)) { 
  download("https://api.gdc.cancer.gov/data/38b8f311-f3a4-4746-9829-b8e3edb9c157",zip.file)
  unzip(zip.file)
}

prepareCancerSpecificPeaks <- function(file){
  output <- gsub(".txt",".rda",file)
  if(file.exists(output)) return()
  atac  <- readr::read_tsv(file)
  
  file.samples.ids <- "TCGA_identifier_mapping.txt"
  if(!file.exists(file.samples.ids)){
    file.url <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
    downloader::download(file.url,file.samples.ids)
  }  
  samples.ids <- readr::read_tsv(file.samples.ids)
  samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
  colnames(atac)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac)[-c(1:5)]),samples.ids$bam_prefix)]
  
  samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(1:5)]))
  
  if(grepl("ESCA",file)){
    samples.map <- paste0(samples.info$primary_diagnosis,"-",samples.info$sample) %>%
      { gsub(",| |NOS","",.) }  %>% # Remove spaces, and Remove NOS
      { gsub("Adenocarcinoma","ESAD",.) }  %>% # change Adenocarcinoma to ESAD
      { gsub("Squamous cell carcinoma","ESCC",.) }  # change Squamous cell carcinoma to ESCC
    
    colnames(atac)[-c(1:5)] <- samples.map[match(substr(colnames(atac)[-c(1:5)],1,16),substr(samples.map,6,21))]
  } else {
    colnames(atac)[-c(1:5)] <- samples.info$sample[match(substr(colnames(atac)[-c(1:5)],1,16),samples.info$sample)]
  }
  
  # create SE object  
  counts <- atac[,-c(1:5)]
  rowRanges <- makeGRangesFromDataFrame(atac)
  rowRanges$score <- atac$score
  rowRanges$name <- atac$name
  names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
  colData <- DataFrame(unique(left_join(samples.info,samples.ids)))
  rse <- SummarizedExperiment(
    assays = SimpleList(
      log2norm = as.matrix(counts)
    ),
    rowRanges = rowRanges, 
    colData = colData
  )
  save(rse,file = output, compress = "xz")
}

plyr::a_ply(dir(pattern=".txt"),1, function(f) 
  tryCatch({prepareCancerSpecificPeaks(f)}, error = function(e){message(e)}),
  .progress = "time")

7.2 ATAC-Seq Bigwig

The ATAC-Seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

Here is some information about the bigwig files.

All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.

We recommend extracting the files using the following command: tar -zxvf file_name.tgz –strip-components 8 where the “–strip-components 8” extracts the files without copying their original directory structure

The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.

The bigwig names also use Stanford UUIDs. The script below will help to rename the bigwifiles with TCGA barcodes. First, we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-Seq website.

bigwig.files <- dir(
  path = "ATAC-Seq_data/ESCA_bigWigs/",
  pattern = ".bw",
  all.files = T,
  full.names = T
)
table <- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")
head(table)

Then, for each file, we will map them in the downloaded table and rename them.

# rename bigwig files add more info in the nanem
plyr::a_ply(bigwig.files,1, function(file) {
  
  file.uuid <- stringr::str_extract(file,
  "[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")
  
  idx <- grep(file.uuid,gsub("-","_",table$stanfordUUID))
  
  barcode <- unique(table[idx,]$Case_ID)
  
  if (grepl("ESCA",file)) {
    samples.info <- TCGAbiolinks:::colDataPrepare(barcode)
    barcode <- paste0(
      samples.info$primary_diagnosis,"-",
      samples.info$sample
    ) %>% 
      { gsub(",| |NOS","",.) }  %>% # Remove spaces, and Remove NOS
      { gsub("Adenocarcinoma","ESAD",.) }  %>% # change Adenocarcinoma to ESAD
      { gsub("Squamous cell carcinoma","ESCC",.) }  # change Squamous cell carcinoma to ESCC
    
    
  }
  # change UUID to barcode
  to <- gsub(file.uuid,barcode,file)
  file.rename(file, to)
})

Since loading several bigWigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromosome (i.e. chr20). The Rscript below can do it by transforming the bigWig to a wig with only chr20 then converting the wig back to bigWig.

You can download at the executable for bigWigToWig and wigToBigWig at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/) and the hg38.chrom.sizes is available at GitHub (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)

chr <- 20
dirout <- paste0("chr",chr)
dir.create(dirout)
files <- dir(path = ".",pattern = "bw",full.names = T)
for(f in files){
  f.in <- f
  f.out <- gsub("bw","wig",f)
  f.out.chr <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
  cmd <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
  system(cmd)
  cmd <- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
  system(cmd)
}

7.2.1 Visualizing the bigwig files in R

We upload some samples (chromosome 20 only) at the google drive that can be plotted in R using the karyoploteR package, , which main documentation can be found at https://bernatgel.github.io/karyoploter_tutorial/.

# Load required libraries
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

# Plot parameters, only to look better
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0

# Get transcrupts annotation to get HNF4A regions
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "HNF4A"]

# plot will be at  the HNF4A range +- 50Kb
HNF4A.region <- range(c(tssAnnot)) + 50000
HNF4A.region

## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr20 44305700-44484596      +
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

# Start by plotting gene tracks
kp <- plotKaryotype(zoom = HNF4A.region, genome = "hg38", cex = 0.5, plot.params = pp)
genes.data <- makeGenesDataFromTxDb(
  TxDb.Hsapiens.UCSC.hg38.knownGene,
  karyoplot = kp,
  plot.transcripts = TRUE, 
  plot.transcripts.structure = TRUE
)

##   228 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.

genes.data <- addGeneNames(genes.data = genes.data)

## Loading required package: org.Hs.eg.db

##

genes.data <- mergeTranscripts(genes.data = genes.data)


kp <- plotKaryotype(
  zoom = HNF4A.region,
  genome = "hg38",
  cex = 0.5,
  plot.params = pp
)
kpAddBaseNumbers(
  kp,
  tick.dist = 20000,
  minor.tick.dist = 5000,
  add.units = TRUE,
  cex = 0.4,
  tick.len = 3
)
kpPlotGenes(
  kp,
  data = genes.data,
  r0 = 0,
  r1 = 0.25,
  gene.name.cex = 0.5
)


# Start to plot bigwig files
big.wig.files <- dir(
  path = "Data/ESCA_bigwig_chr20/",
  pattern = ".bw",
  all.files = T,
  full.names = T
)
big.wig.files

## [1] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L5-A4OE-01A_X015_S05_L033_B1_T1_P034.insertions_chr20.bw"
## [2] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L7-A6VZ-01A_X018_S02_L027_B1_T1_P040.insertions_chr20.bw"
## [3] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-IG-A51D-01A_X012_S02_L027_B1_T1_P024.insertions_chr20.bw"
## [4] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A49O-01A_X008_S08_L015_B1_T1_P017.insertions_chr20.bw"
## [5] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A4A5-01A_X007_S06_L065_B1_T1_P012.insertions_chr20.bw"
## [6] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A9FQ-01A_X010_S08_L063_B1_T1_P016.insertions_chr20.bw"

# Reserve area to plot the bigwig files
out.at <- autotrack(
  1:length(big.wig.files), 
  length(big.wig.files), 
  margin = 0.3, 
  r0 = 0.3,
  r1 = 1
)

# Add ATAC-seq label from 0.3 to 1 which should cover all ATAC-seq tracks
kpAddLabels(
  kp, 
  labels = "ATAC-Seq", 
  r0 = out.at$r0, 
  r1 = out.at$r1, 
  side = "left",
  cex = 1,
  srt = 90, 
  pos = 3, 
  label.margin = 0.1
)


for(i in seq_len(length(big.wig.files))) {
  bigwig.file <- big.wig.files[i]
  
  # Define where the track will be ploted
  # autotrack will simple get the reserved space (from out.at$r0 up to out.at$r1)
  # and split in equal sizes for each bigwifile, i the index, will control which 
  # one is being plotted
  at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
  
  # Plot bigwig
  kp <- kpPlotBigWig(
    kp, 
    data = bigwig.file, 
    ymax = "visible.region", 
    r0 = at$r0, 
    col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
    r1 = at$r1
  )
  computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
  
  # Add track axis
  kpAxis(
    kp, 
    ymin = 0, 
    ymax = computed.ymax, 
    numticks = 2,
    r0 = at$r0, 
    r1 = at$r1,
    cex = 0.5
  )
  
  # Add track label
  kpAddLabels(
    kp, 
    labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
    r0 = at$r0, 
    r1 = at$r1, 
    cex = 0.5, 
    label.margin = 0.01
  )
}

7.2.2 Visualizing links

One of the probe-gene links identified using HM450 was cg03326606-PARD6B, which will be plotted below.

library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0

cg03326606 <- ELMER:::getInfiniumAnnotation()["cg03326606"]
cg03326606

## GRanges object with 1 range and 52 metadata columns:
##              seqnames            ranges strand | address_A address_B
##                 <Rle>         <IRanges>  <Rle> | <integer> <integer>
##   cg03326606    chr20 50729030-50729031      + |  35670461      <NA>
##                  channel  designType    nextBase nextBaseRef   probeType
##              <character> <character> <character> <character> <character>
##   cg03326606        Both          II         G/A           C          cg
##              orientation probeCpGcnt context35  probeBeg  probeEnd
##              <character>   <integer> <integer> <integer> <numeric>
##   cg03326606          up           1         2  50728981  50729030
##                          ProbeSeq_A  ProbeSeq_B        gene   gene_HGNC
##                         <character> <character> <character> <character>
##   cg03326606 TTTATCAAATTCCAACTATT..                    <NA>        <NA>
##                   chrm_A     beg_A    flag_A    mapQ_A     cigar_A      NM_A
##              <character> <integer> <integer> <integer> <character> <integer>
##   cg03326606       chr20  50728981         0        60         50M         0
##                   chrm_B     beg_B    flag_B    mapQ_B     cigar_B      NM_B
##              <character> <integer> <integer> <integer> <character> <integer>
##   cg03326606        <NA>      <NA>      <NA>      <NA>        <NA>      <NA>
##              wDecoy_chrm_A wDecoy_beg_A wDecoy_flag_A wDecoy_mapQ_A
##                <character>    <integer>     <integer>     <integer>
##   cg03326606         chr20     50728981             0            60
##              wDecoy_cigar_A wDecoy_NM_A wDecoy_chrm_B wDecoy_beg_B
##                 <character>   <integer>   <character>    <integer>
##   cg03326606            50M           0          <NA>         <NA>
##              wDecoy_flag_B wDecoy_mapQ_B wDecoy_cigar_B wDecoy_NM_B  posMatch
##                  <integer>     <integer>    <character>   <integer> <logical>
##   cg03326606          <NA>          <NA>           <NA>        <NA>      <NA>
##              MASK_mapping MASK_typeINextBaseSwitch MASK_rmsk15 MASK_sub40_copy
##                 <logical>                <logical>   <logical>       <logical>
##   cg03326606        FALSE                    FALSE       FALSE           FALSE
##              MASK_sub35_copy MASK_sub30_copy MASK_sub25_copy MASK_snp5_common
##                    <logical>       <logical>       <logical>        <logical>
##   cg03326606           FALSE           FALSE           FALSE            FALSE
##              MASK_snp5_GMAF1p MASK_extBase MASK_general
##                     <logical>    <logical>    <logical>
##   cg03326606            FALSE        FALSE        FALSE
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths

tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "PARD6B"]

pair.region <- range(c(tssAnnot,cg03326606)) + 200

suppressWarnings({
  kp <- plotKaryotype(zoom = pair.region, genome = "hg38", cex = 0.5, plot.params = pp)
})



genes.data <-
  makeGenesDataFromTxDb(
    TxDb.Hsapiens.UCSC.hg38.knownGene,
    karyoplot = kp,
    plot.transcripts = TRUE,
    plot.transcripts.structure = TRUE
  )

##   228 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.

genes.data <- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)

promoter <- promoters(range(tssAnnot))

kp <-
  plotKaryotype(
    zoom = pair.region,
    genome = "hg38",
    cex = 0.5,
    plot.params = pp
  )
kpPlotRegions(
  kp,
  toGRanges("chr20:50729030-50729031"),
  r0 = 0,
  r1 = 0.02,
  col = "#ff8d92"
)
kpPlotRegions(
  kp,
  promoter,
  r0 = 0,
  r1 = 0.02,
  col = "#8d9aff"
)

kpPlotLinks(
  kp,
  data = toGRanges("chr20:50729030-50729031"),
  data2 = promoter,
  col = "#fac7ffaa",
  r0 = 0.02,
  arch.height = 0.1
)
kpAddBaseNumbers(
  kp,
  tick.dist = 10000,
  minor.tick.dist = 2000,
  add.units = TRUE,
  cex = 0.5,
  tick.len = 3
)
kpPlotGenes(
  kp,
  data = genes.data,
  r0 = 0.12,
  r1 = 0.25,
  gene.name.cex = 0.5
)



big.wig.files <- dir(
  path = "Data/ESCA_bigwig_chr20/",
  pattern = ".bw",
  all.files = T,
  full.names = T
)

out.at <- autotrack(
  1:length(big.wig.files), 
  length(big.wig.files), 
  margin = 0.3, 
  r0 = 0.3,
  r1 = 1
)

kpAddLabels(
  kp, 
  labels = "ATAC-Seq", 
  r0 = out.at$r0, 
  r1 = out.at$r1, 
  side = "left",
  cex = 1,
  srt = 90, 
  pos = 3, 
  label.margin = 0.1
)

for(i in seq_len(length(big.wig.files))) {
  bigwig.file <- big.wig.files[i]
  at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
  kp <- kpPlotBigWig(
    kp, 
    data = bigwig.file, 
    ymax = 523,
    r0 = at$r0, 
    col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
    r1 = at$r1
  )
  computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
  kpAxis(
    kp, 
    ymin = 0, 
    ymax = computed.ymax, 
    numticks = 2,
    r0 = at$r0, 
    r1 = at$r1,
    cex = 0.5
  )
  kpAddLabels(
    kp, 
    labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
    r0 = at$r0, 
    r1 = at$r1, 
    cex = 1, 
    label.margin = 0.01
  )
}

8 Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.13.0                     
##  [2] ELMER_2.16.0                            
##  [3] ELMER.data_2.16.0                       
##  [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.13.0
##  [5] GenomicFeatures_1.44.0                  
##  [6] AnnotationDbi_1.54.1                    
##  [7] karyoploteR_1.18.0                      
##  [8] regioneR_1.24.0                         
##  [9] pheatmap_1.0.12                         
## [10] circlize_0.4.13                         
## [11] ComplexHeatmap_2.8.0                    
## [12] TCGAbiolinks_2.20.0                     
## [13] plyr_1.8.6                              
## [14] SummarizedExperiment_1.22.0             
## [15] Biobase_2.52.0                          
## [16] MatrixGenerics_1.4.0                    
## [17] matrixStats_0.59.0                      
## [18] dplyr_1.0.6                             
## [19] tidyr_1.1.3                             
## [20] GenomicRanges_1.44.0                    
## [21] GenomeInfoDb_1.28.0                     
## [22] IRanges_2.26.0                          
## [23] S4Vectors_0.30.0                        
## [24] BiocGenerics_0.38.0                     
## [25] readr_1.4.0                             
## [26] DT_0.18                                 
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.1                  R.utils_2.10.1             
##   [3] tidyselect_1.1.1            RSQLite_2.2.7              
##   [5] htmlwidgets_1.5.3           BiocParallel_1.26.0        
##   [7] munsell_0.5.0               codetools_0.2-18           
##   [9] colorspace_2.0-1            filelock_1.0.2             
##  [11] highr_0.9                   knitr_1.33                 
##  [13] rstudioapi_0.13             ggsignif_0.6.2             
##  [15] labeling_0.4.2              GenomeInfoDbData_1.2.6     
##  [17] farver_2.1.0                bit64_4.0.5                
##  [19] downloader_0.4              vctrs_0.3.8                
##  [21] generics_0.1.0              xfun_0.24                  
##  [23] biovizBase_1.40.0           BiocFileCache_2.0.0        
##  [25] R6_2.5.0                    doParallel_1.0.16          
##  [27] clue_0.3-59                 AnnotationFilter_1.16.0    
##  [29] bitops_1.0-7                cachem_1.0.5               
##  [31] reshape_0.8.8               DelayedArray_0.18.0        
##  [33] assertthat_0.2.1            BiocIO_1.2.0               
##  [35] scales_1.1.1                nnet_7.3-16                
##  [37] gtable_0.3.0                Cairo_1.5-12.2             
##  [39] ensembldb_2.16.0            rlang_0.4.11               
##  [41] GlobalOptions_0.1.2         splines_4.1.0              
##  [43] rtracklayer_1.52.0          rstatix_0.7.0              
##  [45] lazyeval_0.2.2              dichromat_2.0-0            
##  [47] broom_0.7.7                 checkmate_2.0.0            
##  [49] reshape2_1.4.4              yaml_2.2.1                 
##  [51] abind_1.4-5                 backports_1.2.1            
##  [53] Hmisc_4.5-0                 tools_4.1.0                
##  [55] ggplot2_3.3.4               ellipsis_0.3.2             
##  [57] jquerylib_0.1.4             RColorBrewer_1.1-2         
##  [59] MultiAssayExperiment_1.18.0 Rcpp_1.0.6                 
##  [61] base64enc_0.1-3             progress_1.2.2             
##  [63] zlibbioc_1.38.0             purrr_0.3.4                
##  [65] RCurl_1.98-1.3              prettyunits_1.1.1          
##  [67] ggpubr_0.4.0                rpart_4.1-15               
##  [69] GetoptLong_1.0.5            haven_2.4.1                
##  [71] ggrepel_0.9.1               cluster_2.1.2              
##  [73] magrittr_2.0.1              magick_2.7.2               
##  [75] data.table_1.14.0           openxlsx_4.2.4             
##  [77] ProtGenerics_1.24.0         hms_1.1.0                  
##  [79] TCGAbiolinksGUI.data_1.12.0 evaluate_0.14              
##  [81] XML_3.99-0.6                rio_0.5.26                 
##  [83] jpeg_0.1-8.1                readxl_1.3.1               
##  [85] gridExtra_2.3               shape_1.4.6                
##  [87] compiler_4.1.0              biomaRt_2.48.1             
##  [89] tibble_3.1.2                crayon_1.4.1               
##  [91] R.oo_1.24.0                 htmltools_0.5.1.1          
##  [93] Formula_1.2-4               DBI_1.1.1                  
##  [95] dbplyr_2.1.1                rappdirs_0.3.3             
##  [97] Matrix_1.3-4                car_3.0-10                 
##  [99] cli_2.5.0                   R.methodsS3_1.8.1          
## [101] Gviz_1.36.1                 forcats_0.5.1              
## [103] pkgconfig_2.0.3             prettydoc_0.4.1            
## [105] GenomicAlignments_1.28.0    foreign_0.8-81             
## [107] plotly_4.9.4                xml2_1.3.2                 
## [109] foreach_1.5.1               bslib_0.2.5.1              
## [111] XVector_0.32.0              rvest_1.0.0                
## [113] stringr_1.4.0               bezier_1.1.2               
## [115] VariantAnnotation_1.38.0    digest_0.6.27              
## [117] Biostrings_2.60.1           rmarkdown_2.9              
## [119] cellranger_1.1.0            htmlTable_2.2.1            
## [121] restfulr_0.0.13             curl_4.3.1                 
## [123] Rsamtools_2.8.0             rjson_0.2.20               
## [125] lifecycle_1.0.0             jsonlite_1.7.2             
## [127] carData_3.0-4               viridisLite_0.4.0          
## [129] BSgenome_1.60.0             fansi_0.5.0                
## [131] pillar_1.6.1                lattice_0.20-44            
## [133] KEGGREST_1.32.0             fastmap_1.1.0              
## [135] httr_1.4.2                  survival_3.2-11            
## [137] glue_1.4.2                  zip_2.2.0                  
## [139] bamsignals_1.24.0           png_0.1-7                  
## [141] iterators_1.0.13            bit_4.0.4                  
## [143] stringi_1.6.2               sass_0.4.0                 
## [145] blob_1.2.1                  latticeExtra_0.6-29        
## [147] memoise_2.0.0

9 Workshop materials

9.1 Workshops HTMLs

ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop

9.2 Workshop videos

We have a set of recorded videos, explaining some of the workshops.

All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
ELMER algorithm: https://youtu.be/PzC31K9vfu0
ELMER data: https://youtu.be/R00wG--tGo8
ELMER analysis part1 : https://youtu.be/bcd4uyxrZCw
ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
ELMER summarizing several analysis: https://youtu.be/moLeik7JjLk
ATAC-Seq workshop: https://youtu.be/3ftZecz0lU4

References

Buenrostro, Jason D, Paul G Giresi, Lisa C Zaba, Howard Y Chang, and William J Greenleaf. 2013. “Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-Binding Proteins and Nucleosome Position.” Nature Methods 10 (12): 1213.

Corces, M Ryan, Jeffrey M Granja, Shadi Shams, Bryan H Louie, Jose A Seoane, Wanding Zhou, Tiago C Silva, et al. 2018. “The Chromatin Accessibility Landscape of Primary Human Cancers.” Science 362 (6413): eaav1898.

Corces, M Ryan, Alexandro E Trevino, Emily G Hamilton, Peyton G Greenside, Nicholas A Sinnott-Armstrong, Sam Vesuna, Ansuman T Satpathy, et al. 2017. “An Improved ATAC-Seq Protocol Reduces Background and Enables Interrogation of Frozen Tissues.” Nature Methods 14 (10): 959.

Workshop for ATAC-Seq analysis

Benjamin P. Berman, Tiago C. Silva

2021-06-16