1 Introduction
An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .
Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.
2 How to cite the data
If you use TCGA ATAC-Seq data in your analysis, please cite the following paper:
CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898
3 Workshop description
In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.
For more information about the data please visit GDC publication website and read the paper:
- CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898
3.1 Pre-requisites
- Basic knowledge of R syntax
3.2 Workshop Participation
Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.
3.3 Goals and objectives
- Download and understand the ATAC-Seq data
- Compare two different groups of samples ATAC-Seq data
3.4 Enviroment: R libraries
The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.
## Warning: package 'GenomeInfoDb' was built under R version 4.0.4
# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
# For the t.test loop
library(plyr)
# For easy volcano plot
library(TCGAbiolinks)
# For heatmap plot
library(ComplexHeatmap)
library(circlize)
# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## Warning: package 'GenomicFeatures' was built under R version 4.0.4
If one of them is not available in your R enviroment, you can easily install them with BiocManager
as shown below.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
list.of.packages <- c(
"karyoploteR",
"circlize",
"ComplexHeatmap",
"TxDb.Hsapiens.UCSC.hg38.knownGene",
"TCGAbiolinks",
"plyr",
"dplyr",
"tidyr",
"GenomicRanges",
"readr"
)
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)
4 Data
The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
5 Understanding the data: peaks sets
The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. And, it is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).
The paper introduces 3 sets of peaks: 1) sample peak set, 2) cancer type-specific peak set and 3) pan-cancer peak set.
- Sample peak set description from the paper:
Overlapping peaks called within a single sample were handled using an iterative removal procedure. First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This resulted in a set of fixed-width peaks for each sample which we refer to here as a “sample peak set”.
Since the samples varied in read depth or quality, to compare samples they developed a robust method to normalize peak significance scores across samples and cancer types, which uses MACS2 peak scores (-log10(p-value)) for each sample and converted them to a “score per million” by dividing each individual peak score by the sum of all of the peak scores in the given sample divided by 1 million.
- “cancer type-specific peak set” description:
First, all “sample peak sets” from a given cancer type were combined into a cumulative peak set and trimmed for overlap using the same iterative procedure mentioned above. Contains all of the reproducible peaks observed in an individual cancer type, filtered to those which were observed in at least two samples with a score per million value \(>=5\).
- “pan-cancer peak set” description:
First, there was a re-normalization of the score per million scores for each “cancer type-specific peak set” to prevent cancer types with more samples or higher quality samples from comprising a disproportionate share of the union peak set. The resulting peak sets were combined with the same iterative procedure mentioned above. These peaks are reproducible peaks from all cancer types that could then be used for cross-cancer comparisons.
All these descriptions can be found at the supplemental material from Corces et al. (2018) in the Peak calling section.
5.1 Comparing pan-cancer peak set and cancer type-specific peak set
If we check the both sets (Files downloaded from GDC: “All cancer type-specific peak sets. [ZIP]” and “Pan-cancer peak set. [TXT]”), the set of peaks “pan-cancer peak set” consists of \(~562K\) peaks, and it contains a subset of each “cancer type-specific peak set”. We show an example for Esophageal carcinoma (ESCA) below.
## # A tibble: 6 x 8
## seqnames start end name score annotation percentGC percentAT
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 chr1 1290095 1290596 ESCA_107 2.46 3' UTR 0.677 0.323
## 2 chr1 1291115 1291616 ESCA_108 2.59 3' UTR 0.703 0.297
## 3 chr1 1291753 1292254 ESCA_109 7.58 3' UTR 0.639 0.361
## 4 chr1 1440824 1441325 ESCA_160 4.47 3' UTR 0.659 0.341
## 5 chr1 1630188 1630689 ESCA_179 20.6 3' UTR 0.729 0.271
## 6 chr1 2030218 2030719 ESCA_341 15.6 3' UTR 0.619 0.381
## Number of peaks: 127055
## Range score: 1.82339062289625 - 1476.55719343776
# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt")
head(atac_pan)
## # A tibble: 6 x 7
## seqnames start end name score annotation percentGC
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 chr1 906012 906513 ACC_10 7.17 Intron 0.613
## 2 chr2 112541661 112542162 ACC_10008 22.0 Promoter 0.555
## 3 chr1 21673421 21673922 ACC_1001 6.46 Distal 0.509
## 4 chr2 112584205 112584706 ACC_10013 43.2 Promoter 0.587
## 5 chr2 112596243 112596744 ACC_10016 5.43 Intron 0.491
## 6 chr1 21725692 21726193 ACC_1002 5.20 Intron 0.405
# check the names of the PanCAN set of peaks
plyr::count(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])
## x freq
## 1 ACC 29311
## 2 BLCA 25337
## 3 BRCA 49748
## 4 CESC 14358
## 5 CHOL 11819
## 6 COAD 25404
## 7 ESCA 13237
## 8 GBM 15394
## 9 HNSC 16651
## 10 KIRC 15067
## 11 KIRP 24324
## 12 LGG 23836
## 13 LIHC 35787
## 14 LUAD 23729
## 15 LUSC 22195
## 16 MESO 22958
## 17 PCPG 31372
## 18 PRAD 30067
## 19 SKCM 36591
## 20 STAD 17358
## 21 TGCT 26120
## 22 THCA 31568
## 23 UCEC 20478
## Range score: 1.022347327 - 2527.059401
However, it is important to highlight that the “pan-cancer peak set” will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the “pan-cancer peak set” consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
## [1] 126935
5.2 Checking an overlapping peak
So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the “ACC_10008” peak, which has a higher score.
## [1] FALSE
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | name score annotation
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr2 112541661-112542162 * | ACC_10008 22.0306 Promoter
## percentGC
## <numeric>
## [1] 0.55489
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score annotation
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr2 112541649-112542150 * | ESCA_17603 6.25799 Promoter
## percentGC percentAT
## <numeric> <numeric>
## [1] 0.556886 0.443114
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
5.3 Peaks size
Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp. The main reason to use fixed-width peaks is stated in the supplemental material:
We chose to use fixed-width peaks because (i) it makes count based and motif focused analyses less biased to large peaks and (ii) with large datasets merging peak sets to obtain a union peak set can lead to many peaks being merged into one very large peak, limiting our ability to resolve independent peaks.
## [1] 502
## [1] 502
5.4 Summary
In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.
6 Analysis: Using ATAC-Seq counts to compare two groups
Through the next section we will load the normalized counts data and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.
The main two files used in this section are:
- Normalized ATAC-Seq insertion counts within the pan-cancer peak set. Recommended format. [RDS]
- All cancer type-specific count matrices in normalized counts. [ZIP]
In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID
instead of TCGA barcodes
and each patient normally has two replicates.
## # A tibble: 4 x 8
## seqnames start end name score ESCA_1E6AC686_96C2_4… ESCA_1E6AC686_96C2_…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA_1 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA_2 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA_3 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA_4 2.59 1.44 1.16
## # … with 1 more variable:
## # ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016 <dbl>
We will change the samples names to TCGA barcodes using the file “Lookup table for various TCGA sample identifiers. [TXT]” from GDC.
gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
head(samples.ids)
## # A tibble: 6 x 6
## bam_prefix stanfordUUID aliquot_id Case_UUID Case_ID sample
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 2 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 3 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 4 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 5 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
## 6 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]
## # A tibble: 4 x 8
## seqnames start end name score `TCGA-IG-A51D-01A-31… `TCGA-IG-A51D-01A-3…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA_1 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA_2 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA_3 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA_4 2.59 1.44 1.16
## # … with 1 more variable: TCGA-LN-A9FQ-01A-11-A617-42 <dbl>
atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))
samples.map <- gsub(",| |NOS","",
gsub("Adenocarcinoma","ESAD",
gsub("Squamous cell carcinoma","ESCC",
paste0(samples.info$primary_diagnosis,"-",samples.info$sample)
)
)
)
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
# create SE object
counts <- atac[,-c(1:5)] %>% as.matrix()
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
rownames(counts) <- names(rowRanges)
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))
rownames(colData) <- colnames(counts)
esca.rse <- SummarizedExperiment(
assays = SimpleList(log2norm = as.matrix(counts)),
rowRanges = rowRanges,
colData = colData
)
# Since we have two samples for each patient we will rename tham as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
## class: RangedSummarizedExperiment
## dim: 126935 33
## metadata(0):
## assays(1): log2norm
## rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
## ESCA_126943_chrX_155985136_155985635
## ESCA_126944_chrX_156030008_156030507
## rowData names(2): score name
## colnames(33): ESCC-TCGA-IG-A51D-01A_rep1 ESCC-TCGA-IG-A51D-01A_rep2 ...
## ESAD-TCGA-M9-A5M8-01A_rep1 ESAD-TCGA-M9-A5M8-01A_rep2
## colData names(140): sample patient ... Case_UUID Case_ID
## [1] "ESCC-TCGA-IG-A51D-01A_rep1" "ESCC-TCGA-IG-A51D-01A_rep2"
## [3] "ESCC-TCGA-LN-A9FQ-01A_rep1" "ESCC-TCGA-LN-A9FQ-01A_rep2"
## [5] "ESAD-TCGA-L7-A6VZ-01A_rep1" "ESAD-TCGA-L7-A6VZ-01A_rep2"
## [7] "ESAD-TCGA-L5-A4OE-01A_rep1" "ESAD-TCGA-L5-A4OE-01A_rep2"
## [9] "ESCC-TCGA-LN-A49O-01A_rep1" "ESCC-TCGA-LN-A49O-01A_rep2"
## [11] "ESAD-TCGA-IG-A7DP-01A_rep1" "ESAD-TCGA-IG-A7DP-01A_rep2"
## [13] "ESCC-TCGA-LN-A4A5-01A_rep1" "ESCC-TCGA-LN-A4A5-01A_rep2"
## [15] "ESCC-TCGA-IG-A4P3-01A_rep1" "ESCC-TCGA-IG-A3YA-01A_rep1"
## [17] "ESCC-TCGA-IG-A3YA-01A_rep2" "ESAD-TCGA-IG-A4QS-01A_rep1"
## [19] "ESAD-TCGA-IG-A4QS-01A_rep2" "ESCC-TCGA-LN-A7HX-01A_rep1"
## [21] "ESCC-TCGA-LN-A7HX-01A_rep2" "ESCC-TCGA-LN-A8HZ-01A_rep1"
## [23] "ESCC-TCGA-LN-A4A2-01A_rep1" "ESCC-TCGA-LN-A4A2-01A_rep2"
## [25] "ESCC-TCGA-LN-A4MR-01A_rep1" "ESCC-TCGA-IG-A625-01A_rep1"
## [27] "ESCC-TCGA-IG-A625-01A_rep2" "ESCC-TCGA-LN-A49W-01A_rep1"
## [29] "ESCC-TCGA-LN-A49W-01A_rep2" "ESAD-TCGA-IC-A6RE-01A_rep1"
## [31] "ESAD-TCGA-IC-A6RE-01A_rep2" "ESAD-TCGA-M9-A5M8-01A_rep1"
## [33] "ESAD-TCGA-M9-A5M8-01A_rep2"
6.1 Comparing ESCC vs ESAD ATAC-Seq
A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.
escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")
doParallel::registerDoParallel(cores = 4)
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
return(tibble::tibble("raw_p_value"= results$p.value,
"ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak",.parallel = TRUE)
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")
6.2 Volcano plot of t-test analysis
It is possible to visualize the results of a t.test
as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks.
fdr.cut.off <- 0.01
diff.cut.off <- 2
TCGAbiolinks:::TCGAVisualize_volcano(
x = result$ESCC_minus_ESAD,
y = result$FDR,
title = paste0("Volcano plot - ATAC-Seq peaks ",
"difference in ",
"ESCC vs ESAD\n"),
filename = NULL,
label = c("Not Significant",
paste0("High in ESCC (vs ESAD)"),
paste0("Low in ESCC (vs ESAD)")),
ylab = expression(paste(-Log[10],
" (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
)),
xlab = expression(paste(
"Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
)),
x.cut = diff.cut.off,
y.cut = fdr.cut.off
)
# How many peaks pass our cut-offs
table(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)
##
## FALSE TRUE
## 122868 4067
6.3 Heatmap of differential significant peaks
It is possible to visualize the results of a t.test
as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the \(FDR < 0.01\) and \(\Delta Log_2 Counts > 2\).
Also, we will be using TCGAbiolinks
package to plot it, but you can do it using ggplot2
or plotly
for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.
# Colors of the heatmap
pal_atac <- colorRampPalette(c('#3361A5',
'#248AF3',
'#14B3FF',
'#88CEEF',
'#C1D5DC',
'#EAD397',
'#FDB31A',
'#E42A2A',
'#A31D1D'))(100)
# Upper track with the samples annotation
ha = HeatmapAnnotation(
df = data.frame(
"Group" = esca.rse$primary_diagnosis,
"Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")
),
show_annotation_name = T,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6)
)
# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
# Define the color scale
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
# Show the names of the samples 1 and 18
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))
# Plot the ATAC-Seq signals
ht_list <-
Heatmap(plot.atac,
name = "ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(
legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)
),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12)
)
options(repr.plot.width = 15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
6.4 Heatmap of differential significant peaks (z-score)
A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here.
In R the function scale
can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-Seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-Seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-Seq ESCC vs ESAD (FDR < ",
fdr.cut.off,", Diff mean log2 Count > ",
diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
6.5 Merging ATAC-Seq replicates
If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.
# This function will calculate the Means of the peaks for a given group
# in our case we will calculate the mean of the replicates of each patient.
groupMeans <- function(mat, groups = NULL, na.rm = TRUE){
stopifnot(!is.null(groups))
gm <- lapply(unique(groups), function(x){
rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
}) %>% Reduce("cbind",.)
colnames(gm) <- unique(groups)
return(gm)
}
matMerged <- groupMeans(
mat = assays(esca.rse)$log2norm,
groups = colData(esca.rse)$sample
)
# keep only metadata for replicate 1
metadata <- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),]
# Create the upper annotation track for the samples
ha = HeatmapAnnotation(
df = data.frame("Group" = metadata$primary_diagnosis),
show_annotation_name = TRUE,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = TRUE,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
# Select the significant peaks to be plotted
plot.atac <- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
# Define the color scheme based on the values
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
# Plot ATAC-Seq signal values
ht_list <-
Heatmap(plot.atac,
name = "ATAC-Seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",
Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
# Start to plot z-score heatmap
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width = 15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
7 Other useful codes
7.1 Get all TCGA cancer-specific peaks peaks
The code below will download all cancer specific peaks and transform them into an R object.
library(readr)
library(GenomicRanges)
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
library(plyr)
#-------------------------------
# Cancer specific peaks as SummaerizedExperiment object
#-------------------------------
zip.file <- "export.zip"
if(!file.exists(zip.file)) {
download("https://api.gdc.cancer.gov/data/38b8f311-f3a4-4746-9829-b8e3edb9c157",zip.file)
unzip(zip.file)
}
prepareCancerSpecificPeaks <- function(file){
output <- gsub(".txt",".rda",file)
if(file.exists(output)) return()
atac <- readr::read_tsv(file)
file.samples.ids <- "TCGA_identifier_mapping.txt"
if(!file.exists(file.samples.ids)) downloader::download("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c",file.samples.ids)
samples.ids <- readr::read_tsv(file.samples.ids)
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colnames(atac)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac)[-c(1:5)]),samples.ids$bam_prefix)]
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(1:5)]))
if(grepl("ESCA",file)){
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(1:5)] <- samples.map[match(substr(colnames(atac)[-c(1:5)],1,16),substr(samples.map,6,21))]
} else {
colnames(atac)[-c(1:5)] <- samples.info$sample[match(substr(colnames(atac)[-c(1:5)],1,16),samples.info$sample)]
}
# create SE object
counts <- atac[,-c(1:5)]
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))
rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
rowRanges = rowRanges,
colData = colData)
save(rse,file = output, compress = "xz")
}
plyr::a_ply(dir(pattern=".txt"),1, function(f)
tryCatch({prepareCancerSpecificPeaks(f)}, error = function(e){message(e)}),
.progress = "time")
7.2 ATAC-Seq Bigwig
The ATAC-Seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
Here is some information about the bigwig files.
All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.
We recommend extracting the files using the following command: tar -zxvf file_name.tgz –strip-components 8 where the “–strip-components 8” extracts the files without copying their original directory structure
The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.
The bigwig names also use Stanford UUIDs. The script below will help to rename the bigwifiles with TCGA barcodes. First, we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-Seq website.
bigwig.files <- dir(path = "ATAC-Seq_data/ESCA_bigWigs/",
pattern = ".bw",
all.files = T,
full.names = T)
table <- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")
head(table)
Then, for each file, we will map them in the downloaded table and rename them.
# rename bigwig files add more info in the nanem
plyr::a_ply(bigwig.files,1, function(file) {
file.uuid <- stringr::str_extract(file,
"[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")
idx <- grep(file.uuid,gsub("-","_",table$stanfordUUID))
barcode <- unique(table[idx,]$Case_ID)
if(grepl("ESCA",file)){
samples.info <- TCGAbiolinks:::colDataPrepare(barcode)
barcode <- gsub(",| |NOS","",
gsub("Adenocarcinoma","ESAD",
gsub("Squamous cell carcinoma","ESCC",
paste0(samples.info$primary_diagnosis,"-",
samples.info$sample)
)
)
)
}
# change UUID to barcode
to <- gsub(file.uuid,barcode,file)
file.rename(file, to)
})
Since loading several bigWigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromosome (i.e. chr20). The Rscript below can do it by transforming the bigWig to a wig with only chr20 then converting the wig back to bigWig.
You can download at the executable for bigWigToWig
and wigToBigWig
at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/) and the hg38.chrom.sizes is available at GitHub (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)
chr <- 20
dirout <- paste0("chr",chr)
dir.create(dirout)
files <- dir(path = ".",pattern = "bw",full.names = T)
for(f in files){
f.in <- f
f.out <- gsub("bw","wig",f)
f.out.chr <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
cmd <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
system(cmd)
cmd <- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
system(cmd)
}
7.2.1 Visualizing the bigwig files in R
We upload some samples (chromosome 20 only) at the google drive that can be plotted in R using the karyoploteR
package, , which main documentation can be found at https://bernatgel.github.io/karyoploter_tutorial/.
# Load required libraries
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Plot parameters, only to look better
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
# Get transcrupts annotation to get HNF4A regions
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "HNF4A"]
# plot will be at the HNF4A range +- 50Kb
HNF4A.region <- range(c(tssAnnot)) + 50000
# Start by plotting gene tracks
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE)
## 1613 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Loading required package: org.Hs.eg.db
##
genes.data <- mergeTranscripts(genes.data)
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpAddBaseNumbers(kp, tick.dist = 20000, minor.tick.dist = 5000,
add.units = TRUE, cex = 0.4, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0, r1 = 0.25, gene.name.cex = 0.5)
# Start to plot bigwig files
big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T)
big.wig.files
## [1] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L5-A4OE-01A_X015_S05_L033_B1_T1_P034.insertions_chr20.bw"
## [2] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L7-A6VZ-01A_X018_S02_L027_B1_T1_P040.insertions_chr20.bw"
## [3] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-IG-A51D-01A_X012_S02_L027_B1_T1_P024.insertions_chr20.bw"
## [4] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A49O-01A_X008_S08_L015_B1_T1_P017.insertions_chr20.bw"
## [5] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A4A5-01A_X007_S06_L065_B1_T1_P012.insertions_chr20.bw"
## [6] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A9FQ-01A_X010_S08_L063_B1_T1_P016.insertions_chr20.bw"
# Reserve area to plot the bigwig files
out.at <- autotrack(1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1)
# Add ATAC-seq label from 0.3 to 1 which should cover all ATAC-seq tracks
kpAddLabels(kp,
labels = "ATAC-Seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1)
for(i in seq_len(length(big.wig.files))) {
bigwig.file <- big.wig.files[i]
# Define where the track will be ploted
# autotrack will simple get the reserved space (from out.at$r0 up to out.at$r1)
# and split in equal sizes for each bigwifile, i the index, will control which
# one is being plotted
at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
# Plot bigwig
kp <- kpPlotBigWig(kp,
data = bigwig.file,
ymax = "visible.region",
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1)
computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
# Add track axis
kpAxis(kp,
ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5)
# Add track label
kpAddLabels(kp,
labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 0.5,
label.margin = 0.01)
}
7.2.2 Visualizing links
One of the probe-gene links identified using HM450 was cg03326606-PARD6B
, which will be plotted below.
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
cg03326606 <- ELMER:::getInfiniumAnnotation()["cg03326606"]
cg03326606
## GRanges object with 1 range and 52 metadata columns:
## seqnames ranges strand | address_A address_B
## <Rle> <IRanges> <Rle> | <integer> <integer>
## cg03326606 chr20 50729030-50729031 + | 35670461 <NA>
## channel designType nextBase nextBaseRef probeType
## <character> <character> <character> <character> <character>
## cg03326606 Both II G/A C cg
## orientation probeCpGcnt context35 probeBeg probeEnd
## <character> <integer> <integer> <integer> <numeric>
## cg03326606 up 1 2 50728981 50729030
## ProbeSeq_A ProbeSeq_B gene gene_HGNC
## <character> <character> <character> <character>
## cg03326606 TTTATCAAATTCCAACTATT.. <NA> <NA>
## chrm_A beg_A flag_A mapQ_A cigar_A NM_A
## <character> <integer> <integer> <integer> <character> <integer>
## cg03326606 chr20 50728981 0 60 50M 0
## chrm_B beg_B flag_B mapQ_B cigar_B NM_B
## <character> <integer> <integer> <integer> <character> <integer>
## cg03326606 <NA> <NA> <NA> <NA> <NA> <NA>
## wDecoy_chrm_A wDecoy_beg_A wDecoy_flag_A wDecoy_mapQ_A
## <character> <integer> <integer> <integer>
## cg03326606 chr20 50728981 0 60
## wDecoy_cigar_A wDecoy_NM_A wDecoy_chrm_B wDecoy_beg_B
## <character> <integer> <character> <integer>
## cg03326606 50M 0 <NA> <NA>
## wDecoy_flag_B wDecoy_mapQ_B wDecoy_cigar_B wDecoy_NM_B posMatch
## <integer> <integer> <character> <integer> <logical>
## cg03326606 <NA> <NA> <NA> <NA> <NA>
## MASK_mapping MASK_typeINextBaseSwitch MASK_rmsk15 MASK_sub40_copy
## <logical> <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE FALSE
## MASK_sub35_copy MASK_sub30_copy MASK_sub25_copy MASK_snp5_common
## <logical> <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE FALSE
## MASK_snp5_GMAF1p MASK_extBase MASK_general
## <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE
## -------
## seqinfo: 26 sequences from an unspecified genome; no seqlengths
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "PARD6B"]
pair.region <- range(c(tssAnnot,cg03326606)) + 200
suppressWarnings({
kp <- plotKaryotype(zoom = pair.region, genome = "hg38", cex = 0.5, plot.params = pp)
})
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE)
## 1613 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
genes.data <- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)
promoter <- promoters(range(tssAnnot))
kp <- plotKaryotype(zoom = pair.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpPlotRegions(kp, toGRanges("chr20:50729030-50729031"), r0=0, r1=0.02, col="#ff8d92")
kpPlotRegions(kp, promoter, r0 = 0, r1 = 0.02, col = "#8d9aff")
kpPlotLinks(kp,
data = toGRanges("chr20:50729030-50729031"),
data2 = promoter,
col = "#fac7ffaa",
r0 = 0.02,
arch.height = 0.1)
kpAddBaseNumbers(kp, tick.dist = 10000, minor.tick.dist = 2000,
add.units = TRUE, cex = 0.5, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0.12, r1 = 0.25, gene.name.cex = 0.5)
big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T)
out.at <- autotrack(1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1)
kpAddLabels(kp,
labels = "ATAC-Seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1)
for(i in seq_len(length(big.wig.files))) {
bigwig.file <- big.wig.files[i]
at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
kp <- kpPlotBigWig(kp,
data = bigwig.file,
ymax = 523,
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1)
computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
kpAxis(kp,
ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5)
kpAddLabels(kp,
labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 1,
label.margin = 0.01)
}
8 Session information
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.12.0
## [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
## [3] GenomicFeatures_1.42.2
## [4] AnnotationDbi_1.52.0
## [5] karyoploteR_1.16.0
## [6] regioneR_1.22.0
## [7] circlize_0.4.12
## [8] ComplexHeatmap_2.7.9.1000
## [9] TCGAbiolinks_2.19.0
## [10] plyr_1.8.6
## [11] SummarizedExperiment_1.20.0
## [12] Biobase_2.50.0
## [13] MatrixGenerics_1.2.1
## [14] matrixStats_0.58.0
## [15] dplyr_1.0.5
## [16] tidyr_1.1.3
## [17] GenomicRanges_1.42.0
## [18] GenomeInfoDb_1.26.4
## [19] IRanges_2.24.1
## [20] S4Vectors_0.28.1
## [21] BiocGenerics_0.36.0
## [22] readr_1.4.0
## [23] DT_0.17
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.1 R.utils_2.10.1
## [3] tidyselect_1.1.0 RSQLite_2.2.4
## [5] htmlwidgets_1.5.3 BiocParallel_1.24.1
## [7] ELMER_2.14.0 munsell_0.5.0
## [9] codetools_0.2-18 colorspace_2.0-0
## [11] highr_0.8 knitr_1.31
## [13] rstudioapi_0.13 ggsignif_0.6.1
## [15] labeling_0.4.2 GenomeInfoDbData_1.2.4
## [17] bit64_4.0.5 farver_2.1.0
## [19] downloader_0.4 vctrs_0.3.6
## [21] generics_0.1.0 xfun_0.22
## [23] biovizBase_1.38.0 BiocFileCache_1.14.0
## [25] ELMER.data_2.14.0 R6_2.5.0
## [27] doParallel_1.0.16 clue_0.3-58
## [29] AnnotationFilter_1.14.0 bitops_1.0-6
## [31] cachem_1.0.4 reshape_0.8.8
## [33] DelayedArray_0.16.2 assertthat_0.2.1
## [35] scales_1.1.1 nnet_7.3-15
## [37] gtable_0.3.0 Cairo_1.5-12.2
## [39] ensembldb_2.14.0 rlang_0.4.10
## [41] GlobalOptions_0.1.2 splines_4.0.3
## [43] rtracklayer_1.50.0 rstatix_0.7.0
## [45] lazyeval_0.2.2 dichromat_2.0-0
## [47] broom_0.7.5 checkmate_2.0.0
## [49] yaml_2.2.1 reshape2_1.4.4
## [51] abind_1.4-5 backports_1.2.1
## [53] Hmisc_4.5-0 tools_4.0.3
## [55] ggplot2_3.3.3 ellipsis_0.3.1
## [57] jquerylib_0.1.3 RColorBrewer_1.1-2
## [59] MultiAssayExperiment_1.16.0 Rcpp_1.0.6
## [61] base64enc_0.1-3 progress_1.2.2
## [63] zlibbioc_1.36.0 purrr_0.3.4
## [65] RCurl_1.98-1.3 prettyunits_1.1.1
## [67] ggpubr_0.4.0 rpart_4.1-15
## [69] openssl_1.4.3 GetoptLong_1.0.5
## [71] haven_2.3.1 ggrepel_0.9.1
## [73] cluster_2.1.1 magrittr_2.0.1
## [75] data.table_1.14.0 magick_2.7.0
## [77] openxlsx_4.2.3 ProtGenerics_1.22.0
## [79] hms_1.0.0 TCGAbiolinksGUI.data_1.10.0
## [81] evaluate_0.14 XML_3.99-0.6
## [83] rio_0.5.26 jpeg_0.1-8.1
## [85] readxl_1.3.1 gridExtra_2.3
## [87] shape_1.4.5 compiler_4.0.3
## [89] biomaRt_2.46.3 tibble_3.1.0
## [91] crayon_1.4.1 R.oo_1.24.0
## [93] htmltools_0.5.1.1 Formula_1.2-4
## [95] DBI_1.1.1 dbplyr_2.1.0
## [97] rappdirs_0.3.3 Matrix_1.3-2
## [99] car_3.0-10 cli_2.3.1
## [101] R.methodsS3_1.8.1 Gviz_1.34.1
## [103] forcats_0.5.1 pkgconfig_2.0.3
## [105] prettydoc_0.4.1 GenomicAlignments_1.26.0
## [107] foreign_0.8-81 plotly_4.9.3
## [109] xml2_1.3.2 foreach_1.5.1
## [111] bslib_0.2.4 XVector_0.30.0
## [113] rvest_1.0.0 stringr_1.4.0
## [115] bezier_1.1.2 VariantAnnotation_1.36.0
## [117] digest_0.6.27 Biostrings_2.58.0
## [119] rmarkdown_2.7 cellranger_1.1.0
## [121] htmlTable_2.1.0 curl_4.3
## [123] Rsamtools_2.6.0 rjson_0.2.20
## [125] lifecycle_1.0.0 jsonlite_1.7.2
## [127] carData_3.0-4 viridisLite_0.3.0
## [129] askpass_1.1 BSgenome_1.58.0
## [131] fansi_0.4.2 pillar_1.5.1
## [133] lattice_0.20-41 fastmap_1.1.0
## [135] httr_1.4.2 survival_3.2-10
## [137] glue_1.4.2 zip_2.1.1
## [139] bamsignals_1.22.0 png_0.1-7
## [141] iterators_1.0.13 bit_4.0.4
## [143] stringi_1.5.3 sass_0.3.1
## [145] blob_1.2.1 latticeExtra_0.6-29
## [147] memoise_2.0.0
9 Workshop materials
9.1 Workshops HTMLs
- ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
- ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
- ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop
9.2 Workshop videos
We have a set of recorded videos, explaining some of the workshops.
- All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
- ELMER algorithm: https://youtu.be/PzC31K9vfu0
- ELMER data: https://youtu.be/R00wG--tGo8
- ELMER analysis part1 : https://youtu.be/bcd4uyxrZCw
- ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
- ELMER summarizing several analysis: https://youtu.be/moLeik7JjLk
- ATAC-Seq workshop: https://youtu.be/0hx20S-JuLI
References
Buenrostro, Jason D, Paul G Giresi, Lisa C Zaba, Howard Y Chang, and William J Greenleaf. 2013. “Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, Dna-Binding Proteins and Nucleosome Position.” Nature Methods 10 (12): 1213.
Corces, M Ryan, Jeffrey M Granja, Shadi Shams, Bryan H Louie, Jose A Seoane, Wanding Zhou, Tiago C Silva, et al. 2018. “The Chromatin Accessibility Landscape of Primary Human Cancers.” Science 362 (6413): eaav1898.
Corces, M Ryan, Alexandro E Trevino, Emily G Hamilton, Peyton G Greenside, Nicholas A Sinnott-Armstrong, Sam Vesuna, Ansuman T Satpathy, et al. 2017. “An Improved Atac-Seq Protocol Reduces Background and Enables Interrogation of Frozen Tissues.” Nature Methods 14 (10): 959.