1 Introduction
An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .
Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.
2 How to cite the data
If you use TCGA ATAC-Seq data in your analysis, please cite the following paper:
CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898
3 Workshop description
In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.
For more information about the data please visit GDC publication website and read the paper:
- CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898
3.1 Pre-requisites
- Basic knowledge of R syntax
- R version 4.0 or higher (https://cloud.r-project.org/)
- Rstudio installed (https://www.rstudio.com/products/rstudio/download/#download)
- Last version of bioconductor installed
- In the end of this document there is also the version of the libraries used
3.2 Workshop Participation
Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.
3.3 Goals and objectives
- Download and understand the ATAC-Seq data
- Compare two different groups of samples ATAC-Seq data
3.4 Enviroment: R libraries
The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.
# to read txt files
library(readr)
# to transform data into GenomicRanges
library(GenomicRanges)
# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
# For the t.test loop
library(plyr)
# For easy volcano plot
library(TCGAbiolinks)
# For heatmap plot
library(ComplexHeatmap)
library(circlize)
library(pheatmap)
# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ELMER)
If one of them is not available in your R environment, you can easily install them with BiocManager
as shown below.
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}::install(version = "3.13",ask = FALSE) # Install last version of Bioconductor
BiocManager
<- c(
list.of.packages "karyoploteR",
"circlize",
"ComplexHeatmap",
"TxDb.Hsapiens.UCSC.hg38.knownGene",
"TCGAbiolinks",
"plyr",
"dplyr",
"tidyr",
"GenomicRanges",
"readr",
"ELMER"
)<- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
new.packages if(length(new.packages)) BiocManager::install(new.packages)
4 Data
The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
5 Understanding the data: peaks sets
The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. And, it is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).
The paper introduces 3 sets of peaks: 1) sample peak set, 2) cancer type-specific peak set and 3) pan-cancer peak set.
- Sample peak set description from the paper:
Overlapping peaks called within a single sample were handled using an iterative removal procedure. First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This resulted in a set of fixed-width peaks for each sample which we refer to here as a “sample peak set.”
Since the samples varied in read depth or quality, to compare samples they developed a robust method to normalize peak significance scores across samples and cancer types, which uses MACS2 peak scores (-log10(p-value)) for each sample and converted them to a “score per million” by dividing each individual peak score by the sum of all of the peak scores in the given sample divided by 1 million.
- “cancer type-specific peak set” description:
First, all “sample peak sets” from a given cancer type were combined into a cumulative peak set and trimmed for overlap using the same iterative procedure mentioned above. Contains all of the reproducible peaks observed in an individual cancer type, filtered to those which were observed in at least two samples with a score per million value >=5.
- “pan-cancer peak set” description:
First, there was a re-normalization of the score per million scores for each “cancer type-specific peak set” to prevent cancer types with more samples or higher quality samples from comprising a disproportionate share of the union peak set. The resulting peak sets were combined with the same iterative procedure mentioned above. These peaks are reproducible peaks from all cancer types that could then be used for cross-cancer comparisons.
All these descriptions can be found at the supplemental material from Corces et al. (2018) in the Peak calling section.
5.1 Comparing pan-cancer peak set and cancer type-specific peak set
If we check the both sets (Files downloaded from GDC: “All cancer type-specific peak sets. [ZIP]” and “Pan-cancer peak set. [TXT]”), the set of peaks “pan-cancer peak set” consists of 562K peaks, and it contains a subset of each “cancer type-specific peak set.” We show an example for Esophageal carcinoma (ESCA) below.
# ESCA specific peaks set
<- readr::read_tsv("Data/ESCA_peakCalls.txt")
atac_esca head(atac_esca)
## # A tibble: 6 x 8
## seqnames start end name score annotation percentGC percentAT
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 chr1 1290095 1290596 ESCA_107 2.46 3' UTR 0.677 0.323
## 2 chr1 1291115 1291616 ESCA_108 2.59 3' UTR 0.703 0.297
## 3 chr1 1291753 1292254 ESCA_109 7.58 3' UTR 0.639 0.361
## 4 chr1 1440824 1441325 ESCA_160 4.47 3' UTR 0.659 0.341
## 5 chr1 1630188 1630689 ESCA_179 20.6 3' UTR 0.729 0.271
## 6 chr1 2030218 2030719 ESCA_341 15.6 3' UTR 0.619 0.381
message("Number of peaks: ", nrow(atac_esca))
## Number of peaks: 127055
message("Range score: ", paste(range(atac_esca$score),collapse = " - "))
## Range score: 1.82339062289625 - 1476.55719343776
# pan-cancer peak set
<- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt")
atac_pan head(atac_pan)
## # A tibble: 6 x 7
## seqnames start end name score annotation percentGC
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 chr1 906012 906513 ACC_10 7.17 Intron 0.613
## 2 chr2 112541661 112542162 ACC_10008 22.0 Promoter 0.555
## 3 chr1 21673421 21673922 ACC_1001 6.46 Distal 0.509
## 4 chr2 112584205 112584706 ACC_10013 43.2 Promoter 0.587
## 5 chr2 112596243 112596744 ACC_10016 5.43 Intron 0.491
## 6 chr1 21725692 21726193 ACC_1002 5.20 Intron 0.405
# check the names of the PanCAN set of peaks
::count(stringr::str_split(atac_pan$name,"_",simplify = T)[,1]) plyr
## x freq
## 1 ACC 29311
## 2 BLCA 25337
## 3 BRCA 49748
## 4 CESC 14358
## 5 CHOL 11819
## 6 COAD 25404
## 7 ESCA 13237
## 8 GBM 15394
## 9 HNSC 16651
## 10 KIRC 15067
## 11 KIRP 24324
## 12 LGG 23836
## 13 LIHC 35787
## 14 LUAD 23729
## 15 LUSC 22195
## 16 MESO 22958
## 17 PCPG 31372
## 18 PRAD 30067
## 19 SKCM 36591
## 20 STAD 17358
## 21 TGCT 26120
## 22 THCA 31568
## 23 UCEC 20478
message("Range score: ", paste(range(atac_pan$score),collapse = " - "))
## Range score: 1.022347327 - 2527.059401
However, it is important to highlight that the “pan-cancer peak set” will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the “pan-cancer peak set” consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.
# We will be using a GRanges object a data structure to hangle genomic ranges
# This data structure provides access to useful functions such as
# promoters, resize, findOverlps, subsetByOverlaps
# More information at:
# https://www.bioconductor.org/help/course-materials/2015/Uruguay2015/V3-GenomicRanges.html#genomicranges
<- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_esca.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
atac_pan.gr length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
## [1] 126935
5.2 Checking an overlapping peak
So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the “ACC_10008” peak, which has a higher score.
"ESCA_17603" %in% atac_pan.gr$name
## [1] FALSE
subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | name score annotation
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr2 112541661-112542162 * | ACC_10008 22.0306 Promoter
## percentGC
## <numeric>
## [1] 0.55489
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score annotation
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr2 112541649-112542150 * | ESCA_17603 6.25799 Promoter
## percentGC percentAT
## <numeric> <numeric>
## [1] 0.556886 0.443114
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
5.3 Peaks size
Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp. The main reason to use fixed-width peaks is stated in the supplemental material:
We chose to use fixed-width peaks because (i) it makes count based and motif focused analyses less biased to large peaks and (ii) with large datasets merging peak sets to obtain a union peak set can lead to many peaks being merged into one very large peak, limiting our ability to resolve independent peaks.
unique(width(atac_pan.gr))
## [1] 502
unique(width(atac_esca.gr))
## [1] 502
5.4 Summary
In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.
6 Analysis: Using ATAC-Seq counts to compare two groups
Through the next section we will load the normalized counts data and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.
The main two files used in this section are:
- Normalized ATAC-Seq insertion counts within the pan-cancer peak set. Recommended format. [RDS]
- All cancer type-specific count matrices in normalized counts. [ZIP]
In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID
instead of TCGA barcodes
and each patient normally has two replicates.
<- readr::read_tsv("Data/ESCA_log2norm.txt") atac_esca_norm_ct
1:4,1:8] atac_esca_norm_ct[
## # A tibble: 4 x 8
## seqnames start end name score ESCA_1E6AC686_96C2_4… ESCA_1E6AC686_96C2_…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA_1 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA_2 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA_3 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA_4 2.59 1.44 1.16
## # … with 1 more variable:
## # ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016 <dbl>
We will change the samples names to TCGA barcodes using the file “Lookup table for various TCGA sample identifiers. [TXT]” from GDC.
<- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
gdc.file <- readr::read_tsv(gdc.file, col_types = readr::cols())
samples.ids $sample <- substr(samples.ids$Case_ID,1,16) # Get first 16 characters
samples.idshead(samples.ids)
## # A tibble: 6 x 6
## bam_prefix stanfordUUID aliquot_id Case_UUID Case_ID sample
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 2 BRCA-000CFD9F-AD… 000CFD9F-ADDF-… TCGA-A7-A13F… 2cf68894-16… TCGA-A7-… TCGA-A…
## 3 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 4 PCPG-007124EC-1F… 007124EC-1F9B-… TCGA-RM-A68W… 1a1cf490-8b… TCGA-RM-… TCGA-R…
## 5 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
## 6 STAD-00DFAA4D-DE… 00DFAA4D-DE64-… TCGA-BR-A4J1… e9a98a44-83… TCGA-BR-… TCGA-B…
# match the Stanford UUIDs from the atac_esca_norm_ct column names to the
# column bam_prefix in the samples.ids table and get the TCGA ID (column Case_ID)
# from it to replace as column names
# We remove the first 5 columns since the are not samples names
<- colnames(atac_esca_norm_ct)[-c(1:5)]
standfordUUID colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
1:4,1:8] atac_esca_norm_ct[
## # A tibble: 4 x 8
## seqnames start end name score `TCGA-IG-A51D-01A-31… `TCGA-IG-A51D-01A-3…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA_1 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA_2 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA_3 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA_4 2.59 1.44 1.16
## # … with 1 more variable: TCGA-LN-A9FQ-01A-11-A617-42 <dbl>
# It is important to notice that our mapping produced replicated column names,
# for that reason the code using dplyr R package (which is more elegant) would not work
<- atac_esca_norm_ct %>% rename_with(
atac_esca_norm_ct .fn = function(standfordUUID){
$Case_ID[match(gsub("_","-",standfordUUID),samples.ids$bam_prefix)]
samples.ids.cols = starts_with("ESCA_")) },
We will next get the TCGA metadata information from GDC using the function TCGAbiolinks:::colDataPrepare
.
<- atac_esca_norm_ct
atac <- 1:5
non.cts.idx <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))
samples.info
<- paste0(
samples.map $primary_diagnosis,"-",
samples.info$sample
samples.info%>%
) gsub(",| |NOS","",.) } %>% # substitutes spaces by no space, and Remove NOS
{ gsub("Adenocarcinoma","ESAD",.) } %>% # substitutes Adenocarcinoma to ESAD
{ gsub("Squamous cell carcinoma","ESCC",.) } # substitutes Squamous cell carcinoma to ESCC
{
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# create Summarized Experiment object
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# Matrix with data information
<- atac[,-c(1:5)]
counts
# Genomic ranges information
<- makeGRangesFromDataFrame(atac)
rowRanges $score <- atac$score
rowRanges$name <- atac$name
rowRangesnames(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
# samples information
<- DataFrame(unique(left_join(samples.info,samples.ids)))
colData
<- SummarizedExperiment(
esca.rse assays = SimpleList(
log2norm = as.matrix(counts)
),rowRanges = rowRanges,
colData = colData
)
# Since we have two samples for each patient we will rename them as rep1 and rep2
<- duplicated(colnames(esca.rse))
duplicated.idx colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
esca.rse
## class: RangedSummarizedExperiment
## dim: 126935 33
## metadata(0):
## assays(1): log2norm
## rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
## ESCA_126943_chrX_155985136_155985635
## ESCA_126944_chrX_156030008_156030507
## rowData names(2): score name
## colnames(33): ESCC-TCGA-IG-A51D-01A_rep1 ESCC-TCGA-IG-A51D-01A_rep2 ...
## ESAD-TCGA-M9-A5M8-01A_rep1 ESAD-TCGA-M9-A5M8-01A_rep2
## colData names(140): sample patient ... Case_UUID Case_ID
colnames(esca.rse)
## [1] "ESCC-TCGA-IG-A51D-01A_rep1" "ESCC-TCGA-IG-A51D-01A_rep2"
## [3] "ESCC-TCGA-LN-A9FQ-01A_rep1" "ESCC-TCGA-LN-A9FQ-01A_rep2"
## [5] "ESAD-TCGA-L7-A6VZ-01A_rep1" "ESAD-TCGA-L7-A6VZ-01A_rep2"
## [7] "ESAD-TCGA-L5-A4OE-01A_rep1" "ESAD-TCGA-L5-A4OE-01A_rep2"
## [9] "ESCC-TCGA-LN-A49O-01A_rep1" "ESCC-TCGA-LN-A49O-01A_rep2"
## [11] "ESAD-TCGA-IG-A7DP-01A_rep1" "ESAD-TCGA-IG-A7DP-01A_rep2"
## [13] "ESCC-TCGA-LN-A4A5-01A_rep1" "ESCC-TCGA-LN-A4A5-01A_rep2"
## [15] "ESCC-TCGA-IG-A4P3-01A_rep1" "ESCC-TCGA-IG-A3YA-01A_rep1"
## [17] "ESCC-TCGA-IG-A3YA-01A_rep2" "ESAD-TCGA-IG-A4QS-01A_rep1"
## [19] "ESAD-TCGA-IG-A4QS-01A_rep2" "ESCC-TCGA-LN-A7HX-01A_rep1"
## [21] "ESCC-TCGA-LN-A7HX-01A_rep2" "ESCC-TCGA-LN-A8HZ-01A_rep1"
## [23] "ESCC-TCGA-LN-A4A2-01A_rep1" "ESCC-TCGA-LN-A4A2-01A_rep2"
## [25] "ESCC-TCGA-LN-A4MR-01A_rep1" "ESCC-TCGA-IG-A625-01A_rep1"
## [27] "ESCC-TCGA-IG-A625-01A_rep2" "ESCC-TCGA-LN-A49W-01A_rep1"
## [29] "ESCC-TCGA-LN-A49W-01A_rep2" "ESAD-TCGA-IC-A6RE-01A_rep1"
## [31] "ESAD-TCGA-IC-A6RE-01A_rep2" "ESAD-TCGA-M9-A5M8-01A_rep1"
## [33] "ESAD-TCGA-M9-A5M8-01A_rep2"
6.1 Comparing ESCC vs ESAD ATAC-Seq
A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.
<- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
escc.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")
esad.idx
<- plyr::adply(
result .data = assay(esca.rse),
.margins = 1,
.fun = function(peak){
<- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
results return(
::tibble(
tibble"raw_p_value"= results$p.value,
"ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2])
).progress = "time", .id = "peak")
},
$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr") result
6.2 Volcano plot of t-test analysis
It is possible to visualize the results of a t.test
as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks.
<- 0.0001
fdr.cut.off <- 2.5
diff.cut.off
:::TCGAVisualize_volcano(
TCGAbiolinksx = result$ESCC_minus_ESAD,
y = result$FDR,
title = paste0(
"Volcano plot - ATAC-Seq peaks ",
"difference in ",
"ESCC vs ESAD\n"
),filename = NULL,
label = c(
"Not Significant",
paste0("High in ESCC (vs ESAD)"),
paste0("Low in ESCC (vs ESAD)")
),ylab = expression( paste(
-Log[10]," (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
)),xlab = expression(paste(
"Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
)),x.cut = diff.cut.off,
y.cut = fdr.cut.off
)
# How many peaks pass our cut-offs
table(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)
##
## FALSE TRUE
## 125988 947
6.3 Heatmap of differential significant peaks
It is possible to visualize the results of a t.test
as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the FDR<0.01 and ΔLog2Counts>2.
Also, we will be using TCGAbiolinks
package to plot it, but you can do it using ggplot2
or plotly
for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.
library(ComplexHeatmap)
library(circlize)
# Colors of the heatmap
<- colorRampPalette(
pal_atac c('#3361A5',
'#248AF3',
'#14B3FF',
'#88CEEF',
'#C1D5DC',
'#EAD397',
'#FDB31A',
'#E42A2A',
'#A31D1D')
100)
)(
# Upper track with the samples annotation
= HeatmapAnnotation(
ha df = data.frame(
"Group" = esca.rse$primary_diagnosis,
"Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")
),show_annotation_name = T,
col = list(
Group = c(
"Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue"
)
),show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6)
)
# Select significant peals to plot
<- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
plot.atac
# Define the color scale
<- colorRamp2(
col seq(
min(plot.atac),
max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99),
pal_atac
)
# Show the names of a peaks enriched in ESCA and on in ESCC
<- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD > diff.cut.off)[1]
idx1 <- which(result$FDR < fdr.cut.off & result$ESCC_minus_ESAD < diff.cut.off)[1]
idx2
<- rowAnnotation(foo = anno_mark(at = c(idx1,idx2), labels = rownames(plot.atac)[c(idx1,idx2)]))
rows.annot
# Plot the ATAC-Seq signals
<- Heatmap(
ht_list
plot.atac,name = "ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(
legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)
),show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(
sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-seq peaks"
),row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12)
)
options(repr.plot.width = 15, repr.plot.height = 8)
draw(
ht_list,newpage = TRUE,
merge_legend = TRUE,
column_title = paste0(
"ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"
),column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "bottom"
)
6.4 Heatmap of differential significant peaks (z-score)
A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here.
In R the function scale
can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.
<- pheatmap:::scale_rows(plot.atac) # row z-score
plot.atac.row.z.score <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
col.zscore
<- Heatmap(
ht_list
plot.atac.row.z.score,name = "Row z-score (ATAC-Seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = FALSE,
heatmap_legend_param = list(
legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)
),show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = TRUE,
row_title = paste0(
sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-Seq peaks"
),row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12)
)
draw(
ht_list,newpage = TRUE,
merge_legend = TRUE,
column_title = paste0(
"ATAC-Seq ESCC vs ESAD (FDR < ",
", Diff mean log2 Count > ",
fdr.cut.off,")"
diff.cut.off,
),column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "bottom"
)
6.5 Merging ATAC-Seq replicates
If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.
# This function will calculate the Means of the peaks for a given group
# in our case we will calculate the mean of the replicates of each patient.
<- function(mat, groups = NULL, na.rm = TRUE){
groupMeans stopifnot(!is.null(groups))
<- lapply(unique(groups), function(x){
gm rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
%>% Reduce("cbind",.)
}) colnames(gm) <- unique(groups)
return(gm)
}<- groupMeans(mat = assays(esca.rse)$log2norm, groups = colData(esca.rse)$sample)
matMerged
# keep only metadata for replicate 1
<- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),] metadata
# Create the upper annotation track for the samples
<- HeatmapAnnotation(
ha df = data.frame(
"Group" = metadata$primary_diagnosis
),show_annotation_name = TRUE,
col = list(
Group = c(
"Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")
),show_legend = TRUE,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6)
)
# Select the significant peaks to be plotted
<- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
plot.atac
# Define the color scheme based on the values
<- colorRamp2(
col breaks = seq(
min(plot.atac),
max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99),
colors = pal_atac
)
# Plot ATAC-Seq signal values
<- Heatmap(
ht_list matrix = plot.atac,
name = "ATAC-Seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(
legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)
),show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(
sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off), " ATAC-seq peaks"
),row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12)
)
options(repr.plot.width=15, repr.plot.height=8)
draw(
ht_list,newpage = TRUE,
merge_legend = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",
Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "bottom"
)
# Start to plot z-score heatmap
# To calculate the z-score we will use the function scale from R
# By default this function is applied on the columns so we need to tranpose the
# matrix to scale by rows and transpose back
<- pheatmap:::scale_rows(plot.atac) # row z-score
plot.atac.row.z.score
# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
<- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
col.zscore
<- Heatmap(
ht_list matrix = plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(
legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)
),show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = TRUE,
row_title = paste0(
sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)," ATAC-seq peaks"
),row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12)
)
options(repr.plot.width = 15, repr.plot.height=8)
draw(
object = ht_list,
merge_legend = TRUE,
newpage = TRUE,
column_title = paste0(
"ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,", Diff mean log2 Count > ",diff.cut.off,")"
),column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "bottom"
)
7 Other useful codes
7.1 Get all TCGA cancer-specific peaks peaks
The code below will download all cancer specific peaks and transform them into an R object.
library(readr)
library(GenomicRanges)
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
library(plyr)
#-------------------------------
# Cancer specific peaks as SummaerizedExperiment object
#-------------------------------
<- "export.zip"
zip.file if(!file.exists(zip.file)) {
download("https://api.gdc.cancer.gov/data/38b8f311-f3a4-4746-9829-b8e3edb9c157",zip.file)
unzip(zip.file)
}
<- function(file){
prepareCancerSpecificPeaks <- gsub(".txt",".rda",file)
output if(file.exists(output)) return()
<- readr::read_tsv(file)
atac
<- "TCGA_identifier_mapping.txt"
file.samples.ids if(!file.exists(file.samples.ids)){
<- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
file.url ::download(file.url,file.samples.ids)
downloader
} <- readr::read_tsv(file.samples.ids)
samples.ids $sample <- substr(samples.ids$Case_ID,1,16)
samples.idscolnames(atac)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac)[-c(1:5)]),samples.ids$bam_prefix)]
<- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(1:5)]))
samples.info
if(grepl("ESCA",file)){
<- paste0(samples.info$primary_diagnosis,"-",samples.info$sample) %>%
samples.map gsub(",| |NOS","",.) } %>% # Remove spaces, and Remove NOS
{ gsub("Adenocarcinoma","ESAD",.) } %>% # change Adenocarcinoma to ESAD
{ gsub("Squamous cell carcinoma","ESCC",.) } # change Squamous cell carcinoma to ESCC
{
colnames(atac)[-c(1:5)] <- samples.map[match(substr(colnames(atac)[-c(1:5)],1,16),substr(samples.map,6,21))]
else {
} colnames(atac)[-c(1:5)] <- samples.info$sample[match(substr(colnames(atac)[-c(1:5)],1,16),samples.info$sample)]
}
# create SE object
<- atac[,-c(1:5)]
counts <- makeGRangesFromDataFrame(atac)
rowRanges $score <- atac$score
rowRanges$name <- atac$name
rowRangesnames(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
<- DataFrame(unique(left_join(samples.info,samples.ids)))
colData <- SummarizedExperiment(
rse assays = SimpleList(
log2norm = as.matrix(counts)
),rowRanges = rowRanges,
colData = colData
)save(rse,file = output, compress = "xz")
}
::a_ply(dir(pattern=".txt"),1, function(f)
plyrtryCatch({prepareCancerSpecificPeaks(f)}, error = function(e){message(e)}),
.progress = "time")
7.2 ATAC-Seq Bigwig
The ATAC-Seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
Here is some information about the bigwig files.
All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.
We recommend extracting the files using the following command: tar -zxvf file_name.tgz –strip-components 8 where the “–strip-components 8” extracts the files without copying their original directory structure
The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.
The bigwig names also use Stanford UUIDs. The script below will help to rename the bigwifiles with TCGA barcodes. First, we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-Seq website.
<- dir(
bigwig.files path = "ATAC-Seq_data/ESCA_bigWigs/",
pattern = ".bw",
all.files = T,
full.names = T
)<- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")
table head(table)
Then, for each file, we will map them in the downloaded table and rename them.
# rename bigwig files add more info in the nanem
::a_ply(bigwig.files,1, function(file) {
plyr
<- stringr::str_extract(file,
file.uuid "[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")
<- grep(file.uuid,gsub("-","_",table$stanfordUUID))
idx
<- unique(table[idx,]$Case_ID)
barcode
if (grepl("ESCA",file)) {
<- TCGAbiolinks:::colDataPrepare(barcode)
samples.info <- paste0(
barcode $primary_diagnosis,"-",
samples.info$sample
samples.info%>%
) gsub(",| |NOS","",.) } %>% # Remove spaces, and Remove NOS
{ gsub("Adenocarcinoma","ESAD",.) } %>% # change Adenocarcinoma to ESAD
{ gsub("Squamous cell carcinoma","ESCC",.) } # change Squamous cell carcinoma to ESCC
{
}# change UUID to barcode
<- gsub(file.uuid,barcode,file)
to file.rename(file, to)
})
Since loading several bigWigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromosome (i.e. chr20). The Rscript below can do it by transforming the bigWig to a wig with only chr20 then converting the wig back to bigWig.
You can download at the executable for bigWigToWig
and wigToBigWig
at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/) and the hg38.chrom.sizes is available at GitHub (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)
<- 20
chr <- paste0("chr",chr)
dirout dir.create(dirout)
<- dir(path = ".",pattern = "bw",full.names = T)
files for(f in files){
<- f
f.in <- gsub("bw","wig",f)
f.out <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
f.out.chr <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
cmd system(cmd)
<- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
cmd system(cmd)
}
7.2.1 Visualizing the bigwig files in R
We upload some samples (chromosome 20 only) at the google drive that can be plotted in R using the karyoploteR
package, , which main documentation can be found at https://bernatgel.github.io/karyoploter_tutorial/.
# Load required libraries
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Plot parameters, only to look better
<- getDefaultPlotParams(plot.type = 1)
pp $leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
pp
# Get transcrupts annotation to get HNF4A regions
<- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "HNF4A"]
tssAnnot
# plot will be at the HNF4A range +- 50Kb
<- range(c(tssAnnot)) + 50000
HNF4A.region HNF4A.region
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 44305700-44484596 +
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
# Start by plotting gene tracks
<- plotKaryotype(zoom = HNF4A.region, genome = "hg38", cex = 0.5, plot.params = pp)
kp <- makeGenesDataFromTxDb(
genes.data
TxDb.Hsapiens.UCSC.hg38.knownGene,karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE
)
## 228 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
<- addGeneNames(genes.data = genes.data) genes.data
## Loading required package: org.Hs.eg.db
##
<- mergeTranscripts(genes.data = genes.data)
genes.data
<- plotKaryotype(
kp zoom = HNF4A.region,
genome = "hg38",
cex = 0.5,
plot.params = pp
)kpAddBaseNumbers(
kp,tick.dist = 20000,
minor.tick.dist = 5000,
add.units = TRUE,
cex = 0.4,
tick.len = 3
)kpPlotGenes(
kp,data = genes.data,
r0 = 0,
r1 = 0.25,
gene.name.cex = 0.5
)
# Start to plot bigwig files
<- dir(
big.wig.files path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T
) big.wig.files
## [1] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L5-A4OE-01A_X015_S05_L033_B1_T1_P034.insertions_chr20.bw"
## [2] "Data/ESCA_bigwig_chr20//ESCA_ESAD-TCGA-L7-A6VZ-01A_X018_S02_L027_B1_T1_P040.insertions_chr20.bw"
## [3] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-IG-A51D-01A_X012_S02_L027_B1_T1_P024.insertions_chr20.bw"
## [4] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A49O-01A_X008_S08_L015_B1_T1_P017.insertions_chr20.bw"
## [5] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A4A5-01A_X007_S06_L065_B1_T1_P012.insertions_chr20.bw"
## [6] "Data/ESCA_bigwig_chr20//ESCA_ESCC-TCGA-LN-A9FQ-01A_X010_S08_L063_B1_T1_P016.insertions_chr20.bw"
# Reserve area to plot the bigwig files
<- autotrack(
out.at 1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1
)
# Add ATAC-seq label from 0.3 to 1 which should cover all ATAC-seq tracks
kpAddLabels(
kp, labels = "ATAC-Seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1
)
for(i in seq_len(length(big.wig.files))) {
<- big.wig.files[i]
bigwig.file
# Define where the track will be ploted
# autotrack will simple get the reserved space (from out.at$r0 up to out.at$r1)
# and split in equal sizes for each bigwifile, i the index, will control which
# one is being plotted
<- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
at
# Plot bigwig
<- kpPlotBigWig(
kp
kp, data = bigwig.file,
ymax = "visible.region",
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1
)<- ceiling(kp$latest.plot$computed.values$ymax)
computed.ymax
# Add track axis
kpAxis(
kp, ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5
)
# Add track label
kpAddLabels(
kp, labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 0.5,
label.margin = 0.01
) }
7.2.2 Visualizing links
One of the probe-gene links identified using HM450 was cg03326606-PARD6B
, which will be plotted below.
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
<- getDefaultPlotParams(plot.type = 1)
pp $leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
pp
<- ELMER:::getInfiniumAnnotation()["cg03326606"]
cg03326606 cg03326606
## GRanges object with 1 range and 52 metadata columns:
## seqnames ranges strand | address_A address_B
## <Rle> <IRanges> <Rle> | <integer> <integer>
## cg03326606 chr20 50729030-50729031 + | 35670461 <NA>
## channel designType nextBase nextBaseRef probeType
## <character> <character> <character> <character> <character>
## cg03326606 Both II G/A C cg
## orientation probeCpGcnt context35 probeBeg probeEnd
## <character> <integer> <integer> <integer> <numeric>
## cg03326606 up 1 2 50728981 50729030
## ProbeSeq_A ProbeSeq_B gene gene_HGNC
## <character> <character> <character> <character>
## cg03326606 TTTATCAAATTCCAACTATT.. <NA> <NA>
## chrm_A beg_A flag_A mapQ_A cigar_A NM_A
## <character> <integer> <integer> <integer> <character> <integer>
## cg03326606 chr20 50728981 0 60 50M 0
## chrm_B beg_B flag_B mapQ_B cigar_B NM_B
## <character> <integer> <integer> <integer> <character> <integer>
## cg03326606 <NA> <NA> <NA> <NA> <NA> <NA>
## wDecoy_chrm_A wDecoy_beg_A wDecoy_flag_A wDecoy_mapQ_A
## <character> <integer> <integer> <integer>
## cg03326606 chr20 50728981 0 60
## wDecoy_cigar_A wDecoy_NM_A wDecoy_chrm_B wDecoy_beg_B
## <character> <integer> <character> <integer>
## cg03326606 50M 0 <NA> <NA>
## wDecoy_flag_B wDecoy_mapQ_B wDecoy_cigar_B wDecoy_NM_B posMatch
## <integer> <integer> <character> <integer> <logical>
## cg03326606 <NA> <NA> <NA> <NA> <NA>
## MASK_mapping MASK_typeINextBaseSwitch MASK_rmsk15 MASK_sub40_copy
## <logical> <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE FALSE
## MASK_sub35_copy MASK_sub30_copy MASK_sub25_copy MASK_snp5_common
## <logical> <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE FALSE
## MASK_snp5_GMAF1p MASK_extBase MASK_general
## <logical> <logical> <logical>
## cg03326606 FALSE FALSE FALSE
## -------
## seqinfo: 26 sequences from an unspecified genome; no seqlengths
<- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "PARD6B"]
tssAnnot
<- range(c(tssAnnot,cg03326606)) + 200
pair.region
suppressWarnings({
<- plotKaryotype(zoom = pair.region, genome = "hg38", cex = 0.5, plot.params = pp)
kp
})
<-
genes.data makeGenesDataFromTxDb(
TxDb.Hsapiens.UCSC.hg38.knownGene,karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE
)
## 228 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
<- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)
genes.data
<- promoters(range(tssAnnot))
promoter
<-
kp plotKaryotype(
zoom = pair.region,
genome = "hg38",
cex = 0.5,
plot.params = pp
)kpPlotRegions(
kp,toGRanges("chr20:50729030-50729031"),
r0 = 0,
r1 = 0.02,
col = "#ff8d92"
)kpPlotRegions(
kp,
promoter,r0 = 0,
r1 = 0.02,
col = "#8d9aff"
)
kpPlotLinks(
kp,data = toGRanges("chr20:50729030-50729031"),
data2 = promoter,
col = "#fac7ffaa",
r0 = 0.02,
arch.height = 0.1
)kpAddBaseNumbers(
kp,tick.dist = 10000,
minor.tick.dist = 2000,
add.units = TRUE,
cex = 0.5,
tick.len = 3
)kpPlotGenes(
kp,data = genes.data,
r0 = 0.12,
r1 = 0.25,
gene.name.cex = 0.5
)
<- dir(
big.wig.files path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T
)
<- autotrack(
out.at 1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1
)
kpAddLabels(
kp, labels = "ATAC-Seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1
)
for(i in seq_len(length(big.wig.files))) {
<- big.wig.files[i]
bigwig.file <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
at <- kpPlotBigWig(
kp
kp, data = bigwig.file,
ymax = 523,
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1
)<- ceiling(kp$latest.plot$computed.values$ymax)
computed.ymax kpAxis(
kp, ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5
)kpAddLabels(
kp, labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 1,
label.margin = 0.01
) }
8 Session information
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.13.0
## [2] ELMER_2.16.0
## [3] ELMER.data_2.16.0
## [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.13.0
## [5] GenomicFeatures_1.44.0
## [6] AnnotationDbi_1.54.1
## [7] karyoploteR_1.18.0
## [8] regioneR_1.24.0
## [9] pheatmap_1.0.12
## [10] circlize_0.4.13
## [11] ComplexHeatmap_2.8.0
## [12] TCGAbiolinks_2.20.0
## [13] plyr_1.8.6
## [14] SummarizedExperiment_1.22.0
## [15] Biobase_2.52.0
## [16] MatrixGenerics_1.4.0
## [17] matrixStats_0.59.0
## [18] dplyr_1.0.6
## [19] tidyr_1.1.3
## [20] GenomicRanges_1.44.0
## [21] GenomeInfoDb_1.28.0
## [22] IRanges_2.26.0
## [23] S4Vectors_0.30.0
## [24] BiocGenerics_0.38.0
## [25] readr_1.4.0
## [26] DT_0.18
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.1 R.utils_2.10.1
## [3] tidyselect_1.1.1 RSQLite_2.2.7
## [5] htmlwidgets_1.5.3 BiocParallel_1.26.0
## [7] munsell_0.5.0 codetools_0.2-18
## [9] colorspace_2.0-1 filelock_1.0.2
## [11] highr_0.9 knitr_1.33
## [13] rstudioapi_0.13 ggsignif_0.6.2
## [15] labeling_0.4.2 GenomeInfoDbData_1.2.6
## [17] farver_2.1.0 bit64_4.0.5
## [19] downloader_0.4 vctrs_0.3.8
## [21] generics_0.1.0 xfun_0.24
## [23] biovizBase_1.40.0 BiocFileCache_2.0.0
## [25] R6_2.5.0 doParallel_1.0.16
## [27] clue_0.3-59 AnnotationFilter_1.16.0
## [29] bitops_1.0-7 cachem_1.0.5
## [31] reshape_0.8.8 DelayedArray_0.18.0
## [33] assertthat_0.2.1 BiocIO_1.2.0
## [35] scales_1.1.1 nnet_7.3-16
## [37] gtable_0.3.0 Cairo_1.5-12.2
## [39] ensembldb_2.16.0 rlang_0.4.11
## [41] GlobalOptions_0.1.2 splines_4.1.0
## [43] rtracklayer_1.52.0 rstatix_0.7.0
## [45] lazyeval_0.2.2 dichromat_2.0-0
## [47] broom_0.7.7 checkmate_2.0.0
## [49] reshape2_1.4.4 yaml_2.2.1
## [51] abind_1.4-5 backports_1.2.1
## [53] Hmisc_4.5-0 tools_4.1.0
## [55] ggplot2_3.3.4 ellipsis_0.3.2
## [57] jquerylib_0.1.4 RColorBrewer_1.1-2
## [59] MultiAssayExperiment_1.18.0 Rcpp_1.0.6
## [61] base64enc_0.1-3 progress_1.2.2
## [63] zlibbioc_1.38.0 purrr_0.3.4
## [65] RCurl_1.98-1.3 prettyunits_1.1.1
## [67] ggpubr_0.4.0 rpart_4.1-15
## [69] GetoptLong_1.0.5 haven_2.4.1
## [71] ggrepel_0.9.1 cluster_2.1.2
## [73] magrittr_2.0.1 magick_2.7.2
## [75] data.table_1.14.0 openxlsx_4.2.4
## [77] ProtGenerics_1.24.0 hms_1.1.0
## [79] TCGAbiolinksGUI.data_1.12.0 evaluate_0.14
## [81] XML_3.99-0.6 rio_0.5.26
## [83] jpeg_0.1-8.1 readxl_1.3.1
## [85] gridExtra_2.3 shape_1.4.6
## [87] compiler_4.1.0 biomaRt_2.48.1
## [89] tibble_3.1.2 crayon_1.4.1
## [91] R.oo_1.24.0 htmltools_0.5.1.1
## [93] Formula_1.2-4 DBI_1.1.1
## [95] dbplyr_2.1.1 rappdirs_0.3.3
## [97] Matrix_1.3-4 car_3.0-10
## [99] cli_2.5.0 R.methodsS3_1.8.1
## [101] Gviz_1.36.1 forcats_0.5.1
## [103] pkgconfig_2.0.3 prettydoc_0.4.1
## [105] GenomicAlignments_1.28.0 foreign_0.8-81
## [107] plotly_4.9.4 xml2_1.3.2
## [109] foreach_1.5.1 bslib_0.2.5.1
## [111] XVector_0.32.0 rvest_1.0.0
## [113] stringr_1.4.0 bezier_1.1.2
## [115] VariantAnnotation_1.38.0 digest_0.6.27
## [117] Biostrings_2.60.1 rmarkdown_2.9
## [119] cellranger_1.1.0 htmlTable_2.2.1
## [121] restfulr_0.0.13 curl_4.3.1
## [123] Rsamtools_2.8.0 rjson_0.2.20
## [125] lifecycle_1.0.0 jsonlite_1.7.2
## [127] carData_3.0-4 viridisLite_0.4.0
## [129] BSgenome_1.60.0 fansi_0.5.0
## [131] pillar_1.6.1 lattice_0.20-44
## [133] KEGGREST_1.32.0 fastmap_1.1.0
## [135] httr_1.4.2 survival_3.2-11
## [137] glue_1.4.2 zip_2.2.0
## [139] bamsignals_1.24.0 png_0.1-7
## [141] iterators_1.0.13 bit_4.0.4
## [143] stringi_1.6.2 sass_0.4.0
## [145] blob_1.2.1 latticeExtra_0.6-29
## [147] memoise_2.0.0
9 Workshop materials
9.1 Workshops HTMLs
- ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
- ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
- ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop
9.2 Workshop videos
We have a set of recorded videos, explaining some of the workshops.
- All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
- ELMER algorithm: https://youtu.be/PzC31K9vfu0
- ELMER data: https://youtu.be/R00wG--tGo8
- ELMER analysis part1 : https://youtu.be/bcd4uyxrZCw
- ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
- ELMER summarizing several analysis: https://youtu.be/moLeik7JjLk
- ATAC-Seq workshop: https://youtu.be/3ftZecz0lU4