ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a cutting-edge technique in genomics that allows researchers to explore the accessibility of chromatin regions within a genome. Chromatin accessibility determines the extent to which genes can be activated or silenced.
Simply, ATAC-seq leverages the use of a transposase enzyme that targets open chromatin regions and “tags” them with adapters for sequencing. By sequencing the tagged DNA fragments, we gain insights into the regions of chromatin that are more accessible.
Here, we specifically use ATAC-seq for Differential Accessibility Analysis by comparing ATAC-seq profiles across different conditions or cell types. These changes in chromatin accessibility may be associated with various biological processes, such as development, disease, or response to stimuli.
Our workflow relies on the nf-core ATAC-seq pipeline v1.2.2.
Briefly, reads were mapped to the reference genome using bwa and peaks were called for each sample independently using MACS2. Sample-specific peaks were combined into a consensus peak-set with bedtools. featureCounts was used to count the number of reads relative to the consensus peak-set across all samples. The resulting raw count matrix served as input to DESeq2 for differential accessibility analysis. Downstream analyses, including plotting and GSEA, were performed in R.
A detailed results description can be found under pipeline_info/results_description.html. A full list of software versions can be found at pipeline_info/software_versions.csv
Results are organized in the atacseq-results directory, which includes:
peakannotatedresults.xlsx: Annotated differential accessibility data for all contrasts.Other subdirectories contain additional information and visualizations:
bigwig: BigWig files for visualization.DiffAnalysis: Files related to differential accessibility analysis.pipeline_info: Information about the pipeline and software versions.qc: Quality control reports and visualizations. atacseq-results
├── peakannotatedresults.xlsx
├── bigwig
│ ├── ...
│ └── <sample>.mLb.clN.bigWig
├── DiffAnalysis
│ ├── FDR
│ ├── consensus_peaks.mRp.clN.annotatePeaks.txt
│ ├── consensus_peaks.mRp.clN.log
│ └── consensus_peaks.mRp.clN.results.txt
├── pipeline_info
│ ├── design_reads.csv
│ ├── pipeline_report.html
│ ├── pipeline_report.txt
│ ├── results_description.html
│ └── software_versions.csv
└── qc
├── mergedLibrary
│ ├── macs_annotatePeaks.mLb.clN.plots.pdf
│ ├── macs_annotatePeaks.mLb.clN.summary.txt
│ └── macs_peak.mLb.clN.plots.pdf
├── mergedReplicate
│ ├── macs_annotatePeaks.mRp.clN.plots.pdf
│ ├── macs_annotatePeaks.mRp.clN.summary.txt
│ └── macs_peak.mRp.clN.plots.pdf
└── multiqc_report.html
The file peakannotatedresults.xlsx contains several columns. Here is a guide to what each represents:
| Column Name | Content |
|---|---|
| interval_id | Peak ID |
| Chr, Start, End, Strand | Location of the peak by chromosome, start and end position, and strand |
| Length | Length of the peak region |
| Annotation | Location within the gene. Often, a given position will overlap with multiple possible annotations. In this case, annotations are prioritized as explained in the Homer Manual |
| Distance to TSS | Distance to nearest transcription start site. Negative values mean upstream of the TSS, positive values mean downstream |
| Nearest PromoterID, EntrezID, Nearest Unigene, Gene Name | Various ID and info on the gene |
<comparison>.baseMean |
Mean number of reads within the peak |
<comparison>.log2FoldChange |
log2 of the ratio of counts in group 1 vs group 2 |
<comparison>.lfcSE |
Standard error for the log2 fold-change |
<comparison>.stat |
Test statistic (used to calculate p-value) |
<comparison>.pvalue |
p-value (not corrected for multiple testing) |
<comparison>.padj |
False discovery rate adjusted p-value |
<sample>.raw |
Raw counts per peak and sample |
<sample>.pseudo |
Normalized counts per peak and sample. These counts have been adjusted to account for differences in sequencing depth between samples. |
Please note: The Annotation field refers to the gene with which the center of the peak overlaps. Distance to TSS, Nearest PromoterID, EntrezID, Nearest Unigene, and Gene Name give information about the gene whose transcription start site (TSS) is closest to the center of the peak.
In the example below, the peak of interest (indicated by the red arrow) falls within the blue gene. In this case, the Annotation field will contain information about this blue gene. However, the nearest TSS is that of the green gene. As a result, Distance to TSS, Nearest PromoterID, EntrezID, Nearest Unigene, and Gene Name will refer to the green gene.
Note, this example project is taken from the following published work: Shep at al. 2015.
ATAC-seq was performed on S. cerevisiae exposed to osmotic stress (0.6 M increase in the NaCl concentration over 15 min). In this work, researchers wish to assess changes in chromatin accessibility specifically in promoter regions. Previously, osmotic stress has been shown to induce transient gene expression changes that peak after 15 min (Ni et al. 2009), suggesting that increased accessibility in promoter regions may be occurring. Researchers also wish to identify whether promoter regions showing both increased expression and increased accessibility are significantly enriched for terms relating to osmotic stress response. If so, this would suggest that the up-regulation of key genes in the response to osmotic stress is modulated through changes in chromatin architecture, specifically in promoter regions.
The following table provides an annotation summary for all peaks identified across samples. Peaks may be annotated as “exon”, “intergenic”, “promoter.TSS” or “TTS”.
| Sample | Exon | Intergenic | promoter.TSS | TTS | Total |
|---|---|---|---|---|---|
| G1_Treatment_T15 | 6 | 16 | 1336 | 53 | 1411 |
| G2_Control_T0 | 23 | 16 | 1059 | 55 | 1153 |
The following table summarizes the differential peak analyses for each comparison.
| G1_Treatment_T15vsG2_Control_T0 genes with FDR <= 0.01: 27 (up=5, down=22) |
| G1_Treatment_T15vsG2_Control_T0 genes with FDR <= 0.01 & FC > 2: 27 (up=5, down=22) |
| G1_Treatment_T15vsG2_Control_T0 genes with FDR <= 0.05: 45 (up=11, down=34) |
| G1_Treatment_T15vsG2_Control_T0 genes with FDR <= 0.05 & FC > 2: 44 (up=11, down=33) |
A volcano plot is a common visualization used in genomics and bioinformatics to simultaneously display both the statistical significance (usually represented by p-values) and the magnitude of change (such as log2 fold change) for a set of features (i.e., genes or peaks). It’s particularly useful for highlighting genes or elements that are both statistically significant and biologically relevant.
A volcano plot is divided into three regions:
Upper Right Quadrant (Significantly Changed): Points in this quadrant represent features (genes, peaks, etc.) that are both statistically significant (small p-value) and exhibit a considerable magnitude of change (large log2 fold change). These are the elements of high interest and are often the focus of downstream analysis.
Upper Left Quadrant (Significant but Opposite Direction): Points in this quadrant are statistically significant but show a change in the opposite direction (negative log2 fold change). These can indicate interesting biological phenomena, such as genes that are down-regulated despite being highly significant.
Lower Half (Not Statistically Significant): Points in the lower half of the plot are those that are not statistically significant (large p-value), regardless of the magnitude of change. They might not be biologically relevant for the current analysis.
Below we highlight peaks where FDR <= 0.05. The volcano plot(s) are interactive, feel free to mouse over the plot to see annotation details for each point.
Gene Set Enrichment Analysis (GSEA) is a computational method used in genomics to extract meaningful biological insights from high-throughput data, such as gene expression or chromatin accessibility. Instead of focusing solely on individual genes, GSEA operates at a higher level by analyzing sets of genes that share common biological functions, pathways, or regulatory mechanisms. The gene sets are defined based on prior biological knowledge (e.g. published biochemical pathways). We performed GSEA using the following R-package: https://github.com/ctlab/fgsea/.
Ontology gene sets
We use the Ontology collection for gene set enrichment analysis. This collection is divided into two sub-collections, the first derived from the Gene Ontology resource (GO) which contains BP (biological process), CC (cellular component), and MF (molecular function) components, and a second derived from the Human Phenotype Ontology (HPO). We specifically assess gene sets that contain at least 3 genes. Below we display pathways that (1) have a p-value <= 0.5 and (2) have a p-adj <= 0.05.
Note: We have performed GSEA across all peaks as well as independently for peaks found in each of the 4 annotation categories.
In this example report, we are working with data from Saccharomyces cerevisiae. The gene set used for enrichment analysis was taken from https://www.yeastgenome.org/.
Understanding the graphs
The graphs below reveal whether a specific set of genes (a gene set) is enriched or over-represented within the data.
The first graph highlights gene sets that have a p-adj <= 0.05. Here, the x-axis represents the position of genes along a ranked list (ranked by fold change). Genes are ordered from the most up-regulated to the least, according to the metric. NES (Normalized Enrichment Score) is a metric that accounts for the size of the gene set and the distribution of genes across the ranked list. This score quantifies how strongly the gene set is enriched at a particular position along the ranked list. Positive enrichment scores indicate that the genes in the set are predominantly found towards the top of the list (highly active), while negative scores indicate enrichment towards the bottom (low activity).
The second graph displays gene sets that have a p-value <= 0.05 while highlighting those that also have a p-adj <= 0.05.
## [[1]]
##
## [[2]]
## NULL
The “Promoter-TSS” annotation category corresponds to chromatin regions that are located near the transcription start site (TSS) of genes. These regions play a critical role in gene regulation, as they often contain regulatory elements that control the initiation of transcription. Analyzing this category in ATAC-seq data helps us understand changes in the accessibility of promoter regions, which can directly influence gene expression levels.
## [[1]]
##
## [[2]]
## NULL
The “Exon” annotation category includes chromatin regions that overlap with exons, which are the coding regions of genes. While exons are traditionally associated with protein-coding sequences, recent studies have highlighted their involvement in various regulatory processes, such as alternative splicing and transcriptional regulation. Exploring changes in exon accessibility through ATAC-seq can reveal insights into the complex interplay between chromatin accessibility and gene regulation.
## [1] "Minimum padj = 0.4654 is above the 0.05 adjusted p-value threshold"
The “Intergenic” annotation category covers chromatin regions that lie between genes and do not overlap with known coding or regulatory elements. Despite their lack of direct association with gene structures, intergenic regions can harbor important regulatory elements, such as enhancers, that influence the expression of nearby genes. Investigating accessibility changes in intergenic regions using ATAC-seq can uncover novel regulatory elements and their roles in gene regulation.
## [1] "Minimum padj = 0.2654 is above the 0.05 adjusted p-value threshold"
The “TTS” annotation category corresponds to chromatin regions near the transcription termination site of genes. These regions play a role in regulating transcriptional termination and gene expression. Changes in accessibility at TTS regions can provide insights into transcriptional dynamics and the coordination of transcriptional processes. Studying TTS accessibility through ATAC-seq contributes to our understanding of gene expression regulation beyond the initiation phase.
## [1] "Minimum padj = 0.5327 is above the 0.05 adjusted p-value threshold"
Our primary objective was to assess changes in chromatin accessibility in S. cerevisiae following osmotic stress. Previous work showed that osmotic stress induced transient gene expression changes that peak after 15 minutes. For this project, our focus was directed toward identifying whether these dynamics extended to increased accessibility in promoter regions.
Our analysis revealed a notable increase in accessibility of promoter regions following 15 minutes of osmotic stress, aligning with the previously observed transient gene expression changes. Furthermore, leveraging Gene Set Enrichment Analysis (GSEA), we noticed a significant enrichment of terms associated with osmotic stress response within these more accessible promoter regions. These findings suggest a compelling link between changes in chromatin accessibility and the regulatory mechanisms underlying osmotic stress response.”