For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of adult mouse brain cells provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website.
To download the required data, run the following lines in a shell:
This vignette echoes the commands run in the introductory Signac vignette on human PBMC. We provide the same analysis in a different system to demonstrate performance and applicability to other tissue types, and to provide an example from another species.
First load in Signac, Seurat, and some other packages we will be using for analyzing mouse data.
library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Mmusculus.v79)
library(ggplot2)
library(patchwork)
set.seed(1234)
We can also add gene annotations to the brain object for
the mouse genome. This will allow downstream functions to pull the gene
annotation information directly from the object.
Next we compute some useful per-cell QC metrics.
We can look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.
The enrichment of Tn5 integration events at transcriptional start
sites (TSSs) can also be an important quality control metric to assess
the targeting of Tn5 in ATAC-seq experiments. The ENCODE consortium
defined a TSS enrichment score as the number of Tn5 integration site
around the TSS normalized to the number of Tn5 integration sites in
flanking regions. See the ENCODE documentation for more information
about the TSS enrichment score (https://www.encodeproject.org/data-standards/terms/). We
can calculate the TSS enrichment score for each cell using the
TSSEnrichment() function in Signac.
We remove cells that are outliers for these QC metrics.
The first LSI component often captures sequencing depth (technical
variation) rather than biological variation. If this is the case, the
component should be removed from downstream analysis. We can assess the
correlation between each LSI component and sequencing depth using the
DepthCor() function:
Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.
Now that the cells are embedded in a low-dimensional space, we can
use methods commonly applied for the analysis of scRNA-seq data to
perform graph-based clustering, and non-linear dimension reduction for
visualization. The functions RunUMAP(),
FindNeighbors(), and FindClusters() all come
from the Seurat package.
To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (the adult mouse brain). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here.
You can download the raw data for this experiment from the Allen Institute website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.
We changed default parameters for
FindIntegrationAnchors() and
FindVariableFeatures() (including more features and
dimensions). You can run the analysis both ways, and observe very
similar results. However, when using default parameters we mislabel
cluster 11 cells as Vip-interneurons, when they are in fact a Meis2
expressing CGE-derived interneuron population recently described by us and others. The
reason is that this subset is exceptionally rare in the scRNA-seq data
(0.3%), and so the genes define this subset (for example,
Meis2) were too lowly expressed to be selected in the initial
set of variable features. We therefore need more genes and dimensions to
facilitate cross-modality mapping. Interestingly, this subset is 10-fold
more abundant in the scATAC-seq data compared to the scRNA-seq data (see
this
paper for possible explanations.)
You can see that the RNA-based classifications are entirely consistent with the UMAP visualization, computed only on the ATAC-seq data. We can now easily annotate our scATAC-seq derived clusters (alternately, we could use the RNA classifications themselves). We note three small clusters (13, 20, 21) which represent subdivisions of the scRNA-seq labels. Try transferring the cluster label (which shows finer distinctions) from the allen scRNA-seq dataset, to annotate them!
Here, we find differentially accessible regions between excitatory neurons in different layers of the cortex.
We can also create coverage plots grouped by cluster, cell type, or
any other metadata stored in the object for any genomic region using the
CoveragePlot() function. These represent pseudo-bulk
accessibility tracks, where signal from all cells within a group have
been averaged together to visualize the DNA accessibility in a
region.