library(Seurat)
library(patchwork)
library(dplyr)Establish path to directory containing outputs from Kallisto Bustools:
pbmc.data <- Read10X(data.dir = "/data/github/barrydigby.github.io/scRNA-Seq/seurat/")Create Seurat Object: Filtering can be applied in this step, where we can select:
min.features: Include cells where at least this many features are detected.min.cells: Include features detected in at least this many cells.For the tutorial we will select conservative filtering values:
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc1k", min.cells = 2, min.features = 100)Inspect the pbmc object by using a $ or @ symbol to access RNA counts/features and metadata, respectively.
Check the number of genes and samples (cells) in the experiment:
pbmc## An object of class Seurat
## 21514 features across 722 samples within 1 assay
## Active assay: RNA (21514 features, 0 variable features)
One can check the nCount_RNA and nFeature_RNA values per cell by using pbmc$nFeature_RNA and pbmc$nFeature_RNA. Alternatively, use pbmc@meta.data to obtain an overview of both:
head(pbmc@meta.data)## orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGTGGTCAG pbmc1k 5109.5 3461
## AAAGGTATCAACTACG pbmc1k 4604.0 3112
## AAAGTCCAGCGTGTCC pbmc1k 4807.0 3294
## AACACACTCAAGAGTA pbmc1k 5039.9 3578
## AACACACTCGACGAGA pbmc1k 4234.0 3096
## AACAGGGCAGGAGGTT pbmc1k 8773.0 4950
In contrast to RNA-Seq where empty cells are represented with a 0, scRNA-seq use . in place of 0 (sparse matrix representation) to significantly reduce memory costs. The code below shows the count matrix for three genes over the first 10 cells:
pbmc.data[c("TRDD2", "DR1", "TARBP1"), 1:10]## 3 x 10 sparse Matrix of class "dgCMatrix"
##
## TRDD2 . . . . . . . . . .
## DR1 0.154589 1.125 . . . 2.000873 1.025641 0.006536 2 1.142857
## TARBP1 . . 1 . 1 . . . . .
We have already performed a pre-processing step using CreateSeruatObject(), whereby cells with low numbers of genes expressed and lowly expressed genes were filtered using min.features and min.cells, respectively.
Further common filtering steps include:
Calculate the fraction of mitochondrial genes expressed in each cell using PercentageFeatureSet() and REGEX, appending the results to the meta.data of the seurat object:
pbmc[["percent_mt"]] = PercentageFeatureSet(pbmc, pattern="^MT-")Inspect the results:
head(pbmc@meta.data)## orig.ident nCount_RNA nFeature_RNA percent_mt
## AAACCCAAGTGGTCAG pbmc1k 5109.5 3461 11.459047
## AAAGGTATCAACTACG pbmc1k 4604.0 3112 8.079931
## AAAGTCCAGCGTGTCC pbmc1k 4807.0 3294 15.921226
## AACACACTCAAGAGTA pbmc1k 5039.9 3578 7.212445
## AACACACTCGACGAGA pbmc1k 4234.0 3096 2.928673
## AACAGGGCAGGAGGTT pbmc1k 8773.0 4950 10.727992
Let’s make a plot of these statistics to inform cut-off values:
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)Based on the plots, perform filtering to remove:
% above 10%pbmc <- subset(pbmc, subset = nFeature_RNA > 300 & nFeature_RNA < 6500 & nCount_RNA < 20000 & percent_mt < 10)Plot the filtered object:
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data:
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)In order to extract meaningful biological signals from the dataset, Seurat aims to identify a subset of features (e.g., genes) exhibiting high variability across cells, and therefore represent heterogeneous features to prioritize for downstream analysis. Choosing genes solely based on their log-normalized single-cell variance fails to account for the mean-variance relationship that is inherent to single-cell RNA-seq. Therefore, a variance-stabilizing transformation is applied to correct for this before calculating the mean-variance relationship, implemented in the FindVariableFeatures() function.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)Plot the top 20 variable genes:
top20 <- head(VariableFeatures(pbmc), 20)
init_plot <- VariableFeaturePlot(pbmc)
plot_var <- LabelPoints(plot = init_plot, points = top20, repel = TRUE)
plot_varThe dataset now needs to be centered and scaed prior to performing dimensionality reduction techniques such as PCA. To reiterate what we mean by scaling:
Perform scaling, the results are stored in pbmc[["RNA"]]@scale.data:
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)Perform PCA on the selected features in the pbmc object. If you wish to provide a different assay to PCA, define it using features:
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))Inspect the loadings on each PC:
print(pbmc[["pca"]], dims = 1:2, nfeatures = 5)## PC_ 1
## Positive: CST3, LYZ, S100A9, FCN1, CTSS
## Negative: IFITM1, CD3E, CD69, IL32, TRAC
## PC_ 2
## Positive: IFITM1, CD3E, IFITM2, CD7, GZMM
## Negative: CD79A, MS4A1, IGHM.1, IGKC, IGHM
Plot the loadings on each PC (note the plots seem to rank features according to absolute values, hence why negative genes are dominating the plots):
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca", nfeatures = 5)Plot a 2D scatter plot of each cell whereby the cells position is determined by the embeddings determined by the reduction technique:
DimPlot(pbmc, reduction = "pca")DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)The reason we performed PCA is to compress the dataset into a robust representation of the heterogeneity present in the dataset for downstream clustering, however now we are faced with an issue: how many PCs to include for downstream analyses?
Seurat implements a resampling test inspired by the JackStraw procedure. Seurat randomly permutes a subset of the data (1% by default) and reruns PCA, constructing a ‘null distribution’ of feature scores, and repeats the procedure. Seurat then identifies ‘significant’ PCs as those who have a strong enrichment of low p-value features:
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:15)The JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 7 PCs.
JackStrawPlot(pbmc, dims = 1:15)An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC. In the below plot we might conclude that taking the the top 10 PCs makes the most sense.
Ask Sarah which method is most appropriate, should significant PCs in JackStraw be included?
ElbowPlot(pbmc)Seurat applies a graph-based clustering approach by first constructing a KNN graph based on the euclidean distance in PCA space, and refining the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (we will include the top 10 PCs here).
To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets:
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 447
## Number of edges: 12763
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8515
## Number of communities: 5
## Elapsed time: 0 seconds
The clusters can be found using the Idents() function:
head(Idents(pbmc), 5)## AAAGGTATCAACTACG AACACACTCAAGAGTA AACACACTCGACGAGA AACAGGGCAGTGTATC
## 0 2 3 3
## AACCTGAAGATGGTCG
## 0
## Levels: 0 1 2 3 4
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")Seurat can help you find markers that define clusters via differential expression. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
In the example below we will test the differentially expressed features of cluster 0, specifying that the genes must be expressed in 25% of both cell groups:
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, min.pct = 0.25)
head(cluster0.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## BCL11B 1.311813e-53 2.003492 0.819 0.103 2.822234e-49
## IL7R 5.753797e-50 2.186796 0.890 0.209 1.237872e-45
## CD74 1.937419e-49 -3.727260 0.458 0.932 4.168164e-45
## CD3D 9.055952e-49 1.568799 0.852 0.106 1.948297e-44
## TRAC 3.958761e-48 1.918506 0.845 0.134 8.516879e-44
In the example below we will test differentially expressed features of cluster 1 vs cluster 0 & 4, with the same percentage cut-off:
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, ident.2 = c(0, 4), min.pct = 0.25)
head(cluster1.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## S100A12 3.416364e-66 4.652120 1.00 0.000 7.349965e-62
## VCAN 2.562430e-65 4.314659 0.99 0.000 5.512813e-61
## MNDA 2.562430e-65 3.856130 0.99 0.000 5.512813e-61
## S100A8 1.968865e-63 7.247090 1.00 0.033 4.235816e-59
## CST3 4.254010e-63 4.315201 1.00 0.038 9.152076e-59
In the example below we test differentially expressed markers in each cluster vs all other cells iteratively:
pbmc.markers <- FindAllMarkers(pbmc, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)## # A tibble: 10 x 7
## # Groups: cluster [5]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.75e-50 2.19 0.89 0.209 1.24e-45 0 IL7R
## 2 9.92e- 3 3.14 0.406 0.305 1 e+ 0 0 SEC11C
## 3 1.27e-67 4.71 1 0.187 2.73e-63 1 S100A8
## 4 3.09e-63 4.32 1 0.23 6.64e-59 1 S100A9
## 5 3.12e-86 4.96 0.915 0.005 6.71e-82 2 IGHM.1
## 6 2.66e-75 5.85 0.803 0.003 5.71e-71 2 IGKC
## 7 9.61e-34 2.01 1 0.374 2.07e-29 3 IFI30
## 8 1.86e-17 2.17 0.4 0.055 4.01e-13 3 FCGR3A
## 9 1.88e-76 4.99 1 0.051 4.04e-72 4 NKG7
## 10 1.21e-51 5.57 0.702 0.031 2.60e-47 4 GNLY
Plot individual gene expression counts across clusters:
VlnPlot(pbmc, features = c("IGKC", "GNLY"), slot = "counts")Plot differentially expressed features in the reduced dimensional embedding plot (PCA, UMAP, tSNE), set interactive = TRUE for tutorial.
FeaturePlot(pbmc, features = c( "GNLY", "IGKC"))#, interactive = TRUE)top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()