MA5112 scRNA-Seq

library(Seurat)
library(patchwork)
library(dplyr)

Load Data

Establish path to directory containing outputs from Kallisto Bustools:

pbmc.data <- Read10X(data.dir = "/data/github/barrydigby.github.io/scRNA-Seq/seurat/")

Create Seurat Object: Filtering can be applied in this step, where we can select:

min.features: Include cells where at least this many features are detected.
min.cells: Include features detected in at least this many cells.

For the tutorial we will select conservative filtering values:

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc1k", min.cells = 2, min.features = 100)

Seurat Object

Inspect the pbmc object by using a $ or @ symbol to access RNA counts/features and metadata, respectively.

Check the number of genes and samples (cells) in the experiment:

pbmc

## An object of class Seurat 
## 21514 features across 722 samples within 1 assay 
## Active assay: RNA (21514 features, 0 variable features)

One can check the nCount_RNA and nFeature_RNA values per cell by using pbmc$nFeature_RNA and pbmc$nFeature_RNA. Alternatively, use pbmc@meta.data to obtain an overview of both:

head(pbmc@meta.data)

##                  orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGTGGTCAG     pbmc1k     5109.5         3461
## AAAGGTATCAACTACG     pbmc1k     4604.0         3112
## AAAGTCCAGCGTGTCC     pbmc1k     4807.0         3294
## AACACACTCAAGAGTA     pbmc1k     5039.9         3578
## AACACACTCGACGAGA     pbmc1k     4234.0         3096
## AACAGGGCAGGAGGTT     pbmc1k     8773.0         4950

Count Matrix

In contrast to RNA-Seq where empty cells are represented with a 0, scRNA-seq use . in place of 0 (sparse matrix representation) to significantly reduce memory costs. The code below shows the count matrix for three genes over the first 10 cells:

pbmc.data[c("TRDD2", "DR1", "TARBP1"), 1:10]

## 3 x 10 sparse Matrix of class "dgCMatrix"
##                                                                  
## TRDD2  .        .     . . . .        .        .        . .       
## DR1    0.154589 1.125 . . . 2.000873 1.025641 0.006536 2 1.142857
## TARBP1 .        .     1 . 1 .        .        .        . .

Pre-processing

We have already performed a pre-processing step using CreateSeruatObject(), whereby cells with low numbers of genes expressed and lowly expressed genes were filtered using min.features and min.cells, respectively.

Further common filtering steps include:

Removing cells with aberrantly high gene counts
Removing low quality/dying cells (such cells exhibit mitochondrial contamination)

Mitochondrial Genes

Calculate the fraction of mitochondrial genes expressed in each cell using PercentageFeatureSet() and REGEX, appending the results to the meta.data of the seurat object:

pbmc[["percent_mt"]] = PercentageFeatureSet(pbmc, pattern="^MT-")

Inspect the results:

head(pbmc@meta.data)

##                  orig.ident nCount_RNA nFeature_RNA percent_mt
## AAACCCAAGTGGTCAG     pbmc1k     5109.5         3461  11.459047
## AAAGGTATCAACTACG     pbmc1k     4604.0         3112   8.079931
## AAAGTCCAGCGTGTCC     pbmc1k     4807.0         3294  15.921226
## AACACACTCAAGAGTA     pbmc1k     5039.9         3578   7.212445
## AACACACTCGACGAGA     pbmc1k     4234.0         3096   2.928673
## AACAGGGCAGGAGGTT     pbmc1k     8773.0         4950  10.727992

Let’s make a plot of these statistics to inform cut-off values:

VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)

Based on the plots, perform filtering to remove:

Cells with mitochondrial gene % above 10%
Cells with gene counts above 20,000
Cells with unique feature counts above 6500 and below 300

pbmc <- subset(pbmc, subset = nFeature_RNA > 300 & nFeature_RNA < 6500 & nCount_RNA < 20000 & percent_mt < 10)

Plot the filtered object:

VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)

Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data:

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Feature Selection

In order to extract meaningful biological signals from the dataset, Seurat aims to identify a subset of features (e.g., genes) exhibiting high variability across cells, and therefore represent heterogeneous features to prioritize for downstream analysis. Choosing genes solely based on their log-normalized single-cell variance fails to account for the mean-variance relationship that is inherent to single-cell RNA-seq. Therefore, a variance-stabilizing transformation is applied to correct for this before calculating the mean-variance relationship, implemented in the FindVariableFeatures() function.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

Plot variable genes

Plot the top 20 variable genes:

top20 <- head(VariableFeatures(pbmc), 20)
init_plot <- VariableFeaturePlot(pbmc)
plot_var <- LabelPoints(plot = init_plot, points = top20, repel = TRUE)
plot_var

Data Scaling

The dataset now needs to be centered and scaed prior to performing dimensionality reduction techniques such as PCA. To reiterate what we mean by scaling:

Shift the expression of each gene, so that the mean expression across cells is 0
Scale the expression of each gene, so that the variance across cells is 1
- This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

Perform scaling, the results are stored in pbmc[["RNA"]]@scale.data:

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Dimension Reduction

PCA

Perform PCA on the selected features in the pbmc object. If you wish to provide a different assay to PCA, define it using features:

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Inspect Loadings

Inspect the loadings on each PC:

print(pbmc[["pca"]], dims = 1:2, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, LYZ, S100A9, FCN1, CTSS 
## Negative:  IFITM1, CD3E, CD69, IL32, TRAC 
## PC_ 2 
## Positive:  IFITM1, CD3E, IFITM2, CD7, GZMM 
## Negative:  CD79A, MS4A1, IGHM.1, IGKC, IGHM

Plot Loadings

Plot the loadings on each PC (note the plots seem to rank features according to absolute values, hence why negative genes are dominating the plots):

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca", nfeatures = 5)

Plot 2D Scatter

Plot a 2D scatter plot of each cell whereby the cells position is determined by the embeddings determined by the reduction technique:

DimPlot(pbmc, reduction = "pca")

Heatmap of loadings

DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

Choose PCs

The reason we performed PCA is to compress the dataset into a robust representation of the heterogeneity present in the dataset for downstream clustering, however now we are faced with an issue: how many PCs to include for downstream analyses?

JackStraw

Seurat implements a resampling test inspired by the JackStraw procedure. Seurat randomly permutes a subset of the data (1% by default) and reruns PCA, constructing a ‘null distribution’ of feature scores, and repeats the procedure. Seurat then identifies ‘significant’ PCs as those who have a strong enrichment of low p-value features:

pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:15)

Plot JackStraw

The JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 7 PCs.

JackStrawPlot(pbmc, dims = 1:15)

Elbow Plot

An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC. In the below plot we might conclude that taking the the top 10 PCs makes the most sense.

Ask Sarah which method is most appropriate, should significant PCs in JackStraw be included?

ElbowPlot(pbmc)

Cluster Cells

Seurat applies a graph-based clustering approach by first constructing a KNN graph based on the euclidean distance in PCA space, and refining the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (we will include the top 10 PCs here).

To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets:

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 447
## Number of edges: 12763
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8515
## Number of communities: 5
## Elapsed time: 0 seconds

The clusters can be found using the Idents() function:

head(Idents(pbmc), 5)

## AAAGGTATCAACTACG AACACACTCAAGAGTA AACACACTCGACGAGA AACAGGGCAGTGTATC 
##                0                2                3                3 
## AACCTGAAGATGGTCG 
##                0 
## Levels: 0 1 2 3 4

UMAP/tSNE

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.

pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")

pbmc <- RunTSNE(pbmc, dims = 1:10)

DimPlot(pbmc, reduction = "tsne")

Differentially Expressed features

Seurat can help you find markers that define clusters via differential expression. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

FindMarkers

In the example below we will test the differentially expressed features of cluster 0, specifying that the genes must be expressed in 25% of both cell groups:

cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, min.pct = 0.25)
head(cluster0.markers, n = 5)

##               p_val avg_log2FC pct.1 pct.2    p_val_adj
## BCL11B 1.311813e-53   2.003492 0.819 0.103 2.822234e-49
## IL7R   5.753797e-50   2.186796 0.890 0.209 1.237872e-45
## CD74   1.937419e-49  -3.727260 0.458 0.932 4.168164e-45
## CD3D   9.055952e-49   1.568799 0.852 0.106 1.948297e-44
## TRAC   3.958761e-48   1.918506 0.845 0.134 8.516879e-44

In the example below we will test differentially expressed features of cluster 1 vs cluster 0 & 4, with the same percentage cut-off:

cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, ident.2 = c(0, 4), min.pct = 0.25)
head(cluster1.markers, n = 5)

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## S100A12 3.416364e-66   4.652120  1.00 0.000 7.349965e-62
## VCAN    2.562430e-65   4.314659  0.99 0.000 5.512813e-61
## MNDA    2.562430e-65   3.856130  0.99 0.000 5.512813e-61
## S100A8  1.968865e-63   7.247090  1.00 0.033 4.235816e-59
## CST3    4.254010e-63   4.315201  1.00 0.038 9.152076e-59

FindAllMarkers

In the example below we test differentially expressed markers in each cluster vs all other cells iteratively:

pbmc.markers <- FindAllMarkers(pbmc, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)

## # A tibble: 10 x 7
## # Groups:   cluster [5]
##       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 5.75e-50       2.19 0.89  0.209  1.24e-45 0       IL7R  
##  2 9.92e- 3       3.14 0.406 0.305  1   e+ 0 0       SEC11C
##  3 1.27e-67       4.71 1     0.187  2.73e-63 1       S100A8
##  4 3.09e-63       4.32 1     0.23   6.64e-59 1       S100A9
##  5 3.12e-86       4.96 0.915 0.005  6.71e-82 2       IGHM.1
##  6 2.66e-75       5.85 0.803 0.003  5.71e-71 2       IGKC  
##  7 9.61e-34       2.01 1     0.374  2.07e-29 3       IFI30 
##  8 1.86e-17       2.17 0.4   0.055  4.01e-13 3       FCGR3A
##  9 1.88e-76       4.99 1     0.051  4.04e-72 4       NKG7  
## 10 1.21e-51       5.57 0.702 0.031  2.60e-47 4       GNLY

Violin Plot

Plot individual gene expression counts across clusters:

VlnPlot(pbmc, features = c("IGKC", "GNLY"), slot = "counts")

Features Plot

Plot differentially expressed features in the reduced dimensional embedding plot (PCA, UMAP, tSNE), set interactive = TRUE for tutorial.

FeaturePlot(pbmc, features = c( "GNLY", "IGKC"))#, interactive = TRUE)

Heatmap

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()