MA5112 scRNA-Seq

Raw Data Processing

1. Nextflow Script

All of the files required to run the tutorial are located at /data/MSc/2020/MA5112/scRNA-Seq.

Navigate to your directory on /data/MSc/2022/ and copy the nextflow script to your directory:

cp /data/MSc/2020/MA5112/scRNA-Seq/kallisto_bustools.nf .

Load Java and Singularity modules:

module load java/1.8.0
module load singularity/3.4.1

Run the analysis:

nextflow -bg run kallisto_bustools.nf --outdir "." -with-singularity '/data/MSc/2020/MA5112/scRNA-Seq/container/scRNA.img'

A successful run of the pipeline will produce an output directory analysis/ containing:

bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ tree analysis/
analysis/
├── counts
│   └── output.sort.txt
├── raw
│   ├── bus_output
│   │   ├── matrix.ec
│   │   ├── output.bus
│   │   ├── run_info.json
│   │   └── transcripts.txt
│   └── kallisto.log
└── sorted
    ├── output.corrected.bus
    └── output.sort.bus

4 directories, 8 files

2. Generate Count Matrix

2(a) Install Python Dependencies

If you have Python >= 3 with numpy, matplotlib, and pandas installed you don’t need to install any dependencies to run the script - skip to part 2(b).

For users without these libraries, we can install them via pip in a conda environment easily (recall week 1..)

environment.yml:

name: python_deps
channels:
 - conda-forge
dependencies:
 - python
 - pip
 - pip:
   - numpy
   - matplotlib
   - pandas

Copy the contents of the .yml file above and save it as environment.yml.

Next, create an environment using conda:

conda env create -f environment.yml

Activate the environment:

conda activate python_deps

2(b) BUS to Count Matrix

To generate a single cell count matrix using the processed BUS file, copy the python script make_mtx.py and assets/ directory to your directory containing the results:

cp /data/MSc/2020/MA5112/scRNA-Seq/make_mtx.py .
cp -R /data/MSc/2020/MA5112/scRNA-Seq/assets/ .

Your directory should now have the following structure:

analysis/
assets/
kallisto_bustools.nf
make_mtx.py
work/

Before running the python script, you need to update line 8 to reflect your current directory containing the analysis results:

#!/usr/bin/env python

gene_min = 200
gene_max = 10000

#setup working directory
import os
os.chdir("/data/MSc/2020/MA5112/scRNA-Seq")

Change the line os.chdir("/data/MSc/2020/MA5112/scRNA-Seq") to reflect the PATH containing your results before running the script:

python make_mtx.py

You should now have a directory called seurat/ that contains:

(python_deps) bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ tree seurat
seurat
├── barcodes.tsv
├── genes.tsv
└── matrix.mtx

0 directories, 3 files

There is an ENSEMBL gene ID in the seurat/genes.tsv file that is missing a gene symbol. This will create an error in R, we can manually add the gene symbol ourselves.

Check which ID has a blank 2nd column i.e no gene symbol:

awk 'NF!=2' seurat/genes.tsv
ENSG00000237235.2

Google the ENSEMBL ID, the gene ID is TRDD2. Now use awk to insert the gene in place of the empty cell. This is ok for one missing symbol, if you have multiple gene symbols missing you will have to come up with a better strategy for this

awk -F"\t" 'BEGIN {OFS="\t"}; $1 == "ENSG00000237235.2" {$2 = "TRDD2"};1' seurat/genes.tsv > seurat/tmp.tsv && rm seurat/genes.tsv && mv seurat/tmp.tsv seurat/genes.tsv

You can delete the work/ directory now to avoid storage bloat.


A description of the raw data processing steps are given at the end of this document.

3. Download Results

You will need to compress and download the seurat/ directory for downstream analysis in R.

tar -cvzf seurat.tar.gz seurat/

On your local machine, use scp to download the file from Lugh:

scp <username>@lugh.nuigalway.ie:/data/.../seurat.tar.gz .

Decompress the directory:

tar -xvf seurat.tar.gz 

Analysis in R

library(Seurat)
library(patchwork)
library(dplyr)

Load Data

Establish path to directory containing outputs from Kallisto Bustools:

pbmc.data <- Read10X(data.dir = "/data/work/work/foo/seurat/")

Create Seurat Object: Filtering can be applied in this step, where we can select:

  • min.features: Include cells where at least this many features are detected.
  • min.cells: Include features detected in at least this many cells.

For the tutorial we will select conservative filtering values:

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc1k", min.cells = 2, min.features = 100)

Seurat Object

Inspect the pbmc object by using a $ or @ symbol to access RNA counts/features and metadata, respectively.

Check the number of genes and samples (cells) in the experiment:

pbmc
## An object of class Seurat 
## 21514 features across 722 samples within 1 assay 
## Active assay: RNA (21514 features, 0 variable features)

One can check the nCount_RNA and nFeature_RNA values per cell by using pbmc$nFeature_RNA and pbmc$nFeature_RNA. Alternatively, use pbmc@meta.data to obtain an overview of both:

head(pbmc@meta.data)
##                  orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGTGGTCAG     pbmc1k     5109.5         3461
## AAAGGTATCAACTACG     pbmc1k     4604.0         3112
## AAAGTCCAGCGTGTCC     pbmc1k     4807.0         3294
## AACACACTCAAGAGTA     pbmc1k     5039.9         3578
## AACACACTCGACGAGA     pbmc1k     4234.0         3096
## AACAGGGCAGGAGGTT     pbmc1k     8773.0         4950

Count Matrix

In contrast to RNA-Seq where empty cells are represented with a 0, scRNA-seq use . in place of 0 (sparse matrix representation) to significantly reduce memory costs. The code below shows the count matrix for three genes over the first 10 cells:

pbmc.data[c("TRDD2", "DR1", "TARBP1"), 1:10]
## 3 x 10 sparse Matrix of class "dgCMatrix"
##                                                                  
## TRDD2  .        .     . . . .        .        .        . .       
## DR1    0.154589 1.125 . . . 2.000873 1.025641 0.006536 2 1.142857
## TARBP1 .        .     1 . 1 .        .        .        . .

Pre-processing

We have already performed a pre-processing step using CreateSeruatObject(), whereby cells with low numbers of genes expressed and lowly expressed genes were filtered using min.features and min.cells, respectively.

Further common filtering steps include:

  • Removing cells with aberrantly high gene counts
  • Removing low quality/dying cells (such cells exhibit mitochondrial contamination)

Mitochondrial Genes

Calculate the fraction of mitochondrial genes expressed in each cell using PercentageFeatureSet() and REGEX, appending the results to the meta.data of the seurat object:

pbmc[["percent_mt"]] = PercentageFeatureSet(pbmc, pattern="^MT-")

Inspect the results:

head(pbmc@meta.data)
##                  orig.ident nCount_RNA nFeature_RNA percent_mt
## AAACCCAAGTGGTCAG     pbmc1k     5109.5         3461  11.459047
## AAAGGTATCAACTACG     pbmc1k     4604.0         3112   8.079931
## AAAGTCCAGCGTGTCC     pbmc1k     4807.0         3294  15.921226
## AACACACTCAAGAGTA     pbmc1k     5039.9         3578   7.212445
## AACACACTCGACGAGA     pbmc1k     4234.0         3096   2.928673
## AACAGGGCAGGAGGTT     pbmc1k     8773.0         4950  10.727992

Let’s make a plot of these statistics to inform cut-off values:

VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)

Based on the plots, perform filtering to remove:

  • Cells with mitochondrial gene % above 10%
  • Cells with gene counts above 20,000
  • Cells with unique feature counts above 6500 and below 300
pbmc <- subset(pbmc, subset = nFeature_RNA > 300 & nFeature_RNA < 6500 & nCount_RNA < 20000 & percent_mt < 10)

Plot the filtered object:

VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)

Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data:

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Feature Selection

In order to extract meaningful biological signals from the dataset, Seurat aims to identify a subset of features (e.g., genes) exhibiting high variability across cells, and therefore represent heterogeneous features to prioritize for downstream analysis. Choosing genes solely based on their log-normalized single-cell variance fails to account for the mean-variance relationship that is inherent to single-cell RNA-seq. Therefore, a variance-stabilizing transformation is applied to correct for this before calculating the mean-variance relationship, implemented in the FindVariableFeatures() function.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

Plot variable genes

Plot the top 20 variable genes:

top20 <- head(VariableFeatures(pbmc), 20)
init_plot <- VariableFeaturePlot(pbmc)
plot_var <- LabelPoints(plot = init_plot, points = top20, repel = TRUE)
plot_var

Data Scaling

The dataset now needs to be centered and scaed prior to performing dimensionality reduction techniques such as PCA. To reiterate what we mean by scaling:

  • Shift the expression of each gene, so that the mean expression across cells is 0
  • Scale the expression of each gene, so that the variance across cells is 1
    • This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

Perform scaling, the results are stored in pbmc[["RNA"]]@scale.data:

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Dimension Reduction

PCA

Perform PCA on the selected features in the pbmc object. If you wish to provide a different assay to PCA, define it using features:

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Inspect Loadings

Inspect the loadings on each PC:

print(pbmc[["pca"]], dims = 1:2, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, LYZ, S100A9, FCN1, CTSS 
## Negative:  IFITM1, CD3E, CD69, IL32, TRAC 
## PC_ 2 
## Positive:  IFITM1, CD3E, IFITM2, CD7, GZMM 
## Negative:  CD79A, MS4A1, IGHM.1, IGKC, IGHM

Plot Loadings

Plot the loadings on each PC (note the plots seem to rank features according to absolute values, hence why negative genes are dominating the plots):

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca", nfeatures = 5)

Plot 2D Scatter

Plot a 2D scatter plot of each cell whereby the cells position is determined by the embeddings determined by the reduction technique:

DimPlot(pbmc, reduction = "pca")

Heatmap of loadings

DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

Choose PCs

The reason we performed PCA is to compress the dataset into a robust representation of the heterogeneity present in the dataset for downstream clustering, however now we are faced with an issue: how many PCs to include for downstream analyses?

JackStraw

Seurat implements a resampling test inspired by the JackStraw procedure. Seurat randomly permutes a subset of the data (1% by default) and reruns PCA, constructing a ‘null distribution’ of feature scores, and repeats the procedure. Seurat then identifies ‘significant’ PCs as those who have a strong enrichment of low p-value features:

pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:15)

Plot JackStraw

The JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 7 PCs.

JackStrawPlot(pbmc, dims = 1:15)

Elbow Plot

An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC. In the below plot we might conclude that taking the the top 10 PCs makes the most sense.

Ask Sarah which method is most appropriate, should significant PCs in JackStraw be included?

ElbowPlot(pbmc)

Cluster Cells

Seurat applies a graph-based clustering approach by first constructing a KNN graph based on the euclidean distance in PCA space, and refining the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (we will include the top 10 PCs here).

To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets:

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 447
## Number of edges: 12763
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8515
## Number of communities: 5
## Elapsed time: 0 seconds

The clusters can be found using the Idents() function:

head(Idents(pbmc), 5)
## AAAGGTATCAACTACG AACACACTCAAGAGTA AACACACTCGACGAGA AACAGGGCAGTGTATC 
##                0                2                3                3 
## AACCTGAAGATGGTCG 
##                0 
## Levels: 0 1 2 3 4

UMAP/tSNE

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.

pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")

pbmc <- RunTSNE(pbmc, dims = 1:10)

DimPlot(pbmc, reduction = "tsne")

Differentially Expressed features

Seurat can help you find markers that define clusters via differential expression. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

FindMarkers

In the example below we will test the differentially expressed features of cluster 0, specifying that the genes must be expressed in 25% of both cell groups:

cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, min.pct = 0.25)
head(cluster0.markers, n = 5)
##               p_val avg_log2FC pct.1 pct.2    p_val_adj
## BCL11B 1.311813e-53   2.003492 0.819 0.103 2.822234e-49
## IL7R   5.753797e-50   2.186796 0.890 0.209 1.237872e-45
## CD74   1.937419e-49  -3.727260 0.458 0.932 4.168164e-45
## CD3D   9.055952e-49   1.568799 0.852 0.106 1.948297e-44
## TRAC   3.958761e-48   1.918506 0.845 0.134 8.516879e-44

In the example below we will test differentially expressed features of cluster 1 vs cluster 0 & 4, with the same percentage cut-off:

cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, ident.2 = c(0, 4), min.pct = 0.25)
head(cluster1.markers, n = 5)
##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## S100A12 3.416364e-66   4.652120  1.00 0.000 7.349965e-62
## VCAN    2.562430e-65   4.314659  0.99 0.000 5.512813e-61
## MNDA    2.562430e-65   3.856130  0.99 0.000 5.512813e-61
## S100A8  1.968865e-63   7.247090  1.00 0.033 4.235816e-59
## CST3    4.254010e-63   4.315201  1.00 0.038 9.152076e-59

FindAllMarkers

In the example below we test differentially expressed markers in each cluster vs all other cells iteratively:

pbmc.markers <- FindAllMarkers(pbmc, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
## # A tibble: 10 × 7
## # Groups:   cluster [5]
##       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 5.75e-50       2.19 0.89  0.209  1.24e-45 0       IL7R  
##  2 9.92e- 3       3.14 0.406 0.305  1   e+ 0 0       SEC11C
##  3 1.27e-67       4.71 1     0.187  2.73e-63 1       S100A8
##  4 3.09e-63       4.32 1     0.23   6.64e-59 1       S100A9
##  5 3.12e-86       4.96 0.915 0.005  6.71e-82 2       IGHM.1
##  6 2.66e-75       5.85 0.803 0.003  5.71e-71 2       IGKC  
##  7 9.61e-34       2.01 1     0.374  2.07e-29 3       IFI30 
##  8 1.86e-17       2.17 0.4   0.055  4.01e-13 3       FCGR3A
##  9 1.88e-76       4.99 1     0.051  4.04e-72 4       NKG7  
## 10 1.21e-51       5.57 0.702 0.031  2.60e-47 4       GNLY

Violin Plot

Plot individual gene expression counts across clusters:

VlnPlot(pbmc, features = c("IGKC", "GNLY"), slot = "counts")

Features Plot

Plot differentially expressed features in the reduced dimensional embedding plot (PCA, UMAP, tSNE), set interactive = TRUE for tutorial.

FeaturePlot(pbmc, features = c( "GNLY", "IGKC"))#, interactive = TRUE)

Heatmap

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Description of Raw Data Processing

scRNA-Seq file format

Single cell genomics sequencing files are different to canonical paired-end sequencing reads. The R1 file contains the Unique Molecular Identifier (UMI) and the Cell Barcode whilst the R2 file contains the cDNA sequence. UMIs are are sequences that identify the molecule of origin for each read, Cell Barcodes are short sequences that identify cells of origin for each read.

As the starting input RNA is typically much lower in an scRNA-Seq experiment, the reads fragments must undergo amplification before they can be sequenced. UMIs resolve PCR duplicates by ensuring that amplicons of the same read are only counted once.

R1:

bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ zcat reads/pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz | head
@A00228:290:H3FVWDRXX:1:1101:18069:1031 1:N:0:NATGATTC
TNACGGGGTACGAGTGTTTATTACCTTA
+
F#FFFFFFFFFFFFFF::FFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:19605:1031 1:N:0:NATGATTC
CNGATCCTCCGATCTCTCATTGATAGTG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:30671:1031 1:N:0:NATGATTC
TNCAATCCACTGCACGTAGGCCACCTAC

R2:

bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ zcat reads/pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz | head
@A00228:290:H3FVWDRXX:1:1101:18069:1031 2:N:0:NATGATTC
NACAATTTTGAGAATGAAAGGGGGTGCTATAATGATTACACAGAAGGAATAGTTTAAACTGGAATTGTTCTAGACAAATTAGGATTTATAG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF
@A00228:290:H3FVWDRXX:1:1101:19605:1031 2:N:0:NATGATTC
AAGCAGTGGTATCAACGCAGAGTACATGGGGGTATTTGGGTGAATGTTGTTGCTCTGTGCCTGTTGAATGTCCATGCTACAAGAAGTTATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:30671:1031 2:N:0:NATGATTC
GTCCCACCAACAGTGTAAAAGTGTTTCTATTTCTCCACATCCTCTCCAGCATCTGTTGTTTCCTGACTTTTTAATGATCGCCATACTAACT

Kallisto bus

kallisto bus works with raw FASTQ files for single-cell RNA-Seq datasets. For each read the cell barcode, UMI information and the equivalence class resulting from pseudoalignment are stored in a BUS file output.bus stored in the output directory specified by -o, along with matrix.ec and transcripts.txt which store information about the equivalence classes and transcript names for downstream processing.

Run the following commands in your directory, and specify the path to the indexed transcriptome and fastq.gz files given at the top of the page.

Inputs
  • Genome Index file
  • FASTQ reads
kallisto bus \
         -i Homo_sapiens.cDNA.idx \
         -o bus_output/ \
         -x 10xv3 \
         -t 2 \
         pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz
Flags
  • -i Genome Index file generated by kallisto index
  • -o Directory to write output to
  • -x Single-cell technology used (string)
  • -t n threads to use
Outputs

Kallisto bus generates an output directory with the following files:

  • matrix.ec
  • output.bus
  • run_info.json
  • transcripts.txt

Bustools correct

BUS files can be barcode error corrected with regards to a technology specific whitelist of barcodes. The bustools correct command will correct all barcodes that are at Hamming distance 1 (i.e. one substitution) away from a single barcode in the whitelist.

To run the bustools correct command, you will need the 10xv3_whitelist.txt file and the output.bus files generated by kallisto bus.

Inputs
  • Whitelisted barcodes file
  • BUS file generated by kallisto bus
bustools correct \
         -w 10xv3_whitelist.txt \
         bus_output/output.bus \
         -o output_corrected.bus
Flags
  • -o File for corrected bus output
  • -w File of whitelisted barcodes to correct against
Outputs

bustools correct outputs a corrected BUS file.


Bustools sort

Raw BUS output from kallisto bus pseudoalignment may be unsorted. To accelerate downstream processing BUS files can be sorted using bustools sort. This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.

Before running the command, make a tmp/ directory in your directory. Run the following commands, referecing the output_corrected.bus file generated in the previous step.

Inputs
  • corrected BUS file generated by bustools correct
bustools sort \
         -T tmp/ \
         -t 2 \
         -o output_sort.bus \
         output_corrected.bus
Flags
  • -t Number of threads to use
  • -m Maximum memory used
  • -T Location for scratch files
  • -o Sorted output file
Outputs

Bustools sort outputs a sorted BUS file.


Bustools text

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages. We will be using an R script on the output file generated.

Run the following commands, referencing the sorted bus file generated in the previous step.

Inputs
  • Corrected + Sorted BUS file
  • Transcripts to Gene mapping file
bustools text \
         -o output_sort.txt \
         output.sort.bus
Flags
  • -o File for text output
Outputs

Bustools text produces a BUS file in text format.