All of the files required to run the tutorial are located at /data/MSc/2020/MA5112/scRNA-Seq.
Navigate to your directory on /data/MSc/2022/ and copy the nextflow script to your directory:
cp /data/MSc/2020/MA5112/scRNA-Seq/kallisto_bustools.nf .Load Java and Singularity modules:
module load java/1.8.0
module load singularity/3.4.1Run the analysis:
nextflow -bg run kallisto_bustools.nf --outdir "." -with-singularity '/data/MSc/2020/MA5112/scRNA-Seq/container/scRNA.img'A successful run of the pipeline will produce an output directory analysis/ containing:
bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ tree analysis/
analysis/
├── counts
│ └── output.sort.txt
├── raw
│ ├── bus_output
│ │ ├── matrix.ec
│ │ ├── output.bus
│ │ ├── run_info.json
│ │ └── transcripts.txt
│ └── kallisto.log
└── sorted
├── output.corrected.bus
└── output.sort.bus
4 directories, 8 files
If you have Python >= 3 with numpy, matplotlib, and pandas installed you don’t need to install any dependencies to run the script - skip to part 2(b).
For users without these libraries, we can install them via pip in a conda environment easily (recall week 1..)
environment.yml:
name: python_deps
channels:
- conda-forge
dependencies:
- python
- pip
- pip:
- numpy
- matplotlib
- pandasCopy the contents of the .yml file above and save it as environment.yml.
Next, create an environment using conda:
conda env create -f environment.ymlActivate the environment:
conda activate python_depsTo generate a single cell count matrix using the processed BUS file, copy the python script make_mtx.py and assets/ directory to your directory containing the results:
cp /data/MSc/2020/MA5112/scRNA-Seq/make_mtx.py .
cp -R /data/MSc/2020/MA5112/scRNA-Seq/assets/ .Your directory should now have the following structure:
analysis/
assets/
kallisto_bustools.nf
make_mtx.py
work/
Before running the python script, you need to update line 8 to reflect your current directory containing the analysis results:
#!/usr/bin/env python
gene_min = 200
gene_max = 10000
#setup working directory
import os
os.chdir("/data/MSc/2020/MA5112/scRNA-Seq")Change the line os.chdir("/data/MSc/2020/MA5112/scRNA-Seq") to reflect the PATH containing your results before running the script:
python make_mtx.pyYou should now have a directory called seurat/ that contains:
(python_deps) bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ tree seurat
seurat
├── barcodes.tsv
├── genes.tsv
└── matrix.mtx
0 directories, 3 files
There is an ENSEMBL gene ID in the seurat/genes.tsv file that is missing a gene symbol. This will create an error in R, we can manually add the gene symbol ourselves.
Check which ID has a blank 2nd column i.e no gene symbol:
awk 'NF!=2' seurat/genes.tsv
ENSG00000237235.2Google the ENSEMBL ID, the gene ID is TRDD2. Now use awk to insert the gene in place of the empty cell. This is ok for one missing symbol, if you have multiple gene symbols missing you will have to come up with a better strategy for this
awk -F"\t" 'BEGIN {OFS="\t"}; $1 == "ENSG00000237235.2" {$2 = "TRDD2"};1' seurat/genes.tsv > seurat/tmp.tsv && rm seurat/genes.tsv && mv seurat/tmp.tsv seurat/genes.tsvYou can delete the
work/directory now to avoid storage bloat.
A description of the raw data processing steps are given at the end of this document.
You will need to compress and download the seurat/ directory for downstream analysis in R.
tar -cvzf seurat.tar.gz seurat/On your local machine, use scp to download the file from Lugh:
scp <username>@lugh.nuigalway.ie:/data/.../seurat.tar.gz .Decompress the directory:
tar -xvf seurat.tar.gz library(Seurat)
library(patchwork)
library(dplyr)Establish path to directory containing outputs from Kallisto Bustools:
pbmc.data <- Read10X(data.dir = "/data/work/work/foo/seurat/")Create Seurat Object: Filtering can be applied in this step, where we can select:
min.features: Include cells where at least this many features are detected.min.cells: Include features detected in at least this many cells.For the tutorial we will select conservative filtering values:
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc1k", min.cells = 2, min.features = 100)Inspect the pbmc object by using a $ or @ symbol to access RNA counts/features and metadata, respectively.
Check the number of genes and samples (cells) in the experiment:
pbmc## An object of class Seurat
## 21514 features across 722 samples within 1 assay
## Active assay: RNA (21514 features, 0 variable features)
One can check the nCount_RNA and nFeature_RNA values per cell by using pbmc$nFeature_RNA and pbmc$nFeature_RNA. Alternatively, use pbmc@meta.data to obtain an overview of both:
head(pbmc@meta.data)## orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGTGGTCAG pbmc1k 5109.5 3461
## AAAGGTATCAACTACG pbmc1k 4604.0 3112
## AAAGTCCAGCGTGTCC pbmc1k 4807.0 3294
## AACACACTCAAGAGTA pbmc1k 5039.9 3578
## AACACACTCGACGAGA pbmc1k 4234.0 3096
## AACAGGGCAGGAGGTT pbmc1k 8773.0 4950
In contrast to RNA-Seq where empty cells are represented with a 0, scRNA-seq use . in place of 0 (sparse matrix representation) to significantly reduce memory costs. The code below shows the count matrix for three genes over the first 10 cells:
pbmc.data[c("TRDD2", "DR1", "TARBP1"), 1:10]## 3 x 10 sparse Matrix of class "dgCMatrix"
##
## TRDD2 . . . . . . . . . .
## DR1 0.154589 1.125 . . . 2.000873 1.025641 0.006536 2 1.142857
## TARBP1 . . 1 . 1 . . . . .
We have already performed a pre-processing step using CreateSeruatObject(), whereby cells with low numbers of genes expressed and lowly expressed genes were filtered using min.features and min.cells, respectively.
Further common filtering steps include:
Calculate the fraction of mitochondrial genes expressed in each cell using PercentageFeatureSet() and REGEX, appending the results to the meta.data of the seurat object:
pbmc[["percent_mt"]] = PercentageFeatureSet(pbmc, pattern="^MT-")Inspect the results:
head(pbmc@meta.data)## orig.ident nCount_RNA nFeature_RNA percent_mt
## AAACCCAAGTGGTCAG pbmc1k 5109.5 3461 11.459047
## AAAGGTATCAACTACG pbmc1k 4604.0 3112 8.079931
## AAAGTCCAGCGTGTCC pbmc1k 4807.0 3294 15.921226
## AACACACTCAAGAGTA pbmc1k 5039.9 3578 7.212445
## AACACACTCGACGAGA pbmc1k 4234.0 3096 2.928673
## AACAGGGCAGGAGGTT pbmc1k 8773.0 4950 10.727992
Let’s make a plot of these statistics to inform cut-off values:
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)Based on the plots, perform filtering to remove:
% above 10%pbmc <- subset(pbmc, subset = nFeature_RNA > 300 & nFeature_RNA < 6500 & nCount_RNA < 20000 & percent_mt < 10)Plot the filtered object:
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent_mt"), ncol = 3)After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data:
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)In order to extract meaningful biological signals from the dataset, Seurat aims to identify a subset of features (e.g., genes) exhibiting high variability across cells, and therefore represent heterogeneous features to prioritize for downstream analysis. Choosing genes solely based on their log-normalized single-cell variance fails to account for the mean-variance relationship that is inherent to single-cell RNA-seq. Therefore, a variance-stabilizing transformation is applied to correct for this before calculating the mean-variance relationship, implemented in the FindVariableFeatures() function.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)Plot the top 20 variable genes:
top20 <- head(VariableFeatures(pbmc), 20)
init_plot <- VariableFeaturePlot(pbmc)
plot_var <- LabelPoints(plot = init_plot, points = top20, repel = TRUE)
plot_varThe dataset now needs to be centered and scaed prior to performing dimensionality reduction techniques such as PCA. To reiterate what we mean by scaling:
Perform scaling, the results are stored in pbmc[["RNA"]]@scale.data:
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)Perform PCA on the selected features in the pbmc object. If you wish to provide a different assay to PCA, define it using features:
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))Inspect the loadings on each PC:
print(pbmc[["pca"]], dims = 1:2, nfeatures = 5)## PC_ 1
## Positive: CST3, LYZ, S100A9, FCN1, CTSS
## Negative: IFITM1, CD3E, CD69, IL32, TRAC
## PC_ 2
## Positive: IFITM1, CD3E, IFITM2, CD7, GZMM
## Negative: CD79A, MS4A1, IGHM.1, IGKC, IGHM
Plot the loadings on each PC (note the plots seem to rank features according to absolute values, hence why negative genes are dominating the plots):
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca", nfeatures = 5)Plot a 2D scatter plot of each cell whereby the cells position is determined by the embeddings determined by the reduction technique:
DimPlot(pbmc, reduction = "pca")DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)The reason we performed PCA is to compress the dataset into a robust representation of the heterogeneity present in the dataset for downstream clustering, however now we are faced with an issue: how many PCs to include for downstream analyses?
Seurat implements a resampling test inspired by the JackStraw procedure. Seurat randomly permutes a subset of the data (1% by default) and reruns PCA, constructing a ‘null distribution’ of feature scores, and repeats the procedure. Seurat then identifies ‘significant’ PCs as those who have a strong enrichment of low p-value features:
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:15)The JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 7 PCs.
JackStrawPlot(pbmc, dims = 1:15)An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each PC. In the below plot we might conclude that taking the the top 10 PCs makes the most sense.
Ask Sarah which method is most appropriate, should significant PCs in JackStraw be included?
ElbowPlot(pbmc)Seurat applies a graph-based clustering approach by first constructing a KNN graph based on the euclidean distance in PCA space, and refining the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (we will include the top 10 PCs here).
To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets:
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 447
## Number of edges: 12763
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8515
## Number of communities: 5
## Elapsed time: 0 seconds
The clusters can be found using the Idents() function:
head(Idents(pbmc), 5)## AAAGGTATCAACTACG AACACACTCAAGAGTA AACACACTCGACGAGA AACAGGGCAGTGTATC
## 0 2 3 3
## AACCTGAAGATGGTCG
## 0
## Levels: 0 1 2 3 4
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")Seurat can help you find markers that define clusters via differential expression. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
In the example below we will test the differentially expressed features of cluster 0, specifying that the genes must be expressed in 25% of both cell groups:
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, min.pct = 0.25)
head(cluster0.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## BCL11B 1.311813e-53 2.003492 0.819 0.103 2.822234e-49
## IL7R 5.753797e-50 2.186796 0.890 0.209 1.237872e-45
## CD74 1.937419e-49 -3.727260 0.458 0.932 4.168164e-45
## CD3D 9.055952e-49 1.568799 0.852 0.106 1.948297e-44
## TRAC 3.958761e-48 1.918506 0.845 0.134 8.516879e-44
In the example below we will test differentially expressed features of cluster 1 vs cluster 0 & 4, with the same percentage cut-off:
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, ident.2 = c(0, 4), min.pct = 0.25)
head(cluster1.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## S100A12 3.416364e-66 4.652120 1.00 0.000 7.349965e-62
## VCAN 2.562430e-65 4.314659 0.99 0.000 5.512813e-61
## MNDA 2.562430e-65 3.856130 0.99 0.000 5.512813e-61
## S100A8 1.968865e-63 7.247090 1.00 0.033 4.235816e-59
## CST3 4.254010e-63 4.315201 1.00 0.038 9.152076e-59
In the example below we test differentially expressed markers in each cluster vs all other cells iteratively:
pbmc.markers <- FindAllMarkers(pbmc, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)## # A tibble: 10 × 7
## # Groups: cluster [5]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.75e-50 2.19 0.89 0.209 1.24e-45 0 IL7R
## 2 9.92e- 3 3.14 0.406 0.305 1 e+ 0 0 SEC11C
## 3 1.27e-67 4.71 1 0.187 2.73e-63 1 S100A8
## 4 3.09e-63 4.32 1 0.23 6.64e-59 1 S100A9
## 5 3.12e-86 4.96 0.915 0.005 6.71e-82 2 IGHM.1
## 6 2.66e-75 5.85 0.803 0.003 5.71e-71 2 IGKC
## 7 9.61e-34 2.01 1 0.374 2.07e-29 3 IFI30
## 8 1.86e-17 2.17 0.4 0.055 4.01e-13 3 FCGR3A
## 9 1.88e-76 4.99 1 0.051 4.04e-72 4 NKG7
## 10 1.21e-51 5.57 0.702 0.031 2.60e-47 4 GNLY
Plot individual gene expression counts across clusters:
VlnPlot(pbmc, features = c("IGKC", "GNLY"), slot = "counts")Plot differentially expressed features in the reduced dimensional embedding plot (PCA, UMAP, tSNE), set interactive = TRUE for tutorial.
FeaturePlot(pbmc, features = c( "GNLY", "IGKC"))#, interactive = TRUE)top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()Single cell genomics sequencing files are different to canonical paired-end sequencing reads. The R1 file contains the Unique Molecular Identifier (UMI) and the Cell Barcode whilst the R2 file contains the cDNA sequence. UMIs are are sequences that identify the molecule of origin for each read, Cell Barcodes are short sequences that identify cells of origin for each read.
As the starting input RNA is typically much lower in an scRNA-Seq experiment, the reads fragments must undergo amplification before they can be sequenced. UMIs resolve PCR duplicates by ensuring that amplicons of the same read are only counted once.
R1:
bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ zcat reads/pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz | head
@A00228:290:H3FVWDRXX:1:1101:18069:1031 1:N:0:NATGATTC
TNACGGGGTACGAGTGTTTATTACCTTA
+
F#FFFFFFFFFFFFFF::FFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:19605:1031 1:N:0:NATGATTC
CNGATCCTCCGATCTCTCATTGATAGTG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:30671:1031 1:N:0:NATGATTC
TNCAATCCACTGCACGTAGGCCACCTACR2:
bdigby@lugh:/data/MSc/2020/MA5112/scRNA-Seq$ zcat reads/pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz | head
@A00228:290:H3FVWDRXX:1:1101:18069:1031 2:N:0:NATGATTC
NACAATTTTGAGAATGAAAGGGGGTGCTATAATGATTACACAGAAGGAATAGTTTAAACTGGAATTGTTCTAGACAAATTAGGATTTATAG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF
@A00228:290:H3FVWDRXX:1:1101:19605:1031 2:N:0:NATGATTC
AAGCAGTGGTATCAACGCAGAGTACATGGGGGTATTTGGGTGAATGTTGTTGCTCTGTGCCTGTTGAATGTCCATGCTACAAGAAGTTATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFF
@A00228:290:H3FVWDRXX:1:1101:30671:1031 2:N:0:NATGATTC
GTCCCACCAACAGTGTAAAAGTGTTTCTATTTCTCCACATCCTCTCCAGCATCTGTTGTTTCCTGACTTTTTAATGATCGCCATACTAACTkallisto bus works with raw FASTQ files for single-cell RNA-Seq datasets. For each read the cell barcode, UMI information and the equivalence class resulting from pseudoalignment are stored in a BUS file output.bus stored in the output directory specified by -o, along with matrix.ec and transcripts.txt which store information about the equivalence classes and transcript names for downstream processing.
Run the following commands in your directory, and specify the path to the indexed transcriptome and fastq.gz files given at the top of the page.
kallisto bus \
-i Homo_sapiens.cDNA.idx \
-o bus_output/ \
-x 10xv3 \
-t 2 \
pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz-i Genome Index file generated by kallisto index-o Directory to write output to-x Single-cell technology used (string)-t n threads to useKallisto bus generates an output directory with the following files:
matrix.ecoutput.busrun_info.jsontranscripts.txtBUS files can be barcode error corrected with regards to a technology specific whitelist of barcodes. The bustools correct command will correct all barcodes that are at Hamming distance 1 (i.e. one substitution) away from a single barcode in the whitelist.
To run the bustools correct command, you will need the 10xv3_whitelist.txt file and the output.bus files generated by kallisto bus.
busbustools correct \
-w 10xv3_whitelist.txt \
bus_output/output.bus \
-o output_corrected.bus-o File for corrected bus output-w File of whitelisted barcodes to correct againstbustools correct outputs a corrected BUS file.
Raw BUS output from kallisto bus pseudoalignment may be unsorted. To accelerate downstream processing BUS files can be sorted using bustools sort. This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.
Before running the command, make a tmp/ directory in your directory. Run the following commands, referecing the output_corrected.bus file generated in the previous step.
bustools correctbustools sort \
-T tmp/ \
-t 2 \
-o output_sort.bus \
output_corrected.bus-t Number of threads to use-m Maximum memory used-T Location for scratch files-o Sorted output fileBustools sort outputs a sorted BUS file.
BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages. We will be using an R script on the output file generated.
Run the following commands, referencing the sorted bus file generated in the previous step.
bustools text \
-o output_sort.txt \
output.sort.bus-o File for text outputBustools text produces a BUS file in text format.