This Markdown document uses code and data from a Satija Lab Seurat tutorial to show how to create a plot showing cell type clusters obtained from single-cell mRNA-Seq data.
Note that the tutorial is quite long and includes a lot of code demonstrating quality-control and quality-checking procedures. This shorter Markdown document does not include those steps.
Also please note that many of the methods are computationally-intensive. You may find that running the methods using RStudio Cloud is painfully slow. If so, you should try installing R and RStudio onto a desktop computer and running this Markdown there, instead of RStudio Cloud.
In addition to getting to know the Seurat library, we are also going to determine whether the analysis presented in the tutorial can be reproduced.
Question: If we re-run the analysis steps, can we obtain the same or similar cell type clusters?
The next sections walk through using the Seurat library to cluster cells based on gene expression patterns from a single-cell RNA-Seq experiment.
Data are from 10X Genomics, a company that makes instruments and reagents for single-cell RNA-Seq experiments.
The basic pattern of how this works is that we read some data into an object of type Seurat, named after the library.
Why this name? George Seurat was a French painter who lived in the late 1800s. He is famous for using and improving a painting technique called pointillism, where the painter applies little dots of paint to a canvas. When you look at the painting from far away, you can see shapes and figures. When you look at the painting up close, all you see are the dots.
The entire point of the Seurat library is to identify cell types from single-cell RNA-Seq data, like you saw done in Dr. Regev’s video lecture. This walk-through shows you how to produce the two-dimensional scatter plot cluster images from Dr. Regev’s slides.
One thing you need to keep in mind as you follow the tutorial is that each command applies some function to a Seurat object and then returns the same object, but with some new data added to it.
This is a bit weird and takes some getting used to, but it’s a strategy used by many other packages in R. Everything focuses on adding new information to a large and complex object.
Unfortunately, this leads to some confusion during interactive, exploratory sometimes because it requires you to keep track of what functions you’ve already called.
The first time you run this Markdown, you need to install the Seurat library. To do this, use the install.packages command with argument Seurat. Un-comment the following line to install the Seurat library:
#install.packages("Seurat")
Once you have install the library, load it into your environment:
library(Seurat)
Next, get the data for the tutorial. Download it from:
Unpack the file. Unpacking the file creates a new folder named pbmc3k_filtered_gene_bc_matrices which contains three files:
All three files together define the data from an experiment investigating gene expression in peripheral blood mononuclear Cells (PBMC).
Make sure you unpack the data into the same folder as your RStudio project file. This ensures that the next step will work properly.
Use the tutorial as guide to create the Seurat object that will be progressively modified in the rest of the Markdown:
pbmc.data = Read10X(data.dir = "pbmc3k_filtered_gene_bc_matrices")
pbmc = CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)#copied from tutorial
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
The object pbmc is an instance of an S4 class with class name Seurat. Documentation about the class is here: https://github.com/satijalab/seurat/wiki.
A Seurat object stores information about the experiment together with the experimental data.
The experimental data is essentially a gigantic matrix, where rows represent genes and columns represent cells.
In this experiment, the matrix has:
The experimental procedure for generating the single-cell mRNA-Seq data is not perfect. To avoid making wrong conclusions, we need to remove cells (columns) with low quality data.
The Seurat authors recommend filtering based on the following criteria gleaned from many people’s experience analyzing single-cell data.
As advised, we remove cells with:
Calculate the percentage of mitochondrial genes detected using PercentageFeatureSet and save the result by adding a value named “percent.mt” to the Seurat object:
pbmc[["percent.mt"]] = PercentageFeatureSet(pbmc, pattern = "^MT-")#copied from tutorial
When we do this, the data get stored in Seurat slot meta.data.
Next, remove the low-quality cells, using the tutorial as a guide:
pbmc = subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)#copied from tutorial
The newly filtered matrix contains:
The ultimate goal of the analysis is to identify & visualize groupings of cells, grouped based on the similarity of their gene expression patterns. Doing this requires that we can compare data from different cells.
Because different cells were sequenced to different depths, we can’t compare them without somehow accounting for this. Normalization accounts for different sequencing depths between cells and therefore lets us compare data across cells.
As of this writing, the Seurat file preprocessing.R from https://github.com/satijalab/seurat implements the normalization method.
Normalize by applying a method NormalizeData, using the tutorial as a guide:
pbmc = NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000 )#copied from tutorial
According to the tutorial authors, many groups have found that focusing on the most highly-variable genes improves downstream analysis.
For this next step, we identify the most variable genes across the experiment and ignore everything else.
Use the same method and parameters used in the tutorial - FindVariableFeatures.
pbmc = FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)#copied from tutorial
The data are all nicely normalized with respect cells, represented by columns. However, the scales are different across genes, represented by rows. That is, some genes have high absolute expression levels and others have low absolute expression values.
To make a good cluster plots, we need to express every gene’s expression values per cell in terms of the gene’s average expression value across all cells.
This is called “scaling the data”. Use the ScaleData function with variable all.genes as shown in the tutorial:
all.genes = rownames(pbmc)#copied from tutorial
pbmc = ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
Now that the individual genes’ expression values are all expressed in terms of their own means across all the cells, we use principal components analysis to further reduce the data set. Use the function RunPCA to add principal components to the Seurat object pbmc, using the tutorial as a guide:
pbmc = RunPCA(pbmc,features = VariableFeatures(object = pbmc))#copied from tutorial
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
This step identifies principal components, which you can think of as collections of genes that capture most of the variation in the data set. They are typically ranked in terms of how much of the variation they capture. The first principal component captures the most variation, the second capture the next most, and so on.
Once this is done, you (the investigator) would need to visualize the components and decide how many of these principal components to include in the next phase of the analysis.
In the tutorial, the authors have already done this for us. They advise using the first 10 principal components.
Using these first ten principal components, we need to identify clusters. The methods for doing this are complex, but they boil down to using the principal components to compute distances between cells suing the FindNeighbors function.
Then we use FindClusters to identify clusters. Fill in the same methods described in the tutorial:
pbmc = FindNeighbors(pbmc, dims = 1:10)#copied from tutorial
## Computing nearest neighbor graph
## Computing SNN
pbmc = FindClusters(pbmc, resolution = 0.5)#copied from tutorial
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 96033
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
Next, use the principal components from the previous step as inputs to the UMAP algorithm, which clusters cells in a two-dimensional plot. Do this using the RunUMAP methods as shown in the tutorial.
pbmc = RunUMAP(pbmc, dims = 1:10)#copied from tutorial
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 04:05:56 UMAP embedding parameters a = 0.9922 b = 1.112
## 04:05:56 Read 2638 rows and found 10 numeric columns
## 04:05:56 Using Annoy for neighbor search, n_neighbors = 30
## 04:05:56 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 04:05:57 Writing NN index file to temp file C:\Users\aboud\AppData\Local\Temp\RtmpkzU3Lt\file20c83fc53bf7
## 04:05:57 Searching Annoy index using 1 thread, search_k = 3000
## 04:05:57 Annoy recall = 100%
## 04:05:57 Commencing smooth kNN distance calibration using 1 thread
## 04:05:58 Initializing from normalized Laplacian + noise
## 04:05:58 Commencing optimization for 500 epochs, with 105132 positive edges
## 04:06:05 Optimization finished
DimPlot(pbmc, reduction = "umap")#copied from tutorial
If the above code is correct, you should see an image with eight color-coded clusters, numbered 1 through 8, as shown in the tutorial section labeled Run non-linear dimensional reduction (UMAP/tSNE).
The tutorial describes how to identify gene markers for each cluster and how to use these markers to determine cell type identify for each cluster.
This is the ultimate goal of this type of analysis: Identification of distinct cell types based on gene expression.
Use the code from the tutorial to re-draw the plot with clusters labeled by cell type:
new.cluster.ids = c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")#copied from tutorial
names(new.cluster.ids) = levels(pbmc)#copied from tutorial
pbmc = RenameIdents(pbmc, new.cluster.ids)#copied from tutorial
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()#copied from tutorial
If the tutorial is reproducible, then you will see a plot that looks similar the last plot in the tutorial.
What do you think? Did your plot perfectly match what was shown in the tutorial? How was it different, or similar?
The plot did not perfectly match the one shown in the tutorial. However, there were a lot of similarities such as similar gene names and gene grouping. In addition, the amount of dots for each color-coded clusters seems to be the same. The only big difference is the spatial arrangement of the three shapes formed from the clusters of genes. It seems like the entire plot is inverted and can be compared to a reflection seen from a mirror.This definitely has an effect on the values and distances for the entire gene clusters.Thus, since the locations of the genes being expressed is changed so is the final product. In conclusions, I do not think that this tutorial is reproducible but it provided a great deal of knowledge regarding tools for single cell genomics.