Hopefully, following the Seurat workflow on clustering 2.7K bcs has given you a “feel” for what scRNA-seq analysis entails. In this tutorial, we will be analyzing the a slightly larger dataset of Retinal Bipolar Cells (BCs) sequenced using the Drop-seq method from the publication Shekhar et al., Cell, 2016 (a copy of which has been made available to you). We will use the Seurat package, and follow roughly the same steps that were applied for pbmcs in the earlier tutorial. Retinal Bipolar Cells are a heterogenous class of interneurons in the retina involved in the processing of visual signals. In the paper, the authors identified 15 molecularly distinct types of bipolar neurons, and matched them against cell morphology. The paper identified all 12 types of bipolar neurons that had been described earlier, in addition to three novel types.
The original publication began with nearly 44000 cells, of which ~28,000 cells passed QC and featured in the final analysis. In this dataset, you will begin with 8000 cells. I have indicated some key steps below, but your task is to carve out your own analysis path and compare results against those reported in the paper. Even though the figures might look different because of the small size of the dataset and the different methods used here compared to those in the paper, ask yourself if you are able to recover cell types with similar molecular signatures. You are encouraged (although not obligated) to work in groups.
Load necessary packages
library(Seurat)
library(dplyr)
library(Matrix)
library(MASS)
library(topGO)
library(xtable)
source("../sc_workshop/utilities.R")
Load the data matrix
# Load the bc dataset (loads a sparse matrix Count.mat)
load("../sc_workshop/bipolar8000.Rdata")
Initialize a Seurat Object
# Initialize the Seurat object with the raw (non-normalized data). Keep all
# genes expressed in >= 10 cells. Keep all cells with at
# least 500 detected genes
bc <- CreateSeuratObject(raw.data = Count.mat, min.cells = 10, min.genes = 500,
project = "Dropseq_bipolar")
As before, bc@raw.data
is a slot that stores the original gene expression matrix. We can visualize the first 20 rows (genes) and the first 10 columns (cells),
bc@raw.data[1:20,1:10]
## 20 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'Bipolar4_CGGGACGGCTAA', 'Bipolar2_TACCGCAGCTTA', 'Bipolar4_AGCGGATCTAGA' ... ]]
##
## 0610007P14Rik . . . . . . . . . .
## 0610009B22Rik . . . . . . . . . .
## 0610009E02Rik . . . . . . . . . .
## 0610009L18Rik . . . . . . . . . .
## 0610009O20Rik . . . . . . . . . .
## 0610010F05Rik . 2 . . . . . . . .
## 0610010K14Rik . . . . . . . . . .
## 0610011F06Rik . . . 1 . 1 . . . .
## 0610030E20Rik . . . . . . . . . .
## 0610037L13Rik . . 1 . . . . . . .
## 0610040B10Rik . . . . 1 . . . . .
## 0610040J01Rik . . . . . . . . . .
## 0610043K17Rik . . . . . . . . . .
## 1110001J03Rik . . . . . . . . 2 .
## 1110002L01Rik . . . . . . . . 1 .
## 1110004E09Rik . . . . . . . . . .
## 1110004F10Rik 1 2 . 1 . 1 2 . . .
## 1110007C09Rik . . . . . . . . . .
## 1110008F13Rik . . . . . . 1 . . .
## 1110008L16Rik . . . . . . . . . .
Following the same procedure as before, we compute percent.mito
for each cell. Note that the search string input to the grep
function below is slightly modified to accomodate mouse gene names.
# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. For non-UMI
# data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial
# genes here and store it in percent.mito using AddMetaData. We use object@raw.data since this represents
# non-transformed and non-log-normalized counts The % of UMI mapping to mt-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^mt-", x = rownames(x = bc@data), value = TRUE)
percent.mito <- Matrix::colSums(bc@raw.data[mito.genes, ])/Matrix::colSums(bc@raw.data)
# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
bc <- AddMetaData(object = bc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = bc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
# GenePlot is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object,
# i.e. columns in object@meta.data, PC scores etc. Since there is a rare subset of cells with an outlier level of high
# mitochondrial percentage and also low UMI content, we filter these as well
par(mfrow = c(1, 2))
GenePlot(object = bc, gene1 = "nUMI", gene2 = "percent.mito")
GenePlot(object = bc, gene1 = "nUMI", gene2 = "nGene")
Filter out cells with fewer than 500 genes detected and percent.mito
higher than 0.1 (10%),
bc <- FilterCells(object = bc, subset.names = c("nGene", "percent.mito"),
low.thresholds = c(500, -Inf), high.thresholds = c(2500, 0.1))
print(dim(bc@data))
## [1] 13416 6154
We see that we are left with 6154 cells. Normalize the data as before, using “LogNormalize”, which normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor equal to the median counts of all genes, and log-transforms the result.
med_trans = median(Matrix::colSums(bc@raw.data[,bc@cell.names]))
med_trans
## [1] 1143
bc <- NormalizeData(object = bc, normalization.method = "LogNormalize",
scale.factor = med_trans)
In this exercise, we will use an alternative method to compute variable genes than the one in Seurat. This is implemented in the NB.var.genes
function. Briefly, it uses a Negative Binomial Model (more precisely, a Poisson-Gamma mixture) to estimate a lower bound for the coefficient of variation of any gene (CV, defined as the ratio of the standard deviation and the mean) as a function of the mean (the “null” model). This null model describes the minimum CV that would be exhibited by any gene based on statistical noise alone. Thus, this model provides a rational means to rank genes based on their “excess” CV from the null model,
var.genes.NB <- NB.var.genes(bc, do.idents = FALSE, set.var.genes = FALSE, num.sd=1, x.high.cutoff = 15, x.low.cutoff = 0.005)
## [1] "Identifying variable genes based on UMI Counts. Warning - use this only for UMI based data"
## [1] "Using diffCV = 0.21 as the cutoff"
## [1] "Considering only genes with mean counts less than 15 and more than 0.005"
## [1] "Found 1082 variable genes"
We can also compute variable genes using Seurat’s FindVariableGenes
method as before. Let’s compare the output of the two different methods?
var.genes.Seurat <- FindVariableGenes(object = bc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5, set.var.genes = FALSE)
print(length(var.genes.NB))
## [1] 1082
print(length(var.genes.Seurat))
## [1] 2074
print(length(intersect(var.genes.NB, var.genes.Seurat)))
## [1] 791
We have to set the variable genes to proceed. I’m partial to my method, so I’m following with that. But you should feel free to use either of the two set of variable genes, their union or their intersection (or a completely different method). Feel free to explore!
bc@var.genes = var.genes.NB
Seurat uses linear regression to remove unwanted sources of variation that can be specified by the user.
bc <- ScaleData(object = bc, vars.to.regress = c("nUMI","percent.mito"), genes.use = bc@var.genes)
## [1] "Regressing out nUMI" "Regressing out percent.mito"
##
|
| | 0%
|
|====== | 9%
|
|============ | 18%
|
|================== | 27%
|
|======================== | 36%
|
|============================== | 45%
|
|=================================== | 55%
|
|========================================= | 64%
|
|=============================================== | 73%
|
|===================================================== | 82%
|
|=========================================================== | 91%
|
|=================================================================| 100%
## [1] "Scaling data matrix"
##
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
Next we perform PCA on the scaled data. By default, the genes in object@var.genes are used as input, but can be defined using pc.genes.
bc <- RunPCA(object = bc, pc.genes = bc@var.genes, do.print = TRUE, pcs.print = 1:5,
genes.print = 5, pcs.compute = 50)
## [1] "PC1"
## [1] "Apoe" "Dkk3" "Rlbp1" "Glul" "Slc1a3"
## [1] ""
## [1] "Gng13" "Cplx3" "Snap25" "Nme1" "Gnao1"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "App" "Gria2" "Scgn" "Lbh" "Gucy1a3"
## [1] ""
## [1] "Pcp2" "Chgb" "Car8" "Prkca" "Trpm1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Slit2" "Tacr3" "A730046J19Rik" "Glra1"
## [5] "A930003A15Rik"
## [1] ""
## [1] "Cplx4" "Medag" "BC046251" "Hs3st4" "Atp2b1"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Rho" "Sag" "Pdc" "Pde6g" "Gnat1"
## [1] ""
## [1] "Tacr3" "Slit2" "Gabra1" "A730046J19Rik"
## [5] "Zfhx4"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Vsx1" "Cck" "Pcp4l1" "Tnnt1" "Reln"
## [1] ""
## [1] "Gsg1" "Cabp5" "Pcp4" "Nnat" "Lhx4"
## [1] ""
## [1] ""
# Examine and visualize PCA results a few different ways
PrintPCA(object = bc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)
## [1] "PC1"
## [1] "Apoe" "Dkk3" "Rlbp1" "Glul" "Slc1a3"
## [1] ""
## [1] "Gng13" "Cplx3" "Snap25" "Nme1" "Gnao1"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "App" "Gria2" "Scgn" "Lbh" "Gucy1a3"
## [1] ""
## [1] "Pcp2" "Chgb" "Car8" "Prkca" "Trpm1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Slit2" "Tacr3" "A730046J19Rik" "Glra1"
## [5] "A930003A15Rik"
## [1] ""
## [1] "Cplx4" "Medag" "BC046251" "Hs3st4" "Atp2b1"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Rho" "Sag" "Pdc" "Pde6g" "Gnat1"
## [1] ""
## [1] "Tacr3" "Slit2" "Gabra1" "A730046J19Rik"
## [5] "Zfhx4"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Vsx1" "Cck" "Pcp4l1" "Tnnt1" "Reln"
## [1] ""
## [1] "Gsg1" "Cabp5" "Pcp4" "Nnat" "Lhx4"
## [1] ""
## [1] ""
VizPCA(object = bc, pcs.use = 1:2)
PCAPlot(object = bc, dim.1 = 1, dim.2 = 2)
PCHeatmap
allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.
PCHeatmap(object = bc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE,
label.columns = FALSE, use.full = FALSE)
Determining number of significant PCs by the PCA Elbow plot
PCElbowPlot(object = bc, num.pc = 30)
Determine clusters based on graph clustering
# save.SNN = T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
bc <- FindClusters(object = bc, reduction.type = "pca", dims.use = 1:25,
resolution = 0.5, print.output = 0, save.SNN = TRUE, force.recalc = TRUE)
How many clusters does this yield? (check bc@ident
). Visualize the clusters using t-SNE
bc <- RunTSNE(object = bc, dims.use = 1:25, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = bc)
# find markers for every cluster compared to all remaining cells, report
# only the positive ones
bc.markers <- FindAllMarkers(object = bc, only.pos = TRUE, min.pct = 0.25,
thresh.use = 0.25)
bc.markers %>% group_by(cluster) %>% top_n(4, avg_logFC)
## # A tibble: 60 x 7
## # Groups: cluster [15]
## p_val avg_… pct.1 pct.2 p_val_adj clus… gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0 0.990 1.00 0.905 0 0 Calm1
## 2 0 0.955 1.00 0.643 0 0 Pcp2
## 3 0 0.869 0.982 0.519 0 0 Chgb
## 4 0 0.749 0.991 0.665 0 0 Trpm1
## 5 3.07e⁻³⁰³ 0.691 1.00 0.995 4.12e⁻²⁹⁹ 1 mt-R…
## 6 1.39e⁻¹⁴⁸ 0.545 0.931 0.764 1.87e⁻¹⁴⁴ 1 mt-N…
## 7 2.89e⁻¹⁴⁴ 0.582 0.996 0.924 3.87e⁻¹⁴⁰ 1 Calm1
## 8 1.30e⁻¹³³ 0.514 0.978 0.730 1.74e⁻¹²⁹ 1 Trpm1
## 9 0 3.02 1.00 0.0970 0 2 Apoe
## 10 0 2.74 0.996 0.172 0 2 Glul
## # ... with 50 more rows
Check out Fig. 1F in Shekhar et al., 2016. Are you able to distinguish clusters corresponding to bipolar neurons vs. those corresponding to non-bipolar neurons? (Hint: Bipolar cells were enriched in this study using a transgenic line that expresses GFP driven by a transgene corresponding to the marker Vsx2)
DoHeatmap
generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- bc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
# setting slim.col.label to TRUE will print just the cluster IDS instead of
# every cell name
DoHeatmap(object = bc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)
Check out Figs. 2A, 3A and 4A? How do the clusters you obtain correspond to the 15 Bipolar Neuronal types published in the study? Also, how do these results compare to Fig. S2N, where they authors performed a clustering on a 5000 cell subsample? Can you see some similarities in the results?
Lastly, let us perform a GO analysis of the cell type specific genes to see what’s enriched.
bc.cell.type.genes <- unique(bc.markers$gene) # Takes all the unique cell type specific genes
GOterms.bc = topGOterms(fg.genes = bc.cell.type.genes, bg.genes = rownames(bc@data), organism = "Mouse")
## [1] "Total 13416 genes. 635 genes in the foreground"
## [1] "Using Ontology : BP"
## [1] "Using the fisher statistic with the weight01 algorithm"
Examine the GO table to see if there’s anything that’s retina specific? How are these different from the terms you saw in the PBMC example?
GOterms.bc$res.table
## GO.ID Term
## 1 GO:0007601 visual perception
## 2 GO:0007602 phototransduction
## 3 GO:0050896 response to stimulus
## 4 GO:0009584 detection of visible light
## 5 GO:0045665 negative regulation of neuron differentiation
## 6 GO:0060041 retina development in camera-type eye
## 7 GO:0007186 G-protein coupled receptor signaling pathway
## 8 GO:0007263 nitric oxide mediated signal transduction
## 9 GO:0071482 cellular response to light stimulus
## 10 GO:0007155 cell adhesion
## 11 GO:0009886 post-embryonic animal morphogenesis
## 12 GO:0021871 forebrain regionalization
## 13 GO:0042462 eye photoreceptor cell development
## 14 GO:0045762 positive regulation of adenylate cyclase activity
## 15 GO:0007411 axon guidance
## 16 GO:0003407 neural retina development
## 17 GO:0050731 positive regulation of peptidyl-tyrosine phosphorylation
## 18 GO:0060749 mammary gland alveolus development
## 19 GO:0055075 potassium ion homeostasis
## 20 GO:0014009 glial cell proliferation
## Annotated Significant Expected pval
## 1 97 42 4.92 1.7e-25
## 2 21 14 1.07 5.4e-14
## 3 4288 339 217.56 3.7e-09
## 4 23 13 1.17 6.0e-09
## 5 177 30 8.98 1.4e-08
## 6 116 34 5.89 1.7e-08
## 7 306 54 15.53 4.3e-08
## 8 10 7 0.51 8.8e-08
## 9 67 8 3.40 2.4e-07
## 10 711 82 36.07 4.8e-07
## 11 12 7 0.61 5.3e-07
## 12 13 7 0.66 1.1e-06
## 13 29 12 1.47 1.1e-06
## 14 15 7 0.76 3.8e-06
## 15 121 23 6.14 5.3e-06
## 16 39 14 1.98 5.8e-06
## 17 83 19 4.21 7.9e-06
## 18 12 6 0.61 1.2e-05
## 19 12 6 0.61 1.2e-05
## 20 29 8 1.47 1.6e-05