SINGLE CELL RNA-SEQ WORKSHOP MODULE 2 (CLUSTERING 8K Retinal Bipolar Cells)

Introduction

Hopefully, following the Seurat workflow on clustering 2.7K bcs has given you a “feel” for what scRNA-seq analysis entails. In this tutorial, we will be analyzing the a slightly larger dataset of Retinal Bipolar Cells (BCs) sequenced using the Drop-seq method from the publication Shekhar et al., Cell, 2016 (a copy of which has been made available to you). We will use the Seurat package, and follow roughly the same steps that were applied for pbmcs in the earlier tutorial. Retinal Bipolar Cells are a heterogenous class of interneurons in the retina involved in the processing of visual signals. In the paper, the authors identified 15 molecularly distinct types of bipolar neurons, and matched them against cell morphology. The paper identified all 12 types of bipolar neurons that had been described earlier, in addition to three novel types.

The original publication began with nearly 44000 cells, of which ~28,000 cells passed QC and featured in the final analysis. In this dataset, you will begin with 8000 cells. I have indicated some key steps below, but your task is to carve out your own analysis path and compare results against those reported in the paper. Even though the figures might look different because of the small size of the dataset and the different methods used here compared to those in the paper, ask yourself if you are able to recover cell types with similar molecular signatures. You are encouraged (although not obligated) to work in groups.

Read the count matrix and setup the Seurat object

Load necessary packages

library(Seurat)
library(dplyr)
library(Matrix)
library(MASS)
library(topGO)
library(xtable)
source("../sc_workshop/utilities.R")

Load the data matrix

# Load the bc dataset (loads a sparse matrix Count.mat)
load("../sc_workshop/bipolar8000.Rdata")

Initialize a Seurat Object

# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 10 cells. Keep all cells with at
# least 500 detected genes
bc <- CreateSeuratObject(raw.data = Count.mat, min.cells = 10, min.genes = 500, 
                           project = "Dropseq_bipolar")

As before, bc@raw.data is a slot that stores the original gene expression matrix. We can visualize the first 20 rows (genes) and the first 10 columns (cells),

bc@raw.data[1:20,1:10]

## 20 x 10 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 10 column names 'Bipolar4_CGGGACGGCTAA', 'Bipolar2_TACCGCAGCTTA', 'Bipolar4_AGCGGATCTAGA' ... ]]

##                                  
## 0610007P14Rik . . . . . . . . . .
## 0610009B22Rik . . . . . . . . . .
## 0610009E02Rik . . . . . . . . . .
## 0610009L18Rik . . . . . . . . . .
## 0610009O20Rik . . . . . . . . . .
## 0610010F05Rik . 2 . . . . . . . .
## 0610010K14Rik . . . . . . . . . .
## 0610011F06Rik . . . 1 . 1 . . . .
## 0610030E20Rik . . . . . . . . . .
## 0610037L13Rik . . 1 . . . . . . .
## 0610040B10Rik . . . . 1 . . . . .
## 0610040J01Rik . . . . . . . . . .
## 0610043K17Rik . . . . . . . . . .
## 1110001J03Rik . . . . . . . . 2 .
## 1110002L01Rik . . . . . . . . 1 .
## 1110004E09Rik . . . . . . . . . .
## 1110004F10Rik 1 2 . 1 . 1 2 . . .
## 1110007C09Rik . . . . . . . . . .
## 1110008F13Rik . . . . . . 1 . . .
## 1110008L16Rik . . . . . . . . . .

Preprocessing step 1 : Filter out unhealthy cells

Following the same procedure as before, we compute percent.mito for each cell. Note that the search string input to the grep function below is slightly modified to accomodate mouse gene names.

# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.  For non-UMI
# data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial
# genes here and store it in percent.mito using AddMetaData.  We use object@raw.data since this represents
# non-transformed and non-log-normalized counts The % of UMI mapping to mt-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^mt-", x = rownames(x = bc@data), value = TRUE)
percent.mito <- Matrix::colSums(bc@raw.data[mito.genes, ])/Matrix::colSums(bc@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
bc <- AddMetaData(object = bc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = bc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)

# GenePlot is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object,
# i.e. columns in object@meta.data, PC scores etc.  Since there is a rare subset of cells with an outlier level of high
# mitochondrial percentage and also low UMI content, we filter these as well
par(mfrow = c(1, 2))
GenePlot(object = bc, gene1 = "nUMI", gene2 = "percent.mito")
GenePlot(object = bc, gene1 = "nUMI", gene2 = "nGene")

Filter out cells with fewer than 500 genes detected and percent.mito higher than 0.1 (10%),

bc <- FilterCells(object = bc, subset.names = c("nGene", "percent.mito"), 
    low.thresholds = c(500, -Inf), high.thresholds = c(2500, 0.1))
print(dim(bc@data))

## [1] 13416  6154

We see that we are left with 6154 cells. Normalize the data as before, using “LogNormalize”, which normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor equal to the median counts of all genes, and log-transforms the result.

med_trans = median(Matrix::colSums(bc@raw.data[,bc@cell.names]))
med_trans

## [1] 1143

bc <- NormalizeData(object = bc, normalization.method = "LogNormalize", 
                      scale.factor = med_trans)

Detection of variable genes

In this exercise, we will use an alternative method to compute variable genes than the one in Seurat. This is implemented in the NB.var.genes function. Briefly, it uses a Negative Binomial Model (more precisely, a Poisson-Gamma mixture) to estimate a lower bound for the coefficient of variation of any gene (CV, defined as the ratio of the standard deviation and the mean) as a function of the mean (the “null” model). This null model describes the minimum CV that would be exhibited by any gene based on statistical noise alone. Thus, this model provides a rational means to rank genes based on their “excess” CV from the null model,

var.genes.NB <- NB.var.genes(bc, do.idents = FALSE, set.var.genes = FALSE, num.sd=1, x.high.cutoff = 15, x.low.cutoff = 0.005)

## [1] "Identifying variable genes based on UMI Counts. Warning - use this only for UMI based data"

## [1] "Using diffCV = 0.21 as the cutoff"
## [1] "Considering only genes with mean counts less than 15 and more than 0.005"
## [1] "Found 1082 variable genes"

We can also compute variable genes using Seurat’s FindVariableGenes method as before. Let’s compare the output of the two different methods?

var.genes.Seurat <- FindVariableGenes(object = bc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5, set.var.genes = FALSE)

print(length(var.genes.NB))

## [1] 1082

print(length(var.genes.Seurat))

## [1] 2074

print(length(intersect(var.genes.NB, var.genes.Seurat)))

## [1] 791

We have to set the variable genes to proceed. I’m partial to my method, so I’m following with that. But you should feel free to use either of the two set of variable genes, their union or their intersection (or a completely different method). Feel free to explore!

bc@var.genes = var.genes.NB

Z-scoring the data and removing unwanted sources of variation

Seurat uses linear regression to remove unwanted sources of variation that can be specified by the user.

bc <- ScaleData(object = bc, vars.to.regress = c("nUMI","percent.mito"), genes.use = bc@var.genes)

## [1] "Regressing out nUMI"         "Regressing out percent.mito"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |==============================                                   |  45%
  |                                                                       
  |===================================                              |  55%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |=================================================================| 100%
## [1] "Scaling data matrix"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

Perform linear dimensionality reduction using PCA

Next we perform PCA on the scaled data. By default, the genes in object@var.genes are used as input, but can be defined using pc.genes.

bc <- RunPCA(object = bc, pc.genes = bc@var.genes, do.print = TRUE, pcs.print = 1:5, 
               genes.print = 5, pcs.compute = 50)

## [1] "PC1"
## [1] "Apoe"   "Dkk3"   "Rlbp1"  "Glul"   "Slc1a3"
## [1] ""
## [1] "Gng13"  "Cplx3"  "Snap25" "Nme1"   "Gnao1" 
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "App"     "Gria2"   "Scgn"    "Lbh"     "Gucy1a3"
## [1] ""
## [1] "Pcp2"  "Chgb"  "Car8"  "Prkca" "Trpm1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Slit2"         "Tacr3"         "A730046J19Rik" "Glra1"        
## [5] "A930003A15Rik"
## [1] ""
## [1] "Cplx4"    "Medag"    "BC046251" "Hs3st4"   "Atp2b1"  
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Rho"   "Sag"   "Pdc"   "Pde6g" "Gnat1"
## [1] ""
## [1] "Tacr3"         "Slit2"         "Gabra1"        "A730046J19Rik"
## [5] "Zfhx4"        
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Vsx1"   "Cck"    "Pcp4l1" "Tnnt1"  "Reln"  
## [1] ""
## [1] "Gsg1"  "Cabp5" "Pcp4"  "Nnat"  "Lhx4" 
## [1] ""
## [1] ""

# Examine and visualize PCA results a few different ways
PrintPCA(object = bc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)

## [1] "PC1"
## [1] "Apoe"   "Dkk3"   "Rlbp1"  "Glul"   "Slc1a3"
## [1] ""
## [1] "Gng13"  "Cplx3"  "Snap25" "Nme1"   "Gnao1" 
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "App"     "Gria2"   "Scgn"    "Lbh"     "Gucy1a3"
## [1] ""
## [1] "Pcp2"  "Chgb"  "Car8"  "Prkca" "Trpm1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Slit2"         "Tacr3"         "A730046J19Rik" "Glra1"        
## [5] "A930003A15Rik"
## [1] ""
## [1] "Cplx4"    "Medag"    "BC046251" "Hs3st4"   "Atp2b1"  
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Rho"   "Sag"   "Pdc"   "Pde6g" "Gnat1"
## [1] ""
## [1] "Tacr3"         "Slit2"         "Gabra1"        "A730046J19Rik"
## [5] "Zfhx4"        
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Vsx1"   "Cck"    "Pcp4l1" "Tnnt1"  "Reln"  
## [1] ""
## [1] "Gsg1"  "Cabp5" "Pcp4"  "Nnat"  "Lhx4" 
## [1] ""
## [1] ""

VizPCA(object = bc, pcs.use = 1:2)

PCAPlot(object = bc, dim.1 = 1, dim.2 = 2)

PCHeatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.

PCHeatmap(object = bc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, 
    label.columns = FALSE, use.full = FALSE)

Determining number of significant PCs by the PCA Elbow plot

PCElbowPlot(object = bc, num.pc = 30)

Determine clusters based on graph clustering

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
bc <- FindClusters(object = bc, reduction.type = "pca", dims.use = 1:25, 
    resolution = 0.5, print.output = 0, save.SNN = TRUE, force.recalc = TRUE)

How many clusters does this yield? (check bc@ident). Visualize the clusters using t-SNE

bc <- RunTSNE(object = bc, dims.use = 1:25, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = bc)

Finding Cluster-specific markers for all clusters

# find markers for every cluster compared to all remaining cells, report
# only the positive ones
bc.markers <- FindAllMarkers(object = bc, only.pos = TRUE, min.pct = 0.25, 
    thresh.use = 0.25)
bc.markers %>% group_by(cluster) %>% top_n(4, avg_logFC)

## # A tibble: 60 x 7
## # Groups:   cluster [15]
##                   p_val avg_… pct.1  pct.2           p_val_adj clus… gene 
##                   <dbl> <dbl> <dbl>  <dbl>               <dbl> <fct> <chr>
##  1            0         0.990 1.00  0.905            0         0     Calm1
##  2            0         0.955 1.00  0.643            0         0     Pcp2 
##  3            0         0.869 0.982 0.519            0         0     Chgb 
##  4            0         0.749 0.991 0.665            0         0     Trpm1
##  5            3.07e⁻³⁰³ 0.691 1.00  0.995            4.12e⁻²⁹⁹ 1     mt-R…
##  6            1.39e⁻¹⁴⁸ 0.545 0.931 0.764            1.87e⁻¹⁴⁴ 1     mt-N…
##  7            2.89e⁻¹⁴⁴ 0.582 0.996 0.924            3.87e⁻¹⁴⁰ 1     Calm1
##  8            1.30e⁻¹³³ 0.514 0.978 0.730            1.74e⁻¹²⁹ 1     Trpm1
##  9            0         3.02  1.00  0.0970           0         2     Apoe 
## 10            0         2.74  0.996 0.172            0         2     Glul 
## # ... with 50 more rows

Check out Fig. 1F in Shekhar et al., 2016. Are you able to distinguish clusters corresponding to bipolar neurons vs. those corresponding to non-bipolar neurons? (Hint: Bipolar cells were enriched in this study using a transgenic line that expresses GFP driven by a transgene corresponding to the marker Vsx2)

Visualize a heatmap of markers

DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.

top10 <- bc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
# setting slim.col.label to TRUE will print just the cluster IDS instead of
# every cell name
DoHeatmap(object = bc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Check out Figs. 2A, 3A and 4A? How do the clusters you obtain correspond to the 15 Bipolar Neuronal types published in the study? Also, how do these results compare to Fig. S2N, where they authors performed a clustering on a 5000 cell subsample? Can you see some similarities in the results?

Lastly, let us perform a GO analysis of the cell type specific genes to see what’s enriched.

bc.cell.type.genes <- unique(bc.markers$gene) # Takes all the unique cell type specific genes
GOterms.bc = topGOterms(fg.genes = bc.cell.type.genes, bg.genes = rownames(bc@data), organism = "Mouse")

## [1] "Total 13416 genes. 635 genes in the foreground"
## [1] "Using Ontology : BP"
## [1] "Using the fisher statistic with the weight01 algorithm"

Examine the GO table to see if there’s anything that’s retina specific? How are these different from the terms you saw in the PBMC example?

GOterms.bc$res.table

##         GO.ID                                                     Term
## 1  GO:0007601                                        visual perception
## 2  GO:0007602                                        phototransduction
## 3  GO:0050896                                     response to stimulus
## 4  GO:0009584                               detection of visible light
## 5  GO:0045665            negative regulation of neuron differentiation
## 6  GO:0060041                    retina development in camera-type eye
## 7  GO:0007186             G-protein coupled receptor signaling pathway
## 8  GO:0007263                nitric oxide mediated signal transduction
## 9  GO:0071482                      cellular response to light stimulus
## 10 GO:0007155                                            cell adhesion
## 11 GO:0009886                      post-embryonic animal morphogenesis
## 12 GO:0021871                                forebrain regionalization
## 13 GO:0042462                       eye photoreceptor cell development
## 14 GO:0045762        positive regulation of adenylate cyclase activity
## 15 GO:0007411                                            axon guidance
## 16 GO:0003407                                neural retina development
## 17 GO:0050731 positive regulation of peptidyl-tyrosine phosphorylation
## 18 GO:0060749                       mammary gland alveolus development
## 19 GO:0055075                                potassium ion homeostasis
## 20 GO:0014009                                 glial cell proliferation
##    Annotated Significant Expected    pval
## 1         97          42     4.92 1.7e-25
## 2         21          14     1.07 5.4e-14
## 3       4288         339   217.56 3.7e-09
## 4         23          13     1.17 6.0e-09
## 5        177          30     8.98 1.4e-08
## 6        116          34     5.89 1.7e-08
## 7        306          54    15.53 4.3e-08
## 8         10           7     0.51 8.8e-08
## 9         67           8     3.40 2.4e-07
## 10       711          82    36.07 4.8e-07
## 11        12           7     0.61 5.3e-07
## 12        13           7     0.66 1.1e-06
## 13        29          12     1.47 1.1e-06
## 14        15           7     0.76 3.8e-06
## 15       121          23     6.14 5.3e-06
## 16        39          14     1.98 5.8e-06
## 17        83          19     4.21 7.9e-06
## 18        12           6     0.61 1.2e-05
## 19        12           6     0.61 1.2e-05
## 20        29           8     1.47 1.6e-05