Introduction

Hopefully, following the Seurat workflow on clustering 2.7K bcs has given you a “feel” for what scRNA-seq analysis entails. In this tutorial, we will be analyzing a slightly larger dataset of Retinal Bipolar Cells (BCs) sequenced using the Drop-seq method from the publication Shekhar et al., Cell, 2016 (download a copy here : https://www.dropbox.com/s/cpw4xysbs170jc8/Bipolar2016_maintext.pdf?dl=0). We will use the Seurat package, and follow roughly the same steps that were applied for pbmcs in the earlier tutorial. Retinal Bipolar Cells are a heterogenous class of interneurons in the retina involved in the processing of visual signals. In the paper, the authors identified 15 molecularly distinct types of bipolar neurons, and matched them against cell morphology. The paper identified all 12 types of bipolar neurons that had been described earlier, in addition to three novel types.

The original publication began with nearly 44000 cells, of which ~28,000 cells passed QC and featured in the final analysis. In this dataset, you will begin with 8000 cells. I have indicated some key steps below, but your task is to carve out your own analysis path and compare results against those reported in the paper. Even though the figures might look different because of the small size of the dataset and the different methods used here compared to those in the paper, ask yourself if you are able to recover cell types with similar molecular signatures. You are encouraged (although not obligated) to work in groups.

Read the count matrix and setup the Seurat object

Load necessary packages

library(Seurat)
library(dplyr)
library(Matrix)
library(MASS)
library(topGO)
library(xtable)
source("../sc_workshop/utilities.R")

Load the data matrix

# Load the bc dataset (loads a sparse matrix Count.mat)
load("../sc_workshop/bipolar8000.Rdata")

Analyze the data by following these steps (or not)

  1. Initialize your Seurat Object using the count matrix Count.mat. Name this object bc (for Bipolar Cells)

  2. Follow preprocessing steps 1-4 as before, but consider the following points about the original paper that differ from the PBMC analysis,
    1. This is mouse, not human data - so gene names follow a different convention. For example, human GAPDH is mouse Gapdh
    2. The original paper only considered cells that expressed at least 500 genes.
    3. Cells expressing > 10% mitochondrial transcripts were discarded
    4. Rather than normalizing all counts to 1e4, the authors normalized all counts to the median observed counts in the data. Estimate this normalization factor as norm.fac = median(Matrix::colSums(AnyCountMatrix)
    5. Play with the parameters of FindVariableGenes to obtain at least 1000-1500 genes for dimensionality reduction. Also consider an alternative model that appears to be more relevant for UMI-based data to estimate variable genes, implemented in the provided NB.var.genes defined in the script utilities.R. Sample invocation : var.genes.NB <- NB.var.genes(bc, do.idents = FALSE, set.var.genes = FALSE, num.sd=1, x.high.cutoff = 15, x.low.cutoff = 0.005)
  3. How different is the set of variable genes output by NB.var.genes compared to FindVariableGenes? Use either, their intersection or union for further analysis (or an entirely different method if you feel courageous). Make sure you set bc@var.genes to the set of variable genes.

  4. Perform PCA, and select an appropriate number of PCs. Use either PCAPlot, PCHeatmap or PCElbowPlot to make an informed choice. Make sure you use this number as input to the pcs.use parameter in FindClusters and RunTSNE

  5. Find differentially expressed genes for the clusters and generate a heatmap.

  6. Compare your results with Shekhar et al.
    1. How many clusters do you find? How sensitive is your cluster output to the parameters in FindClusters?
    2. Are you able to distinguish bipolar neuron clusters vs. non-bipolar cells? (Hint: Bipolar cells and Muller Glia express the gene Vsx2, whereas other retinal neurons do not. See Fig. 1F)
    3. Label your clusters. Do the markers and the clusters found by your unbiased clustering and DE analysis match the clusters reported in the paper? Can you label your clusters? If some clusters happen to be mixed, then label them to indicate that (e.g. BC8_9). Do the clusters that are “mixed” happen to be closely related types based on the dendrogram in Fig. 1G?
    4. Can you make a dot plot using the function DotPlot summarizing the clusters and the top markers?
    5. This analysis uses only 20% of the cells in the original dataset. So it’s possible that some rare molecular distinctions are not very obvious. But can you see any substructure in the data that could indicate that some of your clusters might be a mixture? You can examine such substructure by using FeaturePlot to plot the expression of individual genes on the tSNE map