Introduction

In the following tutorial, we describe the basic computational steps for identifying molecularly distinct cell types from single-cell RNA-seq data. We use the R programming language, which is a versatile platform for many kinds of genomic analyses, and benefits from the availability of a wide array of statistical and bioinformatic libraries. Over the years, a number of software packages have been developed for single-cell transcritpomic analysis (https://github.com/seandavi/awesome-single-cell), and many of them are available through Bioconductor (https://www.bioconductor.org/), an open-source archive of bioinformatic R libraries with an active user community. This tutorial uses the packages Seurat (Satija et al., 2015), PAGODA (Fan et al., 2016), and MAST (Finak et al., 2015).

Here we analyze single nucleus RNA-seq data covering human frontal cortex, visual cortex and cerebellum (Lake et al., 2018). While the main text mostly refers to single “cells”, we note that all the concepts discussed are equally applicable to RNA-seq data collected from single nuclei (sn-RNAseq). While RNA-seq data from nuclei does differ from cells in many key respects, such as an increased abundance of unspliced mRNA, there is ample evidence that they are highly correlated, and are equally representative of the cell’s molecular identity. Moreover, scRNA-seq data snRNA-seq share similar statistical properties. Finally, although not discussed here, many of the tools applied here, ranging from dimensionality reduction to classification are also applicable to analyze other single-cell level measurements, such as epigenomic and protein (e.g. mass cytometry) data. It must, however, be stressed that these modalities contain key statistical differences that might warrant other approaches not applicable to sc- or sn-RNA-seq analysis.

Preprocessing : Read the count matrix and setup the Seurat object

Load necessary packages

library(Seurat)
library(dplyr)
library(Matrix)
library(MASS)
library(topGO)
library(xtable)
library(Matrix.utils)
source("utilities.R")

We first read in the individual data matrices corresponding to the Frontal Cortex, Visual Cortex and Cerebellum downloaded from the Gene Expression Omnibus submission of Lake et al., 2017 (NCBI Gene Expression Omnibus, GSE97942). Since the majority of the entries of these expression matrices are “0”, we immediately convert them to the sparse matrix format using the Matrix package to reduce the memory footprint.

FrontCor_counts = read.table("Data/GSE97930_FrontalCortex_snDrop-seq_UMI_Count_Matrix_08-01-2017.txt.gz", header=TRUE)
FrontCor_counts = Matrix(as.matrix(FrontCor_counts), sparse=TRUE)
VisCor_counts = read.table("Data/GSE97930_VisualCortex_snDrop-seq_UMI_Count_Matrix_08-01-2017.txt.gz", header=TRUE)
VisCor_counts = Matrix(as.matrix(VisCor_counts), sparse=TRUE)
Cereb_counts = read.table("Data/GSE97930_CerebellarHem_snDrop-seq_UMI_Count_Matrix_08-01-2017.txt.gz", header=TRUE)
Cereb_counts = Matrix(as.matrix(Cereb_counts), sparse=TRUE)

Next, we add a “tissue of origin” tag to the three tissue matrices and bind them into a single matrix. The rows of the final matrix correspond to the union of the genes in each of the three tissue matrices. Genes that are missing in any matrix are assumed to not be expressed. We use the rBind.fill function in the Matrix.utils package to fill in the missing genes,

colnames(FrontCor_counts) = paste0("FrontalCortex_",colnames(FrontCor_counts))
colnames(VisCor_counts) = paste0("VisualCortex_",colnames(VisCor_counts))
colnames(Cereb_counts) = paste0("Cerebellum_",colnames(Cereb_counts))
Count.mat_sndrop = Matrix.utils::rBind.fill(t(VisCor_counts), t(FrontCor_counts), fill=0)
Count.mat_sndrop = Matrix.utils::rBind.fill(Count.mat_sndrop, t(Cereb_counts), fill=0)
Count.mat_sndrop = t(Count.mat_sndrop)

Next, we initialize an S4 R object of the class Seurat. The various downstream computations will be performed on this object.

# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 10 cells. Keep all cells with at
# least 500 detected genes
snd <- CreateSeuratObject(raw.data = Count.mat_sndrop, min.cells = 20, min.genes = 300, 
                         project = "snDropBrain")

snd@raw.data is a slot in the Seurat object that stores the original gene expression matrix. We can visualize the first 10 rows (genes) and the first 10 columns (cells),

snd@raw.data[1:10,1:10]

## 10 x 10 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 10 column names 'VisualCortex_Ex1_occ17_AAGTGAGTGACC', 'VisualCortex_Ex1_occ23_CCAGTACGCATC', 'VisualCortex_Ex1_occ24_CCAGGCCTTTCG' ... ]]

##                              
## A1BG-AS1  . . . . . . . . . .
## A1CF      . . . . . . . . . .
## A2M       . . . . . . . . . .
## A2M-AS1   . . . . . . . . . .
## A2ML1     . 1 . . . . . . . .
## A2ML1-AS1 . . . . . . . 2 . 1
## A2MP1     . . . . . . . . . .
## A3GALT2   . . . . . . . . . .
## A4GALT    . . . . . . . . . .
## AAAS      . . . . . . . . . .

Let’s check the dimensions of the normalized expression matrix and the number of cells from each sample. Here snd@ident stores the sample id’s of the cells.

dim(snd@raw.data)

## [1] 23413 34324

table(snd@ident)

## 
##    Cerebellum FrontalCortex  VisualCortex 
##          4637         10319         19368

Thus, we have 23,413 genes and 34,234 cells, with 19368 cells from the Visual Cortex, 10319 cells from the Frontal Cortex and 4637 cells from the Cerebellum, respectively. We can visualize common metrics such as number of genes per cell (nGene) and number of transcripts/UMIs per cell (nUMI) as “violin plots” (a fancier version of the good old box and whisker plots) using the Seurat plotting command VlnPlot

VlnPlot(object = snd, features.plot = c("nGene", "nUMI"), nCol = 2, point.size.use = 0.1)

Analysis step 1 : Normalize the data

Because of technical differences in cell-lysis and mRNA capture efficiency, the count vectors of two equivalent cells can differ in the total number of transcripts/UMIs across all genes. This makes it necessary to normalize the data first to attenuate these differences, which is carried out in two steps.

We re-scale the counts in every cell to sum to a constant value - here, we choose the median of the total transcripts per cell as the scaling factor. This is often referred to as “library-size normalization”.
We apply a logarithmic transformation to the scaled expression values such that E <- log(E+1) (the addition of 1 is to ensure that zeros map to zero values). This transformation has two desirable properties,

It shrinks values such that the data are more uniformly spread across its range of values, which is especially beneficial if there are outliers
Since \(log(A) - log(B) = log\left(\frac{A}{B}\right)\), it converts distances along a gene-axis to log-fold change values. This has the consequence that expression differerences across cells/samples are treated equally, irrespective of the absolute expression value of the gene. This might be especially desirable for lowly expressed genes such as transcription factors.

med_trans = median(Matrix::colSums(snd@raw.data[,snd@cell.names]))
print(med_trans)

## [1] 959

snd <- NormalizeData(object = snd, normalization.method = "LogNormalize", 
                    scale.factor = med_trans)

Analysis step 2 : Detection of variable genes

As described in the main text, it is common prior to dimensionality reduction (step 3) to choose features that are likely to be informative over features that represent statistical noise, a step known as “Feature Selection”. In scRNA-seq data, this is accomplished by choosing genes that are “highly variable” under the assumption that variability in most genes does not represent meaningful biology. An additional challenge is the level of variability in a gene is related to its mean expression (a phenomenon known as heteroscedacity), which has to be explicitly accounted for. We perform variable gene selection using Seurat’s FindVariableGenes method, which bins the genes based on their mean-expression value and looks for the genes with the higest dispersion (the ratio of the variance and the mean) in each bin. The function accepts additional parameters that exclude lowly or highly expressed genes (parameter x.low.cutoff and x.high.cutoff respectively), and a cutoff for the scaled dispersion in each bin (parameter y.cutoff). We refer the reader to other variable gene selection methods - e.g. M3Drop (Andrews and Hemberg, 2016), mean-CV regression (Brennecke et al., 2013) or a recently published Poisson-Gamma mixture model (Pandey et al., 2018).

var.genes <- FindVariableGenes(object = snd, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5, set.var.genes = FALSE)

print(length(var.genes))

## [1] 2353

Thus, we find 2,353 variable genes in the data.

Analysis step 3: Z-scoring the data and removing unwanted sources of variation

Variation in scRNA-seq data that is relevant to cell identity can be masked by many unwanted sources of variation. A common challenge is batch effects, which can be reflected in both transcriptomic differences and cell-type compositional differences between equivalent experimental batches. As mentioned earlier, variations in lysis efficiency, mRNA capture and amplification can result in substantial differences between the transcriptomes of equivalent cells. There can be additional sources of variation resulting from biological processes such as cell cycle, response to dissociation, stress and apoptosis that might dominate the measured transcriptomic state of the cell.

Correcting for such effects continues to be an active area of research, and many sophisticated approaches have been recently introduced [REF], but a comprehensive overview is beyond our scope. Here, for demonstrative purposes, we remove variation in gene expression that is highly correlated with library size nUMI. Seurat performs a linear fit to the expression level of every gene using nUMI as a predictor, and returns the residuals as the “corrected” expression values. Next, the expression values are z-scored or standardized along every gene,

\[ E_{ij} \leftarrow \frac{E_{ij} - \bar E_{i}}{\sigma_i} \] Here \(E_{ij}\) is the corrected gene expression value of gene \(i\) in cell \(j\), \(\bar E_i\) and \(\sigma_i\) are the mean and the standard deviation of gene \(i\)’s expression across all cells. The transformed expression values now have a zero mean and standard deviation equal to 1 across all genes.

Removing the effects of nUMI and z-scoring are performed together using Seurat’s function ScaleData, which then stores the transformed gene expression values in the slot snd@scale.data.

Step 3: Perform linear dimensionality reduction using PCA

Although we have substantially reduced the number of relevant features in STEP 2 (from 23,413 to 2,353), we have not accounted for the fact that many of these genes are correlated with each other, and collectively define cell identity. This enables a further reduction of dimensionality in the data. Here, we accomplish this using Principal Component Analysis (PCA), a classical and extremely versatile dimensionality reduction method that identifies a linear subspace that most accurately captures the variance in the data. Each of the individual axes of this subspace, knwon as principal vectors (PVs), are linear combinations of the original genes, and the projections of the original data onto these axes are known as principal components (or PCs.)

snd <- RunPCA(object = snd, pc.genes = var.genes, do.print = TRUE, pcs.print = 1:5, 
             genes.print = 5, pcs.compute = 50)

## [1] "PC1"
## [1] "PLP1"   "CTNNA3" "RNF220" "ST18"   "MOBP"  
## [1] ""
## [1] "MEG3"    "CDH18"   "GRIK2"   "KCND2"   "ZNF385D"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "SLC1A2"     "GPC5"       "SLC1A3"     "ADGRV1"     "RNF219-AS1"
## [1] ""
## [1] "MEG3"    "NRXN3"   "CNTNAP2" "SLC24A2" "ASIC2"  
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "LRRTM4"  "DPP10"   "CHN1"    "ASIC2"   "SPARCL1"
## [1] ""
## [1] "RP11-649A16.1" "FSTL5"         "ZNF385D"       "CHN2"         
## [5] "TIAM1"        
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "NRGN"   "CHN1"   "CADPS2" "ENC1"   "RORB"  
## [1] ""
## [1] "LHFPL3" "ADARB2" "PCDH15" "NXPH1"  "GRIK1" 
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "ERBB4"         "CXCL14"        "DLX6-AS1"      "SYNPR"        
## [5] "RP11-123O10.4"
## [1] ""
## [1] "APBB1IP"  "DOCK8"    "C10orf11" "ST6GAL1"  "ADAM28"  
## [1] ""
## [1] ""

Each PV is defined by a set of weights corresponding to the genes (known as the “loadings”). A PV is said to be “driven” by genes with high weights (positive or negative), and two PVs represent independent, orthogonal directions. The printed output of RunPCA lists the genes with the highest magnitude loadings (positive and negative) along the top PVs.

Visualize the PCA results

Seurat allows multiple ways to visualize the PC results, and these are useful to gain biological intuition. VizPCA shows the genes with the highest absolute loadings along any number of user specified PVs. PCAPlot allows plotting the cells in a reducted dimensional space of PCs, and can often highlight subpopulation structure.

VizPCA(object = snd, pcs.use = 1:2)

PCAPlot(object = snd, dim.1 = 1, dim.2 = 2)

Here, by examining the results of both VizPCA and PCAPlot, we see that the cells with low values of PC1 are oligodendrocytes, characterized by the high loadings of characteristic genes such as Proteolipid Protein 1 (PLP1) and Myelin Basic Protein (MBP). On the other hand, genes with low values of PC2 are astrocytes, characterized by the expression of the transporters SLC1A2 and SLC1A3. Next, PCHeatmap allows for easy visualization of the gene expression variation along each PC in the data, and can be particularly useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores and loadings respectively along each PC. Setting cells.use to a number plots the “extreme”" cells on both ends of the spectrum.

PCHeatmap(object = snd, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, 
          label.columns = FALSE, use.full = FALSE)

While there are many formal methods to determine the number of statistically significant PCs, a particularly easy and popular method is to examine the successive reduction in variance captured by increasing PCs, and identify an “elbow” where inclusion of PCs is of marginal utility (this is often called the “noise floor”). We do this using the Seurat function PCElbowPlot.

PCElbowPlot(object = snd, num.pc = 50)

We choose 25 PCs based on this plot. Thus, for every cell in the data, we have reduced its representation from ~23,000 genes to 25 PCs (a nearly 1000 fold reduction in dimensionality!). Next, we determine subpopulations in this data using Graph-based Clustering using the Seurat FindClusters function. Graph clustering has been widely used in recently scRNA-seq papers and has many desirable properties compared to other methods such as k-means clustering, hierarchical clustering and density-based clustering. Here, we first build a k-nearest neighbor graph on the data, connecting each cell to its k-nearest neighbor cells based on transcriptional similarity. The nearest neighbors are determined based on proximity in PC space using a euclidean distance metric. Next, similar to the strategy employed in Levine et al., 2015 and Shekhar et al., 2016, the graph edge weights are refined based on the Jaccard-similarity metric, which removes spurious edges between clusters. FindClusters implements an algorithm that determines clusters that maximize a mathematical function known as the “modularity” on the Jaccard-weighted k-nearest neighbor graph. The function contains a resolution parameter that tunes the granularity of the clustering, with increased values leading to a greater number of clusters. We use a value of 0.5, but variations in this parameter need to be tested to check for robustness.

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
snd <- FindClusters(object = snd, reduction.type = "pca", dims.use = 1:25, 
                   resolution = 1, print.output = 0, save.SNN = TRUE, force.recalc = TRUE)
table(snd@ident)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 4372 4079 3624 2483 2002 1974 1866 1822 1473 1439 1381 1223 1074  996  908 
##   15   16   17   18   19   20   21   22   23   24 
##  723  670  607  553  380  291  210   99   47   28

This yields 19 clusters. Visualize the clusters using t-SNE

snd <- RunTSNE(object = snd, dims.use = 1:25, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = snd, do.label = TRUE)

Finding Cluster-specific markers for all clusters

# find markers for every cluster compared to all remaining cells, report
# only the positive ones
snd.markers <- FindAllMarkers(object = snd, only.pos = TRUE, min.pct = 0.25, 
                             thresh.use = 0.25)
snd.markers %>% group_by(cluster) %>% top_n(4, avg_logFC)

## # A tibble: 76 x 7
## # Groups:   cluster [19]
##    p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene   
##    <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>  
##  1    0.     0.842 0.973 0.609        0. 0       LRRTM4 
##  2    0.     0.700 0.995 0.843        0. 0       KCNIP4 
##  3    0.     0.628 0.988 0.799        0. 0       MEG3   
##  4    0.     0.605 0.702 0.108        0. 0       CBLN2  
##  5    0.     0.453 0.867 0.571        0. 1       DPP10  
##  6    0.     0.446 0.922 0.764        0. 1       SYT1   
##  7    0.     0.436 0.779 0.572        0. 1       PHACTR1
##  8    0.     0.431 0.880 0.743        0. 1       LRP1B  
##  9    0.     0.584 0.948 0.755        0. 2       PTPRD  
## 10    0.     0.543 0.688 0.239        0. 2       NRGN   
## # ... with 66 more rows

save(list=c("snd","snd.markers"), file="LakeSeuratAnalysisObject.Rdata")

Cell Type Identification (Book Chapter)

Karthik Shekhar and Vilas Menon