Introduction

The basic goal of almost all single-cell RNA-seq (scRNA-seq) data analysis is to explore heterogeneity in high dimensional space, and identify patterns of variation in cell state and gene networks. There is an array of machine learning and statistical tools that can be applied for this purpose, and covering all of them in one sitting is impossible. For this introductory lab session, we will focus on a rather common task in scRNA-seq analysis - clustering, wherein the goal is to identify discrete groups of cells in the data, where members of a single group share a molecular identity. More formally, if one thinks of each cell’s molecular identity as a vector in the high dimensional space of gene expression, the task is one of identifying well-separated “clouds” or “islands” of cells in this space as well as the genes that distinguish them.

Why is this important? There is ample evidence suggesting that cells that differ in molecular characteristics also differ in their function. scRNA-seq thus offers a high-throughput and unbiased approach to discover functional categories of cells.

For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics, using the Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R. In this dataset, there are 2,700 single cells that were sequenced on the Illumina NextSeq 500. We will cover a commonly used clustering workflow, beginning with data normalization, through dimensionality reduction and clustering, and ending with differential gene expression analysis. It is important to note that clustering, while commonly used, may not always be the most appropriate approach to analyze a given dataset. There are many biological situations where the underlying space of cell states exhibits a more continuous variation rather than a discrete variation (e.g. in the case of development or differentiation). In such circumstances, one might need to pose alternative questions and apply different techniques than the ones introduced here.

Read the count matrix and setup the Seurat object

Load necessary packages

library(Seurat)
library(dplyr)
library(Matrix)
library(xtable)
library(topGO)
source("../sc_workshop/utilities.R")

To preprocess their data, 10X genomics provides a software called cellranger. cellranger aligns the raw reads and generates count matrices. Seurat’s Read10X function reads these count matrices in the format that 10X provides.

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../sc_workshop/hg19/")

# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = pbmc.data))
dense.size
## 709264728 bytes

Seurat stores the count matrix in the sparse format. For matrices where a majority of the entries are zero, which is generally the case for scRNA-seq data (remember dropouts?), it is far more memory efficient to only remember the non-zero entries of the matrix, rather than the entire matrix (which is mostly zeros). This is essentially the basis of the sparse representation, a format that is supported by all programming languages. Notice below how much memory we save due to the sparse format,

sparse.size <- object.size(x = pbmc.data)
sparse.size
## 38715120 bytes
dense.size/sparse.size
## 18.3 bytes

As is the case in a general R workflow, we center all our analysis on a single “object”, in this case an object of the class Seurat that we will call pbmc. This object will contain various “slots” that will store not only the raw input data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.

# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
# least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, 
                           project = "10X_PBMC")

pbmc@raw.data is a slot that stores the original gene expression matrix. We can visualize the first 20 rows (genes) and the first 10 columns (cells),

pbmc@raw.data[1:20,1:10]
## 20 x 10 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 10 column names 'AAACATACAACCAC', 'AAACATTGAGCTAC', 'AAACATTGATCAGC' ... ]]
##                                  
## AL627309.1    . . . . . . . . . .
## AP006222.2    . . . . . . . . . .
## RP11-206L10.2 . . . . . . . . . .
## RP11-206L10.9 . . . . . . . . . .
## LINC00115     . . . . . . . . . .
## NOC2L         . . . . . . . . . .
## KLHL17        . . . . . . . . . .
## PLEKHN1       . . . . . . . . . .
## RP11-54O7.17  . . . . . . . . . .
## HES4          . . . . . . . . . .
## RP11-54O7.11  . . . . . . . . . .
## ISG15         . . 1 9 . 1 . . . 3
## AGRN          . . . . . . . . . .
## C1orf159      . . . . . . . . . .
## TNFRSF18      . 2 . . . . . . . .
## TNFRSF4       . . . . . . . . 1 .
## SDF4          . . 1 . . . . . . .
## B3GALT6       . . . . . . 1 . . .
## FAM132A       . . . . . . . . . .
## UBE2J2        . . . . . . . . 1 .

Preprocessing step 1 : Filter out unhealthy cells

The object initialization step above only considered cells that express at least 200 genes. Additionally, we would like to exclude cells that are unhealthy. A common metric to judge this (although by no means the only one ) is the relative expression of mitochondrially derived genes. When the cells apoptose due to stress, their mitochondria becomes leaky and there is widespread RNA-degradation. Thus a relative enrichment of mitochondrially derived genes can be a tell-tale sign of cell stress. Here, we compute the proportion of transcripts that are of mitochondrial origin for every cell (percent.mito), and visualize its distribution as a violin plot. We also use the GenePlot function to observe how percent.mito correlates with other metrics

# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.  For non-UMI
# data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial
# genes here and store it in percent.mito using AddMetaData.  We use object@raw.data since this represents
# non-transformed and non-log-normalized counts The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)