Seurat Walk-through

Analysis

Introduction to the analysis

The next sections walk through using the Seurat library to cluster cells based on gene expression patterns from a single-cell RNA-Seq experiment.

Data are from 10X Genomics, a company that makes instruments and reagents for single-cell RNA-Seq experiments.

The basic pattern of how this works is that we read some data into an object of type Seurat, named after the library.

Why this name? George Seurat was a French painter who lived in the late 1800s. He is famous for using and improving a painting technique called pointillism, where the painter applies little dots of paint to a canvas. When you look at the painting from far away, you can see shapes and figures. When you look at the painting up close, all you see are the dots.

The entire point of the Seurat library is to identify cell types from single-cell RNA-Seq data, like you saw done in Dr. Regev’s video lecture. This walk-through shows you how to produce the two-dimensional scatter plot cluster images from Dr. Regev’s slides.

One thing you need to keep in mind as you follow the tutorial is that each command applies some function to a Seurat object and then returns the same object, but with some new data added to it.

This is a bit weird and takes some getting used to, but it’s a strategy used by many other packages in R. Everything focuses on adding new information to a large and complex object.

Unfortunately, this leads to some confusion during interactive, exploratory sometimes because it requires you to keep track of what functions you’ve already called.

Setting up

The first time you run this Markdown, you need to install the Seurat library. To do this, use the install.packages command with argument Seurat. Un-comment the following line to install the Seurat library:

#install.packages("Seurat")

Once you have install the library, load it into your environment:

library(Seurat)

Next, get the data for the tutorial. Download it from:

https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz.

Unpack the file. Unpacking the file creates a new folder named pbmc3k_filtered_gene_bc_matrices which contains three files:

barcodes.tsv
genes.tsv
matrix.mtx

All three files together define the data from an experiment investigating gene expression in peripheral blood mononuclear Cells (PBMC).

Make sure you unpack the data into the same folder as your RStudio project file. This ensures that the next step will work properly.

Load the data

Use the tutorial as guide to create the Seurat object that will be progressively modified in the rest of the Markdown:

pbmc.data = Read10X(data.dir = "pbmc3k_filtered_gene_bc_matrices")
pbmc =  CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)#copied from tutorial

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

The object pbmc is an instance of an S4 class with class name Seurat. Documentation about the class is here: https://github.com/satijalab/seurat/wiki.

A Seurat object stores information about the experiment together with the experimental data.

The experimental data is essentially a gigantic matrix, where rows represent genes and columns represent cells.

In this experiment, the matrix has:

13714 rows (genes)
2700 columns (cells)

Remove low-quality cells

The experimental procedure for generating the single-cell mRNA-Seq data is not perfect. To avoid making wrong conclusions, we need to remove cells (columns) with low quality data.

The Seurat authors recommend filtering based on the following criteria gleaned from many people’s experience analyzing single-cell data.

As advised, we remove cells with:

mitochondrial genes detected, indicating contamination from mitochondrial RNA
too few or zero counts, indicating cells were sick or damaged
too many counts, indicating droplets contained more than 1 cell

Calculate the percentage of mitochondrial genes detected using PercentageFeatureSet and save the result by adding a value named “percent.mt” to the Seurat object:

pbmc[["percent.mt"]] = PercentageFeatureSet(pbmc, pattern = "^MT-")#copied from tutorial

When we do this, the data get stored in Seurat slot meta.data.

Next, remove the low-quality cells, using the tutorial as a guide:

pbmc = subset(pbmc,  subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)#copied from tutorial

The newly filtered matrix contains:

13714 rows (genes)
2638 columns (cells)

Normalize

The ultimate goal of the analysis is to identify & visualize groupings of cells, grouped based on the similarity of their gene expression patterns. Doing this requires that we can compare data from different cells.

Because different cells were sequenced to different depths, we can’t compare them without somehow accounting for this. Normalization accounts for different sequencing depths between cells and therefore lets us compare data across cells.

As of this writing, the Seurat file preprocessing.R from https://github.com/satijalab/seurat implements the normalization method.

Normalize by applying a method NormalizeData, using the tutorial as a guide:

pbmc = NormalizeData(pbmc,  normalization.method = "LogNormalize", scale.factor = 10000 )#copied from tutorial

Focus on the most variable genes

According to the tutorial authors, many groups have found that focusing on the most highly-variable genes improves downstream analysis.

For this next step, we identify the most variable genes across the experiment and ignore everything else.

Use the same method and parameters used in the tutorial - FindVariableFeatures.

pbmc = FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)#copied from tutorial

Scale genes

The data are all nicely normalized with respect cells, represented by columns. However, the scales are different across genes, represented by rows. That is, some genes have high absolute expression levels and others have low absolute expression values.

To make a good cluster plots, we need to express every gene’s expression values per cell in terms of the gene’s average expression value across all cells.

This is called “scaling the data”. Use the ScaleData function with variable all.genes as shown in the tutorial:

all.genes = rownames(pbmc)#copied from tutorial
pbmc = ScaleData(pbmc, features = all.genes)

## Centering and scaling data matrix

Perform linear dimension reduction

Now that the individual genes’ expression values are all expressed in terms of their own means across all the cells, we use principal components analysis to further reduce the data set. Use the function RunPCA to add principal components to the Seurat object pbmc, using the tutorial as a guide:

pbmc = RunPCA(pbmc,features = VariableFeatures(object = pbmc))#copied from tutorial

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
##     SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
##     RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

This step identifies principal components, which you can think of as collections of genes that capture most of the variation in the data set. They are typically ranked in terms of how much of the variation they capture. The first principal component captures the most variation, the second capture the next most, and so on.

Once this is done, you (the investigator) would need to visualize the components and decide how many of these principal components to include in the next phase of the analysis.

In the tutorial, the authors have already done this for us. They advise using the first 10 principal components.

Cluster the cells

Using these first ten principal components, we need to identify clusters. The methods for doing this are complex, but they boil down to using the principal components to compute distances between cells suing the FindNeighbors function.

Then we use FindClusters to identify clusters. Fill in the same methods described in the tutorial:

pbmc = FindNeighbors(pbmc,  dims = 1:10)#copied from tutorial

## Computing nearest neighbor graph

## Computing SNN

pbmc = FindClusters(pbmc, resolution = 0.5)#copied from tutorial

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

Run non-linear dimensional reduction (UMAP/tSNE)

Next, use the principal components from the previous step as inputs to the UMAP algorithm, which clusters cells in a two-dimensional plot. Do this using the RunUMAP methods as shown in the tutorial.

 pbmc = RunUMAP(pbmc,  dims = 1:10)#copied from tutorial

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 04:05:56 UMAP embedding parameters a = 0.9922 b = 1.112

## 04:05:56 Read 2638 rows and found 10 numeric columns

## 04:05:56 Using Annoy for neighbor search, n_neighbors = 30

## 04:05:56 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 04:05:57 Writing NN index file to temp file C:\Users\aboud\AppData\Local\Temp\RtmpkzU3Lt\file20c83fc53bf7
## 04:05:57 Searching Annoy index using 1 thread, search_k = 3000
## 04:05:57 Annoy recall = 100%
## 04:05:57 Commencing smooth kNN distance calibration using 1 thread
## 04:05:58 Initializing from normalized Laplacian + noise
## 04:05:58 Commencing optimization for 500 epochs, with 105132 positive edges
## 04:06:05 Optimization finished

 DimPlot(pbmc,  reduction = "umap")#copied from tutorial

If the above code is correct, you should see an image with eight color-coded clusters, numbered 1 through 8, as shown in the tutorial section labeled Run non-linear dimensional reduction (UMAP/tSNE).

Add gene names

The tutorial describes how to identify gene markers for each cluster and how to use these markers to determine cell type identify for each cluster.

This is the ultimate goal of this type of analysis: Identification of distinct cell types based on gene expression.

Use the code from the tutorial to re-draw the plot with clusters labeled by cell type:

 new.cluster.ids = c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")#copied from tutorial
 names(new.cluster.ids) = levels(pbmc)#copied from tutorial
 pbmc = RenameIdents(pbmc,  new.cluster.ids)#copied from tutorial
 DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()#copied from tutorial

If the tutorial is reproducible, then you will see a plot that looks similar the last plot in the tutorial.