Seurat-Guided Clustering Tutorial

This tutorial is only for practice purpose. This tutorial is from Satijalab - always thank you for sharing a great tool vignettes.

Setup the seurat object

The dataset of Peripheral Blood Mononuclear Cells(PBMC) freely available from 10X Genomics. There 2,700 single cell that were sequenced on the ILLumina NextSeq500.

start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified-UMI count matrix.The values in this matrix represent the number of molecules for each feature(i.e gene-row) that are detected in each cell(column).

Note that more recent versions of cellranger now also output using the h5 file format, which can be read in using Read10X_h5() function in Seurat.

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data and analysis for single cell dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## Attaching SeuratObject

library(patchwork)

# load PBMC dataset
pbmc.data <- Read10X(data.dir='/Users/chunjienan/Desktop/scRNA/Seurat-Guided clustering tutorial/filtered_gene_bc_matrices/hg19')

# initialize the seurat object with the raw, non-normalized data
pbmc <- CreateSeuratObject(counts = pbmc.data, project = 'pbmc3k', min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Quiz: What does data in a count matrix look like?

# let's examine a few genes in the first thrity cells
pbmc.data[c('CD3D','TCL1A','MS4A1'), 1:30]

## 3 x 30 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]

##                                                                    
## CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
## TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
## MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

The . value in the matrix represent 0s(no molecules detected). Since most values in an scRNA-seq matrix are 0, seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/10x data.

Standard pre-processing workflow.

A few QC metrics commonly used by the community include: 1. The number of unique genes detected in each cell. low-quality cells or empty droplets will often have very few genes. cell doublets or multiplets may exhibit an aberrantly high gene count.

The total number of molecules detected within a cell
The percentage of reads that map to the mitochondrial genome. low-quality/dying cells often exhibit extensive mitochondrial contamination. calculate mitochondrial QC metrics with PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features. use the set of all genes start with MT- as a set of mitochondrial genes

# The [[ ]] operator can add columns to object metadata.This is a great place to start QC stats
pbmc[['percent.mt']] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Quiz: Where are QC metrics stored in Seurat?

The number of unique genes and total molecules are automatically calculated during CreateSeuratObject(), you can find them stored in the object meta data.

head(pbmc@meta.data)

##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
## AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551

nFeature_RNA is the total number of genes detected in each cell. Low nFeature_RNA for a cell indicates that it may be dead/dying or an empty droplet.

nCount_RNA is the total number of molecules detected within a cell.

High nCount_RNA and/or nFeature_RNA indicates that the ‘cell’ may in fact be a doublet or multiplet. In combination with %mitochondrial reads, removing outliers from these groups removes most doublets/dead cells/empty droplets.

In this example below, we visualize QC metrics, and use these to filter cells.

filter cells that have unique feature counts over 2500 or less than 200. (doublet and dying cell)
filter cells that have 5%> mitochondrial counts

VlnPlot(pbmc, features = c('nFeature_RNA','nCount_RNA','percent.mt'), ncol = 3)

#FeatureScatter is used to visualize feature-feature relationships.
plot1<-FeatureScatter(pbmc, feature1 = 'nCount_RNA', feature2 = 'percent.mt')
plot2<-FeatureScatter(pbmc, feature1 = 'nCount_RNA', feature2 = 'nFeature_RNA')
plot1 + plot2

Filtering with subset function

pbmc <-subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA <2500 & percent.mt<5)

Normalizing the data

After removing unwanted cells, the next step is normalize the data. By default, employ a global-scaling normalization method ‘LogNormalize’ that normalize the feature expression measurements for each cell by total expression, multiplies this by a scale factor 10,000 by default, and log-transforms the resullt. In seurat v5, normalized value are stored in pbmc[[‘RNA’]]$data

pbmc <- NormalizeData(pbmc, normalization.method = 'LogNormalize', scale.factor = 10000)

The above code is same with pbmc<-NormalizeData(pbmc) if no parameter changes in the function.

The LogNormalize assumption: each cell orginally contains the same number of RNA molecules.

Instead, an alternative is SCTransform(), it replaces the NormalizeData, FindVariableFeatures, ScaleData functions described below.

Identification of highly variable feature(feature selection with FindVariableFeatures())

Calculate a subset of features that exhibit high cell-to-cell variation in the dataset, genes that are highly expressed in some cells, and lowly expressed in others. Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

FindVariableFeatures() by default return 2000 features per dataset. These will be used in dwonstream analysis like PCA.

# 2000 variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)


# Identify the 10 most highly variable genes
top10<-head(VariableFeatures(pbmc), 10)


# plot variable features with and without labels
plot1<- VariableFeaturePlot(pbmc)
plot2<-LabelPoints(plot = plot1, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

plot1 + plot2

## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

Scaling the data

Next, apply a linear transformation ‘scaling’ that is standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function: shifts the expression of each gene, so that the mean expression across cells is 0. scale the expression of each gene, so that the variance across cell is 1. This step gives equal weight in downstream analysis, so that highly-expressed genes do not dominate The results of this are stored in pbmc[[‘RNA’]]$scale.data By default, only variable features are scaled. you can specify the features argument to scale addtional features

all.genes<-rownames(pbmc)
pbmc <-ScaleData(pbmc, feature = all.genes)

## Centering and scaling data matrix

Perform linear dimensional reduction

pbmc <-RunPCA(pbmc, features = VariableFeatures(object = pbmc))

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
##     SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
##     RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

Seurat provides several useful ways of visualizing both cells and features that define PCAm including VizDimReduction(), DimPlot(), and DimHeatmap()

print(pbmc[['pca']], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

VizDimLoadings(pbmc, dims = 1:2, reduction = 'pca')

DimPlot(pbmc, reduction = 'pca') + NoLegend()

In particular DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analysis.Both cells and features are ordered according to their PCA scores.

DimHeatmap(pbmc, dims =1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

Determine the ‘dimensionality’ of the dataset

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot() function). In this example, we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.

ElbowPlot(pbmc)

Cluster the cells

we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).

pbmc <- FindNeighbors(pbmc, dims = 1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc<- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95965
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds

head(Idents(pbmc))

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 
##                6                2 
## Levels: 0 1 2 3 4 5 6 7 8

Run non-linear dimensional reduction(UMAP/tsne)

The goal of these algorithm is to learn underlying structure in the dataset, in order to place simuilar cells together in low-dimensional space.

pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 23:24:52 UMAP embedding parameters a = 0.9922 b = 1.112

## 23:24:52 Read 2638 rows and found 10 numeric columns

## 23:24:52 Using Annoy for neighbor search, n_neighbors = 30

## 23:24:52 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 23:24:52 Writing NN index file to temp file /var/folders/pp/3r7qbzhd57g__7xv_rgx8pqc0000gn/T//Rtmp3mCH3N/file142c701eca63
## 23:24:52 Searching Annoy index using 1 thread, search_k = 3000
## 23:24:53 Annoy recall = 100%
## 23:24:53 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 23:24:53 Initializing from normalized Laplacian + noise (using irlba)
## 23:24:53 Commencing optimization for 500 epochs, with 105124 positive edges
## 23:24:57 Optimization finished

DimPlot(pbmc, reduction = 'umap', label =TRUE)

you can save the object at this point, so that it can easily be loaded back in withtout having to rerun the computationally intensive steps performed above.

saveRDS(pbmc, file = '/Users/chunjienan/Desktop/scRNA/Seurat-Guided clustering tutorial/pbmc_tutorial.rds')

Finding differetially expressed features(cluster biomarkers)

Seurat can help you find markers that define clusters via differential expression(DE). By default, it identifies positive and negative markers of a single cluster(specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters.

# find all markers of cluster2
cluster2.marker<-FindMarkers(pbmc, ident.1 = 2)

## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

head(cluster2.marker, n =5)

##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 2.593535e-91  1.2154360 0.949 0.466 3.556774e-87
## LTB  7.994465e-87  1.2828597 0.981 0.644 1.096361e-82
## CD3D 3.922451e-70  0.9359210 0.922 0.433 5.379250e-66
## IL7R 1.130870e-66  1.1776027 0.748 0.327 1.550876e-62
## LDHB 4.082189e-65  0.8837324 0.953 0.614 5.598314e-61

# find all markers distinguishing cluster 5 from cluster 0 and 3
cluster5.markers<-FindMarkers(pbmc,ident.1 = 5, ident.2 = c(0,3))
head(cluster5.markers, n =5)

##                       p_val avg_log2FC pct.1 pct.2     p_val_adj
## FCGR3A        2.150929e-209   4.267579 0.975 0.039 2.949784e-205
## IFITM3        6.103366e-199   3.877105 0.975 0.048 8.370156e-195
## CFD           8.891428e-198   3.411039 0.938 0.037 1.219370e-193
## CD68          2.374425e-194   3.014535 0.926 0.035 3.256286e-190
## RP11-290F20.3 9.308287e-191   2.722684 0.840 0.016 1.276538e-186

# find markers for every cluster compared to all remaining cells, report only the positive one
pbmc.markers<-FindAllMarkers(pbmc, only.pos = TRUE)

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

pbmc.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC >1)

## # A tibble: 928 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
##  1 1.74e-109       1.07 0.897 0.593 2.39e-105 0       LDHB     
##  2 1.17e- 83       1.33 0.435 0.108 1.60e- 79 0       CCR7     
##  3 3.28e- 49       1.05 0.333 0.103 4.50e- 45 0       LEF1     
##  4 9.31e- 44       1.03 0.328 0.11  1.28e- 39 0       PRKCQ-AS1
##  5 5.46e- 29       1.09 0.242 0.085 7.48e- 25 0       LDLRAP1  
##  6 0               5.57 0.996 0.215 0         1       S100A9   
##  7 0               5.48 0.975 0.121 0         1       S100A8   
##  8 0               3.80 0.908 0.059 0         1       LGALS2   
##  9 0               3.39 0.952 0.151 0         1       FCN1     
## 10 2.86e-294       2.82 0.667 0.028 3.92e-290 1       CD14     
## # ℹ 918 more rows

Seurat has several tests for differential expression which can be set with the test.use parameter. For example, the ROC test return the ‘classification power’ for any individual marker(ranging from 0 - random, to 1 - perfect)

cluster0.markers<-FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = 'roc', only.pos = TRUE)

Several tools for visualizing marker expression. VlnPlot() show expression probability distributions across clusters and FeaturePlot() visualizes feature expression on a tSNE or PCA plot are the most commonly used visualization. Also RidgePlot(), CellScatter(), and DotPlot() as additional method.

VlnPlot(pbmc, features = c("MS4A1","CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c('NKG7','PF4'), slot = 'counts', log =TRUE)

Assigning cell type identity to clusters

we can use canonical markers to easily match the unbiased clustering to know cell types

FeaturePlot(pbmc, features = c('MS4A1','GNLY','CD3E','CD14','FCER1A','FCGR3A','LYZ','PPBP','CD8A'))

DoHeatmap() generates an expression heatmap for given cells and features. In this case, we are plotting the top 20 markers for each cluster

pbmc.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 1) %>%
  slice_head(n =10) %>%
  ungroup() -> top10
DoHeatmap(pbmc, features = top10$gene)

new.cluster.ids<-c('Naive CD4 T','CD14+ Mono','Memory CD T','B','CD8 T','FCGR3A+ Mono','NK','DC', 'Platelet')

names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = 'umap', label = TRUE, pt.size =0.5)

library(ggplot2)
plot <- DimPlot(pbmc, reduction = 'umap', label =TRUE, label.size = 4.5) + xlab('UMAP 1') +ylab('UMAP 2') +
  theme(axis.title = element_text(size = 18), legend.text = element_text(size=18))+guides(color = guide_legend(override.aes = list(size = 10)))

ggsave(filename = '/Users/chunjienan/Desktop/scRNA/Seurat-Guided clustering tutorial/pbmc3k_umap.jpg', height =7,width = 12, plot = plot, quality = 50)

saveRDS(pbmc, file ='/Users/chunjienan/Desktop/scRNA/Seurat-Guided clustering tutorial/pbmc3k_final.rds')