Here, we will run through the Seurat Guided Clustering Tutorial, which uses the Peripheral Blood Mononuclear Cells (pbmc) dataset obtained from 10x Genomics. The link for the dataset is below: https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
The first step done below is setting up the Seurat Object by loading the necessary libraries and the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Seurat)
library(patchwork)
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")
Here, the Seurat object was initialized with the raw data
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
The below steps shows what the data in the count matrix looks like. The dots are used to represent zeros, and they are used to save memory space.
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
## 709591472 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
## 29905192 bytes
dense.size/sparse.size
## 23.7 bytes
The command below calculates mitochondrial quality control (QC) metrics. The first five lines of the dataset can be visualized using the following command: head(pbmc@meta.data, 5)
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
This next command filters cells that have unique feature counts over 2,500 or less than 200 and greater than 5% mitochondrial counts, which are shown in violin plots.
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
The below command normalizes the data based on a log scale, multiplies the result by a scale factor of 10,000 and log-transforms the result. This dataset is stored as pbmc[["RNA"]]@data. Running the default command pbmc <- NormalizeData(pbmc) would yield the same reuslts
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
The function FindVariableFeatures is used to calculate the features that exhibit high cell-to-cell variation in the pbmc dataset. The two plots show these features, and plot 2 has the top 10 features labeled.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1
## Warning: Transformation introduced infinite values in continuous x-axis
plot2
## Warning: Transformation introduced infinite values in continuous x-axis
# Plot 1 and Plot 2 were graphed separately in order to avoid errors in the formatting of the x-axis
The below step scales the data so that a PCA analysis can be done. This scales all the genes, but an alternative command that can be run would be pbmc <- ScaleData(pbmc) which would only perform scaling on previously identified variables.
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
Here, a PCA analysis is done, on the variable features that were determined previously. There are a few different ways that the results can be visualized, including a summary table, VizDimReduction, DimPlot, and DimHeatmap, which are shown below.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
This is done using the JackStraw function and is used in order to determine which principal components are the most significant, thereby showing strong enrichment of low p-values.
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
## Warning: Removed 23496 rows containing missing values (geom_point).
An elbow plot is an alternative way to visualize principal components based on the percentage of variance from each one. In this graph, the “elbow,” or where the data seems to plateau out, occurs at around PC 9-10.
ElbowPlot(pbmc)
The FindNeighbors function implements the K-nearest neighbor procedure, whereas the FindClusters function groups cells together using modularity optimization techniques.
pbmc <- FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 96033
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 1 3 1 2
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
Here, a DimPlot was generated using the UMAP reduction technique.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 19:06:42 UMAP embedding parameters a = 0.9922 b = 1.112
## 19:06:42 Read 2638 rows and found 10 numeric columns
## 19:06:42 Using Annoy for neighbor search, n_neighbors = 30
## 19:06:42 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:06:42 Writing NN index file to temp file /var/folders/6s/vkstf1ss7j90z_cy3llfdwtw0000gp/T//Rtmpsx30fz/file8dd945c81d9f
## 19:06:42 Searching Annoy index using 1 thread, search_k = 3000
## 19:06:43 Annoy recall = 100%
## 19:06:43 Commencing smooth kNN distance calibration using 1 thread
## 19:06:44 Initializing from normalized Laplacian + noise
## 19:06:44 Commencing optimization for 500 epochs, with 105132 positive edges
## 19:06:48 Optimization finished
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
The following commands are used to define clusters using differential expression.
# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## IL32 1.894810e-92 0.8373872 0.948 0.464 2.598542e-88
## LTB 7.953303e-89 0.8921170 0.981 0.642 1.090716e-84
## CD3D 1.655937e-70 0.6436286 0.919 0.431 2.270951e-66
## IL7R 3.688893e-68 0.8147082 0.747 0.325 5.058947e-64
## LDHB 2.292819e-67 0.6253110 0.950 0.613 3.144372e-63
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## FCGR3A 7.583625e-209 2.963144 0.975 0.037 1.040018e-204
## IFITM3 2.500844e-199 2.698187 0.975 0.046 3.429657e-195
## CFD 1.763722e-195 2.362381 0.938 0.037 2.418768e-191
## CD68 4.612171e-192 2.087366 0.926 0.036 6.325132e-188
## RP11-290F20.3 1.846215e-188 1.886288 0.840 0.016 2.531900e-184
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.96e-107 0.730 0.901 0.594 2.69e-103 0 LDHB
## 2 1.61e- 82 0.922 0.436 0.11 2.20e- 78 0 CCR7
## 3 7.95e- 89 0.892 0.981 0.642 1.09e- 84 1 LTB
## 4 1.85e- 60 0.859 0.422 0.11 2.54e- 56 1 AQP3
## 5 0. 3.86 0.996 0.215 0. 2 S100A9
## 6 0. 3.80 0.975 0.121 0. 2 S100A8
## 7 0. 2.99 0.936 0.041 0. 3 CD79A
## 8 9.48e-271 2.49 0.622 0.022 1.30e-266 3 TCL1A
## 9 2.96e-189 2.12 0.985 0.24 4.06e-185 4 CCL5
## 10 2.57e-158 2.05 0.587 0.059 3.52e-154 4 GZMK
## 11 3.51e-184 2.30 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 2.14 1 0.315 2.78e-121 5 LST1
## 13 7.95e-269 3.35 0.961 0.068 1.09e-264 6 GZMB
## 14 3.13e-191 3.69 0.961 0.131 4.30e-187 6 GNLY
## 15 1.48e-220 2.68 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 1.99 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 7.73e-200 5.02 1 0.01 1.06e-195 8 PF4
## 18 3.68e-110 5.94 1 0.024 5.05e-106 8 PPBP
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
Here, VlnPlot and FeaturePlot are used to visualize marker expression.
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
Lastly, a heatmap is generated of the top ten markers
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
Here, the clustering is assigned to cell types, and these are annotated using the DimPlot function.
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()