R Markdown

Here, we will run through the Seurat Guided Clustering Tutorial, which uses the Peripheral Blood Mononuclear Cells (pbmc) dataset obtained from 10x Genomics. The link for the dataset is below: https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

The first step done below is setting up the Seurat Object by loading the necessary libraries and the dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)
library(patchwork)
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")

Here, the Seurat object was initialized with the raw data

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

The below steps shows what the data in the count matrix looks like. The dots are used to represent zeros, and they are used to save memory space.

# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]

## 3 x 30 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]

##                                                                    
## CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
## TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
## MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

dense.size <- object.size(as.matrix(pbmc.data))
dense.size

## 709591472 bytes

sparse.size <- object.size(pbmc.data)
sparse.size

## 29905192 bytes

dense.size/sparse.size

## 23.7 bytes

Pre-processing workflow

The command below calculates mitochondrial quality control (QC) metrics. The first five lines of the dataset can be visualized using the following command: head(pbmc@meta.data, 5)

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

This next command filters cells that have unique feature counts over 2,500 or less than 200 and greater than 5% mitochondrial counts, which are shown in violin plots.

# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing the Data

The below command normalizes the data based on a log scale, multiplies the result by a scale factor of 10,000 and log-transforms the result. This dataset is stored as pbmc[["RNA"]]@data. Running the default command pbmc <- NormalizeData(pbmc) would yield the same reuslts

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Identification of Highly Variable Features

The function FindVariableFeatures is used to calculate the features that exhibit high cell-to-cell variation in the pbmc dataset. The two plots show these features, and plot 2 has the top 10 features labeled.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

plot1

## Warning: Transformation introduced infinite values in continuous x-axis

plot2

## Warning: Transformation introduced infinite values in continuous x-axis

# Plot 1 and Plot 2 were graphed separately in order to avoid errors in the formatting of the x-axis

Scaling the Data

The below step scales the data so that a PCA analysis can be done. This scales all the genes, but an alternative command that can be run would be pbmc <- ScaleData(pbmc) which would only perform scaling on previously identified variables.

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

## Centering and scaling data matrix

Perform Linear Dimensional Reduction

Here, a PCA analysis is done, on the variable features that were determined previously. There are a few different ways that the results can be visualized, including a summary table, VizDimReduction, DimPlot, and DimHeatmap, which are shown below.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
##     SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
##     RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot(pbmc, reduction = "pca")

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

Determining the “dimensionality” of the dataset

This is done using the JackStraw function and is used in order to determine which principal components are the most significant, thereby showing strong enrichment of low p-values.

# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

JackStrawPlot(pbmc, dims = 1:15)

## Warning: Removed 23496 rows containing missing values (geom_point).

An elbow plot is an alternative way to visualize principal components based on the percentage of variance from each one. In this graph, the “elbow,” or where the data seems to plateau out, occurs at around PC 9-10.

ElbowPlot(pbmc)

Cluster the cells

The FindNeighbors function implements the K-nearest neighbor procedure, whereas the FindClusters function groups cells together using modularity optimization techniques.

pbmc <- FindNeighbors(pbmc, dims = 1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                1                3                1                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

Run non-linear dimensional reduction

Here, a DimPlot was generated using the UMAP reduction technique.

# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 19:06:42 UMAP embedding parameters a = 0.9922 b = 1.112

## 19:06:42 Read 2638 rows and found 10 numeric columns

## 19:06:42 Using Annoy for neighbor search, n_neighbors = 30

## 19:06:42 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 19:06:42 Writing NN index file to temp file /var/folders/6s/vkstf1ss7j90z_cy3llfdwtw0000gp/T//Rtmpsx30fz/file8dd945c81d9f
## 19:06:42 Searching Annoy index using 1 thread, search_k = 3000
## 19:06:43 Annoy recall = 100%
## 19:06:43 Commencing smooth kNN distance calibration using 1 thread
## 19:06:44 Initializing from normalized Laplacian + noise
## 19:06:44 Commencing optimization for 500 epochs, with 105132 positive edges
## 19:06:48 Optimization finished

# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")

Finding Differentially Expressed Features

The following commands are used to define clusters using differential expression.

# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)

## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session

head(cluster1.markers, n = 5)

##             p_val avg_logFC pct.1 pct.2    p_val_adj
## IL32 1.894810e-92 0.8373872 0.948 0.464 2.598542e-88
## LTB  7.953303e-89 0.8921170 0.981 0.642 1.090716e-84
## CD3D 1.655937e-70 0.6436286 0.919 0.431 2.270951e-66
## IL7R 3.688893e-68 0.8147082 0.747 0.325 5.058947e-64
## LDHB 2.292819e-67 0.6253110 0.950 0.613 3.144372e-63

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

##                       p_val avg_logFC pct.1 pct.2     p_val_adj
## FCGR3A        7.583625e-209  2.963144 0.975 0.037 1.040018e-204
## IFITM3        2.500844e-199  2.698187 0.975 0.046 3.429657e-195
## CFD           1.763722e-195  2.362381 0.938 0.037 2.418768e-191
## CD68          4.612171e-192  2.087366 0.926 0.036 6.325132e-188
## RP11-290F20.3 1.846215e-188  1.886288 0.840 0.016 2.531900e-184

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

## Registered S3 method overwritten by 'cli':
##   method     from    
##   print.boxx spatstat

## # A tibble: 18 x 7
## # Groups:   cluster [9]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 1.96e-107     0.730 0.901 0.594 2.69e-103 0       LDHB    
##  2 1.61e- 82     0.922 0.436 0.11  2.20e- 78 0       CCR7    
##  3 7.95e- 89     0.892 0.981 0.642 1.09e- 84 1       LTB     
##  4 1.85e- 60     0.859 0.422 0.11  2.54e- 56 1       AQP3    
##  5 0.            3.86  0.996 0.215 0.        2       S100A9  
##  6 0.            3.80  0.975 0.121 0.        2       S100A8  
##  7 0.            2.99  0.936 0.041 0.        3       CD79A   
##  8 9.48e-271     2.49  0.622 0.022 1.30e-266 3       TCL1A   
##  9 2.96e-189     2.12  0.985 0.24  4.06e-185 4       CCL5    
## 10 2.57e-158     2.05  0.587 0.059 3.52e-154 4       GZMK    
## 11 3.51e-184     2.30  0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125     2.14  1     0.315 2.78e-121 5       LST1    
## 13 7.95e-269     3.35  0.961 0.068 1.09e-264 6       GZMB    
## 14 3.13e-191     3.69  0.961 0.131 4.30e-187 6       GNLY    
## 15 1.48e-220     2.68  0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21     1.99  1     0.513 2.28e- 17 7       HLA-DPB1
## 17 7.73e-200     5.02  1     0.01  1.06e-195 8       PF4     
## 18 3.68e-110     5.94  1     0.024 5.05e-106 8       PPBP

cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

Here, VlnPlot and FeaturePlot are used to visualize marker expression.

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A"))

Lastly, a heatmap is generated of the top ten markers

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Assigning Cell Type Identity to Clusters

Here, the clustering is assigned to cell types, and these are annotated using the DimPlot function.

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Seurat_example