We will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
Single cell RNA-seq data used here is generated using the 10X genomics platform. The cellranger pipeline outputs two types of feature-barcode matrices described in the table below. Each element of the matrix is the number of UMIs associated with a feature (row) and a barcode (column).
Filtered feature-barcode matrix Contains only detected cellular barcodes. For Targeted Gene Expression samples, non-targeted genes are removed from the filtered matrix.
Each matrix is stored in the Market Exchange Format (MEX) for sparse matrices. It also contains gzipped TSV files with feature and barcode sequences corresponding to row and column indices respectively.
We start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For example, the count matrix is stored in pbmc[[“RNA”]]@counts.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Seurat)
## Attaching SeuratObject
library(patchwork)
# Load the PBMC dataset
pbmc.data = Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc = CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
Here the parameters used are:
counts: Un normalized data such as raw counts or TPMs
min.cells: Include features detected in at least this many cells. Will subset the counts matrix as well. To reintroduce excluded features, create a new object with a lower cutoff.
min.features: Include cells where at least this many features are detected.
The . values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.
pbmc.data[1:50, 1:10]
## 50 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## MIR1302-10 . . . . . . . . . .
## FAM138A . . . . . . . . . .
## OR4F5 . . . . . . . . . .
## RP11-34P13.7 . . . . . . . . . .
## RP11-34P13.8 . . . . . . . . . .
## AL627309.1 . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . .
## RP11-34P13.9 . . . . . . . . . .
## AP006222.2 . . . . . . . . . .
## RP4-669L17.10 . . . . . . . . . .
## OR4F29 . . . . . . . . . .
## RP4-669L17.2 . . . . . . . . . .
## RP5-857K21.15 . . . . . . . . . .
## RP5-857K21.1 . . . . . . . . . .
## RP5-857K21.2 . . . . . . . . . .
## RP5-857K21.3 . . . . . . . . . .
## RP5-857K21.4 . . . . . . . . . .
## RP5-857K21.5 . . . . . . . . . .
## OR4F16 . . . . . . . . . .
## RP11-206L10.3 . . . . . . . . . .
## RP11-206L10.5 . . . . . . . . . .
## RP11-206L10.4 . . . . . . . . . .
## RP11-206L10.2 . . . . . . . . . .
## RP11-206L10.9 . . . . . . . . . .
## AL669831.1 . . . . . . . . . .
## FAM87B . . . . . . . . . .
## LINC00115 . . . . . . . . . .
## FAM41C . . . . . . . . . .
## AL645608.2 . . . . . . . . . .
## RP11-54O7.16 . . . . . . . . . .
## RP11-54O7.1 . . . . . . . . . .
## RP11-54O7.2 . . . . . . . . . .
## RP11-54O7.3 . . . . . . . . . .
## SAMD11 . . . . . . . . . .
## AL645608.1 . . . . . . . . . .
## NOC2L . . . . . . . . . .
## KLHL17 . . . . . . . . . .
## PLEKHN1 . . . . . . . . . .
## C1orf170 . . . . . . . . . .
## RP11-54O7.17 . . . . . . . . . .
## HES4 . . . . . . . . . .
## RP11-54O7.11 . . . . . . . . . .
## ISG15 . . 1 9 . 1 . . . 3
## AGRN . . . . . . . . . .
## RP11-54O7.18 . . . . . . . . . .
## RNF223 . . . . . . . . . .
## C1orf159 . . . . . . . . . .
## RP11-465B22.5 . . . . . . . . . .
## RP11-465B22.8 . . . . . . . . . .
## TTLL10-AS1 . . . . . . . . . .
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include:
The number of unique genes detected in each cell.
Low-quality cells or empty droplets will often have very few genes
Cell doublets or multiplets may exhibit an aberrantly high gene count
Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
The percentage of reads that map to the mitochondrial genome
Low-quality / dying cells often exhibit extensive mitochondrial contamination
We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
We use the set of all genes starting with MT- as a set of mitochondrial genes
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] = PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 SeuratProject 2419 779 3.0177759
## AAACATTGAGCTAC-1 SeuratProject 4903 1352 3.7935958
## AAACATTGATCAGC-1 SeuratProject 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 SeuratProject 2639 960 1.7430845
## AAACCGTGTATGCG-1 SeuratProject 980 521 1.2244898
## AAACGCACTGGTAC-1 SeuratProject 2163 781 1.6643551
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 = FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 = FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1
plot2
pbmc = subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc
## An object of class Seurat
## 13714 features across 2638 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[[“RNA”]]@data.
pbmc = NormalizeData(pbmc)
pbmc = FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 = head(VariableFeatures(pbmc), 10)
top10
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
## [9] "GNG11" "S100A8"
# plot variable features with and without labels
plot1 = VariableFeaturePlot(pbmc)
plot2 = LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning: Transformation introduced infinite values in continuous x-axis
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. By default, Seurat return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:
b.Scales the expression of each gene, so that the variance across cells is 1
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
all.genes = rownames(pbmc)
pbmc = ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
pbmc@assays$RNA@scale.data[1:50, 1:5]
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1 -0.05812316 -0.05812316 -0.05812316
## AP006222.2 -0.03357571 -0.03357571 -0.03357571
## RP11-206L10.2 -0.04166819 -0.04166819 -0.04166819
## RP11-206L10.9 -0.03364562 -0.03364562 -0.03364562
## LINC00115 -0.08223981 -0.08223981 -0.08223981
## NOC2L -0.31717081 -0.31717081 -0.31717081
## KLHL17 -0.05344722 -0.05344722 -0.05344722
## PLEKHN1 -0.05082183 -0.05082183 -0.05082183
## RP11-54O7.17 -0.03308805 -0.03308805 -0.03308805
## HES4 -0.23376818 -0.23376818 -0.23376818
## RP11-54O7.11 -0.03768905 -0.03768905 -0.03768905
## ISG15 -0.83530282 -0.83530282 0.39223510
## AGRN -0.05316879 -0.05316879 -0.05316879
## C1orf159 -0.09498396 -0.09498396 -0.09498396
## TNFRSF18 -0.18550619 4.86702930 -0.18550619
## TNFRSF4 -0.24234683 -0.24234683 -0.24234683
## SDF4 -0.37183662 -0.37183662 2.12159252
## B3GALT6 -0.14723675 -0.14723675 -0.14723675
## FAM132A -0.02751112 -0.02751112 -0.02751112
## UBE2J2 -0.34558273 -0.34558273 -0.34558273
## ACAP3 -0.15080728 -0.15080728 -0.15080728
## PUSL1 -0.14892932 -0.14892932 -0.14892932
## CPSF3L -0.27411451 -0.27411451 -0.27411451
## GLTPD1 -0.16714870 -0.16714870 -0.16714870
## DVL1 -0.04354166 -0.04354166 -0.04354166
## MXRA8 -0.04341365 -0.04341365 -0.04341365
## AURKAIP1 -0.61424026 0.72972362 -0.61424026
## CCNL2 -0.23151897 -0.23151897 -0.23151897
## RP4-758J18.2 -0.19162256 -0.19162256 -0.19162256
## MRPL20 1.50566156 -0.56259931 1.24504892
## ATAD3C -0.04970561 -0.04970561 -0.04970561
## ATAD3B -0.10150202 -0.10150202 -0.10150202
## ATAD3A -0.13088200 -0.13088200 -0.13088200
## SSU72 -0.68454728 0.58087748 -0.68454728
## AL645728.1 -0.09968522 -0.09968522 -0.09968522
## C1orf233 -0.05620767 -0.05620767 -0.05620767
## RP11-345P4.9 -0.11658359 -0.11658359 -0.11658359
## MIB2 -0.25893505 3.45840675 -0.25893505
## MMP23B -0.08858444 -0.08858444 -0.08858444
## CDK11B -0.14875529 -0.14875529 -0.14875529
## SLC35E2B -0.12692854 -0.12692854 -0.12692854
## CDK11A -0.24707389 -0.24707389 -0.24707389
## SLC35E2 -0.07710150 -0.07710150 -0.07710150
## NADK -0.21621570 -0.21621570 -0.21621570
## GNB1 -0.35639526 -0.35639526 -0.35639526
## RP1-140A9.1 -0.03335980 -0.03335980 -0.03335980
## TMEM52 -0.03363839 -0.03363839 -0.03363839
## PRKCZ -0.10937706 -0.10937706 -0.10937706
## RP5-892K4.1 -0.06025369 -0.06025369 -0.06025369
## C1orf86 -0.46390647 -0.46390647 -0.46390647
## AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1 -0.05812316 -0.05812316
## AP006222.2 -0.03357571 -0.03357571
## RP11-206L10.2 -0.04166819 -0.04166819
## RP11-206L10.9 -0.03364562 -0.03364562
## LINC00115 -0.08223981 -0.08223981
## NOC2L -0.31717081 -0.31717081
## KLHL17 -0.05344722 -0.05344722
## PLEKHN1 -0.05082183 -0.05082183
## RP11-54O7.17 -0.03308805 -0.03308805
## HES4 -0.23376818 -0.23376818
## RP11-54O7.11 -0.03768905 -0.03768905
## ISG15 2.21976210 -0.83530282
## AGRN -0.05316879 -0.05316879
## C1orf159 -0.09498396 -0.09498396
## TNFRSF18 -0.18550619 -0.18550619
## TNFRSF4 -0.24234683 -0.24234683
## SDF4 -0.37183662 -0.37183662
## B3GALT6 -0.14723675 -0.14723675
## FAM132A -0.02751112 -0.02751112
## UBE2J2 -0.34558273 -0.34558273
## ACAP3 -0.15080728 -0.15080728
## PUSL1 -0.14892932 -0.14892932
## CPSF3L -0.27411451 -0.27411451
## GLTPD1 -0.16714870 -0.16714870
## DVL1 -0.04354166 -0.04354166
## MXRA8 -0.04341365 -0.04341365
## AURKAIP1 1.27938115 -0.61424026
## CCNL2 -0.23151897 -0.23151897
## RP4-758J18.2 -0.19162256 -0.19162256
## MRPL20 -0.56259931 -0.56259931
## ATAD3C -0.04970561 -0.04970561
## ATAD3B -0.10150202 -0.10150202
## ATAD3A -0.13088200 -0.13088200
## SSU72 2.17830781 -0.68454728
## AL645728.1 -0.09968522 -0.09968522
## C1orf233 -0.05620767 -0.05620767
## RP11-345P4.9 -0.11658359 -0.11658359
## MIB2 -0.25893505 -0.25893505
## MMP23B -0.08858444 -0.08858444
## CDK11B -0.14875529 -0.14875529
## SLC35E2B -0.12692854 -0.12692854
## CDK11A -0.24707389 -0.24707389
## SLC35E2 -0.07710150 -0.07710150
## NADK 4.40918818 -0.21621570
## GNB1 -0.35639526 -0.35639526
## RP1-140A9.1 -0.03335980 -0.03335980
## TMEM52 -0.03363839 -0.03363839
## PRKCZ -0.10937706 -0.10937706
## RP5-892K4.1 -0.06025369 -0.06025369
## C1orf86 -0.46390647 -0.46390647
Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset.
pbmc = RunPCA(pbmc, features = VariableFeatures(object = pbmc))
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
ElbowPlot(pbmc)
DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets.
An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot() function). In this example, we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, the approach to partitioning the cellular distance matrix into clusters has dramatically improved.
Seurat first constructs a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, Seurat next applies modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents() function.
pbmc = FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc = FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
head(pbmc@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 SeuratProject 2419 779 3.0177759
## AAACATTGAGCTAC-1 SeuratProject 4903 1352 3.7935958
## AAACATTGATCAGC-1 SeuratProject 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 SeuratProject 2639 960 1.7430845
## AAACCGTGTATGCG-1 SeuratProject 980 521 1.2244898
## AAACGCACTGGTAC-1 SeuratProject 2163 781 1.6643551
## RNA_snn_res.0.5 seurat_clusters
## AAACATACAACCAC-1 2 2
## AAACATTGAGCTAC-1 3 3
## AAACATTGATCAGC-1 2 2
## AAACCGTGCTTCCG-1 1 1
## AAACCGTGTATGCG-1 6 6
## AAACGCACTGGTAC-1 2 2
pbmc = RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 22:11:37 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:11:37 Read 2638 rows and found 10 numeric columns
## 22:11:37 Using Annoy for neighbor search, n_neighbors = 30
## 22:11:37 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:11:37 Writing NN index file to temp file /tmp/Rtmp3BO577/file6ce02b6c79b2
## 22:11:37 Searching Annoy index using 1 thread, search_k = 3000
## 22:11:38 Annoy recall = 100%
## 22:11:38 Commencing smooth kNN distance calibration using 1 thread
## 22:11:38 Initializing from normalized Laplacian + noise
## 22:11:38 Commencing optimization for 500 epochs, with 105382 positive edges
## 22:11:47 Optimization finished
DimPlot(pbmc, reduction = "umap")
DimPlot(pbmc, reduction = "umap", label = T)
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, they suggest using the same PCs as input to the clustering analysis.
pbmc.markers = FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
head(pbmc.markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## RPS12 1.806317e-144 0.7350248 1.000 0.991 2.477183e-140 0 RPS12
## RPS6 7.135900e-142 0.6798622 1.000 0.995 9.786173e-138 0 RPS6
## RPS27 5.257820e-140 0.7207819 0.999 0.992 7.210575e-136 0 RPS27
## RPL32 4.229582e-136 0.6115515 0.999 0.995 5.800448e-132 0 RPL32
## RPS14 1.799019e-130 0.6199183 1.000 0.994 2.467175e-126 0 RPS14
## RPS25 5.507298e-123 0.7442491 0.997 0.975 7.552709e-119 0 RPS25
a = pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
a
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 2 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 3 0 5.57 0.996 0.215 0 1 S100A9
## 4 0 5.48 0.975 0.121 0 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 10 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 14 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
## 18 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
genes = a %>% pull(gene)
genes
## [1] "LDHB" "CCR7" "S100A9" "S100A8" "LTB" "AQP3"
## [7] "CD79A" "TCL1A" "CCL5" "GZMK" "FCGR3A" "LST1"
## [13] "GZMB" "GNLY" "FCER1A" "HLA-DPB1" "PF4" "PPBP"
FeaturePlot(pbmc, features = genes[1:2])
FeaturePlot(pbmc, features = genes[1:2], cols = c("white", "red"))