The ccAFv2 classifier requires a thoroughly quality-controlled Seurat object, with an example QC pipeline available here. SCTransform is preferred, standard normalization only applies to the highly variable genes. This can exclude genes needed for the accurate classification of the cell cycle. To address this, the PredictCellCycle function re-runs SCTransform to retain all genes in the dataset
The U5 human neural stem cell (hNSC) dataset used to train the ccAFv2 is available for testing purposes here. This data has been QC’d and normalized using SCTransform following our best practices described above.
devtools::install_github("plaisier-lab/ccafv2_R/ccAFv2")
## Skipping install of 'ccAFv2' from a github remote, the SHA1 (2c3746cd) has not changed since last install.
## Use `force = TRUE` to force installation
library(ccAFv2)
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
## version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
seurat_obj = readRDS('U5_normalized_ensembl.rds')
seurat_obj = PredictCellCycle(seurat_obj)
## Running ccAFv2:
## Redoing SCTransform to ensure maximum overlap with classifier genes...
## Total possible marker genes for this classifier: 861
## Marker genes present in this dataset: 861
## Missing marker genes in this dataset: 0
## Predicting cell cycle state probabilities...
## 93/93 - 0s - 467ms/epoch - 5ms/step
## Choosing cell cycle state...
## Adding probabilities and predictions to metadata
## Done
Examining marker genes is crucial for accuracy. Predictions significantly worsen if fewer than 689 marker genes (80%) are present. Setting do_sctransform = TRUE maximizes marker gene overlap. Timing and 93/93 values may vary by dataset, which is normal.
PredictCellCycle(seurat_obj,
threshold=0.5,
do_sctransform=TRUE,
assay='SCT',
species='human',
gene_id='ensembl',
spatial = FALSE)
## Running ccAFv2:
## Redoing SCTransform to ensure maximum overlap with classifier genes...
## Total possible marker genes for this classifier: 861
## Marker genes present in this dataset: 861
## Missing marker genes in this dataset: 0
## Predicting cell cycle state probabilities...
## 93/93 - 0s - 408ms/epoch - 4ms/step
## Choosing cell cycle state...
## Adding probabilities and predictions to metadata
## Done
## An object of class Seurat
## 28600 features across 2962 samples within 2 assays
## Active assay: SCT (13866 features, 3000 variable features)
## 3 layers present: counts, data, scale.data
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
Several options can be passed to the PredictCellCycle function:
seurat_obj: a seurat object must be supplied to classify, no default
threshold: the value used to threshold the likelihoods, default is 0.5
do_sctransform: whether to do SCTransform before classifying, default is TRUE
assay: which seurat_obj assay to use for classification, helpful if data is pre-normalized, default is ‘SCT’
species: from which species did the samples originate, either ‘human’ or ‘mouse’, defaults to ‘human’
gene_id: what type of gene ID is used, either ‘ensembl’ or ‘symbol’, defaults to ‘ensembl’
spatial: whether the data is spatial, defaults to FALSE
The results of the cell cycle classification are stored in the Seurat object metadata. The likelihoods for each cell cycle state can be found with the labels of each cell cycle state (‘qG0’, ‘G1’, ‘Late.G1’, ‘S’, ‘S.G2’, ‘G2.M’, and ‘M.Early.G1’) and the classification for each cell can be found in ‘ccAFv2’. Here are the first 10 rows of the U5-hNSC predictions:
head(seurat_obj@meta.data)
## orig.ident nCount_RNA nFeature_RNA ccAF percent.mito
## AAACATACTAACCG-1 SeuratProject 3147 1466 G1 0.05274865
## AAACATTGAGTTCG-1 SeuratProject 7621 2654 Neural G0 0.02611206
## AAACATTGCACTGA-1 SeuratProject 7297 2616 S/G2 0.02686035
## AAACATTGCTCAGA-1 SeuratProject 3426 1568 M/Early G1 0.02597782
## AAACATTGGTTTCT-1 SeuratProject 2384 1269 G1 0.01971477
## AAACCGTGTAACGC-1 SeuratProject 11846 3463 G2/M 0.02439642
## nCount_SCT nFeature_SCT G1 G2.M Late.G1
## AAACATACTAACCG-1 5436 1541 4.541956e-02 5.182016e-05 2.565520e-04
## AAACATTGAGTTCG-1 6562 2650 1.050646e-06 2.003324e-08 1.028142e-07
## AAACATTGCACTGA-1 6533 2616 4.067950e-06 1.503383e-04 6.808179e-07
## AAACATTGCTCAGA-1 5416 1608 1.011723e-02 2.443393e-01 1.525741e-02
## AAACATTGGTTTCT-1 5437 1480 6.657162e-01 1.492154e-04 8.235564e-02
## AAACCGTGTAACGC-1 6860 3082 4.368711e-15 9.999999e-01 1.369408e-19
## M.Early.G1 Neural.G0 S S.G2 ccAFv2
## AAACATACTAACCG-1 4.129805e-05 9.541291e-01 5.882413e-05 4.282115e-05 G0/G1
## AAACATTGAGTTCG-1 7.923065e-10 9.998178e-01 1.796801e-04 1.260789e-06 G0/G1
## AAACATTGCACTGA-1 4.342752e-09 1.337398e-06 6.851402e-07 9.998429e-01 S/G2
## AAACATTGCTCAGA-1 6.143703e-01 6.462537e-02 4.660092e-03 4.663026e-02 M/Early G1
## AAACATTGGTTTCT-1 2.045607e-05 1.033008e-03 1.750577e-01 7.566766e-02 G0/G1
## AAACCGTGTAACGC-1 1.583236e-18 2.520220e-16 4.863750e-23 1.438320e-15 G2/M
#Plot a UMAP with cell cycle states
DimPlot.ccAFv2(seurat_obj)
For certain data sets, adjusting the likelihood threshold may be necessary. The ThresholdPlot plots the relative proportions of cell cycle states across a range of likelihood thresholds from 0 to 0.9 with intervals of 0.1.
ThresholdPlot(seurat_obj)
The cell cycle strongly influences gene expression, so it’s common to regress out cell cycle effects and use residual variance for further analysis. We support this using ccAFv2 marker genes, starting by calculating expression module scores for Late G1, S, S/G2, G2/M, and M/Early G1 phases.
Collect expression module scores for the cell cycle states Late G1, S, S/G2, G2/M, and M/Early G1.
seurat_obj = PrepareForCellCycleRegression(seurat_obj)
Regress these signatures out of the expression data:
seurat_obj = SCTransform(seurat_obj, vars.to.regress = c("Late.G1_exprs1", "S_exprs2", "S.G2_exprs3", "G2.M_exprs4", "M.Early.G1_exprs5"))
## Running SCTransform on assay: RNA
## vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 13866 by 2962
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 2962 cells
## Found 133 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 13866 genes
## Computing corrected count matrix for 13866 genes
## Calculating gene attributes
## Wall clock passed: Time difference of 19.90896 secs
## Determine variable features
## Regressing out Late.G1_exprs1, S_exprs2, S.G2_exprs3, G2.M_exprs4, M.Early.G1_exprs5
## Centering data matrix
## Place corrected count matrix in counts slot
## Set default assay to SCT
seurat_obj = RunPCA(seurat_obj)
## PC_ 1
## Positive: ENSG00000140988, ENSG00000205542, ENSG00000229117, ENSG00000198712, ENSG00000231500, ENSG00000137818, ENSG00000137309, ENSG00000108518, ENSG00000197958, ENSG00000198886
## ENSG00000105372, ENSG00000109475, ENSG00000034510, ENSG00000105640, ENSG00000122026, ENSG00000198899, ENSG00000167526, ENSG00000142541, ENSG00000161016, ENSG00000137154
## ENSG00000146674, ENSG00000089157, ENSG00000112306, ENSG00000198938, ENSG00000149591, ENSG00000143947, ENSG00000198804, ENSG00000197756, ENSG00000198763, ENSG00000198918
## Negative: ENSG00000167552, ENSG00000164434, ENSG00000105894, ENSG00000133169, ENSG00000120885, ENSG00000137509, ENSG00000046653, ENSG00000160307, ENSG00000136156, ENSG00000080824
## ENSG00000137285, ENSG00000113140, ENSG00000026025, ENSG00000164106, ENSG00000206503, ENSG00000068697, ENSG00000124766, ENSG00000118523, ENSG00000123560, ENSG00000106278
## ENSG00000182481, ENSG00000150991, ENSG00000137267, ENSG00000135404, ENSG00000116209, ENSG00000117450, ENSG00000167614, ENSG00000074696, ENSG00000105223, ENSG00000111716
## PC_ 2
## Positive: ENSG00000105894, ENSG00000133169, ENSG00000167996, ENSG00000160307, ENSG00000198712, ENSG00000118523, ENSG00000198830, ENSG00000164434, ENSG00000164106, ENSG00000046653
## ENSG00000163453, ENSG00000106278, ENSG00000137509, ENSG00000204287, ENSG00000205542, ENSG00000123560, ENSG00000157150, ENSG00000087086, ENSG00000198804, ENSG00000164032
## ENSG00000229117, ENSG00000139910, ENSG00000109475, ENSG00000124766, ENSG00000109099, ENSG00000106366, ENSG00000198886, ENSG00000173114, ENSG00000019582, ENSG00000006327
## Negative: ENSG00000118785, ENSG00000075624, ENSG00000204542, ENSG00000149591, ENSG00000167552, ENSG00000101335, ENSG00000100097, ENSG00000026025, ENSG00000092841, ENSG00000101955
## ENSG00000111640, ENSG00000067225, ENSG00000184009, ENSG00000131981, ENSG00000134333, ENSG00000182718, ENSG00000080824, ENSG00000115641, ENSG00000074800, ENSG00000206503
## ENSG00000108518, ENSG00000234745, ENSG00000120885, ENSG00000175183, ENSG00000096384, ENSG00000182481, ENSG00000213719, ENSG00000163191, ENSG00000117592, ENSG00000135404
## PC_ 3
## Positive: ENSG00000118785, ENSG00000102172, ENSG00000172020, ENSG00000174721, ENSG00000198168, ENSG00000120437, ENSG00000164434, ENSG00000120885, ENSG00000136943, ENSG00000110092
## ENSG00000120068, ENSG00000153551, ENSG00000106236, ENSG00000104549, ENSG00000144485, ENSG00000197746, ENSG00000108551, ENSG00000137726, ENSG00000139352, ENSG00000064300
## ENSG00000067064, ENSG00000137309, ENSG00000133639, ENSG00000251562, ENSG00000185088, ENSG00000165474, ENSG00000132475, ENSG00000186480, ENSG00000160752, ENSG00000171951
## Negative: ENSG00000163453, ENSG00000149591, ENSG00000122786, ENSG00000101335, ENSG00000107796, ENSG00000108691, ENSG00000075624, ENSG00000204287, ENSG00000182718, ENSG00000106366
## ENSG00000109099, ENSG00000106211, ENSG00000140416, ENSG00000115461, ENSG00000164111, ENSG00000181458, ENSG00000106483, ENSG00000113140, ENSG00000117318, ENSG00000167996
## ENSG00000196126, ENSG00000134333, ENSG00000154277, ENSG00000146674, ENSG00000137509, ENSG00000149212, ENSG00000163191, ENSG00000128595, ENSG00000130635, ENSG00000139278
## PC_ 4
## Positive: ENSG00000115461, ENSG00000204542, ENSG00000117318, ENSG00000167552, ENSG00000122861, ENSG00000118785, ENSG00000108691, ENSG00000130720, ENSG00000100097, ENSG00000166710
## ENSG00000148677, ENSG00000143333, ENSG00000164434, ENSG00000132386, ENSG00000118523, ENSG00000134333, ENSG00000184232, ENSG00000198712, ENSG00000146674, ENSG00000160307
## ENSG00000111640, ENSG00000101955, ENSG00000168461, ENSG00000184203, ENSG00000257704, ENSG00000137880, ENSG00000164056, ENSG00000198938, ENSG00000198804, ENSG00000100234
## Negative: ENSG00000172020, ENSG00000149591, ENSG00000105894, ENSG00000251562, ENSG00000165474, ENSG00000110092, ENSG00000197746, ENSG00000107796, ENSG00000108551, ENSG00000135404
## ENSG00000205542, ENSG00000143933, ENSG00000120885, ENSG00000080824, ENSG00000139910, ENSG00000155368, ENSG00000156475, ENSG00000046653, ENSG00000182985, ENSG00000174721
## ENSG00000096384, ENSG00000213626, ENSG00000131016, ENSG00000101335, ENSG00000239672, ENSG00000106211, ENSG00000135046, ENSG00000272841, ENSG00000198668, ENSG00000163584
## PC_ 5
## Positive: ENSG00000109475, ENSG00000117318, ENSG00000140988, ENSG00000143947, ENSG00000167552, ENSG00000122026, ENSG00000080824, ENSG00000181163, ENSG00000145425, ENSG00000137154
## ENSG00000164434, ENSG00000138326, ENSG00000118523, ENSG00000161970, ENSG00000231500, ENSG00000198918, ENSG00000089009, ENSG00000147604, ENSG00000186468, ENSG00000144381
## ENSG00000142541, ENSG00000096384, ENSG00000161960, ENSG00000134419, ENSG00000105640, ENSG00000112306, ENSG00000160307, ENSG00000144713, ENSG00000026025, ENSG00000177954
## Negative: ENSG00000204542, ENSG00000251562, ENSG00000115461, ENSG00000163453, ENSG00000102804, ENSG00000198727, ENSG00000165474, ENSG00000146674, ENSG00000198804, ENSG00000197061
## ENSG00000197746, ENSG00000198168, ENSG00000101335, ENSG00000272841, ENSG00000137726, ENSG00000041982, ENSG00000198886, ENSG00000205927, ENSG00000124762, ENSG00000133639
## ENSG00000172020, ENSG00000105894, ENSG00000166710, ENSG00000136943, ENSG00000106211, ENSG00000102265, ENSG00000113140, ENSG00000156265, ENSG00000139289, ENSG00000142669
seurat_obj = RunUMAP(seurat_obj, dims=1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:12:54 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:12:54 Read 2962 rows and found 10 numeric columns
## 11:12:54 Using Annoy for neighbor search, n_neighbors = 30
## 11:12:54 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:12:54 Writing NN index file to temp file /var/folders/hv/xklkw9n15xgbktqz58g7gtbw0000gn/T//RtmpUteCI3/file6b234dfa4a5c
## 11:12:54 Searching Annoy index using 1 thread, search_k = 3000
## 11:12:55 Annoy recall = 100%
## 11:12:56 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:12:57 Initializing from normalized Laplacian + noise (using RSpectra)
## 11:12:57 Commencing optimization for 500 epochs, with 108358 positive edges
## 11:13:03 Optimization finished
DimPlot.ccAFv2(seurat_obj)
Removing the cell cycle from the U5 hNSCs leads to a random distribution because the cell cycle is the primary biological signal in this in vitro-grown cell line.