Input for classification

The ccAFv2 classifier requires a thoroughly quality-controlled Seurat object, with an example QC pipeline available here. SCTransform is preferred, standard normalization only applies to the highly variable genes. This can exclude genes needed for the accurate classification of the cell cycle. To address this, the PredictCellCycle function re-runs SCTransform to retain all genes in the dataset

Test Data

The U5 human neural stem cell (hNSC) dataset used to train the ccAFv2 is available for testing purposes here. This data has been QC’d and normalized using SCTransform following our best practices described above.

devtools::install_github("plaisier-lab/ccafv2_R/ccAFv2")
## Skipping install of 'ccAFv2' from a github remote, the SHA1 (2c3746cd) has not changed since last install.
##   Use `force = TRUE` to force installation
library(ccAFv2)
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
## version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
seurat_obj = readRDS('U5_normalized_ensembl.rds')
seurat_obj = PredictCellCycle(seurat_obj)
## Running ccAFv2:
##   Redoing SCTransform to ensure maximum overlap with classifier genes...
##   Total possible marker genes for this classifier: 861
##     Marker genes present in this dataset: 861
##     Missing marker genes in this dataset: 0
##   Predicting cell cycle state probabilities...
## 93/93 - 0s - 467ms/epoch - 5ms/step
##   Choosing cell cycle state...
##   Adding probabilities and predictions to metadata
## Done

Examining marker genes is crucial for accuracy. Predictions significantly worsen if fewer than 689 marker genes (80%) are present. Setting do_sctransform = TRUE maximizes marker gene overlap. Timing and 93/93 values may vary by dataset, which is normal.

PredictCellCycle(seurat_obj,
                 threshold=0.5,
                 do_sctransform=TRUE,
                 assay='SCT',
                 species='human',
                 gene_id='ensembl',
                 spatial = FALSE)
## Running ccAFv2:
##   Redoing SCTransform to ensure maximum overlap with classifier genes...
##   Total possible marker genes for this classifier: 861
##     Marker genes present in this dataset: 861
##     Missing marker genes in this dataset: 0
##   Predicting cell cycle state probabilities...
## 93/93 - 0s - 408ms/epoch - 4ms/step
##   Choosing cell cycle state...
##   Adding probabilities and predictions to metadata
## Done
## An object of class Seurat 
## 28600 features across 2962 samples within 2 assays 
## Active assay: SCT (13866 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA
##  2 dimensional reductions calculated: pca, umap

Several options can be passed to the PredictCellCycle function:

Cell cycle classification results

The results of the cell cycle classification are stored in the Seurat object metadata. The likelihoods for each cell cycle state can be found with the labels of each cell cycle state (‘qG0’, ‘G1’, ‘Late.G1’, ‘S’, ‘S.G2’, ‘G2.M’, and ‘M.Early.G1’) and the classification for each cell can be found in ‘ccAFv2’. Here are the first 10 rows of the U5-hNSC predictions:

head(seurat_obj@meta.data)
##                     orig.ident nCount_RNA nFeature_RNA       ccAF percent.mito
## AAACATACTAACCG-1 SeuratProject       3147         1466         G1   0.05274865
## AAACATTGAGTTCG-1 SeuratProject       7621         2654  Neural G0   0.02611206
## AAACATTGCACTGA-1 SeuratProject       7297         2616       S/G2   0.02686035
## AAACATTGCTCAGA-1 SeuratProject       3426         1568 M/Early G1   0.02597782
## AAACATTGGTTTCT-1 SeuratProject       2384         1269         G1   0.01971477
## AAACCGTGTAACGC-1 SeuratProject      11846         3463       G2/M   0.02439642
##                  nCount_SCT nFeature_SCT           G1         G2.M      Late.G1
## AAACATACTAACCG-1       5436         1541 4.541956e-02 5.182016e-05 2.565520e-04
## AAACATTGAGTTCG-1       6562         2650 1.050646e-06 2.003324e-08 1.028142e-07
## AAACATTGCACTGA-1       6533         2616 4.067950e-06 1.503383e-04 6.808179e-07
## AAACATTGCTCAGA-1       5416         1608 1.011723e-02 2.443393e-01 1.525741e-02
## AAACATTGGTTTCT-1       5437         1480 6.657162e-01 1.492154e-04 8.235564e-02
## AAACCGTGTAACGC-1       6860         3082 4.368711e-15 9.999999e-01 1.369408e-19
##                    M.Early.G1    Neural.G0            S         S.G2     ccAFv2
## AAACATACTAACCG-1 4.129805e-05 9.541291e-01 5.882413e-05 4.282115e-05      G0/G1
## AAACATTGAGTTCG-1 7.923065e-10 9.998178e-01 1.796801e-04 1.260789e-06      G0/G1
## AAACATTGCACTGA-1 4.342752e-09 1.337398e-06 6.851402e-07 9.998429e-01       S/G2
## AAACATTGCTCAGA-1 6.143703e-01 6.462537e-02 4.660092e-03 4.663026e-02 M/Early G1
## AAACATTGGTTTCT-1 2.045607e-05 1.033008e-03 1.750577e-01 7.566766e-02      G0/G1
## AAACCGTGTAACGC-1 1.583236e-18 2.520220e-16 4.863750e-23 1.438320e-15       G2/M

Plotting cell cycle states

#Plot a UMAP with cell cycle states
DimPlot.ccAFv2(seurat_obj)

Plotting the impact of varying likelihood thresholds

For certain data sets, adjusting the likelihood threshold may be necessary. The ThresholdPlot plots the relative proportions of cell cycle states across a range of likelihood thresholds from 0 to 0.9 with intervals of 0.1.

ThresholdPlot(seurat_obj)

Cell cycle regression

The cell cycle strongly influences gene expression, so it’s common to regress out cell cycle effects and use residual variance for further analysis. We support this using ccAFv2 marker genes, starting by calculating expression module scores for Late G1, S, S/G2, G2/M, and M/Early G1 phases.

Collect expression module scores for the cell cycle states Late G1, S, S/G2, G2/M, and M/Early G1.

seurat_obj = PrepareForCellCycleRegression(seurat_obj)

Regress these signatures out of the expression data:

seurat_obj = SCTransform(seurat_obj, vars.to.regress = c("Late.G1_exprs1", "S_exprs2", "S.G2_exprs3", "G2.M_exprs4", "M.Early.G1_exprs5"))
## Running SCTransform on assay: RNA
## vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 13866 by 2962
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 2962 cells
## Found 133 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 13866 genes
## Computing corrected count matrix for 13866 genes
## Calculating gene attributes
## Wall clock passed: Time difference of 19.90896 secs
## Determine variable features
## Regressing out Late.G1_exprs1, S_exprs2, S.G2_exprs3, G2.M_exprs4, M.Early.G1_exprs5
## Centering data matrix
## Place corrected count matrix in counts slot
## Set default assay to SCT
seurat_obj = RunPCA(seurat_obj)
## PC_ 1 
## Positive:  ENSG00000140988, ENSG00000205542, ENSG00000229117, ENSG00000198712, ENSG00000231500, ENSG00000137818, ENSG00000137309, ENSG00000108518, ENSG00000197958, ENSG00000198886 
##     ENSG00000105372, ENSG00000109475, ENSG00000034510, ENSG00000105640, ENSG00000122026, ENSG00000198899, ENSG00000167526, ENSG00000142541, ENSG00000161016, ENSG00000137154 
##     ENSG00000146674, ENSG00000089157, ENSG00000112306, ENSG00000198938, ENSG00000149591, ENSG00000143947, ENSG00000198804, ENSG00000197756, ENSG00000198763, ENSG00000198918 
## Negative:  ENSG00000167552, ENSG00000164434, ENSG00000105894, ENSG00000133169, ENSG00000120885, ENSG00000137509, ENSG00000046653, ENSG00000160307, ENSG00000136156, ENSG00000080824 
##     ENSG00000137285, ENSG00000113140, ENSG00000026025, ENSG00000164106, ENSG00000206503, ENSG00000068697, ENSG00000124766, ENSG00000118523, ENSG00000123560, ENSG00000106278 
##     ENSG00000182481, ENSG00000150991, ENSG00000137267, ENSG00000135404, ENSG00000116209, ENSG00000117450, ENSG00000167614, ENSG00000074696, ENSG00000105223, ENSG00000111716 
## PC_ 2 
## Positive:  ENSG00000105894, ENSG00000133169, ENSG00000167996, ENSG00000160307, ENSG00000198712, ENSG00000118523, ENSG00000198830, ENSG00000164434, ENSG00000164106, ENSG00000046653 
##     ENSG00000163453, ENSG00000106278, ENSG00000137509, ENSG00000204287, ENSG00000205542, ENSG00000123560, ENSG00000157150, ENSG00000087086, ENSG00000198804, ENSG00000164032 
##     ENSG00000229117, ENSG00000139910, ENSG00000109475, ENSG00000124766, ENSG00000109099, ENSG00000106366, ENSG00000198886, ENSG00000173114, ENSG00000019582, ENSG00000006327 
## Negative:  ENSG00000118785, ENSG00000075624, ENSG00000204542, ENSG00000149591, ENSG00000167552, ENSG00000101335, ENSG00000100097, ENSG00000026025, ENSG00000092841, ENSG00000101955 
##     ENSG00000111640, ENSG00000067225, ENSG00000184009, ENSG00000131981, ENSG00000134333, ENSG00000182718, ENSG00000080824, ENSG00000115641, ENSG00000074800, ENSG00000206503 
##     ENSG00000108518, ENSG00000234745, ENSG00000120885, ENSG00000175183, ENSG00000096384, ENSG00000182481, ENSG00000213719, ENSG00000163191, ENSG00000117592, ENSG00000135404 
## PC_ 3 
## Positive:  ENSG00000118785, ENSG00000102172, ENSG00000172020, ENSG00000174721, ENSG00000198168, ENSG00000120437, ENSG00000164434, ENSG00000120885, ENSG00000136943, ENSG00000110092 
##     ENSG00000120068, ENSG00000153551, ENSG00000106236, ENSG00000104549, ENSG00000144485, ENSG00000197746, ENSG00000108551, ENSG00000137726, ENSG00000139352, ENSG00000064300 
##     ENSG00000067064, ENSG00000137309, ENSG00000133639, ENSG00000251562, ENSG00000185088, ENSG00000165474, ENSG00000132475, ENSG00000186480, ENSG00000160752, ENSG00000171951 
## Negative:  ENSG00000163453, ENSG00000149591, ENSG00000122786, ENSG00000101335, ENSG00000107796, ENSG00000108691, ENSG00000075624, ENSG00000204287, ENSG00000182718, ENSG00000106366 
##     ENSG00000109099, ENSG00000106211, ENSG00000140416, ENSG00000115461, ENSG00000164111, ENSG00000181458, ENSG00000106483, ENSG00000113140, ENSG00000117318, ENSG00000167996 
##     ENSG00000196126, ENSG00000134333, ENSG00000154277, ENSG00000146674, ENSG00000137509, ENSG00000149212, ENSG00000163191, ENSG00000128595, ENSG00000130635, ENSG00000139278 
## PC_ 4 
## Positive:  ENSG00000115461, ENSG00000204542, ENSG00000117318, ENSG00000167552, ENSG00000122861, ENSG00000118785, ENSG00000108691, ENSG00000130720, ENSG00000100097, ENSG00000166710 
##     ENSG00000148677, ENSG00000143333, ENSG00000164434, ENSG00000132386, ENSG00000118523, ENSG00000134333, ENSG00000184232, ENSG00000198712, ENSG00000146674, ENSG00000160307 
##     ENSG00000111640, ENSG00000101955, ENSG00000168461, ENSG00000184203, ENSG00000257704, ENSG00000137880, ENSG00000164056, ENSG00000198938, ENSG00000198804, ENSG00000100234 
## Negative:  ENSG00000172020, ENSG00000149591, ENSG00000105894, ENSG00000251562, ENSG00000165474, ENSG00000110092, ENSG00000197746, ENSG00000107796, ENSG00000108551, ENSG00000135404 
##     ENSG00000205542, ENSG00000143933, ENSG00000120885, ENSG00000080824, ENSG00000139910, ENSG00000155368, ENSG00000156475, ENSG00000046653, ENSG00000182985, ENSG00000174721 
##     ENSG00000096384, ENSG00000213626, ENSG00000131016, ENSG00000101335, ENSG00000239672, ENSG00000106211, ENSG00000135046, ENSG00000272841, ENSG00000198668, ENSG00000163584 
## PC_ 5 
## Positive:  ENSG00000109475, ENSG00000117318, ENSG00000140988, ENSG00000143947, ENSG00000167552, ENSG00000122026, ENSG00000080824, ENSG00000181163, ENSG00000145425, ENSG00000137154 
##     ENSG00000164434, ENSG00000138326, ENSG00000118523, ENSG00000161970, ENSG00000231500, ENSG00000198918, ENSG00000089009, ENSG00000147604, ENSG00000186468, ENSG00000144381 
##     ENSG00000142541, ENSG00000096384, ENSG00000161960, ENSG00000134419, ENSG00000105640, ENSG00000112306, ENSG00000160307, ENSG00000144713, ENSG00000026025, ENSG00000177954 
## Negative:  ENSG00000204542, ENSG00000251562, ENSG00000115461, ENSG00000163453, ENSG00000102804, ENSG00000198727, ENSG00000165474, ENSG00000146674, ENSG00000198804, ENSG00000197061 
##     ENSG00000197746, ENSG00000198168, ENSG00000101335, ENSG00000272841, ENSG00000137726, ENSG00000041982, ENSG00000198886, ENSG00000205927, ENSG00000124762, ENSG00000133639 
##     ENSG00000172020, ENSG00000105894, ENSG00000166710, ENSG00000136943, ENSG00000106211, ENSG00000102265, ENSG00000113140, ENSG00000156265, ENSG00000139289, ENSG00000142669
seurat_obj = RunUMAP(seurat_obj, dims=1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:12:54 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:12:54 Read 2962 rows and found 10 numeric columns
## 11:12:54 Using Annoy for neighbor search, n_neighbors = 30
## 11:12:54 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:12:54 Writing NN index file to temp file /var/folders/hv/xklkw9n15xgbktqz58g7gtbw0000gn/T//RtmpUteCI3/file6b234dfa4a5c
## 11:12:54 Searching Annoy index using 1 thread, search_k = 3000
## 11:12:55 Annoy recall = 100%
## 11:12:56 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 11:12:57 Initializing from normalized Laplacian + noise (using RSpectra)
## 11:12:57 Commencing optimization for 500 epochs, with 108358 positive edges
## 11:13:03 Optimization finished
DimPlot.ccAFv2(seurat_obj)

Removing the cell cycle from the U5 hNSCs leads to a random distribution because the cell cycle is the primary biological signal in this in vitro-grown cell line.