scry:
Count-Based Dimensionality Reduction for scRNA-seqStandard PCA works well for normally distributed data. But single-cell RNA-seq counts are not Gaussian. They are:
Though Log-normalization partially can partially help, it introduces
its own biases. scry takes a statistically principled
approach by modeling raw counts directly using Generalized
Linear Models (GLMs).
scry provides three key tools:
| Function | Purpose |
|---|---|
devianceFeatureSelection() |
Select informative genes using binomial deviance |
nullResiduals() |
Compute GLM residuals → feed into standard PCA |
GLMPCA() |
Full Generalized PCA on raw counts |
# Run once to install
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("scry", "scRNAseq", "scater", "scran",
"SingleCellExperiment", "BiocSingular"))
install.packages(c("ggplot2", "patchwork", "dplyr"))library(scry)
library(scRNAseq)
library(scater)
library(SingleCellExperiment)
library(ggplot2)
library(patchwork)
library(dplyr)
set.seed(42)We are using the Zeisel 2015 mouse brain dataset.
This scRNA-seq dataset consists with ~3,000 cells and known cell-type
labels. It ships with the scRNAseq package.
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
##
## astrocytes_ependymal endothelial-mural interneurons
## 224 235 290
## microglia oligodendrocytes pyramidal CA1
## 98 820 939
## pyramidal SS
## 399
The dataset has 7 broad cell types: astrocytes, endothelial cells, interneurons, microglia, mural cells, oligodendrocytes, and pyramidal neurons.
Before analysis, we are removing low-quality cells and lowly expressed genes.
sce <- addPerCellQCMetrics(sce)
# keep cells with at least 500 detected genes
sce <- sce[, sce$detected >= 500]
# keep genes expressed in at least 10 cells
gene_counts <- rowSums(counts(sce) > 0)
sce <- sce[gene_counts >= 10, ]
cat("After filtering:\n")## After filtering:
## Cells: 3005
## Genes: 16484
devianceFeatureSelection()Instead of selecting highly variable genes by variance (which is
noise-sensitive), scry uses binomial
deviance.
The idea is to fit a null model where each gene’s expression is proportional only to library size. Genes that deviate most from this null ( genes that carry real biological signal) get the highest deviance scores.
Higher deviance = more informative gene
# Computing binomial deviance for all genes
sce <- devianceFeatureSelection(sce, assay = "counts", fam = "binomial")
# Deviance scores are stored in rowData
head(rowData(sce))## DataFrame with 6 rows and 2 columns
## featureType binomial_deviance
## <character> <numeric>
## Tspan12 endogenous 3355.53
## Tshz1 endogenous 4808.60
## Fnbp1l endogenous 6450.69
## Adamts15 endogenous 1002.91
## Cldn12 endogenous 3415.53
## Rxfp1 endogenous 1070.39
deviance_scores <- rowData(sce)$binomial_deviance
# Plot deviance distribution
data.frame(deviance = deviance_scores) %>%
ggplot(aes(x = deviance)) +
geom_histogram(bins = 80, fill = "#2c7bb6", color = "white", alpha = 0.85) +
geom_vline(xintercept = sort(deviance_scores, decreasing = TRUE)[2000],
color = "firebrick", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = sort(deviance_scores, decreasing = TRUE)[2000] + 200,
y = 500, label = "Top 2000\ncutoff", color = "firebrick", size = 3.5) +
labs(title = "Binomial Deviance Scores",
subtitle = "Each point represents one gene",
x = "Deviance", y = "Number of Genes") +
theme_classic(base_size = 13)Distribution of binomial deviance scores across genes. The right tail contains the most informative genes.
# Select top 2000 most deviant (informative) genes
top_genes <- order(deviance_scores, decreasing = TRUE)[1:2000]
sce_sub <- sce[top_genes, ]
cat("Selected", nrow(sce_sub), "top deviant genes for downstream analysis.\n")## Selected 2000 top deviant genes for downstream analysis.
nullResiduals()nullResiduals() fits a Poisson or Binomial null
model per gene and returns the residuals. These residuals
represent expression above and beyond what’s expected from library size
alone. The residuals are approximately Gaussian. So we can then run
standard PCA on them.
This is faster than full GLM-PCA and often works just as well.
# Computing Pearson residuals under a Poisson null model
sce_sub <- nullResiduals(sce_sub,
assay = "counts",
fam = "poisson",
type = "pearson")
# Residuals stored as a new assay: "poisson_pearson_residuals"
assayNames(sce_sub)## [1] "counts" "poisson_pearson_residuals"
# Run PCA on the residuals (using BiocSingular for speed)
sce_sub <- runPCA(sce_sub,
assay.type = "poisson_pearson_residuals",
ncomponents = 30,
name = "PCA_residuals",
BSPARAM = BiocSingular::IrlbaParam())
# Compute UMAP on top of residuals PCA
sce_sub <- runUMAP(sce_sub,
dimred = "PCA_residuals",
name = "UMAP_residuals",
n_neighbors = 20)plotReducedDim(sce_sub,
dimred = "PCA_residuals",
colour_by = "level1class",
point_size = 0.8) +
labs(title = "PCA on Poisson Null Residuals",
subtitle = "Top 2000 genes by binomial deviance") +
theme_classic(base_size = 13)PCA on null model residuals. Color = cell type.
plotReducedDim(sce_sub,
dimred = "UMAP_residuals",
colour_by = "level1class",
point_size = 0.8) +
labs(title = "UMAP (from Null Residuals PCA)",
subtitle = "Zeisel Mouse Brain Dataset") +
theme_classic(base_size = 13)UMAP on null residuals PCA shows clear cell-type separation.
Here, cell types are well separated, especially the rare microglia and mural populations. This demonstrates that the deviance-based feature selection preserved biologically meaningful genes.
GLMPCA()GLMPCA() or Generalized PCA fits a
latent factor model directly on raw counts using a Poisson (or NB)
likelihood. No normalization required. This is the most statistically
rigorous approach.
GLMPCA is slower than the residuals approach. Use it for final figures or smaller datasets.
# Run GLM-PCA with 10 latent dimensions on a subset of cells (for speed in demo)
sce_glm <- GLMPCA(sce_sub,
L = 10, # num of latent dimensions
fam = "poi", # Poisson likelihood
minibatch = "stochastic") # fast optimization
# The result stored as a reducedDim: "GLMPCA"
reducedDimNames(sce_glm)# UMAP on GLM-PCA factors
sce_glm <- runUMAP(sce_glm,
dimred = "GLMPCA",
name = "UMAP_GLMPCA",
n_neighbors = 20)plotReducedDim(sce_glm,
dimred = "UMAP_GLMPCA",
colour_by = "level1class",
point_size = 0.8) +
labs(title = "UMAP (from GLM-PCA)",
subtitle = "Raw counts, Poisson likelihood, 10 latent dimensions") +
theme_classic(base_size = 13)UMAP derived from GLM-PCA latent factors on raw counts.
Now, comparing the two scry approaches visually:
p1 <- plotReducedDim(sce_sub,
dimred = "UMAP_residuals",
colour_by = "level1class",
point_size = 0.7) +
labs(title = "Null Residuals → PCA → UMAP") +
theme_classic(base_size = 11) +
theme(legend.position = "bottom")
p2 <- plotReducedDim(sce_glm,
dimred = "UMAP_GLMPCA",
colour_by = "level1class",
point_size = 0.7) +
labs(title = "GLM-PCA → UMAP") +
theme_classic(base_size = 11) +
theme(legend.position = "bottom")
p1 + p2Left: Null Residuals PCA → UMAP (fast). Right: GLM-PCA → UMAP (rigorous). Both reveal the same biological structure.
| Step | Function | Key Argument | Output |
|---|---|---|---|
| Feature selection | devianceFeatureSelection() |
fam = "binomial" |
Deviance score per gene in rowData |
| Residuals | nullResiduals() |
fam = "poisson", type = "pearson" |
New assay of residuals |
| Standard PCA | runPCA() |
assay.type = "poisson_pearson_residuals" |
reducedDim slot |
| Full GLM-PCA | GLMPCA() |
L = 10, fam = "poi" |
reducedDim slot |
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.2.0 patchwork_1.3.2
## [3] scater_1.36.0 ggplot2_4.0.2
## [5] scuttle_1.18.0 scRNAseq_2.22.0
## [7] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
## [9] Biobase_2.68.0 GenomicRanges_1.60.0
## [11] GenomeInfoDb_1.44.3 IRanges_2.42.0
## [13] S4Vectors_0.46.0 BiocGenerics_0.54.1
## [15] generics_0.1.4 MatrixGenerics_1.20.0
## [17] matrixStats_1.5.0 scry_1.20.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.18.0 jsonlite_2.0.0
## [4] magrittr_2.0.4 ggbeeswarm_0.7.3 GenomicFeatures_1.60.0
## [7] gypsum_1.4.0 farver_2.1.2 rmarkdown_2.30
## [10] BiocIO_1.18.0 vctrs_0.7.1 memoise_2.0.1
## [13] Rsamtools_2.24.1 RCurl_1.98-1.17 htmltools_0.5.9
## [16] S4Arrays_1.8.1 AnnotationHub_3.16.1 curl_7.0.0
## [19] BiocNeighbors_2.2.0 Rhdf5lib_1.30.0 SparseArray_1.8.1
## [22] rhdf5_2.52.1 sass_0.4.10 alabaster.base_1.8.1
## [25] bslib_0.10.0 alabaster.sce_1.8.0 httr2_1.2.2
## [28] cachem_1.1.0 GenomicAlignments_1.44.0 lifecycle_1.0.5
## [31] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-4
## [34] R6_2.6.1 fastmap_1.2.0 GenomeInfoDbData_1.2.14
## [37] digest_0.6.39 AnnotationDbi_1.70.0 RSpectra_0.16-2
## [40] irlba_2.3.7 ExperimentHub_2.16.1 RSQLite_2.4.6
## [43] beachmat_2.24.0 labeling_0.4.3 filelock_1.0.3
## [46] httr_1.4.8 abind_1.4-8 compiler_4.5.1
## [49] bit64_4.6.0-1 withr_3.0.2 S7_0.2.1
## [52] BiocParallel_1.42.2 viridis_0.6.5 DBI_1.3.0
## [55] HDF5Array_1.36.0 alabaster.ranges_1.8.0 alabaster.schemas_1.8.0
## [58] MASS_7.3-65 glmpca_0.2.0 rappdirs_0.3.4
## [61] DelayedArray_0.34.1 rjson_0.2.23 tools_4.5.1
## [64] vipor_0.4.7 otel_0.2.0 beeswarm_0.4.0
## [67] glue_1.8.0 h5mread_1.0.1 restfulr_0.0.16
## [70] rhdf5filters_1.20.0 grid_4.5.1 gtable_0.3.6
## [73] ensembldb_2.32.0 BiocSingular_1.24.0 ScaledMatrix_1.16.0
## [76] XVector_0.48.0 ggrepel_0.9.7 BiocVersion_3.21.1
## [79] pillar_1.11.1 BiocFileCache_2.16.2 lattice_0.22-9
## [82] FNN_1.1.4.1 rtracklayer_1.68.0 bit_4.6.0
## [85] tidyselect_1.2.1 Biostrings_2.76.0 knitr_1.51
## [88] gridExtra_2.3 ProtGenerics_1.40.0 xfun_0.56
## [91] UCSC.utils_1.4.0 lazyeval_0.2.2 yaml_2.3.12
## [94] evaluate_1.0.5 codetools_0.2-20 tibble_3.3.1
## [97] alabaster.matrix_1.8.0 BiocManager_1.30.27 cli_3.6.5
## [100] uwot_0.2.4 jquerylib_0.1.4 Rcpp_1.1.1
## [103] dbplyr_2.5.2 png_0.1-8 XML_3.99-0.22
## [106] parallel_4.5.1 blob_1.3.0 AnnotationFilter_1.32.0
## [109] bitops_1.0-9 viridisLite_0.4.3 alabaster.se_1.8.0
## [112] scales_1.4.0 crayon_1.5.3 rlang_1.1.7
## [115] cowplot_1.2.0 KEGGREST_1.48.1