Introduction:

Standard PCA works well for normally distributed data. But single-cell RNA-seq counts are not Gaussian. They are:

  • Sparse (many zeros)
  • Over-dispersed (variance >> mean)
  • Compositional (total counts vary per cell)

Though Log-normalization partially can partially help, it introduces its own biases. scry takes a statistically principled approach by modeling raw counts directly using Generalized Linear Models (GLMs).

scry provides three key tools:

Function Purpose
devianceFeatureSelection() Select informative genes using binomial deviance
nullResiduals() Compute GLM residuals → feed into standard PCA
GLMPCA() Full Generalized PCA on raw counts

1. Install & Load Packages

# Run once to install
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(c("scry", "scRNAseq", "scater", "scran",
                        "SingleCellExperiment", "BiocSingular"))
install.packages(c("ggplot2", "patchwork", "dplyr"))
library(scry)
library(scRNAseq)
library(scater)
library(SingleCellExperiment)
library(ggplot2)
library(patchwork)
library(dplyr)

set.seed(42)

2. Load Dataset

We are using the Zeisel 2015 mouse brain dataset. This scRNA-seq dataset consists with ~3,000 cells and known cell-type labels. It ships with the scRNAseq package.

sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
# Cell-type labels are stored in the "level1class" column
table(sce$level1class)
## 
## astrocytes_ependymal    endothelial-mural         interneurons 
##                  224                  235                  290 
##            microglia     oligodendrocytes        pyramidal CA1 
##                   98                  820                  939 
##         pyramidal SS 
##                  399

The dataset has 7 broad cell types: astrocytes, endothelial cells, interneurons, microglia, mural cells, oligodendrocytes, and pyramidal neurons.


3. Basic Quality Control & Filtering

Before analysis, we are removing low-quality cells and lowly expressed genes.

sce <- addPerCellQCMetrics(sce)

# keep cells with at least 500 detected genes
sce <- sce[, sce$detected >= 500]

# keep genes expressed in at least 10 cells
gene_counts <- rowSums(counts(sce) > 0)
sce <- sce[gene_counts >= 10, ]

cat("After filtering:\n")
## After filtering:
cat(" Cells:", ncol(sce), "\n")
##  Cells: 3005
cat(" Genes:", nrow(sce), "\n")
##  Genes: 16484

4. Feature Selection: devianceFeatureSelection()

Instead of selecting highly variable genes by variance (which is noise-sensitive), scry uses binomial deviance.

The idea is to fit a null model where each gene’s expression is proportional only to library size. Genes that deviate most from this null ( genes that carry real biological signal) get the highest deviance scores.

Higher deviance = more informative gene

# Computing binomial deviance for all genes
sce <- devianceFeatureSelection(sce, assay = "counts", fam = "binomial")

# Deviance scores are stored in rowData
head(rowData(sce))
## DataFrame with 6 rows and 2 columns
##          featureType binomial_deviance
##          <character>         <numeric>
## Tspan12   endogenous           3355.53
## Tshz1     endogenous           4808.60
## Fnbp1l    endogenous           6450.69
## Adamts15  endogenous           1002.91
## Cldn12    endogenous           3415.53
## Rxfp1     endogenous           1070.39
deviance_scores <- rowData(sce)$binomial_deviance

# Plot deviance distribution
data.frame(deviance = deviance_scores) %>%
  ggplot(aes(x = deviance)) +
  geom_histogram(bins = 80, fill = "#2c7bb6", color = "white", alpha = 0.85) +
  geom_vline(xintercept = sort(deviance_scores, decreasing = TRUE)[2000],
             color = "firebrick", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = sort(deviance_scores, decreasing = TRUE)[2000] + 200,
           y = 500, label = "Top 2000\ncutoff", color = "firebrick", size = 3.5) +
  labs(title   = "Binomial Deviance Scores",
       subtitle = "Each point represents one gene",
       x = "Deviance", y = "Number of Genes") +
  theme_classic(base_size = 13)
Distribution of binomial deviance scores across genes. The right tail contains the most informative genes.

Distribution of binomial deviance scores across genes. The right tail contains the most informative genes.

# Select top 2000 most deviant (informative) genes
top_genes <- order(deviance_scores, decreasing = TRUE)[1:2000]
sce_sub   <- sce[top_genes, ]

cat("Selected", nrow(sce_sub), "top deviant genes for downstream analysis.\n")
## Selected 2000 top deviant genes for downstream analysis.

5. Null Residuals + PCA: nullResiduals()

nullResiduals() fits a Poisson or Binomial null model per gene and returns the residuals. These residuals represent expression above and beyond what’s expected from library size alone. The residuals are approximately Gaussian. So we can then run standard PCA on them.

This is faster than full GLM-PCA and often works just as well.

# Computing Pearson residuals under a Poisson null model
sce_sub <- nullResiduals(sce_sub,
                          assay  = "counts",
                          fam    = "poisson",
                          type   = "pearson")

# Residuals stored as a new assay: "poisson_pearson_residuals"
assayNames(sce_sub)
## [1] "counts"                    "poisson_pearson_residuals"
# Run PCA on the residuals (using BiocSingular for speed)
sce_sub <- runPCA(sce_sub,
                  assay.type = "poisson_pearson_residuals",
                  ncomponents = 30,
                  name = "PCA_residuals",
                  BSPARAM = BiocSingular::IrlbaParam())

# Compute UMAP on top of residuals PCA
sce_sub <- runUMAP(sce_sub,
                   dimred = "PCA_residuals",
                   name   = "UMAP_residuals",
                   n_neighbors = 20)
plotReducedDim(sce_sub,
               dimred    = "PCA_residuals",
               colour_by = "level1class",
               point_size = 0.8) +
  labs(title    = "PCA on Poisson Null Residuals",
       subtitle = "Top 2000 genes by binomial deviance") +
  theme_classic(base_size = 13)
PCA on null model residuals. Color = cell type.

PCA on null model residuals. Color = cell type.

plotReducedDim(sce_sub,
               dimred    = "UMAP_residuals",
               colour_by = "level1class",
               point_size = 0.8) +
  labs(title    = "UMAP (from Null Residuals PCA)",
       subtitle = "Zeisel Mouse Brain Dataset") +
  theme_classic(base_size = 13)
UMAP on null residuals PCA shows clear cell-type separation.

UMAP on null residuals PCA shows clear cell-type separation.

Here, cell types are well separated, especially the rare microglia and mural populations. This demonstrates that the deviance-based feature selection preserved biologically meaningful genes.


6. GLM-PCA: GLMPCA()

GLMPCA() or Generalized PCA fits a latent factor model directly on raw counts using a Poisson (or NB) likelihood. No normalization required. This is the most statistically rigorous approach.

GLMPCA is slower than the residuals approach. Use it for final figures or smaller datasets.

# Run GLM-PCA with 10 latent dimensions on a subset of cells (for speed in demo)
sce_glm <- GLMPCA(sce_sub,
                  L        = 10,      # num of latent dimensions
                  fam      = "poi",   # Poisson likelihood
                  minibatch = "stochastic")  # fast optimization

# The result stored as a reducedDim: "GLMPCA"
reducedDimNames(sce_glm)
# UMAP on GLM-PCA factors
sce_glm <- runUMAP(sce_glm,
                   dimred = "GLMPCA",
                   name   = "UMAP_GLMPCA",
                   n_neighbors = 20)
plotReducedDim(sce_glm,
               dimred    = "UMAP_GLMPCA",
               colour_by = "level1class",
               point_size = 0.8) +
  labs(title    = "UMAP (from GLM-PCA)",
       subtitle = "Raw counts, Poisson likelihood, 10 latent dimensions") +
  theme_classic(base_size = 13)
UMAP derived from GLM-PCA latent factors on raw counts.

UMAP derived from GLM-PCA latent factors on raw counts.


7. Side-by-Side Comparison

Now, comparing the two scry approaches visually:

p1 <- plotReducedDim(sce_sub,
                     dimred    = "UMAP_residuals",
                     colour_by = "level1class",
                     point_size = 0.7) +
  labs(title = "Null Residuals → PCA → UMAP") +
  theme_classic(base_size = 11) +
  theme(legend.position = "bottom")

p2 <- plotReducedDim(sce_glm,
                     dimred    = "UMAP_GLMPCA",
                     colour_by = "level1class",
                     point_size = 0.7) +
  labs(title = "GLM-PCA → UMAP") +
  theme_classic(base_size = 11) +
  theme(legend.position = "bottom")

p1 + p2
Left: Null Residuals PCA → UMAP (fast). Right: GLM-PCA → UMAP (rigorous). Both reveal the same biological structure.

Left: Null Residuals PCA → UMAP (fast). Right: GLM-PCA → UMAP (rigorous). Both reveal the same biological structure.


8. Conclusion

Step Function Key Argument Output
Feature selection devianceFeatureSelection() fam = "binomial" Deviance score per gene in rowData
Residuals nullResiduals() fam = "poisson", type = "pearson" New assay of residuals
Standard PCA runPCA() assay.type = "poisson_pearson_residuals" reducedDim slot
Full GLM-PCA GLMPCA() L = 10, fam = "poi" reducedDim slot

Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.2.0                 patchwork_1.3.2            
##  [3] scater_1.36.0               ggplot2_4.0.2              
##  [5] scuttle_1.18.0              scRNAseq_2.22.0            
##  [7] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
##  [9] Biobase_2.68.0              GenomicRanges_1.60.0       
## [11] GenomeInfoDb_1.44.3         IRanges_2.42.0             
## [13] S4Vectors_0.46.0            BiocGenerics_0.54.1        
## [15] generics_0.1.4              MatrixGenerics_1.20.0      
## [17] matrixStats_1.5.0           scry_1.20.0                
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3       rstudioapi_0.18.0        jsonlite_2.0.0          
##   [4] magrittr_2.0.4           ggbeeswarm_0.7.3         GenomicFeatures_1.60.0  
##   [7] gypsum_1.4.0             farver_2.1.2             rmarkdown_2.30          
##  [10] BiocIO_1.18.0            vctrs_0.7.1              memoise_2.0.1           
##  [13] Rsamtools_2.24.1         RCurl_1.98-1.17          htmltools_0.5.9         
##  [16] S4Arrays_1.8.1           AnnotationHub_3.16.1     curl_7.0.0              
##  [19] BiocNeighbors_2.2.0      Rhdf5lib_1.30.0          SparseArray_1.8.1       
##  [22] rhdf5_2.52.1             sass_0.4.10              alabaster.base_1.8.1    
##  [25] bslib_0.10.0             alabaster.sce_1.8.0      httr2_1.2.2             
##  [28] cachem_1.1.0             GenomicAlignments_1.44.0 lifecycle_1.0.5         
##  [31] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-4            
##  [34] R6_2.6.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.14 
##  [37] digest_0.6.39            AnnotationDbi_1.70.0     RSpectra_0.16-2         
##  [40] irlba_2.3.7              ExperimentHub_2.16.1     RSQLite_2.4.6           
##  [43] beachmat_2.24.0          labeling_0.4.3           filelock_1.0.3          
##  [46] httr_1.4.8               abind_1.4-8              compiler_4.5.1          
##  [49] bit64_4.6.0-1            withr_3.0.2              S7_0.2.1                
##  [52] BiocParallel_1.42.2      viridis_0.6.5            DBI_1.3.0               
##  [55] HDF5Array_1.36.0         alabaster.ranges_1.8.0   alabaster.schemas_1.8.0 
##  [58] MASS_7.3-65              glmpca_0.2.0             rappdirs_0.3.4          
##  [61] DelayedArray_0.34.1      rjson_0.2.23             tools_4.5.1             
##  [64] vipor_0.4.7              otel_0.2.0               beeswarm_0.4.0          
##  [67] glue_1.8.0               h5mread_1.0.1            restfulr_0.0.16         
##  [70] rhdf5filters_1.20.0      grid_4.5.1               gtable_0.3.6            
##  [73] ensembldb_2.32.0         BiocSingular_1.24.0      ScaledMatrix_1.16.0     
##  [76] XVector_0.48.0           ggrepel_0.9.7            BiocVersion_3.21.1      
##  [79] pillar_1.11.1            BiocFileCache_2.16.2     lattice_0.22-9          
##  [82] FNN_1.1.4.1              rtracklayer_1.68.0       bit_4.6.0               
##  [85] tidyselect_1.2.1         Biostrings_2.76.0        knitr_1.51              
##  [88] gridExtra_2.3            ProtGenerics_1.40.0      xfun_0.56               
##  [91] UCSC.utils_1.4.0         lazyeval_0.2.2           yaml_2.3.12             
##  [94] evaluate_1.0.5           codetools_0.2-20         tibble_3.3.1            
##  [97] alabaster.matrix_1.8.0   BiocManager_1.30.27      cli_3.6.5               
## [100] uwot_0.2.4               jquerylib_0.1.4          Rcpp_1.1.1              
## [103] dbplyr_2.5.2             png_0.1-8                XML_3.99-0.22           
## [106] parallel_4.5.1           blob_1.3.0               AnnotationFilter_1.32.0 
## [109] bitops_1.0-9             viridisLite_0.4.3        alabaster.se_1.8.0      
## [112] scales_1.4.0             crayon_1.5.3             rlang_1.1.7             
## [115] cowplot_1.2.0            KEGGREST_1.48.1