Contents

1 Introduction

In the investigation of molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along a continuous cell trajectory, which can be estimated by pseudotime inference (also callsed trajectory inference) from single-cell RNA-sequencing (scRNA-seq) data. However, the uncertainty in pseudotime inference is ignored in existing methods. PseudotimeDE is designed to generate well-calibrated \(p\)-values. PseudotimeDE is flexible in allowing users to specify the pseudotime inference method and to choose the appropriate model for scRNA-seq data.

suppressPackageStartupMessages(library(PseudotimeDE))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(slingshot))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(irlba))

In this quickstart guide, we demonstrate the basic functionality of the PseudotimeDE package. PseudotimeDE package allows users to specify the pseudotime inference method, but we will use slingshot as the example in our analysis.

2 LPS dataset

data(LPS_sce, package = "PseudotimeDE")

This Smart-seq dataset contains primary mouse dendritic cells (DCs) stimulated with lipopolysaccharide (LPS). The original data is available at Gene Expression Omnibus (GEO) under accession ID GSE45719. Here the dataset has been stored as a SingleCellExperiment object. For more information on how to construct it, please check SingleCellExperiment.

3 Perform pseudotime inference on the original dataset

We use slingshot to infer the pseudotime on the original LPS_sce. Since we have the prior knowledge that this dataset should be a single lineage, we let slingshot to generate only one lineage.

rd <- irlba::prcomp_irlba(t(logcounts(LPS_sce)), scale. = FALSE)$x[, 1:2]

reducedDims(LPS_sce) <- SimpleList(PCA = rd)
colData(LPS_sce)$cl <- 1

fit_ori <- slingshot(LPS_sce, reducedDim = 'PCA', clusterLabels = "cl")
LPS_ori_tbl <- tibble(cell = colnames(LPS_sce), pseudotime = rescale(colData(fit_ori)$slingPseudotime_1))

Check the input format.

head(LPS_ori_tbl)
## # A tibble: 6 × 2
##   cell       pseudotime
##   <chr>           <dbl>
## 1 LPS_1h_S10      0.386
## 2 LPS_1h_S11      0.311
## 3 LPS_1h_S12      0.395
## 4 LPS_1h_S13      0.346
## 5 LPS_1h_S14      0.158
## 6 LPS_1h_S15      0.370

4 Perform pseudotime inference on subsamples

The key step of PseudotimeDE is to use subsampling to get the uncertainty of inferred pseudotime. The default is using 80% cells. To save time, here we only generate 2 subsamples. The defulat is using 1000 subsamples, although 100 subsamples show similar performance in our empirical study. Of course, more subsamples will lead to more accurate results.

set.seed(123)
## Set the cores for parallelization. Note that mclapply doesnot work on Windows.
options(mc.cores = 2)

## Number of subsmaples
n = 2

## Ganerate random subsamples
LPS_index <- mclapply(seq_len(n), function(x) {
  sample(x = c(1:dim(LPS_sce)[2]), size = 0.8*dim(LPS_sce)[2], replace = FALSE)
})

Users should use the exact same procedure as they used in pseudotime inference on the original dataset.

LPS_sub_tbl <- mclapply(LPS_index, function(x, sce) {
  sce <- sce[, x]
  rd <- irlba::prcomp_irlba(t(logcounts(sce)), scale. = FALSE)$x[, 1:2]
  reducedDims(sce) <- SimpleList(PCA = rd)

  fit <- slingshot(sce, reducedDim = 'PCA', clusterLabels = "cl")
  tbl <- tibble(cell = colnames(sce), pseudotime = rescale(colData(fit)$slingPseudotime_1))
  
  ## Make sure the direction of pseudotime is the same as the original pseudotime
  merge.tbl <- left_join(tbl, LPS_ori_tbl, by = "cell")
  
  if(cor(merge.tbl$pseudotime.x, merge.tbl$pseudotime.y) < 0) {
    tbl <- dplyr::mutate(tbl, pseudotime = 1-pseudotime)
  }
  tbl
}, sce = LPS_sce)

We load the example LPS_ori_tbl and LPS_sub_tbl for the next DE test. The LPS_sub_tbl contains the 1000 subsample-pseudotimes.

data(LPS_ori_tbl, package = "PseudotimeDE")
data(LPS_sub_tbl, package = "PseudotimeDE")

Users can check the spreadness of the inferred pseudotime on subsamples.

PseudotimeDE::plotUncertainty(LPS_ori_tbl, LPS_sub_tbl)

5 Perform DE test

To save time, we only run the DE test on two example genes (CCL5, CXCL10) and 100 subsamples. Note that the DE test can be time-consuming since it is a permutation test; we strongly encourage users to allocate at least 10 cores. We specify the distribution as Negative Binomial (nb) here.

system.time(res <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100], ## To save time, use 100 subsamples
                                     mat = LPS_sce, ## You can also use a matrix or SeuratObj as the input
                                     model = "nb",
                                     mc.cores = 2))
##    user  system elapsed 
##   5.970   5.500   5.635

We now check the output.

print(res)
## # A tibble: 2 × 12
##   gene   fix.pv  emp.pv   para.pv ad.pv  rank gam.fit zinf    aic expv.quantile
##   <chr>   <dbl>   <dbl>     <dbl> <dbl> <dbl> <list>  <lgl> <dbl> <list>       
## 1 CCL5        0 0.00990 1.02e-140 0.569  4.13 <gam>   FALSE 7665. <dbl [5]>    
## 2 CXCL10      0 0.00990 1.49e- 19 0.416  4.23 <gam>   FALSE 6388. <dbl [5]>    
## # … with 2 more variables: expv.mean <dbl>, expv.zero <dbl>

para.pv is the most important output - the \(p\)-values of DE test. rank measures how wiggling the gene trajectory is. gam.fit is the fitted Generalized Additive Model (GAM).

Although we do not encourage this, users may choose completely ignore the uncertainty of inferred pseudotime, and use the \(p\)-values based on an asymptotic distribution (fix.pv). Note that fix.pv usually do not behaive correctly under the null (i.g., following a \(\operatorname{Uniform}(0, 1)\) distribution). The good thing is that only caluclating fix.pv makes the DE test very fast.

system.time(res_fix <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = NULL, # Set as NULL to only get fix.pv
                                     mat = LPS_sce,
                                     model = "nb"))
##    user  system elapsed 
##   1.957   5.198   0.909
print(res_fix)
## # A tibble: 2 × 12
##   gene   fix.pv emp.pv para.pv ad.pv  rank gam.fit zinf    aic expv.quantile
##   <chr>   <dbl> <lgl>  <lgl>   <lgl> <dbl> <list>  <lgl> <dbl> <list>       
## 1 CCL5        0 NA     NA      NA     4.13 <gam>   FALSE 7665. <dbl [5]>    
## 2 CXCL10      0 NA     NA      NA     4.23 <gam>   FALSE 6388. <dbl [5]>    
## # … with 2 more variables: expv.mean <dbl>, expv.zero <dbl>

We can visualize the gene trajectories estimated by gam.fit.

PseudotimeDE::plotCurve(gene.vec = res$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res$gam.fit)

The “dropout” problem (extra zeros) can be captured by using Zero-Inflated Negative-Binomial model (ZINB). In general, we believe model = 'nb' should be used in most cases since using ZINB may cause lower power (see this paper Naught all zeros in sequence count data are the same). Another disadvantage is that ZINB is more time-consuming than NB. However, if needed, users may choose the auto decision (model = 'auto') or force to use ZINB (model = 'zinb').

system.time(res_zinb <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100],
                                     mat = LPS_sce,
                                     model = "zinb"))
##    user  system elapsed 
##  14.573  12.353   8.615
print(res_zinb)
## # A tibble: 2 × 12
##   gene   fix.pv  emp.pv  para.pv ad.pv  rank gam.fit zinf    aic expv.quantile
##   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <list>  <lgl> <dbl> <list>       
## 1 CCL5        0 0.00990 1.05e-76 0.272  4.51 <gam>   TRUE  7356. <dbl [5]>    
## 2 CXCL10      0 0.00990 2.27e-20 0.587  4.26 <gam>   TRUE  6358. <dbl [5]>    
## # … with 2 more variables: expv.mean <dbl>, expv.zero <dbl>

We can check the fitted curve by zinb.

PseudotimeDE::plotCurve(gene.vec = res_zinb$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res_zinb$gam.fit)

We also added Gaussian (Normal) distribution as an distribution option into PseudotimeDE. If your input is log-transformed counts (so they are close to Gaussian), you can set model = 'gaussian'. Gaussian is usually faster than NB.

system.time(res_gaussian <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100],
                                     mat = LPS_sce,
                                     model = "gaussian", 
                                     assay.use = "logcounts"))
##    user  system elapsed 
##   2.394   5.042   1.373
print(res_gaussian)
## # A tibble: 2 × 12
##   gene   fix.pv  emp.pv  para.pv ad.pv  rank gam.fit zinf    aic expv.quantile
##   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <list>  <lgl> <dbl> <list>       
## 1 CCL5        0 0.00990 9.96e-49 0.886  4.55 <gam>   FALSE 2200. <dbl [5]>    
## 2 CXCL10      0 0.00990 1.44e-14 0.559  4.95 <gam>   FALSE 2270. <dbl [5]>    
## # … with 2 more variables: expv.mean <dbl>, expv.zero <dbl>

We can check the fitted curve by gaussian.

PseudotimeDE::plotCurve(gene.vec = res_gaussian$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res_gaussian$gam.fit, assay.use = "logcounts")

6 Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] irlba_2.3.3                 Matrix_1.3-4               
##  [3] scales_1.1.1                dplyr_1.0.7                
##  [5] tibble_3.1.6                slingshot_2.0.0            
##  [7] TrajectoryUtils_1.0.0       princurve_2.1.6            
##  [9] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [11] Biobase_2.52.0              GenomicRanges_1.44.0       
## [13] GenomeInfoDb_1.28.4         IRanges_2.26.0             
## [15] S4Vectors_0.30.2            BiocGenerics_0.38.0        
## [17] MatrixGenerics_1.4.3        matrixStats_0.61.0         
## [19] PseudotimeDE_0.9.0          BiocStyle_2.20.2           
## 
## loaded via a namespace (and not attached):
##  [1] mixtools_1.2.0         tidyr_1.1.4            sass_0.4.0            
##  [4] jsonlite_1.7.2         splines_4.1.1          bslib_0.3.1           
##  [7] assertthat_0.2.1       highr_0.9              BiocManager_1.30.16   
## [10] GenomeInfoDbData_1.2.6 yaml_2.2.1             pillar_1.6.4          
## [13] lattice_0.20-45        glue_1.5.1             digest_0.6.29         
## [16] XVector_0.32.0         colorspace_2.0-2       htmltools_0.5.2       
## [19] pkgconfig_2.0.3        magick_2.7.3           bookdown_0.24         
## [22] zlibbioc_1.38.0        purrr_0.3.4            BiocParallel_1.26.2   
## [25] mgcv_1.8-38            farver_2.1.0           generics_0.1.1        
## [28] ggplot2_3.3.5          ellipsis_0.3.2         cli_3.1.0             
## [31] survival_3.2-13        magrittr_2.0.1         crayon_1.4.2          
## [34] evaluate_0.14          fansi_0.5.0            nlme_3.1-152          
## [37] MASS_7.3-54            segmented_1.3-4        tools_4.1.1           
## [40] fitdistrplus_1.1-6     lifecycle_1.0.1        stringr_1.4.0         
## [43] kernlab_0.9-29         munsell_0.5.0          DelayedArray_0.18.0   
## [46] compiler_4.1.1         jquerylib_0.1.4        rlang_0.4.12          
## [49] grid_4.1.1             RCurl_1.98-1.5         rstudioapi_0.13       
## [52] igraph_1.2.9           labeling_0.4.2         bitops_1.0-7          
## [55] rmarkdown_2.11         gtable_0.3.0           codetools_0.2-18      
## [58] DBI_1.1.1              R6_2.5.1               knitr_1.36            
## [61] fastmap_1.1.0          utf8_1.2.2             stringi_1.7.6         
## [64] Rcpp_1.0.7             vctrs_0.3.8            tidyselect_1.1.1      
## [67] xfun_0.28