The workflow and explanations bellow are from OSCA
library(SingleCellExperiment)
library(scran) # For feature selcetion
library(scater) # for pca
library(ggplot2) # modify plots
library(here) #reproducible paths
age <- "old"
sce <- readRDS(here("processed", age, "sce_norm_01.RDS"))
We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system (Highly variable genes, HVGs) while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.
We quantify per-gene variation computing the variance of the log-normalized expression values (referred to as “log-counts” for simplicity) for each gene across all cells in the population (A. T. L. Lun, McCarthy, and Marioni 2016). We use modelGeneVar() that does also corrects for the abundance of each gene.
#The density weights are removed because the genes
# with highest mean abundance are also HVG, this avoids overfiting
gene_var_df <- modelGeneVar(sce, density.weights=FALSE )
gene_var <- metadata(gene_var_df)
plot(gene_var$mean, gene_var$var, xlab= "Mean of log-expression", ylab= "Variance of log-expression")
curve(gene_var$trend(x), lwd=2, add=T, col = "red")
The next step is to select the subset of HVGs to use in downstream analyses. The simplest HVG selection strategy is to take the top X genes with the largest values for the relevant variance metric. Here I select the top 15 % of genes.
hvgs <- getTopHVGs(gene_var_df, prop=0.15)
# save them in the object
rowSubset(sce) <- hvgs
This leaves us with 2061 highly variable genes.
plot(gene_var$mean, gene_var$var, xlab= "Mean of log-expression", ylab= "Variance of log-expression")
points(gene_var$mean[hvgs], gene_var$var[hvgs], col = "red")
curve(gene_var$trend(x), lwd=2, add=T, col = "red")
Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.
set.seed(1000)
sce <- runPCA(sce)
pct_var <- attr(reducedDim(sce, "PCA"), "percentVar")
plot(pct_var, log="y", xlab="PC", ylab="pct variance explained")
I tried to use some of the PCAtools features to estimate the number of PCs, such as the elbow and Horn’s parallel analysis, but both yield a result lower than 15 PCs. These methods tend to underestimate the number of PCs to use, and it is recommended using between 20 and 40 PCs. I will keep 25 PCs, as there is a drop between 25 and 26 PCs in the elbow plot.
#keep the previous dimensional reduction just in case
reducedDim(sce, "PCA_all") <- reducedDim(sce, "PCA")
# And replace the default PCA with the reduced PCs
reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[,1:25]
For visualisation, reduce to 2 dimensions. Non linear reductions.
sce <- runUMAP(sce, dimred="PCA")
plotReducedDim(sce, dimred="UMAP", colour_by="genotype", point_size=0.1, point_alpha = 0.3)
sce <- runTSNE(sce, dimred="PCA")
plotReducedDim(sce, dimred="TSNE", colour_by="genotype", point_size=0.1, point_alpha = 0.3)
saveRDS(sce, here("processed", age, "sce_dimred_01.RDS"))
Click to expand
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scran_1.18.5 MASS_7.3-53.1 org.Mm.eg.db_3.12.0
## [4] AnnotationDbi_1.52.0 scater_1.18.6 ggplot2_3.3.3
## [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
## [10] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
## [13] S4Vectors_0.28.1 BiocGenerics_0.36.0 MatrixGenerics_1.2.1
## [16] matrixStats_0.58.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_4.0.5 RcppAnnoy_0.0.18
## [4] rprojroot_2.0.2 tools_4.0.4 bslib_0.2.4
## [7] utf8_1.1.4 R6_2.5.0 irlba_2.3.3
## [10] vipor_0.4.5 uwot_0.1.10 DBI_1.1.1
## [13] colorspace_2.0-0 withr_2.4.1 tidyselect_1.1.0
## [16] gridExtra_2.3 bit_4.0.4 compiler_4.0.4
## [19] BiocNeighbors_1.8.2 DelayedArray_0.16.2 labeling_0.4.2
## [22] sass_0.3.1 scales_1.1.1 stringr_1.4.0
## [25] digest_0.6.27 rmarkdown_2.7 XVector_0.30.0
## [28] pkgconfig_2.0.3 htmltools_0.5.1.1 sparseMatrixStats_1.2.1
## [31] highr_0.8 limma_3.46.0 fastmap_1.1.0
## [34] rlang_0.4.10 RSQLite_2.2.3 DelayedMatrixStats_1.12.3
## [37] farver_2.1.0 jquerylib_0.1.3 generics_0.1.0
## [40] jsonlite_1.7.2 BiocParallel_1.24.1 dplyr_1.0.5
## [43] RCurl_1.98-1.2 magrittr_2.0.1 BiocSingular_1.6.0
## [46] GenomeInfoDbData_1.2.4 scuttle_1.0.4 Matrix_1.3-3
## [49] Rcpp_1.0.6 ggbeeswarm_0.6.0 munsell_0.5.0
## [52] fansi_0.4.2 viridis_0.5.1 lifecycle_1.0.0
## [55] stringi_1.5.3 yaml_2.2.1 edgeR_3.32.1
## [58] zlibbioc_1.36.0 Rtsne_0.15 grid_4.0.4
## [61] blob_1.2.1 dqrng_0.2.1 crayon_1.4.1
## [64] lattice_0.20-41 cowplot_1.1.1 beachmat_2.6.4
## [67] locfit_1.5-9.4 knitr_1.31 pillar_1.5.1
## [70] igraph_1.2.6 codetools_0.2-18 glue_1.4.2
## [73] evaluate_0.14 vctrs_0.3.6 gtable_0.3.0
## [76] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.4
## [79] xfun_0.21 rsvd_1.0.3 RSpectra_0.16-0
## [82] rsconnect_0.8.18 viridisLite_0.3.0 tibble_3.1.0
## [85] tinytex_0.30 beeswarm_0.3.1 memoise_2.0.0
## [88] statmod_1.4.35 bluster_1.0.0 ellipsis_0.3.1