Set-up

The workflow and explanations bellow are from OSCA

library(SingleCellExperiment)
library(scran) # For feature selcetion
library(scater) # for pca
library(ggplot2) # modify plots
library(here) #reproducible paths

project <- "fire-mice"

sce <- readRDS(here("processed", project, "sce_norm_01.RDS"))

Feature selection

Motivation

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system (Highly variable genes, HVGs) while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

Quantify per-gene variation

We quantify per-gene variation computing the variance of the log-normalized expression values (referred to as “log-counts” for simplicity) for each gene across all cells in the population (A. T. L. Lun, McCarthy, and Marioni 2016). We use modelGeneVar() that does also corrects for the abundance of each gene.

#The density weights are removed because the genes
# with highest mean abundance are also HVG, this avoids overfiting
gene_var_df <- modelGeneVar(sce,density.weights=FALSE )
gene_var <- metadata(gene_var_df)
plot(gene_var$mean, gene_var$var, xlab= "Mean of log-expression", ylab= "Variance of log-expression")
curve(gene_var$trend(x), lwd=2, add=T, col = "red")

Select the HVGs

The next step is to select the subset of HVGs to use in downstream analyses. The simplest HVG selection strategy is to take the top X genes with the largest values for the relevant variance metric. Here I select the top 15 % of genes.

hvgs <- getTopHVGs(gene_var_df, prop=0.15)
# save them in the object
rowSubset(sce) <- hvgs

This leaves us with 2025 highly variable genes.

plot(gene_var$mean, gene_var$var, xlab= "Mean of log-expression", ylab= "Variance of log-expression")
points(gene_var$mean[hvgs], gene_var$var[hvgs], col = "red")
curve(gene_var$trend(x), lwd=2, add=T, col = "red")

Dimensionality reduction

Motivation

Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.

Run PCA and choose PCs

set.seed(1000)
sce <- runPCA(sce)

pct_var <- attr(reducedDim(sce, "PCA"), "percentVar")
plot(pct_var, log="y", xlab="PC", ylab="pct variance explained")

I tried to use some of the PCAtools features to estimate the number of PCs, such as the elbow and Horn’s parallel analysis, but both yield a result lower than 15 PCs. These methods tend to underestimate the number of PCs to use, and it is recommended using between 20 and 40 PCs. I will keep 26 PCs, as there is a drop between 26 and 27 PCs in the elbow plot.

#keep the previous dimensional reduction just in case
reducedDim(sce, "PCA_all") <- reducedDim(sce, "PCA")
# And replace the default PCA with the reduced PCs
reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[,1:26]

Visualisation

For visualisation, reduce to 2 dimensions. Non linear reductions.

UMAP

set.seed(1000)
sce <- runUMAP(sce, dimred="PCA")
plotReducedDim(sce, dimred="UMAP", colour_by="genotype", point_size=0.1, point_alpha = 0.3)

tSNE

set.seed(1000)
sce <- runTSNE(sce, dimred="PCA")
plotReducedDim(sce, dimred="TSNE", colour_by="genotype")

plotReducedDim(sce, dimred="TSNE", colour_by="chip")

Reducing the number of genes to the most variables already improves the dimensional reduction

Save objects

saveRDS(sce, here("processed", project, "sce_dimred_01.RDS"))

Session Info

Click to expand

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] here_1.0.1                  scater_1.23.5              
##  [3] ggplot2_3.3.5               scran_1.22.1               
##  [5] scuttle_1.4.0               SingleCellExperiment_1.16.0
##  [7] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [11] IRanges_2.28.0              S4Vectors_0.32.3           
## [13] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
## [15] matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RcppAnnoy_0.0.19         
##  [3] rprojroot_2.0.2           tools_4.1.1              
##  [5] bslib_0.3.1               utf8_1.2.2               
##  [7] R6_2.5.1                  irlba_2.3.5              
##  [9] vipor_0.4.5               uwot_0.1.11              
## [11] DBI_1.1.2                 colorspace_2.0-2         
## [13] withr_2.4.3               gridExtra_2.3            
## [15] tidyselect_1.1.1          compiler_4.1.1           
## [17] cli_3.2.0                 BiocNeighbors_1.12.0     
## [19] DelayedArray_0.20.0       labeling_0.4.2           
## [21] sass_0.4.0                scales_1.1.1             
## [23] stringr_1.4.0             digest_0.6.29            
## [25] rmarkdown_2.11            XVector_0.34.0           
## [27] pkgconfig_2.0.3           htmltools_0.5.2          
## [29] sparseMatrixStats_1.6.0   highr_0.9                
## [31] fastmap_1.1.0             limma_3.50.0             
## [33] rlang_1.0.1               rstudioapi_0.13          
## [35] DelayedMatrixStats_1.16.0 farver_2.1.0             
## [37] jquerylib_0.1.4           generics_0.1.2           
## [39] jsonlite_1.7.3            BiocParallel_1.28.3      
## [41] dplyr_1.0.8               RCurl_1.98-1.6           
## [43] magrittr_2.0.2            BiocSingular_1.10.0      
## [45] GenomeInfoDbData_1.2.7    Matrix_1.4-0             
## [47] Rcpp_1.0.8                ggbeeswarm_0.6.0         
## [49] munsell_0.5.0             fansi_1.0.2              
## [51] viridis_0.6.2             lifecycle_1.0.1          
## [53] stringi_1.7.6             yaml_2.2.2               
## [55] edgeR_3.36.0              zlibbioc_1.40.0          
## [57] Rtsne_0.15                grid_4.1.1               
## [59] ggrepel_0.9.1             parallel_4.1.1           
## [61] dqrng_0.3.0               crayon_1.5.0             
## [63] lattice_0.20-45           cowplot_1.1.1            
## [65] beachmat_2.10.0           locfit_1.5-9.4           
## [67] metapod_1.2.0             knitr_1.37               
## [69] pillar_1.7.0              igraph_1.2.11            
## [71] codetools_0.2-18          ScaledMatrix_1.2.0       
## [73] glue_1.6.1                evaluate_0.14            
## [75] vctrs_0.3.8               gtable_0.3.0             
## [77] purrr_0.3.4               assertthat_0.2.1         
## [79] xfun_0.29                 rsvd_1.0.5               
## [81] RSpectra_0.16-0           viridisLite_0.4.0        
## [83] tibble_3.1.6              beeswarm_0.4.0           
## [85] cluster_2.1.2             bluster_1.4.0            
## [87] statmod_1.4.36            ellipsis_0.3.2

Feature Selection and Dimensional reduction

NadineBestard

10/07/2022

Set-up

Feature selection

Motivation

Quantify per-gene variation

Select the HVGs

Dimensionality reduction

Motivation

Run PCA and choose PCs

Visualisation

UMAP

tSNE

Save objects

Session Info