Cluster quality control

set-up

Introduction

As in [publication] we will perform a cluster QC to remove clusters of poorer quality. This will be assessed by the number of UMI counts, the mitochondrial percentage, doublet analysis, ribosomal genes and the number of mice that contribute to each cluster. Moreover we will keep in mind our experimental groups in order to ensure biological effects are not being lost. To do so we use a small cluster resolution, 5

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

Number of molecules per cluster

Lower values of umi counts and detected genes can be associated to lower quality cells. Cells can also have lower expressed genes due to their biological state or celltype.

Select clusters with 50 % cells having less than 3000umi counts.

The clusters flagged are 5, 6, 8, 12, 16, 23, 27, 37, 38, 42, 44, 48

Mithocondrial genes

High mithocondrial genes is associated with stressed, lower quality, cells.

Select clusters with 50 % cells having more than 10% mithocondrial genes.

The clusters flagged are 6, 8, 15, 27, 32, 34, 36, 48.

Ribosomal genes

To visualise the ribosomal content in the whole dataset we plotted the cells according to their ribosomal content. High ribosomal content in one cluster, that expresses a mix profile, could indicate that the cells are clustering based on ribosomal gene content.

Some clusters have higher ribosomal content, these are immune clusters. Ribosomal genes can be highly expressed in active cell. Further investigation has been performed to confirm this is a biological difference: the cells are still clustering together after deleting the ribosomal genes from the variable features.

Number of mice per cluster

How many mice contribute to each cluster?

##     
##      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
##   KO 3 3 3 3 1 3 3 3 2  3  1  3  3  3  3  3  3  3  3  3  2  3  3  3  3  3  3  3
##   WT 3 3 3 3 3 3 3 3 3  3  3  3  3  3  3  3  3  3  3  3  2  3  3  3  3  3  3  3
##     
##      29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
##   KO  3  0  3  2  3  3  3  3  3  3  3  3  0  3  2  2  3  2  3  3  3  2  3  2  0
##   WT  3  3  3  1  3  3  3  3  3  3  3  2  3  3  0  2  3  2  3  3  3  1  3  2  3
##     
##      54 55
##   KO  3  1
##   WT  2  0

Except from the obvious microglia clusters, where the numbers are very low or even absent in the fire mice nothing stands out.

Clusters of doublet cells

Detection of clusters formed by doublets/multiplets (i.e. multiple cells captured within the same droplet or reaction volume). The function test each cluster against the null hypothesis that it does consist of doublets. The null is rejected if a cluster has many DE genes that lie outside the expression limits defined by the “source” clusters.

Analyse the results, this includes:

filtering for the clusters where the null hypothesis (of being a doublet) was not rejected at a significance level of 5%
filter for the clusters that are in average formed by cells with bigger library sizes than its source clusters ( the umi counts of a doublet is expected to be larger than for a single cell).
Finally plot the clusters that were not already detected as poorer quality, along with the source clusters to have a closer look at these potential doublet cells.

## Warning: Removed 43 rows containing missing values (geom_text).

Cluster 20 is detected as a doublet cluster between 45 and 24. All these three clusters are Oligodendrocytes, this “doublet” cluster might simply be to close similarity between the clusters.

Cluster 31 is potentially only a subtype of Endothelial/Pericyte cells, as it is flagged as a possible doublet from 29 (Endothelial/Pericyte) and 6 ( low quality cluster, unknown cell type).

Cluster 9: is a doublet between Astrocytes (17) and Microglia(30).

53: is a doublet between Oligodendrocytes (21) and microglia (11).

Control vs fire mice

We want to have a closer look at the clusters that do have a difference between the knockout and the wild type before deleting the clusters.

## Warning: Removed 3 rows containing missing values (geom_text).

## Warning: Removed 2 rows containing missing values (geom_text).

Proportion KO-WT

In order to visualise the proportions from KO and WT for each cluster, we do not take in consideration the microglia clusters, as these are only present in the control, and we normalise per number of cells per cluster.

visualise in a plot

## Warning: Removed 14 rows containing missing values (position_stack).

The clusters with more than 60 % cells from the KO animals that were previously flagged as lower quality are:

6, 32, 23, 27, 37

6 and 27 are very low quality non identified cells, most likely only clustering due to the quality and will be deleted.

37 is close to the big high mt. oligo cluster we are deleting. It does not have high mt but does have very low umi counts. This will also be deleted.

32 and 23 are two small neuronal clusters (part of cluster 15 at k60) (approx. 70% KO). The other small neuronal clusters around, 36 and 12, would be deleted with the chosen thresholds (approx. 55% KO). However as cluster 15 (at k60) is the only neuronal cluster this whole cluster will be kept even though it is borderline in terms of quality. More information supporting this choice can be found in the cluster_QC_k60 and the first annotation files.

Cluster QC

We filter out the clusters highlighted as:

low umi: majority of cells having less than 3000 umi counts
high mt : majority of cells having more than 10 % mitochondrial genes
doublets: their expression profile lie between two other “source” clusters, from which it is not expected to have intermediate cell types.

There are 4649 cells filtered out and the final object has 13975cells.

Session Info

Click to expand

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scDblFinder_1.4.0           tibble_3.1.2               
##  [3] dplyr_1.0.7                 scater_1.18.6              
##  [5] ggplot2_3.3.4               here_1.0.1                 
##  [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [9] Biobase_2.50.0              GenomicRanges_1.42.0       
## [11] GenomeInfoDb_1.26.7         IRanges_2.24.1             
## [13] S4Vectors_0.28.1            BiocGenerics_0.36.1        
## [15] MatrixGenerics_1.2.1        matrixStats_0.59.0         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              rprojroot_2.0.2          
##  [3] tools_4.0.4               bslib_0.2.5.1            
##  [5] utf8_1.2.1                R6_2.5.0                 
##  [7] irlba_2.3.3               vipor_0.4.5              
##  [9] DBI_1.1.1                 colorspace_2.0-2         
## [11] withr_2.4.2               tidyselect_1.1.1         
## [13] gridExtra_2.3             compiler_4.0.4           
## [15] BiocNeighbors_1.8.2       DelayedArray_0.16.3      
## [17] labeling_0.4.2            sass_0.4.0               
## [19] scales_1.1.1              stringr_1.4.0            
## [21] digest_0.6.27             rmarkdown_2.9            
## [23] XVector_0.30.0            pkgconfig_2.0.3          
## [25] htmltools_0.5.1.1         sparseMatrixStats_1.2.1  
## [27] limma_3.46.0              highr_0.9                
## [29] rlang_0.4.10              DelayedMatrixStats_1.12.3
## [31] jquerylib_0.1.4           generics_0.1.0           
## [33] farver_2.1.0              jsonlite_1.7.2           
## [35] BiocParallel_1.24.1       RCurl_1.98-1.3           
## [37] magrittr_2.0.1            BiocSingular_1.6.0       
## [39] GenomeInfoDbData_1.2.4    scuttle_1.0.4            
## [41] Matrix_1.3-4              Rcpp_1.0.6               
## [43] ggbeeswarm_0.6.0          munsell_0.5.0            
## [45] fansi_0.5.0               viridis_0.6.1            
## [47] lifecycle_1.0.0           stringi_1.6.2            
## [49] yaml_2.2.1                edgeR_3.32.1             
## [51] zlibbioc_1.36.0           grid_4.0.4               
## [53] dqrng_0.3.0               crayon_1.4.1             
## [55] lattice_0.20-44           cowplot_1.1.1            
## [57] beachmat_2.6.4            locfit_1.5-9.4           
## [59] knitr_1.33                pillar_1.6.1             
## [61] igraph_1.2.6              xgboost_1.4.1.1          
## [63] glue_1.4.2                evaluate_0.14            
## [65] scran_1.18.7              data.table_1.14.0        
## [67] vctrs_0.3.8               gtable_0.3.0             
## [69] purrr_0.3.4               assertthat_0.2.1         
## [71] xfun_0.21                 rsvd_1.0.5               
## [73] viridisLite_0.4.0         beeswarm_0.4.0           
## [75] bluster_1.0.0             statmod_1.4.36           
## [77] ellipsis_0.3.2