Set-up

The object has 32285 genes and 23007 cells before filtering

Add cell QC metrics to the sce

First we need to sort the gene names and gene symbols, because the default ensembl notation is not very handy. And then save the mitochondrial genes as such.

Then we can use the scater package to add the quality per cell. This computes for each cell some useful metrics such as the number of umi counts (library size), the number of detected genes and the percentage of mitochondiral genes.

Then we use the automatic isOutlier function from the same package that determine which values in a numeric vector are outliers based on the median absolute deviation (MAD). When using this function with low number a log transformation is added, that prevents negative thresholds.

##                X lib_size_high expression_high lib_size_low expression_low     mt_pct total
## 1 Cells filtered        956.00          93.000       0.0000       773.0000 2724.00000  4381
## 2      Threshold      16021.98        5867.663     314.1608       162.4078   21.74347    NA

Plots before QC

Diagnostic plots to visualize the data distribution. The orange cells are marked as outliers by the automatic detection from scater. These thresholds will probably be manually adjusted to better fit the distribution of our data.

Violin plots

Histograms

In the x axis we can see the total number of umi (library size) per cell, the number of detected genes per cell and the mitochondrial percentage per cell; with the number of cells for each measure in the y axis.

This object had already been filtrated with the cell-calling algorithm from CellRanger, that is meant to remove empty droplets. Therefore it is expected to see the total sum of umi skewed as in the plot above.

A bimodal plot is not something usual for the detected number of genes. This bimodality was already visible in the violin plots.

There is a very heavy tail of cells with high mitochondrial genes.

Scatter plots

PCA

Here we run a PCA using the information in the metadata instead of the gene expression. It is useful to visualize the QC parametres.

Ratio between sum and gene counts

This measures the number of detected genes per cell dividided by its lirary size. This will be very useful to delete the cells that have low gene counts but a relatively high umi count.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005707 0.339102 0.393226 0.394572 0.455339 0.828814

The isoutlier function can but used to find the outliers of any distribution, as far as it is roughly normal. Bellow we use it with the ratio between the number of genes expressed and the number of umi.

##     lower    higher 
## 0.1365831 0.6498693

The suggestions here are very reasonable, it is not enough to delete all the cells we do not trust but we can definitely take this into consideration as a threshold parameter.

Decide the thresholds

The upper thresholds from the sum of umi counts, that allows filtering doublet cells, might be a bit too harsh. This could be due to the fact the distribution was not normal. We could start with 25,000 instead of 1.6022^{4}.

The other thresholds seem more reasonable. These are deleting cells with less than 5868 or more than 162 detected genes, and less than 314 umi counts.

Finally, we will also take in consideration the detected genes/umi counts ratio, that filter out cells with relatively high umi counts but very few detected genes ( lots of copies from the same genes?). As well as the automatic mitochondrial threshold, 21.74 %, this is a bit high compared with other datasets but it is already deleting 2724 cells. Lower down to only 10 % would delete 7297 cells.

Here we decide the new thersholds.

## NULL

##                X lib_size_high expression_high lib_size_low expression_low     mt_pct total
## 1 Cells filtered           196          93.000       0.0000       773.0000 2724.00000  4383
## 2      Threshold         25000        5867.663     314.1608       162.4078   21.74347    NA

QC with new thresholds

Bellow we are going to plot the same plots as before with the new thresholds

Violin plots

Scatter plots

Gene QC

It is typically a good idea to remove genes whose expression level is considered “undetectable”. Here we define a gene as detectable if at least two cells contain a transcript from the gene. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells.

## [1] 9604

This way we deleted 9604 genes and kept 22681 genes

We can look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene; each bar corresponds to the expression of a gene in a single cell; and the circle indicates the median expression of each gene, with which genes are sorted. We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A large number of pseudo-genes or predicted genes may indicate problems with alignment.

Create new filetered object

The object has 22681 genes and 18624 cells after filtering

Session Info

Click to expand

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.12.0         AnnotationDbi_1.52.0        scater_1.18.6              
##  [4] ggplot2_3.3.4               SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [7] Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [10] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.1        
## [13] MatrixGenerics_1.2.1        matrixStats_0.59.0          here_1.0.1                 
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.1             sass_0.4.0                BiocSingular_1.6.0       
##  [4] bit64_4.0.5               jsonlite_1.7.2            viridisLite_0.4.0        
##  [7] DelayedMatrixStats_1.12.3 scuttle_1.0.4             bslib_0.2.5.1            
## [10] assertthat_0.2.1          highr_0.9                 blob_1.2.1               
## [13] GenomeInfoDbData_1.2.4    vipor_0.4.5               yaml_2.2.1               
## [16] pillar_1.6.1              RSQLite_2.2.7             lattice_0.20-44          
## [19] glue_1.4.2                beachmat_2.6.4            digest_0.6.27            
## [22] XVector_0.30.0            colorspace_2.0-2          cowplot_1.1.1            
## [25] htmltools_0.5.1.1         Matrix_1.3-4              pkgconfig_2.0.3          
## [28] zlibbioc_1.36.0           purrr_0.3.4               scales_1.1.1             
## [31] BiocParallel_1.24.1       tibble_3.1.2              farver_2.1.0             
## [34] generics_0.1.0            ellipsis_0.3.2            cachem_1.0.5             
## [37] withr_2.4.2               magrittr_2.0.1            crayon_1.4.1             
## [40] memoise_2.0.0             evaluate_0.14             fansi_0.5.0              
## [43] beeswarm_0.4.0            rsconnect_0.8.18          tools_4.0.4              
## [46] lifecycle_1.0.0           stringr_1.4.0             munsell_0.5.0            
## [49] DelayedArray_0.16.3       irlba_2.3.3               compiler_4.0.4           
## [52] jquerylib_0.1.4           rsvd_1.0.5                rlang_0.4.10             
## [55] grid_4.0.4                RCurl_1.98-1.3            BiocNeighbors_1.8.2      
## [58] labeling_0.4.2            bitops_1.0-7              rmarkdown_2.9            
## [61] gtable_0.3.0              DBI_1.1.1                 R6_2.5.0                 
## [64] gridExtra_2.3             knitr_1.33                dplyr_1.0.7              
## [67] fastmap_1.1.0             bit_4.0.4                 utf8_1.2.1               
## [70] rprojroot_2.0.2           stringi_1.6.2             ggbeeswarm_0.6.0         
## [73] Rcpp_1.0.6                vctrs_0.3.8               tidyselect_1.1.1         
## [76] xfun_0.21                 sparseMatrixStats_1.2.1

Gene and Cell QC

NadineBestard

27/02/2021

Set-up

Add cell QC metrics to the sce

Plots before QC

Violin plots

Histograms

Scatter plots

PCA

Ratio between sum and gene counts

Decide the thresholds

QC with new thresholds

Violin plots

Scatter plots

Gene QC

Create new filetered object

Session Info