The object has 32285 genes and 90750 cells before filtering
First we need to sort the gene names and gene symbols, because the default ensembl notation is not very handy. And then save the mitochondrial genes as such.
Then we can use the scater package to add the quality per cell. This computes for each cell some useful metrics such as the number of umi counts (library size), the number of detected genes and the percentage of mitochondiral genes.
Then we use the automatic isOutlier function from the same package that determine which values in a numeric vector are outliers based on the median absolute deviation (MAD). When using this function with low number a log transformation is added, that prevents negative thresholds. We also take only in consideration the first three batches, as the last one is lower quality
## X lib_size_high expression_high lib_size_low expression_low
## 1 Cells filtered 3083.00 99.000 0.0000 4234.000
## 2 Threshold 19023.52 7111.237 462.2212 428.559
## mt_pct total
## 1 18347.00000 23331
## 2 14.37436 NA
This data is saved in outs/fire-mice/autofilter_summary.csv
Diagnostic plots to visualize the data distribution. The orange cells are marked as outliers by the automatic detection from scater.
We can see how chip 6 has a greater amount of cells with lower umi counts than the other 3. This chip has not been taken in consideration when calculating the overall thresholds.
In the x axis we can see the total number of umi (library size) per cell, the number of detected genes per cell and the mitochondrial percentage per cell; with the number of cells for each measure in the y axis.
This object had already been filtrated with the cell-calling algorithm from CellRanger, that is meant to remove empty droplets. Therefore it is expected to see the total sum of umi skewed as in the plot above.
The bimodality present in the number of counts was already visible in the violin plots.
There is a very heavy tail of cells with high mitochondrial genes.
Here we run a PCA using the information in the metadata instead of the gene expression. It is useful to visualize the QC parametres.
This measures the number of detected genes per cell divided by its library size. This will be very useful to delete the cells that have low gene counts but a relatively high umi count (visible in the scatter plots).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008078 0.323264 0.395775 0.406782 0.482403 0.878004
The isoutlier function can but used to find the outliers of any distribution, as far as it is roughly normal. Bellow we use it with the ratio between the number of genes expressed and the number of umi. Again, only the first 3 chips are considered to calculate the cut-offs
## lower higher
## 0.05459917 0.72380852
This filters out 1532 cells.
The upper thresholds from the sum of umi counts is 1.9024^{4} and the lower threshold is 462 umi counts.
Also, removing cells with less than 7111 or more than 429 detected genes.
Finally, we will also take in consideration the detected genes/umi counts ratio, that filter out cells with relatively high umi counts but very few detected genes ( lots of copies from the same genes). And any cell with above 14.37 % mitochondrial genes.
It is typically a good idea to remove genes whose expression level is considered “undetectable”. Here we define a gene as detectable if at least two cells contain a transcript from the gene. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells.
## [1] 8566
This way we deleted 8566 genes and kept 23719 genes
We can look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene; each bar corresponds to the expression of a gene in a single cell; and the circle indicates the median expression of each gene, with which genes are sorted. We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A large number of pseudo-genes or predicted genes may indicate problems with alignment.
The object has 23719 genes and 66826 cells after filtering
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Mm.eg.db_3.12.0 AnnotationDbi_1.52.0
## [3] scater_1.18.6 ggplot2_3.3.5
## [5] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [7] Biobase_2.50.0 GenomicRanges_1.42.0
## [9] GenomeInfoDb_1.26.7 IRanges_2.24.1
## [11] S4Vectors_0.28.1 BiocGenerics_0.36.1
## [13] MatrixGenerics_1.2.1 matrixStats_0.59.0
## [15] here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.1 sass_0.4.0
## [3] BiocSingular_1.6.0 bit64_4.0.5
## [5] jsonlite_1.7.2 viridisLite_0.4.0
## [7] DelayedMatrixStats_1.12.3 scuttle_1.0.4
## [9] bslib_0.2.5.1 assertthat_0.2.1
## [11] highr_0.9 blob_1.2.1
## [13] GenomeInfoDbData_1.2.4 vipor_0.4.5
## [15] yaml_2.2.1 pillar_1.6.1
## [17] RSQLite_2.2.7 lattice_0.20-44
## [19] glue_1.4.2 beachmat_2.6.4
## [21] digest_0.6.27 XVector_0.30.0
## [23] colorspace_2.0-2 cowplot_1.1.1
## [25] htmltools_0.5.1.1 Matrix_1.3-4
## [27] pkgconfig_2.0.3 zlibbioc_1.36.0
## [29] purrr_0.3.4 scales_1.1.1
## [31] BiocParallel_1.24.1 tibble_3.1.2
## [33] farver_2.1.0 generics_0.1.0
## [35] ellipsis_0.3.2 cachem_1.0.5
## [37] withr_2.4.2 magrittr_2.0.1
## [39] crayon_1.4.1 memoise_2.0.0
## [41] evaluate_0.14 fansi_0.5.0
## [43] beeswarm_0.4.0 tools_4.0.4
## [45] lifecycle_1.0.0 stringr_1.4.0
## [47] munsell_0.5.0 DelayedArray_0.16.3
## [49] irlba_2.3.3 compiler_4.0.4
## [51] jquerylib_0.1.4 rsvd_1.0.5
## [53] rlang_0.4.10 grid_4.0.4
## [55] RCurl_1.98-1.3 BiocNeighbors_1.8.2
## [57] labeling_0.4.2 bitops_1.0-7
## [59] rmarkdown_2.9 gtable_0.3.0
## [61] DBI_1.1.1 R6_2.5.0
## [63] gridExtra_2.3 knitr_1.33
## [65] dplyr_1.0.7 fastmap_1.1.0
## [67] bit_4.0.4 utf8_1.2.1
## [69] rprojroot_2.0.2 stringi_1.7.3
## [71] ggbeeswarm_0.6.0 Rcpp_1.0.6
## [73] vctrs_0.3.8 tidyselect_1.1.1
## [75] xfun_0.21 sparseMatrixStats_1.2.1