The aim of this analysis is to estimate the cellular composition of bulk-RNAseq reads from mouse left ventricular tissue. We’ll find cell-type-specific gene expression reference profiles with snRNAseq. These two single nucleus samples are aggregates of several adult, male B6 from Christoph’s post-doc.
This document explains the workflow in the order in which the data processing occurred. You can jump to portions with the table of contents on the left.
Some info on our unprocessed snRNAseq dataset:
## An object of class Seurat
## 19466 features across 39996 samples within 1 assay
## Active assay: RNA (19466 features, 0 variable features)
And our bulk RNAseq dataset, which I’ve done some basic summary statistics on:
This PCA illustrates clustering by treatment and genotype that we would expect. Also, we can see the outlier you mentioned, wt_Lx3. It has more counts per gene than any other sample, but there is so much variability in that metric anyway. Regardless, gene expression in each sample has no bearing on the deconvolution of the other samples, so I’ve decided to keep it in the analysis just to see how it goes. If we end up doing group comparison statistics it might be best to remove it.
After loading our data and packages, we some standard single cell processing, including QC, normaliziation, scaling, neighbor analysis, and dimensional reduction. We’ll also exclude nuclei above 5% mitocondrial gene expression.
sn.clust.1 <- ClusterSeurat(sn, max.rna.ft = 4000,
max.mt.pt = 5, res = 0.1)
sn.clust.1
## An object of class Seurat
## 19466 features across 24936 samples within 1 assay
## Active assay: RNA (19466 features, 2000 variable features)
## 3 dimensional reductions calculated: pca, harmony, umap
Now, we can visualize our data plotted on UMAPs. The top two plots are showing us how our clustering decided to group the nuclei and the right is how well we integrated our two samples.
The lower two feature plots are showing percent mitocondrial RNA and number of transcripts counted in each nuclei. High mitocondrial gene content indicates cytoplasmic contamination, which would confound deconvolution methods.
We’re going to look at how many nuclei in each cluster contain more than 3% mitocondrial RNA. We’ll set a cutoff off 20% of nuclei within a cluster and repeat the clustering without those nuclei. After I remove that cluster, I still set a mitocondrial gene percent cutoff at 2% for the reclustering, since the initial clustering and MT% threshold largely serve to bait highly cytoplasmic nuclei together.
To annotate the clusters with actual cell types, I’ve decided to reference existing mouse single cell datasets.
McLellan 2020 (DOI: 10.1161/CIRCULATIONAHA.119.045115) has a reasonably accessible list of marker genes for their clusters. They repeated their clustering, yielding both a broad and fine classification. So we’ll be doing the process twice.
I’ve reworked their data to be formatted as a data frame with 10 cell-type clusters matched with marker genes.
In their original data, they also included AUC scores and percent of nuclei in and outside of the cluster which express the gene. There’s a lot of differences between their data and ours, so I’ve left that extra info out.
Here’s how that data looks and what cell types are included in the broad classification:
head(markers.broad)
## cluster gene
## 1 B cells Cd79a
## 2 B cells Ly6d
## 3 B cells H2-DMb2
## 4 B cells Cd37
## 5 B cells Cd79b
## 6 B cells Ms4a1
unique(markers.broad$cluster)
## [1] "B cells" "cardiomyocytes" "endocardium"
## [4] "endothelial cells" "epicardium" "fibroblasts"
## [7] "granulocytes" "lymphatic ECs" "macrophages"
## [10] "pericytes" "Schwann cells"
To apply the markers to our data, I’m using the AUCell package. They have a wonderful vignette here if you’re interested.
Basically, it uses the gene sets I’ve supplied (40 - 150 genes per cell-type) and looks for cells where they’re enriched via AUC. Single cell/nucleus is noisy, sparse, and variable, and AUCell does a few smart things to account for that. AUC scores are generally higher in more abundant genes, so it sets a threshold value for each gene. Also, each cell is considered independently to generate it’s AUC score, so count variation has minimal effect on scores.
It outputs histograms of the AUC scores from every nuclei, for each cell-type marker set we provide. Here is an example with our cardiomyocyte gene set:
Under the hood, AUCell looks for a bi-modal distribution. Like we see above, a bi-modal distribution reflects that there are a discrete population of nuclei which express the cardiomyocyte gene set, while the majority of nuclei do not.
And here is a less aesthetically pleasing example from B-cells:
With AUCell, we can end up having multiple significant labels on a given nuclei.
To get around this, I’ve considered each AUC value for each cell (there is an AUC for each cell type it could be categorized). I then compared these values to the threshold value for the corresponding cell type (0.197 in the CM example earlier). For each value The nuclei were then tagged with whatever AUC/threshold value produced the highest ratio.
An aside: This has its drawbacks, like labeling cell types which don’t have a clearly bi-nomial AUC distribution. I’d like to add a filtration step to exclude normally or single-tailed distributed cell-types. Or maybe exclude multi-label nuclei altogether.
After repeating with the fine cell types, we can add all of the annotations to our original single nucleus data and visualize the tags:
And these are the clusters we found earlier, for reference:
From the DimPlots above, it looks like the most clear cell types can be defined from the broad classification. To pin those down to our own clusters, we’ll apply a Chi-Squared test between the Seurat clusters and our newly applied broad cell-type labels.
However, broad labels don’t completely define our clusters:
For this reason, I’ve decided to include both the fine and broad classifiers. Here’s how that looks:
We can then look for the largest positive standardized residual for each seurat cluster and label it with the associated cell type. Sometimes, more than one cluster will be labeled as the same cell type. In these cases, I’ve just decided to make their labels unique by adding a number after (i.e. neuron vs. neuron.1)
Here’s what our final clusters and labels look like:
As I use it, MuSiC takes our single nucleus clusters and looks for deconvolution markers itself. It prioritizes markers that are expressed in most of the nuclei in a given cluster, consistent within the cluster, and variable between all of the clusters. It is also possible for us to supply our own markers. Here’s how the results look:
There is still a lot of room for improvement on this pipeline. For one, I don’t trust the specifics of the reported ratios to be accurate. Fibroblasts are underrepresented and I suspect a lot of signal from immune cell infiltration is being labeled as macrophages. Also I had to remove “cardiomyocyte.1” as a reference. If I leave it, >90% of the samples are reported as it. If I had unlimited time, I’d try some of these things next:
However, I still do think that there are some interesting snippets. I find it exciting that we see, anecdotally, minimal difference between the genotypes while at baseline, but not at all after treatment. And, if we buy into these estimates, it looks like the KO keeps closer to the baseline composition than the WT during treatment.
I had a lot of fun with this project. I got to learn a lot, as I ended up applying a few methods I’ve never used before (like the automated cell-type cluster annotation). Plus, this is the first time I’ve written up a document in this style. Let me know if you have any feedback on how things are communicated, or need any clarification/information not shown here.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
##
## Matrix products: default
## BLAS/LAPACK: /nas/longleaf/rhel8/apps/r/4.2.1/lib/libopenblas_haswellp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scCustomize_1.1.1 gridExtra_2.3
## [3] ggrepel_0.9.2 scales_1.2.1
## [5] gplots_3.1.3 viridis_0.6.2
## [7] viridisLite_0.4.1 AUCell_1.20.2
## [9] shiny_1.7.4 SCpubr_1.1.1.9000
## [11] harmony_0.1.1 Rcpp_1.0.10
## [13] SingleCellExperiment_1.20.0 lubridate_1.9.0
## [15] timechange_0.1.1 forcats_0.5.2
## [17] stringr_1.5.0 purrr_1.0.0
## [19] readr_2.1.3 tidyr_1.2.1
## [21] tibble_3.1.8 tidyverse_2.0.0
## [23] reshape2_1.4.4 MuSiC_1.0.0
## [25] TOAST_1.12.0 quadprog_1.5-8
## [27] limma_3.54.0 EpiDISH_2.14.1
## [29] nnls_1.4 SeuratDisk_0.0.0.9020
## [31] patchwork_1.1.2 DESeq2_1.38.2
## [33] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [35] MatrixGenerics_1.10.0 matrixStats_0.63.0
## [37] GenomicRanges_1.50.2 GenomeInfoDb_1.34.6
## [39] IRanges_2.32.0 S4Vectors_0.36.1
## [41] BiocGenerics_0.44.0 ggplot2_3.4.0
## [43] SeuratObject_4.1.3 Seurat_4.3.0
## [45] dplyr_1.0.10
##
## loaded via a namespace (and not attached):
## [1] ggprism_1.0.4 SparseM_1.81
## [3] scattermore_0.8 GGally_2.1.2
## [5] R.methodsS3_1.8.2 coda_0.19-4
## [7] bit64_4.0.5 knitr_1.41
## [9] irlba_2.3.5.1 DelayedArray_0.24.0
## [11] R.utils_2.12.2 data.table_1.14.6
## [13] KEGGREST_1.38.0 RCurl_1.98-1.9
## [15] doParallel_1.0.17 generics_0.1.3
## [17] cowplot_1.1.1 RSQLite_2.2.20
## [19] RANN_2.6.1 proxy_0.4-27
## [21] future_1.30.0 bit_4.0.5
## [23] tzdb_0.3.0 spatstat.data_3.0-0
## [25] httpuv_1.6.7 assertthat_0.2.1
## [27] xfun_0.36 hms_1.1.2
## [29] jquerylib_0.1.4 evaluate_0.19
## [31] promises_1.2.0.1 fansi_1.0.4
## [33] caTools_1.18.2 igraph_1.3.5
## [35] DBI_1.1.3 geneplotter_1.76.0
## [37] htmlwidgets_1.6.1 mcmc_0.9-7
## [39] reshape_0.8.9 spatstat.geom_3.0-3
## [41] paletteer_1.5.0 ellipsis_0.3.2
## [43] annotate_1.76.0 MCMCpack_1.6-3
## [45] locfdr_1.1-8 deldir_1.0-6
## [47] sparseMatrixStats_1.10.0 vctrs_0.5.1
## [49] quantreg_5.94 ROCR_1.0-11
## [51] abind_1.4-5 cachem_1.0.7
## [53] withr_2.5.0 progressr_0.12.0
## [55] sctransform_0.3.5 goftest_1.2-3
## [57] cluster_2.1.4 lazyeval_0.2.2
## [59] crayon_1.5.2 hdf5r_1.3.7
## [61] spatstat.explore_3.0-5 pkgconfig_2.0.3
## [63] labeling_0.4.2 nlme_3.1-161
## [65] vipor_0.4.5 rlang_1.0.6
## [67] globals_0.16.2 lifecycle_1.0.3
## [69] miniUI_0.1.1.1 MatrixModels_0.5-1
## [71] ggrastr_1.0.1 polyclip_1.10-4
## [73] lmtest_0.9-40 graph_1.76.0
## [75] Matrix_1.5-3 zoo_1.8-11
## [77] beeswarm_0.4.0 ggridges_0.5.4
## [79] GlobalOptions_0.1.2 png_0.1-8
## [81] bitops_1.0-7 R.oo_1.25.0
## [83] KernSmooth_2.23-20 Biostrings_2.66.0
## [85] blob_1.2.3 DelayedMatrixStats_1.20.0
## [87] shape_1.4.6 parallelly_1.34.0
## [89] spatstat.random_3.0-1 gridGraphics_0.5-1
## [91] memoise_2.0.1 GSEABase_1.58.0
## [93] magrittr_2.0.3 plyr_1.8.8
## [95] ica_1.0-3 zlibbioc_1.44.0
## [97] compiler_4.2.1 RColorBrewer_1.1-3
## [99] fitdistrplus_1.1-8 snakecase_0.11.0
## [101] cli_3.5.0 XVector_0.38.0
## [103] listenv_0.9.0 pbapply_1.7-0
## [105] MASS_7.3-58.1 tidyselect_1.2.0
## [107] stringi_1.7.12 highr_0.10
## [109] yaml_2.3.6 locfit_1.5-9.7
## [111] grid_4.2.1 sass_0.4.5
## [113] tools_4.2.1 future.apply_1.10.0
## [115] parallel_4.2.1 circlize_0.4.15
## [117] rstudioapi_0.14 foreach_1.5.2
## [119] janitor_2.2.0 farver_2.1.1
## [121] Rtsne_0.16 digest_0.6.31
## [123] later_1.3.0 RcppAnnoy_0.0.20
## [125] httr_1.4.4 AnnotationDbi_1.60.0
## [127] colorspace_2.1-0 XML_3.99-0.13
## [129] tensor_1.5 reticulate_1.27
## [131] splines_4.2.1 yulab.utils_0.0.6
## [133] uwot_0.1.14 rematch2_2.1.2
## [135] spatstat.utils_3.0-1 sp_1.5-1
## [137] ggplotify_0.1.0 plotly_4.10.1
## [139] xtable_1.8-4 jsonlite_1.8.4
## [141] corpcor_1.6.10 R6_2.5.1
## [143] pillar_1.8.1 htmltools_0.5.4
## [145] mime_0.12 glue_1.6.2
## [147] fastmap_1.1.1 BiocParallel_1.32.5
## [149] class_7.3-21 codetools_0.2-19
## [151] utf8_1.2.3 lattice_0.20-45
## [153] bslib_0.4.2 spatstat.sparse_3.0-0
## [155] ggbeeswarm_0.7.1 leiden_0.4.3
## [157] gtools_3.9.4 survival_3.4-0
## [159] rmarkdown_2.19 munsell_0.5.0
## [161] e1071_1.7-12 GenomeInfoDbData_1.2.9
## [163] iterators_1.0.14 gtable_0.3.1