Introduction and Context

The aim of this analysis is to estimate the cellular composition of bulk-RNAseq reads from mouse left ventricular tissue. We’ll find cell-type-specific gene expression reference profiles with snRNAseq. These two single nucleus samples are aggregates of several adult, male B6 from Christoph’s post-doc.

This document explains the workflow in the order in which the data processing occurred. You can jump to portions with the table of contents on the left.

Load data and packages

Some info on our unprocessed snRNAseq dataset:

## An object of class Seurat 
## 19466 features across 39996 samples within 1 assay 
## Active assay: RNA (19466 features, 0 variable features)


And our bulk RNAseq dataset, which I’ve done some basic summary statistics on:


This PCA illustrates clustering by treatment and genotype that we would expect. Also, we can see the outlier you mentioned, wt_Lx3. It has more counts per gene than any other sample, but there is so much variability in that metric anyway. Regardless, gene expression in each sample has no bearing on the deconvolution of the other samples, so I’ve decided to keep it in the analysis just to see how it goes. If we end up doing group comparison statistics it might be best to remove it.

Clustering nuclei

After loading our data and packages, we some standard single cell processing, including QC, normaliziation, scaling, neighbor analysis, and dimensional reduction. We’ll also exclude nuclei above 5% mitocondrial gene expression.

sn.clust.1 <- ClusterSeurat(sn, max.rna.ft = 4000,
                          max.mt.pt = 5, res = 0.1)
sn.clust.1
## An object of class Seurat 
## 19466 features across 24936 samples within 1 assay 
## Active assay: RNA (19466 features, 2000 variable features)
##  3 dimensional reductions calculated: pca, harmony, umap


Now, we can visualize our data plotted on UMAPs. The top two plots are showing us how our clustering decided to group the nuclei and the right is how well we integrated our two samples.

The lower two feature plots are showing percent mitocondrial RNA and number of transcripts counted in each nuclei. High mitocondrial gene content indicates cytoplasmic contamination, which would confound deconvolution methods.

Removing mitocondrial clusters

We’re going to look at how many nuclei in each cluster contain more than 3% mitocondrial RNA. We’ll set a cutoff off 20% of nuclei within a cluster and repeat the clustering without those nuclei. After I remove that cluster, I still set a mitocondrial gene percent cutoff at 2% for the reclustering, since the initial clustering and MT% threshold largely serve to bait highly cytoplasmic nuclei together. 


Defining cell types

To annotate the clusters with actual cell types, I’ve decided to reference existing mouse single cell datasets.
McLellan 2020 (DOI: 10.1161/CIRCULATIONAHA.119.045115) has a reasonably accessible list of marker genes for their clusters. They repeated their clustering, yielding both a broad and fine classification. So we’ll be doing the process twice.

How it works

I’ve reworked their data to be formatted as a data frame with 10 cell-type clusters matched with marker genes.
In their original data, they also included AUC scores and percent of nuclei in and outside of the cluster which express the gene. There’s a lot of differences between their data and ours, so I’ve left that extra info out.

Here’s how that data looks and what cell types are included in the broad classification:

head(markers.broad)
##   cluster    gene
## 1 B cells   Cd79a
## 2 B cells    Ly6d
## 3 B cells H2-DMb2
## 4 B cells    Cd37
## 5 B cells   Cd79b
## 6 B cells   Ms4a1
unique(markers.broad$cluster)
##  [1] "B cells"           "cardiomyocytes"    "endocardium"      
##  [4] "endothelial cells" "epicardium"        "fibroblasts"      
##  [7] "granulocytes"      "lymphatic ECs"     "macrophages"      
## [10] "pericytes"         "Schwann cells"

To apply the markers to our data, I’m using the AUCell package. They have a wonderful vignette here if you’re interested.

Basically, it uses the gene sets I’ve supplied (40 - 150 genes per cell-type) and looks for cells where they’re enriched via AUC. Single cell/nucleus is noisy, sparse, and variable, and AUCell does a few smart things to account for that. AUC scores are generally higher in more abundant genes, so it sets a threshold value for each gene. Also, each cell is considered independently to generate it’s AUC score, so count variation has minimal effect on scores.

It outputs histograms of the AUC scores from every nuclei, for each cell-type marker set we provide. Here is an example with our cardiomyocyte gene set:

Under the hood, AUCell looks for a bi-modal distribution. Like we see above, a bi-modal distribution reflects that there are a discrete population of nuclei which express the cardiomyocyte gene set, while the majority of nuclei do not.

And here is a less aesthetically pleasing example from B-cells:

Narrowing labels


With AUCell, we can end up having multiple significant labels on a given nuclei.

To get around this, I’ve considered each AUC value for each cell (there is an AUC for each cell type it could be categorized). I then compared these values to the threshold value for the corresponding cell type (0.197 in the CM example earlier). For each value The nuclei were then tagged with whatever AUC/threshold value produced the highest ratio.

An aside: This has its drawbacks, like labeling cell types which don’t have a clearly bi-nomial AUC distribution. I’d like to add a filtration step to exclude normally or single-tailed distributed cell-types. Or maybe exclude multi-label nuclei altogether.

After repeating with the fine cell types, we can add all of the annotations to our original single nucleus data and visualize the tags:


And these are the clusters we found earlier, for reference:


Defining Seurat clusters

From the DimPlots above, it looks like the most clear cell types can be defined from the broad classification. To pin those down to our own clusters, we’ll apply a Chi-Squared test between the Seurat clusters and our newly applied broad cell-type labels.

However, broad labels don’t completely define our clusters:


For this reason, I’ve decided to include both the fine and broad classifiers. Here’s how that looks:

We can then look for the largest positive standardized residual for each seurat cluster and label it with the associated cell type. Sometimes, more than one cluster will be labeled as the same cell type. In these cases, I’ve just decided to make their labels unique by adding a number after (i.e. neuron vs. neuron.1)

Here’s what our final clusters and labels look like:

MuSiC and composition estimates

As I use it, MuSiC takes our single nucleus clusters and looks for deconvolution markers itself. It prioritizes markers that are expressed in most of the nuclei in a given cluster, consistent within the cluster, and variable between all of the clusters. It is also possible for us to supply our own markers. Here’s how the results look:


Takeaways and next steps

There is still a lot of room for improvement on this pipeline. For one, I don’t trust the specifics of the reported ratios to be accurate. Fibroblasts are underrepresented and I suspect a lot of signal from immune cell infiltration is being labeled as macrophages. Also I had to remove “cardiomyocyte.1” as a reference. If I leave it, >90% of the samples are reported as it. If I had unlimited time, I’d try some of these things next:

  • Look more closely at differentially expressed genes between clusters in our snRNAseq
    • This might help us figure out what is driving the possible overestimation of cell types which are sparse in the snRNAseq data.
  • Compare the marker genes and weights being used for deconvolution by MuSiC to known cell-type markers.
  • Incorporate more reference datasets
    • There are a few papers with snRNAseq from male B6 we could try.
    • We could also think of using scRNAseq for non-myocyte fractions and integrate it with our cardiomyocyte nuclei. This might be challenging to justify methodologically, so we should think on it.
  • Once we see a believable ratio of cell types, we could aim to classify into cellular sub-types.
    • At this point it might be best to label and define the types of each cluster by marker expression, rather than pre-annotated references
  • We could also use the DirichletReg package to look for statistically significant differences in composition.
  • MuSiC2 expands upon MuSiC by offering a method to account for treatment conditions of the decon target. We could think about shifting to that method for the treatment/genotype comparisons

However, I still do think that there are some interesting snippets. I find it exciting that we see, anecdotally, minimal difference between the genotypes while at baseline, but not at all after treatment. And, if we buy into these estimates, it looks like the KO keeps closer to the baseline composition than the WT during treatment.

I had a lot of fun with this project. I got to learn a lot, as I ended up applying a few methods I’ve never used before (like the automated cell-type cluster annotation). Plus, this is the first time I’ve written up a document in this style. Let me know if you have any feedback on how things are communicated, or need any clarification/information not shown here.

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
## 
## Matrix products: default
## BLAS/LAPACK: /nas/longleaf/rhel8/apps/r/4.2.1/lib/libopenblas_haswellp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scCustomize_1.1.1           gridExtra_2.3              
##  [3] ggrepel_0.9.2               scales_1.2.1               
##  [5] gplots_3.1.3                viridis_0.6.2              
##  [7] viridisLite_0.4.1           AUCell_1.20.2              
##  [9] shiny_1.7.4                 SCpubr_1.1.1.9000          
## [11] harmony_0.1.1               Rcpp_1.0.10                
## [13] SingleCellExperiment_1.20.0 lubridate_1.9.0            
## [15] timechange_0.1.1            forcats_0.5.2              
## [17] stringr_1.5.0               purrr_1.0.0                
## [19] readr_2.1.3                 tidyr_1.2.1                
## [21] tibble_3.1.8                tidyverse_2.0.0            
## [23] reshape2_1.4.4              MuSiC_1.0.0                
## [25] TOAST_1.12.0                quadprog_1.5-8             
## [27] limma_3.54.0                EpiDISH_2.14.1             
## [29] nnls_1.4                    SeuratDisk_0.0.0.9020      
## [31] patchwork_1.1.2             DESeq2_1.38.2              
## [33] SummarizedExperiment_1.28.0 Biobase_2.58.0             
## [35] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## [37] GenomicRanges_1.50.2        GenomeInfoDb_1.34.6        
## [39] IRanges_2.32.0              S4Vectors_0.36.1           
## [41] BiocGenerics_0.44.0         ggplot2_3.4.0              
## [43] SeuratObject_4.1.3          Seurat_4.3.0               
## [45] dplyr_1.0.10               
## 
## loaded via a namespace (and not attached):
##   [1] ggprism_1.0.4             SparseM_1.81             
##   [3] scattermore_0.8           GGally_2.1.2             
##   [5] R.methodsS3_1.8.2         coda_0.19-4              
##   [7] bit64_4.0.5               knitr_1.41               
##   [9] irlba_2.3.5.1             DelayedArray_0.24.0      
##  [11] R.utils_2.12.2            data.table_1.14.6        
##  [13] KEGGREST_1.38.0           RCurl_1.98-1.9           
##  [15] doParallel_1.0.17         generics_0.1.3           
##  [17] cowplot_1.1.1             RSQLite_2.2.20           
##  [19] RANN_2.6.1                proxy_0.4-27             
##  [21] future_1.30.0             bit_4.0.5                
##  [23] tzdb_0.3.0                spatstat.data_3.0-0      
##  [25] httpuv_1.6.7              assertthat_0.2.1         
##  [27] xfun_0.36                 hms_1.1.2                
##  [29] jquerylib_0.1.4           evaluate_0.19            
##  [31] promises_1.2.0.1          fansi_1.0.4              
##  [33] caTools_1.18.2            igraph_1.3.5             
##  [35] DBI_1.1.3                 geneplotter_1.76.0       
##  [37] htmlwidgets_1.6.1         mcmc_0.9-7               
##  [39] reshape_0.8.9             spatstat.geom_3.0-3      
##  [41] paletteer_1.5.0           ellipsis_0.3.2           
##  [43] annotate_1.76.0           MCMCpack_1.6-3           
##  [45] locfdr_1.1-8              deldir_1.0-6             
##  [47] sparseMatrixStats_1.10.0  vctrs_0.5.1              
##  [49] quantreg_5.94             ROCR_1.0-11              
##  [51] abind_1.4-5               cachem_1.0.7             
##  [53] withr_2.5.0               progressr_0.12.0         
##  [55] sctransform_0.3.5         goftest_1.2-3            
##  [57] cluster_2.1.4             lazyeval_0.2.2           
##  [59] crayon_1.5.2              hdf5r_1.3.7              
##  [61] spatstat.explore_3.0-5    pkgconfig_2.0.3          
##  [63] labeling_0.4.2            nlme_3.1-161             
##  [65] vipor_0.4.5               rlang_1.0.6              
##  [67] globals_0.16.2            lifecycle_1.0.3          
##  [69] miniUI_0.1.1.1            MatrixModels_0.5-1       
##  [71] ggrastr_1.0.1             polyclip_1.10-4          
##  [73] lmtest_0.9-40             graph_1.76.0             
##  [75] Matrix_1.5-3              zoo_1.8-11               
##  [77] beeswarm_0.4.0            ggridges_0.5.4           
##  [79] GlobalOptions_0.1.2       png_0.1-8                
##  [81] bitops_1.0-7              R.oo_1.25.0              
##  [83] KernSmooth_2.23-20        Biostrings_2.66.0        
##  [85] blob_1.2.3                DelayedMatrixStats_1.20.0
##  [87] shape_1.4.6               parallelly_1.34.0        
##  [89] spatstat.random_3.0-1     gridGraphics_0.5-1       
##  [91] memoise_2.0.1             GSEABase_1.58.0          
##  [93] magrittr_2.0.3            plyr_1.8.8               
##  [95] ica_1.0-3                 zlibbioc_1.44.0          
##  [97] compiler_4.2.1            RColorBrewer_1.1-3       
##  [99] fitdistrplus_1.1-8        snakecase_0.11.0         
## [101] cli_3.5.0                 XVector_0.38.0           
## [103] listenv_0.9.0             pbapply_1.7-0            
## [105] MASS_7.3-58.1             tidyselect_1.2.0         
## [107] stringi_1.7.12            highr_0.10               
## [109] yaml_2.3.6                locfit_1.5-9.7           
## [111] grid_4.2.1                sass_0.4.5               
## [113] tools_4.2.1               future.apply_1.10.0      
## [115] parallel_4.2.1            circlize_0.4.15          
## [117] rstudioapi_0.14           foreach_1.5.2            
## [119] janitor_2.2.0             farver_2.1.1             
## [121] Rtsne_0.16                digest_0.6.31            
## [123] later_1.3.0               RcppAnnoy_0.0.20         
## [125] httr_1.4.4                AnnotationDbi_1.60.0     
## [127] colorspace_2.1-0          XML_3.99-0.13            
## [129] tensor_1.5                reticulate_1.27          
## [131] splines_4.2.1             yulab.utils_0.0.6        
## [133] uwot_0.1.14               rematch2_2.1.2           
## [135] spatstat.utils_3.0-1      sp_1.5-1                 
## [137] ggplotify_0.1.0           plotly_4.10.1            
## [139] xtable_1.8-4              jsonlite_1.8.4           
## [141] corpcor_1.6.10            R6_2.5.1                 
## [143] pillar_1.8.1              htmltools_0.5.4          
## [145] mime_0.12                 glue_1.6.2               
## [147] fastmap_1.1.1             BiocParallel_1.32.5      
## [149] class_7.3-21              codetools_0.2-19         
## [151] utf8_1.2.3                lattice_0.20-45          
## [153] bslib_0.4.2               spatstat.sparse_3.0-0    
## [155] ggbeeswarm_0.7.1          leiden_0.4.3             
## [157] gtools_3.9.4              survival_3.4-0           
## [159] rmarkdown_2.19            munsell_0.5.0            
## [161] e1071_1.7-12              GenomeInfoDbData_1.2.9   
## [163] iterators_1.0.14          gtable_0.3.1