Estimate cellular composition in mammalian hearts. I’m using sn/scRNAseq as a reference for deconvolution of bulk RNAseq. We’re interested in alpha-1 adrenergic receptor as a mediator of heart failure (HF). In analysis I’m going to estimate composition in bulk RNAseq from wild type and KO mice with and without HF. I’m using a combination of sn/sc references. One of these is from Christoph, but all of the others are publically available.
This top bit is largely for my own organization, feel free to skip it as it includes a lot of technical stuff I was trying to show a peer.
These plots show how distinct and segregated some of our datasets are. It’s important to note that Wu and Martini both are specifically enriched for immune cells. And Tabula Muris is deficient in cardiomyoctes generally. These clusters aren’t yet annotated, but keep that in mind when we assign cell-types to these clusters later. I’m still concerned about the limited overlap in cells in clusters 0 and 2. I’d like to find more sc/sn datasets that overlap between the clusters so we can better integrate the data.
I’d also like to note that I’ve run the pipeline in two additional ways to check things. I tried each combination of datasets (8 total if Rau is included every time) and also downsampled the counts to 5k (to check if the difference in the “Counts by origin” plot effect things). For the former, more datasets improve the results (I’ll include a plot from this analysis in an email) and for the later, there isn’t any notable difference when using more uniform counts between the datasets.
markers.broad <- read.csv("jensen/data/processed/external/mclellan_2020/mclellan_cell_markers_broad.csv")
head(markers.broad)
## cluster gene
## 1 B cells Cd79a
## 2 B cells Ly6d
## 3 B cells H2-DMb2
## 4 B cells Cd37
## 5 B cells Cd79b
## 6 B cells Ms4a1
As we talked about last time, AUCell assigns an AUC score (think enrichment) for each cell based on expression and cell types with clear bimodal distributions are retained. The second part, retaining only cell-types with distinct groups, is a change from the last workflow that has been very helpful. Here’s how those AUC distributions look for the cell types we retained:
I think that the odd ’tri-modal’distribution for the cardiomyocytes is from ambient RNA in the Rau dataset. I also have a way I’d like to fix that, and should also improve our marker detection. Next we look for which clusters these cell-type labels tend to associate with using basic correlations.
We set a threshold (0.4 in this case) and assign cell types based on that. Here’s how our final labels look, along with some canonical markers to confirm our annotations:
The ambient expression of markers in clusters dominated by the Rau dataset make me think that there is still ambient RNA (RNA from cells that broke open during handling and are present in most droplets). I’ll just need to go back to the raw dataset, remap it, and apply some newer methods. It would just take a bit of time and I didn’t want to delay providing an update.
Experimentally purified, bulk sequenced cell types have been used as a ground truth. This is how I’ve been testing accuracy of the methods. Specifically, I’ve been trying to minimize the Aitchison distance, a measure of composition distance that is apparently widely used in fields like geology, ecology, and microbiome research. The plot I mentioned about dataset combinations uses that as a metric.
The fractions are from Christoph and a publicly available dataset, which is why you can see two distinct groups in the endothelial cell graph.
Here’s the actual results you’re intereted in. The KO vs. WT labels were based on the ‘id’ column of the file you gave me (labels include WT5, AKO5, Ako_Lx1, wt_Lx1, and Ako_Lx4), rather than the ‘condition’ column (labels include: WT.sham, KO.sham, WT.MI, KO.MI, and WT.outlier). Just let me know if I got this right, since I know you had mentioned that they were switched.
I also made a different way to visualize these results, split by condition. This way might work better if we want to assign significance values:
For the statistics, I’m using the DirichletReg package. It uses a linear model, but instead of a normal distribution, it uses a Dirichlet distribution. I had to talk with Mike Love for an hour and read a chapter from a textbook on compositional statistics to even vaguely understand how that works. I’m going to show the code for this and I’ll talk about it here too. Basically, I’m making two different models to compare the compositional changes. Model.2 assumes that changes are based on the effect of treatment, whereas model.1 also includes the interaction of geneotype and treatment. The ANOVA I’m doing between them has a value for Pr(>Chi), which is highly significant. Our null hypothesis is, “The simple model (model.2) is better”, so the high significance for Pr(>Chi) is telling use that the null hypothesis isn’t true, that the interaction model (model.1) better describes what we’re seeing.
I’m also including the summary of model.1, which covers which factors are signficant for each cell type.
# prep DirichletReg matrix
dir.cols <- colnames(decon.melt)[c(1:6)]
dir.mat <- decon.melt |>
subset(type == "whole" & Version == "new",
select = dir.cols) |>
dcast(Sub + genotype + treatment ~ CellType, value.var = "Prop")
# Convert data to DirichletRegData object
dir.mat$CellTypes <- DR_data(dir.mat[,c(4:length(dir.mat))])
# Run Dirichlet regression
model.1 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype, data = dir.mat)
model.2 <- DirichReg(CellTypes ~ treatment + genotype, data = dir.mat)
#model.3 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype + Version, data = dir.mat)
# compare models, find if interaction term improves model
anova(model.1, model.2)
## Analysis of Deviance Table
##
## Model 1: DirichReg(formula = CellTypes ~ treatment + genotype + treatment *
## genotype, data = dir.mat)
## Model 2: DirichReg(formula = CellTypes ~ treatment + genotype, data =
## dir.mat)
##
## Deviance N. par Difference df Pr(>Chi)
## Model 1 -461.67 20
## Model 2 -442.06 15 19.614 5 0.001476 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#anova(model.3, model.1)
# Pr(>Chi) is highly significant (p = 0.001476), which means there's strong evidence against the null hypothesis (the simple model is better). model.1 provides a significantly better fit to the data than model.2.
# this implies that the effect of the treatment may be different for different genotypes.
summary(model.1)
## Call:
## DirichReg(formula = CellTypes ~ treatment + genotype + treatment * genotype,
## data = dir.mat)
##
## Standardized Residuals:
## Min 1Q Median 3Q Max
## `endothelial cells` -0.3519 -0.0864 -0.0758 -0.0610 0.4767
## fibroblasts -2.5714 -1.4155 0.4612 1.5621 3.3793
## cardiomyocytes -1.8225 -0.7589 -0.1643 0.8880 1.7152
## `B cells` -0.1707 -0.1496 -0.0811 -0.0647 -0.0610
## `endothelial cells`_1 -1.2187 -0.6378 -0.0781 0.3448 1.5813
##
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 1: `endothelial cells`
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7562 0.4134 9.087 < 2e-16 ***
## treatmentSham -1.5763 0.5573 -2.828 0.00468 **
## genotypeB6-WT -0.2628 0.5857 -0.449 0.65362
## treatmentSham:genotypeB6-WT 2.2740 0.7809 2.912 0.00359 **
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 2: fibroblasts
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.60333 0.40947 13.684 < 2e-16 ***
## treatmentSham -2.50701 0.54751 -4.579 4.67e-06 ***
## genotypeB6-WT 0.04758 0.57914 0.082 0.9345
## treatmentSham:genotypeB6-WT 2.06326 0.77032 2.678 0.0074 **
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 3: cardiomyocytes
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.7655 0.4088 18.996 < 2e-16 ***
## treatmentSham -1.6253 0.5423 -2.997 0.00272 **
## genotypeB6-WT -0.3197 0.5783 -0.553 0.58035
## treatmentSham:genotypeB6-WT 2.4553 0.7659 3.206 0.00135 **
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 4: `B cells`
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7562 0.4134 9.087 < 2e-16 ***
## treatmentSham -1.6404 0.5580 -2.940 0.00329 **
## genotypeB6-WT -0.2628 0.5857 -0.449 0.65362
## treatmentSham:genotypeB6-WT 2.3381 0.7814 2.992 0.00277 **
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 5: `endothelial cells`_1
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5051 0.4095 13.442 < 2e-16 ***
## treatmentSham -1.7383 0.5452 -3.189 0.00143 **
## genotypeB6-WT -0.1797 0.5794 -0.310 0.75647
## treatmentSham:genotypeB6-WT 2.2710 0.7686 2.955 0.00313 **
## ------------------------------------------------------------------
## Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: 230.8 on 20 df (195 BFGS + 1 NR Iterations)
## AIC: -421.7, BIC: -408.9
## Number of Observations: 14
## Link: Log
## Parametrization: common
A lot has changed since the last analysis we covered. Back then, we saw more of a rainbow of cell types in the deconvolution. But now it’s ~90% cardiomyocytes. I think that this new estimate is more trustworthy. The heart is majority CM by volume. Our test is really asking what proportion of RNA is coming from each cell type, so it makes sense that most of the RNA is coming from CMs. This is an intrinsic limitation of this type of analysis in this system. Deconvolution methods struggle with narrow cell proportions, and we’re trying to look at changes in 10 - 15% of RNA.
Still, these results illustrate the effect of your KO. Alsom I think that we can use them as co-factors in a revised DE analysis. Here’s a paper that illustrates how compositional differences can wildly change DE results if not accounted for (Figure 4).
As far as next steps, I’d still like to download the raw data for each of my references and map them myself. This would help me account for ambient RNA, doublets, and ensure it’s all as standardized as it can be. However, I don’t want to hold up the work on your end. Since we’re limited to mostly seeing changes in fibroblast content, basic stains for fibrosis might be enough.
Let me know if anything is unclear or if you want to meet to talk about things. Again, thanks for the fun problem!