Intro

Project Goals

Estimate cellular composition in mammalian hearts. I’m using sn/scRNAseq as a reference for deconvolution of bulk RNAseq. We’re interested in alpha-1 adrenergic receptor as a mediator of heart failure (HF). In analysis I’m going to estimate composition in bulk RNAseq from wild type and KO mice with and without HF. I’m using a combination of sn/sc references. One of these is from Christoph, but all of the others are publically available.

Overview of Workflow:

This top bit is largely for my own organization, feel free to skip it as it includes a lot of technical stuff I was trying to show a peer.

  1. Load the data and save it
    • Currently four sources totalling to ~70k droplets/cells, down to 33K after QC
    • Data downloaded as GSM from GEO
    • QC’d post-10x prior to receiving
  2. QC and filtering
    • Remove doublets and ambient RNA with dropletUtils
    • Filter each dataset by quantiles of counts, features, and mitocondrial gene percent
      • No longer a harsh mitocondrial percent filter
  3. Merge datasets and cluster
    • Integrate with Harmony
      • Currently correcting by dataset origin
  4. Label and exclude clusters
    • Use AUC and known markers to annotate cells
      • Labelling based on bimodal distributions from all datasets integrated
      • Using AUCell and markers from McKellan 2020
    • Downsampling cell to save time (currently 10K out of 33K)
    • Label clusters with cell types
      • Correlate scores of cells to assigned clusters
    • Exclude clusters based on:
      • <200 cells
      • Corr between seurat cluster and label > 0.4
  5. Deconvolute with MuSiC
    • Automatic marker selection
    • Deconvolution of fractions and whole tissue samples
  6. Compare compositional distances
    • Measure deconvolution performance with Aitchinson distance between:
      • Fraction composition estimates from MuSiC outputs
      • Simulated composition based on likely purity of our ground truths
    • Model compositional shifts with Dirichlet regression

Comparisons I’ve tried

  1. Effect of downsampling UMI per droplet to 5K
    • Asking if variation in average counts by dataset effects results
    • Not much difference
    • Run before doublet/ambient removal
  2. Results with permutations of every dataset
    • More datasets generally perform better on fraction deconvolution
    • Run before doublet/ambient removal
  3. Effect of doublet/ambient RNA removal and flexible filtering of datasets
    • Flexible filtering is better
    • Doublet removal doesn’t do all that much

The Analysis

How the dataset looks


These plots show how distinct and segregated some of our datasets are. It’s important to note that Wu and Martini both are specifically enriched for immune cells. And Tabula Muris is deficient in cardiomyoctes generally. These clusters aren’t yet annotated, but keep that in mind when we assign cell-types to these clusters later. I’m still concerned about the limited overlap in cells in clusters 0 and 2. I’d like to find more sc/sn datasets that overlap between the clusters so we can better integrate the data.

I’d also like to note that I’ve run the pipeline in two additional ways to check things. I tried each combination of datasets (8 total if Rau is included every time) and also downsampled the counts to 5k (to check if the difference in the “Counts by origin” plot effect things). For the former, more datasets improve the results (I’ll include a plot from this analysis in an email) and for the later, there isn’t any notable difference when using more uniform counts between the datasets.

Cells are annotated at by expression of canonical markers

markers.broad <- read.csv("jensen/data/processed/external/mclellan_2020/mclellan_cell_markers_broad.csv")
head(markers.broad)
##   cluster    gene
## 1 B cells   Cd79a
## 2 B cells    Ly6d
## 3 B cells H2-DMb2
## 4 B cells    Cd37
## 5 B cells   Cd79b
## 6 B cells   Ms4a1

As we talked about last time, AUCell assigns an AUC score (think enrichment) for each cell based on expression and cell types with clear bimodal distributions are retained. The second part, retaining only cell-types with distinct groups, is a change from the last workflow that has been very helpful. Here’s how those AUC distributions look for the cell types we retained:


I think that the odd ’tri-modal’distribution for the cardiomyocytes is from ambient RNA in the Rau dataset. I also have a way I’d like to fix that, and should also improve our marker detection. Next we look for which clusters these cell-type labels tend to associate with using basic correlations.

Correlations between single-cell AUC assignments and membership are used to annotate clusters to cell-types

We set a threshold (0.4 in this case) and assign cell types based on that. Here’s how our final labels look, along with some canonical markers to confirm our annotations:

The ambient expression of markers in clusters dominated by the Rau dataset make me think that there is still ambient RNA (RNA from cells that broke open during handling and are present in most droplets). I’ll just need to go back to the raw dataset, remap it, and apply some newer methods. It would just take a bit of time and I didn’t want to delay providing an update.

Deconvolution

Composition estimates on pure cell fractions

Experimentally purified, bulk sequenced cell types have been used as a ground truth. This is how I’ve been testing accuracy of the methods. Specifically, I’ve been trying to minimize the Aitchison distance, a measure of composition distance that is apparently widely used in fields like geology, ecology, and microbiome research. The plot I mentioned about dataset combinations uses that as a metric.

The fractions are from Christoph and a publicly available dataset, which is why you can see two distinct groups in the endothelial cell graph.

Composition estimates in Jensen data

Here’s the actual results you’re intereted in. The KO vs. WT labels were based on the ‘id’ column of the file you gave me (labels include WT5, AKO5, Ako_Lx1, wt_Lx1, and Ako_Lx4), rather than the ‘condition’ column (labels include: WT.sham, KO.sham, WT.MI, KO.MI, and WT.outlier). Just let me know if I got this right, since I know you had mentioned that they were switched.

I also made a different way to visualize these results, split by condition. This way might work better if we want to assign significance values:

Using DirichletReg to model compositional changes

For the statistics, I’m using the DirichletReg package. It uses a linear model, but instead of a normal distribution, it uses a Dirichlet distribution. I had to talk with Mike Love for an hour and read a chapter from a textbook on compositional statistics to even vaguely understand how that works. I’m going to show the code for this and I’ll talk about it here too. Basically, I’m making two different models to compare the compositional changes. Model.2 assumes that changes are based on the effect of treatment, whereas model.1 also includes the interaction of geneotype and treatment. The ANOVA I’m doing between them has a value for Pr(>Chi), which is highly significant. Our null hypothesis is, “The simple model (model.2) is better”, so the high significance for Pr(>Chi) is telling use that the null hypothesis isn’t true, that the interaction model (model.1) better describes what we’re seeing.

I’m also including the summary of model.1, which covers which factors are signficant for each cell type.

# prep DirichletReg matrix
dir.cols <- colnames(decon.melt)[c(1:6)]

dir.mat <- decon.melt |>
  subset(type == "whole" & Version == "new",
         select = dir.cols) |>
  dcast(Sub + genotype + treatment ~ CellType, value.var = "Prop")

# Convert data to DirichletRegData object
dir.mat$CellTypes <- DR_data(dir.mat[,c(4:length(dir.mat))])

# Run Dirichlet regression
model.1 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype, data = dir.mat)

model.2 <- DirichReg(CellTypes ~ treatment + genotype, data = dir.mat)

#model.3 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype + Version, data = dir.mat)

# compare models, find if interaction term improves model
anova(model.1, model.2)
## Analysis of Deviance Table
## 
## Model 1: DirichReg(formula = CellTypes ~ treatment + genotype + treatment *
##   genotype, data = dir.mat)
## Model 2: DirichReg(formula = CellTypes ~ treatment + genotype, data =
##   dir.mat)
## 
##         Deviance N. par Difference df Pr(>Chi)   
## Model 1  -461.67     20                          
## Model 2  -442.06     15     19.614  5 0.001476 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#anova(model.3, model.1)

# Pr(>Chi) is highly significant (p =  0.001476), which means there's strong evidence against the null hypothesis (the simple model is better). model.1 provides a significantly better fit to the data than model.2.
# this implies that the effect of the treatment may be different for different genotypes.
summary(model.1)
## Call:
## DirichReg(formula = CellTypes ~ treatment + genotype + treatment * genotype,
## data = dir.mat)
## 
## Standardized Residuals:
##                            Min       1Q   Median       3Q      Max
## `endothelial cells`    -0.3519  -0.0864  -0.0758  -0.0610   0.4767
## fibroblasts            -2.5714  -1.4155   0.4612   1.5621   3.3793
## cardiomyocytes         -1.8225  -0.7589  -0.1643   0.8880   1.7152
## `B cells`              -0.1707  -0.1496  -0.0811  -0.0647  -0.0610
## `endothelial cells`_1  -1.2187  -0.6378  -0.0781   0.3448   1.5813
## 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 1: `endothelial cells`
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.7562     0.4134   9.087  < 2e-16 ***
## treatmentSham                -1.5763     0.5573  -2.828  0.00468 ** 
## genotypeB6-WT                -0.2628     0.5857  -0.449  0.65362    
## treatmentSham:genotypeB6-WT   2.2740     0.7809   2.912  0.00359 ** 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 2: fibroblasts
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  5.60333    0.40947  13.684  < 2e-16 ***
## treatmentSham               -2.50701    0.54751  -4.579 4.67e-06 ***
## genotypeB6-WT                0.04758    0.57914   0.082   0.9345    
## treatmentSham:genotypeB6-WT  2.06326    0.77032   2.678   0.0074 ** 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 3: cardiomyocytes
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   7.7655     0.4088  18.996  < 2e-16 ***
## treatmentSham                -1.6253     0.5423  -2.997  0.00272 ** 
## genotypeB6-WT                -0.3197     0.5783  -0.553  0.58035    
## treatmentSham:genotypeB6-WT   2.4553     0.7659   3.206  0.00135 ** 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 4: `B cells`
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.7562     0.4134   9.087  < 2e-16 ***
## treatmentSham                -1.6404     0.5580  -2.940  0.00329 ** 
## genotypeB6-WT                -0.2628     0.5857  -0.449  0.65362    
## treatmentSham:genotypeB6-WT   2.3381     0.7814   2.992  0.00277 ** 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 5: `endothelial cells`_1
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   5.5051     0.4095  13.442  < 2e-16 ***
## treatmentSham                -1.7383     0.5452  -3.189  0.00143 ** 
## genotypeB6-WT                -0.1797     0.5794  -0.310  0.75647    
## treatmentSham:genotypeB6-WT   2.2710     0.7686   2.955  0.00313 ** 
## ------------------------------------------------------------------
## Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-likelihood: 230.8 on 20 df (195 BFGS + 1 NR Iterations)
## AIC: -421.7, BIC: -408.9
## Number of Observations: 14
## Link: Log
## Parametrization: common

Outro Notes

A lot has changed since the last analysis we covered. Back then, we saw more of a rainbow of cell types in the deconvolution. But now it’s ~90% cardiomyocytes. I think that this new estimate is more trustworthy. The heart is majority CM by volume. Our test is really asking what proportion of RNA is coming from each cell type, so it makes sense that most of the RNA is coming from CMs. This is an intrinsic limitation of this type of analysis in this system. Deconvolution methods struggle with narrow cell proportions, and we’re trying to look at changes in 10 - 15% of RNA.

Still, these results illustrate the effect of your KO. Alsom I think that we can use them as co-factors in a revised DE analysis. Here’s a paper that illustrates how compositional differences can wildly change DE results if not accounted for (Figure 4).

As far as next steps, I’d still like to download the raw data for each of my references and map them myself. This would help me account for ambient RNA, doublets, and ensure it’s all as standardized as it can be. However, I don’t want to hold up the work on your end. Since we’re limited to mostly seeing changes in fibroblast content, basic stains for fibrosis might be enough.

Let me know if anything is unclear or if you want to meet to talk about things. Again, thanks for the fun problem!