Project Goals

Estimate cellular composition in mammalian hearts. I’m using sn/scRNAseq as a reference for deconvolution of bulk RNAseq. For this walkthrough, I’m going to cover a collaboration I’m doing with Brian Jensen.

Jensen Collab

They’re interested in alpha-1 adronergic receptor as a mediator of heart failure (HF). He asked me to estimate composition of his bulk RNAseq from wild type and KO mice with and without HF. I’m using a combination of sn/sc references. One of these is from Christoph, but all of the others are publically available.

Overview of Workflow:

  1. Load the data and save it
    • Currently four sources totalling to ~70k droplets
    • Data downloaded as GSM from GEO
    • QC’d post-10x prior to receiving
  2. QC and filtering
    • Remove doublets and ambient RNA with dropletUtils
      • ? I would need the raw matricies from SRA to do this correctly?
    • Filter each dataset by quantiles of counts, features, and mitocondrial gene percent
  3. Merge datasets and cluster
    • Integrate with Harmony
      • Currently correcting by dataset origin
  4. Label and exclude clusters
    • Use AUC and known markers to annotate cells
      • Labelling based on bimodal distributions from all datasets integrated
      • Using AUCell and markers from McKellan 2020
    • Downsampling cell to save time (currently 10K out of 33K)
    • Label clusters with cell types
      • Correlate scores of cells to assigned clusters
    • Exclude clusters based on:
      • <200 cells
      • Corr between seurat cluster and label > 0.4
  5. Deconvolute with MuSiC
    • Automatic marker selection
    • Deconvolution of fractions and whole tissue samples
  6. Compare compositional distances
    • Measure deconvolution performance with Aitchinson distance between:
      • Fraction composition estimates from MuSiC outputs
      • Simulated composition based on likely purity of our ground truths
    • Model compositional shifts with Dirichlet regression

Comparisons I’ve tried

  1. Effect of downsampling UMI per droplet to 5K
    • Asking if variation in average counts by dataset effects results
    • Not much difference
    • Run before doublet/ambient removal
  2. Results with permutations of every dataset
    • More datasets generally perform better on fraction deconvolution
    • Run before doublet/ambient removal
  3. Effect of doublet/ambient RNA removal and flexible filtering of datasets
    • Flexible filtering is better
    • Doublet removal doesn’t do all that much

Visuals (to-do):

Clustering/QC

  1. VlnPlots of datasets of origin (counts/features)
  2. barplot of total umi per dataset
  3. ?Dimplot of sn.clust.new w/ and w/o harmony
  4. Dimplots post clustering of clusters, origin, cell types
  5. Features (counts, mito, features, doublet score)

Ground truth comparisons

  1. composition barplot from sn.clust.new
  2. Aitchison plots of dataset combos and doublet filtering

Composition estimates

  1. four panel geno x treat plots
  2. zoomed plots of geno comparison in sham and mi
  3. Dirlchet model code and outputs

Cell type annotations

  1. AUC plot example (good and bad)
  2. Feature plots of good cell types
  3. Heatmap of cluster-label correlations
  4. final cell type labels
  5. Expression of canonical markers plot
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE
## 
## [[11]]
## [1] TRUE
## 
## [[12]]
## [1] TRUE
## 
## [[13]]
## [1] TRUE
## 
## [[14]]
## [1] TRUE
## 
## [[15]]
## [1] TRUE
## 
## [[16]]
## [1] TRUE
## 
## [[17]]
## [1] TRUE
## 
## [[18]]
## [1] TRUE
## 
## [[19]]
## [1] TRUE
## 
## [[20]]
## [1] TRUE
## 
## [[21]]
## [1] TRUE
## 
## [[22]]
## [1] TRUE
## 
## [[23]]
## [1] TRUE
## 
## [[24]]
## [1] TRUE
## An object of class Seurat 
## 25405 features across 33078 samples within 1 assay 
## Active assay: RNA (25405 features, 2000 variable features)
##  3 dimensional reductions calculated: harmony, pca, umap

How the dataset looks

Cells are annotated at by expression of canonical markers

markers.broad <- read.csv("jensen/data/processed/external/mclellan_2020/mclellan_cell_markers_broad.csv")
head(markers.broad)
##   cluster    gene
## 1 B cells   Cd79a
## 2 B cells    Ly6d
## 3 B cells H2-DMb2
## 4 B cells    Cd37
## 5 B cells   Cd79b
## 6 B cells   Ms4a1

AUCell assigns an AUC score for each cell based on expression and cell types with clear bimodal distributions are retained

Correlations between single-cell AUC assignments and membership are used to annotate clusters to cell-types

## Creating Relative Abudance Matrix...
## Creating Variance Matrix...
## Creating Library Size Matrix...
## Used 11608 common genes...
## Used 3 cell types in deconvolution...
## B6.F.CM.RNA.1 has common genes 11608 ...
## B6.F.CM.RNA.2 has common genes 11608 ...
## B6.F.CM.RNA.3 has common genes 11608 ...
## B6.F.CM.RNA.4 has common genes 11608 ...
## B6.F.CM.RNA.5 has common genes 11608 ...
## B6.F.Fib.RNA.1 has common genes 11608 ...
## B6.F.Fib.RNA.3 has common genes 11608 ...
## B6.F.Fib.RNA.4 has common genes 11608 ...
## B6.F.Fib.RNA.5 has common genes 11608 ...
## B6.M.CM.RNA.1 has common genes 11608 ...
## B6.M.CM.RNA.3 has common genes 11608 ...
## B6.M.CM.RNA.4 has common genes 11608 ...
## B6.M.CM.RNA.5 has common genes 11608 ...
## B6.M.Fib.RNA.2 has common genes 11608 ...
## B6.M.Fib.RNA.3 has common genes 11608 ...
## B6.M.Endo.RNA.1 has common genes 11608 ...
## B6.M.Endo.RNA.2 has common genes 11608 ...
## B6.M.Endo.RNA.3 has common genes 11608 ...
## WT5 has common genes 11608 ...
## WT6 has common genes 11608 ...
## WT7 has common genes 11608 ...
## WT8 has common genes 11608 ...
## AKO5 has common genes 11608 ...
## AKO6 has common genes 11608 ...
## AKO7 has common genes 11608 ...
## AKO8 has common genes 11608 ...
## Ako_Lx1 has common genes 11608 ...
## Ako_Lx2 has common genes 11608 ...
## Ako_Lx3 has common genes 11608 ...
## wt_Lx1 has common genes 11608 ...
## wt_Lx2 has common genes 11608 ...
## wt_Lx4 has common genes 11608 ...
## Ako_Lx4 has common genes 11608 ...
## CM_Sham_19 has common genes 11608 ...
## CM_Sham_20 has common genes 11608 ...
## CM_Sham_21 has common genes 11608 ...
## FB_Sham_10 has common genes 11608 ...
## FB_Sham_11 has common genes 11608 ...
## FB_Sham_12 has common genes 11608 ...
## EC_Sham_1 has common genes 11608 ...
## EC_Sham_2 has common genes 11608 ...
## EC_Sham_3 has common genes 11608 ...

Composition estimates on pure cell fractions

Experimentally purified, bulk sequenced. Used as a ground truth

## [[1]]

### Composition estimates in Jensen data

A different way to visualize

Using DirichletReg to model compositional changes

# prep DirichletReg matrix
dir.cols <- colnames(decon.melt)[c(1:6)]

dir.mat <- decon.melt |>
  subset(type == "whole" & Version == "new",
         select = dir.cols) |>
  dcast(Sub + genotype + treatment ~ CellType, value.var = "Prop")

# Convert data to DirichletRegData object
dir.mat$CellTypes <- DR_data(dir.mat[,c(4:length(dir.mat))])

# Run Dirichlet regression
model.1 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype, data = dir.mat)

model.2 <- DirichReg(CellTypes ~ treatment + genotype, data = dir.mat)

#model.3 <- DirichReg(CellTypes ~ treatment + genotype + treatment * genotype + Version, data = dir.mat)

# compare models, find if interaction term improves model
anova(model.1, model.2)
## Analysis of Deviance Table
## 
## Model 1: DirichReg(formula = CellTypes ~ treatment + genotype + treatment *
##   genotype, data = dir.mat)
## Model 2: DirichReg(formula = CellTypes ~ treatment + genotype, data =
##   dir.mat)
## 
##         Deviance N. par Difference df Pr(>Chi)
## Model 1  -178.86     12                       
## Model 2  -177.67      9     1.1937  3   0.7545
#anova(model.3, model.1)

# Pr(>Chi) is highly significant (p = 8.573e-05), which means there's strong evidence against the null hypothesis (the simple model is better). model.1 provides a significantly better fit to the data than model.2.
# this implies that the effect of the treatment may be different for different genotypes.
summary(model.1)
## Call:
## DirichReg(formula = CellTypes ~ treatment + genotype + treatment * genotype,
## data = dir.mat)
## 
## Standardized Residuals:
##                            Min       1Q   Median      3Q     Max
## `endothelial cells`    -1.9939  -0.8494   0.1448  0.9326  1.8674
## `B cells`              -0.7110  -0.5048  -0.1774  0.1102  1.0896
## `endothelial cells`_1  -2.0922  -1.1587   0.1633  1.1136  2.5153
## 
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 1: `endothelial cells`
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   4.4079     0.5859   7.523 5.34e-14 ***
## treatmentSham                 1.6631     0.7714   2.156   0.0311 *  
## genotypeB6-WT                 0.5596     0.7739   0.723   0.4696    
## treatmentSham:genotypeB6-WT   0.3623     1.0495   0.345   0.7300    
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 2: `B cells`
##                             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                   1.4011     0.5830   2.403   0.0162 *
## treatmentSham                 1.3966     0.7691   1.816   0.0694 .
## genotypeB6-WT                 0.2650     0.7707   0.344   0.7310  
## treatmentSham:genotypeB6-WT   0.7047     1.0470   0.673   0.5009  
## ------------------------------------------------------------------
## Beta-Coefficients for variable no. 3: `endothelial cells`_1
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.6404     0.5853   4.511 6.44e-06 ***
## treatmentSham                 1.2185     0.7709   1.581    0.114    
## genotypeB6-WT                 0.1903     0.7732   0.246    0.806    
## treatmentSham:genotypeB6-WT   0.4751     1.0490   0.453    0.651    
## ------------------------------------------------------------------
## Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-likelihood: 89.43 on 12 df (99 BFGS + 1 NR Iterations)
## AIC: -154.9, BIC: -146.4
## Number of Observations: 15
## Link: Log
## Parametrization: common