SoupX to remove ambinet RNA

The vignette
Estimate the composition of this soup, i.e. what fraction of UMIs are derived from the soup in each droplet and produce a corrected count table with the soup based expression removed.

The method to do this consists of three parts:

  1. Calculate the profile of the soup.
  2. Estimate the cell specific contamination fraction.
  3. Infer a corrected expression matrix.
library(SeuratObject, lib.loc = "/usr/local/biotools/rpackages/R-4.2.2-2023-02-01")
library(Seurat, lib.loc = "/usr/local/biotools/rpackages/R-4.2.2-2023-02-01")
library(dplyr)
library(patchwork)
library(knitr)
library(xlsx)
library(ggplot2)
library(gridExtra)
library(grid)
library(lattice)
library(dittoSeq)
##library(matrixStats)
library(SoupX)
library(data.table)
Sys.setenv(RSTUDIO_PANDOC="/usr/local/biotools/pandoc/3.1.2/bin")

Variables

color.panel <- dittoColors()
sampleID <- c("v4_PA_mod0.6x")
sample_colors <- c("#009E73")
pip_results <- "/research/labs/neurology/fryer/m239830/LBD_CWOW/snRNA/pipseq_test_run/pipseeker_results/"
sample_dir <- "HR20231207Mayo_FR_9_results/"
sens <- "sensitivity_3/"

Soup Channel

Read in both the raw and the filtered counts matrix. Create a SoupChannel that contains both the raw and filtered counts matrix.

filtered <- Seurat::Read10X(file.path(paste0(pip_results, sample_dir, "/filtered_matrix/", sens)))
raw <- Seurat::Read10X(file.path(paste0(pip_results, sample_dir, "/raw_matrix/")))

# Create soup channel 
sc <- SoupChannel(raw, filtered)
sc # check the soup

Cluster & UMAP embedding information

Be sure to match the sensitivity used
fread is fast read and works with .gz files

barcodes <- fread(paste0(pip_results, sample_dir, "/filtered_matrix/", sens, 
                         "barcodes.tsv.gz"), header = F)
clusters <- fread(paste0(pip_results, sample_dir, "/clustering/", sens, 
                         "clusters.csv"))
umap_embed <- fread(paste0(pip_results, sample_dir, "/clustering/", sens, 
                           "umap_embedding.csv"))

Reformat the cluster information

Pipseq is special in how they organize the cluster information and umap embedding. We can’t use standard load10x approaches as this won’t work with pipseeker outputs.

filtered_metadata <- cbind(umap_embed, clusters$graph)
filtered_metadata <- as.data.frame(filtered_metadata)
colnames(filtered_metadata) <- c('RD1','RD2', 'Cluster')
row.names(filtered_metadata) <- barcodes$V1

Set clusters and embedding information

Some metadata is essential to include in the SoupChannel object. In general you can add any meta data by providing a data.frame with row names equal to the column names of the toc when building the SoupChannel object.

There are some bits of meta data that are so essential that they have their own special loading functions. The most essential is clustering information. Without it, SoupX will still work, but you won’t be able to automatically estimate the contamination fraction and the correction step will be far less effective.

RD is reduced dimensional. May be either umap, pca, or tsne. The default with the pipseeker output is umap so we will use umap.

sc <- setClusters(sc, setNames(filtered_metadata$Cluster, rownames(filtered_metadata)))
sc <- setDR(sc, filtered_metadata[colnames(sc$toc), c("RD1", "RD2")])

Plot the clusters

Check - does the umap match the pipseeker html report for that sensitivity?

dd <- filtered_metadata[colnames(sc$toc), ]
mids <- aggregate(cbind(RD1, RD2) ~ Cluster, data = dd, FUN = mean)
cluster_plot <- ggplot(dd, aes(RD1, RD2)) + 
  geom_point(aes(colour = Cluster), size = 0.2) + 
  geom_label(data = mids, aes(label = Cluster)) + 
  ggtitle(paste0(sens,  " clusters")) +
  theme_bw()
plot(cluster_plot)

Canonical cell-type markers

Visualize which cells express the gene of interest (goi) by extracting the counts for it from the SoupChannel object.

Is the expression more than expected in the other clusters?
Checking if the expression of the goi in these scattered cells is more than we would expect by chance from the soup. To really answer this properly, we need to know how much contamination is present in each cell. To get a rough idea, we will first calculate how many counts we would expect for the goi in each cell, by assuming that cell contained nothing but soup. The function plotMarkerMap allows us to visualize the ratio of observed counts for a gene (or set of genes) to this expectation value.

genes <- c("AQP4", "RBFOX3", "MOG", "MBP", "VCAN")

for (gene in genes) {
  dd[[gene]] <- sc$toc[gene, ]
  plot <- ggplot(dd, aes(RD1, RD2)) + 
    geom_point(aes(colour = .data[[gene]] > 0)) +
    theme_bw() + 
    labs(title = gene)
  print(plot)
}

Soup or not?

The resulting plot in the chunk below, we see that the cells in the astrocyte cluster have a reddish colour, indicating that they are expressed far more than we would expect by chance, even if the cell was nothing but background. The other clusters all have decidedly bluish hue, indicating that is completely plausible that the expression of AQP4 in these cells is due to contamination from the soup. Those cells that are shown as black tiny dots have zero expression for AQP4.

for (gene in genes) {
  soup_plot <- plotMarkerMap(sc, gene)+
  theme_bw() + 
  labs(title = gene)
  print(soup_plot)
}

Marker list

Given a set of genes that we suspect may be useful, the function plotMarkerDistribution can be used to visualize how this gene’s expression is distributed across cells.

To prevent accidentally including cells that genuinely express one of the estimation genes, SoupX will by default exclude any cluster where even one gene has evidence that it expresses a gene.

To get a more accurate estimate, groups with a similar biological function are grouped together so they’re either used or excluded as a group. This is why the parameter nonExpressedGeneList is given as a list. Each entry in the list is a group of genes that are grouped biologically.

astrocyte_markers <- c("CLU", "GFAP", "AQP4", "GJA1")
nonExpressedGeneList <- list(Astrocyte = c("AQP4", "GJA1", "CLU", "GFAP"), 
                             Microglia  = c("C1QA", "C1QC", "C1QB", "HEXB", 
                                            "TMEM119", "ITGAM", "TYROBP","P2RY12", "AIF1"), 
                             Neuron =  c("RBFOX3", "SNAP25","SYT1", "GAD1", "GAD2"), 
                             Oligodendrocyte = c("PLP1","MBP", "MOG"), 
                             OPC = c("OLIG1","PDGFRA", "VCAN", "TNR"))
plotMarkerDistribution(sc, nonExpressedGeneList) + theme_bw()

It doesn’t matter if some of these are not expressed in our data, they will then just not contribute to the estimate.

Estimate rho

Estimating the contamination fraction.
Probably the most difficult part of using SoupX is accurately estimating the level of background contamination (represented as rho) in each channel. There are two ways to do this: using the automatic autoEstCont method, or manually providing a list of “non expressed genes”.

The idea that underpins both approaches; identifying genes that are not expressed by some cells in our data and the expression that we observe for these genes in these cells must be due to contamination.

The key thing to understand is that the contamination fraction estimate is the fraction of your data that will be discarded. If this value is set too low, your “corrected” data will potentially still be highly contaminated. If you set it too high, you will discard real data, although there are good reasons to want to do this at times. If the contamination fraction is in the right ball park, SoupX will remove most of the contamination. It will generally not matter if this number if off by a few percent.

Automatic

sc <- autoEstCont(sc)

Identifying soup specific genes

Some times it is not obvious in advance which genes are highly specific to just one population of cells.

# Genes that are expressed most highly in the background.
head(sc$soupProfile[order(sc$soupProfile$est, decreasing = TRUE), ], n = 20)
##                        est counts
## MT-RNR2        0.030200499 576317
## CH507-528H12.1 0.026253903 501004
## MALAT1         0.021361389 407640
## CH507-513H4.1  0.013159441 251122
## MT-CO3         0.010164634 193972
## MT-ATP6        0.005524752 105429
## MT-CO2         0.005448978 103983
## MT-ND4         0.005115383  97617
## MT-CO1         0.004720844  90088
## MT-ND3         0.004567723  87166
## KCNIP4         0.003923329  74869
## MT-CYB         0.003775554  72049
## PCDH9          0.003720531  70999
## MAP1B          0.003319861  63353
## MT-RNR1        0.003232139  61679
## MT-TV          0.003084940  58870
## RBFOX1         0.002829739  54000
## MT-ND1         0.002779066  53033
## SNHG14         0.002708532  51687
## LRP1B          0.002670750  50966

Most of the most highly expressed genes in this case are ubiquitously expressed (mitochondrial genes). So we need some further criteria to aid our selection process.

Visualize the distribution of expression (relative to what would be expected were each cell pure background) across all cells in the data set. When no geneset is provided, the function will try and guess which genes might be useful.

# automatic - “guessing” is done using the quickMarkers function to find marker genes of each cluster 
plotMarkerDistribution(sc) + theme_bw()

# nonExpressedGeneList is the list of our canonical markers
plotMarkerDistribution(sc, nonExpressedGeneList) + theme_bw()

The plot shows the distribution of log10 ratios of observed counts to expected if the cell contained nothing but soup. A guess at which cells definitely express each gene is made and the background contamination is calculated. The red line shows the global estimate (i.e., assuming the same contamination fraction for all cells) of the contamination fraction using just that gene or gene list.

For the automatic “guessing” of genes it is important to note that this is a heuristic set of genes that is intended to help develop our biological intuition. It absolutely must not be used to automatically select a set of genes to estimate the background contamination fraction. If you select the top N genes from this list and use those to estimate the contamination, you will over-estimate the contamination fraction!

Estimating non-expression nuclei given list of genes

Having decided on a set of genes with which to estimate the contamination, we next need to decide which cells genuinely express these genes and should not be used for estimating the contamination, and which do not and should.
The following produces a matrix indicating which cells (rows) should use which sets of genes (columns) to estimate the contamination.

Calculate the contamination fraction from list

To estimate the contamination fraction you need only pass your set of genes and which cells in which to use those sets of genes to calculateContaminationFraction.
The function will modify the metaData table of sc object to add a table giving the contamination fraction estimate. This approach gives a contamination fraction very close to the automatic method.

sc <- calculateContaminationFraction(sc, list(Astrocyte = c("AQP4", "GJA1", "CLU", "GFAP")), useToEst = useToEst_Astrocyte)
head(sc$metaData)
##                  nUMIs clusters        RD1       RD2      rho    rhoLow
## AAAAAAACCATGAATT  7782        8  5.6954870  4.631757 0.160367 0.1582958
## AAAAAAAGAGTGCATC  2767        9 11.3356610 15.880696 0.160367 0.1582958
## AAAAAAAGCATACCAG  5722        7 14.6039030  2.495945 0.160367 0.1582958
## AAAAAAATAACAACAA  2654        1  2.3344529 14.153996 0.160367 0.1582958
## AAAAAACAATATCAGT  2827       12 -0.7713541  5.922867 0.160367 0.1582958
## AAAAAACACAGTCCAT  3833        1  1.3936919 15.335578 0.160367 0.1582958
##                   rhoHigh
## AAAAAAACCATGAATT 0.162456
## AAAAAAAGAGTGCATC 0.162456
## AAAAAAAGCATACCAG 0.162456
## AAAAAAATAACAACAA 0.162456
## AAAAAACAATATCAGT 0.162456
## AAAAAACACAGTCCAT 0.162456

Correcting the expression profile from a list

We have now calculated or set the contamination fraction for each cell and would like to use this to remove the contamination from the original count matrix. As with estimating the contamination, this procedure is made much more robust by providing clustering information. This is because there is much more power to separate true expression from contaminating expression when counts are aggregated into clusters.
However, this approach can only work with one cell type at a time.

out <- adjustCounts(sc)
# Inspect the changes in expression profile 
cntSoggy <- rowSums(sc$toc > 0)
cntStrained <- rowSums(out > 0)
mostZeroed <- tail(sort((cntSoggy - cntStrained)/cntSoggy), n = 10)
mostZeroed
##   RP11-69H7.2   CTC-391G2.1     KRTAP4-12         NTF6G CTD-3187F8.15 
##             1             1             1             1             1 
##         BIRC8       KANK1P1       HSFY1P1  RP11-266I3.7        MT-ND6 
##             1             1             1             1             1
# MT - This is often the case as mitochondrial genes are over represented in the background compared to cells, presumably as a result of the soup being generated from distressed cells.
tail(sort(rowSums(sc$toc > out)/rowSums(sc$toc > 0)), n = 20)
##   MT-TD  MT-CO2   MT-TK MT-ATP8 MT-ATP6  MT-CO3   MT-TG  MT-ND3   MT-TR MT-ND4L 
##       1       1       1       1       1       1       1       1       1       1 
##  MT-ND4   MT-TH  MT-TS2  MT-TL2  MT-ND5  MT-ND6   MT-TE  MT-CYB   MT-TT   MT-TP 
##       1       1       1       1       1       1       1       1       1       1
# Now that we've corrected our data, we can see how that compares to our corrected data. 
# By default the function plots the fraction of expression in each cell that has been deemed to be soup and removed.

plotChangeMap(sc, out, "AQP4") + theme_bw()

plotChangeMap(sc, out, "RBFOX3") + theme_bw()

plotChangeMap(sc, out, "MOG") + theme_bw()

plotChangeMap(sc, out, "VCAN") + theme_bw()

Manually set estimated contamination fraction

Manually fixing the contamination fraction at a certain value. This seems like a bad thing to do intuitively, but there are actually good reasons you might want to. When the contamination fraction is set too high, true expression will be removed from your data. However, this is done in such a way that the counts that are most specific to a subset of cells (i.e., good marker genes) will be the absolute last thing to be removed. Because of this, it can be a sensible thing to set a high contamination fraction for a set of experiments and be confident that the vast majority of the contamination has been removed.

Even when you have a good estimate of the contamination fraction, you may want to set the value used artificially higher. SoupX has been designed to be conservative in the sense that it errs on the side of retaining true expression at the cost of letting some contamination to creep through. Our tests show that a well estimated contamination fraction will remove 80-90% of the contamination (i.e. the soup is reduced by an order of magnitude). For most applications this is sufficient. However, in cases where complete removal of contamination is essential, it can be worthwhile to increase the contamination fraction used by SoupX to remove a greater portion of the contamination.

Our experiments indicate that adding 5% extra removes 90-95% of the soup, 10% gets rid of 95-98% and 20% removes 99% or more.

Explicitly setting the contamination fraction can be done by running:

rm(out)
sc <- setContaminationFraction(sc, 0.2) #  set the contamination fraction to 20% for all cells.

Clean the data

Produce a new matrix that has had the contaminating reads removed. This can then be used in any downstream analysis in place of the original matrix. Note that by default adjustCounts will return non-integer counts. If you require integers for downstream processing either pass out through round or set roundToInt=TRUE when running adjustCounts.

out <- adjustCounts(sc)
DropletUtils:::write10xCounts(paste0(pip_results, sample_dir, "SoupX/"), out)

Inspect the manual clean data

cntSoggy <- rowSums(sc$toc > 0)
cntStrained <- rowSums(out > 0)
mostZeroed <- tail(sort((cntSoggy - cntStrained)/cntSoggy), n = 10)
mostZeroed
##        BIRC8      KANK1P1  SLC24A3-AS1   AF241725.4      HSFY1P1 RP11-266I3.7 
##            1            1            1            1            1            1 
##        MT-TQ        MT-TM       MT-ND6        MT-TP 
##            1            1            1            1
# MT - presumably as a result of the soup being generated from distressed cells.
tail(sort(rowSums(sc$toc > out)/rowSums(sc$toc > 0)), n = 20)
##   MT-TD  MT-CO2   MT-TK MT-ATP8 MT-ATP6  MT-CO3   MT-TG  MT-ND3   MT-TR MT-ND4L 
##       1       1       1       1       1       1       1       1       1       1 
##  MT-ND4   MT-TH  MT-TS2  MT-TL2  MT-ND5  MT-ND6   MT-TE  MT-CYB   MT-TT   MT-TP 
##       1       1       1       1       1       1       1       1       1       1

Visualize the manual clean data

# Now that we've corrected our data, we can see how that compares to our corrected data. 
# By default the function plots the fraction of expression in each cell that has been deemed to be soup and removed.

plotChangeMap(sc, out, "AQP4") + theme_bw()

plotChangeMap(sc, out, "RBFOX3") + theme_bw()

plotChangeMap(sc, out, "MOG") + theme_bw()

plotChangeMap(sc, out, "VCAN") + theme_bw()