The vignette
Estimate the composition of this soup, i.e. what fraction of UMIs are
derived from the soup in each droplet and produce a corrected count
table with the soup based expression removed.
The method to do this consists of three parts:
library(SeuratObject, lib.loc = "/usr/local/biotools/rpackages/R-4.2.2-2023-02-01")
library(Seurat, lib.loc = "/usr/local/biotools/rpackages/R-4.2.2-2023-02-01")
library(dplyr)
library(patchwork)
library(knitr)
library(xlsx)
library(ggplot2)
library(gridExtra)
library(grid)
library(lattice)
library(dittoSeq)
##library(matrixStats)
library(SoupX)
library(data.table)
Sys.setenv(RSTUDIO_PANDOC="/usr/local/biotools/pandoc/3.1.2/bin")
Variables
color.panel <- dittoColors()
sampleID <- c("v4_PA_mod0.6x")
sample_colors <- c("#009E73")
pip_results <- "/research/labs/neurology/fryer/m239830/LBD_CWOW/snRNA/pipseq_test_run/pipseeker_results/"
sample_dir <- "HR20231207Mayo_FR_9_results/"
sens <- "sensitivity_3/"
Read in both the raw and the filtered counts matrix. Create a SoupChannel that contains both the raw and filtered counts matrix.
filtered <- Seurat::Read10X(file.path(paste0(pip_results, sample_dir, "/filtered_matrix/", sens)))
raw <- Seurat::Read10X(file.path(paste0(pip_results, sample_dir, "/raw_matrix/")))
# Create soup channel
sc <- SoupChannel(raw, filtered)
sc # check the soup
Be sure to match the sensitivity used
fread is fast read and works with .gz files
barcodes <- fread(paste0(pip_results, sample_dir, "/filtered_matrix/", sens,
"barcodes.tsv.gz"), header = F)
clusters <- fread(paste0(pip_results, sample_dir, "/clustering/", sens,
"clusters.csv"))
umap_embed <- fread(paste0(pip_results, sample_dir, "/clustering/", sens,
"umap_embedding.csv"))
Pipseq is special in how they organize the cluster information and umap embedding. We can’t use standard load10x approaches as this won’t work with pipseeker outputs.
filtered_metadata <- cbind(umap_embed, clusters$graph)
filtered_metadata <- as.data.frame(filtered_metadata)
colnames(filtered_metadata) <- c('RD1','RD2', 'Cluster')
row.names(filtered_metadata) <- barcodes$V1
Some metadata is essential to include in the SoupChannel object. In
general you can add any meta data by providing a data.frame with row
names equal to the column names of the toc when building the SoupChannel
object.
There are some bits of meta data that are so essential that they have
their own special loading functions. The most essential is clustering
information. Without it, SoupX will still work, but you won’t be able to
automatically estimate the contamination fraction and the correction
step will be far less effective.
RD is reduced dimensional. May be either umap, pca, or tsne. The
default with the pipseeker output is umap so we will use umap.
sc <- setClusters(sc, setNames(filtered_metadata$Cluster, rownames(filtered_metadata)))
sc <- setDR(sc, filtered_metadata[colnames(sc$toc), c("RD1", "RD2")])
Check - does the umap match the pipseeker html report for that sensitivity?
dd <- filtered_metadata[colnames(sc$toc), ]
mids <- aggregate(cbind(RD1, RD2) ~ Cluster, data = dd, FUN = mean)
cluster_plot <- ggplot(dd, aes(RD1, RD2)) +
geom_point(aes(colour = Cluster), size = 0.2) +
geom_label(data = mids, aes(label = Cluster)) +
ggtitle(paste0(sens, " clusters")) +
theme_bw()
plot(cluster_plot)
Visualize which cells express the gene of interest (goi) by
extracting the counts for it from the SoupChannel object.
Is the expression more than expected in the other clusters?
Checking if the expression of the goi in these scattered cells is more
than we would expect by chance from the soup. To really answer this
properly, we need to know how much contamination is present in each
cell. To get a rough idea, we will first calculate how many counts we
would expect for the goi in each cell, by assuming that cell contained
nothing but soup. The function plotMarkerMap allows us to visualize the
ratio of observed counts for a gene (or set of genes) to this
expectation value.
genes <- c("AQP4", "RBFOX3", "MOG", "MBP", "VCAN")
for (gene in genes) {
dd[[gene]] <- sc$toc[gene, ]
plot <- ggplot(dd, aes(RD1, RD2)) +
geom_point(aes(colour = .data[[gene]] > 0)) +
theme_bw() +
labs(title = gene)
print(plot)
}
The resulting plot in the chunk below, we see that the cells in the astrocyte cluster have a reddish colour, indicating that they are expressed far more than we would expect by chance, even if the cell was nothing but background. The other clusters all have decidedly bluish hue, indicating that is completely plausible that the expression of AQP4 in these cells is due to contamination from the soup. Those cells that are shown as black tiny dots have zero expression for AQP4.
for (gene in genes) {
soup_plot <- plotMarkerMap(sc, gene)+
theme_bw() +
labs(title = gene)
print(soup_plot)
}
Given a set of genes that we suspect may be useful, the function
plotMarkerDistribution can be used to visualize how this gene’s
expression is distributed across cells.
To prevent accidentally including cells that genuinely express one of
the estimation genes, SoupX will by default exclude any cluster where
even one gene has evidence that it expresses a gene.
To get a more accurate estimate, groups with a similar biological function are grouped together so they’re either used or excluded as a group. This is why the parameter nonExpressedGeneList is given as a list. Each entry in the list is a group of genes that are grouped biologically.
astrocyte_markers <- c("CLU", "GFAP", "AQP4", "GJA1")
nonExpressedGeneList <- list(Astrocyte = c("AQP4", "GJA1", "CLU", "GFAP"),
Microglia = c("C1QA", "C1QC", "C1QB", "HEXB",
"TMEM119", "ITGAM", "TYROBP","P2RY12", "AIF1"),
Neuron = c("RBFOX3", "SNAP25","SYT1", "GAD1", "GAD2"),
Oligodendrocyte = c("PLP1","MBP", "MOG"),
OPC = c("OLIG1","PDGFRA", "VCAN", "TNR"))
plotMarkerDistribution(sc, nonExpressedGeneList) + theme_bw()
It doesn’t matter if some of these are not expressed in our data, they will then just not contribute to the estimate.
Estimating the contamination fraction.
Probably the most difficult part of using SoupX is accurately estimating
the level of background contamination (represented as rho) in each
channel. There are two ways to do this: using the automatic autoEstCont
method, or manually providing a list of “non expressed genes”.
The idea that underpins both approaches; identifying genes that are
not expressed by some cells in our data and the expression that we
observe for these genes in these cells must be due to
contamination.
The key thing to understand is that the contamination fraction
estimate is the fraction of your data that will be discarded. If this
value is set too low, your “corrected” data will potentially still be
highly contaminated. If you set it too high, you will discard real data,
although there are good reasons to want to do this at times. If the
contamination fraction is in the right ball park, SoupX will remove most
of the contamination. It will generally not matter if this number if off
by a few percent.
Automatic
sc <- autoEstCont(sc)
Some times it is not obvious in advance which genes are highly specific to just one population of cells.
# Genes that are expressed most highly in the background.
head(sc$soupProfile[order(sc$soupProfile$est, decreasing = TRUE), ], n = 20)
## est counts
## MT-RNR2 0.030200499 576317
## CH507-528H12.1 0.026253903 501004
## MALAT1 0.021361389 407640
## CH507-513H4.1 0.013159441 251122
## MT-CO3 0.010164634 193972
## MT-ATP6 0.005524752 105429
## MT-CO2 0.005448978 103983
## MT-ND4 0.005115383 97617
## MT-CO1 0.004720844 90088
## MT-ND3 0.004567723 87166
## KCNIP4 0.003923329 74869
## MT-CYB 0.003775554 72049
## PCDH9 0.003720531 70999
## MAP1B 0.003319861 63353
## MT-RNR1 0.003232139 61679
## MT-TV 0.003084940 58870
## RBFOX1 0.002829739 54000
## MT-ND1 0.002779066 53033
## SNHG14 0.002708532 51687
## LRP1B 0.002670750 50966
Most of the most highly expressed genes in this case are ubiquitously expressed (mitochondrial genes). So we need some further criteria to aid our selection process.
Visualize the distribution of expression (relative to what would be expected were each cell pure background) across all cells in the data set. When no geneset is provided, the function will try and guess which genes might be useful.
# automatic - “guessing” is done using the quickMarkers function to find marker genes of each cluster
plotMarkerDistribution(sc) + theme_bw()
# nonExpressedGeneList is the list of our canonical markers
plotMarkerDistribution(sc, nonExpressedGeneList) + theme_bw()
The plot shows the distribution of log10 ratios of observed counts to expected if the cell contained nothing but soup. A guess at which cells definitely express each gene is made and the background contamination is calculated. The red line shows the global estimate (i.e., assuming the same contamination fraction for all cells) of the contamination fraction using just that gene or gene list.
For the automatic “guessing” of genes it is important to note that this is a heuristic set of genes that is intended to help develop our biological intuition. It absolutely must not be used to automatically select a set of genes to estimate the background contamination fraction. If you select the top N genes from this list and use those to estimate the contamination, you will over-estimate the contamination fraction!
Having decided on a set of genes with which to estimate the
contamination, we next need to decide which cells genuinely express
these genes and should not be used for estimating the contamination, and
which do not and should.
The following produces a matrix indicating which cells (rows) should use
which sets of genes (columns) to estimate the contamination.
To estimate the contamination fraction you need only pass your set of
genes and which cells in which to use those sets of genes to
calculateContaminationFraction.
The function will modify the metaData table of sc object to add a table
giving the contamination fraction estimate. This approach gives a
contamination fraction very close to the automatic method.
sc <- calculateContaminationFraction(sc, list(Astrocyte = c("AQP4", "GJA1", "CLU", "GFAP")), useToEst = useToEst_Astrocyte)
head(sc$metaData)
## nUMIs clusters RD1 RD2 rho rhoLow
## AAAAAAACCATGAATT 7782 8 5.6954870 4.631757 0.160367 0.1582958
## AAAAAAAGAGTGCATC 2767 9 11.3356610 15.880696 0.160367 0.1582958
## AAAAAAAGCATACCAG 5722 7 14.6039030 2.495945 0.160367 0.1582958
## AAAAAAATAACAACAA 2654 1 2.3344529 14.153996 0.160367 0.1582958
## AAAAAACAATATCAGT 2827 12 -0.7713541 5.922867 0.160367 0.1582958
## AAAAAACACAGTCCAT 3833 1 1.3936919 15.335578 0.160367 0.1582958
## rhoHigh
## AAAAAAACCATGAATT 0.162456
## AAAAAAAGAGTGCATC 0.162456
## AAAAAAAGCATACCAG 0.162456
## AAAAAAATAACAACAA 0.162456
## AAAAAACAATATCAGT 0.162456
## AAAAAACACAGTCCAT 0.162456
We have now calculated or set the contamination fraction for each
cell and would like to use this to remove the contamination from the
original count matrix. As with estimating the contamination, this
procedure is made much more robust by providing clustering information.
This is because there is much more power to separate true expression
from contaminating expression when counts are aggregated into
clusters.
However, this approach can only work with one cell type at a time.
out <- adjustCounts(sc)
# Inspect the changes in expression profile
cntSoggy <- rowSums(sc$toc > 0)
cntStrained <- rowSums(out > 0)
mostZeroed <- tail(sort((cntSoggy - cntStrained)/cntSoggy), n = 10)
mostZeroed
## RP11-69H7.2 CTC-391G2.1 KRTAP4-12 NTF6G CTD-3187F8.15
## 1 1 1 1 1
## BIRC8 KANK1P1 HSFY1P1 RP11-266I3.7 MT-ND6
## 1 1 1 1 1
# MT - This is often the case as mitochondrial genes are over represented in the background compared to cells, presumably as a result of the soup being generated from distressed cells.
tail(sort(rowSums(sc$toc > out)/rowSums(sc$toc > 0)), n = 20)
## MT-TD MT-CO2 MT-TK MT-ATP8 MT-ATP6 MT-CO3 MT-TG MT-ND3 MT-TR MT-ND4L
## 1 1 1 1 1 1 1 1 1 1
## MT-ND4 MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP
## 1 1 1 1 1 1 1 1 1 1
# Now that we've corrected our data, we can see how that compares to our corrected data.
# By default the function plots the fraction of expression in each cell that has been deemed to be soup and removed.
plotChangeMap(sc, out, "AQP4") + theme_bw()
plotChangeMap(sc, out, "RBFOX3") + theme_bw()
plotChangeMap(sc, out, "MOG") + theme_bw()
plotChangeMap(sc, out, "VCAN") + theme_bw()
Manually fixing the contamination fraction at a certain value. This
seems like a bad thing to do intuitively, but there are actually good
reasons you might want to. When the contamination fraction is set too
high, true expression will be removed from your data. However, this is
done in such a way that the counts that are most specific to a subset of
cells (i.e., good marker genes) will be the absolute last thing to be
removed. Because of this, it can be a sensible thing to set a high
contamination fraction for a set of experiments and be confident that
the vast majority of the contamination has been removed.
Even when you have a good estimate of the contamination fraction, you
may want to set the value used artificially higher. SoupX has been
designed to be conservative in the sense that it errs on the side of
retaining true expression at the cost of letting some contamination to
creep through. Our tests show that a well estimated contamination
fraction will remove 80-90% of the contamination (i.e. the soup is
reduced by an order of magnitude). For most applications this is
sufficient. However, in cases where complete removal of contamination is
essential, it can be worthwhile to increase the contamination fraction
used by SoupX to remove a greater portion of the contamination.
Our experiments indicate that adding 5% extra removes 90-95% of the soup, 10% gets rid of 95-98% and 20% removes 99% or more.
Explicitly setting the contamination fraction can be done by running:
rm(out)
sc <- setContaminationFraction(sc, 0.2) # set the contamination fraction to 20% for all cells.
Produce a new matrix that has had the contaminating reads removed. This can then be used in any downstream analysis in place of the original matrix. Note that by default adjustCounts will return non-integer counts. If you require integers for downstream processing either pass out through round or set roundToInt=TRUE when running adjustCounts.
out <- adjustCounts(sc)
DropletUtils:::write10xCounts(paste0(pip_results, sample_dir, "SoupX/"), out)
cntSoggy <- rowSums(sc$toc > 0)
cntStrained <- rowSums(out > 0)
mostZeroed <- tail(sort((cntSoggy - cntStrained)/cntSoggy), n = 10)
mostZeroed
## BIRC8 KANK1P1 SLC24A3-AS1 AF241725.4 HSFY1P1 RP11-266I3.7
## 1 1 1 1 1 1
## MT-TQ MT-TM MT-ND6 MT-TP
## 1 1 1 1
# MT - presumably as a result of the soup being generated from distressed cells.
tail(sort(rowSums(sc$toc > out)/rowSums(sc$toc > 0)), n = 20)
## MT-TD MT-CO2 MT-TK MT-ATP8 MT-ATP6 MT-CO3 MT-TG MT-ND3 MT-TR MT-ND4L
## 1 1 1 1 1 1 1 1 1 1
## MT-ND4 MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP
## 1 1 1 1 1 1 1 1 1 1
# Now that we've corrected our data, we can see how that compares to our corrected data.
# By default the function plots the fraction of expression in each cell that has been deemed to be soup and removed.
plotChangeMap(sc, out, "AQP4") + theme_bw()
plotChangeMap(sc, out, "RBFOX3") + theme_bw()
plotChangeMap(sc, out, "MOG") + theme_bw()
plotChangeMap(sc, out, "VCAN") + theme_bw()