library(miloR) # compositional analysis
library(here) # reproducible paths
library(scater) # sc plots
library(dplyr) # modify design df
This analysis have been done following the (MiloR vignette)[https://rawcdn.githack.com/MarioniLab/miloR/7c7f906b94a73e62e36e095ddb3e3567b414144e/vignettes/milo_gastrulation.html#5_Finding_markers_of_DA_populations]
For this study we will load our processed single cell experiment
project <- "old"
fig_path <- here("outs", project, "plots","DA_miloR", "/")
sce <- readRDS(here("processed", project, "sce_anno_02.RDS"))
plotReducedDim(sce, colour_by="genotype", dimred = "TSNE", text_by = "clusters_named")
We will test for significant differences in abundance of cells between WT and KO, and the associated gene signatures.
For differential abundance analysis on graph neighbourhoods we first
construct a Milo
object. This extends the SingleCellExperiment
class to store information about neighbourhoods on the KNN graph.
milo <- Milo(sce)
milo
## class: Milo
## dim: 18827 13499
## metadata(1): Samples
## assays(3): counts logcounts logcounts_raw
## rownames(18827): Xkr4 Gm19938 ... CAAA01147332.1 AC149090.1
## rowData names(6): ID Symbol ... gene_name subset
## colnames(13499): 1_AAACCCAAGTATGGAT-1 1_AAACCCAGTTATGACC-1 ...
## 6_TTTGGTTGTATAGGGC-1 6_TTTGGTTTCCTCTTTC-1
## colData names(33): Sample Barcode ... filter_out clusters_named
## reducedDimNames(5): PCA_coldata PCA PCA_all UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1
We need to add the KNN graph to the Milo object. This is stored in
the graph
slot, in igraph
format. The
miloR
package includes functionality to build and store the
graph from the PCA dimensions stored in the reducedDim
slot.
For graph building you need to define a few parameters:
d
: the number of reduced dimensions to use for KNN
refinement. We recommend using the same \(d\) used for KNN graph building. In our
case 26 dimensions (see feature_selection_dimred_02 script)k
: this affects the power of DA testing, since we need
to have enough cells from each sample represented in a neighbourhood to
estimate the variance between replicates. On the other side, increasing
\(k\) too much might lead to
over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for
clustering and UMAP visualization. In our case k20. We will later use
some heuristics to evaluate whether the value of \(k\) should be increased.# k modified after checking neighbourhoods
milo <- buildGraph(milo, k = 30, d = 26, reduced.dim = "PCA")
## Constructing kNN graph with k:30
Alternatively, one can add a precomputed KNN graph (for example
constructed with Seurat or scanpy) to the graph
slot using
the adjacency matrix, through the helper function
buildFromAdjacency
.
We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.
As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:
prop
: the proportion of cells to randomly sample to
start with. We suggest using prop=0.1
for datasets of less
than 30k cells. For bigger datasets using prop=0.05
should
be sufficient (and makes computation faster).refined
: indicates whether you want to use the sampling
refinement algorithm, or just pick cells at random. The default and
recommended way to go is to use refinement. The only situation in which
you might consider using random
instead, is if you have
batch corrected your data with a graph based correction algorithm, such
as BBKNN, but the
results of DA testing will be suboptimal.set.seed(1)
milo <- makeNhoods(milo, prop = 0.1, k = 30, d=26, refined = TRUE, reduced_dims = "PCA")
## Checking valid object
Once we have defined neighbourhoods, we plot the distribution of
neighbourhood sizes (i.e. how many cells form each neighbourhood) to
evaluate whether the value of \(k\)
used for graph building was appropriate. We can check this out using the
plotNhoodSizeHist
function.
As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples or to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop. In our case, 6 samples, an average of minimum 30 is expected, so we rerun makeNhood increasing k until we have an average of minimum 30 (5 x 6samples).
plotNhoodSizeHist(milo)
Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.
milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample="Sample")
## Checking meta.data validity
## Counting cells in neighbourhoods
This adds to the Milo
object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and
\(m\) is the number of experimental
samples. Values indicate the number of cells from each sample counted in
a neighbourhood. This count matrix will be used for DA testing.
head(nhoodCounts(milo))
## 6 x 6 sparse Matrix of class "dgCMatrix"
## S1823 S1824 S1825 S1826 S1827 S1828
## 1 37 27 32 11 19 24
## 2 4 7 5 32 6 2
## 3 42 31 7 . . .
## 4 10 5 18 3 5 16
## 5 36 3 8 2 5 21
## 6 9 12 13 13 8 15
Now we are all set to test for differential abundance in
neighbourhoods. We implement this hypothesis testing in a generalized
linear model (GLM) framework, specifically using the Negative Binomial
GLM implementation in edgeR
.
We first need to think about our experimental design. The design
matrix should match each sample to the experimental condition of
interest for DA testing. In this case, we want to detect DA between
genotypes, stored in the genotype
column of the dataset
colData
. We also include the chip
column in
the design matrix. This represents a known technical covariate that we
want to account for in DA testing.
design <- data.frame(colData(milo))[,c("Sample", "genotype", "chip")]
## Convert info from integers to factor
design$chip <- as.factor(design$chip)
design$genotype <- as.factor(design$genotype)
design$genotype <- relevel(design$genotype, "WT")
# simplify data frame to only distinct combinations conditions
design <- distinct(design)
rownames(design) <- design$Sample
design
## Sample genotype chip
## S1823 S1823 WT 1
## S1824 S1824 WT 2
## S1825 S1825 WT 1
## S1826 S1826 KO 2
## S1827 S1827 KO 2
## S1828 S1828 KO 1
Milo uses an adaptation of the Spatial FDR correction introduced by
cydar,
where we correct p-values accounting for the amount of overlap between
neighbourhoods. Specifically, each hypothesis test P-value is weighted
by the reciprocal of the kth nearest neighbour distance. To use this
statistic we first need to store the distances between nearest neighbors
in the Milo object. This is done by the calcNhoodDistance
function (N.B. this step is the most time consuming of the analysis
workflow and might take a couple of minutes for large datasets).
milo <- calcNhoodDistance(milo, d=26, reduced.dim = "PCA")
Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between genotype WT and KO, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)
da_results <- testNhoods(milo, design = ~ chip + genotype, design.df = design)
## Using TMM normalisation
## Performing spatial FDR correction withk-distance weighting
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -0.57828416 11.25797 0.955212249 3.420499e-01 0.779898186 1 0.7971967792
## 2 0.40707592 10.04075 0.185535947 6.720499e-01 0.922694641 2 0.9358128038
## 3 -7.87478493 10.31614 35.964926980 2.402185e-05 0.001154598 3 0.0008773739
## 4 0.02159023 10.02610 0.000753043 9.784259e-01 0.985382352 4 0.9859959345
## 5 0.26404708 10.20971 0.114358208 7.393609e-01 0.957971464 5 0.9666211159
## 6 0.16164738 10.30371 0.059348870 8.104328e-01 0.958908049 6 0.9667316338
This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions. The main statistics we consider here are:
logFC
: indicates the log-Fold change in cell numbers
between samples from WT and KOPValue
: reports P-values before FDR correctionSpatialFDR
: reports P-values corrected for multiple
testing accounting for overlap between neighbourhoodsda_results %>%
arrange(SpatialFDR) %>%
head()
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 3 -7.874785 10.31614 35.96493 2.402185e-05 0.001154598 3 0.0008773739
## 49 -8.068135 10.64394 39.99231 1.338283e-05 0.001154598 49 0.0008773739
## 91 -7.710790 10.16092 37.87310 1.810314e-05 0.001154598 91 0.0008773739
## 145 -5.586251 10.66431 33.95387 1.990537e-05 0.001154598 145 0.0008773739
## 192 -8.058749 10.44420 44.82005 7.015453e-06 0.001154598 192 0.0008773739
## 196 -8.101508 10.79911 40.72519 1.208867e-05 0.001154598 196 0.0008773739
We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.
ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)
Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).
ggplot(da_results, aes(logFC, -log10(SpatialFDR))) +
geom_point() +
geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)
The neighbourhoods with strong down regulation are the microglia.
To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are coloured by their log-Fold Change.
milo <- buildNhoodGraph(milo)
## Plot single-cell TSNE
tsne_pl <- plotReducedDim(milo, dimred = "TSNE", colour_by="genotype", text_by = "clusters_named",
text_size = 3, point_size=0.5) +
guides(fill="none")
## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(milo, da_results, layout="TSNE",alpha=0.1) + scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
tsne_pl + nh_graph_pl# +
# plot_layout(guides="collect")
We might also be interested in visualizing wheather DA is
particularly evident in certain clusters. To do this, we assign a
cluster label to each neighbourhood by finding the most abundant cluster
within cells in each neighbourhood. We can label neighbourhoods in the
results data.frame
using the function
annotateNhoods
. This also saves the fraction of cells
harbouring the label.
da_results <- annotateNhoods(milo, da_results, coldata_col = "clusters_named")
da_results <- annotateNhoods(milo, da_results, coldata_col = "celltype")
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -0.57828416 11.25797 0.955212249 3.420499e-01 0.779898186 1 0.7971967792
## 2 0.40707592 10.04075 0.185535947 6.720499e-01 0.922694641 2 0.9358128038
## 3 -7.87478493 10.31614 35.964926980 2.402185e-05 0.001154598 3 0.0008773739
## 4 0.02159023 10.02610 0.000753043 9.784259e-01 0.985382352 4 0.9859959345
## 5 0.26404708 10.20971 0.114358208 7.393609e-01 0.957971464 5 0.9666211159
## 6 0.16164738 10.30371 0.059348870 8.104328e-01 0.958908049 6 0.9667316338
## clusters_named clusters_named_fraction celltype celltype_fraction
## 1 Astrocyte_3 1 Astrocyte 1
## 2 Oligo_1 1 Oligo 1
## 3 Microglia 1 Microglia 1
## 4 Astrocyte_3 1 Astrocyte 1
## 5 Astrocyte_3 1 Astrocyte 1
## 6 Astrocyte_3 1 Astrocyte 1
While neighbourhoods tend to be homogeneous, we can define a
threshold for celltype_fraction
to exclude neighbourhoods
that are a mix of cell types.
ggplot(da_results, aes(clusters_named_fraction)) + geom_histogram(bins=50)
da_results$celltype <- ifelse(da_results$celltype_fraction < 0.7, "Mixed", da_results$celltype)
Now we can visualize the distribution of DA Fold Changes in different cell types or clusters
plotDAbeeswarm(da_results, group.by = "celltype") +
scale_colour_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + xlab("")
## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
plotDAbeeswarm(da_results, group.by = "clusters_named") +
scale_colour_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + xlab("")
## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
#save results
saveRDS(da_results, here("processed", project, "da_results.rds"))
–>
–> –>
–>
–> –> –> –> –> –> –>
–> –>
–>
–>
–> –> –> –>
–> –> –> –> –> –>
–> –> –> –>
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_1.0.7 scater_1.20.1
## [3] ggplot2_3.3.5 scuttle_1.2.1
## [5] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [7] Biobase_2.52.0 GenomicRanges_1.44.0
## [9] GenomeInfoDb_1.28.4 IRanges_2.26.0
## [11] S4Vectors_0.30.2 BiocGenerics_0.38.0
## [13] MatrixGenerics_1.4.3 matrixStats_0.61.0
## [15] here_1.0.1 miloR_1.2.0
## [17] edgeR_3.34.1 limma_3.48.3
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 RColorBrewer_1.1-2
## [3] rprojroot_2.0.2 tools_4.1.1
## [5] bslib_0.3.1 utf8_1.2.2
## [7] R6_2.5.1 irlba_2.3.3
## [9] vipor_0.4.5 DBI_1.1.1
## [11] colorspace_2.0-2 withr_2.4.2
## [13] tidyselect_1.1.1 gridExtra_2.3
## [15] compiler_4.1.1 BiocNeighbors_1.10.0
## [17] DelayedArray_0.18.0 labeling_0.4.2
## [19] sass_0.4.0 scales_1.1.1
## [21] stringr_1.4.0 digest_0.6.28
## [23] rmarkdown_2.11 XVector_0.32.0
## [25] pkgconfig_2.0.3 htmltools_0.5.2
## [27] sparseMatrixStats_1.4.2 highr_0.9
## [29] fastmap_1.1.0 rlang_0.4.12
## [31] DelayedMatrixStats_1.14.3 jquerylib_0.1.4
## [33] farver_2.1.0 generics_0.1.1
## [35] jsonlite_1.7.2 gtools_3.9.2
## [37] BiocParallel_1.26.2 RCurl_1.98-1.5
## [39] magrittr_2.0.1 BiocSingular_1.8.1
## [41] GenomeInfoDbData_1.2.6 patchwork_1.1.1
## [43] Matrix_1.3-4 Rcpp_1.0.7
## [45] ggbeeswarm_0.6.0 munsell_0.5.0
## [47] fansi_0.5.0 viridis_0.6.2
## [49] lifecycle_1.0.1 stringi_1.7.5
## [51] yaml_2.2.1 ggraph_2.0.5
## [53] MASS_7.3-54 zlibbioc_1.38.0
## [55] grid_4.1.1 ggrepel_0.9.1
## [57] crayon_1.4.2 lattice_0.20-45
## [59] splines_4.1.1 graphlayouts_0.8.0
## [61] cowplot_1.1.1 beachmat_2.8.1
## [63] locfit_1.5-9.4 knitr_1.36
## [65] pillar_1.6.4 igraph_1.2.7
## [67] ScaledMatrix_1.0.0 glue_1.4.2
## [69] evaluate_0.14 vctrs_0.3.8
## [71] tweenr_1.0.2 gtable_0.3.0
## [73] purrr_0.3.4 polyclip_1.10-0
## [75] tidyr_1.1.4 assertthat_0.2.1
## [77] xfun_0.27 ggforce_0.3.3
## [79] rsvd_1.0.5 tidygraph_1.2.0
## [81] viridisLite_0.4.0 tibble_3.1.5
## [83] beeswarm_0.4.0 statmod_1.4.36
## [85] ellipsis_0.3.2