library(miloR) # compositional analysis
library(here) # reproducible paths
library(scater) # sc plots
library(dplyr) # modify design df
library(ggplot2) #plots
This analysis have been done following the MiloR vignette
For this study we will load our processed single cell experiment with subsetted oligos
project <- "old"
sce <- readRDS(here("processed", project, "sce_oligo_clusters_01.RDS"))
plotReducedDim(sce, colour_by="genotype", dimred = "UMAP", text_by = "clusters_named")
## Warning: Removed 17 rows containing missing values (geom_text).
We will test for significant differences in abundance of cells between WT and KO, and the associated gene signatures.
For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.
milo <- Milo(sce)
milo
## class: Milo
## dim: 17234 3061
## metadata(1): Samples
## assays(3): counts logcounts logcounts_raw
## rownames(17234): Xkr4 Gm19938 ... CAAA01147332.1 AC149090.1
## rowData names(6): ID Symbol ... gene_name subset
## colnames(3061): 1_AAACGAATCCTTACCG-1 1_AAAGGATAGTGAATAC-1 ...
## 6_TTTGACTTCCTCGCAT-1 6_TTTGATCGTTCCGTTC-1
## colData names(39): Sample Barcode ... cluster_oligo_k150
## cluster_oligo_k200
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1
We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot.
For graph building you need to define a few parameters:
d: the number of reduced dimensions to use for KNN refinement. We recommend using the same \(d\) used for KNN graph building. In our case 26 dimensions (see feature_selection_dimred_02 script)k: this affects the power of DA testing, since we need to have enough cells from each sample represented in a neighbourhood to estimate the variance between replicates. On the other side, increasing \(k\) too much might lead to over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for clustering and UMAP visualization. In our case k10, that is the clustering resolution that shows the small cluster, more abundant in KO, as a separate cluster.# k modified after checking neighbourhoods
milo <- buildGraph(milo, k = 10, d = 25, reduced.dim = "PCA")
## Constructing kNN graph with k:10
Alternatively, one can add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix, through the helper function buildFromAdjacency.
We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.
As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:
prop: the proportion of cells to randomly sample to start with. We suggest using prop=0.1 for datasets of less than 30k cells. For bigger datasets using prop=0.05 should be sufficient (and makes computation faster).refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.milo <- makeNhoods(milo, prop = 0.1, k = 10, d=26, refined = TRUE, reduced_dims = "PCA")
## Checking valid object
## Warning in makeNhoods(milo, prop = 0.1, k = 10, d = 26, refined = TRUE, : Specified d is higher than the total number of dimensions in reducedDim(x, reduced_dims). Falling back to using25dimensions
Once we have defined neighbourhoods, we plot the distribution of neighbourhood sizes (i.e. how many cells form each neighbourhood) to evaluate whether the value of \(k\) used for graph building was appropriate. We can check this out using the plotNhoodSizeHist function.
As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples or to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop. In our case, 6 samples, an average of minimum 30 is expected, so we rerun makeNhood increasing k until we have an average of minimum 30 (5 x 6samples).
plotNhoodSizeHist(milo)
Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.
milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample="Sample")
## Checking meta.data validity
## Counting cells in neighbourhoods
This adds to the Milo object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.
head(nhoodCounts(milo))
## 6 x 6 sparse Matrix of class "dgCMatrix"
## S1823 S1824 S1825 S1826 S1827 S1828
## 1 . 7 1 8 2 .
## 2 10 6 6 10 7 6
## 3 3 11 3 34 4 3
## 4 . . . 7 2 9
## 5 4 4 6 4 . 4
## 6 6 4 3 8 3 2
Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.
We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between genotypes, stored in the genotype column of the dataset colData. We also include the chip column in the design matrix. This represents a known technical covariate that we want to account for in DA testing.
design <- data.frame(colData(milo))[,c("Sample", "genotype", "chip")]
## Convert info from integers to factor
design$chip <- as.factor(design$chip)
design$genotype <- as.factor(design$genotype)
design$genotype <- relevel(design$genotype, "WT")
# simplify data frame to only distinct combinations conditions
design <- distinct(design)
rownames(design) <- design$Sample
design
## Sample genotype chip
## S1823 S1823 WT 1
## S1824 S1824 WT 2
## S1825 S1825 WT 1
## S1826 S1826 KO 2
## S1827 S1827 KO 2
## S1828 S1828 KO 1
Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).
milo <- calcNhoodDistance(milo, d=25, reduced.dim = "PCA")
Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between genotype WT and KO, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)
da_results <- testNhoods(milo, design = ~ chip + genotype, design.df = design)
## Using TMM normalisation
## Performing spatial FDR correction withk-distance weighting
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -0.99196465 12.07753 0.733562844 0.401901618 0.78251709 1 0.80993246
## 2 -0.04454733 13.14233 0.005410233 0.942097532 0.96246222 2 0.96438122
## 3 0.27091321 13.21357 0.173065329 0.681842805 0.89934360 3 0.90294079
## 4 6.14915086 12.17319 13.937258221 0.001527721 0.04073923 4 0.04655878
## 5 -0.68046752 12.38738 0.525241312 0.477029088 0.80809685 5 0.82453297
## 6 -0.38497202 12.51019 0.196945780 0.661973342 0.89759097 6 0.90128880
This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions. The main statistics we consider here are:
logFC: indicates the log-Fold change in cell numbers between samples from WT and KOPValue: reports P-values before FDR correctionSpatialFDR: reports P-values corrected for multiple testing accounting for overlap between neighbourhoodsda_results %>%
arrange(SpatialFDR) %>%
head()
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 63 6.407494 12.55481 21.97682 0.0001844990 0.01690373 63 0.01908809
## 123 6.735876 12.61518 25.44835 0.0000849896 0.01690373 123 0.01908809
## 157 6.185890 12.85446 21.40031 0.0002112966 0.01690373 157 0.01908809
## 30 6.066755 12.73343 17.55957 0.0005527346 0.03316408 30 0.03738006
## 4 6.149151 12.17319 13.93726 0.0015277213 0.04073923 4 0.04655878
## 70 5.737011 12.66392 14.72173 0.0012130686 0.04073923 70 0.04655878
We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.
ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)
Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).
ggplot(da_results, aes(logFC, -log10(SpatialFDR))) +
geom_point() +
geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)
To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are coloured by their log-Fold Change.
milo <- buildNhoodGraph(milo)
## Plot single-cell UMAP
UMAP_pl <- plotReducedDim(milo, dimred = "UMAP", colour_by="genotype", text_by = "clusters_named",
text_size = 3, point_size=0.5) +
guides(fill="none")
TSNE_pl <- plotReducedDim(milo, dimred = "TSNE", colour_by="genotype", text_by = "clusters_named",
text_size = 3, point_size=0.5) +
guides(fill="none")
## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(milo, da_results, layout="UMAP",alpha=0.1) + scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
nh_graph_tsne_pl <- plotNhoodGraphDA(milo, da_results, layout="TSNE",alpha=0.1) + scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
UMAP_pl + nh_graph_pl
## Warning: Removed 17 rows containing missing values (geom_text).
TSNE_pl + nh_graph_tsne_pl
## Warning: Removed 17 rows containing missing values (geom_text).
# plot_layout(guides="collect")
We might also be interested in visualizing wheather DA is particularly evident in certain clusters. To do this, we assign a cluster label to each neighbourhood by finding the most abundant cluster within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harbouring the label.
We will visualise this for:
The 10k clustering done exclusively with the oligos, this was overclustered, but it separated the group of cells that were different between wild type and KO into a separate cluster
The clusters named like the first clustering done with all the cell types.
Simply separating oligodendrocytes from OPCs
k10_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="cluster_oligo_k10", text_by = "cluster_oligo_k10",
text_size = 3, point_size=0.5) +
guides(fill="none")
celltype_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="celltype", text_by = "celltype",
text_size = 3, point_size=0.5) +
guides(fill="none")
clusters_named_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="clusters_named", text_by = "clusters_named",
text_size = 3, point_size=0.5) +
guides(fill="none")
k10_tsne + clusters_named_tsne + celltype_tsne
## Warning: Removed 17 rows containing missing values (geom_text).
## Warning: Removed 13 rows containing missing values (geom_text).
da_results <- annotateNhoods(milo, da_results, coldata_col = "cluster_oligo_k10")
da_results <- annotateNhoods(milo, da_results, coldata_col = "celltype")
da_results <- annotateNhoods(milo, da_results, coldata_col = "clusters_named")
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -0.99196465 12.07753 0.733562844 0.401901618 0.78251709 1 0.80993246
## 2 -0.04454733 13.14233 0.005410233 0.942097532 0.96246222 2 0.96438122
## 3 0.27091321 13.21357 0.173065329 0.681842805 0.89934360 3 0.90294079
## 4 6.14915086 12.17319 13.937258221 0.001527721 0.04073923 4 0.04655878
## 5 -0.68046752 12.38738 0.525241312 0.477029088 0.80809685 5 0.82453297
## 6 -0.38497202 12.51019 0.196945780 0.661973342 0.89759097 6 0.90128880
## cluster_oligo_k10 cluster_oligo_k10_fraction celltype celltype_fraction
## 1 11 0.8333333 Oligo 1
## 2 9 0.8888889 Oligo 1
## 3 12 0.5517241 Oligo 1
## 4 10 0.9444444 Oligo 1
## 5 9 0.9090909 Oligo 1
## 6 6 0.9615385 OPCs 1
## clusters_named clusters_named_fraction
## 1 Oligo_1 1
## 2 Oligo_1 1
## 3 Oligo_1 1
## 4 Oligo_1 1
## 5 Oligo_1 1
## 6 OPCs 1
We can define a threshold to exclude neighbourhoods that are a mix of cell types.
ggplot(da_results, aes(cluster_oligo_k10_fraction)) + geom_histogram(bins=50)+
ggplot(da_results, aes(clusters_named_fraction)) + geom_histogram(bins=50) +
ggplot(da_results, aes(celltype_fraction)) + geom_histogram(bins=50)
da_results$cluster_oligo_k10 <- ifelse(da_results$cluster_oligo_k10_fraction < 0.6, "Mixed", da_results$cluster_oligo_k10)
da_results$cluster_oligo_k10 <- ifelse(da_results$celltype_fraction < 0.9, "Mixed", da_results$cluster_oligo_k10)
da_results$cluster_oligo_k10 <- ifelse(da_results$clusters_named_fraction < 0.8, "Mixed", da_results$cluster_oligo_k10)
Now we can visualize the distribution of DA Fold Changes in different cell types or clusters
plotDAbeeswarm(da_results, group.by = "cluster_oligo_k10") + scale_colour_distiller(direction = -1, palette = "RdBu")
## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
plotDAbeeswarm(da_results, group.by = "celltype") + scale_colour_distiller(direction = -1, palette = "RdBu")
## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
plotDAbeeswarm(da_results, group.by = "clusters_named") + scale_colour_distiller(direction = -1, palette = "RdBu")
## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
saveRDS(da_results, here("processed",project, "da_results_oligos.rds"))
–> –> –> –>
Session Info
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_1.0.7 scater_1.20.1
## [3] ggplot2_3.3.5 scuttle_1.2.1
## [5] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [7] Biobase_2.52.0 GenomicRanges_1.44.0
## [9] GenomeInfoDb_1.28.4 IRanges_2.26.0
## [11] S4Vectors_0.30.2 BiocGenerics_0.38.0
## [13] MatrixGenerics_1.4.3 matrixStats_0.61.0
## [15] here_1.0.1 miloR_1.0.0
## [17] edgeR_3.34.1 limma_3.48.3
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 RColorBrewer_1.1-2
## [3] rprojroot_2.0.2 tools_4.1.1
## [5] bslib_0.3.1 utf8_1.2.2
## [7] R6_2.5.1 irlba_2.3.3
## [9] vipor_0.4.5 DBI_1.1.1
## [11] colorspace_2.0-2 withr_2.4.2
## [13] tidyselect_1.1.1 gridExtra_2.3
## [15] compiler_4.1.1 BiocNeighbors_1.10.0
## [17] DelayedArray_0.18.0 labeling_0.4.2
## [19] sass_0.4.0 scales_1.1.1
## [21] stringr_1.4.0 digest_0.6.28
## [23] rmarkdown_2.11 XVector_0.32.0
## [25] pkgconfig_2.0.3 htmltools_0.5.2
## [27] sparseMatrixStats_1.4.2 highr_0.9
## [29] fastmap_1.1.0 rlang_0.4.12
## [31] DelayedMatrixStats_1.14.3 jquerylib_0.1.4
## [33] farver_2.1.0 generics_0.1.1
## [35] jsonlite_1.7.2 gtools_3.9.2
## [37] BiocParallel_1.26.2 RCurl_1.98-1.5
## [39] magrittr_2.0.1 BiocSingular_1.8.1
## [41] GenomeInfoDbData_1.2.6 patchwork_1.1.1
## [43] Matrix_1.3-4 Rcpp_1.0.7
## [45] ggbeeswarm_0.6.0 munsell_0.5.0
## [47] fansi_0.5.0 viridis_0.6.2
## [49] lifecycle_1.0.1 stringi_1.7.5
## [51] yaml_2.2.1 ggraph_2.0.5
## [53] MASS_7.3-54 zlibbioc_1.38.0
## [55] grid_4.1.1 ggrepel_0.9.1
## [57] crayon_1.4.2 lattice_0.20-45
## [59] splines_4.1.1 graphlayouts_0.7.1
## [61] cowplot_1.1.1 beachmat_2.8.1
## [63] locfit_1.5-9.4 knitr_1.36
## [65] pillar_1.6.4 igraph_1.2.7
## [67] ScaledMatrix_1.0.0 glue_1.4.2
## [69] evaluate_0.14 vctrs_0.3.8
## [71] tweenr_1.0.2 gtable_0.3.0
## [73] purrr_0.3.4 polyclip_1.10-0
## [75] tidyr_1.1.4 assertthat_0.2.1
## [77] xfun_0.27 ggforce_0.3.3
## [79] rsvd_1.0.5 tidygraph_1.2.0
## [81] viridisLite_0.4.0 tibble_3.1.5
## [83] beeswarm_0.4.0 statmod_1.4.36
## [85] ellipsis_0.3.2