Package

library(miloR) # compositional analysis
library(here) # reproducible paths
library(scater) # sc plots
library(dplyr) # modify design df

This analysis have been done following the (MiloR vignette)[https://rawcdn.githack.com/MarioniLab/miloR/7c7f906b94a73e62e36e095ddb3e3567b414144e/vignettes/milo_gastrulation.html#5_Finding_markers_of_DA_populations]

Load data

For this study we will load our processed single cell experiment

project <- "old"
fig_path <- here("outs", project, "plots","DA_miloR", "/")
sce <- readRDS(here("processed", project, "sce_anno_02.RDS"))

Visualize the data

plotReducedDim(sce, colour_by="genotype", dimred = "TSNE", text_by = "clusters_named")

We will test for significant differences in abundance of cells between WT and KO, and the associated gene signatures.

Differential abundance testing

Create a Milo object

For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.

milo <- Milo(sce)
milo

## class: Milo 
## dim: 18827 13499 
## metadata(1): Samples
## assays(3): counts logcounts logcounts_raw
## rownames(18827): Xkr4 Gm19938 ... CAAA01147332.1 AC149090.1
## rowData names(6): ID Symbol ... gene_name subset
## colnames(13499): 1_AAACCCAAGTATGGAT-1 1_AAACCCAGTTATGACC-1 ...
##   6_TTTGGTTGTATAGGGC-1 6_TTTGGTTTCCTCTTTC-1
## colData names(33): Sample Barcode ... filter_out clusters_named
## reducedDimNames(5): PCA_coldata PCA PCA_all UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot.

For graph building you need to define a few parameters:

d: the number of reduced dimensions to use for KNN refinement. We recommend using the same \(d\) used for KNN graph building. In our case 26 dimensions (see feature_selection_dimred_02 script)
k: this affects the power of DA testing, since we need to have enough cells from each sample represented in a neighbourhood to estimate the variance between replicates. On the other side, increasing \(k\) too much might lead to over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for clustering and UMAP visualization. In our case k20. We will later use some heuristics to evaluate whether the value of \(k\) should be increased.

# k modified after checking neighbourhoods
milo <- buildGraph(milo, k = 30, d = 26, reduced.dim = "PCA")

## Constructing kNN graph with k:30

Alternatively, one can add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix, through the helper function buildFromAdjacency.

Defining representative neighbourhoods on the KNN graph

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.

As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:

prop: the proportion of cells to randomly sample to start with. We suggest using prop=0.1 for datasets of less than 30k cells. For bigger datasets using prop=0.05 should be sufficient (and makes computation faster).
refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.

set.seed(1)
milo <- makeNhoods(milo, prop = 0.1, k = 30, d=26, refined = TRUE, reduced_dims = "PCA")

## Checking valid object

Once we have defined neighbourhoods, we plot the distribution of neighbourhood sizes (i.e. how many cells form each neighbourhood) to evaluate whether the value of \(k\) used for graph building was appropriate. We can check this out using the plotNhoodSizeHist function.

As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples or to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop. In our case, 6 samples, an average of minimum 30 is expected, so we rerun makeNhood increasing k until we have an average of minimum 30 (5 x 6samples).

plotNhoodSizeHist(milo)

Counting cells in neighbourhoods

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample="Sample")

## Checking meta.data validity

## Counting cells in neighbourhoods

This adds to the Milo object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

head(nhoodCounts(milo))

## 6 x 6 sparse Matrix of class "dgCMatrix"
##   S1823 S1824 S1825 S1826 S1827 S1828
## 1    37    27    32    11    19    24
## 2     4     7     5    32     6     2
## 3    42    31     7     .     .     .
## 4    10     5    18     3     5    16
## 5    36     3     8     2     5    21
## 6     9    12    13    13     8    15

Defining experimental design

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between genotypes, stored in the genotype column of the dataset colData. We also include the chip column in the design matrix. This represents a known technical covariate that we want to account for in DA testing.

design <- data.frame(colData(milo))[,c("Sample", "genotype", "chip")]
## Convert info from integers to factor
design$chip <- as.factor(design$chip) 
design$genotype <- as.factor(design$genotype)
design$genotype <- relevel(design$genotype, "WT")
# simplify data frame to only distinct combinations conditions
design <- distinct(design)
rownames(design) <- design$Sample
design

##       Sample genotype chip
## S1823  S1823       WT    1
## S1824  S1824       WT    2
## S1825  S1825       WT    1
## S1826  S1826       KO    2
## S1827  S1827       KO    2
## S1828  S1828       KO    1

Computing neighbourhood connectivity

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).

milo <- calcNhoodDistance(milo, d=26, reduced.dim = "PCA")

Testing

Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between genotype WT and KO, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)

da_results <- testNhoods(milo, design = ~ chip + genotype, design.df = design)

## Using TMM normalisation

## Performing spatial FDR correction withk-distance weighting

head(da_results)

##         logFC   logCPM            F       PValue         FDR Nhood   SpatialFDR
## 1 -0.57828416 11.25797  0.955212249 3.420499e-01 0.779898186     1 0.7971967792
## 2  0.40707592 10.04075  0.185535947 6.720499e-01 0.922694641     2 0.9358128038
## 3 -7.87478493 10.31614 35.964926980 2.402185e-05 0.001154598     3 0.0008773739
## 4  0.02159023 10.02610  0.000753043 9.784259e-01 0.985382352     4 0.9859959345
## 5  0.26404708 10.20971  0.114358208 7.393609e-01 0.957971464     5 0.9666211159
## 6  0.16164738 10.30371  0.059348870 8.104328e-01 0.958908049     6 0.9667316338

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions. The main statistics we consider here are:

logFC: indicates the log-Fold change in cell numbers between samples from WT and KO
PValue: reports P-values before FDR correction
SpatialFDR: reports P-values corrected for multiple testing accounting for overlap between neighbourhoods

da_results %>%
  arrange(SpatialFDR) %>%
  head()

##         logFC   logCPM        F       PValue         FDR Nhood   SpatialFDR
## 3   -7.874785 10.31614 35.96493 2.402185e-05 0.001154598     3 0.0008773739
## 49  -8.068135 10.64394 39.99231 1.338283e-05 0.001154598    49 0.0008773739
## 91  -7.710790 10.16092 37.87310 1.810314e-05 0.001154598    91 0.0008773739
## 145 -5.586251 10.66431 33.95387 1.990537e-05 0.001154598   145 0.0008773739
## 192 -8.058749 10.44420 44.82005 7.015453e-06 0.001154598   192 0.0008773739
## 196 -8.101508 10.79911 40.72519 1.208867e-05 0.001154598   196 0.0008773739

Inspecting DA testing results

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.

ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)

Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).

ggplot(da_results, aes(logFC, -log10(SpatialFDR))) + 
  geom_point() +
  geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)

The neighbourhoods with strong down regulation are the microglia.

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are coloured by their log-Fold Change.

milo <- buildNhoodGraph(milo)
## Plot single-cell TSNE
tsne_pl <- plotReducedDim(milo, dimred = "TSNE", colour_by="genotype", text_by = "clusters_named", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")
## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(milo, da_results, layout="TSNE",alpha=0.1) +  scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

tsne_pl + nh_graph_pl# +

 # plot_layout(guides="collect")

We might also be interested in visualizing wheather DA is particularly evident in certain clusters. To do this, we assign a cluster label to each neighbourhood by finding the most abundant cluster within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harbouring the label.

da_results <- annotateNhoods(milo, da_results, coldata_col = "clusters_named")
da_results <- annotateNhoods(milo, da_results, coldata_col = "celltype")
head(da_results)

##         logFC   logCPM            F       PValue         FDR Nhood   SpatialFDR
## 1 -0.57828416 11.25797  0.955212249 3.420499e-01 0.779898186     1 0.7971967792
## 2  0.40707592 10.04075  0.185535947 6.720499e-01 0.922694641     2 0.9358128038
## 3 -7.87478493 10.31614 35.964926980 2.402185e-05 0.001154598     3 0.0008773739
## 4  0.02159023 10.02610  0.000753043 9.784259e-01 0.985382352     4 0.9859959345
## 5  0.26404708 10.20971  0.114358208 7.393609e-01 0.957971464     5 0.9666211159
## 6  0.16164738 10.30371  0.059348870 8.104328e-01 0.958908049     6 0.9667316338
##   clusters_named clusters_named_fraction  celltype celltype_fraction
## 1    Astrocyte_3                       1 Astrocyte                 1
## 2        Oligo_1                       1     Oligo                 1
## 3      Microglia                       1 Microglia                 1
## 4    Astrocyte_3                       1 Astrocyte                 1
## 5    Astrocyte_3                       1 Astrocyte                 1
## 6    Astrocyte_3                       1 Astrocyte                 1

While neighbourhoods tend to be homogeneous, we can define a threshold for celltype_fraction to exclude neighbourhoods that are a mix of cell types.

ggplot(da_results, aes(clusters_named_fraction)) + geom_histogram(bins=50)

da_results$celltype <- ifelse(da_results$celltype_fraction < 0.7, "Mixed", da_results$celltype)

Now we can visualize the distribution of DA Fold Changes in different cell types or clusters

plotDAbeeswarm(da_results, group.by = "celltype") +
  scale_colour_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + xlab("")

## Converting group.by to factor...

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

plotDAbeeswarm(da_results, group.by = "clusters_named") +
  scale_colour_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + xlab("")

## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

#save results
saveRDS(da_results, here("processed", project, "da_results.rds"))

–>

–> –>

–>

–> –> –> –> –> –> –>