Package

library(miloR) # compositional analysis
library(here) # reproducible paths
library(scater) # sc plots
library(dplyr) # modify design df

This analysis have been done following the (MiloR vignette)[https://rawcdn.githack.com/MarioniLab/miloR/7c7f906b94a73e62e36e095ddb3e3567b414144e/vignettes/milo_gastrulation.html#5_Finding_markers_of_DA_populations]

Load data

For this study we will load our processed single cell experiment

source(here("src/colours.R"))
project <- "fire-mice"
fig_path <- here("outs", project, "plots","DA_miloR", "/")
sce <- readRDS(here("processed", project, "sce_anno_02.RDS"))

Visualize the data

plotReducedDim(sce, colour_by="genotype", dimred = "TSNE", text_by = "clusters_named")

We will test for significant differences in abundance of cells between WT and KO, and the associated gene signatures.

Differential abundance testing

Create a Milo object

For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.

milo <- Milo(sce)
milo

## class: Milo 
## dim: 19148 15418 
## metadata(1): Samples
## assays(3): counts logcounts logcounts_raw
## rownames(19148): Xkr4 Gm19938 ... CAAA01147332.1 AC149090.1
## rowData names(6): ID Symbol ... gene_name subset
## colnames(15418): 1_AAACCCAAGGCCTGCT-1 1_AAAGAACAGCATTTGC-1 ...
##   12_TTTGGAGTCACTCACC-1 12_TTTGGAGTCGAGCCAC-1
## colData names(38): Sample Barcode ... originalexp_snn_res.0.9
##   originalexp_snn_res.1
## reducedDimNames(5): PCA_coldata PCA PCA_all UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot.

For graph building you need to define a few parameters:

d: the number of reduced dimensions to use for KNN refinement. We recommend using the same \(d\) used for KNN graph building. In our case 26 dimensions (see feature_selection_dimred_02 script)
k: this affects the power of DA testing, since we need to have enough cells from each sample represented in a neighbourhood to estimate the variance between replicates. On the other side, increasing \(k\) too much might lead to over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for clustering and UMAP visualization. In our case k20. We will later use some heuristics to evaluate whether the value of \(k\) should be increased.

# k modified after checking neighbourhoods
milo <- buildGraph(milo, k = 30, d = 25, reduced.dim = "PCA")

## Constructing kNN graph with k:30

Alternatively, one can add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix, through the helper function buildFromAdjacency.

Defining representative neighbourhoods on the KNN graph

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.

As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:

prop: the proportion of cells to randomly sample to start with. We suggest using prop=0.1 for datasets of less than 30k cells. For bigger datasets using prop=0.05 should be sufficient (and makes computation faster).
refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.

set.seed(1)
milo <- makeNhoods(milo, prop = 0.1, k = 30, d=25, refined = TRUE, reduced_dims = "PCA")

## Checking valid object

## Running refined sampling with reduced_dim

Once we have defined neighbourhoods, we plot the distribution of neighbourhood sizes (i.e. how many cells form each neighbourhood) to evaluate whether the value of \(k\) used for graph building was appropriate. We can check this out using the plotNhoodSizeHist function.

As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples or to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop. In our case, 6 samples, an average of minimum 30 is expected, so we rerun makeNhood increasing k until we have an average of minimum 30 (5 x 6samples).

plotNhoodSizeHist(milo)

Counting cells in neighbourhoods

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample="Sample")

## Checking meta.data validity

## Counting cells in neighbourhoods

This adds to the Milo object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

head(nhoodCounts(milo))

## 6 x 12 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 12 column names 'WTvo1_1', 'WTvo1_2', 'KOvo2_1' ... ]]

##                               
## 1 5 4 2 1 13 12 1 .  5 7 11  4
## 2 4 3 . 1  4  6 4 3  4 6  9  3
## 3 1 2 . . 18 18 3 2  7 1  4  3
## 4 4 7 3 .  .  1 4 1 22 .  .  1
## 5 3 3 2 4  6  2 4 . 10 4 14 14
## 6 . 1 1 1 21 23 2 5  . .  .  .

Defining experimental design

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between genotypes, stored in the genotype column of the dataset colData. We also include the chip column in the design matrix. This represents a known technical covariate that we want to account for in DA testing.

design <- data.frame(colData(milo))[,c("Sample", "genotype", "chip")]
## Convert info from integers to factor
design$chip <- as.factor(design$chip) 
design$genotype <- as.factor(design$genotype)
design$genotype <- relevel(design$genotype, "WT")
# simplify data frame to only distinct combinations conditions
design <- distinct(design)
rownames(design) <- design$Sample
design

##          Sample genotype chip
## WTvo1_1 WTvo1_1       WT    7
## WTvo1_2 WTvo1_2       WT    7
## KOvo2_1 KOvo2_1       KO    8
## KOvo2_2 KOvo2_2       KO    8
## KOvo3_1 KOvo3_1       KO    8
## KOvo3_2 KOvo3_2       KO    8
## KOvo1_1 KOvo1_1       KO    7
## KOvo1_2 KOvo1_2       KO    7
## WTvo2_1 WTvo2_1       WT    8
## WTvo2_2 WTvo2_2       WT    8
## WTvo3_1 WTvo3_1       WT    8
## WTvo3_2 WTvo3_2       WT    8

Computing neighbourhood connectivity

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).

milo <- calcNhoodDistance(milo, d=25, reduced.dim = "PCA")

Testing

Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between genotype WT and KO, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)

da_results <- testNhoods(milo, design = ~ chip + genotype, design.df = design)

## Using TMM normalisation

## Performing spatial FDR correction withk-distance weighting

head(da_results)

##        logFC    logCPM         F       PValue         FDR Nhood  SpatialFDR
## 1 -0.7685872 10.036900  1.325872 0.2623770329 0.514222353     1 0.516848597
## 2 -0.8280734  9.806620  1.838366 0.1894273594 0.421241435     2 0.424414888
## 3  0.8020197  9.867048  1.035801 0.3204156533 0.570248461     3 0.574221853
## 4 -2.0212901  9.848748  3.987672 0.0589964213 0.200588358     4 0.197370330
## 5 -1.3723491 10.100008  4.922181 0.0375723568 0.148048269     5 0.144676628
## 6  4.5784177  9.767864 19.229852 0.0002588503 0.003380103     6 0.002966256

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions. The main statistics we consider here are:

logFC: indicates the log-Fold change in cell numbers between samples from WT and KO
PValue: reports P-values before FDR correction
SpatialFDR: reports P-values corrected for multiple testing accounting for overlap between neighbourhoods

da_results %>%
  arrange(SpatialFDR) %>%
  head()

##         logFC   logCPM        F       PValue          FDR Nhood   SpatialFDR
## 15  -6.914346 10.57783 66.23125 4.109587e-07 2.884416e-05    15 2.105539e-05
## 177 -7.309169 10.94815 72.43165 2.268748e-07 2.884416e-05   177 2.105539e-05
## 178 -6.817482 10.55910 68.62413 3.250590e-07 2.884416e-05   178 2.105539e-05
## 262 -7.029305 10.68119 73.20424 2.113142e-07 2.884416e-05   262 2.105539e-05
## 265 -5.599947 10.42492 57.12232 1.917699e-07 2.884416e-05   265 2.105539e-05
## 274 -6.895875 10.53473 69.95337 2.861839e-07 2.884416e-05   274 2.105539e-05

Inspecting DA testing results

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.

ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)

Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).

ggplot(da_results, aes(logFC, -log10(SpatialFDR))) + 
  geom_point() +
  geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)

The neighbourhoods with strong down regulation are the microglia.

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are coloured by their log-Fold Change.

milo <- buildNhoodGraph(milo)
## Plot single-cell TSNE
tsne_pl <- plotReducedDim(milo, dimred = "TSNE", colour_by="genotype",
                         #  text_by = "clusters_named",  text_size = 3, 
                          point_size=0.5) + 
  guides(fill="none")  + scale_color_manual(values = c(col_wt_ko[1],col_wt_ko[2])) + labs(color="genotype")

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(milo, da_results, layout="TSNE",alpha=0.1) +  scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + labs(fill = "logFC")

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

tsne_pl + nh_graph_pl #+

 #plot_layout(guides="collect")

We might also be interested in visualizing wheather DA is particularly evident in certain clusters. To do this, we assign a cluster label to each neighbourhood by finding the most abundant cluster within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harbouring the label.

da_results <- annotateNhoods(milo, da_results, coldata_col = "clusters_named")
da_results <- annotateNhoods(milo, da_results, coldata_col = "celltype")
head(da_results)

##        logFC    logCPM         F       PValue         FDR Nhood  SpatialFDR
## 1 -0.7685872 10.036900  1.325872 0.2623770329 0.514222353     1 0.516848597
## 2 -0.8280734  9.806620  1.838366 0.1894273594 0.421241435     2 0.424414888
## 3  0.8020197  9.867048  1.035801 0.3204156533 0.570248461     3 0.574221853
## 4 -2.0212901  9.848748  3.987672 0.0589964213 0.200588358     4 0.197370330
## 5 -1.3723491 10.100008  4.922181 0.0375723568 0.148048269     5 0.144676628
## 6  4.5784177  9.767864 19.229852 0.0002588503 0.003380103     6 0.002966256
##   clusters_named clusters_named_fraction        celltype celltype_fraction
## 1        mOligo1               1.0000000 Oligodendrocyte         1.0000000
## 2  BAMs&DCs&Mono               1.0000000          Immune         1.0000000
## 3  BAMs&DCs&Mono               1.0000000          Immune         1.0000000
## 4         Astro1               0.8837209       Astrocyte         0.9302326
## 5         Astro3               1.0000000       Astrocyte         1.0000000
## 6        mOligo2               0.9629630 Oligodendrocyte         1.0000000

While neighbourhoods tend to be homogeneous, we can define a threshold for celltype_fraction to exclude neighbourhoods that are a mix of cell types.

ggplot(da_results, aes(clusters_named_fraction)) + geom_histogram(bins=50)

da_results$celltype <- ifelse(da_results$celltype_fraction < 0.7, "Mixed", da_results$celltype)
da_results$clusters_named <- ifelse(da_results$clusters_named_fraction < 0.7, "Mixed", da_results$clusters_named)

Now we can visualize the distribution of DA Fold Changes in different cell types or clusters

# reorder factor
da_results$clusters_named <- factor(da_results$clusters_named, levels =c("Mixed", rev(levels(sce$clusters_named))))

plotDAbeeswarm(da_results, group.by = "clusters_named") +
  scale_colour_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue")) + xlab("")

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

#save results
saveRDS(da_results, here("processed", project, "da_results.rds"))

Session Info

sessionInfo()

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pals_1.7                    dplyr_1.0.9                
##  [3] scater_1.24.0               ggplot2_3.3.6              
##  [5] scuttle_1.6.2               SingleCellExperiment_1.18.0
##  [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
##  [9] GenomicRanges_1.48.0        GenomeInfoDb_1.32.3        
## [11] IRanges_2.30.0              S4Vectors_0.34.0           
## [13] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
## [15] matrixStats_0.62.0          here_1.0.1                 
## [17] miloR_1.4.0                 edgeR_3.38.4               
## [19] limma_3.52.2               
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RColorBrewer_1.1-3       
##  [3] rprojroot_2.0.3           tools_4.2.1              
##  [5] bslib_0.3.1               utf8_1.2.2               
##  [7] R6_2.5.1                  irlba_2.3.5              
##  [9] vipor_0.4.5               DBI_1.1.3                
## [11] colorspace_2.0-3          withr_2.5.0              
## [13] tidyselect_1.1.2          gridExtra_2.3            
## [15] compiler_4.2.1            cli_3.3.0                
## [17] BiocNeighbors_1.14.0      DelayedArray_0.22.0      
## [19] labeling_0.4.2            sass_0.4.1               
## [21] scales_1.2.0              stringr_1.4.0            
## [23] digest_0.6.29             rmarkdown_2.14           
## [25] XVector_0.36.0            dichromat_2.0-0.1        
## [27] pkgconfig_2.0.3           htmltools_0.5.2          
## [29] sparseMatrixStats_1.8.0   highr_0.9                
## [31] maps_3.4.0                fastmap_1.1.0            
## [33] rlang_1.0.3               rstudioapi_0.13          
## [35] DelayedMatrixStats_1.18.0 jquerylib_0.1.4          
## [37] farver_2.1.1              generics_0.1.3           
## [39] jsonlite_1.8.0            gtools_3.9.3             
## [41] BiocParallel_1.30.3       RCurl_1.98-1.8           
## [43] magrittr_2.0.3            BiocSingular_1.12.0      
## [45] GenomeInfoDbData_1.2.8    patchwork_1.1.1          
## [47] Matrix_1.4-1              ggbeeswarm_0.6.0         
## [49] Rcpp_1.0.9                munsell_0.5.0            
## [51] fansi_1.0.3               viridis_0.6.2            
## [53] lifecycle_1.0.1           stringi_1.7.8            
## [55] yaml_2.3.5                ggraph_2.0.5             
## [57] MASS_7.3-57               zlibbioc_1.42.0          
## [59] grid_4.2.1                parallel_4.2.1           
## [61] ggrepel_0.9.1             crayon_1.5.1             
## [63] lattice_0.20-45           splines_4.2.1            
## [65] cowplot_1.1.1             graphlayouts_0.8.0       
## [67] beachmat_2.12.0           mapproj_1.2.8            
## [69] locfit_1.5-9.6            knitr_1.39               
## [71] pillar_1.7.0              igraph_1.3.4             
## [73] codetools_0.2-18          ScaledMatrix_1.4.0       
## [75] glue_1.6.2                evaluate_0.15            
## [77] vctrs_0.4.1               tweenr_1.0.2             
## [79] gtable_0.3.0              purrr_0.3.4              
## [81] polyclip_1.10-0           tidyr_1.2.0              
## [83] assertthat_0.2.1          xfun_0.31                
## [85] ggforce_0.3.3             rsvd_1.0.5               
## [87] tidygraph_1.2.1           viridisLite_0.4.0        
## [89] tibble_3.1.7              beeswarm_0.4.0           
## [91] statmod_1.4.37            ellipsis_0.3.2

Compositional analysis with Milo

Nadine Bestard

2023-02-09