Package

library(miloR) # compositional analysis
library(here) # reproducible paths
library(scater) # sc plots
library(dplyr) # modify design df
library(ggplot2) #plots

This analysis have been done following the MiloR vignette

Load data

For this study we will load our processed single cell experiment with subsetted oligos

project <- "old"
sce <- readRDS(here("processed", project, "sce_oligo_clusters_01.RDS"))

Visualize the data

plotReducedDim(sce, colour_by="genotype", dimred = "UMAP", text_by = "clusters_named")

## Warning: Removed 17 rows containing missing values (geom_text).

We will test for significant differences in abundance of cells between WT and KO, and the associated gene signatures.

Differential abundance testing

Create a Milo object

For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.

milo <- Milo(sce)
milo

## class: Milo 
## dim: 17234 3061 
## metadata(1): Samples
## assays(3): counts logcounts logcounts_raw
## rownames(17234): Xkr4 Gm19938 ... CAAA01147332.1 AC149090.1
## rowData names(6): ID Symbol ... gene_name subset
## colnames(3061): 1_AAACGAATCCTTACCG-1 1_AAAGGATAGTGAATAC-1 ...
##   6_TTTGACTTCCTCGCAT-1 6_TTTGATCGTTCCGTTC-1
## colData names(39): Sample Barcode ... cluster_oligo_k150
##   cluster_oligo_k200
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot.

For graph building you need to define a few parameters:

d: the number of reduced dimensions to use for KNN refinement. We recommend using the same \(d\) used for KNN graph building. In our case 26 dimensions (see feature_selection_dimred_02 script)
k: this affects the power of DA testing, since we need to have enough cells from each sample represented in a neighbourhood to estimate the variance between replicates. On the other side, increasing \(k\) too much might lead to over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for clustering and UMAP visualization. In our case k10, that is the clustering resolution that shows the small cluster, more abundant in KO, as a separate cluster.

# k modified after checking neighbourhoods
milo <- buildGraph(milo, k = 10, d = 25, reduced.dim = "PCA")

## Constructing kNN graph with k:10

Alternatively, one can add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix, through the helper function buildFromAdjacency.

Defining representative neighbourhoods on the KNN graph

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.

As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:

prop: the proportion of cells to randomly sample to start with. We suggest using prop=0.1 for datasets of less than 30k cells. For bigger datasets using prop=0.05 should be sufficient (and makes computation faster).
refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.

milo <- makeNhoods(milo, prop = 0.1, k = 10, d=26, refined = TRUE, reduced_dims = "PCA")

## Checking valid object

## Warning in makeNhoods(milo, prop = 0.1, k = 10, d = 26, refined = TRUE, : Specified d is higher than the total number of dimensions in reducedDim(x, reduced_dims). Falling back to using25dimensions

Once we have defined neighbourhoods, we plot the distribution of neighbourhood sizes (i.e. how many cells form each neighbourhood) to evaluate whether the value of \(k\) used for graph building was appropriate. We can check this out using the plotNhoodSizeHist function.

As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples or to have a distribution peaking between 50 and 100. Otherwise you might consider rerunning makeNhoods increasing k and/or prop. In our case, 6 samples, an average of minimum 30 is expected, so we rerun makeNhood increasing k until we have an average of minimum 30 (5 x 6samples).

plotNhoodSizeHist(milo)

Counting cells in neighbourhoods

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample="Sample")

## Checking meta.data validity

## Counting cells in neighbourhoods

This adds to the Milo object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

head(nhoodCounts(milo))

## 6 x 6 sparse Matrix of class "dgCMatrix"
##   S1823 S1824 S1825 S1826 S1827 S1828
## 1     .     7     1     8     2     .
## 2    10     6     6    10     7     6
## 3     3    11     3    34     4     3
## 4     .     .     .     7     2     9
## 5     4     4     6     4     .     4
## 6     6     4     3     8     3     2

Defining experimental design

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between genotypes, stored in the genotype column of the dataset colData. We also include the chip column in the design matrix. This represents a known technical covariate that we want to account for in DA testing.

design <- data.frame(colData(milo))[,c("Sample", "genotype", "chip")]
## Convert info from integers to factor
design$chip <- as.factor(design$chip) 
design$genotype <- as.factor(design$genotype)
design$genotype <- relevel(design$genotype, "WT")
# simplify data frame to only distinct combinations conditions
design <- distinct(design)
rownames(design) <- design$Sample
design

##       Sample genotype chip
## S1823  S1823       WT    1
## S1824  S1824       WT    2
## S1825  S1825       WT    1
## S1826  S1826       KO    2
## S1827  S1827       KO    2
## S1828  S1828       KO    1

Computing neighbourhood connectivity

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).

milo <- calcNhoodDistance(milo, d=25, reduced.dim = "PCA")

Testing

Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between genotype WT and KO, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)

da_results <- testNhoods(milo, design = ~ chip + genotype, design.df = design)

## Using TMM normalisation

## Performing spatial FDR correction withk-distance weighting

head(da_results)

##         logFC   logCPM            F      PValue        FDR Nhood SpatialFDR
## 1 -0.99196465 12.07753  0.733562844 0.401901618 0.78251709     1 0.80993246
## 2 -0.04454733 13.14233  0.005410233 0.942097532 0.96246222     2 0.96438122
## 3  0.27091321 13.21357  0.173065329 0.681842805 0.89934360     3 0.90294079
## 4  6.14915086 12.17319 13.937258221 0.001527721 0.04073923     4 0.04655878
## 5 -0.68046752 12.38738  0.525241312 0.477029088 0.80809685     5 0.82453297
## 6 -0.38497202 12.51019  0.196945780 0.661973342 0.89759097     6 0.90128880

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions. The main statistics we consider here are:

logFC: indicates the log-Fold change in cell numbers between samples from WT and KO
PValue: reports P-values before FDR correction
SpatialFDR: reports P-values corrected for multiple testing accounting for overlap between neighbourhoods

da_results %>%
  arrange(SpatialFDR) %>%
  head()

##        logFC   logCPM        F       PValue        FDR Nhood SpatialFDR
## 63  6.407494 12.55481 21.97682 0.0001844990 0.01690373    63 0.01908809
## 123 6.735876 12.61518 25.44835 0.0000849896 0.01690373   123 0.01908809
## 157 6.185890 12.85446 21.40031 0.0002112966 0.01690373   157 0.01908809
## 30  6.066755 12.73343 17.55957 0.0005527346 0.03316408    30 0.03738006
## 4   6.149151 12.17319 13.93726 0.0015277213 0.04073923     4 0.04655878
## 70  5.737011 12.66392 14.72173 0.0012130686 0.04073923    70 0.04655878

Inspecting DA testing results

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.

ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)

Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).

ggplot(da_results, aes(logFC, -log10(SpatialFDR))) + 
  geom_point() +
  geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are coloured by their log-Fold Change.

milo <- buildNhoodGraph(milo)
## Plot single-cell UMAP
UMAP_pl <- plotReducedDim(milo, dimred = "UMAP", colour_by="genotype", text_by = "clusters_named", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")

TSNE_pl <- plotReducedDim(milo, dimred = "TSNE", colour_by="genotype", text_by = "clusters_named", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")
## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(milo, da_results, layout="UMAP",alpha=0.1) + scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

nh_graph_tsne_pl <- plotNhoodGraphDA(milo, da_results, layout="TSNE",alpha=0.1) + scale_fill_gradient2(high = scales::muted("red"), mid = "white", low = scales::muted("blue"))

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

UMAP_pl + nh_graph_pl

## Warning: Removed 17 rows containing missing values (geom_text).

TSNE_pl + nh_graph_tsne_pl

## Warning: Removed 17 rows containing missing values (geom_text).

#  plot_layout(guides="collect")

We might also be interested in visualizing wheather DA is particularly evident in certain clusters. To do this, we assign a cluster label to each neighbourhood by finding the most abundant cluster within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harbouring the label.

We will visualise this for:

The 10k clustering done exclusively with the oligos, this was overclustered, but it separated the group of cells that were different between wild type and KO into a separate cluster
The clusters named like the first clustering done with all the cell types.
Simply separating oligodendrocytes from OPCs

k10_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="cluster_oligo_k10", text_by = "cluster_oligo_k10", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")

celltype_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="celltype", text_by = "celltype", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")

clusters_named_tsne <- plotReducedDim(milo, dimred = "TSNE", colour_by="clusters_named", text_by = "clusters_named", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")

k10_tsne + clusters_named_tsne + celltype_tsne

## Warning: Removed 17 rows containing missing values (geom_text).

## Warning: Removed 13 rows containing missing values (geom_text).

da_results <- annotateNhoods(milo, da_results, coldata_col = "cluster_oligo_k10")
da_results <- annotateNhoods(milo, da_results, coldata_col = "celltype")
da_results <- annotateNhoods(milo, da_results, coldata_col = "clusters_named")
head(da_results)

##         logFC   logCPM            F      PValue        FDR Nhood SpatialFDR
## 1 -0.99196465 12.07753  0.733562844 0.401901618 0.78251709     1 0.80993246
## 2 -0.04454733 13.14233  0.005410233 0.942097532 0.96246222     2 0.96438122
## 3  0.27091321 13.21357  0.173065329 0.681842805 0.89934360     3 0.90294079
## 4  6.14915086 12.17319 13.937258221 0.001527721 0.04073923     4 0.04655878
## 5 -0.68046752 12.38738  0.525241312 0.477029088 0.80809685     5 0.82453297
## 6 -0.38497202 12.51019  0.196945780 0.661973342 0.89759097     6 0.90128880
##   cluster_oligo_k10 cluster_oligo_k10_fraction celltype celltype_fraction
## 1                11                  0.8333333    Oligo                 1
## 2                 9                  0.8888889    Oligo                 1
## 3                12                  0.5517241    Oligo                 1
## 4                10                  0.9444444    Oligo                 1
## 5                 9                  0.9090909    Oligo                 1
## 6                 6                  0.9615385     OPCs                 1
##   clusters_named clusters_named_fraction
## 1        Oligo_1                       1
## 2        Oligo_1                       1
## 3        Oligo_1                       1
## 4        Oligo_1                       1
## 5        Oligo_1                       1
## 6           OPCs                       1

We can define a threshold to exclude neighbourhoods that are a mix of cell types.

ggplot(da_results, aes(cluster_oligo_k10_fraction)) + geom_histogram(bins=50)+
ggplot(da_results, aes(clusters_named_fraction)) + geom_histogram(bins=50) +
ggplot(da_results, aes(celltype_fraction)) + geom_histogram(bins=50)

da_results$cluster_oligo_k10 <- ifelse(da_results$cluster_oligo_k10_fraction < 0.6, "Mixed", da_results$cluster_oligo_k10)
da_results$cluster_oligo_k10 <- ifelse(da_results$celltype_fraction < 0.9, "Mixed", da_results$cluster_oligo_k10)
da_results$cluster_oligo_k10 <- ifelse(da_results$clusters_named_fraction < 0.8, "Mixed", da_results$cluster_oligo_k10)

Now we can visualize the distribution of DA Fold Changes in different cell types or clusters

plotDAbeeswarm(da_results, group.by = "cluster_oligo_k10") + scale_colour_distiller(direction = -1, palette = "RdBu")

## Converting group.by to factor...

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

plotDAbeeswarm(da_results, group.by = "celltype") + scale_colour_distiller(direction = -1, palette = "RdBu")

## Converting group.by to factor...

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

plotDAbeeswarm(da_results, group.by = "clusters_named") + scale_colour_distiller(direction = -1, palette = "RdBu")

## Converting group.by to factor...
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

saveRDS(da_results, here("processed",project, "da_results_oligos.rds"))

–> –> –> –>

Session Info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.7                 scater_1.20.1              
##  [3] ggplot2_3.3.5               scuttle_1.2.1              
##  [5] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
##  [7] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [9] GenomeInfoDb_1.28.4         IRanges_2.26.0             
## [11] S4Vectors_0.30.2            BiocGenerics_0.38.0        
## [13] MatrixGenerics_1.4.3        matrixStats_0.61.0         
## [15] here_1.0.1                  miloR_1.0.0                
## [17] edgeR_3.34.1                limma_3.48.3               
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RColorBrewer_1.1-2       
##  [3] rprojroot_2.0.2           tools_4.1.1              
##  [5] bslib_0.3.1               utf8_1.2.2               
##  [7] R6_2.5.1                  irlba_2.3.3              
##  [9] vipor_0.4.5               DBI_1.1.1                
## [11] colorspace_2.0-2          withr_2.4.2              
## [13] tidyselect_1.1.1          gridExtra_2.3            
## [15] compiler_4.1.1            BiocNeighbors_1.10.0     
## [17] DelayedArray_0.18.0       labeling_0.4.2           
## [19] sass_0.4.0                scales_1.1.1             
## [21] stringr_1.4.0             digest_0.6.28            
## [23] rmarkdown_2.11            XVector_0.32.0           
## [25] pkgconfig_2.0.3           htmltools_0.5.2          
## [27] sparseMatrixStats_1.4.2   highr_0.9                
## [29] fastmap_1.1.0             rlang_0.4.12             
## [31] DelayedMatrixStats_1.14.3 jquerylib_0.1.4          
## [33] farver_2.1.0              generics_0.1.1           
## [35] jsonlite_1.7.2            gtools_3.9.2             
## [37] BiocParallel_1.26.2       RCurl_1.98-1.5           
## [39] magrittr_2.0.1            BiocSingular_1.8.1       
## [41] GenomeInfoDbData_1.2.6    patchwork_1.1.1          
## [43] Matrix_1.3-4              Rcpp_1.0.7               
## [45] ggbeeswarm_0.6.0          munsell_0.5.0            
## [47] fansi_0.5.0               viridis_0.6.2            
## [49] lifecycle_1.0.1           stringi_1.7.5            
## [51] yaml_2.2.1                ggraph_2.0.5             
## [53] MASS_7.3-54               zlibbioc_1.38.0          
## [55] grid_4.1.1                ggrepel_0.9.1            
## [57] crayon_1.4.2              lattice_0.20-45          
## [59] splines_4.1.1             graphlayouts_0.7.1       
## [61] cowplot_1.1.1             beachmat_2.8.1           
## [63] locfit_1.5-9.4            knitr_1.36               
## [65] pillar_1.6.4              igraph_1.2.7             
## [67] ScaledMatrix_1.0.0        glue_1.4.2               
## [69] evaluate_0.14             vctrs_0.3.8              
## [71] tweenr_1.0.2              gtable_0.3.0             
## [73] purrr_0.3.4               polyclip_1.10-0          
## [75] tidyr_1.1.4               assertthat_0.2.1         
## [77] xfun_0.27                 ggforce_0.3.3            
## [79] rsvd_1.0.5                tidygraph_1.2.0          
## [81] viridisLite_0.4.0         tibble_3.1.5             
## [83] beeswarm_0.4.0            statmod_1.4.36           
## [85] ellipsis_0.3.2

Compositional analysis with Milo - Subset Oligos

Nadine Bestard

02/11/2021