Session 1: Spatial analyses of sequencing data

Objective

Introduce the Bioconductor framework with a focus on the SpatialExperiment package. Guide participants through installation, data acquisition, and basic data manipulation and visualisation.

Session Structure

Introduction to Bioconductor

What is Bioconductor? Its role in spatial omics analysis.

Getting Started with SpatialExperiment

Installation of SpatialExperiment and dependencies. Overview of the package’s capabilities.

Downloading Example Dataset

Step-by-step guide on acquiring a sample dataset. Explanation of dataset characteristics.

Data Visualisation and Manipulation

Basic visualisation techniques. Manipulating spatial omics data using SpatialExperiment.

Exercises and Q&A

Q&A session for clarifications and deeper insights.

1. Introduction to Bioconductor

In this section of our R Markdown workshop, we delve into the world of Bioconductor and its significance in spatial omics analysis. Understanding Bioconductor is fundamental for anyone venturing into the complex and fascinating field of bioinformatics, especially in analysing spatially resolved data.

What is Bioconductor?

Bioconductor is a pioneering software project that specialises in the analysis and comprehension of genomic data. It is an open-source, open-development platform primarily based on the R statistical programming language.

The key features of Bioconductor include:

Extensive Range of Packages: Bioconductor offers a vast collection of packages specifically designed to analyse high-throughput genomic data. This includes genomics, transcriptomics, proteomics and other high-throughput data domains
Community-Driven Development: Bioconductor is developed by a global community of scientists and programmers, ensuring it stays at the forefront of genomic research and bioinformatics.
Reproducible Research: It emphasises reproducibility and provides tools that facilitate the sharing of data, code, and research workflows.

Bioconductor’s role in Spatial Omics Analysis

This emerging field combines traditional omics data (like genomics and transcriptomics) with spatial context. This means understanding how the expression of genes varies across different physical locations within a tissue or cell. How Bioconductor contributes to this field:

Specialised Packages: Bioconductor offers packages tailored for spatial omics analysis. These packages can handle spatially resolved transcriptomics data, enabling researchers to analyse and visualize the spatial distribution of gene expression.
Data Management: It provides robust solutions for managing complex and large datasets typical in spatial omics, ensuring efficient data handling and processing.
Statistical Analysis Tools: Bioconductor contains a range of statistical tools specifically designed for the unique challenges of spatial data analysis, such as accounting for spatial autocorrelation and integrating spatial and non-spatial data.
Visualisation Capabilities: With its advanced graphical functionalities, Bioconductor aids in creating insightful visual representations of spatial omics data, which is crucial for understanding spatial patterns in gene expression.

Slack channel: #community-bioc Support forum: https://support.bioconductor.org/

In the following sections of this workshop, we will explore practical examples and dive deeper into how Bioconductor is used in spatial omics analyses, including hands-on coding examples. Stay tuned for an engaging journey through spatial omics with Bioconductor!

2. Getting Started with SpatialExperiment

To begin working with SpatialExperiment, it’s essential to install the necessary packages from Bioconductor, using the BiocManager package. This setup ensures that you have the required tools to manage and analyse spatial omics data effectively.

SpatialExperiment defines an S4 class designed to hold data from spatial-omics experiments. This class builds on the SingleCellExperiment by adding functionality to store and access extra information from both spot-based and cell-based platforms, such as spatial coordinates, images, and their metadata.

Righelli et al. doi: 10.1093/bioinformatics/btac299

## here() starts at /Users/mangiola.s/Documents/tidySpatialWorkshop2024

# Install BiocManager
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# Install SpatialExperiment package
BiocManager::install("SpatialExperiment")

3. Downloading Example Dataset

Once the packages are installed, we will guide you through the process of working with spatial omics data using the SpatialExperiment package.

# Load the SpatialExperiment package

# For SpatialExperiment the following might be needed
# 1. bash: module load ImageMagick/6.9.11-22
# 2. R: devtools::install_github("ropensci/magick")

library(SpatialExperiment)

Visualisation functions for spatial transcriptomics data.

library(ggspavis)

Utility packages for single-cell data.

library(scuttle)
library(scater)
library(scran)

In this section, we explore the handling and processing of spatial transcriptomics data using the spatialLIBD and ExperimentHub packages. The following R code block retrieves a specific dataset from the ExperimentHub, a Bioconductor project designed to manage and distribute large biological data sets. The code efficiently fetches the data, removes any existing dimensional reductions, and filters the dataset to include only selected samples. This approach is essential for analysing spatial patterns in gene expression across multiple samples, and the code below exemplifies how to manipulate these datasets in preparation for further analysis. This process is adapted from the Banksy package’s vignette, which provides advanced methods for multi-sample spatial transcriptomics.

Maynard and Torres et al., doi: 10.1038/s41593-020-00787-0

library(spatialLIBD)
library(ExperimentHub)

spatial_data <- 
  ExperimentHub::ExperimentHub() |> 
  spatialLIBD::fetch_data( eh = _, type = "spe")

names(libd_layer_colors) = gsub("ayer", "", names(libd_layer_colors))

# Clear the reductions
reducedDims(spatial_data) = NULL 

# Select only 3 samples
spatial_data = spatial_data[,spatial_data$sample_id %in% c("151673", "151675", "151676")]

# Display the object
spatial_data

## class: SpatialExperiment 
## dim: 33538 10691 
## metadata(0):
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(9): source type ... gene_search is_top_hvg
## colnames(10691): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
##   TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
## colData names(69): sample_id Cluster ... array_row array_col
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor

From: https://bookdown.org/sjcockell/ismb-tutorial-2023/practical-session-2.html

We shows metadata for each cell, helping understand the dataset’s structure.

col_data = colData(spatial_data)

knitr::kable(
  head(col_data),
  format = "html"
)

	sample_id	Cluster	sum_umi	sum_gene	subject	replicate	subject_position	discard	key	cell_count	SNN_k50_k4	SNN_k50_k5	SNN_k50_k6	SNN_k50_k7	SNN_k50_k8	SNN_k50_k9	SNN_k50_k10	SNN_k50_k11	SNN_k50_k12	SNN_k50_k13	SNN_k50_k14	SNN_k50_k15	SNN_k50_k16	SNN_k50_k17	SNN_k50_k18	SNN_k50_k19	SNN_k50_k20	SNN_k50_k21	SNN_k50_k22	SNN_k50_k23	SNN_k50_k24	SNN_k50_k25	SNN_k50_k26	SNN_k50_k27	SNN_k50_k28	GraphBased	Maynard	Martinowich	layer_guess	layer_guess_reordered	layer_guess_reordered_short	expr_chrM	expr_chrM_ratio	SpatialDE_PCA	SpatialDE_pool_PCA	HVG_PCA	pseudobulk_PCA	markers_PCA	SpatialDE_UMAP	SpatialDE_pool_UMAP	HVG_UMAP	pseudobulk_UMAP	markers_UMAP	SpatialDE_PCA_spatial	SpatialDE_pool_PCA_spatial	HVG_PCA_spatial	pseudobulk_PCA_spatial	markers_PCA_spatial	SpatialDE_UMAP_spatial	SpatialDE_pool_UMAP_spatial	HVG_UMAP_spatial	pseudobulk_UMAP_spatial	markers_UMAP_spatial	spatialLIBD	ManualAnnotation	in_tissue	array_row	array_col
AAACAAGTATCTCCCA-1	151673	7	8458	3586	Br8100	1	Br8100_pos0	FALSE	151673_AAACAAGTATCTCCCA-1	6	1	1	1	1	1	1	1	4	4	4	4	3	3	3	2	2	2	2	1	1	1	6	5	5	4	7	2_3	2	Layer3	Layer3	L3	1407	0.1663514	3	3	2	2	2	1	1	3	1	1	3	1	1	3	1	7	1	1	2	1	L3	NA	TRUE	50	102
AAACAATCTACTAGCA-1	151673	4	1667	1150	Br8100	1	Br8100_pos0	FALSE	151673_AAACAATCTACTAGCA-1	16	1	1	1	1	1	1	1	4	4	4	4	3	3	3	2	2	2	2	1	1	1	10	10	10	10	4	2_3	3	Layer1	Layer1	L1	204	0.1223755	2	5	3	7	8	2	2	2	2	1	7	5	2	2	3	2	1	4	2	3	L1	NA	TRUE	3	43
AAACACCAATAACTGC-1	151673	8	3769	1960	Br8100	1	Br8100_pos0	FALSE	151673_AAACACCAATAACTGC-1	5	2	3	4	5	8	9	10	11	10	9	10	11	11	12	13	12	11	11	12	20	20	21	22	21	22	8	WM	WM	WM	WM	WM	430	0.1140886	4	4	7	5	5	5	5	6	6	1	5	4	4	5	3	5	7	5	3	2	WM	NA	TRUE	59	19
AAACAGAGCGACTCCT-1	151673	6	5433	2424	Br8100	1	Br8100_pos0	FALSE	151673_AAACAGAGCGACTCCT-1	2	1	1	1	1	1	1	1	2	2	2	2	1	1	1	4	4	4	4	3	3	3	2	2	2	7	6	6	6	Layer3	Layer3	L3	1316	0.2422234	3	3	2	3	6	1	1	3	1	3	3	3	1	2	2	3	4	2	1	1	L3	NA	TRUE	14	94
AAACAGCTTTCAGAAG-1	151673	3	4278	2264	Br8100	1	Br8100_pos0	FALSE	151673_AAACAGCTTTCAGAAG-1	4	1	1	1	1	1	3	3	3	3	3	3	2	2	2	1	1	1	1	6	6	6	5	4	4	3	3	5	5	Layer5	Layer5	L5	651	0.1521739	2	1	1	4	2	1	1	7	3	4	2	1	2	4	1	3	3	8	4	4	L5	NA	TRUE	43	9
AAACAGGGTCTATATT-1	151673	3	4004	2178	Br8100	1	Br8100_pos0	FALSE	151673_AAACAGGGTCTATATT-1	6	2	4	5	6	5	6	7	8	12	12	13	14	15	16	17	17	17	16	17	16	16	17	18	17	18	3	5	5	Layer6	Layer6	L6	621	0.1550949	6	7	8	8	4	1	7	1	3	4	6	8	7	7	7	3	3	3	4	4	L6	NA	TRUE	47	13

We access and display feature-related information from the dataset.

row_data = rowData(spatial_data)

knitr::kable(
  head(row_data),
  format = "html"
)

	source	type	gene_id	gene_version	gene_name	gene_source	gene_biotype	gene_search	is_top_hvg
ENSG00000243485	havana	gene	ENSG00000243485	5	MIR1302-2HG	havana	lincRNA	MIR1302-2HG; ENSG00000243485	FALSE
ENSG00000237613	havana	gene	ENSG00000237613	2	FAM138A	havana	lincRNA	FAM138A; ENSG00000237613	FALSE
ENSG00000186092	ensembl_havana	gene	ENSG00000186092	6	OR4F5	ensembl_havana	protein_coding	OR4F5; ENSG00000186092	FALSE
ENSG00000238009	ensembl_havana	gene	ENSG00000238009	6	AL627309.1	ensembl_havana	lincRNA	AL627309.1; ENSG00000238009	FALSE
ENSG00000239945	havana	gene	ENSG00000239945	1	AL627309.3	havana	lincRNA	AL627309.3; ENSG00000239945	FALSE
ENSG00000239906	havana	gene	ENSG00000239906	1	AL627309.2	havana	antisense	AL627309.2; ENSG00000239906	FALSE

Here, we perform a preliminary examination of the assay data contained within the spatial dataset.

assay(spatial_data)[1:20, 75:100]

## 20 x 26 sparse Matrix of class "dgCMatrix"

##   [[ suppressing 26 column names 'AACCAGTATCACTCTT-1', 'AACCATAGGGTTGAAC-1', 'AACCATGGGATCGCTA-1' ... ]]

##                                                                    
## ENSG00000243485 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000237613 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000186092 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000238009 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000239945 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000239906 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000241599 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000236601 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000284733 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000235146 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000284662 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000229905 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000237491 . . . . . . . . . . . . . 1 . . . . . . 1 . . . . 2
## ENSG00000177757 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000225880 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000230368 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000272438 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000230699 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000241180 . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSG00000223764 . . . . . . . . . . . . . . . . . . . . . . . . . .

4. Data Visualisation and Manipulation

Understanding the spatial arrangement of your data is fundamental. We’ll demonstrate straightforward methods to visualise spatial data, helping you appreciate the underlying spatial patterns and distributions.

We will use ggpavis package to visualise the data.

# image data
imgData(spatial_data)

## DataFrame with 3 rows and 4 columns
##     sample_id    image_id   data scaleFactor
##   <character> <character> <list>   <numeric>
## 1      151673      lowres   ####   0.0450045
## 2      151675      lowres   ####   0.0450045
## 3      151676      lowres   ####   0.0450045

# Simple visualization of spatial data
ggspavis::plotSpots(spatial_data) + 
  facet_wrap(~sample_id)

Enhance the visualisation by annotating the plots to provide more context within the spatial data.

Layers = L1-6, white matter = WM

# plot spots with annotation
ggspavis::plotSpots(
  spatial_data, 
  annotate = "spatialLIBD"
) + 
  facet_wrap(~sample_id)

Explore additional visualisation features offered by the Visium platform, exposing the H&E (hematoxylin and eosin) image.

ggspavis::plotVisium(spatial_data)

This visualisation focuses on specific tissue features within the dataset, emphasising areas of interest.

ggspavis::plotVisium(
  spatial_data, 
  annotate = "spatialLIBD", 
  highlight = "in_tissue"
) + 
  facet_wrap(~sample_id)

5. Quality control and filtering

We will use the scater package McCarthy et al. 2017 to compute the three primary QC metrics we discussed earlWe’llUsing the scater Package for QC Metrics: We’ll apply the scater package to compute three primary quality control metrics. We’ll also use ggspavis for visualisation along with some custom plotting techniques.

Previously, we visualised both on- and off-tissue spots. Moving forward, we focus on on-tissue spots for more relevant analyses. This block shows how to filter out off-tissue spots to refine the dataset.

Source OSTAWorkshopBioc2021

## Dataset dimensions before the filtering
dim(spatial_data)

## [1] 33538 10691

Filtering Dataset to Retain Only On-Tissue Spots: We then refine our dataset by keeping only those spots that are on-tissue, aligning with our focus for subsequent analysis. The dimensions of the dataset after filtering are displayed to confirm the changes.

## Subset to keep only on-tissue spots
spatial_data <- spatial_data[, colData(spatial_data)$in_tissue == 1]
dim(spatial_data)

## [1] 33538 10691

Mitochondrial

Next, we identify mitochondrial genes, as they are indicative of cell health. Cells with high mitochondrial gene expression typically indicate poor health or dying cells, which we aim to exclude.

## Classify genes as "mitochondrial" (is_mito == TRUE) 
## or not (is_mito == FALSE)
is_gene_mitochondrial <- grepl("(^MT-)|(^mt-)", rowData(spatial_data)$gene_name)
rowData(spatial_data)$gene_name[is_gene_mitochondrial]

##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

After identifying mitochondrial genes, we apply quality control metrics to further clean the dataset. This involves adding per-cell QC measures and setting a threshold to exclude cells with high mitochondrial transcription.

## Add per-cell QC metrics to spatial data using identified mitochondrial genes
spatial_data <- scater::addPerCellQC(
  spatial_data, 
  subsets = list(mito = is_gene_mitochondrial)
)

## Select expressed genes threshold
qc_mitochondrial_transcription <- colData(spatial_data)$subsets_mito_percent > 30

## Check how many spots are filtered out
table(qc_mitochondrial_transcription)

## qc_mitochondrial_transcription
## FALSE  TRUE 
## 10599    92

After applying the QC metrics, it’s crucial to visually assess their impact. This step involves checking the spatial pattern of the spots removed based on high mitochondrial transcription, helping us understand the distribution and quality of the remaining dataset.

## Add threshold in colData
colData(spatial_data)$qc_mitochondrial_transcription <- qc_mitochondrial_transcription

## Check for putative spatial pattern of removed spots
plotSpotQC(
  spatial_data, 
  plot_type = "spot",  
  annotate = "qc_mitochondrial_transcription", 
) + 
  facet_wrap(~sample_id)

Library size

This analysis focuses on examining the distribution of library sizes across different spots. It uses a histogram and density plot to visualise the range and commonality of library sizes in the dataset.

## Density and histogram of library sizes
data.frame(colData(spatial_data)) |> 
  ggplot(aes(x = sum)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60) +
  geom_density() +
  scale_x_log10() +
  xlab("Library size") + 
  ylab("Density") + 
  theme_classic()

Setting Library Size Threshold: After examining the library sizes, a threshold is applied to identify spots with library sizes below 700, which are considered for potential exclusion from further analysis.

## Select library size threshold
qc_total_counts <- colData(spatial_data)$sum < 700

## Check how many spots are filtered out
table(qc_total_counts)

## qc_total_counts
## FALSE  TRUE 
## 10639    52

Incorporating Library Size Threshold in Dataset: This step involves adding the library size threshold to the dataset’s metadata and examining the spatial pattern of the spots that have been removed based on this criterion.

## Add threshold in colData
colData(spatial_data)$qc_total_counts <- qc_total_counts

## Check for putative spatial pattern of removed spots
plotSpotQC(
  spatial_data, 
  plot_type = "spot",  
  annotate = "qc_total_counts", 
) + 
  facet_wrap(~sample_id)

Detected genes

This analysis examines how many genes are expressed per spot, using a histogram and density plot to visualise the distribution of gene counts across the dataset.

## Density and histogram of library sizes
data.frame(colData(spatial_data) ) |> 
  ggplot(aes(x = detected)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60) +
  geom_density() +
  scale_x_log10() +
  xlab("Number of genes with > 0 counts") + 
  ylab("Density") + 
  theme_classic()

Setting Gene Expression Threshold: This block applies a threshold to identify spots with fewer than 500 detected genes, considering these for exclusion to ensure data quality.

## Select expressed genes threshold
qc_detected_genes <- colData(spatial_data)$detected < 500
## Check how many spots are filtered out
table(qc_detected_genes)

## qc_detected_genes
## FALSE  TRUE 
## 10642    49

Incorporating Gene Expression Threshold in Dataset: After setting the gene expression threshold, it is added to the dataset’s metadata. The spatial pattern of spots removed based on this threshold is then examined.

## Add threshold in colData
colData(spatial_data)$qc_detected_genes <- qc_detected_genes

## Check for putative spatial pattern of removed spots
plotSpotQC(
  spatial_data, 
  plot_type = "spot",  
  annotate = "qc_detected_genes", 
) + 
  facet_wrap(~sample_id)

Exploring the Relationship Between Library Size and Number of Genes Detected: This analysis explores the correlation between library size and the number of genes detected in each spot, providing insights into data quality and sequencing depth.

## Density and histogram of library sizes
data.frame(colData(spatial_data)) |> 
  ggplot(aes(sum, detected)) +
  geom_point(shape=".") +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Library size") + 
  ylab("Number of genes with > 0 counts") + 
  theme_classic()

Combined filtering

After applying all QC filters, this block combines them and stores the results in the dataset. The spatial patterns of all discarded spots are then reviewed to ensure comprehensive quality control.

## Store the set in the object
colData(spatial_data)$discard <- qc_total_counts | qc_detected_genes | qc_mitochondrial_transcription

## Check the spatial pattern of combined set of discarded spots
plotSpotQC(
  spatial_data, 
  plot_type = "spot",  
  annotate = "discard", 
) + 
  facet_wrap(~sample_id)

The final step in data preprocessing involves removing all spots identified as low-quality based on the previously applied thresholds, refining the dataset for subsequent analyses.

spatial_data = spatial_data[,!colData(spatial_data)$discard ]

6. Dimensionality reduction

Dimensionality reduction is essential in spatial transcriptomics due to the high-dimensional nature of the data, which includes vast gene expression profiles across various spatial locations. Techniques such as PCA (Principal Component Analysis) and UMAP (Uniform Manifold Approximation and Projection) are particularly valuable. PCA helps to reduce noise and highlight the most significant variance in the data, making it simpler to uncover underlying patterns and correlations. UMAP, ofen calculated from principal components (and not directly from features) preserves both global and local data structures, enabling more nuanced visualisations of complex cellular landscapes. Together, these methods facilitate a deeper understanding of spatial gene expression, helping to reveal biological insights such as cellular heterogeneity and tissue structure, which are crucial for both basic biological research and clinical applications.

Variable gene identification

genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(spatial_data))
dec = scran::modelGeneVar(spatial_data, subset.row = genes) 


# Visualisation
plot(dec$mean, dec$total, xlab = "Mean log-expression", ylab = "Variance")
curve(metadata(dec)$trend(x), col = "blue", add = TRUE)

# Get top variable genes 
dec = scran::modelGeneVar(spatial_data, subset.row = genes, block = spatial_data$sample_id) 
hvg = scran::getTopHVGs(dec, n = 1000)

rowData(spatial_data[head(hvg),])[,c("gene_id", "gene_name")]

## DataFrame with 6 rows and 2 columns
##                         gene_id   gene_name
##                     <character> <character>
## ENSG00000123560 ENSG00000123560        PLP1
## ENSG00000197971 ENSG00000197971         MBP
## ENSG00000110484 ENSG00000110484     SCGB2A2
## ENSG00000131095 ENSG00000131095        GFAP
## ENSG00000173786 ENSG00000173786         CNP
## ENSG00000109846 ENSG00000109846       CRYAB

PCA

With the highly variable genes, we perform PCA to reduce dimensionality, followed by UMAP to visualise the data in a lower-dimensional space, enhancing our ability to observe clustering and patterns in the data.

spatial_data <- 
  spatial_data |> 
  scuttle::logNormCounts() |> 
  scater::runPCA(subset_row = hvg)

## Check correctness - names
reducedDimNames(spatial_data)

## [1] "PCA"

reducedDim(spatial_data, "PCA")[1:5, 1:5]

##                          PC1        PC2        PC3        PC4         PC5
## AAACAAGTATCTCCCA-1 -4.323918  0.9600003 -0.3514095  2.2100416  0.05912948
## AAACAATCTACTAGCA-1 -1.560326 -2.1527877 -0.8244904 -0.3135456  0.76248069
## AAACACCAATAACTGC-1 19.168276  2.2825249 -0.1300977  0.2591148  2.48548951
## AAACAGAGCGACTCCT-1 -3.070884 -1.0284710 -1.0845869  2.6911326 -0.88723306
## AAACAGCTTTCAGAAG-1 -1.588394  2.2461043  2.9044546  0.5902962 -0.13658943

As for single-cell data, we need to verify that there is not significant batch effect. If so we need to adjust for it (a.k.a. integration) before calculating principal component. Many adjustment methods to output adjusted principal components directly.

UMAP

You can appreciate that, in this case, selecting within-sample variable genes, we do not see major batch effects across samples. We see two major pixel clusters.

We can appreciate that there are no major batch effects across samples, and we don’t see grouping driven by sample_id.

set.seed(42)
spatial_data <- scater::runUMAP(spatial_data, dimred = "PCA")

scater::plotUMAP(spatial_data, colour_by = "sample_id", point_size = 0.2)

Exercise 1.1

Visualise where the two macro clusters are located spatially. We will take a very pragmatic approach and get cluster label from splitting the UMAP coordinated in two (colData() and reducedDim() will help us, see above), and then visualise it with ggspavis.

Modify the SpatialExperiment object based on the UMAP1 dimension so to divide those 2 cluster
Visualise the UMAP colouring by the new labelling
Visualise the Visium slide colouring by the new labelling

7. Clustering

Clustering in spatial transcriptomics is crucial for understanding the intricate cellular composition of tissues. This technique groups cells/pixels based on similar gene expression profiles, enabling the identification of distinct cell types and states within a tissue’s spatial context. Clustering reveals patterns of cellular organisation and differentiation, and interactions in the microenvironment.

Transcriptome based nearest neighbours

First, we establish the number of nearest neighbors to use in the k-NN graph. This graph forms the basis for clustering, using the Walktrap algorithm to detect community structures that suggest natural groupings or clusters in the data. buildSNNGraph is from the scan package.

## Set number of Nearest-Neighbours (NNs)
k <- 10

## Build the k-NN graph
g_walk <- 
  spatial_data |> 
  scran::buildSNNGraph( 
    k = 10, 
    use.dimred = "PCA"
  ) |> 
  igraph::cluster_walktrap()

clus <- g_walk$membership
## Check how many
table(clus)

## clus
##    1    2    3    4    5    6    7 
##  856 3557 2332  857 2386  276  283

Applying Clustering Labels and Visualising Results: After determining clusters, we apply these labels back to the spatial data and visualise the results using UMAP. This allows us to observe how the data clusters in a reduced dimension space, and further visualise how these clusters map onto the physical tissue sections for context.

We can appreciate here that we get two main pixel clusters.

colLabels(spatial_data) <- factor(clus)

scater::plotUMAP(spatial_data, colour_by = "label") + scale_color_brewer(palette = "Paired")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

Those two clusters group the white matter from the rest of the layers.

## Plot in tissue map
ggspavis::plotSpots(spatial_data, annotate = "label") + 
  facet_wrap(~sample_id) +
  scale_color_brewer(palette = "Paired")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

As for comparison, we show the manually annotated regions. We can see that while the single cell style clustering catchers, the overall tissue, architecture, a lot of details are not retrieved. We clusters cannot faithfully recapitulate the tissue morphology. However, they might represent specific cell types within morphological regions.

## Plot ground truth in tissue map
ggspavis::plotSpots(spatial_data, annotate = "spatialLIBD") +
  facet_wrap(~sample_id) +
  scale_color_manual(values = libd_layer_colors)

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

Spatially-aware clustering

To cluster spatial regions (i.e. tissue domain) rather than single-cell types, the clustering algorithms need to take spatial context into account. For example what is the transcriptional profile of the neighbouring pixels or neighbouring cells.

BANKSY combines molecular and spatial information. BANKSY leverages the fact that a cell’s state can be more fully represented by considering both its own transcriptome “nd that of its local m”croenvironment.This algorithm embeds cells within a combined space that incorporates their own transcriptome and that of their locell’svironment, representing both the cell state and the surrounding microenvironment.

Overview of the algorithm

* Construct a neighborhood graph between cells in physical space (k-nearest neighbors or radius nearest neighbors).
* We use neighborhood graph to compute two matrices:

– ** Average neighborhood expression matrix

– ** “Azimuthal Gabor filter” matrix. It represents the transcriptomic microenvironment around each cell. It measures the gradient of gene expression in each cell’s neighborhood.

* These matrices are then scaled on the basis of a mixing parameter λ, which controls their relative weighting
* Concatenate these two matrices with the original gene–cell expression matrix
* Combine these three matrices by direct product

Singhal et al., 2024

Material source

library(Banksy)

# scale the counts, without log transformation
spatial_data = spatial_data |> logNormCounts(log=FALSE, name = "normcounts")

Highly-variable genes

The Banksy documentation, suggest the use of Seurat for the detection of highly variable genes.

library(Seurat)

# Convert to list
spatial_data_list_for_seurat <- lapply(unique(spatial_data$sample_id), function(x) 
  spatial_data[,  spatial_data$sample_id == x  ]
)

seu_list <- lapply(spatial_data_list_for_seurat, function(x) {
    x <- as.Seurat(x, data = NULL)
    NormalizeData(x, scale.factor = 5000, normalization.method = 'RC')
})

# Compute HVGs
hvgs <- lapply(seu_list, function(x) {
    VariableFeatures(FindVariableFeatures(x, nfeatures = 2000))
})
hvgs <- Reduce(union, hvgs)
rm(seu_list, spatial_data_list_for_seurat)

rowData(spatial_data[head(hvgs),])[,c("gene_id", "gene_name")]

## DataFrame with 6 rows and 2 columns
##                         gene_id   gene_name
##                     <character> <character>
## ENSG00000122585 ENSG00000122585         NPY
## ENSG00000123560 ENSG00000123560        PLP1
## ENSG00000211592 ENSG00000211592        IGKC
## ENSG00000244734 ENSG00000244734         HBB
## ENSG00000188536 ENSG00000188536        HBA2
## ENSG00000197971 ENSG00000197971         MBP

We now split the data by sample, to compute the neighbourhood matrices.

# Convert to list
spatial_data_list <- lapply(unique(spatial_data$sample_id), function(x) 
  spatial_data[
    hvgs, 
    spatial_data$sample_id == x
    ]
)

spatial_data_list <- lapply(
  spatial_data_list, 
  computeBanksy, # Compute the component neighborhood matrices
  assay_name = "normcounts"
)

# Rejoin the datasets
spe_joint <- do.call(cbind, spatial_data_list)

Here, we perform PCA using the BANKSY algorithm on the joint dataset. The group argument specifies how to treat different samples, ensuring that features are scaled separately per sample group to account for variations among them.

Note: this step takes long time

spe_joint <- runBanksyPCA( # Run PCA on the Banskly matrix
  spe_joint, 
  lambda = 0.2, # spatial weighting parameter. Larger values (e.g. 0.8) incorporate more spatial neighborhood
  group = "sample_id", # Features belonging to the grouping variable will be z-scaled separately. 
  seed = 42
)

Once the dimensional reduction is complete, we cluster the spots across all samples and use connectClusters to visually compare these new BANKSY clusters against manual annotations.

spe_joint <- clusterBanksy( # clustering on the principal components computed on the BANKSY matrix
  spe_joint, 
  lambda = 0.2, # spatial weighting parameter. Larger values (e.g. 0.8) incorporate more spatial neighborhood
  resolution = 0.7, # numeric vector specifying resolution used for clustering (louvain / leiden).
  seed = 42
)
colData(spe_joint)$clust_annotation  = colData(spe_joint)$Cluster

spe_joint <- connectClusters(spe_joint)

As an optional step, we smooth the cluster labels for each sample independently, which can enhance the visual coherence of clusters, especially in heterogeneous biological samples.

From SpiceMix paper Chidester et al., 2023

spatial_data_list <- lapply(
  unique(spe_joint$sample_id), 
  function(x) 
    spe_joint[, spe_joint$sample_id == x]
)

spatial_data_list <- lapply(
  spatial_data_list, 
  smoothLabels, 
  cluster_names = "clust_M0_lam0.2_k50_res0.7",
  k = 6L, 
  verbose = FALSE
)
names(spatial_data_list) <- paste0("sample_", unique(spe_joint$sample_id))

The raw and smoothed cluster labels are stored in the colData slot of each SingleCellExperiment or SpatialExperiment object.

cluster_metadata = 
  colData(spatial_data_list$sample_151673)[, c("clust_M0_lam0.2_k50_res0.7", "clust_M0_lam0.2_k50_res0.7_smooth")]

 
knitr::kable(head(cluster_metadata), format = "html")

	clust_M0_lam0.2_k50_res0.7	clust_M0_lam0.2_k50_res0.7_smooth
AAACAAGTATCTCCCA-1	3	3
AAACAATCTACTAGCA-1	7	9
AAACACCAATAACTGC-1	5	5
AAACAGAGCGACTCCT-1	3	3
AAACAGCTTTCAGAAG-1	1	1
AAACAGGGTCTATATT-1	6	6

Using cluster comparison metrics like the adjusted Rand index (ARI) we evaluate the performance of our clustering approach. This statistical analysis helps validate the clustering results against known labels or pathologies.

The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings. Measures degree of overlapping between two partitions.

compareClusters(spatial_data_list$sample_151673, func = 'ARI')

##                                   clust_M0_lam0.2_k50_res0.7 clust_annotation
## clust_M0_lam0.2_k50_res0.7                             1.000            0.345
## clust_annotation                                       0.345            1.000
## clust_M0_lam0.2_k50_res0.7_smooth                      0.903            0.333
##                                   clust_M0_lam0.2_k50_res0.7_smooth
## clust_M0_lam0.2_k50_res0.7                                    0.903
## clust_annotation                                              0.333
## clust_M0_lam0.2_k50_res0.7_smooth                             1.000

We calculate the ARI for each sample to assess the consistency and accuracy of our clustering across different samples.

ari <- sapply(spatial_data_list, function(x) as.numeric(tail(compareClusters(x, func = "ARI")[, 1], n = 1)))
ari

## sample_151673 sample_151675 sample_151676 
##         0.903         0.862         0.888

Visualising Clusters and Annotations on Spatial Coordinates: We utilise the ggspavis package to visually map BANKSY clusters and manual annotations onto the spatial coordinates of the dataset, providing a comprehensive visual overview of spatial and clustering data relationships.

# Use scater:::.get_palette('tableau10medium')
library(cowplot)

pal <- c(
  "#1abc9c", "#3498db", "#9b59b6", "#e74c3c", "#34495e",
  "#f39c12", "#d35400", "#7f8c8d", "#2ecc71", "#e67e22"
)

 ggspavis::plotSpots(
   do.call(cbind, spatial_data_list), 
   annotate = sprintf("%s_smooth", "clust_M0_lam0.2_k50_res0.7"), 
   pal = pal
  ) +
    facet_wrap(~sample_id) +
    theme(legend.position = "none") +
    labs(title = "BANKSY clusters")

ggspavis::plotSpots(spatial_data, annotate = "spatialLIBD") +
  facet_wrap(~sample_id) +
  scale_color_manual(values = libd_layer_colors) +
  theme(legend.position = "none") +
  labs(title = "spatialLIBD regions")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

Exercise 1.2

We have applied cluster smoothing using smoothLabels. How much do you think this operation has affected the cluster labels. To find out,

Plot the non smoothed cluster
identify the pixel that have been smoothed, and
visualise them using plotSpotQC that we have used above.

8. Deconvolution of pixel-based spatial data

One of the popular algorithms for spatial deconvolution is SPOTlight. Elosua-Bayes et al., 2021.

Source

SPOTlight uses a seeded non-negative matrix factorization regression, initialized using cell-type marker genes and non-negative least squares.

Producing the reference for single-cell databases

CuratedAtlasQuery is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.

Harmonised data is stored in the ARDC Nectar Research Cloud, and most CuratedAtlasQuery functions interact with Nectar via web requests, so a network connection is required for most functionality.

Mangiola et al., 2024 doi doi.org/10.1101/2023.06.08.542671

# Get reference
library(CuratedAtlasQueryR)
library(HDF5Array)

tmp_file_path = tempfile()

brain_reference =
  
  # Query metadata across 30M cells
  get_metadata() |>
  
  # Filter your data of interest
  dplyr::filter(tissue_harmonised=="brain", disease == "normal", organism == "Mus musculus") |> 
  
  # Collect pseudobulk as SummarizedExperiment
  get_pseudobulk() |> 
  
  # Normalise for Spotlight
  scuttle::logNormCounts() |> 
  
  # Save for fast reading
  HDF5Array::saveHDF5SummarizedExperiment(tmp_file_path, replace = TRUE)

library(HDF5Array)

brain_reference = HDF5Array::loadHDF5SummarizedExperiment(tmp_file_path)

my_metadata = colData(brain_reference)

knitr::kable(head(my_metadata), format = "html")

	cell_type_harmonised	file_id	sample_	sex	age_days	ethnicity	assay	organism	tissue	tissue_harmonised	disease	dataset_id	collection_id	cell_count	is_primary_data_x	X_sample_name	assay_ontology_term_id	development_stage	development_stage_ontology_term_id	disease_ontology_term_id	ethnicity_ontology_term_id	organism_ontology_term_id	sample_placeholder	sex_ontology_term_id	tissue_ontology_term_id	dataset_deployments	is_primary_data_y	is_valid	linked_genesets	mean_genes_per_cell	name	published	revision	schema_version	tombstone	x_normalization	created_at_x	published_at	revised_at	updated_at_x	filename	filetype	s3_uri	user_submitted	created_at_y	updated_at_y	sample_identifier
67dae03d451e04031cf10e111175cbaa___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	67dae03d451e04031cf10e111175cbaa	female	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	F003_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000383	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	67dae03d451e04031cf10e111175cbaa___neuron
feef718adb81776c446192754e0f6e62___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	feef718adb81776c446192754e0f6e62	female	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	F004_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000383	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	feef718adb81776c446192754e0f6e62___neuron
0fcee3d9b39ca26467043e9bf52e485f___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	0fcee3d9b39ca26467043e9bf52e485f	female	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	F005_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000383	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	0fcee3d9b39ca26467043e9bf52e485f___neuron
a9b6084008b432ae72d76a250260ae81___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	a9b6084008b432ae72d76a250260ae81	female	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	F007_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000383	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	a9b6084008b432ae72d76a250260ae81___neuron
2c19773cda2d898c8868560b7721b3b3___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	2c19773cda2d898c8868560b7721b3b3	female	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	F008_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000383	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	2c19773cda2d898c8868560b7721b3b3___neuron
cc3ced848db3c0ea2e50a313bbaa2f8b___neuron	neuron	42b46a1c-fd98-42f8-9cfd-48521a6edb6c	cc3ced848db3c0ea2e50a313bbaa2f8b	male	9125	na	10x 3’ v3	Mus musculus	primary motor cortex	brain	normal	b6203114-e133-458a-aed5-eed1028378b4	367d95c0-0eb0-4dae-8276-9407239421ee	24213	FALSE	M002_primary motor cortex_early adult stage_10x 3’ v3_	EFO:0009922	early adult stage	MmusDv:0000061	PATO:0000461	na	NCBITaxon:10090	NA	PATO:0000384	UBERON:0001384	https://cellxgene.cziscience.com/e/b6203114-e133-458a-aed5-eed1028378b4.cxg/	SECONDARY	FALSE	NA	4394.937	Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse 3-species integration excitory neurons	TRUE	1	2.0.0	FALSE	log1p, base e	2021-02-17	2021-02-17	2021-10-14	2021-10-14	local.h5ad	H5AD	s3://corpora-data-prod/aa0ccbc9-21b1-4a41-8e28-190b10119794/local.h5ad	TRUE	2021-09-23	2021-09-23	cc3ced848db3c0ea2e50a313bbaa2f8b___neuron

These are the cell types included in our reference, and the number of pseudobulk samples we have for each cell type.

table(brain_reference$cell_type_harmonised)

## 
##                      astrocyte               endothelial_cell 
##                             12                             12 
##            leptomeningeal_cell                microglial_cell 
##                             12                             12 
##                    muscle_cell                         neuron 
##                             11                             36 
##                oligodendrocyte oligodendrocyte_precursor_cell 
##                             12                             12 
##                  pericyte_cell 
##                             12

These are the number of samples we have for each of the three data sets.

table(brain_reference$dataset_id)

## 
## 6acb6637-ac08-4a65-b2d1-581e51dc7ccf 9b686bb6-1427-4e13-b451-7ee961115cf9 
##                                   12                                   95 
## b6203114-e133-458a-aed5-eed1028378b4 f7a068f1-0fdb-48e8-8029-db870ff11d9e 
##                                   12                                   12

The collection_id can be used to gather information on the cell database. e.g. https://cellxgene.cziscience.com/collections/

table(brain_reference$collection_id)

## 
## 367d95c0-0eb0-4dae-8276-9407239421ee 
##                                  131

Now, we identify the variable genes within each dataset, to not capture technical effects, and identify the union of variable genes for further analysis.

genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(brain_reference))

# Convert to list
brain_reference_list <- lapply(unique(brain_reference$dataset_id), function(x) brain_reference[, brain_reference$dataset_id == x])

dec = scran::modelGeneVar(brain_reference, subset.row = genes, block = brain_reference$sample_id)
hvg_CAQ = scran::getTopHVGs(dec, n = 1000)
            


hvg_CAQ = unique( unlist(hvg_CAQ))

head(hvg_CAQ)

## [1] "EBF1"    "ADAP2"   "RBPMS"   "TNR"     "ADARB2"  "DSCAML1"

Initially, the code prepares the spatial data by setting gene names as row identifiers.

spatial_data_gene_name = spatial_data
rownames(spatial_data_gene_name) <- rowData(spatial_data_gene_name)$gene_name
spatial_data_gene_name = logNormCounts(spatial_data_gene_name)

We then identify and score relevant marker genes based on their expression and significance across different cell types. The result is a curated list of high-potential marker genes, organised and ready for deeper analysis and interpretation in the context of spatial gene expression patterns.

mgs <- scran::scoreMarkers(
  brain_reference, 
  groups = brain_reference$cell_type_harmonised,
  
  # Omit mitochondrial genes and keep all the genes in spatial
  subset.row = 
    grep("(^MT-)|(^mt-)|(\\.)|(-)", rownames(brain_reference), value=TRUE, invert=TRUE) |> 
    intersect(rownames(spatial_data_gene_name))
)

# Select the most informative markers
mgs_df <- lapply(names(mgs), function(i) {
  x <- mgs[[i]]
  
  # Filter and keep relevant marker genes, those with AUC > 0.8
  x <- x[x$mean.AUC > 0.8, ]
  
  # Sort the genes from highest to lowest weight
  x <- x[order(x$mean.AUC, decreasing = TRUE), ]
  
  # Add gene and cluster id to the dataframe
  x$gene <- rownames(x)
  x$cluster <- i
  data.frame(x)
})
mgs_df <- do.call(rbind, mgs_df)

head(mgs_df)

##         self.average other.average self.detected other.detected
## AASS        7.686725     0.4003784             1      0.1875000
## ACOT11      9.213282     1.9381921             1      0.4479167
## ACSBG1     10.146606     2.7580640             1      0.5937500
## ACSL3      12.307226     7.6284278             1      0.8873106
## ACSL6      10.417003     4.2560701             1      0.7092803
## ALDH1L1     9.522679     2.8334096             1      0.5520833
##         mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen
## AASS           17.425660        4.667077          20.047540        25.58478
## ACOT11          9.623311        2.619640           5.283945        33.25505
## ACSBG1         11.689569        2.828156           4.825212        48.98240
## ACSL3           8.012062        2.529708           7.898101        17.79184
## ACSL6           4.929009        2.450982           3.926159        10.65133
## ALDH1L1         6.729219        2.877222           4.210761        22.63845
##         rank.logFC.cohen mean.AUC min.AUC median.AUC max.AUC rank.AUC
## AASS                   4        1       1          1       1        1
## ACOT11                35        1       1          1       1        1
## ACSBG1                17        1       1          1       1        1
## ACSL3                 19        1       1          1       1        1
## ACSL6                157        1       1          1       1        1
## ALDH1L1              216        1       1          1       1        1
##         mean.logFC.detected min.logFC.detected median.logFC.detected
## AASS              2.6073355          0.2410081             3.1427011
## ACOT11            1.3508211          0.0000000             1.1357982
## ACSBG1            0.9707408          0.0000000             0.6154772
## ACSL3             0.2644347          0.0000000             0.0000000
## ACSL6             0.6129288          0.0000000             0.2469944
## ALDH1L1           1.1072928          0.0000000             1.0394757
##         max.logFC.detected rank.logFC.detected    gene   cluster
## AASS              3.700440                   1    AASS astrocyte
## ACOT11            3.584963                   1  ACOT11 astrocyte
## ACSBG1            3.584963                   1  ACSBG1 astrocyte
## ACSL3             2.000000                 291   ACSL3 astrocyte
## ACSL6             2.584963                 291   ACSL6 astrocyte
## ALDH1L1           3.584963                   1 ALDH1L1 astrocyte

We now utilise SPOTlight to deconvolve tisslet’smposition from our independent mouse brain reference. The result is visualised through a scatter pie plot, which provides a graphical representation of the spatial distribution of cell types within a let’sfic sample. This visualisation aids in understanding the spatial heterogeneity.

library(SPOTlight)

res <- SPOTlight(
  x = brain_reference |> assay("logcounts"),
  y = spatial_data_gene_name |> assay("logcounts"),
  groups = brain_reference$cell_type_harmonised,
  group_id = "cluster",
  mgs = mgs_df,
  hvg =  intersect(hvg_CAQ, rownames(spatial_data_gene_name)),
  weight_id = "mean.AUC",
  gene_id = "gene"
)

cell_first_sample = colData(spatial_data_gene_name)$sample_id=="151673"

plotSpatialScatterpie(
  x = spatial_data_gene_name[,cell_first_sample],
  y = res$mat[cell_first_sample,],
  cell_types = colnames(res$mat[cell_first_sample,]),
  img = FALSE,
  scatterpie_alpha = 1,
  pie_scale = 0.4
)

Let’s visualise without pit’syte_cell and endothelial cells, which oclet’sa lot of the visual spectrum.

plotSpatialScatterpie(
  x = spatial_data_gene_name[,cell_first_sample],
  y = res$mat[cell_first_sample,-c(2,9)],
  cell_types = colnames(res$mat[cell_first_sample,-c(2,9)]),
  img = FALSE,
  scatterpie_alpha = 1,
  pie_scale = 0.4
)

Let’s also exclude without muscle_cell, which occupy a lot of the visual spectrum.

plotSpatialScatterpie(
  x = spatial_data_gene_name[,cell_first_sample],
  y = res$mat[cell_first_sample,-c(2, 9, 5)],
  cell_types = colnames(res$mat[cell_first_sample,-c(2, 9, 5)]),
  img = FALSE,
  scatterpie_alpha = 1,
  pie_scale = 0.4
)

No, let’s look at the correlation matrices to see which cell type are most often occurring rather than mutually exclusive within our data set.

plotCorrelationMatrix(res$mat)

Excercise

Rather than looking at the correlation matrix, overall, let’s observe whether the correlation structure amongst cell types is consistent across samples. Do you think it’s consistent or noticably different?

res_spatialLIBD = split(data.frame(res$mat), colData(spatial_data_gene_name)$sample_id ) 

lapply(res_spatialLIBD, function(x) plotCorrelationMatrix(as.matrix(x[,-10])))

## $`151673`

## 
## $`151675`

## 
## $`151676`

Now let’s observe whether the correlation structure is consistent across spatial regions, irrespectively of the sample of origin. Do you think they are consistent or noticably different?

res_spatialLIBD = split(data.frame(res$mat), colData(spatial_data_gene_name)$spatialLIBD ) 

lapply(res_spatialLIBD, function(x) plotCorrelationMatrix(as.matrix(x[,-10])))

## $L1

## 
## $L2

## 
## $L3

## 
## $L4

## 
## $L5

## 
## $L6

## 
## $WM

Some of the most positive correlations involve the end of cells with Oligodendrocytes and Leptomeningeal cells.

Leptomeningeal cells refer to the cells that make up the leptomeninges, which consist of two of the three layers olet’s meninges surrounding the brain and spinal cord: the arachnoid mater and the pia mater. These layers play a critical role in protecting the central nervous system and assisting in various physiological processes.

Oligodendrocytes are a type of glial cell in the central nervous system (CNS) of vertebrates, including humans and mouse. These cells are crucial for the formation and maintenance of the myelin sheath, a fatty layer that encases the axons of many neurons.

Hello, let’s try to visualise the pixel where these cell types most occur.

mat_df = as.data.frame(res$mat)

is_endothelial_leptomeningeal = mat_df$endothelial_cell >0.1 & mat_df$leptomeningeal_cell>0.1 & mat_df$endothelial_cell  + mat_df$leptomeningeal_cell > 0.4 
is_endothelial_oligodendrocytes = mat_df$endothelial_cell >0.1 & mat_df$oligodendrocyte>0.05 & mat_df$endothelial_cell  + mat_df$oligodendrocyte > 0.4 

spatial_data$is_endothelial_leptomeningeal = is_endothelial_leptomeningeal
spatial_data$is_endothelial_oligodendrocyte = is_endothelial_oligodendrocytes

ggspavis::plotSpots(spatial_data, annotate = "is_endothelial_leptomeningeal") +
    facet_wrap(~sample_id) +
  scale_color_manual(values = c("TRUE"= "red", "FALSE" = "grey"))

theme(legend.position = "none") +
  labs(title = "endothelial + leptomeningeal")

## List of 2
##  $ legend.position: chr "none"
##  $ title          : chr "endothelial + leptomeningeal"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

ggspavis::plotSpots(spatial_data, annotate = "is_endothelial_oligodendrocyte") +
    facet_wrap(~sample_id) +
  scale_color_manual(values = c("TRUE"= "blue", "FALSE" = "grey"))

theme(legend.position = "none") +
  labs(title = "endothelial + oligodendrocyte")

## List of 2
##  $ legend.position: chr "none"
##  $ title          : chr "endothelial + oligodendrocyte"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Session Information

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Adelaide
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SPOTlight_1.8.0             HDF5Array_1.32.0           
##  [3] rhdf5_2.48.0                DelayedArray_0.30.1        
##  [5] SparseArray_1.4.3           S4Arrays_1.4.0             
##  [7] abind_1.4-5                 Matrix_1.7-0               
##  [9] CuratedAtlasQueryR_1.1.3    cowplot_1.1.3              
## [11] Seurat_5.1.0                SeuratObject_5.0.2         
## [13] sp_2.1-4                    Banksy_1.0.0               
## [15] ExperimentHub_2.12.0        AnnotationHub_3.12.0       
## [17] BiocFileCache_2.12.0        dbplyr_2.5.0               
## [19] spatialLIBD_1.16.0          scran_1.32.0               
## [21] scater_1.32.0               scuttle_1.14.0             
## [23] ggspavis_1.10.0             ggplot2_3.5.1              
## [25] SpatialExperiment_1.14.0    SingleCellExperiment_1.26.0
## [27] SummarizedExperiment_1.34.0 Biobase_2.64.0             
## [29] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
## [31] IRanges_2.38.0              S4Vectors_0.42.0           
## [33] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
## [35] matrixStats_1.3.0           here_1.0.1                 
## 
## loaded via a namespace (and not attached):
##   [1] goftest_1.2-3             DT_0.33                  
##   [3] Biostrings_2.72.0         vctrs_0.6.5              
##   [5] spatstat.random_3.2-3     digest_0.6.35            
##   [7] png_0.1-8                 registry_0.5-1           
##   [9] ggrepel_0.9.5             deldir_2.0-4             
##  [11] parallelly_1.37.1         magick_2.8.3             
##  [13] MASS_7.3-60.2             reshape2_1.4.4           
##  [15] httpuv_1.6.15             foreach_1.5.2            
##  [17] withr_3.0.0               ggfun_0.1.4              
##  [19] xfun_0.44                 survival_3.6-4           
##  [21] memoise_2.0.1             benchmarkme_1.0.8        
##  [23] ggbeeswarm_0.7.2          zoo_1.8-12               
##  [25] pbapply_1.7-2             rematch2_2.1.2           
##  [27] KEGGREST_1.44.0           promises_1.3.0           
##  [29] httr_1.4.7                restfulr_0.0.15          
##  [31] globals_0.16.3            fitdistrplus_1.1-11      
##  [33] rhdf5filters_1.16.0       stringfish_0.16.0        
##  [35] rstudioapi_0.16.0         UCSC.utils_1.0.0         
##  [37] miniUI_0.1.1.1            generics_0.1.3           
##  [39] curl_5.2.1                fields_15.2              
##  [41] zlibbioc_1.50.0           ScaledMatrix_1.12.0      
##  [43] polyclip_1.10-6           GenomeInfoDbData_1.2.12  
##  [45] golem_0.4.1               xtable_1.8-4             
##  [47] stringr_1.5.1             doParallel_1.0.17        
##  [49] evaluate_0.23             irlba_2.3.5.1            
##  [51] colorspace_2.1-0          filelock_1.0.3           
##  [53] ROCR_1.0-11               reticulate_1.36.1        
##  [55] spatstat.data_3.0-4       shinyWidgets_0.8.6       
##  [57] magrittr_2.0.3            lmtest_0.9-40            
##  [59] later_1.3.2               viridis_0.6.5            
##  [61] lattice_0.22-6            spatstat.geom_3.2-9      
##  [63] NMF_0.27                  future.apply_1.11.2      
##  [65] scattermore_1.2           XML_3.99-0.16.1          
##  [67] RcppAnnoy_0.0.22          pillar_1.9.0             
##  [69] nlme_3.1-164              iterators_1.0.14         
##  [71] gridBase_0.4-7            compiler_4.4.0           
##  [73] beachmat_2.20.0           RSpectra_0.16-1          
##  [75] stringi_1.8.4             tensor_1.5               
##  [77] GenomicAlignments_1.40.0  plyr_1.8.9               
##  [79] crayon_1.5.2              BiocIO_1.14.0            
##  [81] locfit_1.5-9.9            bit_4.0.5                
##  [83] dplyr_1.1.4               codetools_0.2-20         
##  [85] BiocSingular_1.20.0       bslib_0.7.0              
##  [87] paletteer_1.6.0           plotly_4.10.4            
##  [89] mime_0.12                 splines_4.4.0            
##  [91] Rcpp_1.0.12               fastDummies_1.7.3        
##  [93] duckdb_0.10.2             sparseMatrixStats_1.16.0 
##  [95] attempt_0.3.1             knitr_1.46               
##  [97] blob_1.2.4                utf8_1.2.4               
##  [99] BiocVersion_3.19.1        listenv_0.9.1            
## [101] checkmate_2.3.1           nnls_1.5                 
## [103] DelayedMatrixStats_1.26.0 RcppHungarian_0.3        
## [105] tibble_3.2.1              statmod_1.5.0            
## [107] tweenr_2.0.3              pkgconfig_2.0.3          
## [109] tools_4.4.0               cachem_1.1.0             
## [111] aricode_1.0.3             RSQLite_2.3.6            
## [113] viridisLite_0.4.2         DBI_1.2.2                
## [115] fastmap_1.2.0             rmarkdown_2.27           
## [117] scales_1.3.0              grid_4.4.0               
## [119] ica_1.0-3                 Rsamtools_2.20.0         
## [121] sass_0.4.9                patchwork_1.2.0          
## [123] BiocManager_1.30.23       dotCall64_1.1-1          
## [125] RANN_2.6.1                farver_2.1.2             
## [127] scatterpie_0.2.2          yaml_2.3.8               
## [129] rtracklayer_1.64.0        cli_3.6.2                
## [131] purrr_1.0.2               leiden_0.4.3.1           
## [133] lifecycle_1.0.4           dbscan_1.1-12            
## [135] uwot_0.2.2                bluster_1.14.0           
## [137] sessioninfo_1.2.2         backports_1.4.1          
## [139] ggcorrplot_0.1.4.1        BiocParallel_1.38.0      
## [141] gtable_0.3.5              rjson_0.2.21             
## [143] ggridges_0.5.6            progressr_0.14.0         
## [145] parallel_4.4.0            limma_3.60.0             
## [147] jsonlite_1.8.8            edgeR_4.2.0              
## [149] RcppHNSW_0.6.0            bitops_1.0-7             
## [151] benchmarkmeData_1.0.4     bit64_4.0.5              
## [153] assertthat_0.2.1          Rtsne_0.17               
## [155] spatstat.utils_3.0-4      BiocNeighbors_1.22.0     
## [157] RcppParallel_5.1.7        ggside_0.3.1             
## [159] jquerylib_0.1.4           highr_0.10               
## [161] metapod_1.12.0            config_0.3.2             
## [163] dqrng_0.4.0               lazyeval_0.2.2           
## [165] shiny_1.8.1.1             htmltools_0.5.8.1        
## [167] sctransform_0.4.1         rappdirs_0.3.3           
## [169] glue_1.7.0                spam_2.10-0              
## [171] XVector_0.44.0            RCurl_1.98-1.14          
## [173] rprojroot_2.0.4           mclust_6.1.1             
## [175] gridExtra_2.3             sccore_1.0.5             
## [177] igraph_2.0.3              R6_2.5.1                 
## [179] tidyr_1.3.1               labeling_0.4.3           
## [181] cluster_2.1.6             rngtools_1.5.2           
## [183] Rhdf5lib_1.26.0           tidyselect_1.2.1         
## [185] vipor_0.4.7               maps_3.4.2               
## [187] ggforce_0.4.2             AnnotationDbi_1.66.0     
## [189] future_1.33.2             leidenAlg_1.1.3          
## [191] rsvd_1.0.5                munsell_0.5.1            
## [193] KernSmooth_2.23-24        data.table_1.15.4        
## [195] htmlwidgets_1.6.4         RColorBrewer_1.1-3       
## [197] rlang_1.1.3               spatstat.sparse_3.0-3    
## [199] spatstat.explore_3.2-7    fansi_1.0.6              
## [201] beeswarm_0.4.0

References

Sequencing assays

Stefano Mangiola, South Australian immunoGENomics Cancer Institute¹, Walter and Eliza Hall Institute²

Session 1: Spatial analyses of sequencing data

Objective

Session Structure

1. Introduction to Bioconductor

2. Getting Started with SpatialExperiment

3. Downloading Example Dataset

4. Data Visualisation and Manipulation

5. Quality control and filtering

Mitochondrial

Library size

Detected genes

Combined filtering

6. Dimensionality reduction

Variable gene identification

PCA

UMAP

7. Clustering

Transcriptome based nearest neighbours

Spatially-aware clustering

8. Deconvolution of pixel-based spatial data

Producing the reference for single-cell databases

Excercise

Sequencing assays

Stefano Mangiola, South Australian immunoGENomics Cancer Institute1, Walter and Eliza Hall Institute2

Session 1: Spatial analyses of sequencing data

Objective

Session Structure

1. Introduction to Bioconductor

2. Getting Started with SpatialExperiment

3. Downloading Example Dataset

4. Data Visualisation and Manipulation

5. Quality control and filtering

Mitochondrial

Library size

Detected genes

Combined filtering

6. Dimensionality reduction

Variable gene identification

PCA

UMAP

7. Clustering

Transcriptome based nearest neighbours

Spatially-aware clustering

8. Deconvolution of pixel-based spatial data

Producing the reference for single-cell databases

Excercise

Stefano Mangiola, South Australian immunoGENomics Cancer Institute¹, Walter and Eliza Hall Institute²