GAM-clustering Tutorial Notebook For Reanalysis Of Metabolic Tabula Muris Senis Single Cell RNA-Seq Data

Required packages to be installed:

if (!require("devtools", quietly = TRUE))
    install.packages("BiocManager")

devtools::install_github("ctlab/GAMclust")
devtools::install_github("ctlab/gatom")
devtools::install_github("ctlab/mwcsr")
devtools::install_github("ctlab/fgsea")

library(GAMclust)
library(gatom)
library(mwcsr)
library(fgsea)
library(data.table)
library(Seurat)

set.seed(42)

Preparing working environment

First, please load and initialize all objects required for GAM-clustering analysis:

load KEGG metabolic network network.kegg.rds and its metabolites annotation met.kegg.db.rds;

network <- readRDS(url("http://artyomovlab.wustl.edu/publications/supp_materials/GATOM/network.kegg.rds"))
metabolites.annotation <- readRDS(url("http://artyomovlab.wustl.edu/publications/supp_materials/GATOM/met.kegg.db.rds"))

load species-specific network annotation: org.Hs.eg.gatom.anno for human data or org.Mm.eg.gatom.anno for mouse data;

network.annotation <- readRDS(url("http://artyomovlab.wustl.edu/publications/supp_materials/GATOM/org.Mm.eg.gatom.anno.rds"))

load provided list of metabolites that should not be considered during the analysis as connections between reactions (e.g., CO2, HCO3-, etc);

met.to.filter <- data.table::fread(system.file("mets2mask.lst", package="GAMclust"))$ID

initialize SMGWCS solver:

(4.1.) we recommend to use here either heuristic relax-and-cut solver rnc_solver from mwcsr package,

solver <- mwcsr::rnc_solver()

(4.2.) either proprietary CPLEX solver (free for academy);

cplex.dir <-  "/opt/ibm/ILOG/CPLEX_Studio1271"

solver <- mwcsr::virgo_solver(cplex_dir = cplex.dir)

set working directory where the results will be saved to.

work.dir <- "results"
dir.create(work.dir, showWarnings = F, recursive = T)

TEMPORARY: collecting logs while developing the tool.

stats.dir <- paste0(work.dir, "/stats")
dir.create(stats.dir, showWarnings = F, recursive = T)

log_file <- file(paste0(stats.dir, "/log.txt"), open = "wt")
sink(log_file, type = "output", append = TRUE)
sink(log_file, type = "message", append = TRUE)

Preparing objects for the analysis

Preparing data

GAMclust works with bulk, single cell and spatial RNA-seq data.

This vignette shows how to process single cell RNA-seq data on the example of Tabula Muris Senis data reanalysis.

For single cell data, take 6,000-12,000 genes for GAM-clustering analysis. To do this while preprocessing data with Seurat pipeline, set variable.features.n = 12000 in SCTransform() function. In case of preprocessing multi-sample data, set nfeatures=12000 in SelectIntegrationFeatures()).

Let’s load already preprocessed data.

seurat_object <- readRDS(url("http://artyomovlab.wustl.edu/publications/supp_materials/GAMclust/tms12k.rds"))

E <- as.matrix(Seurat::GetAssayData(object = seurat_object,
                                    assay = "SCT",
                                    slot = "scale.data"))

nrow(E) # ! make sure this value is in range from 6,000 to 12,000

# [1] 12000

E[1:3, 1:3]

#        AACCATGAGATCCCAT-1-0-0-0 AACCATGTCAGAGGTG-1-0-0-0
# Sox17                -0.1865187               -0.2478802
# Mrpl15               -0.4440852                0.3323707
# Lypla1               -0.4182570               -0.5818529
#        AAGGTTCTCCTAGAAC-1-0-0-0
# Sox17                -0.2390427
# Mrpl15               -0.5901840
# Lypla1               -0.5602515

Genes in your dataset may be named as Symbol, Entrez, Ensembl or RefSeq IDs. One of these names should be specified as value of gene.id.type parameter in prepareData().

If you analyse singe cell or spatial RNA-seq data, please set use.PCA=TRUE in prepareData().

E.prep <- prepareData(E = E,
                      gene.id.type = "Symbol",
                      use.PCA = TRUE,
                      use.PCA.n = 50,
                      network.annotation = network.annotation)

E.prep[1:3, 1:3]

#                PC1        PC2         PC3
# 16176   0.07415203 2.20087518 -0.85846106
# 278180  1.21894952 2.84602959  1.64928876
# 17394  -5.31802334 0.07568864  0.08151284

Preparing network

The prepareNetwork() function defines the structure of the final metabolic modules.

network.prep <- prepareNetwork(E = E.prep,
                               network = network,
                               met.to.filter = met.to.filter,
                               network.annotation = network.annotation)

# > Global metabolite network contains 5267 edges.

# > Largest connected component of this global network contains 1177 nodes and 4294 edges.

Preclustering

The preClustering() function defines initial patterns using k-medoids clustering on gene expression matrix. It is strongly recommended to do initial clustering with no less than 32 clusters (initial.number.of.clusters = 32).

You can visualize the initial heatmap as shown below.

cur.centers <- preClustering(E.prep = E.prep,
                             network.prep = network.prep,
                             initial.number.of.clusters = 32,
                             network.annotation = network.annotation)

# > 929 metabolic genes from the analysed dataset mapped to this component.

cur.centers[1:3, 1:3]

#           PC1        PC2       PC3
# 1 2.821065907 -2.1195663 -1.750319
# 2 0.007682082 -0.6166972  1.703774
# 3 3.483677116 -1.6683076  2.582279

pheatmap::pheatmap(
      GAMclust:::normalize.rows(cur.centers),
      cluster_rows=F, cluster_cols=F,
      show_rownames=F, show_colnames=T)

GAM-clustering analysis

Now you have everything prepared for the GAM-clustering analysis.

Initial patterns will be now refined in an iterative process. The output of gamClustering() function presents a set of specific subnetworks (also called metabolic modules) that reflect metabolic variability within a given transcriptional dataset.

Note, that it may take a long time to derive metabolic modules by gamClustering() function (tens of minutes).

There is a set of parameters which determine the size and number of your final modules. We recommend you to start with the default settings, however you can adjust them based on your own preferences:

If you consider final modules to bee too small or too big and it complicates interpretation for you, you can either increase or reduce by 10 units the max.module.size parameter.
If among final modules you consider presence of any modules with too similar patterns, you can reduce by 0.1 units the cor.threshold parameter.
If among final modules you consider presence of any uninformative modules, you can reduce by 10 times the p.adj.val.threshold parameter.

results <- gamClustering(E.prep = E.prep,
                         network.prep = network.prep,
                         cur.centers = cur.centers,
                         
                         start.base = 0.5,
                         base.dec = 0.05,
                         max.module.size = 50,

                         cor.threshold = 0.8,
                         p.adj.val.threshold = 1e-5,

                         batch.solver = seq_batch_solver(solver),
                         work.dir = work.dir,
                         
                         show.intermediate.clustering = FALSE,
                         verbose = FALSE,
                         collect.stats = TRUE)

Visualizing and exploring the GAM-clustering results

Each metabolic module is a connected piece of metabolic network whose genes expression is correlated across all dataset.

The following functions will help you to visualize and explore the obtained results.

Get graphs of modules:

getGraphs(modules = results$modules,
          network.annotation = network.annotation,
          metabolites.annotation = metabolites.annotation,
          seed.for.layout = 42,
          work.dir = work.dir)

# Graphs for module 1 are built

# Graphs for module 2 are built

# Graphs for module 3 are built

# Graphs for module 4 are built

# Graphs for module 5 are built

# Graphs for module 6 are built

# Graphs for module 7 are built

Example of the graph of the third module:

Get gene tables:

The table contains gene list. Each gene has two descriptive values: i) gene’s correlation value with the modules pattern and ii) gene’s score. High score means that this gene’s expression is similar to the module’s pattern and not similar to other modules’ patterns.

m.gene.list <- getGeneTables(modules = results$modules,
                             nets = results$nets,
                             patterns = results$patterns.pos,
                             gene.exprs = E.prep,
                             network.annotation = network.annotation,
                             work.dir = work.dir)

# Gene tables for module 1 are produced

# Gene tables for module 2 are produced

# Gene tables for module 3 are produced

# Gene tables for module 4 are produced

# Gene tables for module 5 are produced

# Gene tables for module 6 are produced

# Gene tables for module 7 are produced

Example of the gene table of the third module:

head(GAMclust:::read.tsv(paste0(work.dir, "/m.3.genes.tsv")))

Get plots of patterns:

for(i in 1:length(m.gene.list)){
  
  plot(fgsea::plotCoregulationProfileReduction(m.gene.list[[i]], 
                                               seurat_object, 
                                               title = paste0("module ", i),
                                               raster = TRUE,
                                               reduction = "pumap"))
}

Get plots of individual genes expression (example for the 3d module):

Seurat::DefaultAssay(seurat_object) <- "SCT"

i <- 3

Seurat::FeaturePlot(seurat_object, 
                    slot = "data",
                    reduction = "pumap",
                    features = m.gene.list[[i]],
                    ncol = 6,
                    raster = TRUE,
                    combine = TRUE)

Get tables and plots with annotation of modules:

Functional annotation of obtained modules is done based on KEGG and Reactome canonical metabolic pathways.

getAnnotationTables(network.annotation = network.annotation,
                    nets = results$nets,
                    work.dir = work.dir)

# Pathway annotation for module 1 is produced

# Pathway annotation for module 2 is produced

# Pathway annotation for module 3 is produced

# Pathway annotation for module 4 is produced

# Pathway annotation for module 5 is produced

# Pathway annotation for module 6 is produced

# Pathway annotation for module 7 is produced

Example of the annotation table of the third module:

head(GAMclust:::read.tsv(paste0(work.dir, "/m.3.pathways.tsv")))

Annotation heatmap for all modules:

getAnnotationHeatmap(work.dir = work.dir)

# Processing module 1

# Module size: 40

# Processing module 2

# Module size: 18

# Processing module 3

# Module size: 16

# Processing module 4

# Module size: 20

# Processing module 5

# Module size: 13

# Processing module 6

# Module size: 12

# Processing module 7

# Module size: 5

Compare modules obtained in different runs:

You may also compare two results of running GAM-clustering on the same dataset (e.g. runs with different parameters) or compare two results of running GAM-clustering on different datasets (then set same.data=FALSE).

modulesSimilarity(dir1 = work.dir,
                  dir2 = "old.dir",
                  name1 = "new",
                  name2 = "old",
                  same.data = TRUE,
                  use.genes.with.pos.score = TRUE,
                  work.dir = work.dir,
                  file.name = "comparison.png")