We present DiSTect (Disease-Specific Gene Detection), a Bayesian shrinkage model designed to address the challenges of high-dimensionality, spatial correlation within tissue spots, multiple correlated samples, and missing data. The full paper can be accessed at https://arxiv.org/abs/2409.02397.
This tutorial provides a quick guide to running DiSTect in R for two main scenarios:
Single sample analysis
Multiple correlated sample analysis
Additionally, it covers several applications such as disease-specific genes identification, gene interaction, missing data imputation, and disease prediction.
To use DiSTect, R packages rstan and rjags are required for Bayesian modelling and for JAGS-based modelling, respectively.
install.packages("rstan", dependencies = TRUE)
install.packages("rjags", type = "source")
Next, download the file named “DiSTect” from from our GitHub repository https://github.com/StaGill/DiSTect, and install it locally:
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'DiSTect'
## The following object is masked from 'package:stats':
##
## predict
install.packages("/YourPath/DiSTect", repos = NULL, type = "source")
library(DiSTect)
We begin with a toy dataset to illustrate basic usage:
data(toy)
head(toy)
## gene1 gene2 gene3 x y disease
## 1 -0.56047565 -0.7152422 1.07401226 1.5526414 6.298418 1
## 2 -0.23017749 -0.7526890 -0.02734697 8.4585101 3.534138 0
## 3 1.55870831 -0.9385387 -0.03333034 2.1438043 4.247147 0
## 4 0.07050839 -1.0525133 -1.51606762 6.6987324 9.637688 0
## 5 0.12928774 -0.4371595 0.79038534 6.1775645 6.809985 1
## 6 1.71506499 0.3311792 -0.21073418 0.4999978 7.184639 1
The dataset is divided into two components:
A covariate matrix (\(X\)), containing log-transformed gene expression counts and spatial coordinates.
A response vector (\(Y\)), representing disease status.
x <- as.matrix(toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')])
y <- toy$disease
Use the dsgd
function to fit the model. Key arguments
include includes:
list_y
: A vector containing disease status information
for each spatial location or cell.matrix_x
: A matrix of gene expression values and
spatial coordinates (note: spatial coordinates must be in the last two
columns)DiSTect offers two estimation methods to balance computational efficiency and accuracy:
Notably, dsgd
provides two ways for parameter estimation
considering the trade-off between accuracy and computational
feasibility: - NUTS (No-U-Turn Sampler): A robust MCMC sampling method
for Bayesian hierarchical models. - VI (Variational Inference): A more
computationally efficient way by approximation.
By default, dsgd
uses VI
, but this can be
modified via the method
argument.
model.single <- dsgd(list_y = y, matrix_x = x)
## Loading required package: StanHeaders
##
## rstan version 2.32.7 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1: This procedure has not been thoroughly tested and may be unstable
## Chain 1: or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1:
## Chain 1:
## Chain 1:
## Chain 1: Gradient evaluation took 0.034543 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 345.43 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
## Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
## Chain 1: Iteration: 250 / 250 [100%] (Adaptation)
## Chain 1: Success! Found best value [eta = 0.1].
## Chain 1:
## Chain 1: Begin stochastic gradient ascent.
## Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
## Chain 1: 100 -187648.407 1.000 1.000
## Chain 1: 200 -187390.397 0.501 1.000
## Chain 1: 300 -187305.355 0.334 0.001
## Chain 1: 400 -187274.348 0.250 0.001
## Chain 1: 500 -187260.340 0.200 0.000
## Chain 1: 600 -187254.045 0.167 0.000
## Chain 1: 700 -187251.827 0.143 0.000
## Chain 1: 800 -187250.218 0.125 0.000
## Chain 1: 900 -187249.011 0.111 0.000
## Chain 1: 1000 -187248.869 0.100 0.000
## Chain 1: 1100 -187249.032 0.000 0.000
## Chain 1: 1200 -187249.197 0.000 0.000
## Chain 1: 1300 -187248.310 0.000 0.000 MEDIAN ELBO CONVERGED
## Chain 1:
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...
## Chain 1: COMPLETED.
summary(model.single)
## $summary
## mean se_mean sd 2.5% 25% 50%
## beta[1] 1.14693159 NaN 0.26021562 0.639396475 0.97224400 1.1501950
## beta[2] 2.04435341 NaN 0.31281740 1.402095250 1.84248250 2.0519200
## beta[3] 3.60590780 NaN 0.44174590 2.757847250 3.31296500 3.5968450
## eta 0.02864907 NaN 0.02631248 0.004744418 0.01284007 0.0206586
## beta_gamma[1] 9.57544647 NaN 4.49177254 3.752121500 6.44646000 8.7228050
## beta_gamma[2] 9.69880656 NaN 4.35458523 3.956789500 6.72456250 8.8275900
## beta_gamma[3] 10.57683452 NaN 4.72950566 4.034471500 7.03883250 9.6363100
## w 0.77253801 NaN 0.19313513 0.286981375 0.66706025 0.8380010
## lp__ 0.00000000 NaN 0.00000000 0.000000000 0.00000000 0.0000000
## 75% 97.5% n_eff khat
## beta[1] 1.3289725 1.64770500 NaN 0.3937108
## beta[2] 2.2466775 2.62984200 NaN 0.4125598
## beta[3] 3.9013675 4.42382400 NaN 0.3736457
## eta 0.0360505 0.09630858 NaN 0.4431482
## beta_gamma[1] 11.8491750 19.51849750 NaN 0.6288305
## beta_gamma[2] 11.6481750 19.90184250 NaN 0.5780287
## beta_gamma[3] 13.2920750 21.27337250 NaN 0.5136678
## w 0.9248757 0.98358528 NaN 0.3975397
## lp__ 0.0000000 0.00000000 NaN 0.4435280
##
## $c_summary
## , , chains = chain:1
##
## stats
## parameter mean sd 2.5% 25% 50%
## beta[1] 1.14693159 0.26021562 0.639396475 0.97224400 1.1501950
## beta[2] 2.04435341 0.31281740 1.402095250 1.84248250 2.0519200
## beta[3] 3.60590780 0.44174590 2.757847250 3.31296500 3.5968450
## eta 0.02864907 0.02631248 0.004744418 0.01284007 0.0206586
## beta_gamma[1] 9.57544647 4.49177254 3.752121500 6.44646000 8.7228050
## beta_gamma[2] 9.69880656 4.35458523 3.956789500 6.72456250 8.8275900
## beta_gamma[3] 10.57683452 4.72950566 4.034471500 7.03883250 9.6363100
## w 0.77253801 0.19313513 0.286981375 0.66706025 0.8380010
## lp__ 0.00000000 0.00000000 0.000000000 0.00000000 0.0000000
## stats
## parameter 75% 97.5%
## beta[1] 1.3289725 1.64770500
## beta[2] 2.2466775 2.62984200
## beta[3] 3.9013675 4.42382400
## eta 0.0360505 0.09630858
## beta_gamma[1] 11.8491750 19.51849750
## beta_gamma[2] 11.6481750 19.90184250
## beta_gamma[3] 13.2920750 21.27337250
## w 0.9248757 0.98358528
## lp__ 0.0000000 0.00000000
The output includes estimated coefficients that reflect the relationship between gene expression and disease status. For example, a coefficient of 3.16 for gene3 suggests that it likely promotes disease development.
The function plot_coef
visualizes the estimated effect
sizes of genes on disease status. By default, the top 30 genes are
displayed, but this can be adjusted using the n
parameter:
plot_coef(fit = model.single, matrix_x = x)
To examine gene–gene interaction effects, specify the genes of
interest using the interaction
argument. This generates all
two-way interactions among the specified genes:
model.interaction <- dsgd(list_y = toy$disease,
matrix_x = toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')],
interaction = c("gene1", "gene2", "gene3"))
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1: This procedure has not been thoroughly tested and may be unstable
## Chain 1: or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1:
## Chain 1:
## Chain 1:
## Chain 1: Gradient evaluation took 0.032382 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 323.82 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
## Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
## Chain 1: Iteration: 250 / 250 [100%] (Adaptation)
## Chain 1: Success! Found best value [eta = 0.1].
## Chain 1:
## Chain 1: Begin stochastic gradient ascent.
## Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
## Chain 1: 100 -188300.157 1.000 1.000
## Chain 1: 200 -187639.583 0.502 1.000
## Chain 1: 300 -187409.306 0.335 0.004
## Chain 1: 400 -187303.490 0.251 0.004
## Chain 1: 500 -187274.332 0.201 0.001
## Chain 1: 600 -187264.889 0.168 0.001
## Chain 1: 700 -187261.204 0.144 0.001
## Chain 1: 800 -187259.132 0.126 0.001
## Chain 1: 900 -187258.307 0.112 0.000
## Chain 1: 1000 -187257.306 0.101 0.000
## Chain 1: 1100 -187257.033 0.001 0.000
## Chain 1: 1200 -187257.024 0.000 0.000
## Chain 1: 1300 -187257.233 0.000 0.000
## Chain 1: 1400 -187256.379 0.000 0.000 MEDIAN ELBO CONVERGED
## Chain 1:
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...
## Chain 1: COMPLETED.
## Warning: Pareto k diagnostic value is 0.84. Resampling is unreliable.
## Increasing the number of draws or decreasing tol_rel_obj may help.
summary(model.interaction)
## $summary
## mean se_mean sd 2.5% 25% 50%
## beta[1] 1.27415358 NaN 0.32114878 0.61430967 1.077265000 1.2778350
## beta[2] 2.16726443 NaN 0.29348443 1.59699025 1.966682500 2.1738250
## beta[3] 4.07646214 NaN 0.54740646 3.00120150 3.726657500 4.0692550
## beta[4] 0.22470792 NaN 0.34703426 -0.46569468 -0.006714307 0.2225025
## beta[5] 0.34829647 NaN 0.42166120 -0.43676082 0.070075750 0.3461285
## beta[6] 0.75908832 NaN 0.32068193 0.13651760 0.541916750 0.7506000
## eta 0.05642943 NaN 0.03476532 0.01659194 0.032339700 0.0464880
## beta_gamma[1] 9.98324293 NaN 4.39299340 4.01876075 6.819897500 9.2284400
## beta_gamma[2] 9.80257006 NaN 4.18539613 3.86590700 6.805070000 9.0097900
## beta_gamma[3] 10.14502446 NaN 4.71980945 4.02690175 6.896002500 9.1184150
## beta_gamma[4] 9.47573234 NaN 4.42042055 3.52936225 6.218140000 8.6841950
## beta_gamma[5] 9.59881552 NaN 4.72043700 3.23101975 6.190965000 8.6142500
## beta_gamma[6] 9.89954330 NaN 4.59613095 3.79943000 6.720890000 8.8760450
## w 0.87832432 NaN 0.11611932 0.53440915 0.837256250 0.9157430
## lp__ 0.00000000 NaN 0.00000000 0.00000000 0.000000000 0.0000000
## 75% 97.5% n_eff khat
## beta[1] 1.4956000 1.8706900 NaN 0.8331018
## beta[2] 2.3751875 2.7246907 NaN 0.8100776
## beta[3] 4.4518925 5.1375105 NaN 0.8636810
## beta[4] 0.4642457 0.9077431 NaN 0.8302727
## beta[5] 0.6234523 1.2085175 NaN 0.8066371
## beta[6] 0.9920918 1.3826538 NaN 0.8434977
## eta 0.0704525 0.1530873 NaN 0.8379704
## beta_gamma[1] 12.3323500 20.5485025 NaN 0.7093822
## beta_gamma[2] 12.0749000 19.9275550 NaN 0.8716651
## beta_gamma[3] 12.2957500 22.5033775 NaN 0.8339456
## beta_gamma[4] 11.4274750 21.2797425 NaN 0.8029808
## beta_gamma[5] 11.8661500 20.9298275 NaN 1.1123361
## beta_gamma[6] 12.1556750 21.6436375 NaN 0.8350353
## w 0.9583097 0.9904899 NaN 0.8476742
## lp__ 0.0000000 0.0000000 NaN 0.8372063
##
## $c_summary
## , , chains = chain:1
##
## stats
## parameter mean sd 2.5% 25% 50%
## beta[1] 1.27415358 0.32114878 0.61430967 1.077265000 1.2778350
## beta[2] 2.16726443 0.29348443 1.59699025 1.966682500 2.1738250
## beta[3] 4.07646214 0.54740646 3.00120150 3.726657500 4.0692550
## beta[4] 0.22470792 0.34703426 -0.46569468 -0.006714307 0.2225025
## beta[5] 0.34829647 0.42166120 -0.43676082 0.070075750 0.3461285
## beta[6] 0.75908832 0.32068193 0.13651760 0.541916750 0.7506000
## eta 0.05642943 0.03476532 0.01659194 0.032339700 0.0464880
## beta_gamma[1] 9.98324293 4.39299340 4.01876075 6.819897500 9.2284400
## beta_gamma[2] 9.80257006 4.18539613 3.86590700 6.805070000 9.0097900
## beta_gamma[3] 10.14502446 4.71980945 4.02690175 6.896002500 9.1184150
## beta_gamma[4] 9.47573234 4.42042055 3.52936225 6.218140000 8.6841950
## beta_gamma[5] 9.59881552 4.72043700 3.23101975 6.190965000 8.6142500
## beta_gamma[6] 9.89954330 4.59613095 3.79943000 6.720890000 8.8760450
## w 0.87832432 0.11611932 0.53440915 0.837256250 0.9157430
## lp__ 0.00000000 0.00000000 0.00000000 0.000000000 0.0000000
## stats
## parameter 75% 97.5%
## beta[1] 1.4956000 1.8706900
## beta[2] 2.3751875 2.7246907
## beta[3] 4.4518925 5.1375105
## beta[4] 0.4642457 0.9077431
## beta[5] 0.6234523 1.2085175
## beta[6] 0.9920918 1.3826538
## eta 0.0704525 0.1530873
## beta_gamma[1] 12.3323500 20.5485025
## beta_gamma[2] 12.0749000 19.9275550
## beta_gamma[3] 12.2957500 22.5033775
## beta_gamma[4] 11.4274750 21.2797425
## beta_gamma[5] 11.8661500 20.9298275
## beta_gamma[6] 12.1556750 21.6436375
## w 0.9583097 0.9904899
## lp__ 0.0000000 0.0000000
Visualize the interaction network using plot_network
(defaults:
label_size = 13
,node_color = "#C1D8E7"
):
# Load the required packages
library(ggplot2)
library(GGally)
library(network)
library(sna)
plot_network(model.interaction, c("gene1", "gene2", "gene3"))
Generate a dataset with missingness.
data(toy)
x <- toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')]
y <- toy$disease
missing_indices <- sample(length(y), 10)
y[missing_indices] <- NA
coord <- as.matrix(toy[, c("x", "y")])
To handle missing spots, DiSTect offers
missing_imputation
. The function returns the estimated
probability for each missing observation, which can be used to impute
values:
imputed_mean <- missing_imputation(y, coord)[missing_indices]
y[missing_indices] <- rbinom(10, 1, imputed_mean) # Impute the missing values by the corresponding estimated mean
Subsequently, you can run the dsgd
function on the
imputed dataset:
dsgd(list_y = y, matrix_x = x)
DiSTect allows prediction of disease status using a trained model and
a new dataset. Specifically, the predict
function takes as
input a fitted model and a dataframe containing gene expression and
spatial coordinate information for each observation. The spatial
coordinates should be placed in the last two columns of the
dataframe.
y_pred <- predict(model.single, toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')])
sum(y_pred == toy$disease) / nrow(toy) # calculate the accuracy
## [1] 0.85