Introduction

We present DiSTect (Disease-Specific Gene Detection), a Bayesian shrinkage model designed to address the challenges of high-dimensionality, spatial correlation within tissue spots, multiple correlated samples, and missing data. The full paper can be accessed at https://arxiv.org/abs/2409.02397.

This tutorial provides a quick guide to running DiSTect in R for two main scenarios:

Additionally, it covers several applications such as disease-specific genes identification, gene interaction, missing data imputation, and disease prediction.

Outline

Installation

To use DiSTect, R packages rstan and rjags are required for Bayesian modelling and for JAGS-based modelling, respectively.

install.packages("rstan", dependencies = TRUE)
install.packages("rjags", type = "source")

Next, download the file named “DiSTect” from from our GitHub repository https://github.com/StaGill/DiSTect, and install it locally:

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'DiSTect'
## The following object is masked from 'package:stats':
## 
##     predict
install.packages("/YourPath/DiSTect", repos = NULL, type = "source")
library(DiSTect)

Single Tissue Scenario

We begin with a toy dataset to illustrate basic usage:

data(toy)
head(toy)
##         gene1      gene2       gene3         x        y disease
## 1 -0.56047565 -0.7152422  1.07401226 1.5526414 6.298418       1
## 2 -0.23017749 -0.7526890 -0.02734697 8.4585101 3.534138       0
## 3  1.55870831 -0.9385387 -0.03333034 2.1438043 4.247147       0
## 4  0.07050839 -1.0525133 -1.51606762 6.6987324 9.637688       0
## 5  0.12928774 -0.4371595  0.79038534 6.1775645 6.809985       1
## 6  1.71506499  0.3311792 -0.21073418 0.4999978 7.184639       1

The dataset is divided into two components:

x <- as.matrix(toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')])
y <- toy$disease

Use the dsgd function to fit the model. Key arguments include includes:

DiSTect offers two estimation methods to balance computational efficiency and accuracy:

Notably, dsgd provides two ways for parameter estimation considering the trade-off between accuracy and computational feasibility: - NUTS (No-U-Turn Sampler): A robust MCMC sampling method for Bayesian hierarchical models. - VI (Variational Inference): A more computationally efficient way by approximation.

By default, dsgd uses VI, but this can be modified via the method argument.

model.single <- dsgd(list_y = y, matrix_x = x)
## Loading required package: StanHeaders
## 
## rstan version 2.32.7 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.034543 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 345.43 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Iteration: 250 / 250 [100%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 0.1].
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100      -187648.407             1.000            1.000
## Chain 1:    200      -187390.397             0.501            1.000
## Chain 1:    300      -187305.355             0.334            0.001
## Chain 1:    400      -187274.348             0.250            0.001
## Chain 1:    500      -187260.340             0.200            0.000
## Chain 1:    600      -187254.045             0.167            0.000
## Chain 1:    700      -187251.827             0.143            0.000
## Chain 1:    800      -187250.218             0.125            0.000
## Chain 1:    900      -187249.011             0.111            0.000
## Chain 1:   1000      -187248.869             0.100            0.000
## Chain 1:   1100      -187249.032             0.000            0.000
## Chain 1:   1200      -187249.197             0.000            0.000
## Chain 1:   1300      -187248.310             0.000            0.000   MEDIAN ELBO CONVERGED
## Chain 1: 
## Chain 1: Drawing a sample of size 1000 from the approximate posterior... 
## Chain 1: COMPLETED.
summary(model.single)
## $summary
##                      mean se_mean         sd        2.5%        25%       50%
## beta[1]        1.14693159     NaN 0.26021562 0.639396475 0.97224400 1.1501950
## beta[2]        2.04435341     NaN 0.31281740 1.402095250 1.84248250 2.0519200
## beta[3]        3.60590780     NaN 0.44174590 2.757847250 3.31296500 3.5968450
## eta            0.02864907     NaN 0.02631248 0.004744418 0.01284007 0.0206586
## beta_gamma[1]  9.57544647     NaN 4.49177254 3.752121500 6.44646000 8.7228050
## beta_gamma[2]  9.69880656     NaN 4.35458523 3.956789500 6.72456250 8.8275900
## beta_gamma[3] 10.57683452     NaN 4.72950566 4.034471500 7.03883250 9.6363100
## w              0.77253801     NaN 0.19313513 0.286981375 0.66706025 0.8380010
## lp__           0.00000000     NaN 0.00000000 0.000000000 0.00000000 0.0000000
##                      75%       97.5% n_eff      khat
## beta[1]        1.3289725  1.64770500   NaN 0.3937108
## beta[2]        2.2466775  2.62984200   NaN 0.4125598
## beta[3]        3.9013675  4.42382400   NaN 0.3736457
## eta            0.0360505  0.09630858   NaN 0.4431482
## beta_gamma[1] 11.8491750 19.51849750   NaN 0.6288305
## beta_gamma[2] 11.6481750 19.90184250   NaN 0.5780287
## beta_gamma[3] 13.2920750 21.27337250   NaN 0.5136678
## w              0.9248757  0.98358528   NaN 0.3975397
## lp__           0.0000000  0.00000000   NaN 0.4435280
## 
## $c_summary
## , , chains = chain:1
## 
##                stats
## parameter              mean         sd        2.5%        25%       50%
##   beta[1]        1.14693159 0.26021562 0.639396475 0.97224400 1.1501950
##   beta[2]        2.04435341 0.31281740 1.402095250 1.84248250 2.0519200
##   beta[3]        3.60590780 0.44174590 2.757847250 3.31296500 3.5968450
##   eta            0.02864907 0.02631248 0.004744418 0.01284007 0.0206586
##   beta_gamma[1]  9.57544647 4.49177254 3.752121500 6.44646000 8.7228050
##   beta_gamma[2]  9.69880656 4.35458523 3.956789500 6.72456250 8.8275900
##   beta_gamma[3] 10.57683452 4.72950566 4.034471500 7.03883250 9.6363100
##   w              0.77253801 0.19313513 0.286981375 0.66706025 0.8380010
##   lp__           0.00000000 0.00000000 0.000000000 0.00000000 0.0000000
##                stats
## parameter              75%       97.5%
##   beta[1]        1.3289725  1.64770500
##   beta[2]        2.2466775  2.62984200
##   beta[3]        3.9013675  4.42382400
##   eta            0.0360505  0.09630858
##   beta_gamma[1] 11.8491750 19.51849750
##   beta_gamma[2] 11.6481750 19.90184250
##   beta_gamma[3] 13.2920750 21.27337250
##   w              0.9248757  0.98358528
##   lp__           0.0000000  0.00000000

The output includes estimated coefficients that reflect the relationship between gene expression and disease status. For example, a coefficient of 3.16 for gene3 suggests that it likely promotes disease development.

Multiple Correlated Tissue Scenario

For multiple correlated samples, we use a second toy dataset that includes a label column indicating tissue origin:

data(toy2)
head(toy2)
##        gene1       gene2       gene3        x        y disease label
## 1 -0.7400457  0.30878658  0.51980559 5.447189 1.724372       1     1
## 2  0.5030884  0.38848124  0.44233855 1.959015 2.130551       1     1
## 3 -0.8790444 -0.02834531 -0.31232314 9.302594 0.248819       0     1
## 4 -0.8813801 -0.59399920  0.09072223 1.638433 9.264955       1     1
## 5  0.1820235  0.48096339  1.01203654 3.286397 2.392358       1     1
## 6 -1.1870649 -0.05291483 -0.50489724 4.259975 2.055192       1     1

The same dsgd function is used, with the addition of the label_list argument:

x2 <- toy2[, c('gene1', 'gene2', 'gene3', 'x', 'y')]  
y2 <- toy2$disease  
label<- toy2$label

model.multiple <- dsgd(list_y = y2, matrix_x = x2, label_list = label)
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.011076 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 110.76 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Iteration: 250 / 250 [100%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 0.1].
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100       -83579.648             1.000            1.000
## Chain 1:    200       -83448.534             0.501            1.000
## Chain 1:    300       -83404.031             0.334            0.002
## Chain 1:    400       -83386.394             0.251            0.002
## Chain 1:    500       -83377.813             0.200            0.001
## Chain 1:    600       -83374.287             0.167            0.001
## Chain 1:    700       -83372.288             0.143            0.000
## Chain 1:    800       -83371.136             0.125            0.000
## Chain 1:    900       -83370.623             0.111            0.000
## Chain 1:   1000       -83370.001             0.100            0.000
## Chain 1:   1100       -83369.309             0.000            0.000
## Chain 1:   1200       -83369.612             0.000            0.000
## Chain 1:   1300       -83369.861             0.000            0.000
## Chain 1:   1400       -83369.382             0.000            0.000   MEDIAN ELBO CONVERGED
## Chain 1: 
## Chain 1: Drawing a sample of size 1000 from the approximate posterior... 
## Chain 1: COMPLETED.
summary(model.multiple)
## $summary
##                      mean se_mean         sd        2.5%        25%         50%
## beta[1]        0.18689327     NaN 0.18240347 -0.17104610  0.0583527  0.18848000
## beta[2]        0.22577005     NaN 0.15441134 -0.06766604  0.1198502  0.21873450
## beta[3]       -0.19609097     NaN 0.17577472 -0.54253235 -0.3137100 -0.19456800
## eta            0.17204008     NaN 0.09262078  0.05634969  0.1089117  0.15272800
## beta_gamma[1]  9.88473305     NaN 4.66576083  3.64703250  6.7319125  8.75719000
## beta_gamma[2]  9.92663531     NaN 4.73722790  3.65293350  6.7981700  8.83614500
## beta_gamma[3]  9.87966418     NaN 4.46870373  3.95360625  6.6239975  8.97535500
## w              0.77509402     NaN 0.17932053  0.30942570  0.6788353  0.82799200
## U[1]          -0.08179391     NaN 0.25417792 -0.57074945 -0.2542985 -0.08623435
## U[2]          -0.23363364     NaN 0.22951739 -0.64131605 -0.3970305 -0.24196950
## sigma_square   0.96472861     NaN 0.02549667  0.89747315  0.9554245  0.97151150
## rho            0.54023293     NaN 0.28152411  0.04064403  0.3038515  0.56233650
## lp__           0.00000000     NaN 0.00000000  0.00000000  0.0000000  0.00000000
##                       75%      97.5% n_eff      khat
## beta[1]        0.31524825  0.5246449   NaN 0.6866302
## beta[2]        0.32660125  0.5342455   NaN 0.6498638
## beta[3]       -0.07677243  0.1473626   NaN 0.6465643
## eta            0.21288550  0.4025303   NaN 0.6588228
## beta_gamma[1] 12.06897500 21.2718450   NaN 0.6169870
## beta_gamma[2] 12.23427500 21.6006050   NaN 0.7020414
## beta_gamma[3] 12.08697500 21.0877125   NaN 0.6555084
## w              0.91553850  0.9803454   NaN 0.7218508
## U[1]           0.09345088  0.4207357   NaN 0.6712253
## U[2]          -0.08061250  0.2381882   NaN 0.6426643
## sigma_square   0.98156300  0.9920669   NaN 0.6858993
## rho            0.78466650  0.9688958   NaN 0.5402273
## lp__           0.00000000  0.0000000   NaN 0.6880947
## 
## $c_summary
## , , chains = chain:1
## 
##                stats
## parameter              mean         sd        2.5%        25%         50%
##   beta[1]        0.18689327 0.18240347 -0.17104610  0.0583527  0.18848000
##   beta[2]        0.22577005 0.15441134 -0.06766604  0.1198502  0.21873450
##   beta[3]       -0.19609097 0.17577472 -0.54253235 -0.3137100 -0.19456800
##   eta            0.17204008 0.09262078  0.05634969  0.1089117  0.15272800
##   beta_gamma[1]  9.88473305 4.66576083  3.64703250  6.7319125  8.75719000
##   beta_gamma[2]  9.92663531 4.73722790  3.65293350  6.7981700  8.83614500
##   beta_gamma[3]  9.87966418 4.46870373  3.95360625  6.6239975  8.97535500
##   w              0.77509402 0.17932053  0.30942570  0.6788353  0.82799200
##   U[1]          -0.08179391 0.25417792 -0.57074945 -0.2542985 -0.08623435
##   U[2]          -0.23363364 0.22951739 -0.64131605 -0.3970305 -0.24196950
##   sigma_square   0.96472861 0.02549667  0.89747315  0.9554245  0.97151150
##   rho            0.54023293 0.28152411  0.04064403  0.3038515  0.56233650
##   lp__           0.00000000 0.00000000  0.00000000  0.0000000  0.00000000
##                stats
## parameter               75%      97.5%
##   beta[1]        0.31524825  0.5246449
##   beta[2]        0.32660125  0.5342455
##   beta[3]       -0.07677243  0.1473626
##   eta            0.21288550  0.4025303
##   beta_gamma[1] 12.06897500 21.2718450
##   beta_gamma[2] 12.23427500 21.6006050
##   beta_gamma[3] 12.08697500 21.0877125
##   w              0.91553850  0.9803454
##   U[1]           0.09345088  0.4207357
##   U[2]          -0.08061250  0.2381882
##   sigma_square   0.98156300  0.9920669
##   rho            0.78466650  0.9688958
##   lp__           0.00000000  0.0000000

Disease-specific Genes Identification

The function plot_coef visualizes the estimated effect sizes of genes on disease status. By default, the top 30 genes are displayed, but this can be adjusted using the n parameter:

plot_coef(fit = model.single, matrix_x = x)

Gene Interaction

To examine gene–gene interaction effects, specify the genes of interest using the interaction argument. This generates all two-way interactions among the specified genes:

model.interaction <- dsgd(list_y = toy$disease, 
                          matrix_x = toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')], 
                          interaction = c("gene1", "gene2", "gene3"))
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.032382 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 323.82 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Iteration: 250 / 250 [100%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 0.1].
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100      -188300.157             1.000            1.000
## Chain 1:    200      -187639.583             0.502            1.000
## Chain 1:    300      -187409.306             0.335            0.004
## Chain 1:    400      -187303.490             0.251            0.004
## Chain 1:    500      -187274.332             0.201            0.001
## Chain 1:    600      -187264.889             0.168            0.001
## Chain 1:    700      -187261.204             0.144            0.001
## Chain 1:    800      -187259.132             0.126            0.001
## Chain 1:    900      -187258.307             0.112            0.000
## Chain 1:   1000      -187257.306             0.101            0.000
## Chain 1:   1100      -187257.033             0.001            0.000
## Chain 1:   1200      -187257.024             0.000            0.000
## Chain 1:   1300      -187257.233             0.000            0.000
## Chain 1:   1400      -187256.379             0.000            0.000   MEDIAN ELBO CONVERGED
## Chain 1: 
## Chain 1: Drawing a sample of size 1000 from the approximate posterior... 
## Chain 1: COMPLETED.
## Warning: Pareto k diagnostic value is 0.84. Resampling is unreliable.
## Increasing the number of draws or decreasing tol_rel_obj may help.
summary(model.interaction)
## $summary
##                      mean se_mean         sd        2.5%          25%       50%
## beta[1]        1.27415358     NaN 0.32114878  0.61430967  1.077265000 1.2778350
## beta[2]        2.16726443     NaN 0.29348443  1.59699025  1.966682500 2.1738250
## beta[3]        4.07646214     NaN 0.54740646  3.00120150  3.726657500 4.0692550
## beta[4]        0.22470792     NaN 0.34703426 -0.46569468 -0.006714307 0.2225025
## beta[5]        0.34829647     NaN 0.42166120 -0.43676082  0.070075750 0.3461285
## beta[6]        0.75908832     NaN 0.32068193  0.13651760  0.541916750 0.7506000
## eta            0.05642943     NaN 0.03476532  0.01659194  0.032339700 0.0464880
## beta_gamma[1]  9.98324293     NaN 4.39299340  4.01876075  6.819897500 9.2284400
## beta_gamma[2]  9.80257006     NaN 4.18539613  3.86590700  6.805070000 9.0097900
## beta_gamma[3] 10.14502446     NaN 4.71980945  4.02690175  6.896002500 9.1184150
## beta_gamma[4]  9.47573234     NaN 4.42042055  3.52936225  6.218140000 8.6841950
## beta_gamma[5]  9.59881552     NaN 4.72043700  3.23101975  6.190965000 8.6142500
## beta_gamma[6]  9.89954330     NaN 4.59613095  3.79943000  6.720890000 8.8760450
## w              0.87832432     NaN 0.11611932  0.53440915  0.837256250 0.9157430
## lp__           0.00000000     NaN 0.00000000  0.00000000  0.000000000 0.0000000
##                      75%      97.5% n_eff      khat
## beta[1]        1.4956000  1.8706900   NaN 0.8331018
## beta[2]        2.3751875  2.7246907   NaN 0.8100776
## beta[3]        4.4518925  5.1375105   NaN 0.8636810
## beta[4]        0.4642457  0.9077431   NaN 0.8302727
## beta[5]        0.6234523  1.2085175   NaN 0.8066371
## beta[6]        0.9920918  1.3826538   NaN 0.8434977
## eta            0.0704525  0.1530873   NaN 0.8379704
## beta_gamma[1] 12.3323500 20.5485025   NaN 0.7093822
## beta_gamma[2] 12.0749000 19.9275550   NaN 0.8716651
## beta_gamma[3] 12.2957500 22.5033775   NaN 0.8339456
## beta_gamma[4] 11.4274750 21.2797425   NaN 0.8029808
## beta_gamma[5] 11.8661500 20.9298275   NaN 1.1123361
## beta_gamma[6] 12.1556750 21.6436375   NaN 0.8350353
## w              0.9583097  0.9904899   NaN 0.8476742
## lp__           0.0000000  0.0000000   NaN 0.8372063
## 
## $c_summary
## , , chains = chain:1
## 
##                stats
## parameter              mean         sd        2.5%          25%       50%
##   beta[1]        1.27415358 0.32114878  0.61430967  1.077265000 1.2778350
##   beta[2]        2.16726443 0.29348443  1.59699025  1.966682500 2.1738250
##   beta[3]        4.07646214 0.54740646  3.00120150  3.726657500 4.0692550
##   beta[4]        0.22470792 0.34703426 -0.46569468 -0.006714307 0.2225025
##   beta[5]        0.34829647 0.42166120 -0.43676082  0.070075750 0.3461285
##   beta[6]        0.75908832 0.32068193  0.13651760  0.541916750 0.7506000
##   eta            0.05642943 0.03476532  0.01659194  0.032339700 0.0464880
##   beta_gamma[1]  9.98324293 4.39299340  4.01876075  6.819897500 9.2284400
##   beta_gamma[2]  9.80257006 4.18539613  3.86590700  6.805070000 9.0097900
##   beta_gamma[3] 10.14502446 4.71980945  4.02690175  6.896002500 9.1184150
##   beta_gamma[4]  9.47573234 4.42042055  3.52936225  6.218140000 8.6841950
##   beta_gamma[5]  9.59881552 4.72043700  3.23101975  6.190965000 8.6142500
##   beta_gamma[6]  9.89954330 4.59613095  3.79943000  6.720890000 8.8760450
##   w              0.87832432 0.11611932  0.53440915  0.837256250 0.9157430
##   lp__           0.00000000 0.00000000  0.00000000  0.000000000 0.0000000
##                stats
## parameter              75%      97.5%
##   beta[1]        1.4956000  1.8706900
##   beta[2]        2.3751875  2.7246907
##   beta[3]        4.4518925  5.1375105
##   beta[4]        0.4642457  0.9077431
##   beta[5]        0.6234523  1.2085175
##   beta[6]        0.9920918  1.3826538
##   eta            0.0704525  0.1530873
##   beta_gamma[1] 12.3323500 20.5485025
##   beta_gamma[2] 12.0749000 19.9275550
##   beta_gamma[3] 12.2957500 22.5033775
##   beta_gamma[4] 11.4274750 21.2797425
##   beta_gamma[5] 11.8661500 20.9298275
##   beta_gamma[6] 12.1556750 21.6436375
##   w              0.9583097  0.9904899
##   lp__           0.0000000  0.0000000

Visualize the interaction network using plot_network (defaults: label_size = 13,node_color = "#C1D8E7"):

# Load the required packages
library(ggplot2)
library(GGally)
library(network)
library(sna)

plot_network(model.interaction, c("gene1", "gene2", "gene3"))

Missing Data Imputation

Generate a dataset with missingness.

data(toy)
x <- toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')]  
y <- toy$disease  

missing_indices <- sample(length(y), 10)
y[missing_indices] <- NA

coord <- as.matrix(toy[, c("x", "y")])

To handle missing spots, DiSTect offers missing_imputation. The function returns the estimated probability for each missing observation, which can be used to impute values:

imputed_mean <- missing_imputation(y, coord)[missing_indices]
y[missing_indices] <- rbinom(10, 1, imputed_mean) # Impute the missing values by the corresponding estimated mean

Subsequently, you can run the dsgd function on the imputed dataset:

dsgd(list_y = y, matrix_x = x)

Disease Prediction

DiSTect allows prediction of disease status using a trained model and a new dataset. Specifically, the predict function takes as input a fitted model and a dataframe containing gene expression and spatial coordinate information for each observation. The spatial coordinates should be placed in the last two columns of the dataframe.

y_pred <- predict(model.single, toy[, c('gene1', 'gene2', 'gene3', 'x', 'y')])
sum(y_pred == toy$disease) / nrow(toy) # calculate the accuracy
## [1] 0.85