An Introduction to rgm

This vignette is designed to guide users through the process of simulating data and then running an experiment with the RGM package. We’ll start by demonstrating how to simulate data, which is a crucial preliminary step. Following this, we’ll showcase how to run the estimation process on the simulated data. The focus will be on interpreting various plots generated from the experiment, providing insights into the performance and accuracy of the RGM package.

Simulating Data

The first step in our analysis is to simulate data, which forms the basis for our subsequent experiment. The RGM package offers functionalities for simulating complex network data, which we will utilize here.

The initial step involves simulating network data using the RGM package. This process sets the stage for our experiment by generating data that mimics real-world complex network structures.

Consider \(B=13\) distinct body sites, each representing a different environment, and \(p=87\) microbes identified as Operational Taxonomic Units (OTUs). For each environment \(k\), where \(k=1,\ldots,B\), let \(\mathbf{Y}^{(k)} = (Y^{(k)}_1, \ldots, Y^{(k)}_p)\) denote the \(p\)-dimensional random vector of OTU abundances. The relationship among these OTUs within each environment is modeled using a graphical model (GM): \[\begin{equation} \mathbf{Y}^{(k)} | G^{(k)} \sim \mathcal{L}_{G^{(k)}}(\boldsymbol{\Omega}^{(k)}), \end{equation}\] where \(G^{(k)}\) represents the conditional independence graph for environment \(k\), and \(\boldsymbol{\Omega}^{(k)}\) denotes the associated parameters.

Moreover, the collection of graphs \(G = \{G^{(k)}\}_k\) across all environments is assumed to follow a random graph model, characterized as: \[\begin{equation} G^{(k)} \sim P(\boldsymbol{\Theta}), \quad k=1,\ldots,B, \end{equation}\] with \(\boldsymbol{\Theta}\) representing the parameter vector for this joint distribution.

# Running the simulation with specific parameters
a <- sim.rgm(p=87, B=13, n=346, mcmc_iter = 100, seed=1234)

Estimation

Once we have our simulated data, the next phase is to apply the RGM model to this data. This step involves estimating the network structure and relationships from the data we generated.


# Fitting the model
res <- rgm(a$data, X=a$X, iter=10000)

This section provides an in-depth analysis of the simulation results using various plots. We will interpret the plots to understand the model’s behavior and accuracy.

\[\begin{equation} \label{eq:latentprobit} P({G_{j_1,j_2}}^{(k)}=1~|~G_{j_1,j_2}^{(-k)}, \Theta, w)= \Phi\Big(\alpha_k+{w_{j_1,j_2}}^t \beta+\mathbf{c}_k^t\sum_{k' \ne k}\mathbf{c}_{k'}1_{\{{G_{j_1,j_2}}^{(k')}=1\}}\Big), \end{equation}\]

Results

We proceed with the diagnostics plots


ps = post_processing_rgm(a,res)

We first observe the convergence of \(\beta\)

ps$beta_convergence

The simulation study results are based on the analysis of the last 2500 MCMC iterations. We compare the true probit probabilities, as derived from Equation above, with those calculated using the mean posterior estimates of parameters \(\boldsymbol{\alpha}\), \(\beta\), and \(\mathbf{c}\).

ps$rgm_recovery

Receiver Operating Characteristic (ROC) curves illustrate the performance of the recovered graphs relative to the true graphs for each of the 13 environments. These curves are generated by varying thresholds on the posterior edge probabilities from Equation (\(\ref{eq:posteriorprob}\)). Different colors represent each of the 13 environments.

ps$roc_plot

The estimation of the \(\alpha\) parameter is a crucial step in our analysis. The following R code chunk executes a procedure to estimate \(\alpha\) using the data obtained from the previous steps of the model. The estimation process leverages advanced statistical techniques to accurately capture the underlying characteristics of the network model, reflecting the sparsity levels in the environment-specific graphs. This step is essential for understanding the overall structure and connectivity patterns within each of the environments.

ps$estimation_of_alpha

In this section, we focus on examining the posterior distribution of the \(\beta\) parameter. The subsequent R code chunk carries out the computation to obtain and visualize this distribution. The posterior distribution provides valuable insights into the variability and uncertainty associated with the \(\beta\) parameter. Analyzing this distribution is pivotal for assessing the robustness of our model and for making reliable inferences about the network interactions influenced by the \(\beta\) parameter.

ps$posterior_distribution