An Introduction to the RandomVerse

Juan Pablo Acuña González

10 octubre 2022

Abstract

The broad philosophy of the RandomVerse is to capture new developments of in silico systems by integrating quantitative methods into biology with an emphasis into molecular biology. The audience appeals to students, researchers, and professionals in the mathematical, statistical, computational and biological sciences in all levels of scholarity from secondary to tertiary education. The RandomVerse is built by moving towards synthesizing information to generate and test new hypotheses about how network biology works and focusing on stochastic models for biochemical reactions using the standard algorithm for simulating these dynamics: ‘the Gillespie algorithm’ (Gillespie 1977), since the kinetics of biological processes at the intra-cellular level keep showing a stochastic behaviour that is better understood by building in silico models (Schneider 2016). This is still an active research area, so the main purpose here is to make it more accessible for transdisciplinary research and to fill in the gap by providing actual materials used, develop training with case studies and stress the construction of models with a discrete stochastic simulation perspective. This article illustrates several case studies of well–mixed stochastic biochemical reaction network models to investigate its behavior, accompanied with working code examples by implementing them in a shiny app (Chang et al. 2022).

1 Introduction and general background

In silico systems biology refers to the modeling in biology at the systems level through synthetic devices. Advances in the study of biological systems rely on the assumed underlying predictive factor that models provide by giving insights into the functionality of the whole as well as the different parts in any system. The systematic approach allows to extend across species to predict behavior of similar systems in humans which in turn can be applied to medical studies and develop health solutions. When models utilize simulations experimentation can be validated with a saving of time and resources.

The construction and analysis of models simplify the description of a system based on experience combined with data. Empirically, models explore what happens at the organism level when the individual parts represented as variables are disturbed. These perturbations lead to new hypothesis about phenomena to justify such discrepancies and so, the output of the initial model becomes the input of a new model and the cycle starts again.

To attack in silico modeling proper computing technology is nedeed. In the last three decades, the R programming language (R Core Team 2022) has become the standard tool in statistics by using a balance of a high-level language with one of the best support for stochastic simulation and statistical analysis. The great advantages of R against other technologies are its completely free open-source software nature and its increased use in bioinformatics and several applications of systems biology (Giorgi, Ceraolo, and Mercatelli 2022).

In the field of stochastic modeling for in silico biology several efforts have been made in the form of R packages. smfmsb (D. Wilkinson 2018) is the main reference for combining modeling, biological applications and R developement. Packages adaptivetau (Johnson 2019) and GillespieSSA (Pineda-Krch and Cannoodt 2022) implement Gillespie’s exact stochastic simulation algorithm (direct method) and several approximate methods. GillespieSSA2 (Cannoodt et al. 2020) and ssar (still at its developmental stage) present improvements over GillespieSSA with faster implementations of Gillespie’s algorithm. The package lnar (Giagos 2010) provides likelihood-based inference of stochastic kinetic models using a Linear Noise Approximation. In regards to the shinyverse, the app hosted at https://edbonneville.shinyapps.io/gillespie_shiny/ allows the user to simulate and explore infectious disease epidemics stochastically with the Gillespie algorithm.

As for the literature, some of the textbooks in stochastic modeling that deal with a strong view concerning biological applications are (D. J. Wilkinson 2020) and (Fuchs 2013). Another very recent reference is (Zobitz 2022) which is written in a very modern style employing the tidyverse (Wickham et al. 2019) and allied libraries.

All of the aforementioned material has been developed towards a quantitative focus, therefore, the RandomVerse proposes to unify previous software into a more qualitative approach in order to give deeper understanding into the biology rather than into the algorithms, and in doing so, providing a better tool for decision making in the areas of drug development and health improvement.

2 Stochastic modeling in molecular biology, biochemistry and genetics

Stochastic models are an indispensable tool for the understanding of complex natural phenomena such as the mechanisms of heredity and variation of living organisms, senescence and the emergence of diseases such as viral infections. Nowadays one knows that these phenomena are based on chemical processes in living organisms and the structures and functions of living cells.

The overlapping areas of molecular biology, biochemistry and genetics comprise an enormous variety of different applications and models. Hence, the RandomVerse focuses on addressing one specific branch of modeling, that is, applications which utilize the framework of biochemical reactions for the representation of selected key processes.

The concept of biological systems as networks of coupled biochemical reactions are sufficiently general that they can be simulated in different ways using different algorithms depending on assumptions made about the underlying kinetics, and are manageable enough to be used directly to construct full dynamic simulations of the system behavior on a computer.

Biochemical reactions are typically specified by reaction equations of the form

$a_1 A_1 + . . . + a_k A_k → b_1 B_1 + . . . + b_l B_l,$

where $r$ is the number of reactants and $p$ is the number of products. $R_i$ represents the $i$th reactant molecule and $P_j$ is the $j$th product molecule. $m_i$ is the number of molecules of $R_i$ consumed in a single reaction step, and $n_j $is the number of molecules of $P_j$ produced in a single reaction step. The coefficients $m_i$ and $n_j$ are known as stoichiometries. The stoichiometries are assumed to be natural numbers with greatest common divisor equal to one.

Reaction equations and their associated mathematical theory are convenient tools in biological phenomena. They are particularly used to describe the natural laws which underlie the essential process of information transfer between information-carrying molecules; the central dogma of molecular biology.

The human body is made up of billions of cells. These cells specialize in different tasks. Yet, all these different kinds of cells come from a single-celled embryo. All the instructions to make different kinds of cells are contained within that single cell and with every division of that cell, those instructions are transmitted to new cells.

These instructions can be coded into a string – a molecule of DNA, a polymer made of recurring units called nucleotides. The four nucleotides in DNA molecules, Adenine, Guanine, Cytosine and Thymine (coded as four letters: A, C, G, and T) in a speciﬁc sequence, store the information for life.

All cells use their hereditary information in the same way most of the time; the DNA is replicated to transfer the information to new cells. If activated, the genes are transcribed into messenger RNAs (mRNAs) in the nucleus (in eukaryotes), followed by mRNAs (if the gene is protein coding) getting translated into proteins in the cytoplasm.

Proteins are essential elements for life. The growth and repair, functioning and structure of all living cells depend on them. This is why the gene is a central concept for information encoding of proteins and other functional molecules.

Although biological modelling can be applied to biological systems at a variety of different scales, it turns out that stochastic effects are particularly important and prevalent at the scale of genetic and biochemical networks.

Auto-regulatory networks are a simplified way to generate very simple models of key processes involved in gene expression and regulation.

Within each cell, there are several thousand types of interacting proteins. Depending on its environment, a cell determines the required amount of each protein by means of transcription networks. Transcription is one out of several regulatory mechanisms in genetic networks and can be described by a set of coupled elementary reactions (D. J. Wilkinson 2020). At a less detailed level, transcription and other key processes can be assembled to construct genetic networks. For example, the following components of a prokaryotic auto-regulatory network are summarised

\[ \begin{aligned} g \rightarrow g + r \,\ (transcription) \\ g + P_2 \leftarrow \rightarrow g · P_2 \,\ (repression) \\ r \rightarrow r + P \,\ (translation) \\ 2P \leftarrow\rightarrow P_2 \,\ (dimerisation) \\ r \rightarrow ∅ \,\ (mRNA \,\ degradation) \\ P \rightarrow ∅ \,\ (protein \,\ degradation). \end{aligned} \]

In these equations, P stands for a protein, P 2 for the compound of two of these proteins, g for a gene and r for a transcript of g. The empty set ∅ indicates that the product of a reaction is not part of the model, and a dot represents the compound of two components.

The above mentioned application is intrinsically stochastic.

3 Basic illustration through the shiny app

3.1 Auto-regulatory genetic network

For this application it is needed to specify as the body of the function the following text input, comma delimited with no blanks in between entries:

$\texttt{th1*x2*x5,th2*x1,th3*x2,th4*x3,th5*0.5*x4*(x4-1),th6*x5,th7*x3,th8*x4}$

Next the arguments of the function are specified as as the following text input:

$\texttt{x1,x2,x3,x4,x5,th1,th2,th3,th4,th5,th6,th7,th8}$

The next input adds the values for the arguments, also as a text input:

$\texttt{0,1,2,10,12,1,10,0.01,10,1,1,0.1,0.01}$

The last inputs refer to the Pre reaction ans Post reaction matrices of the system and their dimensions specifiying the number of reactions (8) and species (3) in the model:

Pre matrix:

$\texttt{0,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0}$

Post matrix:

$\texttt{1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0}$

Thus producing the following representation as a stochastic process:

Auto-regulatory genetik network.

It’s important to add the inputs with the comma delimited format and with no blanks in between and choose an appropriate time horizon in order to not produce invalid probabilities in the model.

3.2 Lotka–Volterra system

The reaction equations between the competing species can be represented as:

\[ \begin{aligned} Y_1 \rightarrow 2Y_1 \\ Y_1+Y_2 \rightarrow 2Y_2 \\ Y_2 \rightarrow ∅. \end{aligned} \] For this example, the text input are as follow. Body of the function:

$\texttt{k1*x1-k2*x1*x2,k2*x1*x2-k3*x2}$

Arguments of the function:

$\texttt{x1,x2,th1,th2,th3}$

Values for the arguments:

$\texttt{50,100,1,0.05,0.6}$

Pre matrix:

$\texttt{1,0,1,1,0,1}$

Post matrix:

$\texttt{2,0,0,2,0,0}$

Reactions:

$\texttt{3}$

Species:

$\texttt{2}$

A realization for this model is:

Lotka–Volterra system.

3.3 Dimerisation kinetics

This problem considerS the dimerisation kinetics of a protein, P, at very low concentrations. The forward and backward reactions, respectively, are:

\[ \begin{aligned} 2P \rightarrow P_2 \\ P_2 \rightarrow P. \\ \end{aligned} \] Body of the function:

$\texttt{th1*x1*(x1-1)/2,th2*x2}$

Arguments of the function:

$\texttt{x1,x2,th1,th2}$

Values for the arguments:

$\texttt{301,0,1.66e-3,0.2}$

Pre matrix:

$\texttt{2,0,0,1}$

Post matrix:

$\texttt{0,1,2,0}$

Reactions:

$\texttt{2}$

Species:

$\texttt{2}$

And the graphical output is:

Dimerisation kinetics.

3.4 Michaelis–Menten enzyme kinetics

Here a substrate S is converted to a product P only in the presence of a catalyst E (for enzyme). A plausible model for this is:

\[ \begin{aligned} S + E \rightarrow SE \\ SE \rightarrow S + E \\ SE \rightarrow P + E. \\ \end{aligned} \] Body of the function:

$\texttt{th1*x1*x2,th2*x3,th3*x3}$

Arguments of the function:

$\texttt{x1,x2,x3,x4,th1,th2,th3}$

Values for the arguments:

$\texttt{301,120,100,10,0.00166,1e-04,0.1}$

Pre matrix:

$\texttt{1,1,0,0,0,0,1,0,0,0,1,0}$

Post matrix:

$\texttt{0,0,1,0,1,1,0,0,0,1,0,1}$

Reactions:

$\texttt{3}$

Species:

$\texttt{4}$

The graphical representation of the model is:

Michaelis–Menten enzyme kineticsk.

Session info

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=es_MX.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_MX.UTF-8        LC_COLLATE=es_MX.UTF-8    
##  [5] LC_MONETARY=es_MX.UTF-8    LC_MESSAGES=es_MX.UTF-8   
##  [7] LC_PAPER=es_MX.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.24.0
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.29       digest_0.6.29       R6_2.5.1           
##  [4] jsonlite_1.8.0      magrittr_2.0.3      evaluate_0.16      
##  [7] stringi_1.7.8       cachem_1.0.6        rlang_1.0.5        
## [10] cli_3.4.0           rstudioapi_0.14     jquerylib_0.1.4    
## [13] bslib_0.4.0         rmarkdown_2.16      tools_4.2.1        
## [16] stringr_1.4.0       xfun_0.33           yaml_2.3.5         
## [19] fastmap_1.1.0       compiler_4.2.1      BiocManager_1.30.18
## [22] htmltools_0.5.3     knitr_1.40          sass_0.4.2

References

Cannoodt, Robrecht, Wouter Saelens, Louise Deconinck, and Yvan Saeys. 2020. “Dyngen: A Multi-Modal Simulator for Spearheading New Single-Cell Omics Analyses.” bioRxiv, February. https://doi.org/10.1101/2020.02.06.936971.

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.

Fuchs, Christiane. 2013. Inference for Diffusion Processes: With Applications in Life Sciences. Springer.

Giagos, Vasileios. 2010. “Inference for Auto-Regulatory Genetic Networks Using Diffusion Process Approximations.” PhD thesis.

Gillespie, Daniel T. 1977. “Exact Stochastic Simulation of Coupled Chemical Reactions.” The Journal of Physical Chemistry 81 (25): 2340–61. https://doi.org/10.1021/j100540a008.

Giorgi, Federico M., Carmine Ceraolo, and Daniele Mercatelli. 2022. “The r Language: An Engine for Bioinformatics and Data Science.” Life 12 (5). https://doi.org/10.3390/life12050648.

Johnson, Philip. 2019. Adaptivetau: Tau-Leaping Stochastic Simulation. https://CRAN.R-project.org/package=adaptivetau.

Pineda-Krch, Mario, and Robrecht Cannoodt. 2022. GillespieSSA: Gillespie’s Stochastic Simulation Algorithm (SSA). https://CRAN.R-project.org/package=GillespieSSA.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schneider, Maria Victoria. 2016. In Silico Systems Biology. Humana Press.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wilkinson, Darren. 2018. Smfsb: Stochastic Modelling for Systems Biology. https://CRAN.R-project.org/package=smfsb.

Wilkinson, Darren James. 2020. Stochastic Modelling for Systems Biology. Chapman & Hall/CRC.

Zobitz, John M. 2022. Adding Stochasticity to Parameters | Exploring Modeling with Data and Differential Equations Using r. github. https://jmzobitz.github.io/ModelingWithR/adding-stochasticity-to-parameters.html.