We will start with a very basic probabilistic model. Let’s say we want to simulate a sequence 10 nucleotides long. For this example, nucelotide bases can appear with equal p=0.25 probability at every position. This part is inspired by Chapter 10 in (Krijnen 2009).
sample(c("A","C","T","G"),10,rep=TRUE,prob=c(0.25,0.25,0.25,0.25))
## [1] "T" "G" "A" "A" "G" "G" "T" "A" "G" "T"
If we would like to simulate a GC rich region we could alter the probabilities as follow:
sample(c("A","C","T","G"),10,rep=TRUE,prob=c(0.15,0.35,0.15,0.35))
## [1] "C" "T" "C" "C" "C" "G" "A" "G" "A" "G"
We will simulate two independent parental chromatids and use them to produce a haploid chromosome with recombination events.
paternal<-sample(c("A","C","T","G"),10,rep=TRUE,prob=c(0.25,0.25,0.25,0.25))
maternal<-sample(c("A","C","T","G"),10,rep=TRUE,prob=c(0.25,0.25,0.25,0.25))
paternal
## [1] "T" "G" "G" "A" "G" "T" "C" "A" "A" "A"
maternal
## [1] "C" "A" "A" "C" "C" "C" "C" "G" "T" "C"
Let’s say recombination events occur with a 0.3 probability in any position of our 10 nucleotides genome. We can simulate recombination events in this way.
recombination.event<-sample(c(TRUE,FALSE),10,prob=c(0.3,0.7),rep=TRUE)
recombination.event
## [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
Then we can use those simulated recombination events to switch from an initial paternal or maternal chromatide when generating the child chromatid (as in gamete formation).
# assigning the starting parental genome source
current.genome<-sample(c('paternal','maternal'),1,prob=c(0.5,0.5))
# initializing empty arrays for the child genome and parental path
child<-c();
parent.path<-c();
# this loop will genereate the child genome based on the parents and
# the recombination events calculated previously
for (base.index in 1:length(paternal)) {
if (recombination.event[base.index]){ #switching current parental genome
if(current.genome=='paternal'){
current.genome<-'maternal'
} else {
current.genome<-'paternal'
}
}
if(current.genome=='paternal'){
child[base.index]<-paternal[base.index]
} else {
child[base.index]<-maternal[base.index]
}
parent.path[base.index]<-current.genome
}
the child genome would look like
child
## [1] "T" "G" "A" "A" "G" "C" "C" "G" "T" "A"
and the recorded parental source of the nucleotides will be
parent.path
## [1] "paternal" "paternal" "maternal" "paternal" "paternal" "maternal"
## [7] "maternal" "maternal" "maternal" "paternal"
Krijnen WP. 2009. Applied Statistics for Bioinformatics using R. GNU Free Document License.