by Christopher Eeles
A Markov model is a stochastic model which describes a sequence of possible events (states) in which the probability of each event depends on a subset of previous events [1]. Such models are useful in analysis of biological sequence data such as those from gene and protein sequencing experimentss [1]. The probability of a state change from one nucleotide or amino acid residue in a sequence can often be derived from experimental observations and used to calculate the probability of finding a specific state (i.e., nucleotide or amino acid residue) some series of steps from the current state [1]. Markov models possess orders, representing the number of previous states considered when determing the probability of the next state [1]. This report will focus on First-Order Markov Chains, in which the probability of a future state depends only on the current state [1]. Included will be hypothetical initial and transition matrices which will be used to generate a transition diagram of the Markov model [2]. These resources will then be used to explain how the distribution of states change as the number of steps (the number of samples) of the Markov chain are increased [2].
Figure 1 displays a transition diagram for Markov chains derived from the hypothetical Markov model created for this report. In this diagram, nodes represent possible states transitions, beginning at ‘START’ and transitioning through states ‘A’, ‘C’, ‘G’ and ‘T’ until for the number of steps selected for a given Chain have completed. The lines represent state transitions betweent two respective nodes, with the probability of a transition represented as the number closest to the terminal arrow. Loops represent the probability of no state change, with the probability of occurence written in the center of the loop. Table 1 shows the transition matrix corresponding to the diagram in Figure 1, with each cell representing the probability of transitioning from the row state to a specific column state.
## START A C G T
## START 0 0.30 0.20 0.20 0.30
## A 0 0.10 0.60 0.20 0.10
## C 0 0.50 0.10 0.10 0.30
## G 0 0.20 0.05 0.05 0.70
## T 0 0.05 0.70 0.20 0.05
Examing Table 1, we find the probability of any transition from one state to another in the cells of this matrix. To predict the frequency of any one state overall in a sequence, the average of each column must be taken:
## START A C G T
## 0.00 0.23 0.33 0.15 0.29
This represents the overall probability of each state occuring for a Markov chain with the Transition Matrix in Table 1. These values correspond to the expected frequency of occurence of each nucleotide in the plots in Figure 2. For the dinucleotide plots, however, things are slightly more complex. The probability of each dinucleotide pair is represented by the probability of transitioning from state 1 to state 2 (i.e., first nucleotide to second nucleotide) and therefore the frequency values correspond to those in transition matrix from Table 1.
## A C G T
## A 0.10 0.60 0.20 0.10
## C 0.50 0.10 0.10 0.30
## G 0.20 0.05 0.05 0.70
## T 0.05 0.70 0.20 0.05
Considering these expected values, it becomes clear that as the sample size (i.e., number of steps in chain) increases, the observed frequency approach those from the frequency distribution of either the nucleotide or dinucleotide distributions displayed above. Extrapolating from these results suggests that as sample size approaches infinity our observed frequency distributions for each Markov chain should approach the true frequency distribution of the population. In the statistical context this result is logical: just as a sampling distribution approaches the population distribution as sample size increases, so too does the observed vs the true frequency distribution of a Markov. Further analysis is required to verify these properties for higher-order Markov models.
[1] School of Biological Sciences and Applied Chemistry. (2019).Topic 6: Markov Chains. Seneca College: Toronto, ON.
[2] School of Biological Sciences and Applied Chemistry. (2019). Lab 6: Markov Chains. Seneca College: Toronto, ON.
library(seqinr)
library(markovchain)
library(igraph)
set.seed(2345) # To ensure the results aren't different every time I run the program
######## Calculate 1st order Markov Model ########
alph<-c('A','C','G','T') # Char vector of alphabet of possible states
firstOrderMat<-matrix(NA,nr=4,nc=4) # Initialize empty matrix with nr rows and nc columns
colnames(firstOrderMat)<-rownames(firstOrderMat)<-alph # Naming rows and columns after the states
# Find faster way to do this!
firstOrderMat[1,]<-c(0.1,0.6,0.2,0.1)
firstOrderMat[2,]<-c(0.5,0.1,0.1,0.3)
firstOrderMat[3,]<-c(0.2,0.05,0.05,0.7)
firstOrderMat[4,]<-c(0.05,0.7,0.2,0.05)
inProb <-c(0.3,0.2,0.2,0.3); names(inProb)<-alph # Initial probabilities; naming initial probabilities
# Plotting transition diagram
alphInit<-c('START','A', 'C','G','T')
transMat <- cbind(matrix(data=c(0,0,0,0),byrow=FALSE,ncol=1,nrow=4), firstOrderMat)
transMat2 <- rbind(c(0,0.3,0.2,0.2,0.3), transMat)
colnames(transMat2)<-rownames(transMat2)<-alphInit
markovTransitionMat <- new("markovchain", states = alphInit, transitionMatrix = transMat2, name = "Amino Acid Transtion Matrix")
plot(markovTransitionMat, curved=TRUE, main="Figure 1 - Transition Diagram", loop.angle=0.5)
# Printing transition matrix
print('Table 1 - Transition Matrix')
print(transMat2)
# Can't just sample() command this time because there are two pieces of information needed to calculate the probability
generateFirstOrderSeq<-function(lengthSeq,alphabet,initialProb,firstOrderMatrix) {
outputSeq<-rep(NA,length=lengthSeq) # Set to be dynamic because this is not yet known, initialize empty vector with seqLength elements
outputSeq[1]<-sample(alphabet,1,prob=initialProb) # Similar to command for the 0th order Markov Model
for( i in 2:length(outputSeq)){ # Want to know what previous nucleotide was because next value depends on it
prevNuc<-outputSeq[i-1] # Store previous value
currentProb<-firstOrderMatrix[prevNuc,] # Get probabilities from last nucleotide
outputSeq[i]<-sample(alphabet,1,prob=currentProb) # Assigns results to the matrix for each i
}
return(outputSeq) # What the function passes back when run
}
#### First Order Markov Model: 10 Samples ####
# Generate a new sequence using our function
firstOrderSeq<-generateFirstOrderSeq(10,alph,inProb,firstOrderMat) # lengthSeq, alphabet, initialProb, firstOrderMartix
# Calculate
firstOrderFreq<-count(firstOrderSeq,1,alphabet=alph,freq=TRUE)
firstOrderFreqDin<-count(firstOrderSeq,2,alphabet=alph,freq=TRUE)
#### First Order Markov Model: 100 Samples ####
# Generate a new sequence using our function
firstOrderSeq2<-generateFirstOrderSeq(100,alph,inProb,firstOrderMat) # lengthSeq, alphabet, initialProb, firstOrderMartix
# Calculate frequencies
firstOrderFreq2<-count(firstOrderSeq2,1,alphabet=alph,freq=TRUE)
firstOrderFreqDin2<-count(firstOrderSeq2,2,alphabet=alph,freq=TRUE)
#### First Order Markov Model: 1000 Samples ####
# Generate a new sequence using our function
firstOrderSeq3<-generateFirstOrderSeq(1000,alph,inProb,firstOrderMat) # lengthSeq, alphabet, initialProb, firstOrderMartix
# Calculate frequencies
firstOrderFreq3<-count(firstOrderSeq3,1,alphabet=alph,freq=TRUE)
firstOrderFreqDin3<-count(firstOrderSeq3,2,alphabet=alph,freq=TRUE)
######## Drawing Plots #########
par(mfrow=c(2,3)) # Now we have 2 x 3 table, so six plots.
## Draw first order nucleotide plots with 10, 100, 1000 samples
# FirstOrder: Nucleotide plot
barplot(firstOrderFreq,col=1:4, main="Nucleotide: 10 Samples", # col specifies respective colors, main specifies title
xlab="Base",ylab="Base proportion") # label axes
# Nucleotide plot
barplot(firstOrderFreq2,col=1:4, main="Nucleotide: 100 Samples", # col specifies respective colors, main specifies title
xlab="Base",ylab="Base proportion") # label axes
# Nucleotide plot
barplot(firstOrderFreq3,col=1:4, main="Nucleotide: 1000 Samples", # col specifies respective colors, main specifies title
xlab="Base",ylab="Base proportion") # label axes
## Draw first order dinucleotide plots with 10, 100, 1000 samples
# FirstOrder: Dinucleotide, won't get a good distribution because there are only 10 items in the sequence
barplot(firstOrderFreqDin,col=rainbow(16), #raindbow(n) pulls n colours from the rainbow palette
main="Dinucelotide: 10 Samples",
xlab="Base",ylab="Base proportion")
# Dinucleotide, 100 samples
barplot(firstOrderFreqDin2,col=rainbow(16), #raindbow(n) pulls n colours from the rainbow palette
main="Dinucleotide: 100 Samples",
xlab="Base",ylab="Base proportion")
# Dinucleotide, 1000 samples
barplot(firstOrderFreqDin3,col=rainbow(16), #raindbow(n) pulls n colours from the rainbow palette
main="Dinucleotide: 1000 Samples",
xlab="Base",ylab="Base proportion")
#### Discussion Calculations ####
averages <- c(mean(transMat2[,1]),mean(transMat2[,2]),mean(transMat2[,3]),mean(transMat2[,4]),mean(transMat2[,5]))
names(averages) <- alphInit
averages
firstOrderMat