Date: 04/12/2022

To: The Head Biologist

From: Md Ashraful Islam Bhuiya

Subject: Identification of the putative site of the origin of replication in the CMV genome by computational analysis.

To suggest the putative regions of origin of replication in the CMV genome, we took the approach to detect tight clusters of complementary palindromes in the CMV genome. CMV genome is 229,354 base pair (bp) long and contains 296 palindromes that are at least 10 bp long. We considered the statistical model, the Homogeneous Poisson process (HPP), as the base model. HPP is a natural model for uniform random scatter. It assumes that events are non-overlapping, occur at a known constant rate, and are independent of each other. According to the HPP model, palindromes are scattered uniformly as well as randomly across the CMV genome. The number of palindromes in any location of the genome is independent of the number of palindromes in a non-overlapping, another location. If we observe the histogram in graph A below, where the number of palindromes in each 5000 bp is plotted, we find that a ~ similar number of palindromes are scattered all over the genome with some heterogeneity. For example, we can see only two locations where there are excess numbers of palindromes, 18 and 12, compared to other regions. Therefore, we assumed that the distribution of palindromes in the CMV genome broadly fits the HPP model, but with some heterogeneity. A small deviation from HPP (for example, a tight cluster of a number of palindromes) might give us our target locations. We analyzed the CMV genome to identify any locations with tight clusters of palindromes by considering the HPP as a baseline.

Finally, with this model, we have detected a number of tight clusters of palindromes at the locations from 87717 bp - 97488 bp, and 189810 bp - 198709 bp in the CMV genomes. The tightest clusters we have detected are from 90251 - 94174 bp. Anyone or a group of these clusters has significant potential to possess the origin of replication.

cmv<-read.csv("http://www.uwyo.edu/buerkle/compbio/statlabs/data/hcmv.data.txt")$location

par(mfrow=c(2,1), mar= c(5, 6, 3.5, 0))

 hist(cmv, breaks=seq(0,232000, by=5000), # creating a histogram, where each bar shows the counts of palindromes in every 5000 bp.
                 labels=TRUE, 
                 main = "Graph A: Distribution of palindromes in each 5000 bp in CMV genome", 
                 cex.main= 3, 
                 cex.axis=2.5, 
                 xlab = "Locations in the CMV genome", 
                 ylab = "Counts of the palindromes",
                 cex.lab=2.5
                 )

abline(h = qpois((0.95), lambda=5000*296/229354), col= "purple", lwd=3) #Add a horizontal line to the histogram that shows 95% quantile using the quantile function of the Poisson distribution qpois().




slide.window<-function(seqdata, total.length, blocksize=500, incr=100){  # creating a function for the sliding window, where block size can be specified and each block moves a given number of bp. 
  block.right<-seq(blocksize,total.length, by=incr)

  block.left<-block.right-blocksize+1

  block.mid<-(block.right+block.left)/2

  nblocks<-length(block.right)

  count<-numeric(nblocks)

  rate<-numeric(nblocks)

  

  for(i in 1:nblocks) {

    focal<-seqdata[seqdata>block.left[i] & seqdata < block.right[i]] # assigning the locations from seqdata (CMV data in our case) when locations are greater than left block and smaller than the right block. 

    count[i]<-length(focal) # total number of locations fall inside ith block. 

    rate[i]<-count[i] / blocksize

    # alternatively

    #count[i] <- sum(seqdata>block.left[i] & seqdata < block.right[i])

  }

  data.frame(count, rate, block.left, block.mid, block.right) # Creating a dataframe including the variables mentioned here. 

}

plot(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=5000, incr=250), # Plotting rate vs the value of calculated middle block, when block size is 5000 bp and each block move only 250 bp.   
     type="l", 
     col= "black",
     lwd = 2,
     main = "Graph B: Sliding window with 250 bp increment",
     cex.main= 3, 
     cex.axis=2.5, 
     xlab = "Locations in the CMV genome", 
     ylab = "Rates (count/blocksize)",
     cex.lab=2.5)

lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=9000, incr=250), # creating a line plot over the previous plot with different color when block size is 9000 bp and each block move only 250 bp.
      col="purple",
      lwd = 2)
      
lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=18000, incr=250), 
      col="orange",
      lwd = 2)

lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=36000, incr=250), 
      col="red",
      lwd=2)

In graph A, a total number of palindromes in every 5000 bp is plotted. We see that mostly 3 to 10 palindromes in each 5000 bp are spread all over the genome. Only two locations we found that contain 18 and 12 palindromes. The purple line in the graph shows 95% quantile. From this line, we understand that there is only a 5% chance that we get 12 or more palindromes in each 5000 bp. Graph B is a sliding window that shows the rate of palindromes in 5000 bp block when we move blocks 250 bp. From the sliding window, we find a similar number of counts as shown in graph A when the block size is 5000 bp. We also found that if we count palindromes every 2000 to 10000 bp or produce a sliding window with a 2000 to 10000 bp block size, the distribution of palindromes all over the genome is similar. The sliding window also shows that larger block sizes dilute the signal of palindrome count significantly as we can see from the purple, orange, and red lines with 9000bp, 18000 bp, and 36000 bp block sizes respectively in graph B (especially with 18000 and 36000 bp). Because increasing block size increases the probability of repeated count of the same palindromes in blocks each time. By comparing graph A and the sliding window (graph B), we showed that the count of the number of palindromes in each 5000 bp is accurate and robust as we move blocks only 250 bp. We did not lose any significant amount of palindrome from the counting due to the block size we considered. Therefore, the locations from graph A, where we find the cluster of 18 and 12 palindromes are our potential candidates for the sites of origin of replication, as we supposed that they might be more tightly clustered together compare to other regions.


par(mfrow=c(1,2), mar=c(5,5.5,5.5,4.5))
plot(cmv[1:(296-14)], diff(cmv, lag=14), # plotting the locations of palindromes in X- axis and difference between 1st and 15th palindromes of each consecutive/ overlapping set of 15 palindromes using lag value inside diff() function. 
     main = "Graph C: Locations and distances between two terminal palindromes
     of all sets of 15  palidromes",
     cex.main=2.5,
     cex.lab = 2.5,
     cex.axis = 2.5,
     xlab="Locations in the CMV genome", 
     ylab="Observed distance (bp)", 
     ylim=c(0,25000),
     lwd=3)
abline(h=qgamma(c(0.02, 0.0002),rate =296/229354, shape=14 ), # Add two horizontal lines on the plot for 2% and 0.02% quantile using the quantile function of the gamma distribution.
       col=c("purple", "blue"), 
       lwd=2)
plot(dgamma(0:25000, rate =296/229354, shape=14), 0:25000, # Density plot of the gamma distribution of the distances between two terminal palindromes of sets of 15 palindromes. 
     main = "Graph D: Density plot of the gamma distribution",
     cex.main=3,
     cex.lab = 2.5,
     cex.axis = 2.5,
     type="l", 
     ylab="", 
     xlab="Density of the gamma",
     lwd=3)
abline(h=qgamma( c(0.02, 0.0002),rate =296/229354, shape=14 ), 
       col=c("purple","blue"), 
       lwd=2)
mtext(c(0.02, 0.0002), # Add text to the right hand side of horizontal lines. 
      side=4, 
      at=qgamma(c(0.02, 0.0002),
      rate =296/229354, shape=14 ), 
      col=c("purple","blue"), 
      las=1, 
      cex=2)

In graph C we observe the distance (y-axis) between 1st and 15th palindromes of each set of 15 palindromes in different locations (x-axis) of the genome. We found that the differences between the two terminal palindromes of most of the sets are from ~6000 to ~ 17000 bp. There are only two locations where distances are ~6000 bp or less (below the purple line in graph C). I analyzed sets of the different numbers of palindromes (sets of 10 to 18 palindromes). Each time I found a similar pattern of distribution of variable distances between two terminal palindromes of each set. However, analysis with the set of 15 palindromes resulted in a decent number of locations where distances between two terminal palindromes are highly (probability <0.02) or extremely (probability <0.0002) improbable. Therefore, I used the set of 15 palindromes (lag=14) in this analysis. Graph D is the density plot of the gamma distribution. In both graphs, purple and blue lines are showing 2% and 0.02% quantile (extremely small distances) of the gamma distribution. This indicates that there is an extremely low probability (<0.02) of the occurrence of 15 palindromes in these small distances by chance. If we compare the locations of these two regions with graph A, we find that they fall inside the similar regions we detected in graph A with 95% quantile. These seemingly tight clusters (sets of 15) of palindromes might have some significant biological roles, for example, may possess the origin of replication.

## save indexes of improbably small distances
myprecious.indexes<-which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.02) # Creating a vector that contain indexes of the locations of 1st palindromes of sets of 15 palindromes, of which, the probability of distances between 1st and 15th palindrome is less than 0.02 using the cumulative density of function, pgamma(), of the gamma distribution.   
mystrictprecious.indexes<-which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.0002)

par(mar=c(5, 5, 2,2))
plot(cmv[1:(296-14)], 1:(296-14), type='n', # Creating an empty plot. 
     ylab="Sets of 15 palindromes", 
     xlab="Locations in the CMV genome", 
     main = "Graph E: Distances beween the two terminal palindromes of each set of 15 palindomes at different positions in CMV genome",
     cex.main= 2.2,
     cex.lab= 2.5,
     cex.axis = 2.5,
     lwd=3)
segments(cmv[1:(296-14)], 1:(296-14),  # Add distances between two terminal palindromes of each set of 15 palindromes as line segments.
         cmv[(1+14):296], 1:(296-14))
segments(cmv[myprecious.indexes], myprecious.indexes, # Add distances between two terminal palindromes of sets of palindromes indexed by myprecious as line segments with red color. This segments override the previous segments at the same locations. 
         cmv[myprecious.indexes+14], myprecious.indexes, col="red", lwd = 3) 

segments(cmv[mystrictprecious.indexes], mystrictprecious.indexes, 
         cmv[mystrictprecious.indexes+14], mystrictprecious.indexes, col= adjustcolor( "yellow", alpha.f = 0.7), lwd = 3)

To visualize distances as shown in graph C more clearly, we developed Graph E. It shows the distances between two terminal palindromes of sets of 15 palindromes by line segments. Here, each line indicates a set of 15 palindromes and the length of the line indicates the distance between the 1st and 15th palindromes of that set. From the x-axis, we can also understand the position of each set of palindromes in the CMV genome. We can visualize that there are two places in the genome (colored with both yellow and red) where lengths of sets of palindromes are significantly less (probability is less than 0.02) compared to the rest of the genome as also indicated in graph C. As I discussed before, with this extremely low probability (<0.02 (red+yellow regions) or <0.0002 (only yellow regions)) this is highly unlikely that these palindromes are in these positions of the genome just by chance. Based on these observations, primarily I made two recommendations below to start experimentally searching for the origin of replication.

Final recommendations

I would like to make two alternate recommendations. You may start with any one of these based on your overall goal, available time, and resources.

1. If you want to start analysis focusing on broader regions (with probability < 0.02), marked with red and yellow colors in the graph E.

library(kableExtra)

index_cmv1<- which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.02) # Same as myprecious.indexes

broad_regions <- data.frame(Index=NA,  # Creating a dataframe called broad_regions.
                            Start=NA,
                            End=NA,
                            Distance=NA)

for (i in 1:length(index_cmv1)) {
  loc = index_cmv1[i]
  
  broad_regions[i,1] <- loc    # adding indexes to the first column of the dataframe. 
  broad_regions[i,2] <- cmv[loc] # adding starting locations of sets of 15 palindromes to the second column of the dataframe. 
  broad_regions[i,3] <- cmv[loc+14] # adding end locations of sets of 15 palindromes to the third column of the dataframe. 
  broad_regions[i,4] <- cmv[loc+14] - cmv[loc] # # adding distances between two terminals palindromes of sets 15 palindromes in the fourth column of the dataframe.  
  
}
tbl1<- kable(broad_regions, caption = "Table 1: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.02)")
column_spec(tbl1, 1:4, width = "1in") # creating a table with a caption and defined width of columns. 
Table 1: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.02)
Index Start End Distance
106 87717 92783 5066
107 88803 92859 4056
108 89586 93110 3524
109 90251 93250 2999
110 90763 93511 2748
111 91490 93601 2111
112 91637 94174 2537
113 91953 95975 4022
114 92526 97488 4962
243 189810 195262 5452
244 190918 195835 4917
248 192527 198195 5668
249 193447 198709 5262
NA

In the table above, each row shows the 1st (Start) and 15th (End) positions of a set of 15 palindromes and the distance (Distance) spanned by those palindromes. From the table, we find that the maximum distance a set of 15 palindromes span is ~5668 bp. The probability of these distances is < 0.02, which is highly improbable, which indicates that compared to other distances with a probability > 0.02, these distances are significantly smaller. So, we can say that 15 palindromes in a set are tightly packed together in these positions. And this is highly unlikely that they packed together by chance (with the probability of 0.02). This shows a clear deviation from the homogeneous Poisson process. Therefore, any one of these sets or a group of these sets of 15 palindromes may possess the site of the origin of replication. Based on the consecutive sets of palindromes (see Index in table 1), I would suggest starting the analysis focusing on these two regions, 87717 bp - 97488 bp, and 189810 bp - 198709 bp (table 1).

2. If you want to start the analysis by focusing on regions with extremely improbable distances (probability < 0.0002), marked with only yellow in the graph E.


index_cmv2<- which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.0002) # same as above

smaller_regions <- data.frame(Index=NA,   #same as above
                            Start=NA,
                            End=NA,
                            Distance=NA)

for (i in 1:length(index_cmv2)) {
  loc = index_cmv2[i]
  
  smaller_regions[i,1] <- loc               #same as above.
  smaller_regions[i,2] <- cmv[loc]
  smaller_regions[i,3] <- cmv[loc+14]
  smaller_regions[i,4] <- cmv[loc+14] - cmv[loc]
}
tbl2 <- kable(smaller_regions, caption = "Table 2: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.0002)")
column_spec(tbl2, 1:4, width = "1in")
Table 2: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.0002)
Index Start End Distance
109 90251 93250 2999
110 90763 93511 2748
111 91490 93601 2111
112 91637 94174 2537
NA

If you want to start searching for the origin of replication within smaller regions, you can start with these regions as indicated in table 2. The exact locations of four sets of extremely packed 15 palindromes are given here. The maximum distance between two terminal palindromes of a set is 2999 bp and the minimum is only 2111 bp, which is extremely unlikely (with probability < 0.0002) that the palindromes in a set are very tightly clustered together just by chance. These extremely small distances also show a highly significant deviation from the Poisson process. Therefore, region from 90251 bp to 94174 bp has the highest probability to have the site of the origin of replication. So, it is worth starting to search with these regions for the origin of replication in the laboratory and if necessary, explore the other regions as mentioned in table 1.

---
output:
  html_notebook:
    code_folding: hide
  html_document:
    df_print: paged
---


#### Date: 04/12/2022

#### To:    The Head Biologist 

#### From:  Md Ashraful Islam Bhuiya

### _***Subject:*** Identification of the putative site of the origin of replication in the CMV genome by computational analysis._

To suggest the putative regions of origin of replication in the CMV genome, we took the approach to detect tight clusters of complementary palindromes in the CMV genome. CMV genome is 229,354 base pair (bp) long and contains 296 palindromes that are at least 10 bp long. We considered the statistical model, the Homogeneous Poisson process (HPP), as the base model.  HPP is a natural model for uniform random scatter. It assumes that events are non-overlapping, occur at a known constant rate, and are independent of each other. According to the HPP model, palindromes are scattered uniformly as well as randomly across the CMV genome. The number of palindromes in any location of the genome is independent of the number of palindromes in a non-overlapping, another location. If we observe the histogram in graph A below, where the number of palindromes in each 5000 bp is plotted, we find that a ~ similar number of palindromes are scattered all over the genome with some heterogeneity. For example, we can see only two locations where there are excess numbers of palindromes, 18 and 12, compared to other regions. Therefore, we assumed that the distribution of palindromes in the CMV genome broadly fits the HPP model, but with some heterogeneity. A small deviation from HPP (for example, a tight cluster of a number of palindromes) might give us our target locations. We analyzed the CMV genome to identify any locations with tight clusters of palindromes by considering the HPP as a baseline.  

Finally, with this model, we have detected a number of tight clusters of palindromes at the locations from 87717 bp - 97488 bp, and 189810 bp - 198709 bp in the CMV genomes. The tightest clusters we have detected are from 90251 - 94174 bp. Anyone or a group of these clusters has significant potential to possess the origin of replication. 

```{r, fig.height=5}
cmv<-read.csv("http://www.uwyo.edu/buerkle/compbio/statlabs/data/hcmv.data.txt")$location

par(mfrow=c(2,1), mar= c(5, 6, 3.5, 0))

 hist(cmv, breaks=seq(0,232000, by=5000), # creating a histogram, where each bar shows the counts of palindromes in every 5000 bp.
                 labels=TRUE, 
                 main = "Graph A: Distribution of palindromes in each 5000 bp in CMV genome", 
                 cex.main= 3, 
                 cex.axis=2.5, 
                 xlab = "Locations in the CMV genome", 
                 ylab = "Counts of the palindromes",
                 cex.lab=2.5
                 )

abline(h = qpois((0.95), lambda=5000*296/229354), col= "purple", lwd=3) #Add a horizontal line to the histogram that shows 95% quantile using the quantile function of the Poisson distribution qpois().




slide.window<-function(seqdata, total.length, blocksize=500, incr=100){  # creating a function for the sliding window, where block size can be specified and each block moves a given number of bp. 
  block.right<-seq(blocksize,total.length, by=incr)

  block.left<-block.right-blocksize+1

  block.mid<-(block.right+block.left)/2

  nblocks<-length(block.right)

  count<-numeric(nblocks)

  rate<-numeric(nblocks)

  

  for(i in 1:nblocks) {

    focal<-seqdata[seqdata>block.left[i] & seqdata < block.right[i]] # assigning the locations from seqdata (CMV data in our case) when locations are greater than left block and smaller than the right block. 

    count[i]<-length(focal) # total number of locations fall inside ith block. 

    rate[i]<-count[i] / blocksize

    # alternatively

    #count[i] <- sum(seqdata>block.left[i] & seqdata < block.right[i])

  }

  data.frame(count, rate, block.left, block.mid, block.right) # Creating a dataframe including the variables mentioned here. 

}

plot(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=5000, incr=250), # Plotting rate vs the value of calculated middle block, when block size is 5000 bp and each block move only 250 bp.   
     type="l", 
     col= "black",
     lwd = 2,
     main = "Graph B: Sliding window with 250 bp increment",
     cex.main= 3, 
     cex.axis=2.5, 
     xlab = "Locations in the CMV genome", 
     ylab = "Rates (count/blocksize)",
     cex.lab=2.5)

lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=9000, incr=250), # creating a line plot over the previous plot with different color when block size is 9000 bp and each block move only 250 bp.
      col="purple",
      lwd = 2)
      
lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=18000, incr=250), 
      col="orange",
      lwd = 2)

lines(rate ~ block.mid, data=slide.window(cmv, 230000, blocksize=36000, incr=250), 
      col="red",
      lwd=2)

```
In graph A, a total number of palindromes in every 5000 bp is plotted. We see that mostly 3 to 10 palindromes in each 5000 bp are spread all over the genome. Only two locations we found that contain 18 and 12 palindromes. The purple line in the graph shows 95% quantile. From this line, we understand that there is only a 5% chance that we get 12 or more palindromes in each 5000 bp. Graph B is a sliding window that shows the rate of palindromes in 5000 bp block when we move blocks 250 bp. From the sliding window, we find a similar number of counts as shown in graph A when the block size is 5000 bp. We also found that if we count palindromes every 2000 to 10000 bp or produce a sliding window with a 2000 to 10000 bp block size, the distribution of palindromes all over the genome is similar. The sliding window also shows that larger block sizes dilute the signal of palindrome count significantly as we can see from the purple, orange, and red lines with 9000bp, 18000 bp, and 36000 bp block sizes respectively in graph B (especially with 18000 and 36000 bp). Because increasing block size increases the probability of repeated count of the same palindromes in blocks each time. By comparing graph A and the sliding window (graph B), we showed that the count of the number of palindromes in each 5000 bp is accurate and robust as we move blocks only 250 bp. We did not lose any significant amount of palindrome from the counting due to the block size we considered. Therefore, the locations from graph A, where we find the cluster of 18 and 12 palindromes are our potential candidates for the sites of origin of replication, as we supposed that they might be more tightly clustered together compare to other regions.         
       


```{r, fig.height=5, fig.width=10 }

par(mfrow=c(1,2), mar=c(5,5.5,5.5,4.5))
plot(cmv[1:(296-14)], diff(cmv, lag=14), # plotting the locations of palindromes in X- axis and difference between 1st and 15th palindromes of each consecutive/ overlapping set of 15 palindromes using lag value inside diff() function. 
     main = "Graph C: Locations and distances between two terminal palindromes
     of all sets of 15  palidromes",
     cex.main=2.5,
     cex.lab = 2.5,
     cex.axis = 2.5,
     xlab="Locations in the CMV genome", 
     ylab="Observed distance (bp)", 
     ylim=c(0,25000),
     lwd=3)
abline(h=qgamma(c(0.02, 0.0002),rate =296/229354, shape=14 ), # Add two horizontal lines on the plot for 2% and 0.02% quantile using the quantile function of the gamma distribution.
       col=c("purple", "blue"), 
       lwd=2)
plot(dgamma(0:25000, rate =296/229354, shape=14), 0:25000, # Density plot of the gamma distribution of the distances between two terminal palindromes of sets of 15 palindromes. 
     main = "Graph D: Density plot of the gamma distribution",
     cex.main=3,
     cex.lab = 2.5,
     cex.axis = 2.5,
     type="l", 
     ylab="", 
     xlab="Density of the gamma",
     lwd=3)
abline(h=qgamma( c(0.02, 0.0002),rate =296/229354, shape=14 ), 
       col=c("purple","blue"), 
       lwd=2)
mtext(c(0.02, 0.0002), # Add text to the right hand side of horizontal lines. 
      side=4, 
      at=qgamma(c(0.02, 0.0002),
      rate =296/229354, shape=14 ), 
      col=c("purple","blue"), 
      las=1, 
      cex=2)

```
In graph C we observe the distance (y-axis) between 1st and 15th palindromes of each set of 15 palindromes in different locations (x-axis) of the genome. We found that the differences between the two terminal palindromes of most of the sets are from ~6000 to ~ 17000 bp. There are only two locations where distances are ~6000 bp or less (below the purple line in graph C). I analyzed sets of the different numbers of palindromes (sets of 10 to 18 palindromes). Each time I found a similar pattern of distribution of variable distances between two terminal palindromes of each set. However, analysis with the set of 15 palindromes resulted in a decent number of locations where distances between two terminal palindromes are highly (probability <0.02) or extremely (probability <0.0002) improbable. Therefore, I used the set of 15 palindromes (lag=14) in this analysis. Graph D is the density plot of the gamma distribution. In both graphs, purple and blue lines are showing 2% and 0.02% quantile (extremely small distances) of the gamma distribution. This indicates that there is an extremely low probability (<0.02) of the occurrence of 15 palindromes in these small distances by chance. If we compare the locations of these two regions with graph A, we find that they fall inside the similar regions we detected in graph A with 95% quantile. These seemingly tight clusters (sets of 15) of palindromes might have some significant biological roles, for example, may possess the origin of replication.  



```{r, fig.height=5}
## save indexes of improbably small distances
myprecious.indexes<-which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.02) # Creating a vector that contain indexes of the locations of 1st palindromes of sets of 15 palindromes, of which, the probability of distances between 1st and 15th palindrome is less than 0.02 using the cumulative density of function, pgamma(), of the gamma distribution.   
mystrictprecious.indexes<-which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.0002)

par(mar=c(5, 5, 2,2))
plot(cmv[1:(296-14)], 1:(296-14), type='n', # Creating an empty plot. 
     ylab="Sets of 15 palindromes", 
     xlab="Locations in the CMV genome", 
     main = "Graph E: Distances beween the two terminal palindromes of each set of 15 palindomes at different positions in CMV genome",
     cex.main= 2.2,
     cex.lab= 2.5,
     cex.axis = 2.5,
     lwd=3)
segments(cmv[1:(296-14)], 1:(296-14),  # Add distances between two terminal palindromes of each set of 15 palindromes as line segments.
         cmv[(1+14):296], 1:(296-14))
segments(cmv[myprecious.indexes], myprecious.indexes, # Add distances between two terminal palindromes of sets of palindromes indexed by myprecious as line segments with red color. This segments override the previous segments at the same locations. 
         cmv[myprecious.indexes+14], myprecious.indexes, col="red", lwd = 3) 

segments(cmv[mystrictprecious.indexes], mystrictprecious.indexes, 
         cmv[mystrictprecious.indexes+14], mystrictprecious.indexes, col= adjustcolor( "yellow", alpha.f = 0.7), lwd = 3)
```
To visualize distances as shown in graph C more clearly, we developed Graph E. It shows the distances between two terminal palindromes of sets of 15 palindromes by line segments. Here, each line indicates a set of 15 palindromes and the length of the line indicates the distance between the 1st and 15th palindromes of that set. From the x-axis, we can also understand the position of each set of palindromes in the CMV genome. We can visualize that there are two places in the genome (colored with both yellow and red) where lengths of sets of palindromes are significantly less (probability is less than 0.02) compared to the rest of the genome as also indicated in graph C. As I discussed before, with this extremely low probability (<0.02 (red+yellow regions)  or <0.0002 (only yellow regions)) this is highly unlikely that these palindromes are in these positions of the genome just by chance. Based on these observations, primarily I made two recommendations below to start experimentally searching for the origin of replication. 

## *Final recommendations*
I would like to make two alternate recommendations. You may start with any one of these based on your overall goal, available time, and resources. 

### 1. If you want to start analysis focusing on broader regions (with probability < 0.02), marked with red and yellow colors in the graph E.
```{r}
library(kableExtra)

index_cmv1<- which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.02) # Same as myprecious.indexes

broad_regions <- data.frame(Index=NA,  # Creating a dataframe called broad_regions.
                            Start=NA,
                            End=NA,
                            Distance=NA)

for (i in 1:length(index_cmv1)) {
  loc = index_cmv1[i]
  
  broad_regions[i,1] <- loc    # adding indexes to the first column of the dataframe. 
  broad_regions[i,2] <- cmv[loc] # adding starting locations of sets of 15 palindromes to the second column of the dataframe. 
  broad_regions[i,3] <- cmv[loc+14] # adding end locations of sets of 15 palindromes to the third column of the dataframe. 
  broad_regions[i,4] <- cmv[loc+14] - cmv[loc] # # adding distances between two terminals palindromes of sets 15 palindromes in the fourth column of the dataframe.  
  
}
tbl1<- kable(broad_regions, caption = "Table 1: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.02)")
column_spec(tbl1, 1:4, width = "1in") # creating a table with a caption and defined width of columns. 

```
In the table above, each row shows the 1st (Start) and 15th (End) positions of a set of 15 palindromes and the distance (Distance) spanned by those palindromes. From the table, we find that the maximum distance a set of 15 palindromes span is ~5668 bp. The probability of these distances is < 0.02, which is highly improbable, which indicates that compared to other distances with a probability > 0.02, these distances are significantly smaller. So, we can say that 15 palindromes in a set are tightly packed together in these positions. And this is highly unlikely that they packed together by chance (with the probability of 0.02). This shows a clear deviation from the homogeneous Poisson process. Therefore, any one of these sets or a group of these sets of 15 palindromes may possess the site of the origin of replication. Based on the consecutive sets of palindromes (see Index in table 1), I would suggest starting the analysis focusing on these two regions, 87717 bp - 97488 bp, and 189810 bp - 198709 bp (table 1).   

### 2.  If you want to start the analysis by focusing on regions with extremely improbable distances (probability < 0.0002), marked with only yellow in the graph E.  
```{r}

index_cmv2<- which(pgamma(diff(cmv, lag=14), rate =296/229354, shape=14) < 0.0002) # same as above

smaller_regions <- data.frame(Index=NA,   #same as above
                            Start=NA,
                            End=NA,
                            Distance=NA)

for (i in 1:length(index_cmv2)) {
  loc = index_cmv2[i]
  
  smaller_regions[i,1] <- loc               #same as above.
  smaller_regions[i,2] <- cmv[loc]
  smaller_regions[i,3] <- cmv[loc+14]
  smaller_regions[i,4] <- cmv[loc+14] - cmv[loc]
}
tbl2 <- kable(smaller_regions, caption = "Table 2: locations of two terminal palindromes and distances between them of sets of 15 palindromes in the CMV genome (probability < 0.0002)")
column_spec(tbl2, 1:4, width = "1in")

```
If you want to start searching for the origin of replication within smaller regions, you can start with these regions as indicated in table 2. The exact locations of four sets of extremely packed 15 palindromes are given here. The maximum distance between two terminal palindromes of a set is  2999 bp and the minimum is only 2111 bp, which is extremely unlikely (with probability < 0.0002) that the palindromes in a set are very tightly clustered together just by chance. These extremely small distances also show a highly significant deviation from the Poisson process. Therefore, region from 90251 bp to 94174 bp has the highest probability to have the site of the origin of replication. So, it is worth starting to search with these regions for the origin of replication in the laboratory and if necessary, explore the other regions as mentioned in table 1.   





