Number of ORFs by sliding window example

I'm making this example in response to a friend's request to try to emulate the plots here by showing the genome axis on the \( X \) axis and the number of ORFs on the \( Y \) axis.

#### Basic data of #ORFs by window

## Using sliding windows by 1/4 of the size (which is 100)
set.seed(101)
df <- data.frame(
    start=seq(1, 9901, by=25),
    end=seq(100, 10000, by=25),
    nORFs=round(runif(397, max = 200))
)
head(df)
##   start end nORFs
## 1     1 100    74
## 2    26 125     9
## 3    51 150   142
## 4    76 175   132
## 5   101 200    50
## 6   126 225    60

#### Summarize information for making the plot

## Start by saving the positions of interest
data <- data.frame(
    ## Could use something simpler since we know the actual window size
    pos=df$start + round((df$end - df$start)/2)
)

## Get subsets
subsets <- lapply(data$pos, function(x) { 
    subset(df, start <= x & end >= x)$nORFs
})
## Complete data.frame of interest
data <- cbind(data, data.frame(
    mean=unlist(lapply(subsets, mean)),
    min=unlist(lapply(subsets, min)),
    max=unlist(lapply(subsets, max))
))
head(data)
##   pos  mean min max
## 1  51 75.00   9 142
## 2  76 89.25   9 142
## 3 101 83.25   9 142
## 4 126 96.00  50 142
## 5 151 89.75  50 132
## 6 176 73.50  50 117

#### Make the plot
library(ggplot2)
ggplot(data, aes(x=pos, y=mean)) +
    geom_ribbon(aes(ymin=min, ymax=max), alpha=0.2) +
    geom_line()

plot of chunk unnamed-chunk-1

The plot doesn't look so great because I generated random data. But the idea is that you look at the mean along with some information of other windows that overlap the mid position of each window. Depending on how many windows actually overlap, you might be interested on using the mean +- the standard deviation.