Introduction

Having seen the basics of the R language and environment we will try to put some of what we ’ve learned in a real world application. This will a first practical aiming at showcasing the concept of pattern identification through the calculation of observed-over-expected frequencies/probabilities.

Part I. The problem

We will deal with a real life biological problem: the identification of clusters of genes with similar properties in the linear genome. We will use data from this publication by our group.

The problem here is quite straight-forward. We have transiently de-activated topo-isomerase II, a major genome structural protein in yeast (and all eukaryotes) and have measured gene expression changes genome-wide, in the form of Genome Run-On (GRO) values. We are interested in how the topological stress that accumulates during topo-isomerase II inactivation propagates along the genome.
* Do genes become severely affected?
* In what way? Are they up- or down-regulated?
* Most importantly, does this de-regulation occur in tandem? Do genes become more affected if they are next to each other?

Some initial observations on gene expression changes show that genes are both up- (red) and down-regulated (blue). Moreover, we may be tempted to hypothesize that they do so in the form of linear clusters: series of genes placed in tandem like so.

The question now becomes: How can we identify these clusters and then how can we be sure they are really there because of some dependence between consecutive genes and not out of chance?

In short, we need a way to test for the expected size of the clusters and then compare it with the size of the clusters we observe.

Part II. Data and Methods

The data

We will first load the necessary data into R. We need a file containing all the NRO values and the coordinates of the genes. This file is available by the authors, so we use the link in the supplementary data to retrieve it and directly load it into our environment.

grodata<-read.delim("https://raw.githubusercontent.com/christoforos-nikolaou/MolBioMedClass/master/Datasets/SacCer2_All_NRO_values.tsv", header=T, sep="\t")

we can take a look at it and make sure it contains all we need

str(grodata)
## 'data.frame':    5414 obs. of  6 variables:
##  $ chrom     : chr  "chrI" "chrI" "chrI" "chrI" ...
##  $ startCoord: int  12045 31566 33447 35154 36508 37463 39258 42177 42881 45899 ...
##  $ endCoord  : int  12426 32940 34701 36303 37147 38972 41902 42720 45023 48251 ...
##  $ GeneName  : chr  "YAL064W-B" "YAL062W" "YAL061W" "YAL060W" ...
##  $ strand    : chr  "+" "+" "+" "+" ...
##  $ GROValue  : num  -0.475 0.683 2.303 0.508 -0.46 ...

In preparation for the analysis we will next load the necessary packages and libraries.

Loading Libraries

Next we will have to load the necessary libraries and packages onto R. Remember that if a package is not already installed in your machine you will need to install it with \(install.packages("nameOfpackage")\). Otherwise you can load it directly as we show below.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As you see the tidyverse() package is a package that loads a number of other packages and libraries that help us treat, analyze and visualize data.

Part III. Analysis

Visual inspection of the data

It is always good to have a visual idea of what the data look like before any analysis. Since we are dealing with one numerical value (GRO values) we can first perform some basic descriptive statistics on it, to calculate central tendencies, variance etc

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(grodata$GROValue)
##    vars    n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 5414 0.01 0.41      0   -0.01 0.33 -1.93 3.35  5.28  0.9      4.2 0.01

and then we can visualize the actual values in a histogram plot or a boxplot

library(ggplot2)
ggplot(grodata, aes(x = GROValue)) +
  geom_histogram(binwidth = 0.1, fill = "olivedrab", color = "black") +
  labs(title = "GRO Value Distribution", x = "GRO Values", y = "Frequency")

#
grovals<-data.frame(x = factor(""), y = grodata$GROValue) # this is just creating a dummy data.frame
ggplot(grovals, aes(x = x, y = y)) +
  geom_boxplot(fill = "olivedrab", color = "black", width = 0.3) +
  labs(title = "GRO Value Boxplot", x= NULL, y = "GRO Values")

Stop and think. What information do you gather from the above?

Extracting clusters

We will start with going after exactly what we wanted to identify. That is, the clusters of consecutive positive or negative GRO values and their sizes.
To this end we will implement a very useful built-in R function called rle(). rle() takes a vector of (usually binary) values and creates series of homogeneous characters. For example given a series of:
\[S=[1,1,1,0,0,1,0,1,1]\]

rle will create 5 subseries (which it calls “runs”) of homogeneous characters and their corresponding lengths in two separate vectors:
The values of each “run” will be: 1, 0, 1, 0, 1
And the lengths of each will be: 3, 2, 1, 1, 2.
Basically, the original series S is translated into 3 1s, 2 0s, 1 1, 1 0 and 2 1s.

Below we load rle() and run it on our GRO values after we have created a binary transformation of GRO values into 1 (for GRO>=0) and 0 (for GRO<0).

library(rle)
chroms<-as.factor(grodata$chrom)
runs<-rle(c("0"))
for(chromosome in levels(chroms)){ # we have to split values from each chromosome 
  i<-which(grodata$chrom==chromosome)
  values<-grodata$GROValue[i]
  values<-ifelse(values >= 0, 1, 0) # values are now +/- GRO values
  runs<-c(runs,rle(values)) # running rle and storing the output to previous chromosome data
}

We can see the output of this by simply calling the variable \(runs\).

runs
## Run Length Encoding
##   lengths: int [1:2304] 1 1 3 3 3 1 1 8 2 1 ...
##   values : chr [1:2304] "0" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" ...

From which we see that we have identified 2304 runs of consecutive positive or negative values, of any size k.

We can now start looking into their size distribution. We may also do this separately for positive and negative values like we see below.
First by creating two separate vectors holding the sizes of positive(=1) and negative(=0) clusters.

posclusters<-runs$lengths[which(runs$values==1)]
negclusters<-runs$lengths[which(runs$values==0)]

Next, we can calculate the times each cluster size occurs in the data. We can use the table() function for this.

tabp<-as.data.frame(table(posclusters))
tabn<-as.data.frame(table(negclusters))
#
tabp
##    posclusters Freq
## 1            1  539
## 2            2  268
## 3            3  131
## 4            4   88
## 5            5   48
## 6            6   29
## 7            7   21
## 8            8   10
## 9            9    6
## 10          10    5
## 11          11    2
## 12          12    3
## 13          13    2
## 14          15    2
## 15          18    1
## 16          21    1
## 17          31    1

Next, we need to calculate the frequencies of these clusters. We will thus divide the number each cluster size occurs in our data with the total size of clusters (runs) reported by rle.

tabp$Freq<-tabp$Freq/length(runs$lengths)
colnames(tabp)[1]<-"ClusterSize"
tabn$Freq<-tabn$Freq/length(runs$lengths)
colnames(tabn)[1]<-"ClusterSize"
tabp$type<-"positive"
tabn$type<-"negative"
tab<-rbind(tabp, tabn)
tab$ClusterSize<-as.numeric(as.character(tab$ClusterSize))

Last, we will plot the two distributions on the same graph against the corresponding k.

#
library(ggplot2)
ggplot(tab, aes(x = as.numeric(ClusterSize), y = Freq, color= type)) +
  geom_point(size = 5, alpha = 0.6) +
  scale_color_manual(values = c("positive" = "firebrick4", "negative" = "steelblue4")) +
  scale_x_continuous(breaks = sort(unique(tab$ClusterSize))) +
  labs(title = "Cluster Size Distribution", x = "Cluster Size", y = "Frequency") +
  theme_minimal()

Notice how, for both types of clusters show very similar behaviour. Since there are no clear qualitative differences in the two distributions and for reasons of better statistics, we can merge the counts for both positive/negative clusters and simply create one distribution of observed cluster sizes. The code below does exactly that.

observedclusters<-runs$lengths
tabo<-as.data.frame(table(observedclusters))
tabo$Freq<-tabo$Freq/length(runs$lengths)
colnames(tabo)[1]<-"ClusterSize"
tabo$type<-"observed"
tabo$ClusterSize<-as.numeric(as.character(tabo$ClusterSize))
#
ggplot(tabo, aes(x = as.numeric(ClusterSize), y = Freq, color= type)) +
  geom_point(size = 5, alpha = 0.6, col="forestgreen") +
  scale_x_continuous(breaks = sort(unique(tabo$ClusterSize))) +
  labs(title = "Cluster Size Distribution", x = "Cluster Size", y = "Frequency") +
  theme_minimal()

There appears to be an exponential decay. This is plausible. The longer the cluster the less likely we would expect it to be. The question is how expected.

Remember two things:
1. What we see above is the observed frequency for a cluster of any given cluster size.
2. But, also remember our initial hypothesis is that there is some degree of dependence between consecutive genes that leads to the formation of large clusters. In fact, clusters that are larger than the ones expected by chance (which we saw above).

The question remains:
* How can we know if the cluster sizes we see are indeed larger than the ones expected by chance?

The key now rests on two things:
1. To calculate the expected frequencies of clusters based on our data.
2. To compare these frequencies with the actual, observed we have just extracted.

Getting a background model. Expected Values

In order to create a background model we can work in various ways. For instance, we could simulate the data, that is create random sets of GRO values, but we will leave this for another practical.

There is a more straight-forward way and it is to test the null hypothesis that we can create from our initial hypothesis. Remember that we have assumed some dependence between GRO values. The opposite would then be to examine the case of independence. Under independence every GRO value will be independent from the previous one. Using simple probability theory we could calculate the expected probability of any cluster size. We only need the base probability of a gene having a positive or a negative value.

Remember this plot:

library(ggplot2)
ggplot(grodata, aes(x = GROValue)) +
  geom_histogram(binwidth = 0.1, fill = "olivedrab", color = "black") +
  labs(title = "GRO Value Distribution", x = "GRO Values", y = "Frequency")

From which we can see that the distribution of GRO values is symmetrical around zero. We can easily calculate the percentage of positive/negative values like this:

length(which(grodata$GROValue>=0))/length(grodata$GROValue)
## [1] 0.5077577
#
length(which(grodata$GROValue<0))/length(grodata$GROValue)
## [1] 0.4922423

from which we see that the frequency of positive/negative values in 0.507/0.492 Since this is very close to 50%-50%, we will simplify the frequencies to 0.5 for both positive and negative GRO values.

What can we do next? We can create a function that will calculate the probability of a k-gene cluster of positive or negative GRO-value genes. Remember, assuming independence the probability of consecutive k genes with a positive GRO value is: \[Ppos_{k}=P[positive]^k\]
and since we have set P[positive]=P[negative]=0.5, this becomes the same for any cluster of k consecutive genes:

\[P_{k}=P^k\] We may therefore easily calculate the probabilities of clusters up to any k. Below we do this for an extreme case for 30 genes. (Why is this extreme? Think about it).

p<-vector(mode="numeric", length=30)
for(k in 1:30){
  p[k]=0.5**k
}
tabr<-data.frame("ClusterSize"=1:30, "Freq"=p, "type"="expected")

We can see how the probability really decays exponentially with k in a line plot

ggplot(tabr, aes(x = ClusterSize, y = Freq)) +
  #geom_line(color = "forestgreen", linewidth = 3) +   # Line color can be customized
  geom_point(color = "grey40", size=5, alpha=0.5) +   # Add points to see individual values
  labs(title = "Cluster Size vs Expected Probability",
       x = "Cluster Size",
       y = "Frequency") +
  theme_minimal()

All that is left now is to compare this plot with the one we got from the actual data.

Comparing Observed over Expected Values

The first thing we can do is plot the two distributions (observed and expected) on the same plot.

tab<-rbind(tabo, tabr)
#
ggplot(tab, aes(x = as.numeric(ClusterSize), y = Freq, color= type)) +
  geom_point(size = 5, alpha = 0.7) +
  scale_color_manual(values = c("observed" = "forestgreen", "expected"="grey40")) +
  scale_x_continuous(breaks = sort(unique(tab$ClusterSize))) +
  labs(title = "Cluster Size Distribution", x = "Cluster Size", y = "Frequency") +
  theme_minimal()

Some difference is visible, initially in favour of expected values, which rapidly shifts towards observed ones. As we are considering probabilities, the values decay really fast and after a given point they appear indistinguishable from zero. In such cases, we can use a simple log-transformation to increase the dynamic range of the values. This is what the above picture looks like, by simply taking the the log10 of the probabilities.

tab<-rbind(tabo, tabr)
#
ggplot(tab, aes(x = as.numeric(ClusterSize), y = log10(Freq), color= type)) +
  geom_point(size = 5, alpha = 0.7) +
  scale_color_manual(values = c("observed" = "forestgreen", "expected"="grey40")) +
  scale_x_continuous(breaks = sort(unique(tab$ClusterSize))) +
  labs(title = "Cluster Size Distribution", x = "Cluster Size", y = "log10(Frequency)") +
  theme_minimal()

Differences between observed and expected values do really strike out now. Expected values follow a straight line, as would be expected for an exponential decay. The observed values, however do not show an exponential decay, but have a long-tail suggesting a different type of distribution (They actually resemble a power-law but we will leave this for another practical). What happes with the real clusters is that large-sized clusters appear much more frequently than expected.

Starting from k=4, observed clusters frequencies begin to deviate from expected ones. By k=7 they are clearly non overlapping and at k=10 they are >3-fold more frequent than expected by chance.

As a last step we can try quantify to quantify these differences more precisely. We will use a simple approach of probability ratios. Below we take all cluster sizes from k=1 to k=15 and calculate the ratio of observed over expected frequencies.

clusterTable<-matrix(0, nrow=15, ncol=2)
for(i in 1:15){
  clusterTable[i,1]<-i
  clusterTable[i,2]<-tabo$Freq[i]/tabr$Freq[i]
}
clusterTable<-as.data.frame(clusterTable)
colnames(clusterTable)<-c("Cluster Size", "Probability Ratio")

And then we plot the value of this ratio against the baseline of 1. We may call this value an enrichment, meaning that it reflects the degree to which clusters of a given size appear more frequently than expected by chance.

ggplot(clusterTable, aes(x = as.numeric(`Cluster Size`), y = `Probability Ratio`)) +
  geom_point(size = 5, alpha = 0.8, color="firebrick4") +
  labs(title = "Cluster Size Obs/Exp Enrichment", x = "Cluster Size", y = "Probability Ratio") +
  geom_hline(yintercept=1, color="grey50", linetype="dashed", linewidth=0.5) +
  theme_minimal()

We see that from k=5 and above, all cluster sizes are enriched in frequency in the real, observed data, compared to the expected values. For k>=7 there is an at least 2-fold (ratio>=2) enrichment. This is exactly the threshold we set in our paper for which we considered all clusters of at least 7 genes to be significantly enriched (albeit through a slightly more robust simulation analysis).