gene_lengths_v2.txt shows lengths of genes and more for all human genes from the refseq database, covered next week. It has four columns:

In this homework, we will try to figure out what the difference is between these, and what the most extreme genes are.

0. Load the dataset

The given file has a header row, so we need to specify this in the function to load the dataset correctly.
The dataset is stored as a dataframe genes_df.
To get the sense of what we are dealing with, we use summary to get summary statistics of each column in the dataframe genes_df.

genes_df <- read.table("gene_lengths_v2.txt", header=TRUE)
summary(genes_df)

1. The most common number of exons

Make a histogram that shows what the typical number of exons is. Adjust the bins so that we can pinpoint exactly what number of exons that is the most common. Comment the plot.

hist(genes_df$exon_count,
     breaks = c(0:max(genes_df$exon_count)), 
     xlim = c(0,max(genes_df$exon_count)),
     main = "Histogram of the numbers of exons per gene",
     xlab = "number of exons")

# from https://www.tutorialspoint.com/r/r_mean_median_mode.htm
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(genes_df$exon_count)

The histogram shows that the most common number of exons per gene is 4 exons, which is also confirmed by the mode function. The distribution of the number of exons is right-skewed. Half of the genes have less than 10 exons (median).

2. Total length of introns

Add an additional column to the dataframe that contains the total length of introns for each gene

genes_df$intron_length <- genes_df$genome_length - genes_df$mrna_length
summary(genes_df$intron_length)

3. The length of total exon and total intron

Make histograms and boxplots showing the distribution of total exon and total intron lengths, all as subplots in the same larger plot, where each dataset have a different color. On the histograms, the number of bins should be exactly the same, and the x-axis should have the same scale. Comment the plot – are exons larger than introns or vice versa?

We chose not to make subplots because they are not as illustrative as the overlaying plots below

hist(genes_df$mrna_length, 
     col=rgb(0,0,1,0.5), 
     breaks=seq(0,2500000,by=10000), 
     main="The length of total exon (blue) and total intron (red)", xlab="length"
     )
hist(genes_df$intron_length, 
     col=rgb(1,0,0,0.5), 
     breaks=seq(0,2500000,by=10000), 
     add=TRUE)

hist(genes_df$mrna_length[genes_df$mrna_length<50000], 
     col=rgb(0,0,1,0.5), 
     breaks=seq(0,50000,by=1000),
     main="The length of total exon (blue) and total intron (red)", xlab="length")
hist(genes_df$intron_length[genes_df$intron_length<50000], 
     col=rgb(1,0,0,0.5), 
     breaks=seq(0,50000,by=1000), 
     add=TRUE)

boxplot(genes_df$mrna_length, genes_df$intron_length, col=c('blue', 'red'),
        names=c("exon length", "intron length"))

Histograms and boxplots show that introns are mostly longer than exons. The boxplots show that the intron lengths span a larger range.

4. Hypothesis testing

*Are the mRNA lengths significantly longer than the total intron lengths, or is it the other way around?

par(mfrow = c(1,2))
qqnorm(genes_df$mrna_length, main="exon length")
qqline(genes_df$mrna_length, col="red")

qqnorm(genes_df$intron_length, main="intron length")
qqline(genes_df$intron_length, col="red")

They deviate a lot from normality, so we cannot use parametric t-test. Two-sample Wilcoxon test, as known as Mann-Whitney test, is used instead.

H0: The mRNA lengths are equal to or longer than the intron lengths.
H1: The mRNA lengths are significantly shorter than the intron lengths.

wilcox.test(genes_df$mrna_length, genes_df$intron_length, alternative = "less")

As the p-value is 2.2*10^-16, which is less than 0.05, the null hypothesis is rejected.
Therefore, the mRNA lengths are significantly shorter than the intron lengths.

5. Correlation

Continuing on the same question: is the total exon length more correlated to the total intron length than the number of exons? Show this both with a plot and with correlation scores. Comment on your result.

plot(mrna_length ~ intron_length, data = genes_df)
exonL_intronL = lm(mrna_length ~ intron_length, data = genes_df)
abline(exonL_intronL, col="red")

cor(genes_df$mrna_length, genes_df$intron_length, method="spearman")

plot(mrna_length ~ exon_count, data = genes_df)
exonL_exonN = lm(mrna_length ~ exon_count, data = genes_df)
abline(exonL_exonN, col="red")

cor(genes_df$mrna_length, genes_df$exon_count, method="spearman")

As all parameters (mRNA lenghts, the number of exons per gene, and intron lengths) are not normally distributed, we use non-parametric Spearman's correlation.

The correlation between mRNA lenghts and intron lengths is 0.5367173. The correlation between mRNA lenghts and the number of exons per gene is 0.5407801.

The correlation values are highly similar. However, from the plots, the mRNA lenghts appear to be more correlated to the number of exons than intron length, but this is not represented by the correlation values.

6. Longest total exon length

What gene has the longest (total) exon length? How long is this mRNA and how many exons does it have? Do this in a single line of R (without using “;”).

genes_df[genes_df$mrna_length == max(genes_df$mrna_length), c(1,2,4)]

7. Extremes removal

In genomics, we often want to fish out extreme examples – like all very short genes, or all very long genes. It is often helpful to make a function to do these tasks – it saves time in the long run.

Make a function called “count_genes” that takes two inputs: a. A vector with mRNA lengths b. A cutoff x1 which by default should be set to 0 c. A cutoff x2 which by default should be set to the longest (total) mrna length of the input vector, as you did in “6)”. d. Then, the function should count the number of mRNAs that are no less than (<=) x2 but larger than (>) x1; and finally return the fraction of this count over the total count of mRNAs.

count_genes <- function(mrna_length_vec, x1=0, x2= max(mrna_length_vec)){
  return(length(mrna_length_vec[(x1 < mrna_length_vec) & (mrna_length_vec <= x2)])
         /length(mrna_length_vec))
}

Test this function with the mRNA lengths using the the five settings below: i) Using the default of x1 and x2; ii) Using the default of x2 and set x1=10000; iii) x1=1000 and x2=10000; iv) x1=100 and x2=1000; v) x1=0 and x2=100.

count_genes(genes_df$mrna_length)

count_genes(genes_df$mrna_length, x1=10000)

count_genes(genes_df$mrna_length, x1=1000, x2=10000)

count_genes(genes_df$mrna_length, x1=100, x2=1000)

count_genes(genes_df$mrna_length, x1=0, x2=100)

Homework 1