It has previously been shown in rice that genome annotation led to a large number of transposable elements being misidentified as genes. The analysis that showed the discrepancy explored GC content in the context of each gene and at codon positions. Here I begin to use this approach to evaluate our gene prediction in the Gynandropsis gynandra draft genome.
First, I used codonW to analyse the predicted coding sequences from v3 of our draft unpublished annotation. Outputs were GC content, and GC3s (GC% of the third codon position), per CDS.
The genes were coded according to whether they had a conditioned reciprocal best BLAST(CRBB) hit against Arabidopsis thaliana (TAIR v10 release).
gcfile <- "/home/rds45/experiments/cg_genome/g.gynandra.cds.agis.out"
gcdata <- read.csv(gcfile, as.is = T, sep = "\t")
gcdata$has_rb <- grepl(gcdata$title, pattern = "_AT")
First observation: genes that had a CRBB hit have a different GC distribution than those without a hit.
library(ggplot2)
ggplot(gcdata, aes(x = GC, colour = has_rb)) + geom_density()
The same holds for GC3s (as we might expect), but the distributions don't follow the GC pattern.
ggplot(gcdata, aes(x = GC3s, colour = has_rb)) + geom_density()
Since the distributions weren't the same for GC and GC3s, the next question is whether the GC/GC3s relationship is the same for genes with and without CRBB hits.
ggplot(gcdata, aes(x = GC, y = GC3s, colour = has_rb)) + geom_point() + stat_smooth(method = "lm")
They are more different than I was expecting. The extremes look like they were drawn from different populations. Not expecting to find anything new here, but are the means different?
library(plyr)
std <- function(x) sd(x)/sqrt(length(x))
means <- ddply(gcdata[, c("has_rb", "GC3s", "GC")], .(has_rb), numcolwise(mean))
stds <- ddply(gcdata[, c("has_rb", "GC3s", "GC")], .(has_rb), numcolwise(std))
library(reshape2)
meansm <- melt(means, id = "has_rb")
names(meansm)[3] <- "mean"
stdsm <- melt(stds, id = "has_rb")
names(stdsm)[3] <- "stderr"
all <- merge(meansm, stdsm)
ggplot(all, aes(x = variable, y = mean, fill = has_rb)) + geom_bar(stat = "identity",
position = "dodge") + geom_errorbar(aes(ymin = mean - stderr, ymax = mean +
stderr), position = "dodge")
And let's test that, because p-values:
t.test(gcdata$GC ~ gcdata$has_rb)
##
## Welch Two Sample t-test
##
## data: gcdata$GC by gcdata$has_rb
## t = -5.581, df = 7079, p-value = 2.479e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.006081 -0.002920
## sample estimates:
## mean in group FALSE mean in group TRUE
## 0.4619 0.4664
t.test(gcdata$GC3s ~ gcdata$has_rb)
##
## Welch Two Sample t-test
##
## data: gcdata$GC3s by gcdata$has_rb
## t = -6.605, df = 8028, p-value = 4.226e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.011928 -0.006468
## sample estimates:
## mean in group FALSE mean in group TRUE
## 0.4439 0.4531
to be continued…