Gynandra gynandropsis genome true gene/GC exploration

It has previously been shown in rice that genome annotation led to a large number of transposable elements being misidentified as genes. The analysis that showed the discrepancy explored GC content in the context of each gene and at codon positions. Here I begin to use this approach to evaluate our gene prediction in the Gynandropsis gynandra draft genome.

CodonW analysis - per gene GC% and GC3s

First, I used codonW to analyse the predicted coding sequences from v3 of our draft unpublished annotation. Outputs were GC content, and GC3s (GC% of the third codon position), per CDS.

The genes were coded according to whether they had a conditioned reciprocal best BLAST(CRBB) hit against Arabidopsis thaliana (TAIR v10 release).

gcfile <- "/home/rds45/experiments/cg_genome/g.gynandra.cds.agis.out"
gcdata <- read.csv(gcfile, as.is = T, sep = "\t")
gcdata$has_rb <- grepl(gcdata$title, pattern = "_AT")

First observation: genes that had a CRBB hit have a different GC distribution than those without a hit.

library(ggplot2)
ggplot(gcdata, aes(x = GC, colour = has_rb)) + geom_density()

plot of chunk unnamed-chunk-2

The same holds for GC3s (as we might expect), but the distributions don't follow the GC pattern.

ggplot(gcdata, aes(x = GC3s, colour = has_rb)) + geom_density()

plot of chunk unnamed-chunk-3

Since the distributions weren't the same for GC and GC3s, the next question is whether the GC/GC3s relationship is the same for genes with and without CRBB hits.

ggplot(gcdata, aes(x = GC, y = GC3s, colour = has_rb)) + geom_point() + stat_smooth(method = "lm")

plot of chunk unnamed-chunk-4

They are more different than I was expecting. The extremes look like they were drawn from different populations. Not expecting to find anything new here, but are the means different?

library(plyr)
std <- function(x) sd(x)/sqrt(length(x))
means <- ddply(gcdata[, c("has_rb", "GC3s", "GC")], .(has_rb), numcolwise(mean))
stds <- ddply(gcdata[, c("has_rb", "GC3s", "GC")], .(has_rb), numcolwise(std))

library(reshape2)
meansm <- melt(means, id = "has_rb")
names(meansm)[3] <- "mean"
stdsm <- melt(stds, id = "has_rb")
names(stdsm)[3] <- "stderr"
all <- merge(meansm, stdsm)

ggplot(all, aes(x = variable, y = mean, fill = has_rb)) + geom_bar(stat = "identity", 
    position = "dodge") + geom_errorbar(aes(ymin = mean - stderr, ymax = mean + 
    stderr), position = "dodge")

plot of chunk unnamed-chunk-5

And let's test that, because p-values:

t.test(gcdata$GC ~ gcdata$has_rb)
## 
##  Welch Two Sample t-test
## 
## data:  gcdata$GC by gcdata$has_rb
## t = -5.581, df = 7079, p-value = 2.479e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006081 -0.002920
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##              0.4619              0.4664
t.test(gcdata$GC3s ~ gcdata$has_rb)
## 
##  Welch Two Sample t-test
## 
## data:  gcdata$GC3s by gcdata$has_rb
## t = -6.605, df = 8028, p-value = 4.226e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.011928 -0.006468
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##              0.4439              0.4531

to be continued…