library(ggplot2)
setwd("~/projects/gsize/")

Start with RIMMAs because Paul’s script can’t handle hapmap2 paired end reads. Mapped RIMMAs to v3 version 22 of the cDNA with only “known” cDNAs, then counted abundance only of genes with proportion read mapping < 5E-5

Get data

#genome size
rimma_gs<-read.csv("~/projects/gsize/data/gsize_rimma.csv",header=T)
#percent mapping, change columns, fix filenames
rimma_perc<-read.csv("~/projects/gsize/results/fixed_genes_percent_rimma.txt",header=F)
colnames(rimma_perc)=c("Line","total","mapped","percent")
rimma_perc$Line=gsub(".fastq.bam.txt","",gsub("abundance.","",rimma_perc$Line))
rimma_data<-merge(rimma_gs,rimma_perc,by="Line")

Correlation and plot.

ggplot(data=rimma_data,aes(x=gsize,y=percent))+geom_point()+geom_smooth(method="lm")+xlab("1C genome size")+ylab("Fraction Reads Aligning to corrected cDNA")
## Warning: Removed 8 rows containing missing values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-3

cor.test(rimma_data$gsize,rimma_data$percent)
## 
##  Pearson's product-moment correlation
## 
## data:  rimma_data$gsize and rimma_data$percent
## t = -5.798, df = 41, p-value = 8.391e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8085 -0.4645
## sample estimates:
##     cor 
## -0.6712

Solid negative correlation, but not fantastic \(r^2\). The genome sizes for these lines come from a different individual from the same accession, so we don’t expect great correlation.

Get some data on relative genome size from HapMap2 and those reads mapped to Ensembl v3 version 22 of the cDNA with only “known” cDNAs

#Genome size from Chia
g=pipe(paste("curl -o - https://gist.githubusercontent.com/rossibarra/7638160/raw/62c4dab3f8a96a7d86cfe01a524dc1df2d03784f/genome_size"))
gsize<-read.table(g,header=T,colClasses=c("character","numeric","numeric"))

#Mapping to v3 cDNA ("known" cdna only)
#m=pipe(paste("ssh farm 'cat ~/projects/genomesize/results/percent_hm2.txt'"))

#percent mapping, change columns, fix filenames
hm2_perc<-read.csv("~/projects/gsize/results/fixed_genes_percent_hm2.txt",header=F)
colnames(hm2_perc)=c("Line","total","mapped","percent")
hm2_perc$Line=gsub("_merged.fq.bam.txt","",gsub("abundance.","",hm2_perc$Line))

#Munge away: get lines with both genome size and percent mapping
data<-merge(gsize,hm2_perc,by="Line")

Test correlation

ggplot(data=data,aes(x=GenomeSize,y=percent))+geom_point()+geom_smooth(method="lm")+xlab("1C genome size")+ylab("Fraction Reads Aligning to cDNA")

plot of chunk unnamed-chunk-5

cor.test(data$GenomeSize,data$percent)
## 
##  Pearson's product-moment correlation
## 
## data:  data$GenomeSize and data$percent
## t = -0.0805, df = 33, p-value = 0.9364
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3456  0.3207
## sample estimates:
##    cor 
## -0.014

Significant is good, but positive correlation is not what we predict. Let’s remove TILs and see what happens

x=data[c(1:24,34:35),]
ggplot(data=x,aes(x=GenomeSize,y=percent))+geom_point()+geom_smooth(method="lm")+xlab("1C genome size")+ylab("Fraction Reads Aligning to cDNA")

plot of chunk unnamed-chunk-6

cor.test(x$GenomeSize,x$percent)
## 
##  Pearson's product-moment correlation
## 
## data:  x$GenomeSize and x$percent
## t = -1.153, df = 24, p-value = 0.2604
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5661  0.1737
## sample estimates:
##    cor 
## -0.229

Still sucky. Will need to go back and revisit filtering. For now, ugh.