Prior To Beginning

Welcome to the R markdown walk through for the paper Multiple P450s and Variation in Neuronal Genes Underpins the Response to the insecticide Imidacloprid in a Population of Drosophila melanogaster (Denecke et. al 2017). This document is designed to guide the reader through every step of the analysis accomplished in the paper and allow for the reproducibility of the study.

Before beginning a few housekeeping issues must be addressed. Firstly, all code demonstrated here was optimized on a linux machine but has been tested on a mac and a PC. While it should work the particular specifications of the environment in which it was written are listed below.

version
##                _                           
## platform       x86_64-redhat-linux-gnu     
## arch           x86_64                      
## os             linux-gnu                   
## system         x86_64, linux-gnu           
## status                                     
## major          3                           
## minor          3.3                         
## year           2017                        
## month          03                          
## day            06                          
## svn rev        72310                       
## language       R                           
## version.string R version 3.3.3 (2017-03-06)
## nickname       Another Canoe

Next the analysis will require several packages. These packages can be installed and loaded into the R environment with the following chunk of code.

list.of.packages <- c("ggplot2","dplyr","ggplot2","tidyr","qqman","lme4","knitr","RCurl","R.utils")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(data.table)
library(tidyr)
library(dplyr)
library(ggplot2)
library(lme4)
library(qqman)
library(knitr)
library(RCurl)
library(R.utils)

Finally, the data will have to be loaded into the R enviroment. All raw data is available on github in .csv and binary forms. Also, during the analysis there are several custom functions that made the writing of the analysis far easier. These functions can all be found in the “functions.list.R” file which we must run before proceeding with the analysis. The code used for graphing aesthetics used in all figures can also be bulky and therefore make the communication of results difficult. I have therefore encoded all ggtheme parameters in a separate file called “themes.R”. Lastly, raw DGRP transcriptome data (Section 4) and ModEncode transcriptome data (Section 3) can be downloaded from their sources.

## Source R codes for plot themes and a few custom functions
themes <- getURL("https://raw.githubusercontent.com/shanedenecke/Rmarkdown_DGRP_Test/master/themes.R")
eval(parse(text = themes))
functions <- getURL("https://raw.githubusercontent.com/shanedenecke/Rmarkdown_DGRP_Test/master/functions.list.R")
eval(parse(text = functions))

# Load the raw data from github
download.file("https://raw.githubusercontent.com/shanedenecke/Rmarkdown_DGRP_Test/master/Denecke_et.al_2017_Raw_Data_Binary",destfile="~/test_DGRP/raw.data",quiet=T)
load("~/test_DGRP/raw.data")


## Import external Mod encode RNAseq dataset if not already present
if(!("mod_encode.tsv" %in% list.files())){
download.file("http://flybase.org/static_pages/downloads/FB2013_03/genes/gene_rpkm_report_fb_2013_03.tsv.gz",destfile="Mod_Encode_Raw.gz")
gunzip(filename="Mod_Encode_Raw.gz",destname="mod_encode.tsv",overwrite=F)
}
mod.encode.raw <- fread("mod_encode.tsv",sep="\t",header=T,showProgress = F)


## Download and read transcriptomic data if not already present
if(!("dgrp.array.exp.female.txt" %in% list.files())){
      download.file("http://dgrp2.gnets.ncsu.edu/data/website/dgrp.array.exp.female.txt",
                    destfile="~/test_DGRP/female.transcriptome.txt",quiet=T)
      download.file("http://dgrp2.gnets.ncsu.edu/data/website/dgrp.array.exp.male.txt",
                    destfile="~/test_DGRP/male.transcriptome.txt",quiet=T)
}

dgrp.female.transcriptome <- fread("female.transcriptome.txt",header=T,showProgress=F)
dgrp.male.transcriptome <-  fread("male.transcriptome.txt",header=T,showProgress=F)

The analysis can now begin in earnest. Each section of this document will walk through the analysis of 1 of the 6 results sections presented in the paper. Through the document all Figures and Tables presented in the paper will be generated in what approaches their final form.

Section 1

Data used:

  • DGRP_raw_wiggle_data.csv (as dgrp.wiggle.raw)

Section 1 is focussed on looking at the phenotypic distribution of DGRP genotypes. The first thing that must be accomplished is to summarize the wiggle data. After importing the data, it can be made interpretable using the wiggle.parse() and wiggle.rmr() custom functions. It can then be summarized using the dplyr package. We store this result in sum.wiggle.60 containing the mean RMRs (and standard errors) for each genotype at each dose at 60 minutes. Remember RMRs represent: \[RMR=\frac{\text{Quantity of Movement Without Imidacloprid}}{\text{Quantity of Movement After 1 Hour of Imidacloprid Exposure}}\]

sum.wiggle.60 <- dgrp.wiggle.raw %>% wiggle.parse(write=F) %>% 
      wiggle.rmr(write=F) %>% filter(time==60) %>%  group_by(genotype,dose) %>% 
      summarize(rmr.ci=se(rmr),rmr=mean(rmr),raw=mean(raw)) %>% 
      arrange (desc(dose),rmr)

Next some modifications will need to be made for plotting Figure 3.1, which reflects the distribution of the DGRP lines. In particular a data frame will be created called “annot” which will be used to map aesthetics to individual facets of the ggplot graph. This will be a common practice throughout the analysis.

# Add some modifications that will help for plotting
sum.wiggle.60$rank <- factor(as.character(rep(1:(nrow(sum.wiggle.60)/2),2)),
                             levels=unique(rep(1:(nrow(sum.wiggle.60)/2),2)))
sum.wiggle.60$dose <- factor(sum.wiggle.60$dose,levels=c("25ppm","100ppm"))
sum.wiggle.60$colour <- rep(colorRampPalette(c("steelblue","orangered"))(nrow(sum.wiggle.60)/2),2)
annot <- data.frame(x = rep(-Inf, length(unique(sum.wiggle.60$dose))), 
                    y = rep(Inf, length(unique(sum.wiggle.60$dose))), 
                    dose=as.character(unique(sum.wiggle.60$dose)),
                    labs=LETTERS[1:length(unique(sum.wiggle.60$dose))])

Now Figure 1 can be plotted. This figure shows each DGRP genotype (in order) on the x axis and the mean RMR on the y axis. Remember the higher the RMR the more resistant the DGRP line. At both doses there is a range of responses from the super susceptible (steelblue) to the super resistant (orangered).

gp <- ggplot(data=sum.wiggle.60,aes(x=rank,y=rmr,fill=rank),stat="identity")
gp <- gp+facet_grid(~dose)
gp <- gp+geom_bar(position="dodge",stat="identity",colour="black",size=.1)
gp <- gp+geom_errorbar(aes(ymin=rmr+rmr.ci,ymax=rmr-rmr.ci),width=.8,size=.3)
gp <- gp+scale_fill_manual(values=sum.wiggle.60$colour)
gp <- gp+scale_y_continuous(expand = c(0, 0),limits=c(0,1.3),
                            breaks=c(seq(0,1.2,.2)),labels=c(seq(0,1.2,.2)))
gp <- gp+labs(y="Relative Movement Ratio\n",x="\nOrdered Genotype")
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"), 
                   inherit.aes=FALSE,vjust=1.2,hjust=-.2)
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+spread_theme

print(gp)
This is figure 1

This is figure 1

Before moving on to trying to understanding the genetics underlying these differences, some basic checks must be performed on the data. One such check is to see how related our two doses are. This is shown in figure S1 by creating a scatterplot showing the RMRs at 25ppm vrs the RMRs at 100ppm.

sum.wiggle.60 <- arrange(sum.wiggle.60,genotype)
dose.compare <- data.frame(genotype=unique(sum.wiggle.60$genotype),
                           high.dose=filter(sum.wiggle.60,dose=="100ppm")$rmr,
                           low.dose=filter(sum.wiggle.60,dose=="25ppm")$rmr)
gp <- ggplot(dose.compare,aes(x=high.dose,y=low.dose))
gp <- gp+geom_point(size=2)
gp <- gp+geom_smooth(method="lm")
gp <- gp+labs(y="100ppm Relative Movement Ratio\n",x="\n25ppm Relative Movement Ratio")
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+corr_theme
print(gp) 

Another point of interest would be to see if there is any correlation between the initial motility (in the absence of insecticides) and the final RMR. As the Wiggle Index is a relatively new bioassay, we must be sure that our RMR is not an artefact of some lines moving more than others in general. Figure S2 shows the initial motility of each genotype on the x axis and the final RMR (imidacloprid response) on the y axis.

sum.wiggle <- dgrp.wiggle.raw %>% wiggle.parse() %>% wiggle.rmr() %>% group_by(genotype,time,dose) %>% 
      summarize(rmr=mean(rmr),raw=mean(raw)) %>% arrange (desc(time),genotype)

initial.compare <- sum.wiggle %>% gather(mes,output,rmr,raw) %>% filter((time=="0" & mes=="raw") | (time=="60" & mes=="rmr")) 
initial.compare <- data.frame(genotype=initial.compare$genotype,initial.motility=filter(initial.compare,mes=="raw")$output,wiggle.response=filter(initial.compare,mes=="rmr")$output)

gp <- ggplot(initial.compare,aes(x=initial.motility,y=wiggle.response))
gp <- gp+geom_point(size=2)
gp <- gp+geom_smooth(method="lm")
gp <- gp+ylab("Wiggle Response\n")
gp <- gp+xlab("\nInitial Motility")
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+corr_theme
print(gp) 

There doesn’t seem to be much of an effect. This means that the estimate of imidacloprid resistance is not affected by the baseline differences in motility between genotypes.

Next heritability can be estimated. This compares the variance between genotypes to variance within genotypes.

corrected.wiggle.data <- dgrp.wiggle.raw %>% wiggle.parse() %>% wiggle.rmr()  %>% filter(time==60)
heritability.list=list()
doses <- c("25ppm","100ppm")
for (a in doses){
      heritability.list[[a]]=list()
      sub.dose <- filter(corrected.wiggle.data,dose==a)
      sub.rmr <- sub.dose$rmr
      test <- lmer(sub.rmr~1+(1|sub.dose$genotype))
      heritability.list[[a]] <- c(genetic.variance=NA,environmental.variance=NA,heritability=NA)
      heritability.list[[a]]["genetic.variance"] <- VarCorr(test)$'sub.dose$genotype'[1]
      heritability.list[[a]]["environmental.variance"] <- heritability.list[[a]]["genetic.variance"]+
            attr(VarCorr(test),"sc")^2
      heritability.list[[a]]["heritability"] <- heritability.list[[a]]['environmental.variance']/
            (heritability.list[[a]]["environmental.variance"]+heritability.list[[a]]["genetic.variance"])
} 

heritability.list <- data.frame(heritability.list)
colnames(heritability.list) <- c("25ppm","100ppm")
kable(heritability.list,align="c")
25ppm 100ppm
genetic.variance 0.0197388 0.0302555
environmental.variance 0.0451750 0.0509368
heritability 0.6959225 0.6273602

The final part of this section will clean the data in a format that can be read by the GWAS pipeline. The mean RMRs for each genotype was submitted separately to http://dgrp2.gnets.ncsu.edu/ for further processing. The first few rows of the submission files are printed below. These are tables S1 and S2.

for(d in unique(sum.wiggle.60$dose)){
      single.dose <- sum.wiggle.60 %>% filter(dose==d) %>% select(genotype,rmr)
      single.dose[['genotype']] <-gsub("RAL_","",single.dose[['genotype']]) 
      clean.matrix <- matrix(as.numeric(as.matrix(single.dose)),ncol=2)
      print(head(clean.matrix))
}             
##      [,1]      [,2]
## [1,]  100 0.7069796
## [2,]  101 0.9097897
## [3,]  129 0.5623023
## [4,]  136 0.7559724
## [5,]  138 0.6220350
## [6,]  142 0.5728108
##      [,1]      [,2]
## [1,]  100 0.5537778
## [2,]  101 0.6879195
## [3,]  129 0.1671853
## [4,]  136 0.4303668
## [5,]  138 0.7021481
## [6,]  142 0.3072478

Section 2

Data Used:

  • DGRP_HPLC_data.csv (as dgrp.hplc.data)

In Section 2 a subset of the DGRP lines was used to assess only imidacloprid metabolism in the DGRP. For 18 genotypes (9 resistant and 9 susceptible) amounts of imidacloprid, IMI-5-OH and IMI-Olefin were measured in both media and larval bodies.

Before plotting the raw data needs cleaning. The code chunk below accomplishes this nicely.

dgrp.hplc.data.rearrange <- dgrp.hplc.data %>% 
      gather("mes.type","mes.value",larvae,media) %>%
      filter(!is.na(mes.value)) %>%
      arrange(rank) 

dgrp.hplc.summary <- dgrp.hplc.data.rearrange %>% 
      group_by(genotype,metabolite,mes.type) %>%
      summarize(mes.value=mean(mes.value)) 
wiggle.hplc.subset <- sum.wiggle.60 %>% 
      filter(dose=="25ppm") %>% 
      filter(genotype %in% 
                   unique(dgrp.hplc.summary$genotype)) %>% select(genotype,rmr)

dgrp.hplc.summary <- full_join(wiggle.hplc.subset,dgrp.hplc.summary,by="genotype")

dgrp.hplc.plot.data <- dgrp.hplc.data.rearrange[-intersect(which(dgrp.hplc.data.rearrange$metabolite=="Imidacloprid"),which(dgrp.hplc.data.rearrange$mes.type=="media")),]
dgrp.hplc.plot.data$genotype <- factor(dgrp.hplc.plot.data$genotype,levels=unique(dgrp.hplc.plot.data$genotype))
dgrp.hplc.plot.data$metabolite <- factor(dgrp.hplc.plot.data$metabolite,levels=c("Imidacloprid","5-OH","Olefin"))
dgrp.hplc.plot.data$mes.type <- factor(dgrp.hplc.plot.data$mes.type)

Some further variables are needed for plotting Figure 3. As was done previously, the annot data frame will map data to each facet. In this case the significance of the linear model predicting the level of a chemical with RMR at 25ppm is represented by significance stars. The following code accomplishes this task.

vars <- data.frame(expand.grid(levels(dgrp.hplc.plot.data$metabolite), levels(dgrp.hplc.plot.data$mes.type)))
colnames(vars) <- c("metabolite", "mes.type")

annot <- data.frame(x = rep(Inf, dim(vars)[1]), 
                    y = rep(Inf, dim(vars)[1]), 
                    vars,
                    labs=LETTERS[1:dim(vars)[1]])
equation.list <- vector("character",6)
for(m in unique(dgrp.hplc.plot.data$metabolite)){
      for(t in (unique(dgrp.hplc.plot.data$mes.type))){
            sub <- subset(dgrp.hplc.summary,metabolite==m & mes.type==t)
            equation.list[intersect(which(annot$metabolite==m),which(annot$mes.type==t))] <- equation.extract.hplc("mes.value","rmr",data=sub,sub=F)
      }}
equation.list[intersect(which(annot$metabolite=="Imidacloprid"),which(annot$mes.type=="media"))] <- ""
annot <- mutate(annot,equation.list=equation.list)
set1 <- c(rep("Susceptible",180),rep("Resistant",175))
dgrp.hplc.plot.data$Subset <- set1

Now Figure 2 can be plotted. This figure will be in the same format as Figure 5 and Figure S5. Each will have 6 panels with 2 columns representing measurements in the larval body or media and 3 rows reflecting the 3 chemicals measured.

gp <- ggplot(dgrp.hplc.plot.data,aes(x=genotype,y=mes.value)) 
gp <- gp+facet_grid(metabolite~mes.type,scales="free_y")
gp <- gp+geom_dotplot(binaxis='y', stackdir='center',colour="black",dotsize=1.2,aes(fill=Subset)) 
gp <- gp+labs(y="Quantity of Chemical (ppb)",x="Genotype")
gp <- gp+annotate("rect",xmin=-Inf,xmax=9.5,ymin=-Inf,ymax=Inf,fill="steelblue2",alpha=.3)
gp <- gp+annotate("rect",xmin=9.5,xmax=Inf,ymin=-Inf,ymax=Inf,fill="orangered2",alpha=.3)
gp <- gp+stat_summary(geom="errorbar",colour='black',size=.5,width=.6,fun.data="mean_se")
gp <- gp+stat_summary(fun.y="mean",fun.ymax="mean",fun.ymin="mean", geom="crossbar", width=.6,size=.3)
gp <- gp+scale_fill_manual(values=c("orangered2","steelblue2"))
gp <- gp+geom_dotplot(binaxis='y', stackdir='center',colour="black",dotsize=1.2,alpha=.25,aes(fill=Subset)) 
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"),vjust=1.2,hjust=1.2)
gp <- gp+geom_text(size=7,data=annot,aes(9.5, y, label=equation.list,fontface="bold",family="serif"),vjust=1.2)
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+dgrp.hplc_theme
print(gp)

Section 3

Data Used:

  • gwas.top_25ppm_annot (as gwas.top.candidates.25ppm)
  • gwas.top_100ppm_annot (as gwas.top.candidates.100ppm)
  • gwas.all.associations.100ppm (no .csv available)
  • gwas.all.associations.25ppm (no .csv available)
  • mod.enocde.raw (downloaded above)

This section will cover section 3 in the paper which deals with the genome wide association study. First, the data must be cleaned and annotated in a format that can be plotted.

assoc.list <- vector('list',2)
names(assoc.list) <- c("25ppm","100ppm")

for (d in doses){  
      all.associations <- get(paste("gwas.all.associations.",d,sep=""))
      all.associations <- select(all.associations,ID,SinglePval)
      variant <- unlist(strsplit(all.associations$ID,split="_"))
      chromosome <- variant[seq(1, length(variant), 3)]
      base <- as.numeric(variant[seq(2, length(variant), 3)])
      chromosome <- multi.gsub(chromosome,find=c("X",'2L','2R','3L','3R'),replace=c(1,2,3,4,5))
      gwas.plot.data <- all.associations %>% 
            mutate(CHR=as.numeric(chromosome),BP=base) %>% 
            select(SinglePval,ID,CHR,BP)
      colnames(gwas.plot.data)=c('P','SNP','CHR','BP')
      variant.sig <- gwas.plot.data$SNP[which(gwas.plot.data$P<=.05/length(gwas.plot.data$SNP))]
      #gwas.plot.data <- filter(gwas.plot.data,P<.0001)
      assoc.list[[d]] <- gwas.plot.data
}

Now Figure S3 can be created which show the QQ plots of the GWAS data.

par(mar=c(5,6,2,1),mgp=c(3,1,0),mfrow=c(1,2),bty="l") 

qq(assoc.list[["25ppm"]]$P, main=NULL, family="serif",cex.main=2,font.axis=2,cex.lab=1.5,cex.axis=1.5,
      ylab=list(expression(bold(paste("Observed -log"[10]*"(p)"))),font=2),
      xlab=list(expression(bold(paste("Expected -log"[10]*"(p)"))),font=2))
      title("A",adj=0,family="serif",cex.main=1.5)
qq(assoc.list[["100ppm"]]$P,main=NULL, family="serif",cex.main=2,font.axis=2,cex.lab=1.5,cex.axis=1.5,
      ylab=list(expression(bold(paste("Observed -log"[10]*"(p)"))),font=2), 
      xlab=list(expression(bold(paste("Expected -log"[10]*"(p)"))),font=2)) 
      title("B",adj=0,family="serif",cex.main=1.5)

Now the data can be visualized. First, a Manhattan plot (Figure S4) can be created which shows every variant in order according to its chromosomal position along the x axis. The y axis displays the significance of each association.

par(mar=c(5,6,2,1),mgp=c(3,1,0),mfrow=c(1,2))
manhattan(assoc.list[["25ppm"]],col=c("black","orange"),
            genomewideline=-log10(.05/length(assoc.list[["25ppm"]]$SNP)),
            chrlabs=c("X",'2L','2R','3L','3R'),
            highlight=variant.sig,main=NULL,cex.main=2,ylim=c(0,-log10(1e-8)),
            family="serif",font=2,cex.lab=1.5,cex.axis=1.5,
            xlab="",ylab=list(expression(bold(paste("-log"[10]*"(p)"))),font=2))  
            title("A",adj=0,family="serif",cex.main=1.5)
manhattan(assoc.list[["100ppm"]],col=c("black","orange"),
            genomewideline=-log10(.05/length(assoc.list[["100ppm"]]$SNP)),
            chrlabs=c("X",'2L','2R','3L','3R'),
            highlight=variant.sig,main=NULL,cex.main=2,ylim=c(0,-log10(1e-8)),
            family="serif",font=2,cex.lab=1.5,cex.axis=1.5,
            xlab="",ylab=list(expression(bold(paste("-log"[10]*"(p)"))),font=2))
            title("B",adj=0,family="serif",cex.main=1.5)

These plots show that some variants are significantly associated with imidacloprid resistance. By narrowing the search to anything below p= 10^-5 the most significant variants can be associated along with their nearest gene.

low.dose.gwas <- dgrp.gwas.parse(gwas.top.candidates.25ppm) %>% mutate(dose="25ppm")
high.dose.gwas <- dgrp.gwas.parse(gwas.top.candidates.100ppm) %>% mutate(dose="100ppm")
table.1 <- rbind(low.dose.gwas,high.dose.gwas) %>% select(dose,ID,Gene.Name,MAF,SingleEff,SinglePval)
colnames(table.1) <- c("dose","Variant ID","gene","maf","effect size","p-value")
kable(table.1)
dose Variant ID gene maf effect size p-value
25ppm 2L_19928580_SNP sick 0.122 0.09295 0.0e+00
25ppm 2L_19375473_DEL intergenic 0.0578 0.1238 1.0e-07
25ppm 2L_16917294_SNP tweek 0.09859 0.0999 1.4e-06
25ppm 2L_19376418_SNP intergenic 0.07602 0.1017 7.0e-07
25ppm 2L_3227580_SNP intergenic 0.05917 0.1081 3.3e-06
25ppm 2R_10886753_SNP intergenic 0.06024 0.1079 4.3e-06
25ppm 3L_22083995_DEL intergenic 0.05952 0.1078 2.9e-06
25ppm 3L_9910712_DEL CG34356 0.06433 0.1022 4.2e-06
25ppm 3L_5797410_SNP CG42272 0.1314 0.07233 5.4e-06
25ppm 2R_16416362_SNP CG30151 0.05161 0.1173 4.5e-06
25ppm 2R_14614239_SNP Hs3st-A 0.08721 0.0873 5.3e-06
25ppm 2L_19391228_SNP CG17544 0.05202 0.1088 8.6e-06
25ppm 2L_19391829_SNP CG17544 0.05233 0.1086 9.4e-06
25ppm 2R_4562894_SNP Acsl 0.05172 0.1084 8.7e-06
25ppm X_2949168_SNP kirre 0.3879 0.05049 8.3e-06
25ppm 2L_19380943_SNP CG17350 0.1091 0.07875 8.2e-06
25ppm X_6908817_DEL CG4593 0.106 0.08299 8.3e-06
100ppm 2R_12858731_SNP intergenic 0.06667 -0.1208 6.3e-06
100ppm 2L_19928580_SNP sick 0.1242 0.09213 8.0e-06
100ppm 2L_14249499_SNP intergenic 0.1938 0.07545 7.1e-06
100ppm 2R_18329249_SNP intergenic 0.25 -0.07409 5.3e-06
100ppm 2R_18329255_SNP intergenic 0.2645 -0.07233 4.8e-06
100ppm 2L_1836932_SNP wry 0.1479 -0.0848 3.8e-06
100ppm X_6819592_SNP CG14431 0.07059 -0.1281 4.0e-07
100ppm 2L_5105071_SNP Msp-300 0.1824 -0.08335 1.9e-06
100ppm 2R_18329251_SNP intergenic 0.2532 -0.07259 6.3e-06
100ppm 2L_1836962_SNP wry 0.1479 -0.08367 6.4e-06
100ppm 3L_2495753_SNP CG33232 0.2722 -0.06788 3.6e-06
100ppm 2L_13896989_SNP cenG1A 0.05848 0.1284 3.9e-06
100ppm 2L_2925754_SNP lilli 0.08824 -0.1036 5.4e-06
100ppm 3R_8576914_SNP beat-Vc 0.4969 -0.06109 6.1e-06
100ppm 2R_14899318_SNP intergenic 0.2321 -0.07042 5.1e-06
100ppm 3L_10159678_SNP dpr10 0.1856 -0.07608 5.5e-06

The genes in this list are the best estimate of the genes involved in imidacloprid resistance within the DGRP. Because there are no genes known to have roles in resistance, looking at this list as a group is helpful. For instance these genes tend to be upregulated in the third instar larvae central nervous system compared to the genome wide average. This can be assessed using the custom mod.encode.enrich() function.

unique.gwas.list <- unique(c(low.dose.gwas$Gene.Name,high.dose.gwas$Gene.Name))
unique.gwas.list <- unique.gwas.list[unique.gwas.list!="intergenic"]
l3.enrichment <- mod.encode.enrich(raw.data.file=mod.encode.raw,
                  gene.list = unique.gwas.list, 
                  parent.library="modENCODE_mRNA-Seq_tissues",
                  stage.search.string="L3",
                  tissue.search.string="CNS")
l3.enrichment <- as.data.frame(t(as.matrix(l3.enrichment)))
rownames(l3.enrichment) <- NULL
kable(l3.enrichment)
L3 CNS genome.wide p.value
0.5 0.1964264 0.0015378

As the last thing that will be done in section 3, a few of the larger objects present in the workspace can be cleared.

rm(list=c("mod.encode.raw","gwas.all.associations.25ppm","gwas.all.associations.100ppm","assoc.list"))

Section 4

Data Used:

  • dgrp.male.transcriptome (downloaded above)
  • dgrp.female.transcriptome (downloaded above)
  • GeneID_Symbol_Key.csv (as fbgid.symbol.key)
  • cyp6g1_haplotypes.csv (as g1.haplotypes)

Section 4 considers the transcriptomic dataset generated by Huang et. al (2015). The object here is to correlate transcriptomic variation with phenotypic variation, much like was done in the previous section with genomic variation.

The transcriptomic datasets can be downloaded from http://dgrp2.gnets.ncsu.edu/data.html and are composed of two replicates of each transcriptome for each sex. Because this study considered larval samples that were not sexed, all 4 replicates of the transcriptomes were averaged together with the custom function transcript.combine() and a series of operations connected by the %>% operator.

combined.transcriptomes <- transcript.combine(male=dgrp.male.transcriptome,female=dgrp.female.transcriptome)
combined.transcriptomes <- merge(combined.transcriptomes,fbgid.symbol.key,by="gene")

clean.transcriptome.data <- combined.transcriptomes %>% 
      gather(key="Genotype",value="RPKM",RAL_21:RAL_913) %>%
      group_by(symbol,Genotype) %>% 
      summarize(avg=mean(RPKM,na.rm=T)) %>%
      spread(key=Genotype,value=avg)

A few modifications need to be performed on this dataset including combining it with the phenotypic data.

# Subset the genotypes which are in both the phenotyic and genotypic datasets
transcriptome.genotypes <- colnames(clean.transcriptome.data)[-1]
wiggle.genotypes <- as.character(sum.wiggle.60$genotype)
common.genotypes <- intersect(transcriptome.genotypes,wiggle.genotypes)
phenotype.data.subset <- sum.wiggle.60[wiggle.genotypes %in% common.genotypes,]
transcriptome.subset.symbol <- data.table(data.frame(clean.transcriptome.data)[,c(TRUE,transcriptome.genotypes %in% common.genotypes)])

## Run a linear model predicting RMR with each transcript in the transcriptome
symbol <- transcriptome.subset.symbol$symbol
transcriptome.subset <- select(transcriptome.subset.symbol,-symbol)
transcriptome.test <- sapply(select(transcriptome.subset,starts_with("RAL")),as.numeric)
table_2 <- vector('list',2)
names(table_2) <- doses

Once this is done the transcriptome wide association study can be run. This code fits a linear model predicting the mean RMR at either 25 or 100ppm with the magnitude of the expression of a given transcript.

for (d in doses){
      sub.phen.dose <- as.numeric(subset(phenotype.data.subset,dose==d)$rmr)
      output=matrix(NA,nrow=length(transcriptome.test[,1]),ncol=3)
      colnames(output)=c("estimate","pval","adjrsq")
      for (i in 1:nrow(transcriptome.test)) {
            test <- summary(lm(sub.phen.dose~transcriptome.test[i,]))
            est <- test$coefficients[2,1]
            pval <- test$coefficients[2,4]
            rsq <- test$adj.r.squared
            output[i,] <- c(est,pval,rsq)
      }
      
      rownames(output)=symbol
      twas.summary <- data.frame(output) %>%
            mutate(gene=symbol,dose=d) %>%
            filter(pval<1e-3) %>% arrange(pval) %>%
            select(dose,gene,pval,estimate,adjrsq)
      colnames(twas.summary) <- c("dose","gene","p-value","effect size","adjusted R squared")
      table_2[[d]] <- twas.summary
}
kable(rbindlist(table_2,use.names=T))
dose gene p-value effect size adjusted R squared
25ppm Cyp6g1 0.0000048 0.0843760 0.1232249
25ppm Jon65Aiii 0.0001460 0.0811607 0.0848229
25ppm T3dh 0.0003004 -0.2021680 0.0766034
25ppm CG11504 0.0003585 0.3175544 0.0745854
25ppm CG13151 0.0005755 0.3178061 0.0691738
25ppm Atf3 0.0009299 -0.2101590 0.0636809
25ppm snoRNA:Or-aca3 0.0009973 0.0552007 0.0628790
100ppm Cyp6g1 0.0000252 0.0934486 0.1047115
100ppm CG30345 0.0000798 -0.2185058 0.0916800
100ppm CG3876 0.0001986 -0.2836854 0.0813252
100ppm CG13026 0.0002389 0.1566551 0.0792154
100ppm CG18343 0.0003504 -0.1363508 0.0748450
100ppm RagA-B 0.0004364 -0.3338131 0.0723369
100ppm CG1764 0.0004904 -0.2338882 0.0710032
100ppm CG13088 0.0004960 -0.1372643 0.0708738
100ppm Cyp6g2 0.0007665 0.1552494 0.0658926
100ppm nbs 0.0007895 -0.2547038 0.0655544

At each dose Cyp6g1 appears to be the most significant association. Also present is its genomic neighbour Cyp6g2. To plot these associations individually a few operations can be used to create a data frame that contains just the expression of the two genes and the phenolic Wiggle Index data.

g1.haplotype.sub <- g1.haplotypes %>% filter(genotype%in%common.genotypes)
individual.transcripts <- transcriptome.subset.symbol %>% filter(symbol=="Cyp6g1"|symbol=="Cyp6g2") %>%
      gather(genotype, val, 2:(length(common.genotypes)+1)) %>%  
      spread(symbol, val) %>% arrange(genotype) %>% group_by(genotype) %>% summarise_each(funs(sum(., na.rm=TRUE))) %>%
      full_join(select(filter(phenotype.data.subset,dose=="100ppm"),genotype,rmr),by="genotype") %>% 
      full_join(g1.haplotype.sub,by="genotype") %>% gather(key="transcript",value="expression",Cyp6g1,Cyp6g2)


individual.transcripts$transcript <- factor(individual.transcripts$transcript,levels=c("Cyp6g1","Cyp6g2"))
equation.list <- vector("character",2)
for (t in unique(individual.transcripts$transcript)) equation.list[which(t==unique(individual.transcripts$transcript))] <- equation.extract("rmr","expression",individual.transcripts,"transcript",t)
annot <- data.frame(x = rep(-Inf, length(unique(individual.transcripts$transcript))), 
                    y = rep(Inf, length(unique(individual.transcripts$transcript))), 
                    transcript=as.character(unique(individual.transcripts$transcript)),
                    equation=equation.list,
                    labs=LETTERS[1:length(unique(individual.transcripts$transcript))])
annot$transcript <- factor(annot$transcript,levels=annot$transcript)

Now the associations of both Cyp6g1 and Cyp6g2 can be visualized according to haplotypes in Figure 3.

gp <- ggplot(individual.transcripts,aes(x=expression,y=rmr))
      gp <- ggplot(individual.transcripts,aes(x=expression,y=rmr))
      gp <- ggplot(individual.transcripts,aes(x=expression,y=rmr))
      gp <- gp+facet_grid(facets=.~transcript,scales="free_x")
      gp <- gp+geom_point(size=4,aes(shape=haplotype,colour=haplotype))
      gp <- gp+scale_shape_identity()
      gp <- gp+geom_smooth(method="lm")
      gp <- gp+labs(y="Relative Movement Ratio\n",x="\nTranscript Expression Expression (log2 RPKM)")
      gp <- gp+scale_colour_manual(values=c("red2","blue","green4","black"))
      gp <- gp+ylim(c(0,1.2))
      gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"), inherit.aes=FALSE,vjust=1.2,hjust=-.2)
      gp <- gp+geom_text(size=4,data=annot,aes(Inf, Inf, label=equation,family="serif"), inherit.aes=FALSE,vjust=1.2,hjust=1.2)
      gp <- gp+theme_bw(base_size=14,base_family="serif")
      gp <- gp+cyp6g1.ind_theme
      print(gp) 

Lastly, as was done for section 4, some of the larger objects can be removed from the R environment.

rm(list=c("dgrp.male.transcriptome","dgrp.female.transcriptome"))

Section 5

Data used:

  • total_cyp6g1KO_wiggle.csv (as cyp6g1ko.wiggle)

From section 4 it was shown that Cyp6g1 is the transcript most significantly associated with imidacloprid resistance. Removing Cyp6g1 then follows on as a logical next step.

First the Wiggle data can be considered. The usual cleaning steps can be applied to this dataset as well.

cyp6g1.ko.wiggle.plot <- cyp6g1.ko.wiggle %>% filter(time==60) %>% select(genotype,dose,time,rmr,identifier)
cyp6g1.ko.wiggle.plot$genotype <- factor(cyp6g1.ko.wiggle.plot$genotype,
                                         levels=c("RAL_517","RAL_517-Cyp6g1KO","Canton-S","Canton-S-Cyp6g1KO","Wxac","Wxac-Cyp6g1KO"))
cyp6g1.ko.wiggle.plot$identifier <- factor(cyp6g1.ko.wiggle.plot$identifier,
                                           levels=c("RAL_517 (25ppm)","RAL_517 (100ppm)","Canton-S (5ppm)","Wxac (5ppm)"))
vars <- data.frame(expand.grid(levels(cyp6g1.ko.wiggle.plot$identifier)))
colnames(vars) <- c("identifier")
annot <- data.frame(x = rep(Inf, dim(vars)[1]), y = rep(Inf, dim(vars)[1]), vars,labs=LETTERS[1:dim(vars)[1]])

equation.list <- vector("character",4)
for(i in unique(levels(cyp6g1.ko.wiggle.plot$identifier))){
            sub <- subset(cyp6g1.ko.wiggle.plot,identifier==i)
            equation.list[which(annot$identifier==i)] <- equation.extract.hplc("rmr","genotype",data=sub,sub=F,bonf=4)
      }
annot$pval <- equation.list

Now Figure 4 can be plotted. This figure will show the wiggle response of of Cyp6g1KOs and their controls in 3 different backgrounds. Panels A and B show two doses of RAL517-Cyp6g1KO and C and D show Canton-S-Cyp6g1KO and Wxac-Cyp6g1KO respectively.

gp <- ggplot(cyp6g1.ko.wiggle.plot,aes(x=genotype,y=rmr,fill=genotype))
gp <- gp+facet_grid(.~identifier,scales="free_x")
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2)
gp <- gp+stat_summary(geom="errorbar",colour='black',width=.4,fun.data="mean_se",size=1)
gp <- gp+stat_summary(fun.y="mean",fun.ymax="mean",fun.ymin="mean", geom="crossbar", width=.4,colour='black',size=.75)
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2,alpha=.25)
gp <- gp+scale_fill_manual(values=c("blue","lightblue","purple4","purple1","gold4","gold1")) 
gp <- gp+labs(y="Relative Movement Ratio\n",x="\nGenotype",size=3)
gp <- gp+ylim(c(0,1.4))
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"),inherit.aes=F,vjust=1.2,hjust=-.2)
gp <- gp+geom_text(size=7,data=annot,aes(1.5, y, label=pval,fontface="bold",family="serif"),inherit.aes=F,vjust=1.2)
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+g1KO.wiggle_theme
print(gp)

Section 6

Data Used:

  • P450_real_time_data.csv (as P450.real.time.data)
  • cyp6g12_wiggle_data.csv( as uas.cyp6g12.wiggle)
  • UAS_cyp6g2_HPLC_data.csv (as uas.cyp6g2.hplc.data)

In Section 6 the effect of other P450s on imidacloprid resistance is examined. in particular the genomic neighbours of Cyp6g1, Cyp6g2 and Cyp6t3.

The expression of Cyp6g1, Cyp6g2 and Cyp6t3 was measured in the digestive tissues (midguts and tubules) of 4 DGRP lines, two with M haplotypes and two with AA haplotypes. The custom function rt.data.parse takes raw output from the real-time run and converts it into a usable format. Biological replicates 3 and 4 from RAL_360 were removed due to the high variability in the housekeeper gene RP49.

parsed.rt.data <- rt.data.parse(data=P450.real.time.data)
filtered.data <- parsed.rt.data[which(!((parsed.rt.data$biological.rep=="3" | parsed.rt.data$biological.rep=="4") & parsed.rt.data$Sample=="360")),]

Next the real time PCR data can be processed in parallel with expression data from section 4. Cleaning this data and doing some basic conversions can reveal what the fold upregulation of each gene is compared to a line that expresses low levels of each (RAL_776).

sum.rt.data <- filtered.data %>% 
      group_by(Target,Sample,biological.rep) %>% 
      summarize(mean.ct=mean(Cq)) %>% 
      spread(key=Target,value=mean.ct) %>% 
      mutate(g1dct=-(RP49-CYP6G1),g2dct=-(RP49-CYP6G2),t3dct=-(RP49-CYP6T3)) %>% 
      mutate(genotype=as.character(paste("RAL_",Sample,sep=""))) %>%
      mutate(haplotype=allele.ad(genotype)) %>% data.table() %>%
      mutate(g1ddct=g1dct-mean(.[which(.[["genotype"]]=="RAL_776"),][["g1dct"]],na.rm=T)) %>% 
      mutate(g2ddct=g2dct-mean(.[which(.[["genotype"]]=="RAL_776"),][["g2dct"]],na.rm=T)) %>% 
      mutate(t3ddct=t3dct-mean(.[which(.[["genotype"]]=="RAL_776"),][["t3dct"]],na.rm=T)) %>% 
      mutate(Cyp6g1=2^-g1ddct,Cyp6g2=2^-g2ddct,Cyp6t3=2^-t3ddct) %>% 
      mutate(type="Digestive Tissues Larvae") %>%
      select(genotype,haplotype,type,Cyp6g1:Cyp6t3) %>%
      gather(key="gene",value="relative.expression",Cyp6g1:Cyp6t3)


dgrp.p450.trans <- combined.transcriptomes %>%
      select(symbol,RAL_138,RAL_360,RAL_776,RAL_843) %>%
      filter(symbol %in% c("Cyp6g1","Cyp6g2","Cyp6t3")) %>% 
      transpose.collapse(column="symbol") %>%
      mutate(haplotype=allele.ad(genotype)) %>%
      mutate(Cyp6g1=2^(Cyp6g1-mean(.[which(.[["genotype"]]=="RAL_776"),][["Cyp6g1"]],na.rm=T))) %>%
      mutate(Cyp6g2=2^(Cyp6g2-mean(.[which(.[["genotype"]]=="RAL_776"),][["Cyp6g2"]],na.rm=T))) %>%
      mutate(Cyp6t3=2^(Cyp6t3-mean(.[which(.[["genotype"]]=="RAL_776"),][["Cyp6t3"]],na.rm=T))) %>%
      mutate(type="Whole Body Adults") %>%
      select(genotype,haplotype,type,Cyp6g1:Cyp6t3) %>%
      gather(key="gene",value="relative.expression",Cyp6g1:Cyp6t3) 

Next a few operations can be applied to prepare the data for plotting.

combined.data <- rbindlist(list(sum.rt.data,dgrp.p450.trans),use.names=T)
combined.data <- data.frame(combined.data)[complete.cases(combined.data),]
combined.data$genotype <- factor(combined.data$genotype,levels=c("RAL_776","RAL_843","RAL_138","RAL_360"))
combined.data$haplotype <- factor(combined.data$haplotype,levels=c("M","AA"))

vars <- data.frame(expand.grid(unique(combined.data$gene), unique(combined.data$type)))
colnames(vars) <- c("gene", "type")
annot <- data.frame(x = rep(-Inf, dim(vars)[1]), y = rep(Inf, dim(vars)[1]), vars,labs=LETTERS[1:dim(vars)[1]])

Finally, Figure 5 can be plotted. This figure will reflect the fold increase of each measurement compared to the mean of the measurement in RAL_776. Each gene will be considered in a separate panel along with information on whether the measurements were made by real time PCR in the digestive tissues (measured in this study) or whether it was obtained from Huang et. al (2015) and reflects whole body adult transcriptomes.

gp <- ggplot(data=combined.data,aes(x=genotype,y=relative.expression,fill=haplotype))
gp <- gp+facet_grid(type~gene,scales="free_y")
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2)
gp <- gp+annotate("rect",xmin=-Inf,xmax=2.5,ymin=-Inf,ymax=Inf,fill="steelblue2",alpha=.3)
gp <- gp+annotate("rect",xmin=2.5,xmax=Inf,ymin=-Inf,ymax=Inf,fill="orangered2",alpha=.3)
gp <- gp+stat_summary(geom="errorbar",colour='black',size=.85,width=.5,fun.data="mean_se")
gp <- gp+stat_summary(fun.y="mean",fun.ymax="mean",fun.ymin="mean", geom="crossbar", width=.5,size=.6)
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2,alpha=.25)
gp <- gp+scale_fill_manual(values=c("steelblue","orangered"))
gp <- gp+labs(y="Expression Relative to RAL_776\n",x="\nGenotype")
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"),inherit.aes=F,vjust=1.2,hjust=-.2)
gp <- gp+rt_theme
print(gp)

The difference between genotypes for Figure 5 can be best estimated by using ANOVA followed by Tukey’s honestly significant difference. The value in each column represents the p-value reflected by each comparison.

TableS4 <- list()
for(t in unique(combined.data$type)){
      for (g in unique(combined.data$gene)){
            sub <- subset(combined.data,type==t & gene==g)
            mod <- with(sub,aov(relative.expression~genotype))
            TableS4[[paste(t,g)]] <- TukeyHSD(x=mod, which='genotype', conf.level=0.95)$genotype[,"p adj"]
      }}
TableS4 <- data.frame(TableS4)
colnames(TableS4) <- c("Dig.Tis.6g1","Dig.Tis.6g2","Dig.Tis.t3","Body.6g1","Body.6g2","Body.t3")
kable(TableS4)
Dig.Tis.6g1 Dig.Tis.6g2 Dig.Tis.t3 Body.6g1 Body.6g2 Body.t3
RAL_843-RAL_776 0.9859360 0.9286098 0.8209813 0.9999962 0.9957079 0.8369315
RAL_138-RAL_776 0.0000000 0.0000123 0.7891801 0.0163567 0.0100758 0.6228155
RAL_360-RAL_776 0.0000001 0.0000548 0.0716257 0.0355823 0.1810873 0.4739834
RAL_138-RAL_843 0.0000000 0.0000311 0.9999078 0.0157518 0.0149550 0.9786830
RAL_360-RAL_843 0.0000001 0.0001298 0.2501097 0.0342727 0.2539638 0.9126743
RAL_360-RAL_138 0.9797418 0.9999089 0.2709829 0.9695555 0.3673628 0.9934465

Moving on from expression data, the capacities of Cyp6g1 and Cyp6g2 to confer resistance can be directly measured by expressing each genes from a common insertion site using GAL4-UAS.

First the data must be clenaed

uas.wiggle <- uas.cyp6g12.wiggle %>% wiggle.parse(write=F) %>% wiggle.rmr(write=F) %>% filter(time==60)
uas.wiggle$genotype <- multi.gsub(uas.wiggle$genotype,c("x"),c(" x "))
uas.wiggle$genotype <- factor(uas.wiggle$genotype,levels=c("HRG4 x UAS-NULL","HRG4 x UAS-Cyp6g1","HRG4 x UAS-Cyp6g2"))

annot <- data.frame(x = rep(-Inf, length(unique(uas.wiggle$dose))), 
                    y = rep(Inf, length(unique(uas.wiggle$dose))), 
                    dose=as.character(unique(uas.wiggle$dose)),
                    labs=LETTERS[1:length(unique(uas.wiggle$dose))])

Then the data can be plotted considering two discriminatory doses (20 and 40ppm) separately.

gp <- ggplot(uas.wiggle,aes(x=genotype,y=rmr,fill=genotype))
gp <- gp+facet_grid(.~dose,scales="fixed")
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2)
gp <- gp+stat_summary(geom="errorbar",colour='black',size=.75,width=.4,fun.data="mean_se")
gp <- gp+stat_summary(fun.y="mean",fun.ymax="mean",fun.ymin="mean", geom="crossbar", width=.4,colour='black',size=.6)
gp <- gp+geom_dotplot(binaxis='y',stackdir="center",dotsize=1.2,alpha=.25)
gp <- gp+scale_fill_manual(values=c("grey50","magenta1","orange1")) 
gp <- gp+labs(y="Relative Movement Ratio\n",x="\nGenotype")
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"),inherit.aes=FALSE,vjust=1.2,hjust=-.2)
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+g1KO.wiggle_theme
print(gp)

As was done for Table S4 Tukey’s post hoc HSD can be used to observe the significant differences (or lack thereof) between the three genotypes.

TableS5 <- list()
for (d in unique(uas.wiggle$dose)){
      sub <- subset(uas.wiggle,dose==d)
      mod <- with(sub,aov(rmr~genotype))
      TableS5[[d]] <- TukeyHSD(x=mod, which='genotype', conf.level=0.95)$genotype[,"p adj"] 
      
}
TableS5 <- data.frame(TableS5)
colnames(TableS5) <- c("20ppm","40ppm")
kable(TableS5)
20ppm 40ppm
HRG4 x UAS-Cyp6g1-HRG4 x UAS-NULL 0.0000010 0.0000000
HRG4 x UAS-Cyp6g2-HRG4 x UAS-NULL 0.0001974 0.0004148
HRG4 x UAS-Cyp6g2-HRG4 x UAS-Cyp6g1 0.0706013 0.0000175

It appears that Cyp6g2 can confer resistance to imidacloprid. To see whether this observation is mediated by metabolism HPLC-MS can be applied to a genotype expressing Cyp6g2.

First, the data must be cleaned.

uas.cyp6g2.hplc.data.rearrange <- gather(uas.cyp6g2.hplc.data,key="measurement_type",
                               value="measurement_value",larvae,media)
plot.data <- uas.cyp6g2.hplc.data.rearrange[complete.cases(uas.cyp6g2.hplc.data.rearrange),]
plot.data <- uas.cyp6g2.hplc.data.rearrange[-intersect(which(uas.cyp6g2.hplc.data.rearrange$metabolite=="Imidacloprid"),
                                             which(uas.cyp6g2.hplc.data.rearrange$measurement_type=="media")),]
plot.data$genotype <- factor(plot.data$genotype,levels=c("w1118","Cyp6g2-3a","Cyp6g2-3d"))
plot.data$metabolite <- factor(plot.data$metabolite,levels=c("Imidacloprid","5-OH","Olefin"))
plot.data$measurement_type <- factor(plot.data$measurement_type)
plot.data$measurement_value <- as.numeric(plot.data$measurement_value)
vars <- data.frame(expand.grid(levels(plot.data$metabolite), levels(plot.data$measurement_type)))
colnames(vars) <- c("metabolite", "measurement_type")
annot <- data.frame(x = rep(-Inf, dim(vars)[1]), y = rep(Inf, dim(vars)[1]), vars,labs=LETTERS[1:dim(vars)[1]])


equation.list <- vector("character",6)
for(m in unique(uas.cyp6g2.hplc.data.rearrange$metabolite)){
      for(t in (unique(uas.cyp6g2.hplc.data.rearrange$measurement_type))){
            sub <- subset(uas.cyp6g2.hplc.data.rearrange,metabolite==m & measurement_type==t)
            equation.list[intersect(which(annot$metabolite==m),which(annot$measurement_type==t))] <- equation.extract.hplc("measurement_value","genotype",data=sub,sub=F,bonf=6)
      }}
equation.list[intersect(which(annot$metabolite=="Imidacloprid"),which(annot$mes.type=="media"))] <- ""
annot <- mutate(annot,pval=equation.list)

Then it can be plotted in a similar way to Figure 3.

gp <- ggplot(plot.data,aes(x=genotype,y=measurement_value,fill=genotype))
gp <- gp+facet_grid(metabolite~measurement_type)
gp <- gp+ylim(c(0,2200000))
gp <- gp+geom_dotplot(binaxis='y', stackdir='center',dotsize=1.5)
gp <- gp+ylab("Quantity of Chemical (area under peak)\n")
gp <- gp+stat_summary(geom="errorbar",colour='black',size=.75,width=.4,fun.data="mean_se")
gp <- gp+stat_summary(fun.y="mean",fun.ymax="mean",fun.ymin="mean", geom="crossbar", width=.4,size=.6)
gp <- gp+geom_dotplot(binaxis='y', stackdir='center',dotsize=1.5,alpha=.25) 
gp <- gp+geom_text(size=7,data=annot,aes(x, y, label=labs,fontface="bold",family="serif"),inherit.aes=FALSE,vjust=1.2,hjust=-.2)
gp <- gp+geom_text(size=7,data=annot,aes(1.5, y, label=pval,fontface="bold",family="serif"),inherit.aes=F,vjust=1.2)
gp <- gp+scale_fill_manual(values=c("grey50","orange4")) 
gp <- gp+theme_bw(base_size=14,base_family="serif")
gp <- gp+g1KO.wiggle_theme
print(gp)