CodingQuarryInit.rmd

CodingQuarry: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1344-4

This is an annotation system which I was directed to by a review which suggested that this program/system can more accurately predict and annotate effector proteins. It is aimed specifically at fungal genomes. It is primarily designed to incorporate RNA-seq data, which certainly makes sense in regard to accuracy.

It has a “new” pathogen mode, which takes an existing annotation (or RNA-seq data) and searches between existing annotations for more genes that have been missed by a classic pipeline like MAKER or Augustus. It does NOT annotate, only predict coding regions

Section 2.4.1, in the manual which is downloaded along with the program from https://sourceforge.net/projects/codingquarry/, is aimed at a method for dealing with an already-annotated genome. This will “mine the intergenic genome regions for effector-like genes that may have been overlooked”

This seems to be all that this program will run!

This program is NOT so user friendly. Take care reading and interpreting error messages

1 Dependencies

1.1 Biopython

sudo pip install biopython

1.2 SignalP

This must be SignalP 4.1

Modify the signalP script to direct to the signalP4.1 directory as described in the instructions.

Move the “signalp” script into /usr/bin or equivalent directory in global path. That’s it.

2 Installation

section 2.2 of manual

From within the directory

make

Move the binary to /usr/bin

Set a new environmental variable that directs the scripts to a program-specific library:

export QUARRY_PATH=/home/robert/Tools/CodingQuarry_v2.0/QuarryFiles

or equivalent. modify your bashrc file to make this permanent. OR just run it every time you run the program.

sudo nano ~/.bashrc

use with caution!

2.1 Optional python version problems

I had to install biopython and run CodingQuarry throught a python2 conda environment due to…. problems… with biopython installation

3 Input data

  1. genome fasta
  2. gff file; I have two versions of this. The original 2016 version that I inherited by whoever worked on this project before and the CLC version which I had exported out of CLC. The CLC version did not work, maybe because it is a newer gff version I don’t know, quick visual comparison did not reveal the different, but CodeingQuarry did not read it properly (CDS were not extracted). The original gff could be read.

4 Running CodingQuarry-PM

/home/robert/Tools/CodingQuarry_v2.0/run_CQ-PM_mine.sh \
/home/robert/Dropbox/Projects/Geosmithia/CodingQuarry/genome.resources/G.morbida.refgenome.fa \
/home/robert/Dropbox/Projects/Geosmithia/CodingQuarry/genome.resources/g.morbida.original.2016.gff

5 Output

Training.. Intron training complete Begin pathogen mode There are only 333 secreted genes to train from. Pathogen mode cannot run with less than 500 secreted genes.

6 Recompiling main binary

can I force it to proceed? This is silly.

Yes, I can try recompiling with altering lines 367 and 971 of CodingQuarry.cpp and simply replace the existing binary, replace “500” with “300” and run it again. This will set the minimum allowable size of the secretome to 300 rather than 500… I’m not sure what this would do, but I’m not overly concerned; we are in the same order of magnitude here.

7 Results

CodingQuarry-PM has detected 1560 new genes… Brief manual inspection confirms that they are indeed “between” existing gene anntotations. Quick investigation of annotatations was done by eggnog; “real” and/or “normal” genes will fit into ortholog groups. ~50% of these new genes do so. Many are quite traditional. I did also a quick screening at dbcan (CaZY) and got ~50 positive hits. In other words, they are there, they are real, and they are important.

JGI compared their own pipeline to MAKER and found a similar proportion of additional genes; https://genome.jgi.doe.gov/programs/fungi/benchmarks.jsf

8 Positive selection

Given the current analysis, these new genes should be screened for positive selction using CodeML M0/M1 and M7/M8 model tests, same as has been done for previous genes (original Maker annotation)

8.1 Collect genes and proteins

This is handled in CLC as was done previously. The original BAM files are already in the CLC environment and do not need to be altered in any way.

First, the reference genome needs to be imported again: import as track the genome fasta and gff.

Then you run a pipeline consisting of:

  1. Extract consensus sequence

  2. Align contigs

  3. Annotate from reference

  4. Extract annotations; using type “CDS” and search “PGN” (PGN is the name used by CQPM). This will only extract the new genes

  5. Save these

  6. Translate to Protein

  7. Save these

  8. Change names such that they only include the strain name and export as a set of multifastas

8.2 Reformat multifastas

Now, each multifasta contains all genes/proteins for each strain. This has to be reconfigured such that each multifasta represents all strain varieties of a single gene/protein.

library(stringr)
gff <- read.csv("CQPM.results/PGN_predictedPass.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE, stringsAsFactors = F)
gff <- gff[gff$V3 == "gene",]
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gff$gene <- as.character(gene.split[,2])
prot.template <- gff[c(4,1)]
prot.template <- unique(prot.template)
prot.template <- prot.template[prot.template$gene!="",] #remove artifactual lines where gene is blank, small problem which becomes amplified later?
kable(head(prot.template))
gene scaffold
2 PGN.00007 scaffold_0
4 PGN.00008 scaffold_0
9 PGN.00012 scaffold_0
12 PGN.00013 scaffold_0
15 PGN.00020 scaffold_0
19 PGN.00022 scaffold_0

8.2.1 Import genome protein lists and create a fasta for each protein

8.2.1.1 Create a large data frame, gene X strain, populated by sequence

library(Biostrings)
library(dplyr)
proteins <- list.files("strain.prot/")
strain.prot.concat <- prot.template
for (x in c(1:length(proteins))) {
  strain <- Biostrings::readAAStringSet(paste("strain.prot/",proteins[x],sep=""))
  gene <- names(strain)
  sequence <- paste(strain)
  strain <- data.frame(gene,sequence,stringsAsFactors = F)
  strain <- unique(strain) #for unclear reasons, a small subset of genes and sequences is duplicated; must remove duplicates
  strain <- strain[!duplicated(strain$gene),] #there is a tiny minority of genes which are split, two different seqs for a gene, this creates exponential havoc in the loop
  strain.prot.concat <- dplyr::left_join(strain.prot.concat, strain, by = "gene")
  colnames(strain.prot.concat)[2+x] <- gsub("\\.fa","",paste(proteins[x]))
}

strain.prot.concat <- strain.prot.concat[-2] #remove a dummy column

8.2.2 A simple function for writing fasta from data frame

from: http://bioinfo.umassmed.edu/bootstrappers/guides/main/r_writeFasta.html

writeFasta<-function(data, filename){
  fastaLines = c()
  for (rowNum in 1:nrow(data)){
    fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
    fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
  }
  fileConn<-file(filename)
  writeLines(fastaLines, fileConn)
  close(fileConn)
}

8.2.3 Pop off rows and turn each into a fasta file for a protein

At this stage, I should try to filter out proteins that have been improperly pulled, bad annotation transfers, results usually in an early termination and then gibberish. This does result in a pipeline that totally ignores early termination, but otherwise greatly improves the analysis

  1. remove “*" from the ends of the sequences
  2. remove any sequence that still contains “*"
  3. remove any sequence row which now contains an “NA”
library(seqinr)
for (x in c(1:nrow(strain.prot.concat))) { 
  prot.pop <- strain.prot.concat[x,]
  prot.pop.t <- data.frame(t(prot.pop),stringsAsFactors = F)
  prot.pop.t$name <- rownames(prot.pop.t)
  prot.pop.t <- prot.pop.t[-1,]
  colnames(prot.pop.t) <- c("seq","name")
  rownames(prot.pop.t) <- NULL
  prot.pop.t.x <- data.frame(lapply(prot.pop.t, function(x) gsub("\\*$","",x)), stringsAsFactors = FALSE) #remove the terminal "*" from each sequence
  prot.pop.t.x2 <- prot.pop.t.x[!(grepl("\\*",prot.pop.t.x$seq)==TRUE),] #delete rows with seqs that contain internal stop codons
  prot.pop.t.x3 <- prot.pop.t.x2[!is.na(prot.pop.t.x2$seq)==TRUE,] #delete rows with missing sequences
  writeFasta(prot.pop.t.x3, paste("proteins/",strain.prot.concat[x,1],sep=""))
}

8.2.4 Repeat for genes

library(Biostrings)
library(dplyr)
proteins <- list.files("strain.genes/")
strain.prot.concat <- prot.template
for (x in c(1:length(proteins))) {
  strain <- Biostrings::readAAStringSet(paste("strain.genes/",proteins[x],sep=""))
  gene <- names(strain)
  sequence <- paste(strain)
  strain <- data.frame(gene,sequence,stringsAsFactors = F)
  strain <- unique(strain) #for unclear reasons, a small subset of genes and sequences is duplicated; must remove duplicates
  strain <- strain[!duplicated(strain$gene),] #there is a tiny minority of genes which are split, two different seqs for a gene, this creates exponential havoc in the loop
  strain.prot.concat <- dplyr::left_join(strain.prot.concat, strain, by = "gene")
  colnames(strain.prot.concat)[2+x] <- gsub("\\.fa","",paste(proteins[x]))
}

strain.prot.concat <- strain.prot.concat[-2] #remove a dummy column
library(seqinr)
for (x in c(1:nrow(strain.prot.concat))) { 
  prot.pop <- strain.prot.concat[x,]
  prot.pop.t <- data.frame(t(prot.pop),stringsAsFactors = F)
  prot.pop.t$name <- rownames(prot.pop.t)
  prot.pop.t <- prot.pop.t[-1,]
  colnames(prot.pop.t) <- c("seq","name")
  rownames(prot.pop.t) <- NULL
  prot.pop.t.x <- data.frame(lapply(prot.pop.t, function(x) gsub("\\*$","",x)), stringsAsFactors = FALSE) #remove the terminal "*" from each sequence
  prot.pop.t.x2 <- prot.pop.t.x[!(grepl("\\*",prot.pop.t.x$seq)==TRUE),] #delete rows with seqs that contain internal stop codons
  prot.pop.t.x3 <- prot.pop.t.x2[!is.na(prot.pop.t.x2$seq)==TRUE,] #delete rows with missing sequences
  writeFasta(prot.pop.t.x3, paste("genes/",strain.prot.concat[x,1],sep=""))
}

There are now two folders, “proteins” and “genes”, each of which contains ~1,550 multifastas with a sequence for each strain. The genes are fed into the selection pipeline

8.3 Build alignments

Each multifasta is aligned using MAFFT in –auto mode

The script is described here: https://github.com/rwmurdoch/population.gene.trees/blob/master/align.iterate.sh

8.4 Build trees

These are tree’d with FastTree, which is the best in non-ML tree-building programs (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/).

This script is described here, minor modifications having been made for the current project.

8.5 CodeML via ETE3

ETE3 includes a simpler implementation of CodeML. Installation instructions here: http://etetoolkit.org/download/.

I have created iterative scripts (https://github.com/rwmurdoch/ETE3_CodeML_testing) which do dN/dS testing and positive selection tests.

I could not get ETE3 to run on my Ubutnu 18 system; I had to use my previous OS, Ubuntu 16. I don’t think it was an ubuntu issue though, more likely an issue with my Python/Anaconda installation. In any case, it ran successfully, taking ~4 days to complete the PS tests.

Before proceeding, you must manually remove any instances of "*" from the M2vM1 and M8vM7 results files.

8.6 Importing CodeML results

My ETE3 CodeML scripts produced a single large csv table for each test (M0,M2vM1,M8vM7).

M0 <- read.csv("ETE3_results_genetree_M0.csv", stringsAsFactors = F, header = F)
colnames(M0) <- c("gene","dNdS")
M12 <- read.csv("ETE3_genetree_12_results.csv", stringsAsFactors = F, header = F)
colnames(M12) <- c("gene","lnL_M1","lnL_M2","p_value_12","w_M1","w_M2")
M12$p_value_12 <- as.numeric(M12$p_value_12)
M78 <- read.csv("ETE3_results_genetree_87.csv", stringsAsFactors = F, header = F)
colnames(M78) <- c("gene","lnL_M7","lnL_M8","p_value_78","w_M7","w_M8")
M78$p_value_78 <- as.numeric(M78$p_value_78)

8.7 Calculate FDR and reformat table

M12$'M1-vs-M2' <- M12$lnL_M2 - M12$lnL_M1
M12.s <- M12[c(1,7,4)]
M12.s$fdr_12 <- p.adjust(M12$p_value_12, method = "fdr")
M78$'M7-vs-M8' <- M78$lnL_M8 - M78$lnL_M7
M78.s <- M78[c(1,7,4)]
M78.s$fdr_78 <- p.adjust(M78$p_value_78, method = "fdr")

8.8 Combine all of the new results

include reference genome protein sequence also

library(dplyr)
PM.PS <- dplyr::left_join(M12.s, M78.s, by="gene")
PM.PS <- left_join(PM.PS, M0, by="gene")
PM.PS <- PM.PS[order(PM.PS$fdr_78),]

library(Biostrings)
PM.proteins <- readAAStringSet("CQPM.results/PGN_predicted_CDS.faa")
PM.seqs <- paste(PM.proteins)
PM.names <- names(PM.proteins)
PM.proteins <- data.frame(PM.names, PM.seqs)
colnames(PM.proteins) <- c("gene","seq")

PM.PS <- left_join(PM.PS, PM.proteins, by="gene")
write.csv(PM.PS,"PM.PS.full.results.csv",row.names = F)
datatable(PM.PS)

8.9 Filtering out only signficant PS-selected genes

PS genes are judged as those with M87 test result less than FDR 0.05.

Write the sub-table for working in with the rest of the PS genes and write a protein multifasta

PM.PS.sig <- PM.PS[complete.cases(PM.PS$fdr_78),]
PM.PS.sig <- PM.PS.sig[PM.PS.sig$fdr_78 <= 0.05,]
datatable(PM.PS.sig)
write.csv(PM.PS.sig,"PM.PS.significant.csv",row.names = F)

writeFasta<-function(data, filename){
  fastaLines = c()
  for (rowNum in 1:nrow(data)){
    fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
    fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
  }
  fileConn<-file(filename)
  writeLines(fastaLines, fileConn)
  close(fileConn)
}

name <- PM.PS.sig$gene
seq <- PM.PS.sig$seq
PS.signif.proteins <- data.frame(name,seq)
writeFasta(PS.signif.proteins,"PM.PS.proteins.faa")

9 Annotation of PM PS proteins

The 42 PM PS genes are annoated at WebMGA for COG, KOG, TigrFAM, and pfam and at KEGG:GhostKoala for KO.

The KEGG and COG results are added to the PS results table

library(dplyr)
KEGG <- read.csv("PM.PS.annotations/KEGG/user_ko.txt",sep="\t",header=F)
colnames(KEGG) = c("gene","KO")
COG <- read.csv("PM.PS.annotations/cog/cog-raw.txt",sep="\t",header=F)
COG<-COG[c(1,13,15)]
colnames(COG) <- c("gene","COG","COG.desc")
PS.signif.proteins.2 <- dplyr::left_join(PM.PS.sig,KEGG,by="gene")
PS.signif.proteins.2 <- dplyr::left_join(PS.signif.proteins.2,COG,by="gene")
PS.signif.proteins.2 <- PS.signif.proteins.2[c(1,2,3,4,5,6,7,8,10,11,12,9)]
write.csv(PS.signif.proteins.2,"PM.PS.significant.csv",row.names = F)
datatable(PS.signif.proteins.2)

RWMurdoch

May 7, 2019