CodingQuarryInit.rmd
CodingQuarryInit.rmd
CodingQuarry: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1344-4
This is an annotation system which I was directed to by a review which suggested that this program/system can more accurately predict and annotate effector proteins. It is aimed specifically at fungal genomes. It is primarily designed to incorporate RNA-seq data, which certainly makes sense in regard to accuracy.
It has a “new” pathogen mode, which takes an existing annotation (or RNA-seq data) and searches between existing annotations for more genes that have been missed by a classic pipeline like MAKER or Augustus. It does NOT annotate, only predict coding regions
Section 2.4.1, in the manual which is downloaded along with the program from https://sourceforge.net/projects/codingquarry/, is aimed at a method for dealing with an already-annotated genome. This will “mine the intergenic genome regions for effector-like genes that may have been overlooked”
This seems to be all that this program will run!
This program is NOT so user friendly. Take care reading and interpreting error messages
1 Dependencies
1.1 Biopython
sudo pip install biopython
1.2 SignalP
This must be SignalP 4.1
Modify the signalP script to direct to the signalP4.1 directory as described in the instructions.
Move the “signalp” script into /usr/bin or equivalent directory in global path. That’s it.
2 Installation
section 2.2 of manual
From within the directory
make
Move the binary to /usr/bin
Set a new environmental variable that directs the scripts to a program-specific library:
export QUARRY_PATH=/home/robert/Tools/CodingQuarry_v2.0/QuarryFiles
or equivalent. modify your bashrc file to make this permanent. OR just run it every time you run the program.
sudo nano ~/.bashrc
use with caution!
2.1 Optional python version problems
I had to install biopython and run CodingQuarry throught a python2 conda environment due to…. problems… with biopython installation
3 Input data
- genome fasta
- gff file; I have two versions of this. The original 2016 version that I inherited by whoever worked on this project before and the CLC version which I had exported out of CLC. The CLC version did not work, maybe because it is a newer gff version I don’t know, quick visual comparison did not reveal the different, but CodeingQuarry did not read it properly (CDS were not extracted). The original gff could be read.
4 Running CodingQuarry-PM
/home/robert/Tools/CodingQuarry_v2.0/run_CQ-PM_mine.sh \
/home/robert/Dropbox/Projects/Geosmithia/CodingQuarry/genome.resources/G.morbida.refgenome.fa \
/home/robert/Dropbox/Projects/Geosmithia/CodingQuarry/genome.resources/g.morbida.original.2016.gff
5 Output
Training.. Intron training complete Begin pathogen mode There are only 333 secreted genes to train from. Pathogen mode cannot run with less than 500 secreted genes.
6 Recompiling main binary
can I force it to proceed? This is silly.
Yes, I can try recompiling with altering lines 367 and 971 of CodingQuarry.cpp and simply replace the existing binary, replace “500” with “300” and run it again. This will set the minimum allowable size of the secretome to 300 rather than 500… I’m not sure what this would do, but I’m not overly concerned; we are in the same order of magnitude here.
7 Results
CodingQuarry-PM has detected 1560 new genes… Brief manual inspection confirms that they are indeed “between” existing gene anntotations. Quick investigation of annotatations was done by eggnog; “real” and/or “normal” genes will fit into ortholog groups. ~50% of these new genes do so. Many are quite traditional. I did also a quick screening at dbcan (CaZY) and got ~50 positive hits. In other words, they are there, they are real, and they are important.
JGI compared their own pipeline to MAKER and found a similar proportion of additional genes; https://genome.jgi.doe.gov/programs/fungi/benchmarks.jsf
8 Positive selection
Given the current analysis, these new genes should be screened for positive selction using CodeML M0/M1 and M7/M8 model tests, same as has been done for previous genes (original Maker annotation)
8.1 Collect genes and proteins
This is handled in CLC as was done previously. The original BAM files are already in the CLC environment and do not need to be altered in any way.
First, the reference genome needs to be imported again: import as track the genome fasta and gff.
Then you run a pipeline consisting of:
Extract consensus sequence
Align contigs
Annotate from reference
Extract annotations; using type “CDS” and search “PGN” (PGN is the name used by CQPM). This will only extract the new genes
Save these
Translate to Protein
Save these
Change names such that they only include the strain name and export as a set of multifastas
8.2 Reformat multifastas
Now, each multifasta contains all genes/proteins for each strain. This has to be reconfigured such that each multifasta represents all strain varieties of a single gene/protein.
library(stringr)
gff <- read.csv("CQPM.results/PGN_predictedPass.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE, stringsAsFactors = F)
gff <- gff[gff$V3 == "gene",]
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gff$gene <- as.character(gene.split[,2])
prot.template <- gff[c(4,1)]
prot.template <- unique(prot.template)
prot.template <- prot.template[prot.template$gene!="",] #remove artifactual lines where gene is blank, small problem which becomes amplified later?
kable(head(prot.template))| gene | scaffold | |
|---|---|---|
| 2 | PGN.00007 | scaffold_0 |
| 4 | PGN.00008 | scaffold_0 |
| 9 | PGN.00012 | scaffold_0 |
| 12 | PGN.00013 | scaffold_0 |
| 15 | PGN.00020 | scaffold_0 |
| 19 | PGN.00022 | scaffold_0 |
8.2.1 Import genome protein lists and create a fasta for each protein
8.2.1.1 Create a large data frame, gene X strain, populated by sequence
library(Biostrings)
library(dplyr)
proteins <- list.files("strain.prot/")
strain.prot.concat <- prot.template
for (x in c(1:length(proteins))) {
strain <- Biostrings::readAAStringSet(paste("strain.prot/",proteins[x],sep=""))
gene <- names(strain)
sequence <- paste(strain)
strain <- data.frame(gene,sequence,stringsAsFactors = F)
strain <- unique(strain) #for unclear reasons, a small subset of genes and sequences is duplicated; must remove duplicates
strain <- strain[!duplicated(strain$gene),] #there is a tiny minority of genes which are split, two different seqs for a gene, this creates exponential havoc in the loop
strain.prot.concat <- dplyr::left_join(strain.prot.concat, strain, by = "gene")
colnames(strain.prot.concat)[2+x] <- gsub("\\.fa","",paste(proteins[x]))
}
strain.prot.concat <- strain.prot.concat[-2] #remove a dummy column8.2.2 A simple function for writing fasta from data frame
from: http://bioinfo.umassmed.edu/bootstrappers/guides/main/r_writeFasta.html
writeFasta<-function(data, filename){
fastaLines = c()
for (rowNum in 1:nrow(data)){
fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
}
fileConn<-file(filename)
writeLines(fastaLines, fileConn)
close(fileConn)
}8.2.3 Pop off rows and turn each into a fasta file for a protein
At this stage, I should try to filter out proteins that have been improperly pulled, bad annotation transfers, results usually in an early termination and then gibberish. This does result in a pipeline that totally ignores early termination, but otherwise greatly improves the analysis
- remove “*" from the ends of the sequences
- remove any sequence that still contains “*"
- remove any sequence row which now contains an “NA”
library(seqinr)
for (x in c(1:nrow(strain.prot.concat))) {
prot.pop <- strain.prot.concat[x,]
prot.pop.t <- data.frame(t(prot.pop),stringsAsFactors = F)
prot.pop.t$name <- rownames(prot.pop.t)
prot.pop.t <- prot.pop.t[-1,]
colnames(prot.pop.t) <- c("seq","name")
rownames(prot.pop.t) <- NULL
prot.pop.t.x <- data.frame(lapply(prot.pop.t, function(x) gsub("\\*$","",x)), stringsAsFactors = FALSE) #remove the terminal "*" from each sequence
prot.pop.t.x2 <- prot.pop.t.x[!(grepl("\\*",prot.pop.t.x$seq)==TRUE),] #delete rows with seqs that contain internal stop codons
prot.pop.t.x3 <- prot.pop.t.x2[!is.na(prot.pop.t.x2$seq)==TRUE,] #delete rows with missing sequences
writeFasta(prot.pop.t.x3, paste("proteins/",strain.prot.concat[x,1],sep=""))
}8.2.4 Repeat for genes
library(Biostrings)
library(dplyr)
proteins <- list.files("strain.genes/")
strain.prot.concat <- prot.template
for (x in c(1:length(proteins))) {
strain <- Biostrings::readAAStringSet(paste("strain.genes/",proteins[x],sep=""))
gene <- names(strain)
sequence <- paste(strain)
strain <- data.frame(gene,sequence,stringsAsFactors = F)
strain <- unique(strain) #for unclear reasons, a small subset of genes and sequences is duplicated; must remove duplicates
strain <- strain[!duplicated(strain$gene),] #there is a tiny minority of genes which are split, two different seqs for a gene, this creates exponential havoc in the loop
strain.prot.concat <- dplyr::left_join(strain.prot.concat, strain, by = "gene")
colnames(strain.prot.concat)[2+x] <- gsub("\\.fa","",paste(proteins[x]))
}
strain.prot.concat <- strain.prot.concat[-2] #remove a dummy columnlibrary(seqinr)
for (x in c(1:nrow(strain.prot.concat))) {
prot.pop <- strain.prot.concat[x,]
prot.pop.t <- data.frame(t(prot.pop),stringsAsFactors = F)
prot.pop.t$name <- rownames(prot.pop.t)
prot.pop.t <- prot.pop.t[-1,]
colnames(prot.pop.t) <- c("seq","name")
rownames(prot.pop.t) <- NULL
prot.pop.t.x <- data.frame(lapply(prot.pop.t, function(x) gsub("\\*$","",x)), stringsAsFactors = FALSE) #remove the terminal "*" from each sequence
prot.pop.t.x2 <- prot.pop.t.x[!(grepl("\\*",prot.pop.t.x$seq)==TRUE),] #delete rows with seqs that contain internal stop codons
prot.pop.t.x3 <- prot.pop.t.x2[!is.na(prot.pop.t.x2$seq)==TRUE,] #delete rows with missing sequences
writeFasta(prot.pop.t.x3, paste("genes/",strain.prot.concat[x,1],sep=""))
}There are now two folders, “proteins” and “genes”, each of which contains ~1,550 multifastas with a sequence for each strain. The genes are fed into the selection pipeline
8.3 Build alignments
Each multifasta is aligned using MAFFT in –auto mode
The script is described here: https://github.com/rwmurdoch/population.gene.trees/blob/master/align.iterate.sh
8.4 Build trees
These are tree’d with FastTree, which is the best in non-ML tree-building programs (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/).
This script is described here, minor modifications having been made for the current project.
8.5 CodeML via ETE3
ETE3 includes a simpler implementation of CodeML. Installation instructions here: http://etetoolkit.org/download/.
I have created iterative scripts (https://github.com/rwmurdoch/ETE3_CodeML_testing) which do dN/dS testing and positive selection tests.
I could not get ETE3 to run on my Ubutnu 18 system; I had to use my previous OS, Ubuntu 16. I don’t think it was an ubuntu issue though, more likely an issue with my Python/Anaconda installation. In any case, it ran successfully, taking ~4 days to complete the PS tests.
Before proceeding, you must manually remove any instances of "*" from the M2vM1 and M8vM7 results files.
8.6 Importing CodeML results
My ETE3 CodeML scripts produced a single large csv table for each test (M0,M2vM1,M8vM7).
M0 <- read.csv("ETE3_results_genetree_M0.csv", stringsAsFactors = F, header = F)
colnames(M0) <- c("gene","dNdS")
M12 <- read.csv("ETE3_genetree_12_results.csv", stringsAsFactors = F, header = F)
colnames(M12) <- c("gene","lnL_M1","lnL_M2","p_value_12","w_M1","w_M2")
M12$p_value_12 <- as.numeric(M12$p_value_12)
M78 <- read.csv("ETE3_results_genetree_87.csv", stringsAsFactors = F, header = F)
colnames(M78) <- c("gene","lnL_M7","lnL_M8","p_value_78","w_M7","w_M8")
M78$p_value_78 <- as.numeric(M78$p_value_78)8.7 Calculate FDR and reformat table
M12$'M1-vs-M2' <- M12$lnL_M2 - M12$lnL_M1
M12.s <- M12[c(1,7,4)]
M12.s$fdr_12 <- p.adjust(M12$p_value_12, method = "fdr")
M78$'M7-vs-M8' <- M78$lnL_M8 - M78$lnL_M7
M78.s <- M78[c(1,7,4)]
M78.s$fdr_78 <- p.adjust(M78$p_value_78, method = "fdr")8.8 Combine all of the new results
include reference genome protein sequence also
library(dplyr)
PM.PS <- dplyr::left_join(M12.s, M78.s, by="gene")
PM.PS <- left_join(PM.PS, M0, by="gene")
PM.PS <- PM.PS[order(PM.PS$fdr_78),]
library(Biostrings)
PM.proteins <- readAAStringSet("CQPM.results/PGN_predicted_CDS.faa")
PM.seqs <- paste(PM.proteins)
PM.names <- names(PM.proteins)
PM.proteins <- data.frame(PM.names, PM.seqs)
colnames(PM.proteins) <- c("gene","seq")
PM.PS <- left_join(PM.PS, PM.proteins, by="gene")
write.csv(PM.PS,"PM.PS.full.results.csv",row.names = F)
datatable(PM.PS)8.9 Filtering out only signficant PS-selected genes
PS genes are judged as those with M87 test result less than FDR 0.05.
Write the sub-table for working in with the rest of the PS genes and write a protein multifasta
PM.PS.sig <- PM.PS[complete.cases(PM.PS$fdr_78),]
PM.PS.sig <- PM.PS.sig[PM.PS.sig$fdr_78 <= 0.05,]
datatable(PM.PS.sig)write.csv(PM.PS.sig,"PM.PS.significant.csv",row.names = F)
writeFasta<-function(data, filename){
fastaLines = c()
for (rowNum in 1:nrow(data)){
fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
}
fileConn<-file(filename)
writeLines(fastaLines, fileConn)
close(fileConn)
}
name <- PM.PS.sig$gene
seq <- PM.PS.sig$seq
PS.signif.proteins <- data.frame(name,seq)
writeFasta(PS.signif.proteins,"PM.PS.proteins.faa")9 Annotation of PM PS proteins
The 42 PM PS genes are annoated at WebMGA for COG, KOG, TigrFAM, and pfam and at KEGG:GhostKoala for KO.
The KEGG and COG results are added to the PS results table
library(dplyr)
KEGG <- read.csv("PM.PS.annotations/KEGG/user_ko.txt",sep="\t",header=F)
colnames(KEGG) = c("gene","KO")
COG <- read.csv("PM.PS.annotations/cog/cog-raw.txt",sep="\t",header=F)
COG<-COG[c(1,13,15)]
colnames(COG) <- c("gene","COG","COG.desc")
PS.signif.proteins.2 <- dplyr::left_join(PM.PS.sig,KEGG,by="gene")
PS.signif.proteins.2 <- dplyr::left_join(PS.signif.proteins.2,COG,by="gene")
PS.signif.proteins.2 <- PS.signif.proteins.2[c(1,2,3,4,5,6,7,8,10,11,12,9)]
write.csv(PS.signif.proteins.2,"PM.PS.significant.csv",row.names = F)
datatable(PS.signif.proteins.2)