This script is used for the AA (non-synonomous) variants pipeline
The goal of this pipeline is to characterize alleles in terms of diversity and functional annotation
This script takes the output from the “5.loci.to.genes.aa.rmd” script, which is a map of variant loci to genes. Then this script creates gene to KEGG and COG annotations (COG is currently not used) which is used later in the script to place genes into pathways.
This script also determines allelic counts across the project by generating locus-fingerprints across each gene. This allows for determination of unique fingerprint (allele) counts for each gene.
The variant genes are explored functionally by placement into KEGG pathways (modules)
Inputs: * geosmithia_KEGG.txt: these are gene to KEGG annotations that were obtained at the KEGG mapper web service * KEGG_BRITE2.csv: this is a map of KO to higher hierarchicical categories. This was downloaded from KEGG/BRITE and pre-processed in Excel * cog-raw.txt: This maps genes to COGs and COG categories; this was obtained through the EggNOG sequencemapper webservice * aa.loci.to.genes.csv: this is the output from loci.to.genes.rmd, a map of variant loci to gene names along with a matrix of occurence in strains
Outputs: * full_gene_allele_count_aa.csv: this is a simple two column table that shows allele count for each gene. This includes genes with no variants. This is fed later into the sequence diversity pipeline for further characterization * allele count histograms (not written, save out of the html rmarkdown)
library(stringr)
gff <- read.csv("../g.morbida_CDS.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE)
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gene.split <- str_split_fixed(gene.split[,2], ":",2)
gff$gene <- as.character(gene.split[,1])
kable(head(gff),format="markdown")
| scaffold | start | stop | gene |
|---|---|---|---|
| scaffold_0 | 674 | 1516 | augustus_masked-scaffold_0-processed-gene-0.2274-mRNA-1 |
| scaffold_0 | 9081 | 12438 | maker-scaffold_0-augustus-gene-0.2665-mRNA-1 |
| scaffold_0 | 12504 | 13645 | maker-scaffold_0-augustus-gene-0.2665-mRNA-1 |
| scaffold_0 | 14732 | 15859 | augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 |
| scaffold_0 | 16219 | 16581 | maker-scaffold_0-augustus-gene-0.2819-mRNA-1 |
| scaffold_0 | 16781 | 17263 | maker-scaffold_0-augustus-gene-0.2819-mRNA-1 |
KEGG forms a good foundation for building the database because every gene has an associated annotation (in most cases empty)
kegg.raw <- read.csv("annotations/geosmithia_KEGG.txt", sep ="\t", row.names = NULL, header = FALSE)
colnames(kegg.raw) <- c("gene","KO")
kable(kegg.raw[1:10,])
| gene | KO |
|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 |
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | |
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | |
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | |
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 |
The BRITE KEGG hierarchcial map was created manually from the BRITE download.
BRITE <- read.csv("annotations/KEGG_BRITE2.csv",header=TRUE,row.names=NULL)
colnames(BRITE)[c(1,2,3)] <- c("KEGG.tier1","KEGG.tier2","KEGG.pathway")
BRITE <- BRITE[c(4,5,1,2,3)]
BRITE <- BRITE[-which(BRITE$KO == ""), ] #remove any rows with empty KO, removes ~ 1000 rows
kable(head(BRITE))
| KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|
| K00844 | HK; hexokinase [EC:2.7.1.1] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K12407 | GCK; glucokinase [EC:2.7.1.2] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K00845 | glk; glucokinase [EC:2.7.1.2] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K01810 | GPI | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K06859 | pgi1; glucose-6-phosphate isomerase | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K13810 | tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| BRITE con | tains redundancies because a single KO can map to multiple pathways. This mea | ns that pathw | ay mapping is really not possibl | e at this stage. |
| We can at | tempt to do a one-to-one KO to name mapping for now. We use this much later in | the pipeline | . |
BRITE2 <- data.frame(BRITE[c(1,2)])
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
kegg.raw2 <- kegg.raw[c(2,1)]
gene.annot <- dplyr::left_join(kegg.raw2,BRITE,by="KO")
## Warning: Column `KO` joining factors with different levels, coercing to
## character vector
gene.annot <- gene.annot[c(2,1,3:6)]
kable(gene.annot[1:10,])
| gene | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 | NOP15; nucleolar protein 15 | BriteHierarchies | 09182 Protein families: genetic information processing | 03009 Ribosome biogenesis [BR:ko03009] |
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | NA | NA | NA | NA | |
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | NA | NA | NA | NA | |
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | NA | NA | NA | NA | |
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 | YAT; yeast amino acid transporter | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 | WRB | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
cog.raw <- read.csv("annotations/cog-raw.txt", sep = "\t", row.names = NULL, header = FALSE)
cog.raw <- cog.raw[c(1,13,14,15)]
colnames(cog.raw) <- c("gene","COG","COG.name","COG.desc")
kable(cog.raw[1:10,])
| gene | COG | COG.name | COG.desc |
|---|---|---|---|
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | COG1063 | Tdh | Threonine dehydrogenase and related Zn-dependent dehydrogenases |
| augustus_masked-scaffold_7-processed-gene-0.364-mRNA-1 | COG5309 | COG5309 | Exo-beta-1,3-glucanase |
| maker-scaffold_7-augustus-gene-0.1355-mRNA-1 | COG0199 | RpsN | Ribosomal protein S14 |
| maker-scaffold_7-augustus-gene-0.1443-mRNA-1 | COG2515 | Acd | 1-aminocyclopropane-1-carboxylate deaminase |
| maker-scaffold_7-augustus-gene-0.1444-mRNA-1 | COG0300 | DltE | Short-chain dehydrogenases of various substrate specificities |
| maker-scaffold_18-augustus-gene-0.242-mRNA-1 | COG5277 | COG5277 | Actin and related proteins |
| maker-scaffold_19-augustus-gene-0.251-mRNA-1 | COG2046 | MET3 | ATP sulfurylase (sulfate adenylyltransferase) |
| maker-scaffold_19-augustus-gene-0.251-mRNA-1 | COG0529 | CysC | Adenylylsulfate kinase and related kinases |
| maker-scaffold_3-augustus-gene-0.4-mRNA-1 | COG0474 | MgtA | Cation transport ATPase |
| snap_masked-scaffold_3-processed-gene-0.1103-mRNA-1 | COG1946 | TesB | Acyl-CoA thioesterase |
gene.annot2 <- dplyr::left_join(gene.annot,cog.raw,by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
gene.annot2 <- data.frame(sapply(gene.annot2, as.character),stringsAsFactors = FALSE)
gene.annot2[is.na(gene.annot2)] <- ""
write.csv(gene.annot2,"gene.annotations.csv")
kable(gene.annot2[1:10,])
| gene | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway | COG | COG.name | COG.desc |
|---|---|---|---|---|---|---|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 | NOP15; nucleolar protein 15 | BriteHierarchies | 09182 Protein families: genetic information processing | 03009 Ribosome biogenesis [BR:ko03009] | |||
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | COG1063 | Tdh | Threonine dehydrogenase and related Zn-dependent dehydrogenases | |||||
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | ||||||||
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | COG1196 | Smc | Chromosome segregation ATPases | |||||
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | ||||||||
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 | YAT; yeast amino acid transporter | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] | COG0833 | LysP | Amino acid transporters |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 | WRB | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
fingerprints for each gene/strain combination are generated by pasting together the 1s and 0s for each strain across each gene. Example output can be seen after the code chunk.
the code is ugly but fast:
aa.var.genes <- read.csv("aa.loci.to.genes.csv", row.names = 1)
aa.allele.barcode <- aa.var.genes[-c(2:5)]
p <- function(v) {
Reduce(f=paste0, x = v)} #define a paste function
aa.allele.bygen <- dplyr::group_by(aa.allele.barcode, gene)
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
aa.allele.counts <- dplyr::summarize(aa.allele.bygen, GM307 = p(as.character(GM307)),
GM236 = p(as.character(GM236)),
GM219 = p(as.character(GM219)),
GM216 = p(as.character(GM216)),
GM205 = p(as.character(GM205)),
GM196 = p(as.character(GM196)),
GM188 = p(as.character(GM188)),
GM179 = p(as.character(GM179)),
GM178 = p(as.character(GM178)),
GM177 = p(as.character(GM177)),
GM170 = p(as.character(GM170)),
GM158 = p(as.character(GM158)),
GM140 = p(as.character(GM140)),
GM118 = p(as.character(GM118)),
GM108 = p(as.character(GM108)),
GM64 = p(as.character(GM64)),
GM59 = p(as.character(GM59)),
GM39 = p(as.character(GM39)),
GM25 = p(as.character(GM25)),
GM20 = p(as.character(GM20)),
GM17 = p(as.character(GM17)),
GM11 = p(as.character(GM11)))
aa.allele.counts2 <- aa.allele.counts
aa.allele.counts2$count <- apply(aa.allele.counts2[2:23], 1, function(x)length(unique(x))) #this counts unique values in the fingerprint section
kable(aa.allele.counts2[c(1:5),c(1:3)])
| gene | GM307 | GM236 |
|---|---|---|
| augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 | 000000000000000000000000000 | 000000000000000000000000000 |
| augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 | 00 | 00 |
| augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 | 000000000000000 | 000000000000000 |
| augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 | 00 | 00 |
| augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 | 0000000000 | 0000000000 |
The table above demonstrates the variant loci fingerprint process.
Next we create a list of number of distinct alleles (different fingerprints) for each gene:
rm(aa.allele.counts3)
## Warning in rm(aa.allele.counts3): object 'aa.allele.counts3' not found
aa.allele.counts3 <- aa.allele.counts2[,c(1,24)]
write.csv(aa.allele.counts3, "aa.allele.counts.csv")
kable(head(aa.allele.counts3))
| gene | count |
|---|---|
| augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 | 12 |
| augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 | 3 |
| augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 | 6 |
| augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 | 1 |
| augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 | 4 |
| augustus_masked-scaffold_0-processed-gene-0.1957-mRNA-1 | 3 |
We need to be sure to add in genes with no variant alleles. Up until now, only variants are included. Make an empty table from the gff and join it with the counts, then replace NA values with 1
library(dplyr)
#make the empty table with all genes
aa.gff.build <- gff
aa.gff.build$prox <- 0
aa.gff.build <- aa.gff.build[c(4,5)]
aa.gff.build <- unique(aa.gff.build)
#join in the counts
aa.gff.build <- dplyr::left_join(aa.gff.build, aa.allele.counts3, by = "gene")
## Warning: Column `gene` joining character vector and factor, coercing into
## character vector
#replace NA with 1
aa.gff.build[is.na(aa.gff.build)] <- 1
aa.gff.build <- aa.gff.build[c(1,3)]
From which we can make a distribution histogram of allele counts.
library(ggplot2)
hist.allele.count.aa <- ggplot(aa.gff.build, aes(aa.gff.build$count)) + geom_histogram(binwidth=1, color = "white") + theme_classic()
hist.allele.count.aa <- hist.allele.count.aa + xlab("Alleles") + ylab("Genes")
hist.allele.count.aa
write the full gene non-synonymous allele count table to a file
aa.full.gene.allele.count <- dplyr::left_join(kegg.raw, aa.allele.counts3, by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
aa.full.gene.allele.count[is.na(aa.full.gene.allele.count)] <- 1 #replace NA with 1
write.csv(aa.full.gene.allele.count,"full_gene_allele_count_aa.csv")
Let’s attach functional information where available and restrict ourselves to genes with predicted function (and sort by allele counts)
library(dplyr)
aa.full.gene.allele.count$gene <- as.character(aa.full.gene.allele.count$gene)
allele.counts.KEGG <- dplyr::left_join(aa.full.gene.allele.count, gene.annot2[c(1:6)])
## Joining, by = c("gene", "KO")
## Warning: Column `KO` joining factor and character vector, coercing into
## character vector
allele.counts.KEGG <- allele.counts.KEGG[order(allele.counts.KEGG$count, decreasing = TRUE),]
kable(head(allele.counts.KEGG))
| gene | KO | count | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway | |
|---|---|---|---|---|---|---|---|
| 9945 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | Metabolism | 09104 Nucleotide metabolism | 00230 Purine metabolism [PATH:ko00230] |
| 9946 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | Metabolism | 09104 Nucleotide metabolism | 00230 Purine metabolism [PATH:ko00230] |
| 9947 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09143 Cell growth and death | 04113 Meiosis - yeast [PATH:ko04113] |
| 9948 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09143 Cell growth and death | 04113 Meiosis - yeast [PATH:ko04113] |
| 9949 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09145 Cellular community - prokaryotes | 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] |
| 9950 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09145 Cellular community - prokaryotes | 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] |
First, we attach pathway information: note that that at this stage, we lose a hard count of variant genes since many KO can map to multiple pathways
allele.counts.KEGG.pres <- apply(allele.counts.KEGG, 2, function(x) gsub("^$|^ $", NA, x)) #this makes sure that any empty cells are NA
allele.counts.KEGG.pres <- data.frame(na.omit(allele.counts.KEGG.pres),stringsAsFactors = FALSE) #this removes all rows with no KEGG
allele.counts.KEGG.pres$count <- as.numeric(allele.counts.KEGG.pres$count)
write.csv(allele.counts.KEGG.pres, "aas.gene.allele.counts.csv")
kable(head(allele.counts.KEGG.pres))
| gene | KO | count | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway | |
|---|---|---|---|---|---|---|---|
| 9945 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | Metabolism | 09104 Nucleotide metabolism | 00230 Purine metabolism [PATH:ko00230] |
| 9946 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | Metabolism | 09104 Nucleotide metabolism | 00230 Purine metabolism [PATH:ko00230] |
| 9947 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09143 Cell growth and death | 04113 Meiosis - yeast [PATH:ko04113] |
| 9948 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09143 Cell growth and death | 04113 Meiosis - yeast [PATH:ko04113] |
| 9949 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09145 Cellular community - prokaryotes | 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] |
| 9950 | augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | CellularProcesses | 09145 Cellular community - prokaryotes | 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] |
cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(aa.full.gene.allele.count), "total genes,",
nrow(aa.allele.counts3),"show at least one variant relative to the reference genome.",
length(unique(allele.counts.KEGG.pres$gene)), "can be assigned to a KO family. These",
length(unique(allele.counts.KEGG.pres$gene)), "genes can be mapped to",
nrow(allele.counts.KEGG.pres),"specific KEGG pathway connections representing",
length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 5564 show at least one variant relative to the reference genome. 3270 can be assigned to a KO family. These 3270 genes can be mapped to 9805 specific KEGG pathway connections representing 422 unique pathways
This should be further narrowed down to more highly allelic genes, but first, lets summarize by pathway and determine the most highly impacted pathways in terms of counts of variant genes present. This code chunks just counts pathway name occurances using a summarize function.
KEGG.pathway.var.summary <- allele.counts.KEGG.pres
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.no.min <-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary[c(1:25),])
| KEGG.pathway | variant.genes |
|---|---|
| 04131 Membrane trafficking [BR:ko04131] | 336 |
| 03036 Chromosome and associated proteins [BR:ko03036] | 273 |
| 02000 Transporters [BR:ko02000] | 213 |
| 04147 Exosome [BR:ko04147] | 203 |
| 03009 Ribosome biogenesis [BR:ko03009] | 196 |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 182 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 160 |
| 03021 Transcription machinery [BR:ko03021] | 148 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 147 |
| 03041 Spliceosome [BR:ko03041] | 129 |
| 04121 Ubiquitin system [BR:ko04121] | 127 |
| 03011 Ribosome [BR:ko03011] | 123 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 121 |
| 01002 Peptidases [BR:ko01002] | 111 |
| 03010 Ribosome [PATH:ko03010] | 108 |
| 00230 Purine metabolism [PATH:ko00230] | 96 |
| 99980 Enzymes with EC numbers | 96 |
| 03013 RNA transport [PATH:ko03013] | 84 |
| 03040 Spliceosome [PATH:ko03040] | 81 |
| 03110 Chaperones and folding catalysts [BR:ko03110] | 81 |
| 01009 Protein phosphatase and associated proteins [BR:ko01009] | 79 |
| 04714 Thermogenesis [PATH:ko04714] | 79 |
| 03032 DNA replication proteins [BR:ko03032] | 78 |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 75 |
| 04141 Protein processing in endoplasmic reticulum [PATH:ko04141] | 75 |
| This summary does not take allele count into consideration: a gene | with only 1 or 2 alleles is not very meaningful for this analysis |
This is the same as above but modified to exclude genes that have lower variant counts.
#50% minimum alleles (>10)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 10,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.50 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.50) <- c("KEGG.pathway","50%.alleles")
#75% minimum alleles (>16)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.75 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.75) <- c("KEGG.pathway","75%.alleles")
#90% minimum alleles (>19)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.9,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.90 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.90) <- c("KEGG.pathway","90%.alleles")
#join the pathway counts together with different allelic variation thresholds
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary.no.min, KEGG.pathway.var.summary.50, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining factors with different levels,
## coercing to character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.75, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.90, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
kable(KEGG.pathway.var.summary[c(1:25),])
| KEGG.pathway | variant.genes | 50%.alleles | 75%.alleles | 90%.alleles |
|---|---|---|---|---|
| 04131 Membrane trafficking [BR:ko04131] | 336 | 3 | NA | NA |
| 03036 Chromosome and associated proteins [BR:ko03036] | 273 | 9 | 2 | NA |
| 02000 Transporters [BR:ko02000] | 213 | 2 | NA | NA |
| 04147 Exosome [BR:ko04147] | 203 | 1 | NA | NA |
| 03009 Ribosome biogenesis [BR:ko03009] | 196 | 13 | 2 | NA |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 182 | 3 | 1 | NA |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 160 | 6 | 1 | NA |
| 03021 Transcription machinery [BR:ko03021] | 148 | 4 | 1 | NA |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 147 | 9 | 1 | NA |
| 03041 Spliceosome [BR:ko03041] | 129 | 3 | NA | NA |
| 04121 Ubiquitin system [BR:ko04121] | 127 | 6 | 1 | NA |
| 03011 Ribosome [BR:ko03011] | 123 | NA | NA | NA |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 121 | 4 | 1 | NA |
| 01002 Peptidases [BR:ko01002] | 111 | 2 | NA | NA |
| 03010 Ribosome [PATH:ko03010] | 108 | NA | NA | NA |
| 00230 Purine metabolism [PATH:ko00230] | 96 | 2 | 2 | 2 |
| 99980 Enzymes with EC numbers | 96 | 3 | NA | NA |
| 03013 RNA transport [PATH:ko03013] | 84 | 1 | NA | NA |
| 03040 Spliceosome [PATH:ko03040] | 81 | NA | NA | NA |
| 03110 Chaperones and folding catalysts [BR:ko03110] | 81 | 2 | NA | NA |
| 01009 Protein phosphatase and associated proteins [BR:ko01009] | 79 | 1 | NA | NA |
| 04714 Thermogenesis [PATH:ko04714] | 79 | NA | NA | NA |
| 03032 DNA replication proteins [BR:ko03032] | 78 | 2 | 1 | NA |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 75 | NA | NA | NA |
| 04141 Protein processing in endoplasmic reticulum [PATH:ko04141] | 75 | 2 | NA | NA |
allele.75 <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(aa.full.gene.allele.count), "total genes,",
length(unique(allele.75$gene)),"show at least 17 amino-acid or splice site alleles across the 22 strains and can be assigned to a KO family. These",
length(unique(allele.75$gene)), "genes can be mapped to",
nrow(allele.75),"specific KEGG pathway connections representing",
length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 11 show at least 17 amino-acid or splice site alleles across the 22 strains and can be assigned to a KO family. These 11 genes can be mapped to 29 specific KEGG pathway connections representing 422 unique pathways
KEGG.pathway.75.summary <- KEGG.pathway.var.summary[c(1,4)]
KEGG.pathway.75.summary <- KEGG.pathway.75.summary[order(KEGG.pathway.75.summary$`75%.alleles`,decreasing=TRUE),]
kable(KEGG.pathway.75.summary[1:20,])
| KEGG.pathway | 75%.alleles |
|---|---|
| 04113 Meiosis - yeast [PATH:ko04113] | 3 |
| 03036 Chromosome and associated proteins [BR:ko03036] | 2 |
| 03009 Ribosome biogenesis [BR:ko03009] | 2 |
| 00230 Purine metabolism [PATH:ko00230] | 2 |
| 04213 Longevity regulating pathway - multiple species [PATH:ko04213] | 2 |
| 01008 Polyketide biosynthesis proteins [BR:ko01008] | 2 |
| 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] | 2 |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 1 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 1 |
| 03021 Transcription machinery [BR:ko03021] | 1 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 1 |
| 04121 Ubiquitin system [BR:ko04121] | 1 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 1 |
| 03032 DNA replication proteins [BR:ko03032] | 1 |
| 04111 Cell cycle - yeast [PATH:ko04111] | 1 |
| 03012 Translation factors [BR:ko03012] | 1 |
| 03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] | 1 |
| 01007 Amino acid related enzymes [BR:ko01007] | 1 |
| 04110 Cell cycle [PATH:ko04110] | 1 |
| 00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970] | 1 |
| To be clear, this is a representation of where the most highly allelic | genes are mapped according to the KEGG system. |
This is fascinating, but lets try to put this in two frames of reference: 1. average number of variants per gene across the strains in each pathway 2. total number of genes placing into each pathway in the reference genome, % of pathway genes subject to variation
Let’s create two more tables representing higher levels of organization: X
#tier1
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier1 <- as.factor(KEGG.pathway.var.summary$KEGG.tier1)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier1) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier1<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier1[c(1:10),])
| KEGG.tier1 | variant.genes |
|---|---|
| BriteHierarchies | 15 |
| CellularProcesses | 7 |
| GeneticInformationProcessing | 3 |
| Metabolism | 2 |
| OrganismalSystems | 2 |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
#tier 2
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier2 <- as.factor(KEGG.pathway.var.summary$KEGG.tier2)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier2) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier2<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier2[c(1:25),])
| KEGG.tier2 | variant.genes |
|---|---|
| 09182 Protein families: genetic information processing | 12 |
| 09143 Cell growth and death | 5 |
| 09181 Protein families: metabolism | 3 |
| 09104 Nucleotide metabolism | 2 |
| 09122 Translation | 2 |
| 09145 Cellular community - prokaryotes | 2 |
| 09149 Aging | 2 |
| 09124 Replication and repair | 1 |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 5564 show at least one variant relative to the reference genome. 3270 can be assigned to a KO family. These 3270 genes can be mapped to 9805 specific KEGG pathway connections representing 422 unique pathways
## HIGHLY VARIANT ALLELES: of 6273 total genes, 11 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 11 genes can be mapped to 29 specific KEGG pathway connections representing 422 unique pathways
kable(KEGG.pathway.75.summary[1:20,])
| KEGG.pathway | 75%.alleles |
|---|---|
| 04113 Meiosis - yeast [PATH:ko04113] | 3 |
| 03036 Chromosome and associated proteins [BR:ko03036] | 2 |
| 03009 Ribosome biogenesis [BR:ko03009] | 2 |
| 00230 Purine metabolism [PATH:ko00230] | 2 |
| 04213 Longevity regulating pathway - multiple species [PATH:ko04213] | 2 |
| 01008 Polyketide biosynthesis proteins [BR:ko01008] | 2 |
| 02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025] | 2 |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 1 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 1 |
| 03021 Transcription machinery [BR:ko03021] | 1 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 1 |
| 04121 Ubiquitin system [BR:ko04121] | 1 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 1 |
| 03032 DNA replication proteins [BR:ko03032] | 1 |
| 04111 Cell cycle - yeast [PATH:ko04111] | 1 |
| 03012 Translation factors [BR:ko03012] | 1 |
| 03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] | 1 |
| 01007 Amino acid related enzymes [BR:ko01007] | 1 |
| 04110 Cell cycle [PATH:ko04110] | 1 |
| 00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970] | 1 |
write.csv(KEGG.pathway.75.summary, "KEGG.pathway.75")
kable(KEGG.pathway.var.summary.tier1[c(1:8),])
| KEGG.tier1 | variant.genes |
|---|---|
| BriteHierarchies | 15 |
| CellularProcesses | 7 |
| GeneticInformationProcessing | 3 |
| Metabolism | 2 |
| OrganismalSystems | 2 |
| NA | NA |
| NA | NA |
| NA | NA |
write.csv(KEGG.pathway.var.summary.tier1, "KEGG.tier1.75")
kable(KEGG.pathway.var.summary.tier2[c(1:25),])
| KEGG.tier2 | variant.genes |
|---|---|
| 09182 Protein families: genetic information processing | 12 |
| 09143 Cell growth and death | 5 |
| 09181 Protein families: metabolism | 3 |
| 09104 Nucleotide metabolism | 2 |
| 09122 Translation | 2 |
| 09145 Cellular community - prokaryotes | 2 |
| 09149 Aging | 2 |
| 09124 Replication and repair | 1 |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
write.csv(KEGG.pathway.var.summary.tier2, "KEGG.tier2.75")
At this point we want to write a full file that includes gene names with allele counts. Annotation information is more difficult to attach since genes can map to many pathways… however we can restrict it to gene descriptions and ortholog IDs from KEGG and COG.
aa.allelic.genes <- allele.counts.KEGG[c(1:4)]
aa.allelic.genes <- unique(aa.allelic.genes)
aa.allelic.genes <- aa.allelic.genes[order(-aa.allelic.genes$count),]
cog.map <- cog.raw
aa.allelic.genes.annot <- dplyr::left_join(aa.allelic.genes, cog.map, "gene")
## Warning: Column `gene` joining character vector and factor, coercing into
## character vector
write.csv(aa.allelic.genes.annot, "aa.allelic.genes.csv")
kable(head(aa.allelic.genes.annot))
| gene | KO | count | KEGG.name | COG | COG.name | COG.desc |
|---|---|---|---|---|---|---|
| augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | COG4886 | COG4886 | Leucine-rich repeat (LRR) protein |
| augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1 | K01768 | 22 | E4.6.1.1; adenylate cyclase [EC:4.6.1.1] | COG4886 | COG4886 | Leucine-rich repeat (LRR) protein |
| augustus_masked-scaffold_26-processed-gene-0.258-mRNA-1 | 21 | COG0419 | SbcC | ATPase involved in DNA repair | ||
| maker-scaffold_13-augustus-gene-0.555-mRNA-1 | 19 | NA | NA | NA | ||
| augustus_masked-scaffold_15-processed-gene-0.28-mRNA-1 | 19 | NA | NA | NA | ||
| maker-scaffold_11-augustus-gene-0.502-mRNA-1 | K15026 | 19 | EIF2A; translation initiation factor 2A | COG5354 | COG5354 | Uncharacterized protein, contains Trp-Asp (WD) repeat |
I can also attach protein sequences to the table here, although the fasta names need some parsing
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
g.all.protein.biostrings <- readAAStringSet("../geosmithia_annotations/g.morbida.proteins.fa")
name <- data.frame(names(g.all.protein.biostrings),stringsAsFactors = F)
name.clean <- data.frame(do.call('rbind', strsplit(name$names.g.all.protein.biostrings., "_protein_Name:",fixed=TRUE)))
seqs <- paste(g.all.protein.biostrings)
length <- width(g.all.protein.biostrings)
g.all.protein <- data.frame(name.clean[1],length,seqs, stringsAsFactors = FALSE)
colnames(g.all.protein)[1] <- "gene"
#kable(head(g.all.protein))
then we can join that to the gene allele count annotation table
aa.allelic.genes.annot <- dplyr::left_join(aa.allelic.genes.annot, g.all.protein, "gene")
## Warning: Column `gene` joining character vector and factor, coercing into
## character vector
write.csv(aa.allelic.genes.annot, "aa.allelic.genes.csv")
#kable(head(aa.allelic.genes.annot))
runt this after to upload
library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.nonsynonymous.alles.and.functions”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/6.loci.gene.alleles.aa.rmd”)