(this is a work in progress)
The goal of this pipeline is to characterize alleles in terms of diversity and functional annotation
This script takes the output from the “loci.gene.alleles.rmd” script, which is a map of variant loci to genes. Then this script creates gene to KEGG and COG annotations (COG is currently not used) which is used later in the script to place genes into pathways.
This script also determines allelic counts across the project by generating locus-fingerprints across each gene. This allows for determination of unique fingerprint (allele) counts for each gene.
The variant genes are explored functionally by placement into KEGG pathways (modules)
Inputs: * geosmithia_KEGG.txt: these are gene to KEGG annotations that were obtained at the KEGG mapper web service * KEGG_BRITE2.csv: this is a map of KO to higher hierarchicical categories. This was downloaded from KEGG/BRITE and pre-processed in Excel * cog-raw.txt: This maps genes to COGs and COG categories; this was obtained through the EggNOG sequencemapper webservice * loci.to.genes.csv: this is the output from loci.to.genes.rmd, a map of variant loci to gene names along with a matrix of occurence in strains
Outputs: * full_gene_allele_count.csv: this is a simple two column table that shows allele count for each gene. This includes genes with no variants * allele count histograms (not written, save out of the html rmarkdown)
library(stringr)
gff <- read.csv("../g.morbida_CDS.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE)
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gene.split <- str_split_fixed(gene.split[,2], ":",2)
gff$gene <- as.character(gene.split[,1])
kable(head(gff),format="markdown")
| scaffold | start | stop | gene |
|---|---|---|---|
| scaffold_0 | 674 | 1516 | augustus_masked-scaffold_0-processed-gene-0.2274-mRNA-1 |
| scaffold_0 | 9081 | 12438 | maker-scaffold_0-augustus-gene-0.2665-mRNA-1 |
| scaffold_0 | 12504 | 13645 | maker-scaffold_0-augustus-gene-0.2665-mRNA-1 |
| scaffold_0 | 14732 | 15859 | augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 |
| scaffold_0 | 16219 | 16581 | maker-scaffold_0-augustus-gene-0.2819-mRNA-1 |
| scaffold_0 | 16781 | 17263 | maker-scaffold_0-augustus-gene-0.2819-mRNA-1 |
KEGG forms a good foundation for building the database because every gene has an associated annotation (in most cases empty)
kegg.raw <- read.csv("annotations/geosmithia_KEGG.txt", sep ="\t", row.names = NULL, header = FALSE)
colnames(kegg.raw) <- c("gene","KO")
kable(kegg.raw[1:10,])
| gene | KO |
|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 |
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | |
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | |
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | |
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | |
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 |
The BRITE KEGG hierarchcial map was created manually from the BRITE download.
BRITE <- read.csv("annotations/KEGG_BRITE2.csv",header=TRUE,row.names=NULL)
colnames(BRITE)[c(1,2,3)] <- c("KEGG.tier1","KEGG.tier2","KEGG.pathway")
BRITE <- BRITE[c(4,5,1,2,3)]
BRITE <- BRITE[-which(BRITE$KO == ""), ] #remove any rows with empty KO, removes ~ 1000 rows
kable(head(BRITE))
| KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|
| K00844 | HK; hexokinase [EC:2.7.1.1] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K12407 | GCK; glucokinase [EC:2.7.1.2] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K00845 | glk; glucokinase [EC:2.7.1.2] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K01810 | GPI | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K06859 | pgi1; glucose-6-phosphate isomerase | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| K13810 | tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9] | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| BRITE con | tains redundancies because a single KO can map to multiple pathways. This mea | ns that pathw | ay mapping is really not possibl | e at this stage. |
| We can at | tempt to do a one-to-one KO to name mapping for now. We use this much later in | the pipeline | . |
BRITE2 <- data.frame(BRITE[c(1,2)])
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
kegg.raw2 <- kegg.raw[c(2,1)]
gene.annot <- dplyr::left_join(kegg.raw2,BRITE,by="KO")
## Warning: Column `KO` joining factors with different levels, coercing to
## character vector
gene.annot <- gene.annot[c(2,1,3:6)]
kable(gene.annot[1:10,])
| gene | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 | NOP15; nucleolar protein 15 | BriteHierarchies | 09182 Protein families: genetic information processing | 03009 Ribosome biogenesis [BR:ko03009] |
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | NA | NA | NA | NA | |
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | NA | NA | NA | NA | |
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | NA | NA | NA | NA | |
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 | YAT; yeast amino acid transporter | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | NA | NA | NA | NA | |
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 | WRB | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
cog.raw <- read.csv("annotations/cog-raw.txt", sep = "\t", row.names = NULL, header = FALSE)
cog.raw <- cog.raw[c(1,13,14,15)]
colnames(cog.raw) <- c("gene","COG","COG.name","COG.desc")
kable(cog.raw[1:10,])
| gene | COG | COG.name | COG.desc |
|---|---|---|---|
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | COG1063 | Tdh | Threonine dehydrogenase and related Zn-dependent dehydrogenases |
| augustus_masked-scaffold_7-processed-gene-0.364-mRNA-1 | COG5309 | COG5309 | Exo-beta-1,3-glucanase |
| maker-scaffold_7-augustus-gene-0.1355-mRNA-1 | COG0199 | RpsN | Ribosomal protein S14 |
| maker-scaffold_7-augustus-gene-0.1443-mRNA-1 | COG2515 | Acd | 1-aminocyclopropane-1-carboxylate deaminase |
| maker-scaffold_7-augustus-gene-0.1444-mRNA-1 | COG0300 | DltE | Short-chain dehydrogenases of various substrate specificities |
| maker-scaffold_18-augustus-gene-0.242-mRNA-1 | COG5277 | COG5277 | Actin and related proteins |
| maker-scaffold_19-augustus-gene-0.251-mRNA-1 | COG2046 | MET3 | ATP sulfurylase (sulfate adenylyltransferase) |
| maker-scaffold_19-augustus-gene-0.251-mRNA-1 | COG0529 | CysC | Adenylylsulfate kinase and related kinases |
| maker-scaffold_3-augustus-gene-0.4-mRNA-1 | COG0474 | MgtA | Cation transport ATPase |
| snap_masked-scaffold_3-processed-gene-0.1103-mRNA-1 | COG1946 | TesB | Acyl-CoA thioesterase |
gene.annot2 <- dplyr::left_join(gene.annot,cog.raw,by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
gene.annot2 <- data.frame(sapply(gene.annot2, as.character),stringsAsFactors = FALSE)
gene.annot2[is.na(gene.annot2)] <- ""
write.csv(gene.annot2,"gene.annotations.csv")
kable(gene.annot2[1:10,])
| gene | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway | COG | COG.name | COG.desc |
|---|---|---|---|---|---|---|---|---|
| snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 | K14838 | NOP15; nucleolar protein 15 | BriteHierarchies | 09182 Protein families: genetic information processing | 03009 Ribosome biogenesis [BR:ko03009] | |||
| maker-scaffold_25-augustus-gene-0.4-mRNA-1 | COG1063 | Tdh | Threonine dehydrogenase and related Zn-dependent dehydrogenases | |||||
| snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 | ||||||||
| snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 | COG1196 | Smc | Chromosome segregation ATPases | |||||
| augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 | ||||||||
| maker-scaffold_25-augustus-gene-0.7-mRNA-1 | K16261 | YAT; yeast amino acid transporter | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] | COG0833 | LysP | Amino acid transporters |
| augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 | ||||||||
| augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 | K22384 | WRB | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 02000 Transporters [BR:ko02000] |
fingerprints for each gene/strain combination are generated by pasting together the 1s and 0s for each strain across each gene. Example output can be seen after the code chunk.
the code is ugly but fast:
var.genes <- read.csv("loci.to.genes.csv", row.names = 1)
allele.barcode <- var.genes[-c(2:5)]
p <- function(v) {
Reduce(f=paste0, x = v)} #define a paste function
allele.bygen <- dplyr::group_by(allele.barcode, gene)
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
allele.counts <- dplyr::summarize(allele.bygen, GM307 = p(as.character(GM307)),
GM236 = p(as.character(GM236)),
GM219 = p(as.character(GM219)),
GM216 = p(as.character(GM216)),
GM205 = p(as.character(GM205)),
GM196 = p(as.character(GM196)),
GM188 = p(as.character(GM188)),
GM179 = p(as.character(GM179)),
GM178 = p(as.character(GM178)),
GM177 = p(as.character(GM177)),
GM170 = p(as.character(GM170)),
GM158 = p(as.character(GM158)),
GM140 = p(as.character(GM140)),
GM118 = p(as.character(GM118)),
GM108 = p(as.character(GM108)),
GM64 = p(as.character(GM64)),
GM59 = p(as.character(GM59)),
GM39 = p(as.character(GM39)),
GM25 = p(as.character(GM25)),
GM20 = p(as.character(GM20)),
GM17 = p(as.character(GM17)),
GM11 = p(as.character(GM11)))
allele.counts2 <- allele.counts
allele.counts2$count <- apply(allele.counts2[2:23], 1, function(x)length(unique(x))) #this counts unique values in the fingerprint section
kable(allele.counts2[c(1:5),c(1:3)])
| gene | GM307 | GM236 |
|---|---|---|
| augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 | 000000000000000000000000000000000000000000000 | 000000000000000000000000000000000000000000000 |
| augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 | 00000000 | 00000000 |
| augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 | 00000000000000000000 | 00000000000000000000 |
| augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 | 000000000 | 000000000 |
| augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 | 000000000000000000 | 000000000000000000 |
The table above demonstrates the variant loci fingerprint process.
Next we create a list of number of distinct alleles (different fingerprints) for each gene:
rm(allele.counts3)
## Warning in rm(allele.counts3): object 'allele.counts3' not found
allele.counts3 <- allele.counts2[,c(1,24)]
write.csv(allele.counts3, "allele.counts.csv")
kable(head(allele.counts3))
| gene | count |
|---|---|
| augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 | 15 |
| augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 | 4 |
| augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 | 8 |
| augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 | 4 |
| augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 | 9 |
| augustus_masked-scaffold_0-processed-gene-0.1956-mRNA-1 | 2 |
We need to be sure to add in genes with no variant alleles. Up until now, only variants are included. Make an empty table from the gff and join it with the counts, then replace NA values with 1
library(dplyr)
#make the empty table with all genes
gff.build <- gff
gff.build$prox <- 0
gff.build <- gff.build[c(4,5)]
gff.build <- unique(gff.build)
#join in the counts
gff.build <- dplyr::left_join(gff.build, allele.counts3, by = "gene")
## Warning: Column `gene` joining character vector and factor, coercing into
## character vector
#replace NA with 1
gff.build[is.na(gff.build)] <- 1
gff.build <- gff.build[c(1,3)]
From which we can make a distribution histogram of allele counts.
library(ggplot2)
hist.allele.count <- ggplot(gff.build, aes(gff.build$count)) + geom_histogram(binwidth=1, color = "white") + theme_classic()
hist.allele.count <- hist.allele.count + xlab("Alleles") + ylab("Genes")+ ylim(c(0,1800))
hist.allele.count
```
For the sake of resolving this ambiguity, we will add in genes with no variants and redraw the histogram. To do this, we have to go back the the full list of genes found in kegg.raw and join in this allele.counts3 table
full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
full.gene.allele.count[is.na(full.gene.allele.count)] <- 1 #replace NA with 1
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
xlab = "allele count among all 22 strains", ylab = "genes")
write the full gene non-synonymous allele count table to a file
write.csv(full.gene.allele.count,"full_gene_allele_count.csv")
Let’s attach functional information where available and restrict ourselves to genes with predicted function (and sort by allele counts)
library(dplyr)
allele.counts3$gene <- as.character(allele.counts3$gene)
allele.counts.KEGG <- dplyr::left_join(allele.counts3, gene.annot2[c(1:6)])
## Joining, by = "gene"
allele.counts.KEGG <- allele.counts.KEGG[order(allele.counts.KEGG$KO),]
kable(head(allele.counts.KEGG))
| gene | count | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|---|---|
| augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 | 4 | |||||
| augustus_masked-scaffold_0-processed-gene-0.1960-mRNA-1 | 4 | |||||
| augustus_masked-scaffold_0-processed-gene-0.1967-mRNA-1 | 6 | |||||
| augustus_masked-scaffold_0-processed-gene-0.1978-mRNA-1 | 9 | |||||
| augustus_masked-scaffold_0-processed-gene-0.1979-mRNA-1 | 5 | |||||
| augustus_masked-scaffold_0-processed-gene-0.1982-mRNA-1 | 2 |
First, we attach pathway information: note that that at this stage, we lose a hard count of variant genes since many KO can map to multiple pathways
allele.counts.KEGG.pres <- apply(allele.counts.KEGG, 2, function(x) gsub("^$|^ $", NA, x)) #this makes sure that any empty cells are NA
allele.counts.KEGG.pres <- data.frame(na.omit(allele.counts.KEGG.pres),stringsAsFactors = FALSE) #this removes all rows with no KEGG
allele.counts.KEGG.pres$count <- as.numeric(allele.counts.KEGG.pres$count)
write.csv(allele.counts.KEGG.pres, "gene.allele.counts.csv")
kable(head(allele.counts.KEGG.pres))
| gene | count | KO | KEGG.name | KEGG.tier1 | KEGG.tier2 | KEGG.pathway |
|---|---|---|---|---|---|---|
| augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 | 3 | K00002 | AKR1A1 | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
| augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 | 3 | K00002 | AKR1A1 | Metabolism | 09101 Carbohydrate metabolism | 00040 Pentose and glucuronate interconversions [PATH:ko00040] |
| augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 | 3 | K00002 | AKR1A1 | Metabolism | 09103 Lipid metabolism | 00561 Glycerolipid metabolism [PATH:ko00561] |
| augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 | 3 | K00002 | AKR1A1 | Metabolism | 09111 Xenobiotics biodegradation and metabolism | 00930 Caprolactam degradation [PATH:ko00930] |
| augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 | 3 | K00002 | AKR1A1 | BriteHierarchies | 09183 Protein families: signaling and cellular processes | 04147 Exosome [BR:ko04147] |
| augustus_masked-scaffold_26-processed-gene-0.177-mRNA-1 | 7 | K00002 | AKR1A1 | Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] |
cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
nrow(allele.counts3),"show at least one variant relative to the reference genome.",
length(unique(allele.counts.KEGG.pres$gene)), "can be assigned to a KO family. These",
length(unique(allele.counts.KEGG.pres$gene)), "genes can be mapped to",
nrow(allele.counts.KEGG.pres),"specific KEGG pathway connections representing",
length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways
This should be further narrowed down to more highly allelic genes, but first, lets summarize by pathway and determine the most highly impacted pathways in terms of counts of variant genes present. This code chunks just counts pathway name occurances using a summarize function.
KEGG.pathway.var.summary <- allele.counts.KEGG.pres
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.no.min <-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary[c(1:25),])
| KEGG.pathway | variant.genes |
|---|---|
| 04131 Membrane trafficking [BR:ko04131] | 328 |
| 03036 Chromosome and associated proteins [BR:ko03036] | 269 |
| 02000 Transporters [BR:ko02000] | 213 |
| 04147 Exosome [BR:ko04147] | 198 |
| 03009 Ribosome biogenesis [BR:ko03009] | 196 |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 181 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 160 |
| 03021 Transcription machinery [BR:ko03021] | 147 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 146 |
| 03041 Spliceosome [BR:ko03041] | 129 |
| 04121 Ubiquitin system [BR:ko04121] | 126 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 121 |
| 03011 Ribosome [BR:ko03011] | 117 |
| 01002 Peptidases [BR:ko01002] | 111 |
| 03010 Ribosome [PATH:ko03010] | 102 |
| 99980 Enzymes with EC numbers | 96 |
| 00230 Purine metabolism [PATH:ko00230] | 95 |
| 03013 RNA transport [PATH:ko03013] | 84 |
| 03040 Spliceosome [PATH:ko03040] | 81 |
| 03110 Chaperones and folding catalysts [BR:ko03110] | 79 |
| 03032 DNA replication proteins [BR:ko03032] | 78 |
| 01009 Protein phosphatase and associated proteins [BR:ko01009] | 77 |
| 04714 Thermogenesis [PATH:ko04714] | 76 |
| 04141 Protein processing in endoplasmic reticulum [PATH:ko04141] | 74 |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 73 |
| This summary does not take allele count into consideration: a gene | with only 1 or 2 alleles is not very meaningful for this analysis |
This is the same as above but modified to exclude genes that have lower variant counts.
#50% minimum alleles (>10)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 10,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.50 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.50) <- c("KEGG.pathway","50%.alleles")
#75% minimum alleles (>16)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.75 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.75) <- c("KEGG.pathway","75%.alleles")
#90% minimum alleles (>19)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.9,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.90 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.90) <- c("KEGG.pathway","90%.alleles")
#join the pathway counts together with different allelic variation thresholds
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary.no.min, KEGG.pathway.var.summary.50, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining factors with different levels,
## coercing to character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.75, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.90, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
kable(KEGG.pathway.var.summary[c(1:25),])
| KEGG.pathway | variant.genes | 50%.alleles | 75%.alleles | 90%.alleles |
|---|---|---|---|---|
| 04131 Membrane trafficking [BR:ko04131] | 328 | 41 | 2 | NA |
| 03036 Chromosome and associated proteins [BR:ko03036] | 269 | 52 | 6 | NA |
| 02000 Transporters [BR:ko02000] | 213 | 8 | NA | NA |
| 04147 Exosome [BR:ko04147] | 198 | 8 | 2 | NA |
| 03009 Ribosome biogenesis [BR:ko03009] | 196 | 35 | 6 | 1 |
| 03029 Mitochondrial biogenesis [BR:ko03029] | 181 | 16 | 1 | NA |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 160 | 26 | 3 | NA |
| 03021 Transcription machinery [BR:ko03021] | 147 | 22 | 2 | NA |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 146 | 24 | 3 | NA |
| 03041 Spliceosome [BR:ko03041] | 129 | 16 | NA | NA |
| 04121 Ubiquitin system [BR:ko04121] | 126 | 21 | 2 | 1 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 121 | 18 | 2 | NA |
| 03011 Ribosome [BR:ko03011] | 117 | 2 | NA | NA |
| 01002 Peptidases [BR:ko01002] | 111 | 9 | NA | NA |
| 03010 Ribosome [PATH:ko03010] | 102 | 1 | NA | NA |
| 99980 Enzymes with EC numbers | 96 | 11 | 1 | NA |
| 00230 Purine metabolism [PATH:ko00230] | 95 | 10 | 2 | 2 |
| 03013 RNA transport [PATH:ko03013] | 84 | 13 | 1 | NA |
| 03040 Spliceosome [PATH:ko03040] | 81 | 8 | NA | NA |
| 03110 Chaperones and folding catalysts [BR:ko03110] | 79 | 4 | NA | NA |
| 03032 DNA replication proteins [BR:ko03032] | 78 | 10 | 1 | NA |
| 01009 Protein phosphatase and associated proteins [BR:ko01009] | 77 | 5 | NA | NA |
| 04714 Thermogenesis [PATH:ko04714] | 76 | NA | NA | NA |
| 04141 Protein processing in endoplasmic reticulum [PATH:ko04141] | 74 | 6 | NA | NA |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 73 | 14 | 6 | NA |
allele.75 <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
length(unique(allele.75$gene)),"show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These",
length(unique(allele.75$gene)), "genes can be mapped to",
nrow(allele.75),"specific KEGG pathway connections representing",
length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways
KEGG.pathway.75.summary <- KEGG.pathway.var.summary[c(1,4)]
KEGG.pathway.75.summary <- KEGG.pathway.75.summary[order(KEGG.pathway.75.summary$`75%.alleles`,decreasing=TRUE),]
kable(KEGG.pathway.75.summary[1:20,])
| KEGG.pathway | 75%.alleles |
|---|---|
| 03036 Chromosome and associated proteins [BR:ko03036] | 6 |
| 03009 Ribosome biogenesis [BR:ko03009] | 6 |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 6 |
| 00250 Alanine | 6 |
| 03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] | 4 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 3 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 3 |
| 04113 Meiosis - yeast [PATH:ko04113] | 3 |
| 00300 Lysine biosynthesis [PATH:ko00300] | 3 |
| 04131 Membrane trafficking [BR:ko04131] | 2 |
| 04147 Exosome [BR:ko04147] | 2 |
| 03021 Transcription machinery [BR:ko03021] | 2 |
| 04121 Ubiquitin system [BR:ko04121] | 2 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 2 |
| 00230 Purine metabolism [PATH:ko00230] | 2 |
| 04812 Cytoskeleton proteins [BR:ko04812] | 2 |
| 04110 Cell cycle [PATH:ko04110] | 2 |
| 03015 mRNA surveillance pathway [PATH:ko03015] | 2 |
| 04213 Longevity regulating pathway - multiple species [PATH:ko04213] | 2 |
| 03460 Fanconi anemia pathway [PATH:ko03460] | 2 |
| To be clear, this is a representation of where the most highly allelic | genes are mapped according to the KEGG system. |
This is fascinating, but lets try to put this in two frames of reference: 1. average number of variants per gene across the strains in each pathway 2. total number of genes placing into each pathway in the reference genome, % of pathway genes subject to variation
Let’s create two more tables representing higher levels of organization: X
#tier1
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier1 <- as.factor(KEGG.pathway.var.summary$KEGG.tier1)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier1) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier1<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier1[c(1:10),])
| KEGG.tier1 | variant.genes |
|---|---|
| BriteHierarchies | 38 |
| Metabolism | 21 |
| CellularProcesses | 12 |
| GeneticInformationProcessing | 12 |
| HumanDiseases | 8 |
| EnvironmentalInformationProcessing | 6 |
| OrganismalSystems | 3 |
| NotIncludedinPathwayorBrite | 1 |
| NA | NA |
| NA | NA |
#tier 2
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier2 <- as.factor(KEGG.pathway.var.summary$KEGG.tier2)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier2) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier2<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier2[c(1:25),])
| KEGG.tier2 | variant.genes |
|---|---|
| 09182 Protein families: genetic information processing | 30 |
| 09105 Amino acid metabolism | 9 |
| 09143 Cell growth and death | 9 |
| 09104 Nucleotide metabolism | 8 |
| 09122 Translation | 8 |
| 09132 Signal transduction | 6 |
| 09167 Infectious diseases | 5 |
| 09181 Protein families: metabolism | 4 |
| 09183 Protein families: signaling and cellular processes | 4 |
| 09124 Replication and repair | 3 |
| 09145 Cellular community - prokaryotes | 2 |
| 09149 Aging | 2 |
| 09161 Cancers | 2 |
| 09101 Carbohydrate metabolism | 1 |
| 09106 Metabolism of other amino acids | 1 |
| 09108 Metabolism of cofactors and vitamins | 1 |
| 09110 Biosynthesis of other secondary metabolites | 1 |
| 09123 Folding | 1 |
| 09141 Transport and catabolism | 1 |
| 09155 Excretory system | 1 |
| 09168 Drug resistance | 1 |
| 09191 Unclassified: metabolism | 1 |
| NA | NA |
| NA | NA |
| NA | NA |
full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")
## Warning: Column `gene` joining factor and character vector, coercing into
## character vector
full.gene.allele.count[is.na(full.gene.allele.count)] <- 0 #replace NA with 0
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
xlab = "allele count among all 22 strains", ylab = "genes")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways
## HIGHLY VARIANT ALLELES: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways
kable(KEGG.pathway.75.summary[1:20,])
| KEGG.pathway | 75%.alleles |
|---|---|
| 03036 Chromosome and associated proteins [BR:ko03036] | 6 |
| 03009 Ribosome biogenesis [BR:ko03009] | 6 |
| 00240 Pyrimidine metabolism [PATH:ko00240] | 6 |
| 00250 Alanine | 6 |
| 03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] | 4 |
| 03019 Messenger RNA Biogenesis [BR:ko03019] | 3 |
| 03400 DNA repair and recombination proteins [BR:ko03400] | 3 |
| 04113 Meiosis - yeast [PATH:ko04113] | 3 |
| 00300 Lysine biosynthesis [PATH:ko00300] | 3 |
| 04131 Membrane trafficking [BR:ko04131] | 2 |
| 04147 Exosome [BR:ko04147] | 2 |
| 03021 Transcription machinery [BR:ko03021] | 2 |
| 04121 Ubiquitin system [BR:ko04121] | 2 |
| 03016 Transfer RNA biogenesis [BR:ko03016] | 2 |
| 00230 Purine metabolism [PATH:ko00230] | 2 |
| 04812 Cytoskeleton proteins [BR:ko04812] | 2 |
| 04110 Cell cycle [PATH:ko04110] | 2 |
| 03015 mRNA surveillance pathway [PATH:ko03015] | 2 |
| 04213 Longevity regulating pathway - multiple species [PATH:ko04213] | 2 |
| 03460 Fanconi anemia pathway [PATH:ko03460] | 2 |
write.csv(KEGG.pathway.75.summary, "KEGG.pathway.75")
kable(KEGG.pathway.var.summary.tier1[c(1:8),])
| KEGG.tier1 | variant.genes |
|---|---|
| BriteHierarchies | 38 |
| Metabolism | 21 |
| CellularProcesses | 12 |
| GeneticInformationProcessing | 12 |
| HumanDiseases | 8 |
| EnvironmentalInformationProcessing | 6 |
| OrganismalSystems | 3 |
| NotIncludedinPathwayorBrite | 1 |
write.csv(KEGG.pathway.var.summary.tier1, "KEGG.tier1.75")
kable(KEGG.pathway.var.summary.tier2[c(1:25),])
| KEGG.tier2 | variant.genes |
|---|---|
| 09182 Protein families: genetic information processing | 30 |
| 09105 Amino acid metabolism | 9 |
| 09143 Cell growth and death | 9 |
| 09104 Nucleotide metabolism | 8 |
| 09122 Translation | 8 |
| 09132 Signal transduction | 6 |
| 09167 Infectious diseases | 5 |
| 09181 Protein families: metabolism | 4 |
| 09183 Protein families: signaling and cellular processes | 4 |
| 09124 Replication and repair | 3 |
| 09145 Cellular community - prokaryotes | 2 |
| 09149 Aging | 2 |
| 09161 Cancers | 2 |
| 09101 Carbohydrate metabolism | 1 |
| 09106 Metabolism of other amino acids | 1 |
| 09108 Metabolism of cofactors and vitamins | 1 |
| 09110 Biosynthesis of other secondary metabolites | 1 |
| 09123 Folding | 1 |
| 09141 Transport and catabolism | 1 |
| 09155 Excretory system | 1 |
| 09168 Drug resistance | 1 |
| 09191 Unclassified: metabolism | 1 |
| NA | NA |
| NA | NA |
| NA | NA |
write.csv(KEGG.pathway.var.summary.tier2, "KEGG.tier2.75")
library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.synonymous.alles.and.functions”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/3.loci.gene.alleles.html”) ```