Overview

(this is a work in progress)

The goal of this pipeline is to characterize alleles in terms of diversity and functional annotation

This script takes the output from the “loci.gene.alleles.rmd” script, which is a map of variant loci to genes. Then this script creates gene to KEGG and COG annotations (COG is currently not used) which is used later in the script to place genes into pathways.

This script also determines allelic counts across the project by generating locus-fingerprints across each gene. This allows for determination of unique fingerprint (allele) counts for each gene.

The variant genes are explored functionally by placement into KEGG pathways (modules)

Inputs: * geosmithia_KEGG.txt: these are gene to KEGG annotations that were obtained at the KEGG mapper web service * KEGG_BRITE2.csv: this is a map of KO to higher hierarchicical categories. This was downloaded from KEGG/BRITE and pre-processed in Excel * cog-raw.txt: This maps genes to COGs and COG categories; this was obtained through the EggNOG sequencemapper webservice * loci.to.genes.csv: this is the output from loci.to.genes.rmd, a map of variant loci to gene names along with a matrix of occurence in strains

Outputs: * full_gene_allele_count.csv: this is a simple two column table that shows allele count for each gene. This includes genes with no variants * allele count histograms (not written, save out of the html rmarkdown)

importing and reformatting the gff3 file

library(stringr)
gff <- read.csv("../g.morbida_CDS.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE)
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gene.split <- str_split_fixed(gene.split[,2], ":",2)
gff$gene <- as.character(gene.split[,1])
kable(head(gff),format="markdown")
scaffold start stop gene
scaffold_0 674 1516 augustus_masked-scaffold_0-processed-gene-0.2274-mRNA-1
scaffold_0 9081 12438 maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0 12504 13645 maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0 14732 15859 augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1
scaffold_0 16219 16581 maker-scaffold_0-augustus-gene-0.2819-mRNA-1
scaffold_0 16781 17263 maker-scaffold_0-augustus-gene-0.2819-mRNA-1

Reading the KEGG annotations

KEGG forms a good foundation for building the database because every gene has an associated annotation (in most cases empty)

kegg.raw <- read.csv("annotations/geosmithia_KEGG.txt", sep ="\t", row.names = NULL, header = FALSE)
colnames(kegg.raw) <- c("gene","KO")
kable(kegg.raw[1:10,])
gene KO
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 K14838
maker-scaffold_25-augustus-gene-0.4-mRNA-1
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1 K16261
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 K22384

Adding KEGG descriptions to the gene to KO map

The BRITE KEGG hierarchcial map was created manually from the BRITE download.

reading the BRITE map

BRITE <- read.csv("annotations/KEGG_BRITE2.csv",header=TRUE,row.names=NULL)
colnames(BRITE)[c(1,2,3)] <- c("KEGG.tier1","KEGG.tier2","KEGG.pathway")
BRITE <- BRITE[c(4,5,1,2,3)]
BRITE <- BRITE[-which(BRITE$KO == ""), ] #remove any rows with empty KO, removes ~ 1000 rows
kable(head(BRITE))
KO KEGG.name KEGG.tier1 KEGG.tier2 KEGG.pathway
K00844 HK; hexokinase [EC:2.7.1.1] Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K12407 GCK; glucokinase [EC:2.7.1.2] Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K00845 glk; glucokinase [EC:2.7.1.2] Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K01810 GPI Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K06859 pgi1; glucose-6-phosphate isomerase Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K13810 tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9] Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
BRITE con tains redundancies because a single KO can map to multiple pathways. This mea ns that pathw ay mapping is really not possibl e at this stage.
We can at tempt to do a one-to-one KO to name mapping for now. We use this much later in the pipeline .
BRITE2 <- data.frame(BRITE[c(1,2)])

Attaching KO names, pathways, and tiers (this section is not used for now)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
kegg.raw2 <- kegg.raw[c(2,1)]
gene.annot <- dplyr::left_join(kegg.raw2,BRITE,by="KO")
## Warning: Column `KO` joining factors with different levels, coercing to
## character vector
gene.annot <- gene.annot[c(2,1,3:6)]
kable(gene.annot[1:10,])
gene KO KEGG.name KEGG.tier1 KEGG.tier2 KEGG.pathway
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 K14838 NOP15; nucleolar protein 15 BriteHierarchies 09182 Protein families: genetic information processing 03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1 NA NA NA NA
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1 NA NA NA NA
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1 NA NA NA NA
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 NA NA NA NA
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1 NA NA NA NA
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1 NA NA NA NA
maker-scaffold_25-augustus-gene-0.7-mRNA-1 K16261 YAT; yeast amino acid transporter BriteHierarchies 09183 Protein families: signaling and cellular processes 02000 Transporters [BR:ko02000]
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1 NA NA NA NA
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 K22384 WRB BriteHierarchies 09183 Protein families: signaling and cellular processes 02000 Transporters [BR:ko02000]

Reading the COG

cog.raw <- read.csv("annotations/cog-raw.txt", sep = "\t", row.names = NULL, header = FALSE)
cog.raw <- cog.raw[c(1,13,14,15)]
colnames(cog.raw) <- c("gene","COG","COG.name","COG.desc")
kable(cog.raw[1:10,])
gene COG COG.name COG.desc
maker-scaffold_25-augustus-gene-0.4-mRNA-1 COG1063 Tdh Threonine dehydrogenase and related Zn-dependent dehydrogenases
augustus_masked-scaffold_7-processed-gene-0.364-mRNA-1 COG5309 COG5309 Exo-beta-1,3-glucanase
maker-scaffold_7-augustus-gene-0.1355-mRNA-1 COG0199 RpsN Ribosomal protein S14
maker-scaffold_7-augustus-gene-0.1443-mRNA-1 COG2515 Acd 1-aminocyclopropane-1-carboxylate deaminase
maker-scaffold_7-augustus-gene-0.1444-mRNA-1 COG0300 DltE Short-chain dehydrogenases of various substrate specificities
maker-scaffold_18-augustus-gene-0.242-mRNA-1 COG5277 COG5277 Actin and related proteins
maker-scaffold_19-augustus-gene-0.251-mRNA-1 COG2046 MET3 ATP sulfurylase (sulfate adenylyltransferase)
maker-scaffold_19-augustus-gene-0.251-mRNA-1 COG0529 CysC Adenylylsulfate kinase and related kinases
maker-scaffold_3-augustus-gene-0.4-mRNA-1 COG0474 MgtA Cation transport ATPase
snap_masked-scaffold_3-processed-gene-0.1103-mRNA-1 COG1946 TesB Acyl-CoA thioesterase

Adding COG to the gene annotation table

gene.annot2 <- dplyr::left_join(gene.annot,cog.raw,by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
gene.annot2 <- data.frame(sapply(gene.annot2, as.character),stringsAsFactors = FALSE)
gene.annot2[is.na(gene.annot2)] <- ""
write.csv(gene.annot2,"gene.annotations.csv")
kable(gene.annot2[1:10,])
gene KO KEGG.name KEGG.tier1 KEGG.tier2 KEGG.pathway COG COG.name COG.desc
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1 K14838 NOP15; nucleolar protein 15 BriteHierarchies 09182 Protein families: genetic information processing 03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1 COG1063 Tdh Threonine dehydrogenase and related Zn-dependent dehydrogenases
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1 COG1196 Smc Chromosome segregation ATPases
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1 K16261 YAT; yeast amino acid transporter BriteHierarchies 09183 Protein families: signaling and cellular processes 02000 Transporters [BR:ko02000] COG0833 LysP Amino acid transporters
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1 K22384 WRB BriteHierarchies 09183 Protein families: signaling and cellular processes 02000 Transporters [BR:ko02000]

Fingerprinting genes to define and count alleles

fingerprints for each gene/strain combination are generated by pasting together the 1s and 0s for each strain across each gene. Example output can be seen after the code chunk.

the code is ugly but fast:

var.genes <- read.csv("loci.to.genes.csv", row.names = 1)
allele.barcode <- var.genes[-c(2:5)]
p <- function(v) {
  Reduce(f=paste0, x = v)} #define a paste function
allele.bygen <- dplyr::group_by(allele.barcode, gene)
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
allele.counts <- dplyr::summarize(allele.bygen, GM307 = p(as.character(GM307)), 
                                                GM236 = p(as.character(GM236)),
                                                GM219 = p(as.character(GM219)),
                                                GM216 = p(as.character(GM216)),
                                  GM205 = p(as.character(GM205)),
                                  GM196 = p(as.character(GM196)),
                                  GM188 = p(as.character(GM188)),
                                  GM179 = p(as.character(GM179)),
                                  GM178 = p(as.character(GM178)),
                                  GM177 = p(as.character(GM177)),
                                  GM170 = p(as.character(GM170)),
                                  GM158 = p(as.character(GM158)),
                                  GM140 = p(as.character(GM140)),
                                  GM118 = p(as.character(GM118)),
                                  GM108 = p(as.character(GM108)),
                                  GM64 = p(as.character(GM64)),
                                  GM59 = p(as.character(GM59)),
                                  GM39 = p(as.character(GM39)),
                                  GM25 = p(as.character(GM25)),
                                  GM20 = p(as.character(GM20)),
                                  GM17 = p(as.character(GM17)),
                                  GM11 = p(as.character(GM11)))

allele.counts2 <- allele.counts
allele.counts2$count <- apply(allele.counts2[2:23], 1, function(x)length(unique(x))) #this counts unique values in the fingerprint section
kable(allele.counts2[c(1:5),c(1:3)])
gene GM307 GM236
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 00000000 00000000
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 00000000000000000000 00000000000000000000
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 000000000 000000000
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 000000000000000000 000000000000000000

The table above demonstrates the variant loci fingerprint process.

Next we create a list of number of distinct alleles (different fingerprints) for each gene:

rm(allele.counts3)
## Warning in rm(allele.counts3): object 'allele.counts3' not found
allele.counts3 <- allele.counts2[,c(1,24)]
write.csv(allele.counts3, "allele.counts.csv")
kable(head(allele.counts3))
gene count
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1 15
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 4
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1 8
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1 4
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1 9
augustus_masked-scaffold_0-processed-gene-0.1956-mRNA-1 2

We need to be sure to add in genes with no variant alleles. Up until now, only variants are included. Make an empty table from the gff and join it with the counts, then replace NA values with 1

library(dplyr)
#make the empty table with all genes
gff.build <- gff
gff.build$prox <- 0
gff.build <- gff.build[c(4,5)]
gff.build <- unique(gff.build)

#join in the counts
gff.build <- dplyr::left_join(gff.build, allele.counts3, by = "gene")
## Warning: Column `gene` joining character vector and factor, coercing into
## character vector
#replace NA with 1
gff.build[is.na(gff.build)] <- 1
gff.build <- gff.build[c(1,3)]

From which we can make a distribution histogram of allele counts.

library(ggplot2)

hist.allele.count <- ggplot(gff.build, aes(gff.build$count)) + geom_histogram(binwidth=1, color = "white") + theme_classic()
hist.allele.count <- hist.allele.count + xlab("Alleles") + ylab("Genes")+ ylim(c(0,1800))
hist.allele.count

```

For the sake of resolving this ambiguity, we will add in genes with no variants and redraw the histogram. To do this, we have to go back the the full list of genes found in kegg.raw and join in this allele.counts3 table

full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")
## Warning: Column `gene` joining factors with different levels, coercing to
## character vector
full.gene.allele.count[is.na(full.gene.allele.count)] <- 1 #replace NA with 1
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
     xlab = "allele count among all 22 strains", ylab = "genes")

write the full gene non-synonymous allele count table to a file

write.csv(full.gene.allele.count,"full_gene_allele_count.csv")

Summarizing highly allelic genes by mapping to pathways

Let’s attach functional information where available and restrict ourselves to genes with predicted function (and sort by allele counts)

library(dplyr)
allele.counts3$gene <- as.character(allele.counts3$gene)
allele.counts.KEGG <- dplyr::left_join(allele.counts3, gene.annot2[c(1:6)])
## Joining, by = "gene"
allele.counts.KEGG <- allele.counts.KEGG[order(allele.counts.KEGG$KO),]
kable(head(allele.counts.KEGG))
gene count KO KEGG.name KEGG.tier1 KEGG.tier2 KEGG.pathway
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1 4
augustus_masked-scaffold_0-processed-gene-0.1960-mRNA-1 4
augustus_masked-scaffold_0-processed-gene-0.1967-mRNA-1 6
augustus_masked-scaffold_0-processed-gene-0.1978-mRNA-1 9
augustus_masked-scaffold_0-processed-gene-0.1979-mRNA-1 5
augustus_masked-scaffold_0-processed-gene-0.1982-mRNA-1 2

Exploring variant genes that map to KEGG pathways

First, we attach pathway information: note that that at this stage, we lose a hard count of variant genes since many KO can map to multiple pathways

allele.counts.KEGG.pres <- apply(allele.counts.KEGG, 2, function(x) gsub("^$|^ $", NA, x)) #this makes sure that any empty cells are NA
allele.counts.KEGG.pres <- data.frame(na.omit(allele.counts.KEGG.pres),stringsAsFactors = FALSE) #this removes all rows with no KEGG 
allele.counts.KEGG.pres$count <- as.numeric(allele.counts.KEGG.pres$count)
write.csv(allele.counts.KEGG.pres, "gene.allele.counts.csv")
kable(head(allele.counts.KEGG.pres))
gene count KO KEGG.name KEGG.tier1 KEGG.tier2 KEGG.pathway
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 3 K00002 AKR1A1 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 3 K00002 AKR1A1 Metabolism 09101 Carbohydrate metabolism 00040 Pentose and glucuronate interconversions [PATH:ko00040]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 3 K00002 AKR1A1 Metabolism 09103 Lipid metabolism 00561 Glycerolipid metabolism [PATH:ko00561]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 3 K00002 AKR1A1 Metabolism 09111 Xenobiotics biodegradation and metabolism 00930 Caprolactam degradation [PATH:ko00930]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1 3 K00002 AKR1A1 BriteHierarchies 09183 Protein families: signaling and cellular processes 04147 Exosome [BR:ko04147]
augustus_masked-scaffold_26-processed-gene-0.177-mRNA-1 7 K00002 AKR1A1 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
  nrow(allele.counts3),"show at least one variant relative to the reference genome.",
  length(unique(allele.counts.KEGG.pres$gene)), "can be assigned to a KO family. These",
  length(unique(allele.counts.KEGG.pres$gene)), "genes can be mapped to",
  nrow(allele.counts.KEGG.pres),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways

This should be further narrowed down to more highly allelic genes, but first, lets summarize by pathway and determine the most highly impacted pathways in terms of counts of variant genes present. This code chunks just counts pathway name occurances using a summarize function.

KEGG.pathway.var.summary <- allele.counts.KEGG.pres
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr

#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.no.min <-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary[c(1:25),])
KEGG.pathway variant.genes
04131 Membrane trafficking [BR:ko04131] 328
03036 Chromosome and associated proteins [BR:ko03036] 269
02000 Transporters [BR:ko02000] 213
04147 Exosome [BR:ko04147] 198
03009 Ribosome biogenesis [BR:ko03009] 196
03029 Mitochondrial biogenesis [BR:ko03029] 181
03019 Messenger RNA Biogenesis [BR:ko03019] 160
03021 Transcription machinery [BR:ko03021] 147
03400 DNA repair and recombination proteins [BR:ko03400] 146
03041 Spliceosome [BR:ko03041] 129
04121 Ubiquitin system [BR:ko04121] 126
03016 Transfer RNA biogenesis [BR:ko03016] 121
03011 Ribosome [BR:ko03011] 117
01002 Peptidases [BR:ko01002] 111
03010 Ribosome [PATH:ko03010] 102
99980 Enzymes with EC numbers 96
00230 Purine metabolism [PATH:ko00230] 95
03013 RNA transport [PATH:ko03013] 84
03040 Spliceosome [PATH:ko03040] 81
03110 Chaperones and folding catalysts [BR:ko03110] 79
03032 DNA replication proteins [BR:ko03032] 78
01009 Protein phosphatase and associated proteins [BR:ko01009] 77
04714 Thermogenesis [PATH:ko04714] 76
04141 Protein processing in endoplasmic reticulum [PATH:ko04141] 74
00240 Pyrimidine metabolism [PATH:ko00240] 73
This summary does not take allele count into consideration: a gene with only 1 or 2 alleles is not very meaningful for this analysis

This is the same as above but modified to exclude genes that have lower variant counts.

#50% minimum alleles (>10)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 10,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.50 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.50) <- c("KEGG.pathway","50%.alleles")

#75% minimum alleles (>16)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.75 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.75) <- c("KEGG.pathway","75%.alleles")

#90% minimum alleles (>19)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.9,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.90 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.90) <- c("KEGG.pathway","90%.alleles")

#join the pathway counts together with different allelic variation thresholds
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary.no.min, KEGG.pathway.var.summary.50, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining factors with different levels,
## coercing to character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.75, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.90, by = "KEGG.pathway")
## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector
kable(KEGG.pathway.var.summary[c(1:25),])
KEGG.pathway variant.genes 50%.alleles 75%.alleles 90%.alleles
04131 Membrane trafficking [BR:ko04131] 328 41 2 NA
03036 Chromosome and associated proteins [BR:ko03036] 269 52 6 NA
02000 Transporters [BR:ko02000] 213 8 NA NA
04147 Exosome [BR:ko04147] 198 8 2 NA
03009 Ribosome biogenesis [BR:ko03009] 196 35 6 1
03029 Mitochondrial biogenesis [BR:ko03029] 181 16 1 NA
03019 Messenger RNA Biogenesis [BR:ko03019] 160 26 3 NA
03021 Transcription machinery [BR:ko03021] 147 22 2 NA
03400 DNA repair and recombination proteins [BR:ko03400] 146 24 3 NA
03041 Spliceosome [BR:ko03041] 129 16 NA NA
04121 Ubiquitin system [BR:ko04121] 126 21 2 1
03016 Transfer RNA biogenesis [BR:ko03016] 121 18 2 NA
03011 Ribosome [BR:ko03011] 117 2 NA NA
01002 Peptidases [BR:ko01002] 111 9 NA NA
03010 Ribosome [PATH:ko03010] 102 1 NA NA
99980 Enzymes with EC numbers 96 11 1 NA
00230 Purine metabolism [PATH:ko00230] 95 10 2 2
03013 RNA transport [PATH:ko03013] 84 13 1 NA
03040 Spliceosome [PATH:ko03040] 81 8 NA NA
03110 Chaperones and folding catalysts [BR:ko03110] 79 4 NA NA
03032 DNA replication proteins [BR:ko03032] 78 10 1 NA
01009 Protein phosphatase and associated proteins [BR:ko01009] 77 5 NA NA
04714 Thermogenesis [PATH:ko04714] 76 NA NA NA
04141 Protein processing in endoplasmic reticulum [PATH:ko04141] 74 6 NA NA
00240 Pyrimidine metabolism [PATH:ko00240] 73 14 6 NA
allele.75 <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]

cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
  length(unique(allele.75$gene)),"show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These",
  length(unique(allele.75$gene)), "genes can be mapped to",
  nrow(allele.75),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")
## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways
KEGG.pathway.75.summary <- KEGG.pathway.var.summary[c(1,4)]
KEGG.pathway.75.summary <- KEGG.pathway.75.summary[order(KEGG.pathway.75.summary$`75%.alleles`,decreasing=TRUE),]
kable(KEGG.pathway.75.summary[1:20,])
KEGG.pathway 75%.alleles
03036 Chromosome and associated proteins [BR:ko03036] 6
03009 Ribosome biogenesis [BR:ko03009] 6
00240 Pyrimidine metabolism [PATH:ko00240] 6
00250 Alanine 6
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] 4
03019 Messenger RNA Biogenesis [BR:ko03019] 3
03400 DNA repair and recombination proteins [BR:ko03400] 3
04113 Meiosis - yeast [PATH:ko04113] 3
00300 Lysine biosynthesis [PATH:ko00300] 3
04131 Membrane trafficking [BR:ko04131] 2
04147 Exosome [BR:ko04147] 2
03021 Transcription machinery [BR:ko03021] 2
04121 Ubiquitin system [BR:ko04121] 2
03016 Transfer RNA biogenesis [BR:ko03016] 2
00230 Purine metabolism [PATH:ko00230] 2
04812 Cytoskeleton proteins [BR:ko04812] 2
04110 Cell cycle [PATH:ko04110] 2
03015 mRNA surveillance pathway [PATH:ko03015] 2
04213 Longevity regulating pathway - multiple species [PATH:ko04213] 2
03460 Fanconi anemia pathway [PATH:ko03460] 2
To be clear, this is a representation of where the most highly allelic genes are mapped according to the KEGG system.

This is fascinating, but lets try to put this in two frames of reference: 1. average number of variants per gene across the strains in each pathway 2. total number of genes placing into each pathway in the reference genome, % of pathway genes subject to variation

Let’s create two more tables representing higher levels of organization: X

#tier1
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier1 <- as.factor(KEGG.pathway.var.summary$KEGG.tier1)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier1) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier1<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier1[c(1:10),])
KEGG.tier1 variant.genes
BriteHierarchies 38
Metabolism 21
CellularProcesses 12
GeneticInformationProcessing 12
HumanDiseases 8
EnvironmentalInformationProcessing 6
OrganismalSystems 3
NotIncludedinPathwayorBrite 1
NA NA
NA NA
#tier 2
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier2 <- as.factor(KEGG.pathway.var.summary$KEGG.tier2)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier2) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier2<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier2[c(1:25),])
KEGG.tier2 variant.genes
09182 Protein families: genetic information processing 30
09105 Amino acid metabolism 9
09143 Cell growth and death 9
09104 Nucleotide metabolism 8
09122 Translation 8
09132 Signal transduction 6
09167 Infectious diseases 5
09181 Protein families: metabolism 4
09183 Protein families: signaling and cellular processes 4
09124 Replication and repair 3
09145 Cellular community - prokaryotes 2
09149 Aging 2
09161 Cancers 2
09101 Carbohydrate metabolism 1
09106 Metabolism of other amino acids 1
09108 Metabolism of cofactors and vitamins 1
09110 Biosynthesis of other secondary metabolites 1
09123 Folding 1
09141 Transport and catabolism 1
09155 Excretory system 1
09168 Drug resistance 1
09191 Unclassified: metabolism 1
NA NA
NA NA
NA NA

Key results

full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")
## Warning: Column `gene` joining factor and character vector, coercing into
## character vector
full.gene.allele.count[is.na(full.gene.allele.count)] <- 0 #replace NA with 0
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
     xlab = "allele count among all 22 strains", ylab = "genes")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways
## HIGHLY VARIANT ALLELES: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways

Pathways (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.75.summary[1:20,])
KEGG.pathway 75%.alleles
03036 Chromosome and associated proteins [BR:ko03036] 6
03009 Ribosome biogenesis [BR:ko03009] 6
00240 Pyrimidine metabolism [PATH:ko00240] 6
00250 Alanine 6
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] 4
03019 Messenger RNA Biogenesis [BR:ko03019] 3
03400 DNA repair and recombination proteins [BR:ko03400] 3
04113 Meiosis - yeast [PATH:ko04113] 3
00300 Lysine biosynthesis [PATH:ko00300] 3
04131 Membrane trafficking [BR:ko04131] 2
04147 Exosome [BR:ko04147] 2
03021 Transcription machinery [BR:ko03021] 2
04121 Ubiquitin system [BR:ko04121] 2
03016 Transfer RNA biogenesis [BR:ko03016] 2
00230 Purine metabolism [PATH:ko00230] 2
04812 Cytoskeleton proteins [BR:ko04812] 2
04110 Cell cycle [PATH:ko04110] 2
03015 mRNA surveillance pathway [PATH:ko03015] 2
04213 Longevity regulating pathway - multiple species [PATH:ko04213] 2
03460 Fanconi anemia pathway [PATH:ko03460] 2
write.csv(KEGG.pathway.75.summary, "KEGG.pathway.75")

Tier 1 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier1[c(1:8),])
KEGG.tier1 variant.genes
BriteHierarchies 38
Metabolism 21
CellularProcesses 12
GeneticInformationProcessing 12
HumanDiseases 8
EnvironmentalInformationProcessing 6
OrganismalSystems 3
NotIncludedinPathwayorBrite 1
write.csv(KEGG.pathway.var.summary.tier1, "KEGG.tier1.75")

Tier 2 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier2[c(1:25),])
KEGG.tier2 variant.genes
09182 Protein families: genetic information processing 30
09105 Amino acid metabolism 9
09143 Cell growth and death 9
09104 Nucleotide metabolism 8
09122 Translation 8
09132 Signal transduction 6
09167 Infectious diseases 5
09181 Protein families: metabolism 4
09183 Protein families: signaling and cellular processes 4
09124 Replication and repair 3
09145 Cellular community - prokaryotes 2
09149 Aging 2
09161 Cancers 2
09101 Carbohydrate metabolism 1
09106 Metabolism of other amino acids 1
09108 Metabolism of cofactors and vitamins 1
09110 Biosynthesis of other secondary metabolites 1
09123 Folding 1
09141 Transport and catabolism 1
09155 Excretory system 1
09168 Drug resistance 1
09191 Unclassified: metabolism 1
NA NA
NA NA
NA NA
write.csv(KEGG.pathway.var.summary.tier2, "KEGG.tier2.75")

library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.synonymous.alles.and.functions”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/3.loci.gene.alleles.html”) ```