Overview

Skip to the end for main results

(this is a work in progress)

The goal of this pipeline is to characterize alleles in terms of diversity and functional annotation

This script takes the output from the “loci.gene.alleles.rmd” script, which is a map of variant loci to genes. Then this script creates gene to KEGG and COG annotations (COG is currently not used) which is used later in the script to place genes into pathways.

This script also determines allelic counts across the project by generating locus-fingerprints across each gene. This allows for determination of unique fingerprint (allele) counts for each gene.

The variant genes are explored functionally by placement into KEGG pathways (modules)

Inputs: * geosmithia_KEGG.txt: these are gene to KEGG annotations that were obtained at the KEGG mapper web service * KEGG_BRITE2.csv: this is a map of KO to higher hierarchicical categories. This was downloaded from KEGG/BRITE and pre-processed in Excel * cog-raw.txt: This maps genes to COGs and COG categories; this was obtained through the EggNOG sequencemapper webservice * loci.to.genes.csv: this is the output from loci.to.genes.rmd, a map of variant loci to gene names along with a matrix of occurence in strains

Outputs: * full_gene_allele_count.csv: this is a simple two column table that shows allele count for each gene. This includes genes with no variants * allele count histograms (not written, save out of the html rmarkdown)

importing and reformatting the gff3 file

library(stringr)
gff <- read.csv("../g.morbida_CDS.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE)
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gene.split <- str_split_fixed(gene.split[,2], ":",2)
gff$gene <- as.character(gene.split[,1])
kable(head(gff),format="markdown")

scaffold	start	stop	gene
scaffold_0	674	1516	augustus_masked-scaffold_0-processed-gene-0.2274-mRNA-1
scaffold_0	9081	12438	maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0	12504	13645	maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0	14732	15859	augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1
scaffold_0	16219	16581	maker-scaffold_0-augustus-gene-0.2819-mRNA-1
scaffold_0	16781	17263	maker-scaffold_0-augustus-gene-0.2819-mRNA-1

Reading the KEGG annotations

KEGG forms a good foundation for building the database because every gene has an associated annotation (in most cases empty)

kegg.raw <- read.csv("annotations/geosmithia_KEGG.txt", sep ="\t", row.names = NULL, header = FALSE)
colnames(kegg.raw) <- c("gene","KO")
kable(kegg.raw[1:10,])

gene	KO
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838
maker-scaffold_25-augustus-gene-0.4-mRNA-1
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384

Adding KEGG descriptions to the gene to KO map

The BRITE KEGG hierarchcial map was created manually from the BRITE download.

reading the BRITE map

BRITE <- read.csv("annotations/KEGG_BRITE2.csv",header=TRUE,row.names=NULL)
colnames(BRITE)[c(1,2,3)] <- c("KEGG.tier1","KEGG.tier2","KEGG.pathway")
BRITE <- BRITE[c(4,5,1,2,3)]
BRITE <- BRITE[-which(BRITE$KO == ""), ] #remove any rows with empty KO, removes ~ 1000 rows
kable(head(BRITE))

KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
K00844	HK; hexokinase [EC:2.7.1.1]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K12407	GCK; glucokinase [EC:2.7.1.2]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K00845	glk; glucokinase [EC:2.7.1.2]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K01810	GPI	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K06859	pgi1; glucose-6-phosphate isomerase	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K13810	tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
BRITE con	tains redundancies because a single KO can map to multiple pathways. This mea	ns that pathw	ay mapping is really not possibl	e at this stage.
We can at	tempt to do a one-to-one KO to name mapping for now. We use this much later in	the pipeline	.

BRITE2 <- data.frame(BRITE[c(1,2)])

Attaching KO names, pathways, and tiers (this section is not used for now)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

kegg.raw2 <- kegg.raw[c(2,1)]
gene.annot <- dplyr::left_join(kegg.raw2,BRITE,by="KO")

## Warning: Column `KO` joining factors with different levels, coercing to
## character vector

gene.annot <- gene.annot[c(2,1,3:6)]
kable(gene.annot[1:10,])

gene	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838	NOP15; nucleolar protein 15	BriteHierarchies	09182 Protein families: genetic information processing	03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1		NA	NA	NA	NA
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1		NA	NA	NA	NA
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1		NA	NA	NA	NA
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261	YAT; yeast amino acid transporter	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384	WRB	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]

Reading the COG

cog.raw <- read.csv("annotations/cog-raw.txt", sep = "\t", row.names = NULL, header = FALSE)
cog.raw <- cog.raw[c(1,13,14,15)]
colnames(cog.raw) <- c("gene","COG","COG.name","COG.desc")
kable(cog.raw[1:10,])

gene	COG	COG.name	COG.desc
maker-scaffold_25-augustus-gene-0.4-mRNA-1	COG1063	Tdh	Threonine dehydrogenase and related Zn-dependent dehydrogenases
augustus_masked-scaffold_7-processed-gene-0.364-mRNA-1	COG5309	COG5309	Exo-beta-1,3-glucanase
maker-scaffold_7-augustus-gene-0.1355-mRNA-1	COG0199	RpsN	Ribosomal protein S14
maker-scaffold_7-augustus-gene-0.1443-mRNA-1	COG2515	Acd	1-aminocyclopropane-1-carboxylate deaminase
maker-scaffold_7-augustus-gene-0.1444-mRNA-1	COG0300	DltE	Short-chain dehydrogenases of various substrate specificities
maker-scaffold_18-augustus-gene-0.242-mRNA-1	COG5277	COG5277	Actin and related proteins
maker-scaffold_19-augustus-gene-0.251-mRNA-1	COG2046	MET3	ATP sulfurylase (sulfate adenylyltransferase)
maker-scaffold_19-augustus-gene-0.251-mRNA-1	COG0529	CysC	Adenylylsulfate kinase and related kinases
maker-scaffold_3-augustus-gene-0.4-mRNA-1	COG0474	MgtA	Cation transport ATPase
snap_masked-scaffold_3-processed-gene-0.1103-mRNA-1	COG1946	TesB	Acyl-CoA thioesterase

Adding COG to the gene annotation table

gene.annot2 <- dplyr::left_join(gene.annot,cog.raw,by="gene")

## Warning: Column `gene` joining factors with different levels, coercing to
## character vector

gene.annot2 <- data.frame(sapply(gene.annot2, as.character),stringsAsFactors = FALSE)
gene.annot2[is.na(gene.annot2)] <- ""
write.csv(gene.annot2,"gene.annotations.csv")
kable(gene.annot2[1:10,])

gene	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway	COG	COG.name	COG.desc
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838	NOP15; nucleolar protein 15	BriteHierarchies	09182 Protein families: genetic information processing	03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1						COG1063	Tdh	Threonine dehydrogenase and related Zn-dependent dehydrogenases
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1						COG1196	Smc	Chromosome segregation ATPases
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261	YAT; yeast amino acid transporter	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]	COG0833	LysP	Amino acid transporters
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384	WRB	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]

Fingerprinting genes to define and count alleles

fingerprints for each gene/strain combination are generated by pasting together the 1s and 0s for each strain across each gene. Example output can be seen after the code chunk.

the code is ugly but fast:

var.genes <- read.csv("loci.to.genes.csv", row.names = 1)
allele.barcode <- var.genes[-c(2:5)]
p <- function(v) {
  Reduce(f=paste0, x = v)} #define a paste function
allele.bygen <- dplyr::group_by(allele.barcode, gene)
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
allele.counts <- dplyr::summarize(allele.bygen, GM307 = p(as.character(GM307)), 
                                                GM236 = p(as.character(GM236)),
                                                GM219 = p(as.character(GM219)),
                                                GM216 = p(as.character(GM216)),
                                  GM205 = p(as.character(GM205)),
                                  GM196 = p(as.character(GM196)),
                                  GM188 = p(as.character(GM188)),
                                  GM179 = p(as.character(GM179)),
                                  GM178 = p(as.character(GM178)),
                                  GM177 = p(as.character(GM177)),
                                  GM170 = p(as.character(GM170)),
                                  GM158 = p(as.character(GM158)),
                                  GM140 = p(as.character(GM140)),
                                  GM118 = p(as.character(GM118)),
                                  GM108 = p(as.character(GM108)),
                                  GM64 = p(as.character(GM64)),
                                  GM59 = p(as.character(GM59)),
                                  GM39 = p(as.character(GM39)),
                                  GM25 = p(as.character(GM25)),
                                  GM20 = p(as.character(GM20)),
                                  GM17 = p(as.character(GM17)),
                                  GM11 = p(as.character(GM11)))

allele.counts2 <- allele.counts
allele.counts2$count <- apply(allele.counts2[2:23], 1, function(x)length(unique(x))) #this counts unique values in the fingerprint section
kable(allele.counts2[c(1:5),c(1:3)])

gene	GM307	GM236
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1	000000000000000000000000000000000000000000000	000000000000000000000000000000000000000000000
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1	00000000	00000000
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1	00000000000000000000	00000000000000000000
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1	000000000	000000000
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1	000000000000000000	000000000000000000

The table above demonstrates the variant loci fingerprint process.

Next we create a list of number of distinct alleles (different fingerprints) for each gene:

rm(allele.counts3)

## Warning in rm(allele.counts3): object 'allele.counts3' not found

allele.counts3 <- allele.counts2[,c(1,24)]
write.csv(allele.counts3, "allele.counts.csv")
kable(head(allele.counts3))

gene	count
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1	15
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1	4
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1	8
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1	4
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1	9
augustus_masked-scaffold_0-processed-gene-0.1956-mRNA-1	2

We need to be sure to add in genes with no variant alleles. Up until now, only variants are included. Make an empty table from the gff and join it with the counts, then replace NA values with 1

library(dplyr)
#make the empty table with all genes
gff.build <- gff
gff.build$prox <- 0
gff.build <- gff.build[c(4,5)]
gff.build <- unique(gff.build)

#join in the counts
gff.build <- dplyr::left_join(gff.build, allele.counts3, by = "gene")

## Warning: Column `gene` joining character vector and factor, coercing into
## character vector

#replace NA with 1
gff.build[is.na(gff.build)] <- 1
gff.build <- gff.build[c(1,3)]

From which we can make a distribution histogram of allele counts.

library(ggplot2)

hist.allele.count <- ggplot(gff.build, aes(gff.build$count)) + geom_histogram(binwidth=1, color = "white") + theme_classic()
hist.allele.count <- hist.allele.count + xlab("Alleles") + ylab("Genes")+ ylim(c(0,1800))
hist.allele.count

```

For the sake of resolving this ambiguity, we will add in genes with no variants and redraw the histogram. To do this, we have to go back the the full list of genes found in kegg.raw and join in this allele.counts3 table

full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")

## Warning: Column `gene` joining factors with different levels, coercing to
## character vector

full.gene.allele.count[is.na(full.gene.allele.count)] <- 1 #replace NA with 1
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
     xlab = "allele count among all 22 strains", ylab = "genes")

write the full gene non-synonymous allele count table to a file

write.csv(full.gene.allele.count,"full_gene_allele_count.csv")

Summarizing highly allelic genes by mapping to pathways

Let’s attach functional information where available and restrict ourselves to genes with predicted function (and sort by allele counts)

library(dplyr)
allele.counts3$gene <- as.character(allele.counts3$gene)
allele.counts.KEGG <- dplyr::left_join(allele.counts3, gene.annot2[c(1:6)])

## Joining, by = "gene"

allele.counts.KEGG <- allele.counts.KEGG[order(allele.counts.KEGG$KO),]
kable(head(allele.counts.KEGG))

gene	count	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1	4
augustus_masked-scaffold_0-processed-gene-0.1960-mRNA-1	4
augustus_masked-scaffold_0-processed-gene-0.1967-mRNA-1	6
augustus_masked-scaffold_0-processed-gene-0.1978-mRNA-1	9
augustus_masked-scaffold_0-processed-gene-0.1979-mRNA-1	5
augustus_masked-scaffold_0-processed-gene-0.1982-mRNA-1	2

Exploring variant genes that map to KEGG pathways

First, we attach pathway information: note that that at this stage, we lose a hard count of variant genes since many KO can map to multiple pathways

allele.counts.KEGG.pres <- apply(allele.counts.KEGG, 2, function(x) gsub("^$|^ $", NA, x)) #this makes sure that any empty cells are NA
allele.counts.KEGG.pres <- data.frame(na.omit(allele.counts.KEGG.pres),stringsAsFactors = FALSE) #this removes all rows with no KEGG 
allele.counts.KEGG.pres$count <- as.numeric(allele.counts.KEGG.pres$count)
write.csv(allele.counts.KEGG.pres, "gene.allele.counts.csv")
kable(head(allele.counts.KEGG.pres))

gene	count	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1	3	K00002	AKR1A1	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1	3	K00002	AKR1A1	Metabolism	09101 Carbohydrate metabolism	00040 Pentose and glucuronate interconversions [PATH:ko00040]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1	3	K00002	AKR1A1	Metabolism	09103 Lipid metabolism	00561 Glycerolipid metabolism [PATH:ko00561]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1	3	K00002	AKR1A1	Metabolism	09111 Xenobiotics biodegradation and metabolism	00930 Caprolactam degradation [PATH:ko00930]
augustus_masked-scaffold_13-processed-gene-0.10-mRNA-1	3	K00002	AKR1A1	BriteHierarchies	09183 Protein families: signaling and cellular processes	04147 Exosome [BR:ko04147]
augustus_masked-scaffold_26-processed-gene-0.177-mRNA-1	7	K00002	AKR1A1	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]

cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
  nrow(allele.counts3),"show at least one variant relative to the reference genome.",
  length(unique(allele.counts.KEGG.pres$gene)), "can be assigned to a KO family. These",
  length(unique(allele.counts.KEGG.pres$gene)), "genes can be mapped to",
  nrow(allele.counts.KEGG.pres),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways

This should be further narrowed down to more highly allelic genes, but first, lets summarize by pathway and determine the most highly impacted pathways in terms of counts of variant genes present. This code chunks just counts pathway name occurances using a summarize function.

KEGG.pathway.var.summary <- allele.counts.KEGG.pres
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr

#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.no.min <-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary[c(1:25),])

KEGG.pathway	variant.genes
04131 Membrane trafficking [BR:ko04131]	328
03036 Chromosome and associated proteins [BR:ko03036]	269
02000 Transporters [BR:ko02000]	213
04147 Exosome [BR:ko04147]	198
03009 Ribosome biogenesis [BR:ko03009]	196
03029 Mitochondrial biogenesis [BR:ko03029]	181
03019 Messenger RNA Biogenesis [BR:ko03019]	160
03021 Transcription machinery [BR:ko03021]	147
03400 DNA repair and recombination proteins [BR:ko03400]	146
03041 Spliceosome [BR:ko03041]	129
04121 Ubiquitin system [BR:ko04121]	126
03016 Transfer RNA biogenesis [BR:ko03016]	121
03011 Ribosome [BR:ko03011]	117
01002 Peptidases [BR:ko01002]	111
03010 Ribosome [PATH:ko03010]	102
99980 Enzymes with EC numbers	96
00230 Purine metabolism [PATH:ko00230]	95
03013 RNA transport [PATH:ko03013]	84
03040 Spliceosome [PATH:ko03040]	81
03110 Chaperones and folding catalysts [BR:ko03110]	79
03032 DNA replication proteins [BR:ko03032]	78
01009 Protein phosphatase and associated proteins [BR:ko01009]	77
04714 Thermogenesis [PATH:ko04714]	76
04141 Protein processing in endoplasmic reticulum [PATH:ko04141]	74
00240 Pyrimidine metabolism [PATH:ko00240]	73
This summary does not take allele count into consideration: a gene	with only 1 or 2 alleles is not very meaningful for this analysis

This is the same as above but modified to exclude genes that have lower variant counts.

#50% minimum alleles (>10)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 10,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.50 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.50) <- c("KEGG.pathway","50%.alleles")

#75% minimum alleles (>16)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.75 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.75) <- c("KEGG.pathway","75%.alleles")

#90% minimum alleles (>19)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.9,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.90 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.90) <- c("KEGG.pathway","90%.alleles")

#join the pathway counts together with different allelic variation thresholds
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary.no.min, KEGG.pathway.var.summary.50, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining factors with different levels,
## coercing to character vector

KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.75, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector

KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.90, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector

kable(KEGG.pathway.var.summary[c(1:25),])

KEGG.pathway	variant.genes	50%.alleles	75%.alleles	90%.alleles
04131 Membrane trafficking [BR:ko04131]	328	41	2	NA
03036 Chromosome and associated proteins [BR:ko03036]	269	52	6	NA
02000 Transporters [BR:ko02000]	213	8	NA	NA
04147 Exosome [BR:ko04147]	198	8	2	NA
03009 Ribosome biogenesis [BR:ko03009]	196	35	6	1
03029 Mitochondrial biogenesis [BR:ko03029]	181	16	1	NA
03019 Messenger RNA Biogenesis [BR:ko03019]	160	26	3	NA
03021 Transcription machinery [BR:ko03021]	147	22	2	NA
03400 DNA repair and recombination proteins [BR:ko03400]	146	24	3	NA
03041 Spliceosome [BR:ko03041]	129	16	NA	NA
04121 Ubiquitin system [BR:ko04121]	126	21	2	1
03016 Transfer RNA biogenesis [BR:ko03016]	121	18	2	NA
03011 Ribosome [BR:ko03011]	117	2	NA	NA
01002 Peptidases [BR:ko01002]	111	9	NA	NA
03010 Ribosome [PATH:ko03010]	102	1	NA	NA
99980 Enzymes with EC numbers	96	11	1	NA
00230 Purine metabolism [PATH:ko00230]	95	10	2	2
03013 RNA transport [PATH:ko03013]	84	13	1	NA
03040 Spliceosome [PATH:ko03040]	81	8	NA	NA
03110 Chaperones and folding catalysts [BR:ko03110]	79	4	NA	NA
03032 DNA replication proteins [BR:ko03032]	78	10	1	NA
01009 Protein phosphatase and associated proteins [BR:ko01009]	77	5	NA	NA
04714 Thermogenesis [PATH:ko04714]	76	NA	NA	NA
04141 Protein processing in endoplasmic reticulum [PATH:ko04141]	74	6	NA	NA
00240 Pyrimidine metabolism [PATH:ko00240]	73	14	6	NA

allele.75 <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]

cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(full.gene.allele.count), "total genes,",
  length(unique(allele.75$gene)),"show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These",
  length(unique(allele.75$gene)), "genes can be mapped to",
  nrow(allele.75),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways

KEGG.pathway.75.summary <- KEGG.pathway.var.summary[c(1,4)]
KEGG.pathway.75.summary <- KEGG.pathway.75.summary[order(KEGG.pathway.75.summary$`75%.alleles`,decreasing=TRUE),]
kable(KEGG.pathway.75.summary[1:20,])

KEGG.pathway	75%.alleles
03036 Chromosome and associated proteins [BR:ko03036]	6
03009 Ribosome biogenesis [BR:ko03009]	6
00240 Pyrimidine metabolism [PATH:ko00240]	6
00250 Alanine	6
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008]	4
03019 Messenger RNA Biogenesis [BR:ko03019]	3
03400 DNA repair and recombination proteins [BR:ko03400]	3
04113 Meiosis - yeast [PATH:ko04113]	3
00300 Lysine biosynthesis [PATH:ko00300]	3
04131 Membrane trafficking [BR:ko04131]	2
04147 Exosome [BR:ko04147]	2
03021 Transcription machinery [BR:ko03021]	2
04121 Ubiquitin system [BR:ko04121]	2
03016 Transfer RNA biogenesis [BR:ko03016]	2
00230 Purine metabolism [PATH:ko00230]	2
04812 Cytoskeleton proteins [BR:ko04812]	2
04110 Cell cycle [PATH:ko04110]	2
03015 mRNA surveillance pathway [PATH:ko03015]	2
04213 Longevity regulating pathway - multiple species [PATH:ko04213]	2
03460 Fanconi anemia pathway [PATH:ko03460]	2
To be clear, this is a representation of where the most highly allelic	genes are mapped according to the KEGG system.

This is fascinating, but lets try to put this in two frames of reference: 1. average number of variants per gene across the strains in each pathway 2. total number of genes placing into each pathway in the reference genome, % of pathway genes subject to variation

Let’s create two more tables representing higher levels of organization: X

#tier1
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier1 <- as.factor(KEGG.pathway.var.summary$KEGG.tier1)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier1) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier1<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier1[c(1:10),])

KEGG.tier1	variant.genes
BriteHierarchies	38
Metabolism	21
CellularProcesses	12
GeneticInformationProcessing	12
HumanDiseases	8
EnvironmentalInformationProcessing	6
OrganismalSystems	3
NotIncludedinPathwayorBrite	1
NA	NA
NA	NA

#tier 2
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier2 <- as.factor(KEGG.pathway.var.summary$KEGG.tier2)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier2) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier2<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier2[c(1:25),])

KEGG.tier2	variant.genes
09182 Protein families: genetic information processing	30
09105 Amino acid metabolism	9
09143 Cell growth and death	9
09104 Nucleotide metabolism	8
09122 Translation	8
09132 Signal transduction	6
09167 Infectious diseases	5
09181 Protein families: metabolism	4
09183 Protein families: signaling and cellular processes	4
09124 Replication and repair	3
09145 Cellular community - prokaryotes	2
09149 Aging	2
09161 Cancers	2
09101 Carbohydrate metabolism	1
09106 Metabolism of other amino acids	1
09108 Metabolism of cofactors and vitamins	1
09110 Biosynthesis of other secondary metabolites	1
09123 Folding	1
09141 Transport and catabolism	1
09155 Excretory system	1
09168 Drug resistance	1
09191 Unclassified: metabolism	1
NA	NA
NA	NA
NA	NA

Key results

full.gene.allele.count <- dplyr::left_join(kegg.raw, allele.counts3, by="gene")

## Warning: Column `gene` joining factor and character vector, coercing into
## character vector

full.gene.allele.count[is.na(full.gene.allele.count)] <- 0 #replace NA with 0
hist(full.gene.allele.count$count, breaks=22, main = paste("distribution of nucleotide alleles across all",nrow(full.gene.allele.count),"genes"),
     xlab = "allele count among all 22 strains", ylab = "genes")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 6229 show at least one variant relative to the reference genome. 3239 can be assigned to a KO family. These 3239 genes can be mapped to 9604 specific KEGG pathway connections representing 422 unique pathways

## HIGHLY VARIANT ALLELES: of 6273 total genes, 33 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 33 genes can be mapped to 101 specific KEGG pathway connections representing 422 unique pathways

Pathways (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.75.summary[1:20,])

KEGG.pathway	75%.alleles
03036 Chromosome and associated proteins [BR:ko03036]	6
03009 Ribosome biogenesis [BR:ko03009]	6
00240 Pyrimidine metabolism [PATH:ko00240]	6
00250 Alanine	6
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008]	4
03019 Messenger RNA Biogenesis [BR:ko03019]	3
03400 DNA repair and recombination proteins [BR:ko03400]	3
04113 Meiosis - yeast [PATH:ko04113]	3
00300 Lysine biosynthesis [PATH:ko00300]	3
04131 Membrane trafficking [BR:ko04131]	2
04147 Exosome [BR:ko04147]	2
03021 Transcription machinery [BR:ko03021]	2
04121 Ubiquitin system [BR:ko04121]	2
03016 Transfer RNA biogenesis [BR:ko03016]	2
00230 Purine metabolism [PATH:ko00230]	2
04812 Cytoskeleton proteins [BR:ko04812]	2
04110 Cell cycle [PATH:ko04110]	2
03015 mRNA surveillance pathway [PATH:ko03015]	2
04213 Longevity regulating pathway - multiple species [PATH:ko04213]	2
03460 Fanconi anemia pathway [PATH:ko03460]	2

write.csv(KEGG.pathway.75.summary, "KEGG.pathway.75")

Tier 1 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier1[c(1:8),])

KEGG.tier1	variant.genes
BriteHierarchies	38
Metabolism	21
CellularProcesses	12
GeneticInformationProcessing	12
HumanDiseases	8
EnvironmentalInformationProcessing	6
OrganismalSystems	3
NotIncludedinPathwayorBrite	1

write.csv(KEGG.pathway.var.summary.tier1, "KEGG.tier1.75")

Tier 2 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier2[c(1:25),])

KEGG.tier2	variant.genes
09182 Protein families: genetic information processing	30
09105 Amino acid metabolism	9
09143 Cell growth and death	9
09104 Nucleotide metabolism	8
09122 Translation	8
09132 Signal transduction	6
09167 Infectious diseases	5
09181 Protein families: metabolism	4
09183 Protein families: signaling and cellular processes	4
09124 Replication and repair	3
09145 Cellular community - prokaryotes	2
09149 Aging	2
09161 Cancers	2
09101 Carbohydrate metabolism	1
09106 Metabolism of other amino acids	1
09108 Metabolism of cofactors and vitamins	1
09110 Biosynthesis of other secondary metabolites	1
09123 Folding	1
09141 Transport and catabolism	1
09155 Excretory system	1
09168 Drug resistance	1
09191 Unclassified: metabolism	1
NA	NA
NA	NA
NA	NA

write.csv(KEGG.pathway.var.summary.tier2, "KEGG.tier2.75")

library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.synonymous.alles.and.functions”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/3.loci.gene.alleles.html”) ```

Genome position to annotation

Robert W. Murdoch

January 14, 2019