Overview

This script is used for the AA (non-synonomous) variants pipeline

The goal of this pipeline is to characterize alleles in terms of diversity and functional annotation

This script takes the output from the “5.loci.to.genes.aa.rmd” script, which is a map of variant loci to genes. Then this script creates gene to KEGG and COG annotations (COG is currently not used) which is used later in the script to place genes into pathways.

This script also determines allelic counts across the project by generating locus-fingerprints across each gene. This allows for determination of unique fingerprint (allele) counts for each gene.

The variant genes are explored functionally by placement into KEGG pathways (modules)

Inputs: * geosmithia_KEGG.txt: these are gene to KEGG annotations that were obtained at the KEGG mapper web service * KEGG_BRITE2.csv: this is a map of KO to higher hierarchicical categories. This was downloaded from KEGG/BRITE and pre-processed in Excel * cog-raw.txt: This maps genes to COGs and COG categories; this was obtained through the EggNOG sequencemapper webservice * aa.loci.to.genes.csv: this is the output from loci.to.genes.rmd, a map of variant loci to gene names along with a matrix of occurence in strains

Outputs: * full_gene_allele_count_aa.csv: this is a simple two column table that shows allele count for each gene. This includes genes with no variants. This is fed later into the sequence diversity pipeline for further characterization * allele count histograms (not written, save out of the html rmarkdown)

importing and reformatting the gff3 file

library(stringr)
gff <- read.csv("../g.morbida_CDS.gff3", sep = "\t", row.names = NULL, skip = 2, header = FALSE)
gff <- gff[c(1,4,5,9)]
colnames(gff) <- c("scaffold","start","stop","gene")
gene.split <- str_split_fixed(gff$gene, ";",2)
gene.split <- str_split_fixed(gene.split[,1], "=",2)
gene.split <- str_split_fixed(gene.split[,2], ":",2)
gff$gene <- as.character(gene.split[,1])
kable(head(gff),format="markdown")

scaffold	start	stop	gene
scaffold_0	674	1516	augustus_masked-scaffold_0-processed-gene-0.2274-mRNA-1
scaffold_0	9081	12438	maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0	12504	13645	maker-scaffold_0-augustus-gene-0.2665-mRNA-1
scaffold_0	14732	15859	augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1
scaffold_0	16219	16581	maker-scaffold_0-augustus-gene-0.2819-mRNA-1
scaffold_0	16781	17263	maker-scaffold_0-augustus-gene-0.2819-mRNA-1

Reading the KEGG annotations

KEGG forms a good foundation for building the database because every gene has an associated annotation (in most cases empty)

kegg.raw <- read.csv("annotations/geosmithia_KEGG.txt", sep ="\t", row.names = NULL, header = FALSE)
colnames(kegg.raw) <- c("gene","KO")
kable(kegg.raw[1:10,])

gene	KO
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838
maker-scaffold_25-augustus-gene-0.4-mRNA-1
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384

Adding KEGG descriptions to the gene to KO map

The BRITE KEGG hierarchcial map was created manually from the BRITE download.

reading the BRITE map

BRITE <- read.csv("annotations/KEGG_BRITE2.csv",header=TRUE,row.names=NULL)
colnames(BRITE)[c(1,2,3)] <- c("KEGG.tier1","KEGG.tier2","KEGG.pathway")
BRITE <- BRITE[c(4,5,1,2,3)]
BRITE <- BRITE[-which(BRITE$KO == ""), ] #remove any rows with empty KO, removes ~ 1000 rows
kable(head(BRITE))

KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
K00844	HK; hexokinase [EC:2.7.1.1]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K12407	GCK; glucokinase [EC:2.7.1.2]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K00845	glk; glucokinase [EC:2.7.1.2]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K01810	GPI	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K06859	pgi1; glucose-6-phosphate isomerase	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
K13810	tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9]	Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
BRITE con	tains redundancies because a single KO can map to multiple pathways. This mea	ns that pathw	ay mapping is really not possibl	e at this stage.
We can at	tempt to do a one-to-one KO to name mapping for now. We use this much later in	the pipeline	.

BRITE2 <- data.frame(BRITE[c(1,2)])

Attaching KO names, pathways, and tiers (this section is not used for now)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

kegg.raw2 <- kegg.raw[c(2,1)]
gene.annot <- dplyr::left_join(kegg.raw2,BRITE,by="KO")

## Warning: Column `KO` joining factors with different levels, coercing to
## character vector

gene.annot <- gene.annot[c(2,1,3:6)]
kable(gene.annot[1:10,])

gene	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838	NOP15; nucleolar protein 15	BriteHierarchies	09182 Protein families: genetic information processing	03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1		NA	NA	NA	NA
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1		NA	NA	NA	NA
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1		NA	NA	NA	NA
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261	YAT; yeast amino acid transporter	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1		NA	NA	NA	NA
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384	WRB	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]

Reading the COG

cog.raw <- read.csv("annotations/cog-raw.txt", sep = "\t", row.names = NULL, header = FALSE)
cog.raw <- cog.raw[c(1,13,14,15)]
colnames(cog.raw) <- c("gene","COG","COG.name","COG.desc")
kable(cog.raw[1:10,])

gene	COG	COG.name	COG.desc
maker-scaffold_25-augustus-gene-0.4-mRNA-1	COG1063	Tdh	Threonine dehydrogenase and related Zn-dependent dehydrogenases
augustus_masked-scaffold_7-processed-gene-0.364-mRNA-1	COG5309	COG5309	Exo-beta-1,3-glucanase
maker-scaffold_7-augustus-gene-0.1355-mRNA-1	COG0199	RpsN	Ribosomal protein S14
maker-scaffold_7-augustus-gene-0.1443-mRNA-1	COG2515	Acd	1-aminocyclopropane-1-carboxylate deaminase
maker-scaffold_7-augustus-gene-0.1444-mRNA-1	COG0300	DltE	Short-chain dehydrogenases of various substrate specificities
maker-scaffold_18-augustus-gene-0.242-mRNA-1	COG5277	COG5277	Actin and related proteins
maker-scaffold_19-augustus-gene-0.251-mRNA-1	COG2046	MET3	ATP sulfurylase (sulfate adenylyltransferase)
maker-scaffold_19-augustus-gene-0.251-mRNA-1	COG0529	CysC	Adenylylsulfate kinase and related kinases
maker-scaffold_3-augustus-gene-0.4-mRNA-1	COG0474	MgtA	Cation transport ATPase
snap_masked-scaffold_3-processed-gene-0.1103-mRNA-1	COG1946	TesB	Acyl-CoA thioesterase

Adding COG to the gene annotation table

gene.annot2 <- dplyr::left_join(gene.annot,cog.raw,by="gene")

## Warning: Column `gene` joining factors with different levels, coercing to
## character vector

gene.annot2 <- data.frame(sapply(gene.annot2, as.character),stringsAsFactors = FALSE)
gene.annot2[is.na(gene.annot2)] <- ""
write.csv(gene.annot2,"gene.annotations.csv")
kable(gene.annot2[1:10,])

gene	KO	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway	COG	COG.name	COG.desc
snap_masked-scaffold_25-processed-gene-0.20-mRNA-1	K14838	NOP15; nucleolar protein 15	BriteHierarchies	09182 Protein families: genetic information processing	03009 Ribosome biogenesis [BR:ko03009]
maker-scaffold_25-augustus-gene-0.4-mRNA-1						COG1063	Tdh	Threonine dehydrogenase and related Zn-dependent dehydrogenases
snap_masked-scaffold_25-processed-gene-0.23-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.9-mRNA-1
snap_masked-scaffold_25-processed-gene-0.21-mRNA-1						COG1196	Smc	Chromosome segregation ATPases
augustus_masked-scaffold_25-processed-gene-0.15-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.18-mRNA-1
maker-scaffold_25-augustus-gene-0.7-mRNA-1	K16261	YAT; yeast amino acid transporter	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]	COG0833	LysP	Amino acid transporters
augustus_masked-scaffold_25-processed-gene-0.13-mRNA-1
augustus_masked-scaffold_25-processed-gene-0.16-mRNA-1	K22384	WRB	BriteHierarchies	09183 Protein families: signaling and cellular processes	02000 Transporters [BR:ko02000]

Fingerprinting genes to define and count alleles

fingerprints for each gene/strain combination are generated by pasting together the 1s and 0s for each strain across each gene. Example output can be seen after the code chunk.

the code is ugly but fast:

aa.var.genes <- read.csv("aa.loci.to.genes.csv", row.names = 1)
aa.allele.barcode <- aa.var.genes[-c(2:5)]
p <- function(v) {
  Reduce(f=paste0, x = v)} #define a paste function
aa.allele.bygen <- dplyr::group_by(aa.allele.barcode, gene)
#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
aa.allele.counts <- dplyr::summarize(aa.allele.bygen, GM307 = p(as.character(GM307)), 
                                                GM236 = p(as.character(GM236)),
                                                GM219 = p(as.character(GM219)),
                                                GM216 = p(as.character(GM216)),
                                  GM205 = p(as.character(GM205)),
                                  GM196 = p(as.character(GM196)),
                                  GM188 = p(as.character(GM188)),
                                  GM179 = p(as.character(GM179)),
                                  GM178 = p(as.character(GM178)),
                                  GM177 = p(as.character(GM177)),
                                  GM170 = p(as.character(GM170)),
                                  GM158 = p(as.character(GM158)),
                                  GM140 = p(as.character(GM140)),
                                  GM118 = p(as.character(GM118)),
                                  GM108 = p(as.character(GM108)),
                                  GM64 = p(as.character(GM64)),
                                  GM59 = p(as.character(GM59)),
                                  GM39 = p(as.character(GM39)),
                                  GM25 = p(as.character(GM25)),
                                  GM20 = p(as.character(GM20)),
                                  GM17 = p(as.character(GM17)),
                                  GM11 = p(as.character(GM11)))

aa.allele.counts2 <- aa.allele.counts
aa.allele.counts2$count <- apply(aa.allele.counts2[2:23], 1, function(x)length(unique(x))) #this counts unique values in the fingerprint section
kable(aa.allele.counts2[c(1:5),c(1:3)])

gene	GM307	GM236
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1	000000000000000000000000000	000000000000000000000000000
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1	00	00
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1	000000000000000	000000000000000
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1	00	00
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1	0000000000	0000000000

The table above demonstrates the variant loci fingerprint process.

Next we create a list of number of distinct alleles (different fingerprints) for each gene:

rm(aa.allele.counts3)

## Warning in rm(aa.allele.counts3): object 'aa.allele.counts3' not found

aa.allele.counts3 <- aa.allele.counts2[,c(1,24)]
write.csv(aa.allele.counts3, "aa.allele.counts.csv")
kable(head(aa.allele.counts3))

gene	count
augustus_masked-scaffold_0-processed-gene-0.1933-mRNA-1	12
augustus_masked-scaffold_0-processed-gene-0.1934-mRNA-1	3
augustus_masked-scaffold_0-processed-gene-0.1938-mRNA-1	6
augustus_masked-scaffold_0-processed-gene-0.1944-mRNA-1	1
augustus_masked-scaffold_0-processed-gene-0.1953-mRNA-1	4
augustus_masked-scaffold_0-processed-gene-0.1957-mRNA-1	3

We need to be sure to add in genes with no variant alleles. Up until now, only variants are included. Make an empty table from the gff and join it with the counts, then replace NA values with 1

library(dplyr)
#make the empty table with all genes
aa.gff.build <- gff
aa.gff.build$prox <- 0
aa.gff.build <- aa.gff.build[c(4,5)]
aa.gff.build <- unique(aa.gff.build)

#join in the counts
aa.gff.build <- dplyr::left_join(aa.gff.build, aa.allele.counts3, by = "gene")

## Warning: Column `gene` joining character vector and factor, coercing into
## character vector

#replace NA with 1
aa.gff.build[is.na(aa.gff.build)] <- 1
aa.gff.build <- aa.gff.build[c(1,3)]

From which we can make a distribution histogram of allele counts.

library(ggplot2)

hist.allele.count.aa <- ggplot(aa.gff.build, aes(aa.gff.build$count)) + geom_histogram(binwidth=1, color = "white") + theme_classic()
hist.allele.count.aa <- hist.allele.count.aa + xlab("Alleles") + ylab("Genes")
hist.allele.count.aa

write the full gene non-synonymous allele count table to a file

aa.full.gene.allele.count <- dplyr::left_join(kegg.raw, aa.allele.counts3, by="gene")

## Warning: Column `gene` joining factors with different levels, coercing to
## character vector

aa.full.gene.allele.count[is.na(aa.full.gene.allele.count)] <- 1 #replace NA with 1
write.csv(aa.full.gene.allele.count,"full_gene_allele_count_aa.csv")

Summarizing highly allelic genes by mapping to pathways

Let’s attach functional information where available and restrict ourselves to genes with predicted function (and sort by allele counts)

library(dplyr)
aa.full.gene.allele.count$gene <- as.character(aa.full.gene.allele.count$gene)
allele.counts.KEGG <- dplyr::left_join(aa.full.gene.allele.count, gene.annot2[c(1:6)])

## Joining, by = c("gene", "KO")

## Warning: Column `KO` joining factor and character vector, coercing into
## character vector

allele.counts.KEGG <- allele.counts.KEGG[order(allele.counts.KEGG$count, decreasing = TRUE),]
kable(head(allele.counts.KEGG))

	gene	KO	count	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
9945	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	Metabolism	09104 Nucleotide metabolism	00230 Purine metabolism [PATH:ko00230]
9946	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	Metabolism	09104 Nucleotide metabolism	00230 Purine metabolism [PATH:ko00230]
9947	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09143 Cell growth and death	04113 Meiosis - yeast [PATH:ko04113]
9948	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09143 Cell growth and death	04113 Meiosis - yeast [PATH:ko04113]
9949	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09145 Cellular community - prokaryotes	02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]
9950	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09145 Cellular community - prokaryotes	02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]

Exploring variant genes that map to KEGG pathways

First, we attach pathway information: note that that at this stage, we lose a hard count of variant genes since many KO can map to multiple pathways

allele.counts.KEGG.pres <- apply(allele.counts.KEGG, 2, function(x) gsub("^$|^ $", NA, x)) #this makes sure that any empty cells are NA
allele.counts.KEGG.pres <- data.frame(na.omit(allele.counts.KEGG.pres),stringsAsFactors = FALSE) #this removes all rows with no KEGG 
allele.counts.KEGG.pres$count <- as.numeric(allele.counts.KEGG.pres$count)
write.csv(allele.counts.KEGG.pres, "aas.gene.allele.counts.csv")
kable(head(allele.counts.KEGG.pres))

	gene	KO	count	KEGG.name	KEGG.tier1	KEGG.tier2	KEGG.pathway
9945	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	Metabolism	09104 Nucleotide metabolism	00230 Purine metabolism [PATH:ko00230]
9946	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	Metabolism	09104 Nucleotide metabolism	00230 Purine metabolism [PATH:ko00230]
9947	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09143 Cell growth and death	04113 Meiosis - yeast [PATH:ko04113]
9948	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09143 Cell growth and death	04113 Meiosis - yeast [PATH:ko04113]
9949	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09145 Cellular community - prokaryotes	02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]
9950	augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	CellularProcesses	09145 Cellular community - prokaryotes	02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]

cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(aa.full.gene.allele.count), "total genes,",
  nrow(aa.allele.counts3),"show at least one variant relative to the reference genome.",
  length(unique(allele.counts.KEGG.pres$gene)), "can be assigned to a KO family. These",
  length(unique(allele.counts.KEGG.pres$gene)), "genes can be mapped to",
  nrow(allele.counts.KEGG.pres),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 5564 show at least one variant relative to the reference genome. 3270 can be assigned to a KO family. These 3270 genes can be mapped to 9805 specific KEGG pathway connections representing 422 unique pathways

This should be further narrowed down to more highly allelic genes, but first, lets summarize by pathway and determine the most highly impacted pathways in terms of counts of variant genes present. This code chunks just counts pathway name occurances using a summarize function.

KEGG.pathway.var.summary <- allele.counts.KEGG.pres
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr

#this can be used to create the fingerprint. unfortunately it does not take arguments in place of names
#of variables to be summarized; thus I cannot loop through it at all
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.no.min <-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary[c(1:25),])

KEGG.pathway	variant.genes
04131 Membrane trafficking [BR:ko04131]	336
03036 Chromosome and associated proteins [BR:ko03036]	273
02000 Transporters [BR:ko02000]	213
04147 Exosome [BR:ko04147]	203
03009 Ribosome biogenesis [BR:ko03009]	196
03029 Mitochondrial biogenesis [BR:ko03029]	182
03019 Messenger RNA Biogenesis [BR:ko03019]	160
03021 Transcription machinery [BR:ko03021]	148
03400 DNA repair and recombination proteins [BR:ko03400]	147
03041 Spliceosome [BR:ko03041]	129
04121 Ubiquitin system [BR:ko04121]	127
03011 Ribosome [BR:ko03011]	123
03016 Transfer RNA biogenesis [BR:ko03016]	121
01002 Peptidases [BR:ko01002]	111
03010 Ribosome [PATH:ko03010]	108
00230 Purine metabolism [PATH:ko00230]	96
99980 Enzymes with EC numbers	96
03013 RNA transport [PATH:ko03013]	84
03040 Spliceosome [PATH:ko03040]	81
03110 Chaperones and folding catalysts [BR:ko03110]	81
01009 Protein phosphatase and associated proteins [BR:ko01009]	79
04714 Thermogenesis [PATH:ko04714]	79
03032 DNA replication proteins [BR:ko03032]	78
00240 Pyrimidine metabolism [PATH:ko00240]	75
04141 Protein processing in endoplasmic reticulum [PATH:ko04141]	75
This summary does not take allele count into consideration: a gene	with only 1 or 2 alleles is not very meaningful for this analysis

This is the same as above but modified to exclude genes that have lower variant counts.

#50% minimum alleles (>10)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 10,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.50 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.50) <- c("KEGG.pathway","50%.alleles")

#75% minimum alleles (>16)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.75 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.75) <- c("KEGG.pathway","75%.alleles")

#90% minimum alleles (>19)
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.9,]
KEGG.pathway.var.summary$KEGG.pathway <- as.factor(KEGG.pathway.var.summary$KEGG.pathway)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.pathway) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.90 <- KEGG.pathway.var.summary
colnames(KEGG.pathway.var.summary.90) <- c("KEGG.pathway","90%.alleles")

#join the pathway counts together with different allelic variation thresholds
KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary.no.min, KEGG.pathway.var.summary.50, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining factors with different levels,
## coercing to character vector

KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.75, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector

KEGG.pathway.var.summary <- dplyr::left_join(KEGG.pathway.var.summary, KEGG.pathway.var.summary.90, by = "KEGG.pathway")

## Warning: Column `KEGG.pathway` joining character vector and factor,
## coercing into character vector

kable(KEGG.pathway.var.summary[c(1:25),])

KEGG.pathway	variant.genes	50%.alleles	75%.alleles	90%.alleles
04131 Membrane trafficking [BR:ko04131]	336	3	NA	NA
03036 Chromosome and associated proteins [BR:ko03036]	273	9	2	NA
02000 Transporters [BR:ko02000]	213	2	NA	NA
04147 Exosome [BR:ko04147]	203	1	NA	NA
03009 Ribosome biogenesis [BR:ko03009]	196	13	2	NA
03029 Mitochondrial biogenesis [BR:ko03029]	182	3	1	NA
03019 Messenger RNA Biogenesis [BR:ko03019]	160	6	1	NA
03021 Transcription machinery [BR:ko03021]	148	4	1	NA
03400 DNA repair and recombination proteins [BR:ko03400]	147	9	1	NA
03041 Spliceosome [BR:ko03041]	129	3	NA	NA
04121 Ubiquitin system [BR:ko04121]	127	6	1	NA
03011 Ribosome [BR:ko03011]	123	NA	NA	NA
03016 Transfer RNA biogenesis [BR:ko03016]	121	4	1	NA
01002 Peptidases [BR:ko01002]	111	2	NA	NA
03010 Ribosome [PATH:ko03010]	108	NA	NA	NA
00230 Purine metabolism [PATH:ko00230]	96	2	2	2
99980 Enzymes with EC numbers	96	3	NA	NA
03013 RNA transport [PATH:ko03013]	84	1	NA	NA
03040 Spliceosome [PATH:ko03040]	81	NA	NA	NA
03110 Chaperones and folding catalysts [BR:ko03110]	81	2	NA	NA
01009 Protein phosphatase and associated proteins [BR:ko01009]	79	1	NA	NA
04714 Thermogenesis [PATH:ko04714]	79	NA	NA	NA
03032 DNA replication proteins [BR:ko03032]	78	2	1	NA
00240 Pyrimidine metabolism [PATH:ko00240]	75	NA	NA	NA
04141 Protein processing in endoplasmic reticulum [PATH:ko04141]	75	2	NA	NA

allele.75 <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]

cat("KEGG PATHWAY MAPPING SUMMARY: of",nrow(aa.full.gene.allele.count), "total genes,",
  length(unique(allele.75$gene)),"show at least 17 amino-acid or splice site alleles across the 22 strains and can be assigned to a KO family. These",
  length(unique(allele.75$gene)), "genes can be mapped to",
  nrow(allele.75),"specific KEGG pathway connections representing",
  length(unique(allele.counts.KEGG.pres$KEGG.pathway)),"unique pathways")

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 11 show at least 17 amino-acid or splice site alleles across the 22 strains and can be assigned to a KO family. These 11 genes can be mapped to 29 specific KEGG pathway connections representing 422 unique pathways

KEGG.pathway.75.summary <- KEGG.pathway.var.summary[c(1,4)]
KEGG.pathway.75.summary <- KEGG.pathway.75.summary[order(KEGG.pathway.75.summary$`75%.alleles`,decreasing=TRUE),]
kable(KEGG.pathway.75.summary[1:20,])

KEGG.pathway	75%.alleles
04113 Meiosis - yeast [PATH:ko04113]	3
03036 Chromosome and associated proteins [BR:ko03036]	2
03009 Ribosome biogenesis [BR:ko03009]	2
00230 Purine metabolism [PATH:ko00230]	2
04213 Longevity regulating pathway - multiple species [PATH:ko04213]	2
01008 Polyketide biosynthesis proteins [BR:ko01008]	2
02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]	2
03029 Mitochondrial biogenesis [BR:ko03029]	1
03019 Messenger RNA Biogenesis [BR:ko03019]	1
03021 Transcription machinery [BR:ko03021]	1
03400 DNA repair and recombination proteins [BR:ko03400]	1
04121 Ubiquitin system [BR:ko04121]	1
03016 Transfer RNA biogenesis [BR:ko03016]	1
03032 DNA replication proteins [BR:ko03032]	1
04111 Cell cycle - yeast [PATH:ko04111]	1
03012 Translation factors [BR:ko03012]	1
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008]	1
01007 Amino acid related enzymes [BR:ko01007]	1
04110 Cell cycle [PATH:ko04110]	1
00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970]	1
To be clear, this is a representation of where the most highly allelic	genes are mapped according to the KEGG system.

This is fascinating, but lets try to put this in two frames of reference: 1. average number of variants per gene across the strains in each pathway 2. total number of genes placing into each pathway in the reference genome, % of pathway genes subject to variation

Let’s create two more tables representing higher levels of organization: X

#tier1
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier1 <- as.factor(KEGG.pathway.var.summary$KEGG.tier1)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier1) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier1<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier1[c(1:10),])

KEGG.tier1	variant.genes
BriteHierarchies	15
CellularProcesses	7
GeneticInformationProcessing	3
Metabolism	2
OrganismalSystems	2
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA

#tier 2
KEGG.pathway.var.summary <- allele.counts.KEGG.pres[allele.counts.KEGG.pres$count > 22*0.75,]
KEGG.pathway.var.summary$KEGG.tier2 <- as.factor(KEGG.pathway.var.summary$KEGG.tier2)
KEGG.pathway.var.summary$forsum <- 1
KEGG.pathway.var.summary <- dplyr::group_by(KEGG.pathway.var.summary, KEGG.tier2) #pre-sort the table for use by dplyr
KEGG.pathway.var.summary <- dplyr::summarize(KEGG.pathway.var.summary, variant.genes = n())
KEGG.pathway.var.summary <- KEGG.pathway.var.summary[order(KEGG.pathway.var.summary$variant.genes, decreasing = TRUE),]
KEGG.pathway.var.summary.tier2<-KEGG.pathway.var.summary
kable(KEGG.pathway.var.summary.tier2[c(1:25),])

KEGG.tier2	variant.genes
09182 Protein families: genetic information processing	12
09143 Cell growth and death	5
09181 Protein families: metabolism	3
09104 Nucleotide metabolism	2
09122 Translation	2
09145 Cellular community - prokaryotes	2
09149 Aging	2
09124 Replication and repair	1
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA

Key results

## KEGG PATHWAY MAPPING SUMMARY: of 6273 total genes, 5564 show at least one variant relative to the reference genome. 3270 can be assigned to a KO family. These 3270 genes can be mapped to 9805 specific KEGG pathway connections representing 422 unique pathways

## HIGHLY VARIANT ALLELES: of 6273 total genes, 11 show at least 17 nucleotide alleles across the 22 strains and can be assigned to a KO family. These 11 genes can be mapped to 29 specific KEGG pathway connections representing 422 unique pathways

Pathways (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.75.summary[1:20,])

KEGG.pathway	75%.alleles
04113 Meiosis - yeast [PATH:ko04113]	3
03036 Chromosome and associated proteins [BR:ko03036]	2
03009 Ribosome biogenesis [BR:ko03009]	2
00230 Purine metabolism [PATH:ko00230]	2
04213 Longevity regulating pathway - multiple species [PATH:ko04213]	2
01008 Polyketide biosynthesis proteins [BR:ko01008]	2
02025 Biofilm formation - Pseudomonas aeruginosa [PATH:ko02025]	2
03029 Mitochondrial biogenesis [BR:ko03029]	1
03019 Messenger RNA Biogenesis [BR:ko03019]	1
03021 Transcription machinery [BR:ko03021]	1
03400 DNA repair and recombination proteins [BR:ko03400]	1
04121 Ubiquitin system [BR:ko04121]	1
03016 Transfer RNA biogenesis [BR:ko03016]	1
03032 DNA replication proteins [BR:ko03032]	1
04111 Cell cycle - yeast [PATH:ko04111]	1
03012 Translation factors [BR:ko03012]	1
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008]	1
01007 Amino acid related enzymes [BR:ko01007]	1
04110 Cell cycle [PATH:ko04110]	1
00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970]	1

write.csv(KEGG.pathway.75.summary, "KEGG.pathway.75")

Tier 1 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier1[c(1:8),])

KEGG.tier1	variant.genes
BriteHierarchies	15
CellularProcesses	7
GeneticInformationProcessing	3
Metabolism	2
OrganismalSystems	2
NA	NA
NA	NA
NA	NA

write.csv(KEGG.pathway.var.summary.tier1, "KEGG.tier1.75")

Tier 2 classification (KEGG) of genes with at least 17 alleles

kable(KEGG.pathway.var.summary.tier2[c(1:25),])

KEGG.tier2	variant.genes
09182 Protein families: genetic information processing	12
09143 Cell growth and death	5
09181 Protein families: metabolism	3
09104 Nucleotide metabolism	2
09122 Translation	2
09145 Cellular community - prokaryotes	2
09149 Aging	2
09124 Replication and repair	1
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA

write.csv(KEGG.pathway.var.summary.tier2, "KEGG.tier2.75")

At this point we want to write a full file that includes gene names with allele counts. Annotation information is more difficult to attach since genes can map to many pathways… however we can restrict it to gene descriptions and ortholog IDs from KEGG and COG.

aa.allelic.genes <- allele.counts.KEGG[c(1:4)]
aa.allelic.genes <- unique(aa.allelic.genes)
aa.allelic.genes <- aa.allelic.genes[order(-aa.allelic.genes$count),]
cog.map <- cog.raw
aa.allelic.genes.annot <- dplyr::left_join(aa.allelic.genes, cog.map, "gene")

## Warning: Column `gene` joining character vector and factor, coercing into
## character vector

write.csv(aa.allelic.genes.annot, "aa.allelic.genes.csv")
kable(head(aa.allelic.genes.annot))

gene	KO	count	KEGG.name	COG	COG.name	COG.desc
augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	COG4886	COG4886	Leucine-rich repeat (LRR) protein
augustus_masked-scaffold_12-processed-gene-0.454-mRNA-1	K01768	22	E4.6.1.1; adenylate cyclase [EC:4.6.1.1]	COG4886	COG4886	Leucine-rich repeat (LRR) protein
augustus_masked-scaffold_26-processed-gene-0.258-mRNA-1		21		COG0419	SbcC	ATPase involved in DNA repair
maker-scaffold_13-augustus-gene-0.555-mRNA-1		19		NA	NA	NA
augustus_masked-scaffold_15-processed-gene-0.28-mRNA-1		19		NA	NA	NA
maker-scaffold_11-augustus-gene-0.502-mRNA-1	K15026	19	EIF2A; translation initiation factor 2A	COG5354	COG5354	Uncharacterized protein, contains Trp-Asp (WD) repeat

I can also attach protein sequences to the table here, although the fasta names need some parsing

library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

g.all.protein.biostrings <- readAAStringSet("../geosmithia_annotations/g.morbida.proteins.fa")
name <- data.frame(names(g.all.protein.biostrings),stringsAsFactors = F)
name.clean <- data.frame(do.call('rbind', strsplit(name$names.g.all.protein.biostrings., "_protein_Name:",fixed=TRUE)))
seqs <- paste(g.all.protein.biostrings)
length <- width(g.all.protein.biostrings)
g.all.protein <- data.frame(name.clean[1],length,seqs, stringsAsFactors = FALSE)
colnames(g.all.protein)[1] <- "gene"
#kable(head(g.all.protein))

then we can join that to the gene allele count annotation table

aa.allelic.genes.annot <- dplyr::left_join(aa.allelic.genes.annot, g.all.protein, "gene")

## Warning: Column `gene` joining character vector and factor, coercing into
## character vector

write.csv(aa.allelic.genes.annot, "aa.allelic.genes.csv")
#kable(head(aa.allelic.genes.annot))

runt this after to upload

library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.nonsynonymous.alles.and.functions”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/6.loci.gene.alleles.aa.rmd”)

loci genes to alleles for non-synonymous variants

Robert W. Murdoch

January 14, 2019