COG mapping files
**the clean COG to category map generated by this script can be downloaded from https://www.dropbox.com/s/l8t3zv9syiehrqi/COG_function_map.csv**
The goal of this script is to make a clean COG ID to mapping table that can be used for generating your own COG category summary tables and figures.
The information is all contained in files here: ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data
- ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/fun2003-2014.tab This file has the category descriptions
- ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/cognames2003-2014.tab This file has the COG to category map… HOWEVER the problem here is that some cog IDs map to multiple categories, the format of which is very awkward
1 Import the cognames map
cognames <- read.csv("cognames2003-2014.tab", sep="\t", header=T)
datatable(cognames[1:100,])1.1 The challenge
datatable(cognames[c(74,28,58),])This, for example, needs to be split into six rows with each individual mapping.
2 Are there any COG IDs with more than two mappings?
cognames$func <- as.character(cognames$func)
cognames$mappings <- sapply(cognames$func, nchar)
datatable(cognames[1:100,])hist(cognames$mappings)paste("The maximium number of mappings for a COG ID is",max(cognames$mappings))## [1] "The maximium number of mappings for a COG ID is 4"
There are 4 mappings at most, so any system I apply will have to accomodate this.
3 cleaning the multiple mappings
The general idea is to check each line. If it has 1 function, then add it to the new map. If it has 2, then write two lines, one for each, try this first.
COG.map <- data.frame()
for (n in c(1:nrow(cognames))) {
if(cognames[n,"mappings"]==1) {
COG.map <- rbind(COG.map,cognames[n,])
}
if(cognames[n,"mappings"]==2) {
test1 <- cognames[n,]
test1$func <- substr(test1$func,1,1)
test2 <- cognames[n,]
test2$func <- substr(test2$func,2,2)
COG.map <- rbind(COG.map,test1,test2)
}
if(cognames[n,"mappings"]==3) {
test1 <- cognames[n,]
test1$func <- substr(test1$func,1,1)
test2 <- cognames[n,]
test2$func <- substr(test2$func,2,2)
test3 <- cognames[n,]
test3$func <- substr(test2$func,3,3)
COG.map <- rbind(COG.map,test1,test2,test3)
}
if(cognames[n,"mappings"]==4) {
test1 <- cognames[n,]
test1$func <- substr(test1$func,1,1)
test2 <- cognames[n,]
test2$func <- substr(test2$func,2,2)
test3 <- cognames[n,]
test3$func <- substr(test3$func,3,3)
test4 <- cognames[n,]
test4$func <- substr(test4$func,4,4)
COG.map <- rbind(COG.map,test1,test2,test3,test4)
}
}
COG.map$mappings <- sapply(COG.map$func, nchar)
paste("the maximum number of mappings is now",max(COG.map$mappings))## [1] "the maximum number of mappings is now 1"
The mapping file makes sense now, only one to one mappings. There are 4631 COG IDs and 4920 unique mappings.
4 writing the file
Glance again at the file and write it to a csv (without the mappings column and with cleaned headers)
colnames(COG.map)[1] <- "COG"
COG.map <- COG.map[c(1:3)]
datatable(COG.map[c(1:100),])write.csv(COG.map, "COG_function_map.csv", row.names = F)The file can be downloaded from https://www.dropbox.com/s/l8t3zv9syiehrqi/COG_function_map.csv?dl=0