COG mapping files

**the clean COG to category map generated by this script can be downloaded from https://www.dropbox.com/s/l8t3zv9syiehrqi/COG_function_map.csv**

The goal of this script is to make a clean COG ID to mapping table that can be used for generating your own COG category summary tables and figures.

The information is all contained in files here: ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data

1 Import the cognames map

cognames <- read.csv("cognames2003-2014.tab", sep="\t", header=T)
datatable(cognames[1:100,])

1.1 The challenge

datatable(cognames[c(74,28,58),])

This, for example, needs to be split into six rows with each individual mapping.

2 Are there any COG IDs with more than two mappings?

cognames$func <- as.character(cognames$func)
cognames$mappings <- sapply(cognames$func, nchar)
datatable(cognames[1:100,])
hist(cognames$mappings)

paste("The maximium number of mappings for a COG ID is",max(cognames$mappings))
## [1] "The maximium number of mappings for a COG ID is 4"

There are 4 mappings at most, so any system I apply will have to accomodate this.

3 cleaning the multiple mappings

The general idea is to check each line. If it has 1 function, then add it to the new map. If it has 2, then write two lines, one for each, try this first.

COG.map <- data.frame()
for (n in c(1:nrow(cognames))) {
  if(cognames[n,"mappings"]==1) {
    COG.map <- rbind(COG.map,cognames[n,])
  }
  if(cognames[n,"mappings"]==2) {
    test1 <- cognames[n,]
    test1$func <- substr(test1$func,1,1)
    test2 <- cognames[n,]
    test2$func <- substr(test2$func,2,2)
    COG.map <- rbind(COG.map,test1,test2)
  }
  if(cognames[n,"mappings"]==3) {
    test1 <- cognames[n,]
    test1$func <- substr(test1$func,1,1)
    test2 <- cognames[n,]
    test2$func <- substr(test2$func,2,2)
    test3 <- cognames[n,]
    test3$func <- substr(test2$func,3,3)
    COG.map <- rbind(COG.map,test1,test2,test3)
  }
  if(cognames[n,"mappings"]==4) {
    test1 <- cognames[n,]
    test1$func <- substr(test1$func,1,1)
    test2 <- cognames[n,]
    test2$func <- substr(test2$func,2,2)
    test3 <- cognames[n,]
    test3$func <- substr(test3$func,3,3)
    test4 <- cognames[n,]
    test4$func <- substr(test4$func,4,4)
    COG.map <- rbind(COG.map,test1,test2,test3,test4)
  }
}

COG.map$mappings <-  sapply(COG.map$func, nchar)
paste("the maximum number of mappings is now",max(COG.map$mappings))
## [1] "the maximum number of mappings is now 1"

The mapping file makes sense now, only one to one mappings. There are 4631 COG IDs and 4920 unique mappings.

4 writing the file

Glance again at the file and write it to a csv (without the mappings column and with cleaned headers)

colnames(COG.map)[1] <- "COG"
COG.map <- COG.map[c(1:3)]
datatable(COG.map[c(1:100),])
write.csv(COG.map, "COG_function_map.csv", row.names = F)

The file can be downloaded from https://www.dropbox.com/s/l8t3zv9syiehrqi/COG_function_map.csv?dl=0

RWMurdoch

April 18, 2019