Overview

This rmd file describes how to take the output files from a CD-hit clustering to produce a searchable database of sequences. Ultimately, the user can quickly return a full set of sequences represented by 50% cluster representatives.

The input files are: -cluster.clstr file: this file is a mapping file produced by CD-hit. This is an unfortunately difficult to work with format.
-K01560.proteins.xls: this is a database which maps proteins to genomes

Required packages

Reading the cluster.clstr CD-hit mapping file

clstr <- read.csv("dev.set/K01560_50_from80.clstr", sep = "\t", row.names = NULL, header = FALSE, stringsAsFactors = FALSE)
kable(head(clstr))
V1 V2
>Cluster 0
0 266aa, >638151288… at 61.28%
1 240aa, >642444289… at 61.28%,96.25%
2 240aa, >2546955856… at 61.28%,88.33%
3 240aa, >2546955055… at 61.28%,88.33%
4 240aa, >2612092855… at 61.28%,84.17%

This is clearly an extremely awkward data format. First, we need to loop through the data set line by line and apply the follwing function to column 1: if the cell contains characters, leave it alone if the cell contains only numbers, set it equal to the cell above

clstr2 <- clstr
n = nrow(clstr)
x = 0
numbers_only <- function(x) !grepl("\\D", x)
for (row in c(1:n)) {
  if (numbers_only(clstr2[row,1]) == TRUE) {
    clstr2[row,1] <- x}
  else {NULL}
  x <- clstr2[row,1]
}
kable(head(clstr2))
V1 V2
>Cluster 0
>Cluster 0 266aa, >638151288… at 61.28%
>Cluster 0 240aa, >642444289… at 61.28%,96.25%
>Cluster 0 240aa, >2546955856… at 61.28%,88.33%
>Cluster 0 240aa, >2546955055… at 61.28%,88.33%
>Cluster 0 240aa, >2612092855… at 61.28%,84.17%

It appears that the first column has been filled down, but we need to make sure that the loop executed as planned. let’s locate a transition point between clusters

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
clstr.sums <- data.frame(dplyr::count(clstr2,V1))
kable(head(clstr.sums))
V1 n
>Cluster 0 1139
>Cluster 1 48
>Cluster 10 9
>Cluster 100 62
>Cluster 101 13
>Cluster 102 2

and then focus on the first transition to make sure the cluster name switched as planned:

switch <- clstr.sums[1,2]
clstr3 <- cbind(clstr2[1], clstr)
kable(clstr3[c((switch-5):(switch+5)),])
V1 V1 V2
1134 >Cluster 0 1132 237aa, >2739664423… at 53.16%
1135 >Cluster 0 1133 236aa, >2654235313… at 52.12%
1136 >Cluster 0 1134 240aa, >641705304… at 70.00%
1137 >Cluster 0 1135 240aa, >2579721802… at 66.67%
1138 >Cluster 0 1136 241aa, >2719050008… at 52.28%
1139 >Cluster 0 1137 241aa, >2753145995… at 55.19%
1140 >Cluster 1 >Cluster 1
1141 >Cluster 1 0 263aa, >2502332413… at 56.65%
1142 >Cluster 1 1 263aa, >2756756840… at 56.65%,100.00%
1143 >Cluster 1 2 283aa, >2595101017… at 52.30%
1144 >Cluster 1 3 248aa, >2618424100… at 50.81%

Yes, the switch properly took place. Now we need to get rid of rows that have empty V2

clstr4 <- clstr2[-which(clstr2$V2 == ""), ]
kable(clstr4[c(1:5,(switch-5):(switch+5)),])
V1 V2
2 >Cluster 0 266aa, >638151288… at 61.28%
3 >Cluster 0 240aa, >642444289… at 61.28%,96.25%
4 >Cluster 0 240aa, >2546955856… at 61.28%,88.33%
5 >Cluster 0 240aa, >2546955055… at 61.28%,88.33%
6 >Cluster 0 240aa, >2612092855… at 61.28%,84.17%
1135 >Cluster 0 236aa, >2654235313… at 52.12%
1136 >Cluster 0 240aa, >641705304… at 70.00%
1137 >Cluster 0 240aa, >2579721802… at 66.67%
1138 >Cluster 0 241aa, >2719050008… at 52.28%
1139 >Cluster 0 241aa, >2753145995… at 55.19%
1141 >Cluster 1 263aa, >2502332413… at 56.65%
1142 >Cluster 1 263aa, >2756756840… at 56.65%,100.00%
1143 >Cluster 1 283aa, >2595101017… at 52.30%
1144 >Cluster 1 248aa, >2618424100… at 50.81%
1145 >Cluster 1 248aa, >2600763407… at 50.81%,100.00%
1146 >Cluster 1 248aa, >2641522503… at 50.81%,100.00%

Yes we have now disposed of the rows with empty V2.

Now we proceed to properly formatting the table.

We need to * remove “>” * split V2 by “aa,” and by “… at” and by “%,”

library(stringr)
clstr5 <- clstr4
clstr5[] <- lapply(clstr5, gsub, pattern='>', replacement='')
clstr5.2 <- data.frame(str_split_fixed(clstr5$V2, "aa, ", 2))
clstr5.3 <- data.frame(str_split_fixed(clstr5.2$X2, "... ", 2))
clstr6 <- cbind(clstr5[1],clstr5.2[1],clstr5.3[1:2])
colnames(clstr6) <- c("cluster","aa","gene","stat")
kable(head(clstr6))
cluster aa gene stat
2 Cluster 0 266 638151288 at 61.28%
3 Cluster 0 240 642444289 at 61.28%,96.25%
4 Cluster 0 240 2546955856 at 61.28%,88.33%
5 Cluster 0 240 2546955055 at 61.28%,88.33%
6 Cluster 0 240 2612092855 at 61.28%,84.17%
7 Cluster 0 240 2633535176 at 61.28%,91.25%

The “stat” column is messy. It contains identity to the representative sequence. Maybe more importantly, the representative sequence is indicated by “*" in this column. We don’t really need that now, but we will leave the column in for future use.

We now have a map between gene ID and cluster.

Next we will enrich this with genome information

Reading in the gene information table

I am using gene tables downloaded from IMG. I have included genome information in the table (via the IMG website)

gene.genome <- read.csv("dev.set/K01560.proteins.xls", sep = "\t", header = TRUE)
kable(head(gene.genome[c(1,3:5)]))
Gene.ID Gene.Product.Name Genome.ID Genome.Name
648940921 648028011 Burkholderia sp. CCGE1003
651714005 650716004 Acidovorax avenae avenae ATCC 19860
639637225 (S)-2-haloacid dehalogenase 639633056 Roseobacter denitrificans OCh 114
651229011 (s)-2-haloacid dehalogenase 651053062 Ralstonia solanacearum Po82
639130300 (S)-2-haloacid dehalogenase 638341056 Candidatus Pelagibacter ubique SAR11 HTCC1002
640816332 (S)-2-haloacid dehalogenase 640753030 Janthinobacterium sp. Marseille

at this point we can do some simple reformatting and use a left join to attach the genome data to the cluster map table

clstr7 <- clstr6[c(3,1,2,4)]
colnames(gene.genome)[1] <- "gene"
clstr7$gene <- as.numeric(as.character(clstr7$gene))
clstr8 <- dplyr::left_join(clstr7,gene.genome,by="gene")
clstr9 <- clstr8[1:8]
write.csv(clstr9,"seq.clstr.genome.map.csv")
kable(head(clstr9))
gene cluster aa stat Locus.Tag Gene.Product.Name Genome.ID Genome.Name
638151288 Cluster 0 266 at 61.28% Bamb_3446 haloacid dehalogenase, type II 637000047 Burkholderia cepacia AMMD
642444289 Cluster 0 240 at 61.28%,96.25% BamMEX5DRAFT_6182 haloacid dehalogenase, type II 641736149 Burkholderia ambifaria MEX-5
2546955856 Cluster 0 240 at 61.28%,88.33% BMF7_03358 2-haloacid dehalogenase 2546825540 Burkholderia cenocepacia7
2546955055 Cluster 0 240 at 61.28%,88.33% BMF7_02556 2-haloacid dehalogenase 2546825540 Burkholderia cenocepacia7
2612092855 Cluster 0 240 at 61.28%,84.17% Ga0055110_104731 2-haloacid dehalogenase 2609460236 Burkholderia paludis MSh1
2633535176 Cluster 0 240 at 61.28%,91.25% Ga0077607_1006125 2-haloacid dehalogenase 2630968861 Burkholderia sp. USM B20

Our map is finished now.

Next step is to add the results of the oxygen preference map.

END