This rmd file describes how to take the output files from a CD-hit clustering to produce a searchable database of sequences. Ultimately, the user can quickly return a full set of sequences represented by 50% cluster representatives.
The input files are: -cluster.clstr file: this file is a mapping file produced by CD-hit. This is an unfortunately difficult to work with format.
-K01560.proteins.xls: this is a database which maps proteins to genomes
clstr <- read.csv("dev.set/K01560_50_from80.clstr", sep = "\t", row.names = NULL, header = FALSE, stringsAsFactors = FALSE)
kable(head(clstr))
V1 | V2 |
---|---|
>Cluster 0 | |
0 | 266aa, >638151288… at 61.28% |
1 | 240aa, >642444289… at 61.28%,96.25% |
2 | 240aa, >2546955856… at 61.28%,88.33% |
3 | 240aa, >2546955055… at 61.28%,88.33% |
4 | 240aa, >2612092855… at 61.28%,84.17% |
This is clearly an extremely awkward data format. First, we need to loop through the data set line by line and apply the follwing function to column 1: if the cell contains characters, leave it alone if the cell contains only numbers, set it equal to the cell above
clstr2 <- clstr
n = nrow(clstr)
x = 0
numbers_only <- function(x) !grepl("\\D", x)
for (row in c(1:n)) {
if (numbers_only(clstr2[row,1]) == TRUE) {
clstr2[row,1] <- x}
else {NULL}
x <- clstr2[row,1]
}
kable(head(clstr2))
V1 | V2 |
---|---|
>Cluster 0 | |
>Cluster 0 | 266aa, >638151288… at 61.28% |
>Cluster 0 | 240aa, >642444289… at 61.28%,96.25% |
>Cluster 0 | 240aa, >2546955856… at 61.28%,88.33% |
>Cluster 0 | 240aa, >2546955055… at 61.28%,88.33% |
>Cluster 0 | 240aa, >2612092855… at 61.28%,84.17% |
It appears that the first column has been filled down, but we need to make sure that the loop executed as planned. let’s locate a transition point between clusters
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
clstr.sums <- data.frame(dplyr::count(clstr2,V1))
kable(head(clstr.sums))
V1 | n |
---|---|
>Cluster 0 | 1139 |
>Cluster 1 | 48 |
>Cluster 10 | 9 |
>Cluster 100 | 62 |
>Cluster 101 | 13 |
>Cluster 102 | 2 |
and then focus on the first transition to make sure the cluster name switched as planned:
switch <- clstr.sums[1,2]
clstr3 <- cbind(clstr2[1], clstr)
kable(clstr3[c((switch-5):(switch+5)),])
V1 | V1 | V2 | |
---|---|---|---|
1134 | >Cluster 0 | 1132 | 237aa, >2739664423… at 53.16% |
1135 | >Cluster 0 | 1133 | 236aa, >2654235313… at 52.12% |
1136 | >Cluster 0 | 1134 | 240aa, >641705304… at 70.00% |
1137 | >Cluster 0 | 1135 | 240aa, >2579721802… at 66.67% |
1138 | >Cluster 0 | 1136 | 241aa, >2719050008… at 52.28% |
1139 | >Cluster 0 | 1137 | 241aa, >2753145995… at 55.19% |
1140 | >Cluster 1 | >Cluster 1 | |
1141 | >Cluster 1 | 0 | 263aa, >2502332413… at 56.65% |
1142 | >Cluster 1 | 1 | 263aa, >2756756840… at 56.65%,100.00% |
1143 | >Cluster 1 | 2 | 283aa, >2595101017… at 52.30% |
1144 | >Cluster 1 | 3 | 248aa, >2618424100… at 50.81% |
Yes, the switch properly took place. Now we need to get rid of rows that have empty V2
clstr4 <- clstr2[-which(clstr2$V2 == ""), ]
kable(clstr4[c(1:5,(switch-5):(switch+5)),])
V1 | V2 | |
---|---|---|
2 | >Cluster 0 | 266aa, >638151288… at 61.28% |
3 | >Cluster 0 | 240aa, >642444289… at 61.28%,96.25% |
4 | >Cluster 0 | 240aa, >2546955856… at 61.28%,88.33% |
5 | >Cluster 0 | 240aa, >2546955055… at 61.28%,88.33% |
6 | >Cluster 0 | 240aa, >2612092855… at 61.28%,84.17% |
1135 | >Cluster 0 | 236aa, >2654235313… at 52.12% |
1136 | >Cluster 0 | 240aa, >641705304… at 70.00% |
1137 | >Cluster 0 | 240aa, >2579721802… at 66.67% |
1138 | >Cluster 0 | 241aa, >2719050008… at 52.28% |
1139 | >Cluster 0 | 241aa, >2753145995… at 55.19% |
1141 | >Cluster 1 | 263aa, >2502332413… at 56.65% |
1142 | >Cluster 1 | 263aa, >2756756840… at 56.65%,100.00% |
1143 | >Cluster 1 | 283aa, >2595101017… at 52.30% |
1144 | >Cluster 1 | 248aa, >2618424100… at 50.81% |
1145 | >Cluster 1 | 248aa, >2600763407… at 50.81%,100.00% |
1146 | >Cluster 1 | 248aa, >2641522503… at 50.81%,100.00% |
Yes we have now disposed of the rows with empty V2.
Now we proceed to properly formatting the table.
We need to * remove “>” * split V2 by “aa,” and by “… at” and by “%,”
library(stringr)
clstr5 <- clstr4
clstr5[] <- lapply(clstr5, gsub, pattern='>', replacement='')
clstr5.2 <- data.frame(str_split_fixed(clstr5$V2, "aa, ", 2))
clstr5.3 <- data.frame(str_split_fixed(clstr5.2$X2, "... ", 2))
clstr6 <- cbind(clstr5[1],clstr5.2[1],clstr5.3[1:2])
colnames(clstr6) <- c("cluster","aa","gene","stat")
kable(head(clstr6))
cluster | aa | gene | stat | |
---|---|---|---|---|
2 | Cluster 0 | 266 | 638151288 | at 61.28% |
3 | Cluster 0 | 240 | 642444289 | at 61.28%,96.25% |
4 | Cluster 0 | 240 | 2546955856 | at 61.28%,88.33% |
5 | Cluster 0 | 240 | 2546955055 | at 61.28%,88.33% |
6 | Cluster 0 | 240 | 2612092855 | at 61.28%,84.17% |
7 | Cluster 0 | 240 | 2633535176 | at 61.28%,91.25% |
The “stat” column is messy. It contains identity to the representative sequence. Maybe more importantly, the representative sequence is indicated by “*" in this column. We don’t really need that now, but we will leave the column in for future use.
Next we will enrich this with genome information
I am using gene tables downloaded from IMG. I have included genome information in the table (via the IMG website)
gene.genome <- read.csv("dev.set/K01560.proteins.xls", sep = "\t", header = TRUE)
kable(head(gene.genome[c(1,3:5)]))
Gene.ID | Gene.Product.Name | Genome.ID | Genome.Name |
---|---|---|---|
648940921 | 648028011 | Burkholderia sp. CCGE1003 | |
651714005 | 650716004 | Acidovorax avenae avenae ATCC 19860 | |
639637225 | (S)-2-haloacid dehalogenase | 639633056 | Roseobacter denitrificans OCh 114 |
651229011 | (s)-2-haloacid dehalogenase | 651053062 | Ralstonia solanacearum Po82 |
639130300 | (S)-2-haloacid dehalogenase | 638341056 | Candidatus Pelagibacter ubique SAR11 HTCC1002 |
640816332 | (S)-2-haloacid dehalogenase | 640753030 | Janthinobacterium sp. Marseille |
at this point we can do some simple reformatting and use a left join to attach the genome data to the cluster map table
clstr7 <- clstr6[c(3,1,2,4)]
colnames(gene.genome)[1] <- "gene"
clstr7$gene <- as.numeric(as.character(clstr7$gene))
clstr8 <- dplyr::left_join(clstr7,gene.genome,by="gene")
clstr9 <- clstr8[1:8]
write.csv(clstr9,"seq.clstr.genome.map.csv")
kable(head(clstr9))
gene | cluster | aa | stat | Locus.Tag | Gene.Product.Name | Genome.ID | Genome.Name |
---|---|---|---|---|---|---|---|
638151288 | Cluster 0 | 266 | at 61.28% | Bamb_3446 | haloacid dehalogenase, type II | 637000047 | Burkholderia cepacia AMMD |
642444289 | Cluster 0 | 240 | at 61.28%,96.25% | BamMEX5DRAFT_6182 | haloacid dehalogenase, type II | 641736149 | Burkholderia ambifaria MEX-5 |
2546955856 | Cluster 0 | 240 | at 61.28%,88.33% | BMF7_03358 | 2-haloacid dehalogenase | 2546825540 | Burkholderia cenocepacia7 |
2546955055 | Cluster 0 | 240 | at 61.28%,88.33% | BMF7_02556 | 2-haloacid dehalogenase | 2546825540 | Burkholderia cenocepacia7 |
2612092855 | Cluster 0 | 240 | at 61.28%,84.17% | Ga0055110_104731 | 2-haloacid dehalogenase | 2609460236 | Burkholderia paludis MSh1 |
2633535176 | Cluster 0 | 240 | at 61.28%,91.25% | Ga0077607_1006125 | 2-haloacid dehalogenase | 2630968861 | Burkholderia sp. USM B20 |
Our map is finished now.
Next step is to add the results of the oxygen preference map.
END