cluster.mapping.rmd

Overview

This rmd file describes how to take the output files from a CD-hit clustering to produce a searchable database of sequences. Ultimately, the user can quickly return a full set of sequences represented by 50% cluster representatives.

The input files are: -cluster.clstr file: this file is a mapping file produced by CD-hit. This is an unfortunately difficult to work with format.
-K01560.proteins.xls: this is a database which maps proteins to genomes

Required packages

stringr: quick text manipulation
dplyr: table and data management

Reading the cluster.clstr CD-hit mapping file

clstr <- read.csv("dev.set/K01560_50_from80.clstr", sep = "\t", row.names = NULL, header = FALSE, stringsAsFactors = FALSE)
kable(head(clstr))

V1	V2
>Cluster 0
0	266aa, >638151288… at 61.28%
1	240aa, >642444289… at 61.28%,96.25%
2	240aa, >2546955856… at 61.28%,88.33%
3	240aa, >2546955055… at 61.28%,88.33%
4	240aa, >2612092855… at 61.28%,84.17%

This is clearly an extremely awkward data format. First, we need to loop through the data set line by line and apply the follwing function to column 1: if the cell contains characters, leave it alone if the cell contains only numbers, set it equal to the cell above

clstr2 <- clstr
n = nrow(clstr)
x = 0
numbers_only <- function(x) !grepl("\\D", x)
for (row in c(1:n)) {
  if (numbers_only(clstr2[row,1]) == TRUE) {
    clstr2[row,1] <- x}
  else {NULL}
  x <- clstr2[row,1]
}
kable(head(clstr2))

V1	V2
>Cluster 0
>Cluster 0	266aa, >638151288… at 61.28%
>Cluster 0	240aa, >642444289… at 61.28%,96.25%
>Cluster 0	240aa, >2546955856… at 61.28%,88.33%
>Cluster 0	240aa, >2546955055… at 61.28%,88.33%
>Cluster 0	240aa, >2612092855… at 61.28%,84.17%

It appears that the first column has been filled down, but we need to make sure that the loop executed as planned. let’s locate a transition point between clusters

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

clstr.sums <- data.frame(dplyr::count(clstr2,V1))
kable(head(clstr.sums))

V1	n
>Cluster 0	1139
>Cluster 1	48
>Cluster 10	9
>Cluster 100	62
>Cluster 101	13
>Cluster 102	2

and then focus on the first transition to make sure the cluster name switched as planned:

switch <- clstr.sums[1,2]
clstr3 <- cbind(clstr2[1], clstr)
kable(clstr3[c((switch-5):(switch+5)),])

	V1	V1	V2
1134	>Cluster 0	1132	237aa, >2739664423… at 53.16%
1135	>Cluster 0	1133	236aa, >2654235313… at 52.12%
1136	>Cluster 0	1134	240aa, >641705304… at 70.00%
1137	>Cluster 0	1135	240aa, >2579721802… at 66.67%
1138	>Cluster 0	1136	241aa, >2719050008… at 52.28%
1139	>Cluster 0	1137	241aa, >2753145995… at 55.19%
1140	>Cluster 1	>Cluster 1
1141	>Cluster 1	0	263aa, >2502332413… at 56.65%
1142	>Cluster 1	1	263aa, >2756756840… at 56.65%,100.00%
1143	>Cluster 1	2	283aa, >2595101017… at 52.30%
1144	>Cluster 1	3	248aa, >2618424100… at 50.81%

Yes, the switch properly took place. Now we need to get rid of rows that have empty V2

clstr4 <- clstr2[-which(clstr2$V2 == ""), ]
kable(clstr4[c(1:5,(switch-5):(switch+5)),])

	V1	V2
2	>Cluster 0	266aa, >638151288… at 61.28%
3	>Cluster 0	240aa, >642444289… at 61.28%,96.25%
4	>Cluster 0	240aa, >2546955856… at 61.28%,88.33%
5	>Cluster 0	240aa, >2546955055… at 61.28%,88.33%
6	>Cluster 0	240aa, >2612092855… at 61.28%,84.17%
1135	>Cluster 0	236aa, >2654235313… at 52.12%
1136	>Cluster 0	240aa, >641705304… at 70.00%
1137	>Cluster 0	240aa, >2579721802… at 66.67%
1138	>Cluster 0	241aa, >2719050008… at 52.28%
1139	>Cluster 0	241aa, >2753145995… at 55.19%
1141	>Cluster 1	263aa, >2502332413… at 56.65%
1142	>Cluster 1	263aa, >2756756840… at 56.65%,100.00%
1143	>Cluster 1	283aa, >2595101017… at 52.30%
1144	>Cluster 1	248aa, >2618424100… at 50.81%
1145	>Cluster 1	248aa, >2600763407… at 50.81%,100.00%
1146	>Cluster 1	248aa, >2641522503… at 50.81%,100.00%

Yes we have now disposed of the rows with empty V2.

Now we proceed to properly formatting the table.

We need to * remove “>” * split V2 by “aa,” and by “… at” and by “%,”

library(stringr)
clstr5 <- clstr4
clstr5[] <- lapply(clstr5, gsub, pattern='>', replacement='')
clstr5.2 <- data.frame(str_split_fixed(clstr5$V2, "aa, ", 2))
clstr5.3 <- data.frame(str_split_fixed(clstr5.2$X2, "... ", 2))
clstr6 <- cbind(clstr5[1],clstr5.2[1],clstr5.3[1:2])
colnames(clstr6) <- c("cluster","aa","gene","stat")
kable(head(clstr6))

	cluster	aa	gene	stat
2	Cluster 0	266	638151288	at 61.28%
3	Cluster 0	240	642444289	at 61.28%,96.25%
4	Cluster 0	240	2546955856	at 61.28%,88.33%
5	Cluster 0	240	2546955055	at 61.28%,88.33%
6	Cluster 0	240	2612092855	at 61.28%,84.17%
7	Cluster 0	240	2633535176	at 61.28%,91.25%

The “stat” column is messy. It contains identity to the representative sequence. Maybe more importantly, the representative sequence is indicated by “*" in this column. We don’t really need that now, but we will leave the column in for future use.

We now have a map between gene ID and cluster.

Next we will enrich this with genome information

Reading in the gene information table

I am using gene tables downloaded from IMG. I have included genome information in the table (via the IMG website)

gene.genome <- read.csv("dev.set/K01560.proteins.xls", sep = "\t", header = TRUE)
kable(head(gene.genome[c(1,3:5)]))

Gene.ID	Gene.Product.Name	Genome.ID	Genome.Name
648940921		648028011	Burkholderia sp. CCGE1003
651714005		650716004	Acidovorax avenae avenae ATCC 19860
639637225	(S)-2-haloacid dehalogenase	639633056	Roseobacter denitrificans OCh 114
651229011	(s)-2-haloacid dehalogenase	651053062	Ralstonia solanacearum Po82
639130300	(S)-2-haloacid dehalogenase	638341056	Candidatus Pelagibacter ubique SAR11 HTCC1002
640816332	(S)-2-haloacid dehalogenase	640753030	Janthinobacterium sp. Marseille

at this point we can do some simple reformatting and use a left join to attach the genome data to the cluster map table

clstr7 <- clstr6[c(3,1,2,4)]
colnames(gene.genome)[1] <- "gene"
clstr7$gene <- as.numeric(as.character(clstr7$gene))
clstr8 <- dplyr::left_join(clstr7,gene.genome,by="gene")
clstr9 <- clstr8[1:8]
write.csv(clstr9,"seq.clstr.genome.map.csv")
kable(head(clstr9))

gene	cluster	aa	stat	Locus.Tag	Gene.Product.Name	Genome.ID	Genome.Name
638151288	Cluster 0	266	at 61.28%	Bamb_3446	haloacid dehalogenase, type II	637000047	Burkholderia cepacia AMMD
642444289	Cluster 0	240	at 61.28%,96.25%	BamMEX5DRAFT_6182	haloacid dehalogenase, type II	641736149	Burkholderia ambifaria MEX-5
2546955856	Cluster 0	240	at 61.28%,88.33%	BMF7_03358	2-haloacid dehalogenase	2546825540	Burkholderia cenocepacia7
2546955055	Cluster 0	240	at 61.28%,88.33%	BMF7_02556	2-haloacid dehalogenase	2546825540	Burkholderia cenocepacia7
2612092855	Cluster 0	240	at 61.28%,84.17%	Ga0055110_104731	2-haloacid dehalogenase	2609460236	Burkholderia paludis MSh1
2633535176	Cluster 0	240	at 61.28%,91.25%	Ga0077607_1006125	2-haloacid dehalogenase	2630968861	Burkholderia sp. USM B20

Our map is finished now.

Next step is to add the results of the oxygen preference map.

END