Currently the assembly score is affected by how expressed a contig is.
ggplot(data, aes(x=good, y=diff)) +
geom_point() +
xlim(0,800) +
ylim(0,0.00006) +
xlab("Number of good mappings to contig") +
ylab("Change in assembly score by removing contig")
Changing to a kmer based approach means that the number of reads mapping to a contig won’t affect how much the contig contributes to the assembly score.
ggplot(kmer_data, aes(x=unique, y=change)) +
geom_point() +
xlim(0,5000) +
ylim(0,0.00004) +
xlab("Number of unique kmers in contig") +
ylab("Change in assembly score by removing contig")
merged <- merge(x=data, y=kmer_data, by.x="key", by.y="name")
The count of good mappings versus the number of unique kmers in a contig
ggplot(merged, aes(x=good, y=unique)) +
geom_point() +
xlim(0,800) +
ylim(0,5000)+
xlab("Number of good mappings to contig") +
ylab("Number of unique kmers in contig")
The differences to assembly score for the two methods against each other
ggplot(merged, aes(x=diff, y=change)) +
geom_point() +
xlim(0,0.0001) +
ylim(0,0.0001) +
xlab("Change in assembly score in current method") +
ylab("Change in assembly score with new kmer based method")
Both sets of data on the same plot
ggplot(data2, aes(x=x, y=diff, colour=method)) +
geom_point() +
xlim(0,5000) +
ylim(0,0.0002) +
xlab("Number of things in contig") +
ylab("Change in assembly score by removing contig")
Using the kmer based method, the contig score (sorted) against the assembly score as contigs are removed.
ggplot(kmer_data, aes(x=score, y=a_score)) +
geom_point() +
xlim(0,1.00) +
ylim(0,0.40) +
xlab("Contig score") +
ylab("Assembly score")
The assembly score against the sorted contig number
kmer_data$idx <-as.numeric(row.names(kmer_data))
ggplot(kmer_data, aes(x=idx, y=a_score)) +
geom_point() +
xlim(0,155000) +
ylim(0,0.40) +
xlab("Contig number (sorted by score)") +
ylab("Assembly score")