Transrate comparison of optimal score calculations

Currently the assembly score is affected by how expressed a contig is.

ggplot(data, aes(x=good, y=diff)) +
  geom_point() +
  xlim(0,800) +
  ylim(0,0.00006) +
  xlab("Number of good mappings to contig") +
  ylab("Change in assembly score by removing contig")

Changing to a kmer based approach means that the number of reads mapping to a contig won’t affect how much the contig contributes to the assembly score.

ggplot(kmer_data, aes(x=unique, y=change)) +
  geom_point() +
  xlim(0,5000) +
  ylim(0,0.00004) +
  xlab("Number of unique kmers in contig") +
  ylab("Change in assembly score by removing contig")

merged <- merge(x=data, y=kmer_data, by.x="key", by.y="name")

The count of good mappings versus the number of unique kmers in a contig

ggplot(merged, aes(x=good, y=unique)) +
  geom_point() +
  xlim(0,800) +
  ylim(0,5000)+
  xlab("Number of good mappings to contig") +
  ylab("Number of unique kmers in contig")

The differences to assembly score for the two methods against each other

ggplot(merged, aes(x=diff, y=change)) +
  geom_point() +
  xlim(0,0.0001) +
  ylim(0,0.0001) +
  xlab("Change in assembly score in current method") +
  ylab("Change in assembly score with new kmer based method")

Both sets of data on the same plot

ggplot(data2, aes(x=x, y=diff, colour=method)) +
  geom_point() +
  xlim(0,5000) +
  ylim(0,0.0002) +
  xlab("Number of things in contig") +
  ylab("Change in assembly score by removing contig")

Using the kmer based method, the contig score (sorted) against the assembly score as contigs are removed.

ggplot(kmer_data, aes(x=score, y=a_score)) +
  geom_point() +
  xlim(0,1.00) +
  ylim(0,0.40) +
  xlab("Contig score") +
  ylab("Assembly score")

The assembly score against the sorted contig number

kmer_data$idx <-as.numeric(row.names(kmer_data))
ggplot(kmer_data, aes(x=idx, y=a_score)) +
  geom_point() +
  xlim(0,155000) +
  ylim(0,0.40) +
  xlab("Contig number (sorted by score)") +
  ylab("Assembly score")