1 Introduction

In this project we want to use change in the relative abundance of genes as a proxy for fitness. The change of abundance is characterized using the enrichment ratio: the proportions of all reads with a given sequence after a given number of generation divided by the initial proportion. For ease of comparison, we work with the log2 of the enrichment ratio: genes becoming more common have a enrichment ratio > 0, genes being depleted have a negative enrichment ratio.

If change in relative abundance of genes (as seen in the enrichment ratio) are due to the effect of the encoded peptides on fitness, there should be a positive correlation between the enrichment ratios of genes encoding for the same peptides. A given peptide may be encoded by multiple genes for two reasons: because of the degeneracy of the codons used or because of the circularity of the peptides. (For exemple, CDCL and CLCD are the same peptides after splicing.)

This note aims to test that.

1.1 Correlation coefficient

For each case (unique combination of peptide length, repetition and induction condition) we’re going to compute the correlation coefficient of all the enrichment ratio in a pairwise fashion. That is, if we have four sequences A, B, C and D encoding for the same peptide, we’ll add all six possible pairs to the list of enrichment ratio pairs to test as in the table below.

All possible pairs from 4 sequences A, B, C and D.
Seq1 Seq2
A B
A C
A D
B C
B D
C D

The correlation coefficients are discussed for the short and long peptides separately.

1.2 Short peptides

For the short peptides (seen below) there is no correlation at all between enrichment ratio, as seen in the table below: all correlation coefficients are < 0.01.

Correlation coefficient of the log2(enrichment ratio) for the short peptides.
case coefficient
nnb/induced 0.0006412
nnb/repressed -0.0031178
nnk/induced 0.0030854
nnk/repressed 0.0099953

Graphing this, it can be seen that the trendlines are indeed flat.

1.3 Long peptides

For the long peptides, the pattern is very similar, with one exception: the induced, NNK library has something closer to what we’d expect with a coefficient of 0,65.

Correlation coefficient of the log2(enrichment ratio) for the long peptides.
case coefficient
nnb/induced -0.0203679
nnb/repressed 0.0083027
nnk/induced 0.6489829
nnk/repressed -0.0315734

The graph shows a neat trendline (although not quite the 45° we’d expect)

1.3.1 Number of multiple reads

One explanation might just be the number of pairs for the regression varying between conditions. There are vastly less peptides encoded by multiple genes in the NNK section with non “NA” enrichment ratio despite a similar number of overall reads. This might be due to the large number of entirely purged peptides. This would also explain the skew, with the correlation graph showing highly enriched peptides with ratios >8 for NNK induced, but not for NNB.

Number of peptide sequence with multiple encoding genes for the long peptides.
case count
nnb/induced 687
nnb/repressed 688
nnk/induced 206
nnk/repressed 5725

2 Conclusion

The NNB library as well as the shorter library don’t show a clear correlation between synonymous sequence. That would hints that the ratios seen there are essentially random, with no fitness effect due to the peptide. This could be explained by poor splicing of the libraries. For the NNK7 library, under conditions of repression we see a similar pattern.

One things to be noted is that the while the range of enrichment ratio in those null condition is wide, the distribution still center on low absolute value, with the values for log(enrichment ratio) following a normal distribution with a standard deviation of ~1.5.