In this project we want to use change in the relative abundance of genes as a proxy for fitness. The change of abundance is characterized using the enrichment ratio: the proportions of all reads with a given sequence after a given number of generation divided by the initial proportion. For ease of comparison, we work with the log2 of the enrichment ratio: genes becoming more common have a enrichment ratio > 0, genes being depleted have a negative enrichment ratio.
If change in relative abundance of genes (as seen in the enrichment ratio) are due to the effect of the encoded peptides on fitness, there should be a positive correlation between the enrichment ratios of genes encoding for the same peptides. A given peptide may be encoded by multiple genes for two reasons: because of the degeneracy of the codons used or because of the circularity of the peptides. (For exemple, CDCL and CLCD are the same peptides after splicing.)
This note aims to test that.
For each case (unique combination of peptide length, repetition and induction condition) we’re going to compute the correlation coefficient of all the enrichment ratio in a pairwise fashion. That is, if we have four sequences A, B, C and D encoding for the same peptide, we’ll add all six possible pairs to the list of enrichment ratio pairs to test as in the table below.
| Seq1 | Seq2 |
|---|---|
| A | B |
| A | C |
| A | D |
| B | C |
| B | D |
| C | D |
The correlation coefficients are discussed for the short and long peptides separately.
For the short peptides (seen below) there is no correlation at all between enrichment ratio, as seen in the table below: all correlation coefficients are < 0.01.
| case | coefficient |
|---|---|
| nnb/induced | 0.0006412 |
| nnb/repressed | -0.0031178 |
| nnk/induced | 0.0030854 |
| nnk/repressed | 0.0099953 |
Graphing this, it can be seen that the trendlines are indeed flat.
For the long peptides, the pattern is very similar, with one exception: the induced, NNK library has something closer to what we’d expect with a coefficient of 0,65.
| case | coefficient |
|---|---|
| nnb/induced | -0.0203679 |
| nnb/repressed | 0.0083027 |
| nnk/induced | 0.6489829 |
| nnk/repressed | -0.0315734 |
The graph shows a neat trendline (although not quite the 45° we’d expect)
One explanation might just be the number of pairs for the regression varying between conditions. There are vastly less peptides encoded by multiple genes in the NNK section with non “NA” enrichment ratio despite a similar number of overall reads. This might be due to the large number of entirely purged peptides. This would also explain the skew, with the correlation graph showing highly enriched peptides with ratios >8 for NNK induced, but not for NNB.
| case | count |
|---|---|
| nnb/induced | 687 |
| nnb/repressed | 688 |
| nnk/induced | 206 |
| nnk/repressed | 5725 |
The NNB library as well as the shorter library don’t show a clear correlation between synonymous sequence. That would hints that the ratios seen there are essentially random, with no fitness effect due to the peptide. This could be explained by poor splicing of the libraries. For the NNK7 library, under conditions of repression we see a similar pattern.
One things to be noted is that the while the range of enrichment ratio in those null condition is wide, the distribution still center on low absolute value, with the values for log(enrichment ratio) following a normal distribution with a standard deviation of ~1.5.