This is the analysis of the peptides from the drift experiment. The required datasets were created in the “peptideAnalysis.R” file. I’m going to do it on a subset of of reads for both short and large datasets at first to speed things up.
Let’s have a look at the distribution of enrichment ratio:
$induced
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 0.74 0.84 0.98 1.12 116.15 38643
$repressed
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.26 0.52 0.80 0.92 1.03 87.92 94513
Let’s look at the subset of sequences with stops codons. 104977 sequences (21.65% of the total) contains stop codons. The percentage I would expect with peptides containing 3 randomized NNK codons is 9.09% of the total, so this is much higher than expected.Looking at the sequence I have something odd in that many of the sequences don’t have a starting cysteine.
I will subtract the sequences that do not start with the correct codon (TGC).
This is a surprisingly large number: 235186 out of 484868 (48.51%)
So looking at only the sequences starting with TGC, we have 10.66% of stop codons, much closer to the expected 9.09%. We’ll subset only the correct ones for downstream analysis and work with the NNK3_correct data set.
NAs here are the cases where there were no reads at the generation 1 but some after five generations. 18% of genes in the induced set and 35% in the repressed sets are such NAs. Interesting that way more reads are such NAs for the repressed set, despite it being generally less enriched.
Compared to the rest of the counts, the NAs usually have very low sequence counts even at the generation five and so probably represent rare peptides. That would explain the higher numbers of NAs in the repressed set.
$induced
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 2.591 2.000 169.000
$repressed
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 3.042 3.000 255.000
$induced
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 2.0 4.0 657.3 37.0 2607562.0
$repressed
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 4 865 33 3453856
If those are rare, “fake” sequences that are artifact, I expect a very high rate of sequences that don’t start with TGC. Let’s check that.
Out of 133156 reads with NAs, 41692 have a sequence that starts with TGC. That’s 68.69% of wrong sequences, twice the average rate! This fits with the idea that rare sequences are the one causing issues.
enrichment_ratio
Min. :0.2587
1st Qu.:0.6262
Median :0.8095
Mean :0.7489
3rd Qu.:0.8817
Max. :1.0000
Most peptides here are weakly purged. Of the 207990 peptides with an enrichment ratio that was computed, we have 137997 that are purged. But only 15845 (7.62%) have seen their frequencies reduced at least twofold.
With NNK3 peptides, we expect to have 20^3 or 8000 different possible peptide sequences (not counting cyclisation). In my case, I have 8018 peptides in the repressed condition and 8011 in the induced condition, so almost perfect to perfect coverage. Those numbers are 8006 and 8006 if one removes NAs.
No sequences seems missing or highly repressed. So no strong purge. What could have happened is that some of the ones in the induced set that are missing are actually very strongly purged there compared to the repressed set. If true, I expect them to be purged in the repressed set too.
The absent peptides are all containing relatively rare codons, with a lot of phenylalanine, tryptophane, cysteines… That could be the explanation. Looking at the data sets, all eleven peptides are “NAs” in the represed condition too, so relatively rare.
I should expect the enrichment ratio for the same peptide sequence to correlate pretty well.
There is fairly good correlation (0.79) between the induced and repressed condition. The plot isn’t fully symmetrical which I don’t fully understand.
RR motifs are known to interfere with the membrane. Let look at the peptides with those motifs.
$induced
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 0.74 0.84 0.98 1.12 116.15 38643
$repressed
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.26 0.52 0.80 0.92 1.03 87.92 94513