Step 05 - Model fitness cost of producing constructs by examining DNA levels

Motivation

Let's assume that making lots RNA (and thus protein) puts some significant fitness cost on the cell, decreasing it's growth rate. The more RNA is made, the slower the cell grows, increasing it's division time. Because every time the cell divides, it's RNA abundance is cut in half, the amount of RNA per cell is inflated in cells that grow slowly - it builds up quicker than it can be diluted.

Thus, RNA abundance per cell is a compound measure of RNA production and the fitness cost of producing the RNA. It is possible that RNA abundance is inflated for constructs that pose a greater than average fitness cost on their host cells.

Given the current data, the best measure we have of fitness cost is DNA abundance.

Variation in DNA abundance can come from two sources. The first source is variability in the synthesis of oligos off of the chip. The other source of variation comes from growth rate.

DNA abundance should decrease for cells that grow slowly, since fewer cells means fewer plasmids within them.

However, cells with impaired fitness might also have trouble replicating the plasmid, which would mean fewer plasmids per cell, and thus lower DNA abundance.

Lee and Bailey (Lee & Bailey, 2002) created a model for all of these factors in a 1983 paper (published online 2002), so I might be able to use their model. I'm not sure if there is a more recent model. They explicitly model plasmid replication based on the initiator and repressor complexes of a plasmid (λdv), which I am not using.

Looking at our data

It is clear that DNA abudance is affected by RNA and Protein Levels.

ggplot(ngs, aes(x = log10(RNA), y = log10(Count.DNA), color = Promoter)) + geom_point(alpha = 0.1) + 
    geom_density2d() + theme_bw() + geom_hline(yintercept = 3, linetype = "dashed")

plot of chunk 5-plot_DNA_abundance

ggplot(ngs, aes(x = log10(Prot), y = log10(Count.DNA), color = Promoter)) + 
    geom_point(alpha = 0.1) + geom_density2d() + theme_bw() + facet_grid(Promoter ~ 
    .) + geom_hline(yintercept = 3, linetype = "dashed")

plot of chunk 5-plot_DNA_abundance

ggplot(ngs, aes(x = log10(Prot), y = log10(RNA), color = log10(Count.DNA) < 
    3)) + geom_point(alpha = 0.1) + theme_bw() + facet_grid(Promoter ~ .) + 
    stat_smooth(data = subset(ngs, log10(Prot) > 3.125 & log10(Prot) < 5.25), 
        method = lm, se = F, linetype = "dashed", color = "black", aes(x = log10(Prot), 
            y = log10(RNA)))

plot of chunk 5-plot_DNA_abundance

The DNA level here is clearly reduced where protein and RNA abundances are high. I drew horizontal lines in the first two plots showing that for lower abundance RNA and protein, the DNA counts remain high, but this DNA count dips for high expression of Protein and RNA.

In the third plot, red means above this dotted line, and blue means below. Keep in mind that Protein levels have maximum and minimum measureable limits. Within these limits, log10 Protein ~ log10 RNA ratios hold pretty constant (dotted black line). RNA level has no such limit, and is a much cleaner measure.

Comparing DNA Read Count and Protein Read Count

We also have a separate parallel measure of DNA; the number of total reads seen per construct after sorting. This measure should track linearly to the DNA count.

It won't track linearly without some renormalization, since we sort a different number of cells in each bin, and each barcoded bin has slightly different percentages in the library (separate from cells sorted). If we normalize for both of these things, we will see a stronger correlation.

# I got these numbers from step 04
prot.bin_pcts <- c(37.93333333, 9.266666667, 10.86666667, 10.16666667, 7.666666667, 
    5.8, 4.533333333, 3.566666667, 2.833333333, 2.266666667, 1.733333333, 3.2)
bin_breaks <- c(0, 1250, 2027, 3287, 5331, 8645, 14019, 22735, 36870, 59791, 
    96963, 157243, 255000)

# get constructs with protein counts
prot.has_counts <- with(ngs, Count.Prot > 0 & !is.na(Count.Prot))

# get the read counts, adjusted for reads per bin and cells per bin
ngs[prot.has_counts, "Count.Prot.Norm"] <- rowSums(t(get_adj_read_counts(subset(ngs, 
    prot.has_counts), prot.bin_pcts, bin_breaks)))

# get the 'best' bin; i.e. the bin that the avg prot level falls in
ngs$Prot.BestBin <- with(ngs, cut(Prot, bin_breaks, labels = 1:12))

# bin average values
prot.bin_avgs <- ddply(ngs, "Prot.BestBin", summarize, Count.Prot.Norm = exp(mean(log(Count.Prot.Norm))), 
    Count.Prot = exp(mean(log(Count.Prot))), Count.DNA = exp(mean(log(Count.DNA))))

# compare before and after normalization
ggplot(ngs, aes(x = log10(Count.Prot), y = log10(Count.Prot.Norm), color = Prot.BestBin)) + 
    geom_point() + theme_bw() + opts(title = "Protein Reads before(X)/after(Y) cell/read normalization")

plot of chunk 5.01-normalize_prot_count


ggplot(ngs, aes(x = log10(Count.DNA), y = log10(Count.Prot.Norm), group = Prot.BestBin, 
    color = Prot.BestBin)) + geom_point(alpha = 0.1) + geom_smooth(method = lm, 
    se = F) + geom_point(data = prot.bin_avgs, size = 8) + theme_bw() + opts(title = "Normalized Protein Reads vs. DNA Reads")

plot of chunk 5.01-normalize_prot_count


ggplot(subset(ngs, !is.na(Prot.BestBin)), aes(x = log10(Count.DNA), y = log10(Count.Prot.Norm), 
    color = Prot.BestBin)) + geom_density2d()

plot of chunk 5.01-normalize_prot_count


# Create this new fitness metric, scaled so the mean is 0 and the sd is 1
ngs$Fitness.Read_Ratio <- scale(log(ngs$Count.Prot.Norm)/log(ngs$Count.DNA))

Here we see a plot comparing the normalized reads from the Protein Measurement (Y axis) and DNA reads on the X axis. This comparison is useful because the Protein data was not only grown after transfection, like the DNA, but also after cell sorting, so the ratio of normalized Protein Reads to DNA reads gives us a proxy measurement for the growth rate. Constructs where this ratio is lower compared to the average have a slower growth rate.

It makes sense then that cells with higher expression levels (those falling into higher numbered bins) have a lower protein read:DNA read ratio and thus are slower growing, as one would expect!

So now we have three related measures for growth rate/cell fitness:

Direct DNA abundance
Indirect DNA abundance from from Protein reads
The ratio of of Protein : DNA Reads.

This new measure in 3 we will call Read-Ratio Fitness. 0 is the mean fitness and +/- 1 is one standard deviation away. This is measuring the difference in relative DNA abundance before and after sorting, because there is additional cell growth after sorting.

Unlike looking directly at Protein and DNA read counts, this ratio of Protein to DNA reads should not be affected by the number of reads synthesized per construct on the chip.

Obviously this measurement tracks most closely with the amount of protein produced:

ggplot(ngs, aes(x = log10(Prot), y = Fitness.Read_Ratio, color = Promoter)) + 
    geom_point(alpha = 0.2) + theme_bw() + labs(title = "Read Ratio Fitness versus Protein Abundance", 
    x = "log10 Protein Abundance", y = "Fitness based on Read Ratio")

plot of chunk 5.01a-read_ratio_v_prot_lvl


ggplot(ngs, aes(x = log10(Prot), y = scale(Fitness.Read_Ratio * log10(Prot)), 
    color = Promoter)) + geom_point(alpha = 0.2) + theme_bw() + labs(title = "Read Ratio Fitness versus Protein Abundance", 
    x = "log10 Protein Abundance", y = "Fitness based on Read Ratio")

plot of chunk 5.01a-read_ratio_v_prot_lvl

It also shows a strong correlation with DNA abundance, which also is a proxy for fitness (but subject to differences in synthesis rates per oligo):

ggplot(ngs, aes(x = log10(Count.DNA), y = Fitness.Read_Ratio, color = Promoter)) + 
    geom_point(alpha = 0.2) + theme_bw() + ggtitle("") + labs(title = "Comparing Fitness Measures: DNA vs DNA/Protein Ratio", 
    x = "log10 DNA Abundance", y = "Fitness based on DNA/Prot Read Ratio")

plot of chunk 5.01b-read_ratio_v_dna

Here is the distribution of this measurement across each gene, normalized for protein level:


ggplot(subset(ngs, !is.na(Fitness.Read_Ratio)), aes(x = scale(Fitness.Read_Ratio/Prot), 
    y = reorder(Gene, Prot, mean))) + geom_point(alpha = 0.2) + theme_bw() + 
    facet_grid(RBS ~ Promoter) + scale_y_discrete(labels = NULL) + theme(panel.grid.major = element_line(colour = NA), 
    panel.grid.minor = element_line(colour = NA)) + labs(title = "Fitness per Gene, RBS, Promoter", 
    x = "Fitness normalized by Protein Abundance", y = "Gene, ordered by mean Protein Abundance")

plot of chunk 5.01c-read_ratio_v_gene


ggplot(ddply(subset(ngs, !is.na(Fitness.Read_Ratio)), c("Promoter", "RBS", "Gene"), 
    summarize, Fitness = mean(Fitness.Read_Ratio/Prot, na.rm = T), Fitness.SD = sd(Fitness.Read_Ratio/Prot, 
        na.rm = T), Protein = mean(log10(Prot), na.rm = T), Protein.SD = sd(log10(Prot), 
        na.rm = T)), aes(x = Protein, xmin = Protein - Protein.SD, xmax = Protein + 
    Protein.SD, y = Fitness, ymin = Fitness - Fitness.SD, ymax = Fitness + Fitness.SD)) + 
    facet_grid(RBS ~ Promoter) + geom_errorbar(alpha = 0.2) + geom_errorbarh(alpha = 0.2) + 
    theme_bw() + labs(title = "Fitness Variation by Gene", y = "Fitness normalized by Protein Abundance", 
    x = "Protein Abundance")

plot of chunk 5.01c-read_ratio_v_gene

Some final concerns regarding DNA/RNA dilution

Does the amount of Protein and RNA per cell get inflated significantly because of longer cell divison times and slower dilution? Is this significantly cofounding our signal?

I'm not quite sure how to answer this question with the data I have. Here are some plots:

ggplot(subset(ngs), aes(x = log10(RNA), y = log10(Prot), color = cut(Fitness.Read_Ratio, 
    breaks = c(Inf, 1, -1, -Inf), labels = c(">1 SD Above", "Within 1 SD", ">1 SD Below")))) + 
    geom_point(alpha = 0.2) + theme_bw() + scale_colour_manual(name = "Fitness", 
    values = c("red", "gray", "blue"))

plot of chunk unnamed-chunk-1


ggplot(subset(ngs), aes(x = log10(RNA), y = log10(Prot), color = cut(Fitness.Read_Ratio, 
    breaks = c(Inf, 1, -1, -Inf), labels = c(">1 SD Above", "Within 1 SD", ">1 SD Below")))) + 
    geom_point(alpha = 0.2) + theme_bw() + scale_colour_manual(name = "Fitness", 
    values = c("red", "gray", "blue")) + facet_grid(RBS ~ Promoter)

plot of chunk unnamed-chunk-1


ggplot(subset(ngs), aes(x = log10(Trans), y = Fitness.Read_Ratio, color = cut(Fitness.Read_Ratio, 
    breaks = c(Inf, 1, -1, -Inf), labels = c(">1 SD Above", "Within 1 SD", ">1 SD Below")))) + 
    geom_point(alpha = 0.2) + theme_bw() + scale_colour_manual(name = "Fitness", 
    values = c("red", "gray", "blue")) + facet_grid(RBS ~ Promoter)

plot of chunk unnamed-chunk-1


# finally, add fitness as a factor state
ngs$Fitness <- with(ngs, cut(Fitness.Read_Ratio, breaks = c(Inf, 1, -1, -Inf), 
    labels = c("Fit", "Avg", "Unfit")))

In all three plots, I'm coloring 'low' fitness constructs, those greater than 1 stdev below mean fitness, as red, and those who grow faster by more than 1 stdev in blue.

The first plot shows Prot vs RNA and coloring by fitness, which shows that the translation efficiency seems to decrease for low-fitness constructs.

The second plot is the first, split by promoter and RBS. You see a general negative trend under the strong promoter - increased translation efficiency decreases fitness. Decreased fitness seems to correlate with less protein.

The third plot shows the correspondence between translation efficiency and fitness. It does look like the translation efficiency is lower than expected for strongly expressed constructs with low fitness (bottom left panel). I think this is actually because the protein level is underestimated, not because of fitness effects on translation efficiency.

Do the number of plasmids per cell also increase (or decrease) significantly given longer cell division times and slower dilution?

I'm not sure how to measure this except to examine total DNA counts per construct, and we've already seen that more DNA means more fit, not less fit.

What about segregational stability?

Plasmids get distributed randomly to each cell during division, and the cell with fewer plasmids gets a growth advantage, and can outcompete cells with the plasmid. This could cause a decrease in RNA levels. This assumes that cells with fewer plasmids do not get a growth disadvantage due to the production of less antibiotic resistance protein. I would also expect to see a bimodal distribution in Protein levels per cell.

I will look at this when I examine the construct distributions, next.

Conclusions

By comparing the amount of DNA per construct before and after sorting, I can get a measure of how fast each construct grows. This is because there were many additional doublings while the sorted cells were grown up before being prepped for DNA. Constructs that caused the cells they inhabited to grow more slowly will be relatively less prevalent before and after sorting.
I need to get the growth rates for a variety of constructs to se if my growth rate measurement holds. I can do this with time-series fluorescent measurements in 96-well plates. I could also use this to perhaps extrapolate cell doubling times.
Very strongly expressed constructs have much slower growth rates. This is as expected. There are some constructs that are weakly expressed which also grow slowly. I can correlate this measure with other things also, like codon usage, secondary structure, etc.