Step 09: Running dotblot controls

Picking genes where overexpression does not affect fitness

I want to use this fitness measure to pick 8 genes where overexpression
does not have an effect on fitness. A plot of average fitness per gene
should do it:

ggplot(ddply(ngs, .(Prot.Windows), transform, Fitness.FoldChange = log2(Fitness.Read_Ratio) - 
    log2(mean(Fitness.Read_Ratio, na.rm = T))), aes(x = reorder(Gene, Fitness.FoldChange, 
    median, na.rm = T), y = Fitness.FoldChange)) + geom_boxplot() + theme_bw() + 
    scale_y_continuous(limits = c(-8, 8))

plot of chunk unnamed-chunk-1


gene_fitness_means <- ddply(ddply(ngs, .(Prot.Windows), transform, Fitness.FoldChange = log2(Fitness.Read_Ratio) - 
    log2(mean(Fitness.Read_Ratio, na.rm = T))), "Gene", summarize, Fitness.AvgFoldChange = mean(Fitness.FoldChange, 
    na.rm = T), Fitness.StdevFoldChange = sd(Fitness.FoldChange, na.rm = T))

gene_ranked_by_fitness <- with(gene_fitness_means, Gene[order(Fitness.AvgFoldChange)])

fitness.best_genes <- gene_ranked_by_fitness[128:136]
fitness.mid_genes <- gene_ranked_by_fitness[64:72]
fitness.worst_genes <- gene_ranked_by_fitness[1:8]
fitness.best_stdev <- with(gene_fitness_means, Gene[order(Fitness.StdevFoldChange)])[1:8]

categorize_points <- with(ngs, (Gene %in% fitness.best_genes) + 2 * (Gene %in% 
    fitness.mid_genes) + 3 * (Gene %in% fitness.worst_genes) + 4 * (Gene %in% 
    fitness.best_stdev & !(Gene %in% fitness.mid_genes)))


ggplot(ngs, aes(x = log10(Prot), y = Fitness.Read_Ratio, color = factor(categorize_points))) + 
    geom_point(aes(alpha = factor(categorize_points))) + theme_bw() + scale_colour_manual(name = "Category", 
    values = c("gray", "blue", "yellow", "red", "green"), labels = c("All", 
        "Best Fitness", "Mid Fitness", "Worst Fitness", "Steady Fitness")) + 
    scale_alpha_manual(name = "Category", values = c(0.05, 0.2, 0.2, 0.2, 0.2)) + 
    labs(title = "Read Ratio Fitness versus Protein Abundance", x = "log10 Protein Abundance", 
        y = "Fitness based on Read Ratio")

plot of chunk unnamed-chunk-1

In the first plot, the ones on the low end look bad at every expression level, while the ones on the high end are more fit at every expression level. In the second plot we show the best 8 in blue and the worst 8 in red, fitness versus expression level. Here are the worst 8:

grpE, proS, ispD, dxr, ftsA, infA, hemH, folE

And the worst 8:

cysS, serS, lptD, rplS, ispG, ftsZ, acpP, kdsB, eno

The 8 with the least change in fitness across all variants:

rpsA, thiL, fabD, ftsQ, folA, thrS, pyrG, folK

The green ones seem to have the fewest very unfit constructs, corresponding to
those genes where the SD of fits is lowest.

For the last 8, I will PCR them out of the genome and express them with a
medium strength promoter/RBS and choose the four that have the growth curves
that look the most like the plasmid-free cells.

Picking the genes to do dotblots on

I want to pick a subset of genes to use as a western blot/dotblot control of my codon variants. Here are some properties that I want in my variants:

If I do 4 genes * 6 codon variants * 4 replicates, that's a whole plate (96) of westerns.

We also want to choose constructs that are within the range of our assay, so that we can accurately compare.

We should pick a single promoter/RBS. The strong promoter we will clearly use since we're looking for large changes from overexpression, but we could either use the wild-type RBS or the strongest RBS (BB0030).


ggplot(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030"), aes(x = Fitness.Read_Ratio, 
    y = log10(Prot), colour = CDS.type)) + geom_jitter(size = 2, position = position_jitter(height = 0.3, 
    width = 0)) + theme_bw() + opts(axis.text.y = theme_text(size = 8), title = "Protein level vs. Fitness")

plot of chunk unnamed-chunk-2

One criteria should be based on fitness. The constructs that grow the slowest might be difficult to clone and cause problems when we quantitate. It might also be true however, that the ones that grow slowly do so because they make tons of protein, above the limit of our assay.

We should pick constructs that run the full range of protein expression levels across their codon variants, and do not have fitness effects. So as a first criteria, let's see what each gene's min/max fitness looks like.

ggplot(ddply(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030" & !is.na(Prot)), 
    "Gene", transform, min_fitness = min(Fitness.Read_Ratio, na.rm = T), max_fitness = max(Fitness.Read_Ratio, 
        na.rm = T)), aes(ymin = min_fitness, ymax = max_fitness, x = reorder(Gene, 
    min_fitness))) + geom_linerange(alpha = 0.2) + coord_flip() + geom_point(aes(y = Fitness.Read_Ratio, 
    colour = CDS.type)) + labs(title = "Construct fitness sorted by Gene Min Fitness")

plot of chunk unnamed-chunk-3


ggplot(ddply(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030" & !is.na(Prot)), 
    "Gene", transform, min_fitness = min(Fitness.Read_Ratio, na.rm = T), max_fitness = max(Fitness.Read_Ratio, 
        na.rm = T), min_prot = min(log10(Prot), na.rm = T), max_prot = max(log10(Prot), 
        na.rm = T)), aes(ymin = min_prot, ymax = max_prot, x = reorder(Gene, 
    min_fitness))) + geom_linerange(alpha = 0.2) + coord_flip() + geom_point(aes(y = log10(Prot), 
    colour = CDS.type)) + labs(title = "Construct Protein Level sorted by Gene Min Fitness")

plot of chunk unnamed-chunk-3

One thing worth noting is how many of the minimum fitness variants are due to maximum rare codon usage. If we set our cutoff to about 2.5 fitness SDs, we can keep some interesting ones (i.e. ones with strong effects) and probably be OK. This gives us the following constructs:

pick8.minfit <- subset(
    ddply(subset(ngs, Promoter=='BBaJ23100' & RBS=='BB0030' 
        & !is.na(Prot)), 'Gene', transform, 
        min_fitness= min(Fitness.Read_Ratio, na.rm=T),
        max_fitness= max(Fitness.Read_Ratio, na.rm=T),
        min_prot=min(log10(Prot), na.rm=T),
        max_prot=max(log10(Prot), na.rm=T)), min_fitness > -2.5)

multiplot(
    ggplot(pick8.minfit,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
       geom_linerange(alpha=0.2) +
       coord_flip() + 
       geom_point(aes(y=log10(Prot), colour=CDS.type)),

    ggplot(pick8.minfit,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
       geom_linerange(alpha=0.2) +
       coord_flip() + 
       geom_point(aes(y=log10(Prot), colour=GC)) +
       scale_colour_gradient2(low="blue", mid="beige", high="red",
            midpoint=mean(ngs$GC), limits=range(ngs$GC)),

    ggplot(pick8.minfit,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
       geom_linerange(alpha=0.2) +
       coord_flip() + 
       geom_point(aes(y=log10(Prot), colour=dG)) +
       scale_colour_gradient2(low="red", mid="beige", high="blue",
            midpoint=mean(ngs$dG, na.rm=T), limits=range(ngs$dG, na.rm=T)),

    ggplot(pick8.minfit,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
       geom_linerange(alpha=0.2) +
       coord_flip() + 
       geom_point(aes(y=log10(Prot), colour=Rel.Codon.Freq)) +
       scale_colour_gradient2(low="blue", mid="beige", high="red",
            midpoint=mean(ngs$Rel.Codon.Freq), 
            limits=range(ngs$Rel.Codon.Freq)),

    cols=2)

plot of chunk unnamed-chunk-4

More criteria - expression range and low/mid WT

I spoke with Sri and he agrees that we should pick constructs that have a wide range of expression. We should also choose constructs where the WT codon usage is in the mid-to-low range, so that we can show that this works as a heterologous expression strategy.

pick8.minexpr <- subset(
    ddply(subset(ngs, Promoter=='BBaJ23100' & RBS=='BB0030' 
        & !(Gene %in% Gene[which(is.na(Prot))])), 
        'Gene', transform, 
        min_fitness= min(Fitness.Read_Ratio, na.rm=T),
        max_fitness= max(Fitness.Read_Ratio, na.rm=T),
        min_prot=min(log10(Prot), na.rm=T),
        max_prot=max(log10(Prot), na.rm=T),
        wt_prot=log10(Prot)[which(CDS.type=='WT')]), 
        min_fitness > -3 & 
        min_prot < 3.5 &
        max_prot > 5 &
        wt_prot < 4.75)

multiplot(
    ggplot(pick8.minexpr,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
    geom_linerange(alpha=0.2) +
    coord_flip() + 
    geom_point(aes(y=log10(Prot), colour= CDS.type, 
        size=(CDS.type == 'WT'))) +
    scale_size_manual(values=c(2,5)) + theme_bw(),

    ggplot(pick8.minexpr,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
    geom_linerange(alpha=0.2) +
    coord_flip() + 
    geom_point(aes(y=log10(Prot), colour=GC,
        size=(CDS.type == 'WT'))) +
    scale_size_manual(values=c(2,5)) +
    scale_colour_gradient2(low="blue", mid="beige", high="red",
        midpoint=mean(ngs$GC), limits=range(ngs$GC)) + 
    theme_bw(),

    ggplot(pick8.minexpr,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
    geom_linerange(alpha=0.2) +
    coord_flip() + 
    geom_point(aes(y=log10(Prot), colour=dG,
        size=(CDS.type == 'WT'))) +
    scale_size_manual(values=c(2,5)) +
    scale_colour_gradient2(low="red", mid="beige", high="blue",
        midpoint=mean(ngs$dG, na.rm=T), limits=range(ngs$dG, na.rm=T)) +
    theme_bw(),

    ggplot(pick8.minexpr,
           aes(ymin=min_prot, 
               ymax=max_prot, 
               x=reorder(Gene, min_fitness))) + 
       geom_linerange(alpha=0.2) +
       coord_flip() + 
    geom_point(aes(y=log10(Prot), colour=Rel.Codon.Freq,
        size=(CDS.type == 'WT'))) +
    scale_size_manual(values=c(2,5)) +
    scale_colour_gradient2(low="blue", mid="beige", high="red",
        midpoint=mean(ngs$Rel.Codon.Freq), 
        limits=range(ngs$Rel.Codon.Freq)) + theme_bw(),

    cols=2)

plot of chunk unnamed-chunk-5