I want to use this fitness measure to pick 8 genes where overexpression
does not have an effect on fitness. A plot of average fitness per gene
should do it:
ggplot(ddply(ngs, .(Prot.Windows), transform, Fitness.FoldChange = log2(Fitness.Read_Ratio) -
log2(mean(Fitness.Read_Ratio, na.rm = T))), aes(x = reorder(Gene, Fitness.FoldChange,
median, na.rm = T), y = Fitness.FoldChange)) + geom_boxplot() + theme_bw() +
scale_y_continuous(limits = c(-8, 8))
gene_fitness_means <- ddply(ddply(ngs, .(Prot.Windows), transform, Fitness.FoldChange = log2(Fitness.Read_Ratio) -
log2(mean(Fitness.Read_Ratio, na.rm = T))), "Gene", summarize, Fitness.AvgFoldChange = mean(Fitness.FoldChange,
na.rm = T), Fitness.StdevFoldChange = sd(Fitness.FoldChange, na.rm = T))
gene_ranked_by_fitness <- with(gene_fitness_means, Gene[order(Fitness.AvgFoldChange)])
fitness.best_genes <- gene_ranked_by_fitness[128:136]
fitness.mid_genes <- gene_ranked_by_fitness[64:72]
fitness.worst_genes <- gene_ranked_by_fitness[1:8]
fitness.best_stdev <- with(gene_fitness_means, Gene[order(Fitness.StdevFoldChange)])[1:8]
categorize_points <- with(ngs, (Gene %in% fitness.best_genes) + 2 * (Gene %in%
fitness.mid_genes) + 3 * (Gene %in% fitness.worst_genes) + 4 * (Gene %in%
fitness.best_stdev & !(Gene %in% fitness.mid_genes)))
ggplot(ngs, aes(x = log10(Prot), y = Fitness.Read_Ratio, color = factor(categorize_points))) +
geom_point(aes(alpha = factor(categorize_points))) + theme_bw() + scale_colour_manual(name = "Category",
values = c("gray", "blue", "yellow", "red", "green"), labels = c("All",
"Best Fitness", "Mid Fitness", "Worst Fitness", "Steady Fitness")) +
scale_alpha_manual(name = "Category", values = c(0.05, 0.2, 0.2, 0.2, 0.2)) +
labs(title = "Read Ratio Fitness versus Protein Abundance", x = "log10 Protein Abundance",
y = "Fitness based on Read Ratio")
In the first plot, the ones on the low end look bad at every expression level, while the ones on the high end are more fit at every expression level. In the second plot we show the best 8 in blue and the worst 8 in red, fitness versus expression level. Here are the worst 8:
grpE, proS, ispD, dxr, ftsA, infA, hemH, folE
And the worst 8:
cysS, serS, lptD, rplS, ispG, ftsZ, acpP, kdsB, eno
The 8 with the least change in fitness across all variants:
rpsA, thiL, fabD, ftsQ, folA, thrS, pyrG, folK
The green ones seem to have the fewest very unfit constructs, corresponding to
those genes where the SD of fits is lowest.
For the last 8, I will PCR them out of the genome and express them with a
medium strength promoter/RBS and choose the four that have the growth curves
that look the most like the plasmid-free cells.
I want to pick a subset of genes to use as a western blot/dotblot control of my codon variants. Here are some properties that I want in my variants:
If I do 4 genes * 6 codon variants * 4 replicates, that's a whole plate (96) of westerns.
We also want to choose constructs that are within the range of our assay, so that we can accurately compare.
We should pick a single promoter/RBS. The strong promoter we will clearly use since we're looking for large changes from overexpression, but we could either use the wild-type RBS or the strongest RBS (BB0030).
ggplot(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030"), aes(x = Fitness.Read_Ratio,
y = log10(Prot), colour = CDS.type)) + geom_jitter(size = 2, position = position_jitter(height = 0.3,
width = 0)) + theme_bw() + opts(axis.text.y = theme_text(size = 8), title = "Protein level vs. Fitness")
One criteria should be based on fitness. The constructs that grow the slowest might be difficult to clone and cause problems when we quantitate. It might also be true however, that the ones that grow slowly do so because they make tons of protein, above the limit of our assay.
We should pick constructs that run the full range of protein expression levels across their codon variants, and do not have fitness effects. So as a first criteria, let's see what each gene's min/max fitness looks like.
ggplot(ddply(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030" & !is.na(Prot)),
"Gene", transform, min_fitness = min(Fitness.Read_Ratio, na.rm = T), max_fitness = max(Fitness.Read_Ratio,
na.rm = T)), aes(ymin = min_fitness, ymax = max_fitness, x = reorder(Gene,
min_fitness))) + geom_linerange(alpha = 0.2) + coord_flip() + geom_point(aes(y = Fitness.Read_Ratio,
colour = CDS.type)) + labs(title = "Construct fitness sorted by Gene Min Fitness")
ggplot(ddply(subset(ngs, Promoter == "BBaJ23100" & RBS == "BB0030" & !is.na(Prot)),
"Gene", transform, min_fitness = min(Fitness.Read_Ratio, na.rm = T), max_fitness = max(Fitness.Read_Ratio,
na.rm = T), min_prot = min(log10(Prot), na.rm = T), max_prot = max(log10(Prot),
na.rm = T)), aes(ymin = min_prot, ymax = max_prot, x = reorder(Gene,
min_fitness))) + geom_linerange(alpha = 0.2) + coord_flip() + geom_point(aes(y = log10(Prot),
colour = CDS.type)) + labs(title = "Construct Protein Level sorted by Gene Min Fitness")
One thing worth noting is how many of the minimum fitness variants are due to maximum rare codon usage. If we set our cutoff to about 2.5 fitness SDs, we can keep some interesting ones (i.e. ones with strong effects) and probably be OK. This gives us the following constructs:
pick8.minfit <- subset(
ddply(subset(ngs, Promoter=='BBaJ23100' & RBS=='BB0030'
& !is.na(Prot)), 'Gene', transform,
min_fitness= min(Fitness.Read_Ratio, na.rm=T),
max_fitness= max(Fitness.Read_Ratio, na.rm=T),
min_prot=min(log10(Prot), na.rm=T),
max_prot=max(log10(Prot), na.rm=T)), min_fitness > -2.5)
multiplot(
ggplot(pick8.minfit,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=CDS.type)),
ggplot(pick8.minfit,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=GC)) +
scale_colour_gradient2(low="blue", mid="beige", high="red",
midpoint=mean(ngs$GC), limits=range(ngs$GC)),
ggplot(pick8.minfit,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=dG)) +
scale_colour_gradient2(low="red", mid="beige", high="blue",
midpoint=mean(ngs$dG, na.rm=T), limits=range(ngs$dG, na.rm=T)),
ggplot(pick8.minfit,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=Rel.Codon.Freq)) +
scale_colour_gradient2(low="blue", mid="beige", high="red",
midpoint=mean(ngs$Rel.Codon.Freq),
limits=range(ngs$Rel.Codon.Freq)),
cols=2)
I spoke with Sri and he agrees that we should pick constructs that have a wide range of expression. We should also choose constructs where the WT codon usage is in the mid-to-low range, so that we can show that this works as a heterologous expression strategy.
pick8.minexpr <- subset(
ddply(subset(ngs, Promoter=='BBaJ23100' & RBS=='BB0030'
& !(Gene %in% Gene[which(is.na(Prot))])),
'Gene', transform,
min_fitness= min(Fitness.Read_Ratio, na.rm=T),
max_fitness= max(Fitness.Read_Ratio, na.rm=T),
min_prot=min(log10(Prot), na.rm=T),
max_prot=max(log10(Prot), na.rm=T),
wt_prot=log10(Prot)[which(CDS.type=='WT')]),
min_fitness > -3 &
min_prot < 3.5 &
max_prot > 5 &
wt_prot < 4.75)
multiplot(
ggplot(pick8.minexpr,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour= CDS.type,
size=(CDS.type == 'WT'))) +
scale_size_manual(values=c(2,5)) + theme_bw(),
ggplot(pick8.minexpr,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=GC,
size=(CDS.type == 'WT'))) +
scale_size_manual(values=c(2,5)) +
scale_colour_gradient2(low="blue", mid="beige", high="red",
midpoint=mean(ngs$GC), limits=range(ngs$GC)) +
theme_bw(),
ggplot(pick8.minexpr,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=dG,
size=(CDS.type == 'WT'))) +
scale_size_manual(values=c(2,5)) +
scale_colour_gradient2(low="red", mid="beige", high="blue",
midpoint=mean(ngs$dG, na.rm=T), limits=range(ngs$dG, na.rm=T)) +
theme_bw(),
ggplot(pick8.minexpr,
aes(ymin=min_prot,
ymax=max_prot,
x=reorder(Gene, min_fitness))) +
geom_linerange(alpha=0.2) +
coord_flip() +
geom_point(aes(y=log10(Prot), colour=Rel.Codon.Freq,
size=(CDS.type == 'WT'))) +
scale_size_manual(values=c(2,5)) +
scale_colour_gradient2(low="blue", mid="beige", high="red",
midpoint=mean(ngs$Rel.Codon.Freq),
limits=range(ngs$Rel.Codon.Freq)) + theme_bw(),
cols=2)