R Notebook: Provides reproducible analysis for Broad Mutational Scanning in the following manuscript:
Citation: Romanowicz KJ, Resnick C, Hinton SR, Plesa C. Exploring antibiotic resistance in diverse homologs of the dihydrofolate reductase protein family through broad mutational scanning. bioRxiv, 2025.
GitHub Repository: https://github.com/PlesaLab/DHFR
NCBI BioProject: https://www.ncbi.nlm.nih.gov/bioproject/1189478
This pipeline processes a library of 1,536 DHFR homologs and their associated mutants, with two-fold redundancy (two codon variants per sequence). Fitness scores are derived from a multiplexed in-vivo assay using a trimethoprim concentration gradient, assessing the ability of these homologs and their mutants to complement functionality in an E. coli knockout strain and their tolerance to trimethoprim treatment. This analysis provides insights into how antibiotic resistance evolves across a range of evolutionary starting points. Sequence data were generated using the Illumina NovaSeq platform with 100 bp paired-end sequencing of amplicons.
The following R packages must be installed prior to loading into the R session. See the Reproducibility tab for a complete list of packages and their versions used in this workflow.
# Load the latest version of python (3.10.14) for downstream use:
library(reticulate)
use_python("/Users/krom/miniforge3/bin/python3")
# Make a vector of required packages
required.packages <- c("ape", "bio3d", "Biostrings", "castor", "cowplot", "devtools", "dplyr", "ggExtra", "ggnewscale", "ggplot2", "ggridges", "ggtree", "ggtreeExtra", "glmnet", "gridExtra","igraph", "knitr", "matrixStats", "patchwork", "pheatmap", "purrr", "pscl", "RColorBrewer", "reshape","reshape2", "ROCR", "seqinr", "scales", "stringr", "stringi", "tidyr", "tidytree", "viridis")
# Load required packages with error handling
loaded.packages <- lapply(required.packages, function(package) {
if (!require(package, character.only = TRUE)) {
install.packages(package, dependencies = TRUE)
if (!require(package, character.only = TRUE)) {
message("Package ", package, " could not be installed and loaded.")
return(NULL)
}
}
return(package)
})
# Remove NULL entries from loaded packages
loaded.packages <- loaded.packages[!sapply(loaded.packages, is.null)]
Loaded packages: ape, bio3d, Biostrings, castor, cowplot, devtools, dplyr, ggExtra, ggnewscale, ggplot2, ggridges, ggtree, ggtreeExtra, glmnet, gridExtra, igraph, knitr, matrixStats, patchwork, pheatmap, purrr, pscl, RColorBrewer, reshape, reshape2, ROCR, seqinr, scales, stringr, stringi, tidyr, tidytree, viridis
Import MUTANTS files generated from DHFR.2.Counts.RMD relevant for downstream analysis.
# mut_collapse_15_good_filtered
mut_collapse_15_good_filtered <- read.csv("Mutants/mutants_files_formatted/mut_collapse_15_good_filtered.csv",
header = TRUE, stringsAsFactors = FALSE)
This section is based on the R file: “R_BMS_all.Lib15.R”. It describes how to generate MSA mapping files for each homolog and map their fitness values by sequential amino acid positions aligned to E. coli as the reference sequence. The end result is a heatmap displaying median fitness values for each amino acid at each position along the protein sequence.
Broad Mutational Scanning (BMS) Analysis Script: This set of scripts facilitates broad mutational scanning analysis where data from many related protein homologs and their mutants is combined and collapsed for further analysis and visualization. This script is based on the approach used in: Plesa C, Sidore AM, Lubock N, Zhang D, Kosuri S. Multiplexed Gene Synthesis in Emulsions for Exploring Protein Functional Landscapes. Science 359, 343–347 (2018), DOI: 10.1126/science.aao5167.
Procedure The analysis can be carried out by running the following scripts, which include python and bash:
clustalo Alignment python script based on fasta files
generated for each librarymap.aligned.residues.py Residue mapping from aligned
fasta file generated for each libraryparse_dssp.py Generate RSA and SS information files
from the WT E. coli DHFR dssp filescore_conservation.py Generate Conservation score
information file from the alignment fileAlignment: Use the clustalo executable
to align the protein sequences associated with the filtered perfects
(designed homologs) with the following criteria: numprunedBCs >= 5,
fitD05D03 > -1, ID mutations = 0. This will align the following FASTA
file: mutant.collapse.good.5AA.Lib15.fasta for use in
BMS analysis.
SKIP THIS ALIGNMENT IF ALREADY RUN.
# May need to enable permissions to run the executable:
#chmod +x ./clustalo
./Scripts/clustalo -i Mutants/mutants_files_formatted/Lib15.mutant.collapse.good.5AA.fasta -o BMS/OUTPUT/Complementation/Lib15.mutant.collapse.good.5AA.tree.aligned.fasta.aln --outfmt=clustal --force
Mapping Residues: Use the following
map.aligned.residues.py python script to generate csv files
for each designed homolog and associated mutant that maps residue
positions of each A.A. from alignment fasta:
map_aligned_residues.py This will parse the alignments
and generate tables of which homolog’s residue corresponds to which
position in the alignment table so that co-aligned residues can be
determined.reticulate::repl_python()
# Verify current python version
import sys
print(f"Python version: {sys.version}")
Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:57:05) [Clang 16.0.6 ]
import time
import csv
##################################
#INPUTS:
base_path = ""
trees_path_prefix = base_path+""
#clustal format alignment file
align_file_in = [trees_path_prefix+"BMS/OUTPUT/Lib15.mutant.collapse.good.5AA.tree.aligned.fasta.aln"] #New aligned FASTA
#number of seqs in each alignment file
num_samples_in_file = [418] #New FASTA (+1 from actual file count)
##################################
#OUTPUTS:
msa_map_out_path = [trees_path_prefix+"BMS/MSA_Comp/"]
This chunk can be SKIPPED if already ran once. Generates .csv files for each homolog in the “mut_collapse_15_good_filtered” collection:
Define the directory containing the parse MSA mapping files (.csv) for each perfect homolog (Lib15):
quit
BMS_MSA_directory = "BMS/MSA_Comp"
# AA map + X
BMS_aa_list<-data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
aanum=c(1:21))
BMS_aa_dim <- length(BMS_aa_list$aa)
# Wildtype E. coli is:
#NP_414590
# AA length of homolog which collapsing on
BMS_ref_len <- 159
# Minimum number of BCs to use a mutant
BMS_min_BCs = 1
# Largest number of mutations to use
#currently limited to 5 due to mutation naming scheme
BMS_max_mutations = 5
# Minimum fitness to include the mutation
BMS_min_fitness <- -1.2 # This lower range includes NP_414590 (WT E. coli DHFR)
# Grab perfects (mut_collapse_15_good_filtered)
BMS_homolog_15_ID_list <- mut_collapse_15_good_filtered %>%
filter(mutations == 0 & numprunedBCs >= BMS_min_BCs) %>%
filter(fitD05D03 > BMS_min_fitness & !is.na(fitD05D03)) %>%
dplyr::select(ID)
write.table(BMS_homolog_15_ID_list, file = paste("BMS/OUTPUT/Complementation/BMS_homolog_15_ID_list_min_fitness_",as.character(BMS_min_fitness),".csv",sep=""), sep = ",", row.names = F, quote=F, col.names = F)
# Count the number of perfects retained after filtering by BMS_min_fitness (fitness > -1.2, which includes NP_414590)
length(BMS_homolog_15_ID_list$ID)
[1] 417
#make sure all of the perfects are in the MSA
BMS_index_of_mut_to_drop = numeric()
missing_files = character()
for (i in 1:length(BMS_homolog_15_ID_list$ID)){
mutant_current_temp <- BMS_homolog_15_ID_list$ID[i]
file_path <- paste(BMS_MSA_directory,"/",mutant_current_temp,".csv",sep="")
if(!file.exists(file_path)){
print(paste("Missing file:", mutant_current_temp))
BMS_index_of_mut_to_drop[length(BMS_index_of_mut_to_drop)+1] <-
which(BMS_homolog_15_ID_list$ID==mutant_current_temp)
missing_files <- c(missing_files, mutant_current_temp)
}
}
if (length(BMS_index_of_mut_to_drop) > 0){
BMS_homolog_15_ID_list <- BMS_homolog_15_ID_list[-BMS_index_of_mut_to_drop,]
print(paste("Total missing files:", length(missing_files)))
print("Missing files:")
print(missing_files)
} else {
print("All files are present in the directory.")
}
[1] "All files are present in the directory."
# Print the number of remaining files
print(paste("Number of files present:", nrow(BMS_homolog_15_ID_list)))
[1] "Number of files present: 417"
Convert BMS_homolog_15_ID_list to a data frame with a column named “ID”
# Convert BMS_homolog_15_ID_list to a data frame with a column named "ID"
BMS_homolog_15_ID_list_df <- data.frame(ID = BMS_homolog_15_ID_list)
Make a copy of the mut_collapse_15_good_filtered object
filtered by the BMS_homolog_15_ID_list_df (min. fit =
-1.2):
BMS_mutants15_temp <- mut_collapse_15_good_filtered %>%
filter(numprunedBCs >= BMS_min_BCs & !is.na(fitD05D03) & mutations == 0) %>%
semi_join(BMS_homolog_15_ID_list_df,by="ID") %>% #filtering join
ungroup() %>%
dplyr::rename(IDalign=ID)
Make a new data frame which will keep the mapping info:
BMS_fitness15_map <- data.frame(position=numeric(),
aa=character(),
mutations=numeric(),
fitness=numeric(),
posortho=numeric(),
ingap=character(),
mutID=character(),
ID=character())
Load the E coli MSA mapping file:
#load ecoli MSA mapping
BMS_ecoli_map <- read.csv(file=paste(BMS_MSA_directory,"/NP_414590.csv",sep=""),
head=TRUE,
sep=",")
#E.coli DHFR seq: # Do we need to update this to include the starting "M"?
BMS_ecoli_seq <- "ISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERR"
Loop over all perfects in MSA directory with fitness scores for each a.a. position and determine if it maps to an E. coli residue. SKIP THIS CHUNK IF LOOP HAS ALREADY BEEN RUN.
for (i in 1:nrow(BMS_mutants15_temp)) {
#current homolog:
mutant_current_temp <- BMS_mutants15_temp$IDalign[i]
#get the MSA mapping:
mutant_map_temp <- read.csv(file=paste(BMS_MSA_directory,"/",mutant_current_temp,".csv",sep=""), head=TRUE, sep=",")
#load and define the full seq for each homolog:
BMS_seq_temp <- as.character(BMS_mutants15_temp$seq[i])
#get fitness for this homolog
BMS_fit_temp <- BMS_mutants15_temp$fitD05D03[i]
#loop over all residues
for (j in 1:nchar(BMS_seq_temp)) {
#find the corresponding residue in E.coli using MSA
BMS_cons_aanum <- mutant_map_temp$msa_aanum[which(mutant_map_temp$orth_aanum == j)]
#does this map to a non-gap residue in E.coli?
if (BMS_cons_aanum %in% BMS_ecoli_map$msa_aanum) {
#get the E.coli residue
e_coli_residue <- BMS_ecoli_map$orth_aanum[which(BMS_ecoli_map$msa_aanum == BMS_cons_aanum)]
#aa at this residue
BMS_aa_temp <- substr(BMS_seq_temp, j, j)
#add info for this residue to df
BMS_fitness15_map <- rbind(BMS_fitness15_map,
data.frame(position=e_coli_residue,
aa=BMS_aa_temp,
mutations=0,
fitness=BMS_fit_temp,
posortho=j,
ingap="No",
mutID=mutant_current_temp,
ID=mutant_current_temp))
} else {
#if it's here it maps to a gap
BMS_aa_temp <- substr(BMS_seq_temp, j, j)
BMS_fitness15_map <- rbind(BMS_fitness15_map,
data.frame(position=-1,
aa=BMS_aa_temp,
mutations=0,
fitness=BMS_fit_temp,
posortho=j,
ingap="Yes",
mutID=mutant_current_temp,
ID=mutant_current_temp))
}
}
}
#Save the `BMS_fitness15_map` object to re-load in future sessions (saves substantial computational time):
write.csv(BMS_fitness15_map, "BMS/OUTPUT/Complementation/BMS_fitness15_complementation_map.csv", row.names = FALSE, quote = FALSE)
Load the BMS_fitness15_map dataset to continue the BMS
analysis (if the loop step above was skipped).
SKIP THIS CHUNK IF LOOP ABOVE WAS RUN.
# Update which fitness map to read in Complementation (see loop above)
BMS_fitness15_map <- read.csv("BMS/OUTPUT/Complementation/BMS_fitness15_complementation_map.csv",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE)
Confirm that correct number of unique mutIDs were integrated into the BMS_fitness15_map dataframe (should match count above):
# Count unique mutID in BMS_fitness15_map
length(unique(BMS_fitness15_map$mutID))
[1] 417
Collapse (median) fitness values for each position onto the BMS Fitness Map:
BMS_fitness15_collapsed <- BMS_fitness15_map %>%
filter(position > 0) %>%
group_by(position, aa) %>%
summarise(fitval=median(fitness),
numpoints=n(),
stdfit=sd(fitness))
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.
These matrices have the fitness for each aa at each position:
BMS_matrix15_perfects = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
BMS_matrix15_perfects_num = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
BMS_matrix15_perfects_sd = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
Populate matrix:
for (i in 1:nrow(BMS_fitness15_collapsed)){
BMS_matrix15_perfects[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed$aa[i])),
BMS_fitness15_collapsed$position[i]] <- as.numeric(BMS_fitness15_collapsed$fitval[i])
BMS_matrix15_perfects_num[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed$aa[i])),
BMS_fitness15_collapsed$position[i]] <- as.numeric(BMS_fitness15_collapsed$numpoints[i])
BMS_matrix15_perfects_sd[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed$aa[i])),
BMS_fitness15_collapsed$position[i]] <- as.numeric(BMS_fitness15_collapsed$stdfit[i])
}
Assign AA names and position numbers to matrices:
rownames(BMS_matrix15_perfects)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects)<-c(1:BMS_ref_len)
rownames(BMS_matrix15_perfects_num)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects_num)<-c(1:BMS_ref_len)
rownames(BMS_matrix15_perfects_sd)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects_sd)<-c(1:BMS_ref_len)
BMS_matrix15_perfects_melt <- melt(BMS_matrix15_perfects)
BMS_matrix15_perfects_num_melt <- melt(BMS_matrix15_perfects_num)
BMS_matrix15_perfects_sd_melt <- melt(BMS_matrix15_perfects_sd)
Calculate the minimum and maximum values of
BMS_matrix15_perfects_melt prior to plotting:
# Min
min(BMS_matrix15_perfects_melt$value, na.rm=T)
[1] -0.9986179
# Max
max(BMS_matrix15_perfects_melt$value, na.rm=T)
[1] 0.7070916
Plot the fitness data for the perfects:
# Rename columns to "X1" and "X2"
names(BMS_matrix15_perfects_melt)[names(BMS_matrix15_perfects_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
BMS_matrix15_perfects_melt_plot <- ggplot(data = BMS_matrix15_perfects_melt, aes(x=X2, y=X1, fill=value)) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="Fitness",na.value="grey", limit = c(-1,1)) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
print(BMS_matrix15_perfects_melt_plot)
Plot the coverage data:
# Rename columns to "X1" and "X2"
names(BMS_matrix15_perfects_num_melt)[names(BMS_matrix15_perfects_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
BMS_matrix15_perfects_num_melt_plot <- ggplot(data = BMS_matrix15_perfects_num_melt, aes(x=X2, y=X1, fill=log(value))) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient(low = "blue", high = "red",name="log(#points)",na.value="grey", limit = c(0,max(BMS_matrix15_perfects_num_melt$value))) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
print(BMS_matrix15_perfects_num_melt_plot)
Plot the STD Data:
# Rename columns to "X1" and "X2"
names(BMS_matrix15_perfects_sd_melt)[names(BMS_matrix15_perfects_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
BMS_matrix15_perfects_sd_melt_plot <- ggplot(data = BMS_matrix15_perfects_sd_melt, aes(x=X2, y=X1, fill=value)) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="std(Fitness)",na.value="grey", limit = c(0,1.1*max(BMS_matrix15_perfects_sd_melt$value))) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
print(BMS_matrix15_perfects_sd_melt_plot)
Count the total number of mutants in the
mut_collapse_15_good_filtered dataset:
# Count mutants (up to 5 AA) associated with perfects after filtering by minimum fitness (-1)
mut_collapse_15_good_filtered %>%
filter(mutations > 0 & mutations < 6 & numprunedBCs >= BMS_min_BCs & !is.na(fitD05D03)) %>%
semi_join(BMS_homolog_15_ID_list,by="ID") %>%
ungroup() %>%
dplyr::rename(IDalign=ID) %>%
nrow(.)
[1] 5419
Make a copy of the BMS_fitness15_map object for adding the mutants:
# Make a copy of the dataframe for mutants
BMS_fitness15_map_all <- BMS_fitness15_map
Loop over mutants at some distance (up to 5 aa distance):
for (BMS_cur_mut_num in 1:BMS_max_mutations){
#grab the mutants (remove any mutants with "NA" fitness value for fitD05D03)
BMS_mutants15_temp <- mut_collapse_15_good_filtered %>%
filter(mutations > 0 & mutations < 6 & numprunedBCs >= BMS_min_BCs & !is.na(fitD05D03)) %>%
filter(mutations == BMS_cur_mut_num) %>%
semi_join(BMS_homolog_15_ID_list,by="ID") %>% #filtering join
ungroup() %>%
dplyr::rename(IDalign=ID)
#initialize new df
BMS_fitness15_map1 <- data.frame(position=numeric(),
aa=character(),
mutations=numeric(),
fitness=numeric(),
posortho=numeric(),
ingap=character(),
mutID=character(),
ID=character())
#loop over mutants
for (i in 1:nrow(BMS_mutants15_temp)){
#mutant base name
mutant_current_temp <- BMS_mutants15_temp$IDalign[i]
#length of the name
name_size = nchar(paste(mutant_current_temp,"_",sep=""))
#residue mappings
mutant_map_temp <- read.csv(file=paste(BMS_MSA_directory,"/",mutant_current_temp,".csv",sep=""),head=TRUE,sep=",")
#this mutants fitness
BMS_fit_temp <- BMS_mutants15_temp$fitD05D03[i]
#grab the mut name
mutations15_names <- as.character(BMS_mutants15_temp$mutID[i])
#grab only the relevant portion of the name
mutations15_names <- substr(mutations15_names, name_size+1, nchar(mutations15_names))
## split mutation string at non-digits
s <- strsplit(mutations15_names, "_")
mut_total_length<- nchar(BMS_mutants15_temp$seq[i])
for (mutnum in 1:BMS_cur_mut_num){
#grab the corresponding mutation string
mutcurr<-s[[1]][mutnum]
#get the position
mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
if (mutpos<=mut_total_length){
#get ending aa
to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
#find the number in the consensus seq
BMS_cons_aanum <- mutant_map_temp$msa_aanum[which(mutant_map_temp$orth_aanum == mutpos)]
#does this map to a non-gap
if (BMS_cons_aanum %in% BMS_ecoli_map$msa_aanum){
#the corresponding e.coli residue
e_coli_residue <- BMS_ecoli_map$orth_aanum[which(BMS_ecoli_map$msa_aanum == BMS_cons_aanum)]
#add this point to the data
BMS_fitness15_map1 <- rbind(BMS_fitness15_map1,
data.frame(position=e_coli_residue,
aa=to_aa,
mutations=BMS_cur_mut_num,
fitness=BMS_fit_temp,
posortho=mutpos,
ingap="No",
mutID=as.character(BMS_mutants15_temp$mutID[i]),
ID=mutant_current_temp))
} else {
#if it's here it maps to a gap
#add this point to the data
BMS_fitness15_map1 <- rbind(BMS_fitness15_map1,
data.frame(position=-1,
aa=to_aa,
mutations=BMS_cur_mut_num,
fitness=BMS_fit_temp,
posortho=mutpos,
ingap="Yes",
mutID=as.character(BMS_mutants15_temp$mutID[i]),
ID=mutant_current_temp))
}
}
}
}
#Add these mutants onto the existing data:
BMS_fitness15_map_all <- rbind(BMS_fitness15_map_all,BMS_fitness15_map1)
write.table(BMS_fitness15_map1, file = paste("BMS/OUTPUT/Complementation/BMS_fitness15_map_",
as.character(BMS_cur_mut_num),".csv",sep=""),
sep = ",", row.names = F,quote=F,col.names = T)
#Calculate all data and stats
BMS_fitness15_collapsed_all <- BMS_fitness15_map_all %>%
filter(position > 0) %>%
group_by(position, aa) %>%
summarise(fitval=median(fitness),
numpoints=n(),
stdfit=sd(fitness))
#these matrices have the fitness/num/sd for each aa at each position:
BMS_matrix15_perfects_and_1 = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
BMS_matrix15_perfects_and_1_num = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
BMS_matrix15_perfects_and_1_sd = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
#populate matrix
for (i in 1:nrow(BMS_fitness15_collapsed_all)){
BMS_matrix15_perfects_and_1[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed_all$aa[i])),BMS_fitness15_collapsed_all$position[i]] <- as.numeric(BMS_fitness15_collapsed_all$fitval[i])
BMS_matrix15_perfects_and_1_num[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed_all$aa[i])),BMS_fitness15_collapsed_all$position[i]] <- as.numeric(BMS_fitness15_collapsed_all$numpoints[i])
BMS_matrix15_perfects_and_1_sd[which(BMS_aa_list$aa==as.character(BMS_fitness15_collapsed_all$aa[i])),BMS_fitness15_collapsed_all$position[i]] <- as.numeric(BMS_fitness15_collapsed_all$stdfit[i])
}
rownames(BMS_matrix15_perfects_and_1)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects_and_1)<-c(1:BMS_ref_len)
rownames(BMS_matrix15_perfects_and_1_num)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects_and_1_num)<-c(1:BMS_ref_len)
rownames(BMS_matrix15_perfects_and_1_sd)<-BMS_aa_list$aa
colnames(BMS_matrix15_perfects_and_1_sd)<-c(1:BMS_ref_len)
BMS_matrix15_perfects_and_1_melt <- melt(BMS_matrix15_perfects_and_1)
BMS_matrix15_perfects_and_1_num_melt <- melt(BMS_matrix15_perfects_and_1_num)
BMS_matrix15_perfects_and_1_sd_melt <- melt(BMS_matrix15_perfects_and_1_sd)
# Rename columns to "X1" and "X2"
names(BMS_matrix15_perfects_and_1_melt)[names(BMS_matrix15_perfects_and_1_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(BMS_matrix15_perfects_and_1_num_melt)[names(BMS_matrix15_perfects_and_1_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(BMS_matrix15_perfects_and_1_sd_melt)[names(BMS_matrix15_perfects_and_1_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
#Calculate minimum and maximum fitness values for each mutation level:
min(BMS_matrix15_perfects_and_1_melt$value, na.rm=T)
max(BMS_matrix15_perfects_and_1_melt$value, na.rm=T)
#plot the data from these mutants:
ggplot(data = BMS_matrix15_perfects_and_1_melt, aes(x=X2, y=X1, fill=value)) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="Fitness",na.value="grey", limit = c(min(BMS_matrix15_perfects_and_1_sd_melt$value),max(BMS_matrix15_perfects_and_1_sd_melt$value))) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
ggsave(file=paste("BMS/PLOTS/Complementation/BMS_allpos15_min",toString(BMS_min_BCs),"_max",toString(BMS_cur_mut_num),"_fit.pdf",sep=""))
write.table(BMS_matrix15_perfects_and_1_melt,
file = paste("BMS/OUTPUT/Complementation/BMS_matrix_perfects_and_",
as.character(BMS_cur_mut_num),"_fit_melt.csv",sep=""),
sep = ",", row.names = F,quote=F,col.names = T)
ggplot(data = BMS_matrix15_perfects_and_1_num_melt, aes(x=X2, y=X1, fill=log(value))) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient(low = "blue", high = "red",name="log(#points)",na.value="grey",
limit = c(0,max(BMS_matrix15_perfects_and_1_num_melt$value))) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
ggsave(file=paste("BMS/PLOTS/Complementation/BMS_allpos15_min",toString(BMS_min_BCs),"_max",toString(BMS_cur_mut_num),"_num.pdf",sep=""))
write.table(BMS_matrix15_perfects_and_1_num_melt,
file = paste("BMS/OUTPUT/Complementation/BMS_matrix15_perfects_and_",
as.character(BMS_cur_mut_num),"_num_melt.csv",sep=""),
sep = ",", row.names = F,quote=F,col.names = T)
ggplot(data = BMS_matrix15_perfects_and_1_sd_melt, aes(x=X2, y=X1, fill=value)) +
geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="std(Fitness)",na.value="grey", limit = c(0,1.1*max(BMS_matrix15_perfects_and_1_sd_melt$value))) +
theme_minimal() +
scale_x_continuous(breaks=seq(0,155,5))
ggsave(file=paste("BMS/PLOTS/Complementation/BMS_allpos15_min",toString(BMS_min_BCs),"_max",toString(BMS_cur_mut_num),"_sd.pdf",sep=""))
write.table(BMS_matrix15_perfects_and_1_sd_melt,
file = paste("BMS/OUTPUT/Complementation/BMS_matrix15_perfects_and_",
as.character(BMS_cur_mut_num),"_sd_melt.csv",sep=""),
sep = ",", row.names = F,quote=F,col.names = T)
}
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.Saving 7 x 7 in image
Import the BMS_matrix15_perfects_and_5_melt dataset
based on 5 a.a. mutations:
# Import the BMS_matrix15 dataset based on 1 a.a. mutation:
BMS_matrix15_perfects_and_1_melt <- read.csv("BMS/OUTPUT/Complementation/BMS_matrix_perfects_and_5_fit_melt.csv",
header = TRUE)
Begin by adding a new column for the WT residues:
BMS_matrix15_perfects_and_1_melt$WTcolor <- NA
#X1 is aa, X2 is position
BMS_matrix15_perfects_and_1_melt$aanum <- 0
for (i in 1:nrow(BMS_matrix15_perfects_and_1_melt)){
#find the corresponding residue in E.coli using MSA
BMS_cons_aanum <- BMS_matrix15_perfects_and_1_melt$X2[i]
#is this the WT E.coli residue?
if (BMS_matrix15_perfects_and_1_melt$X1[i] == substr(BMS_ecoli_seq, BMS_cons_aanum, BMS_cons_aanum)){
#assign WT color
BMS_matrix15_perfects_and_1_melt$WTcolor[i] <- "red"
}
BMS_matrix15_perfects_and_1_melt$aanum[i] <- which(BMS_aa_list$aa == BMS_matrix15_perfects_and_1_melt$X1[i])
}
Read in the RSA, SS, and Conservation files:
RSA_1H1T <- read.table("BMS/DSSP/4KJK.RSAs.txt", skip = 2, header = FALSE)
SS_1H1T <- read.table("BMS/DSSP/4KJK.SSs.txt", skip = 2, header = FALSE)
cons_1H1T <- read.table("BMS/DSSP/Lib15.JSD.out.txt", skip = 5, header = FALSE)
In the conservation score file, replace values of -1000 with NA to ignore downstream:
# Replace values of -1000 with NA:
cons_1H1T$V2[cons_1H1T$V2 == -1000] <- NA
Add in the RSA, SS, and Conservation scores for plotting:
colnames(cons_1H1T) <- c("pos","cons","col")
cons_1H1T$pos <- cons_1H1T$pos+1
protein_info_1H1T <- RSA_1H1T %>%
right_join(SS_1H1T, by="V1")
colnames(protein_info_1H1T) <- c("pos","RSA","SS")
protein_info_1H1T$cons <- 0
for (i in 1:nrow(cons_1H1T)){
#get MSA position: cons_1H1T$pos[i]
#find if this position exists in the e.coli MSA
if (cons_1H1T$pos[i] %in% BMS_ecoli_map$msa_aanum){
#get corresponding residue
e_coli_residue <- BMS_ecoli_map$orth_aanum[which(BMS_ecoli_map$msa_aanum==cons_1H1T$pos[i])]
protein_info_1H1T$cons[which(protein_info_1H1T$pos==e_coli_residue)] <- cons_1H1T$cons[i]
}
}
Merge protein_info_1H1T with the
BMS_matrix15_perfects_and_1_melt fitness scores for each
a.a. at each position:
protein_info_1H1T <- BMS_matrix15_perfects_and_1_melt %>%
filter(X1 != "X") %>%
group_by(X2) %>%
dplyr::rename(pos=X2) %>%
summarise(avgfit=mean(value,na.rm=TRUE),
numcov=20-sum(is.na(value))) %>%
right_join(protein_info_1H1T, by="pos")
protein_info_1H1T_melt <- melt(as.data.frame(protein_info_1H1T),
id=c("pos"))
protein_info_1H1T_melt$yval <- 0
protein_info_1H1T_melt$yval[which(protein_info_1H1T_melt$variable=="avgfit")] <- 23
protein_info_1H1T_melt$yval[which(protein_info_1H1T_melt$variable=="RSA")] <- 24
protein_info_1H1T_melt$yval[which(protein_info_1H1T_melt$variable=="SS")] <- 25
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="X")] <- 21.3
Calculate the min. and max. avgfit values in
protein_info_1H1T:
# Minimum avgfit:
min(protein_info_1H1T$avgfit, na.rm = T)
[1] -1.040356
# Maximium avgfit:
max(protein_info_1H1T$avgfit, na.rm = T)
[1] 0.2375597
Generate a fitness values file from the protein_info_1H1T object:
sink("BMS/structures/avg_fit_4KJK_complementation_attribute.defattr")
cat("# BMS data for 4KJK\n")
cat("attribute: percentExposed\n")
cat("match mode: 1-to-1\n")
cat("recipient: residues\n")
for (i in 1:nrow(protein_info_1H1T)){
cat(paste("\t:",
as.character(protein_info_1H1T$pos[i]),
"\t",
as.character(protein_info_1H1T$avgfit[i]),
"\n",
sep=""))
}
sink()
Determine coverage:
(nrow(BMS_matrix15_perfects_and_1_melt %>%
filter(X1 != "X") %>%
dplyr::select(value)) -
sum(is.na(BMS_matrix15_perfects_and_1_melt %>%
filter(X1 != "X") %>%
dplyr::select(value))))/nrow(BMS_matrix15_perfects_and_1_melt %>%
filter(X1 != "X") %>%
dplyr::select(value))
[1] 0.7927673
BMS_Fig_yminlim <- -2
BMS_Fig_ymaxlim <- 30
BMS_Fig_xminlim <- -1
BMS_Fig_xmaxlim <- 160
# Read in BMS_fitness15_map_1 with 1 a.a. mutation
BMS_1mut15_del <- read.table("BMS/OUTPUT/Complementation/BMS_fitness15_map_1.csv", skip = 0 , sep=",",header = TRUE )
# Calculate all data and stats
BMS_1mut15_del_collapsed_all <- BMS_1mut15_del %>%
filter(position > 0) %>%
group_by(position, aa) %>%
summarise(fitval=median(fitness),
numpoints=n(),
stdfit=sd(fitness))
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.
# These matrices have the fitness/num/sd for each aa at each position:
BMS_1mut15_del_matrix = matrix(rep(NA, BMS_ref_len*BMS_aa_dim),nrow=BMS_aa_dim,ncol=BMS_ref_len)
# Populate matrix
for (i in 1:nrow(BMS_1mut15_del_collapsed_all)){
BMS_1mut15_del_matrix[which(BMS_aa_list$aa==as.character(BMS_1mut15_del_collapsed_all$aa[i])),BMS_1mut15_del_collapsed_all$position[i]] <- as.numeric(BMS_1mut15_del_collapsed_all$fitval[i])
}
rownames(BMS_1mut15_del_matrix)<-BMS_aa_list$aa
colnames(BMS_1mut15_del_matrix)<-c(1:BMS_ref_len)
BMS_1mut15_del_matrix_melt <- melt(BMS_1mut15_del_matrix)
# Rename columns to "X1" and "X2"
names(BMS_1mut15_del_matrix_melt)[names(BMS_1mut15_del_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
BMS_1mut15_del_matrix_melt <- BMS_1mut15_del_matrix_melt %>%
filter(X1=="X")
BMS_1mut15_del_matrix_melt$WTcolor <- NA
BMS_1mut15_del_matrix_melt$aanum <- 21.3
BMS_matrix15_perfects_and_1_melt <- BMS_matrix15_perfects_and_1_melt %>%
filter(X1!="X")
BMS_matrix15_perfects_and_1_melt <- rbind(BMS_matrix15_perfects_and_1_melt,BMS_1mut15_del_matrix_melt)
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="A")] <- 12
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="C")] <- 10
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="D")] <- 5
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="E")] <- 4
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="F")] <- 19
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="G")] <- 11
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="H")] <- 3
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="I")] <- 15
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="K")] <- 1
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="L")] <- 14
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="M")] <- 16
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="N")] <- 6
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="P")] <- 17
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="Q")] <- 7
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="R")] <- 2
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="S")] <- 9
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="T")] <- 8
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="V")] <- 13
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="W")] <- 20
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="Y")] <- 18
BMS_matrix15_perfects_and_1_melt$aanum[which(BMS_matrix15_perfects_and_1_melt$X1=="X")] <- 22.6
BMS_matrix15_perfects_and_1_melt_WT <- BMS_matrix15_perfects_and_1_melt %>%
filter(WTcolor=="red")
Calculate the min. and max. value in
BMS_matrix15_perfects_and_1_melt for plotting below:
# Minimum value:
min(BMS_matrix15_perfects_and_1_melt$value, na.rm = T)
[1] -5.990958
# Maximium value:
max(BMS_matrix15_perfects_and_1_melt$value, na.rm = T)
[1] 1.705215
Adjust the index position to shift to the right:
BMS_matrix15_perfects_and_1_melt_withM <- BMS_matrix15_perfects_and_1_melt %>%
filter(X2==1) %>%
mutate(X2=0, value=NA)
BMS_matrix15_perfects_and_1_melt_fix <- rbind(BMS_matrix15_perfects_and_1_melt_withM, BMS_matrix15_perfects_and_1_melt)
Bring it all together to build a custom BMS plot for COMPLEMENTATION (fitD05D03) for LIBRARY 15:
BMS_plot15 <- ggplot() +
#BMS matrix
geom_rect(data=BMS_matrix15_perfects_and_1_melt_fix,aes(xmin=X2+1, xmax=X2+2, ymin=aanum, ymax=aanum+1, fill=value))+
#WT seq
geom_rect(data=BMS_matrix15_perfects_and_1_melt_WT,aes(xmin=X2+1, xmax=X2+2, ymin=aanum, ymax=aanum+1), color="black",alpha=0)+
#avg fit
geom_rect(data=protein_info_1H1T,aes(xmin=pos+1, xmax=pos+2, ymin=21.3, ymax=22.3,fill=avgfit)) +
labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="Fitness",na.value="grey", limit = c(-6,3)) +
geom_text(data=BMS_matrix15_perfects_and_1_melt[1:21,], aes(x=-1, y=aanum+0.5, label=X1), size=3.5)+
geom_text(data=data.frame(pos=seq(0,150,30)),aes(x=pos+0.5,y=0,label=pos))+
geom_segment(aes(x = 14.5, y = 1, xend = 14.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 15.5, y = 1, xend = 15.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 16.5, y = 1, xend = 16.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 17.5, y = 1, xend = 17.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 20.5, y = 1, xend = 20.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 27.5, y = 1, xend = 27.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 31.5, y = 1, xend = 31.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 32.5, y = 1, xend = 32.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 46.5, y = 1, xend = 46.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 52.5, y = 1, xend = 52.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 57.5, y = 1, xend = 57.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 94.5, y = 1, xend = 94.5, yend = 0), colour = "blue")+
geom_segment(aes(x = 100.5, y = 1, xend = 100.5, yend = 0), colour = "blue")+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())+
xlim(BMS_Fig_xminlim,BMS_Fig_xmaxlim)+
ylim(BMS_Fig_yminlim,BMS_Fig_ymaxlim)
BMS_plot15
Plot the SS string to go above the custom BMS plot:
protein_info_1H1T_no_loop <- protein_info_1H1T %>%
filter(SS != "loop")
# Plot SS
SS_plot15 <- ggplot()+
geom_segment(aes(x = 1, y = 24.5, xend = 160, yend = 24.5), colour = "black")+
geom_rect(data=protein_info_1H1T_no_loop,aes(xmin=pos, xmax=pos+1, ymin=24, ymax=25, fill=SS))+
xlim(BMS_Fig_xminlim,BMS_Fig_xmaxlim)+
ylim(BMS_Fig_yminlim,BMS_Fig_ymaxlim)+
labs(x = "", y ="",color="") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())
SS_plot15
Plot the RSA (protein) string:
RSA_plot15 <- ggplot(protein_info_1H1T)+
geom_rect(aes(xmin=pos, xmax=pos+1, ymin=25.3, ymax=26.3, fill=RSA))+
xlim(BMS_Fig_xminlim,BMS_Fig_xmaxlim)+
ylim(BMS_Fig_yminlim,BMS_Fig_ymaxlim)+
labs(x = "", y ="",color="") +
scale_fill_gradient(low = "white", high = "red",name="RSA",na.value="grey") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())
RSA_plot15
Plot the conservation score:
cons_plot15 <- ggplot(protein_info_1H1T)+
geom_rect(aes(xmin=pos, xmax=pos+1, ymin=26.6, ymax=27.6, fill=cons))+
xlim(BMS_Fig_xminlim,BMS_Fig_xmaxlim)+
ylim(BMS_Fig_yminlim,BMS_Fig_ymaxlim)+
labs(x = "", y ="",color="") +
scale_fill_gradient(low = "white", high = "red",name="Cons",na.value="grey") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())
cons_plot15
Plot average fitness versus position coverage:
coverage_fit_plot15 <- ggplot(protein_info_1H1T,aes(x=avgfit,y=numcov/20*100))+
geom_smooth(fill="#0072B2")+
geom_point()+
labs(x = "Average fitness at position", y ="Position mutational coverage (%)",color="") +
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
coverage_fit_plot15
Correlation Test for coverage versus average fit:
cor.test(protein_info_1H1T$avgfit,protein_info_1H1T$numcov/20*100)
Pearson's product-moment correlation
data: protein_info_1H1T$avgfit and protein_info_1H1T$numcov/20 * 100
t = 6.9751, df = 153, p-value = 8.63e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3615376 0.6022063
sample estimates:
cor
0.4911901
cor(protein_info_1H1T$avgfit,protein_info_1H1T$numcov, use = "complete.obs")
[1] 0.4911901
Plot Average Fit versus SS:
SS_fit_plot15 <- ggplot(protein_info_1H1T,aes(x=SS,y=avgfit))+
geom_boxplot(color="black", fill="#0072B2", alpha = 0.8)+
geom_jitter()+
labs(x = "Secondary structure", y ="Average fitness at position",color="")+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
SS_fit_plot15
Plot Average Fit versus Conservation:
Cons_fit_plot15 <- ggplot(protein_info_1H1T,aes(x=cons,y=avgfit))+
geom_smooth()+
geom_point(alpha=0.7)+
labs(x = "Position Conservation", y ="Average fitness at position",color="")+
xlim(0.1,0.8)+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
Cons_fit_plot15 <- ggExtra::ggMarginal(Cons_fit_plot15,type = "histogram",
xparams = list(bins=30),
yparams = list(bins=20),
col = 'black',
fill = '#0072B2', alpha = 0.8)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'Warning: Removed 43 rows containing non-finite outside the scale range (`stat_smooth()`).`geom_smooth()` using method = 'loess' and formula = 'y ~ x'Warning: Removed 43 rows containing non-finite outside the scale range (`stat_smooth()`).Warning: Removed 43 rows containing missing values or values outside the scale range (`geom_point()`).
Cons_fit_plot15
Correlation Tests for Cons versus average fit:
cor.test(protein_info_1H1T$cons,protein_info_1H1T$avgfit)
Pearson's product-moment correlation
data: protein_info_1H1T$cons and protein_info_1H1T$avgfit
t = -6.2874, df = 111, p-value = 6.539e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6368952 -0.3619969
sample estimates:
cor
-0.5124578
cor(protein_info_1H1T$cons,protein_info_1H1T$avgfit, use = "complete.obs")
[1] -0.5124578
Plot Average Fit versus RSA:
RSA_fit_plot15 <- ggplot(protein_info_1H1T,aes(x=RSA,y=avgfit))+
geom_smooth()+
geom_point(alpha=0.7)+
labs(x = "Relative solvent accesibility", y ="Average fitness at position",color="")+
#ylim(-2.7,2.7)+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
RSA_fit_plot15 <- ggExtra::ggMarginal(RSA_fit_plot15,type = "histogram",
xparams = list(bins=30),
yparams = list(bins=20),
col = 'black',
fill = '#0072B2', alpha = 0.8)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).`geom_smooth()` using method = 'loess' and formula = 'y ~ x'Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
RSA_fit_plot15
Correlation Tests for RSA versus average fit:
cor.test(protein_info_1H1T$RSA,protein_info_1H1T$avgfit)
Pearson's product-moment correlation
data: protein_info_1H1T$RSA and protein_info_1H1T$avgfit
t = 3.0691, df = 153, p-value = 0.002541
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.08645328 0.38389317
sample estimates:
cor
0.2408193
cor(protein_info_1H1T$RSA,protein_info_1H1T$avgfit, use = "complete.obs")
[1] 0.2408193
Plot the average fitness for WT E. coli DHFR vs. All remaining DHFR homologs:
WT_fit_plot15 <- ggplot(BMS_matrix15_perfects_and_1_melt,aes(x=WTcolor,y=value))+
geom_boxplot(color="black", fill="#0072B2", alpha = 0.8)+
labs(x = "Residue", y ="Average Fitness (LogFC)",color="")+
scale_x_discrete(name ="Residue type",
labels=c("Wildtype E. coli \nDHFR Homolog","All Remaining \nDHFR Homologs"))+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
WT_fit_plot15
Summary Stats for WT E. coli DHFR homolog and All remaining DHFR homologs:
### WT Summary Stats
# WT Mean Fitness
mean(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(WTcolor=="red") %>% dplyr::select(value))))
[1] -0.01281394
# WT Median Fitness
median(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(WTcolor=="red") %>% dplyr::select(value))))
[1] 0.02925269
# WT SD Fitness
sd(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(WTcolor=="red") %>% dplyr::select(value))))
[1] 0.1978762
### Non-WT Summary Stats
# Non-WT Mean Fitness
mean(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(is.na(WTcolor) & !is.na(value)) %>% dplyr::select(value))))
[1] -0.1707993
# Non-WT Median Fitness
median(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(is.na(WTcolor) & !is.na(value)) %>% dplyr::select(value))))
[1] -0.05846074
# Non-WT SD Fitness
sd(as.numeric(unlist(BMS_matrix15_perfects_and_1_melt %>% filter(is.na(WTcolor) & !is.na(value)) %>% dplyr::select(value))))
[1] 0.5990878
Establish plotting parameters:
# Import the BMS_matrix15 dataset based on 1 a.a. mutation:
BMS_matrix15_perfects_and_1_sd_melt <- read.csv("BMS/OUTPUT/Complementation/BMS_matrix15_perfects_and_1_sd_melt.csv",
header = TRUE)
BMS_matrix15_perfects_and_1_sd_melt <- BMS_matrix15_perfects_and_1_sd_melt %>%
filter(X1!="X")
BMS_matrix15_perfects_and_1_sd_melt$aanum <- 0
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="A")] <- 12
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="C")] <- 10
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="D")] <- 5
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="E")] <- 4
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="F")] <- 19
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="G")] <- 11
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="H")] <- 3
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="I")] <- 15
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="K")] <- 1
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="L")] <- 14
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="M")] <- 16
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="N")] <- 6
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="P")] <- 17
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="Q")] <- 7
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="R")] <- 2
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="S")] <- 9
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="T")] <- 8
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="V")] <- 13
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="W")] <- 20
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="Y")] <- 18
BMS_matrix15_perfects_and_1_sd_melt$aanum[which(BMS_matrix15_perfects_and_1_sd_melt$X1=="X")] <- 22.6
Plot the STD values of the BMS analysis:
BMS_std_plot15 <- ggplot() +
#BMS matrix
geom_rect(data=BMS_matrix15_perfects_and_1_sd_melt,aes(xmin=X2, xmax=X2+1, ymin=aanum, ymax=aanum+1, fill=value))+
#WT seq
geom_rect(data=BMS_matrix15_perfects_and_1_melt_WT,aes(xmin=X2, xmax=X2+1, ymin=aanum, ymax=aanum+1), color="green",alpha=0)+
labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient2(low = "blue", high = "red", mid="gold",name="Std",na.value="grey", midpoint = 3, limit = c(0,1.1*max(BMS_matrix15_perfects_and_1_sd_melt$value))) +#
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())
BMS_std_plot15
Plot epistatis:
BMS_epistasis15 <- BMS_matrix15_perfects_and_1_sd_melt %>%
dplyr::rename(sd=value) %>%
inner_join(BMS_matrix15_perfects_and_1_num_melt, by=c("X1","X2")) %>%
dplyr::rename(num=value) %>%
filter(num>4)
BMS_epistasis_std_plot15 <- ggplot() +
#BMS matrix
geom_rect(data=BMS_epistasis15,aes(xmin=X2, xmax=X2+1, ymin=aanum, ymax=aanum+1, fill=sd))+
#WT seq
geom_rect(data=BMS_matrix15_perfects_and_1_melt_WT,aes(xmin=X2, xmax=X2+1, ymin=aanum, ymax=aanum+1), color="green",alpha=0)+
geom_text(data=BMS_matrix15_perfects_and_1_melt[1:21,], aes(x=-1, y=aanum+0.5, label=X1), size=3.5)+
geom_text(data=data.frame(pos=seq(0,150,30)),aes(x=pos+0.5,y=0,label=pos))+
labs(x = "Position (aa)", y ="Amino acid",color="") +
scale_fill_gradient(low = "gold", high = "red", name="Std",na.value="grey", limit = c(3,1.1*max(BMS_epistasis15$sd))) +#
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.line=element_blank())
BMS_epistasis_std_plot15
Plot potential epistatic interactions:
BMS_epistasis15_pos <- BMS_epistasis15 %>%
group_by(X2) %>%
summarise(numaa=n(),meansd=mean(sd),stdsd=sd(sd))
# ___
ggplot(BMS_epistasis15_pos,aes(x=X2,y=numaa)) +
geom_point()+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
# ___
ggplot(BMS_epistasis15_pos,aes(x=X2,y=meansd)) +
geom_point()+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
# ___
ggplot(BMS_epistasis15_pos,aes(x=X2,y=stdsd)) +
geom_point()+
theme(title = element_text(size = 18),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
panel.background = element_blank()) +
panel_border(color = "black")
Save BMS Info as dataframe (and write to csv file):
BMS_info15 <- right_join(BMS_matrix15_perfects_and_1_melt %>%
select(X1,X2,value) %>%
dplyr::rename(AA=X1,Pos=X2,avgfitness=value),
BMS_matrix15_perfects_and_1_num_melt %>%
select(X1,X2,value) %>%
dplyr::rename(AA=X1,Pos=X2,numpoints=value),
by=c("AA", "Pos"))
BMS_info15 <- right_join(BMS_info15,
BMS_matrix15_perfects_and_1_sd_melt%>%
select(X1,X2,value) %>%
dplyr::rename(AA=X1,Pos=X2,sd=value),
by=c("AA", "Pos"))
Save the formatted BMS files to import for downstream analyses
# protein_info_1H1T
write.csv(protein_info_1H1T,
"BMS/bms_files_formatted/protein_info_1H1T.csv", row.names = FALSE)
# BMS_matrix15_perfects_and_1_melt
write.csv(BMS_matrix15_perfects_and_1_melt,
"BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_melt.csv", row.names = FALSE)
# BMS_matrix15_perfects_and_1_num_melt
write.csv(BMS_matrix15_perfects_and_1_num_melt,
"BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_num_melt.csv", row.names = FALSE)
The session information is provided for full reproducibility.
devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS 15.2
system aarch64, darwin20
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2025-01-23
rstudio 2024.09.0+375 Cranberry Hibiscus (desktop)
pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
ade4 1.7-22 2023-02-06 [1] CRAN (R 4.3.0)
ape * 5.8 2024-04-11 [1] CRAN (R 4.3.1)
aplot 0.2.2 2023-10-06 [1] CRAN (R 4.3.1)
bio3d * 2.4-5 2024-10-29 [1] CRAN (R 4.3.3)
BiocGenerics * 0.46.0 2023-06-04 [1] Bioconductor
Biostrings * 2.68.1 2023-05-21 [1] Bioconductor
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.0)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
castor * 1.8.0 2024-01-09 [1] CRAN (R 4.3.1)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
codetools 0.2-20 2024-03-31 [1] CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
cowplot * 1.1.3 2024-01-22 [1] CRAN (R 4.3.1)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
devtools * 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.1)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
GenomeInfoDb * 1.36.4 2023-10-08 [1] Bioconductor
GenomeInfoDbData 1.2.10 2023-09-13 [1] Bioconductor
ggExtra * 0.10.1 2023-08-21 [1] CRAN (R 4.3.0)
ggfun 0.1.4 2024-01-19 [1] CRAN (R 4.3.1)
ggnewscale * 0.4.10 2024-02-08 [1] CRAN (R 4.3.1)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.3.1)
ggplotify 0.1.2 2023-08-09 [1] CRAN (R 4.3.0)
ggridges * 0.5.6 2024-01-23 [1] CRAN (R 4.3.1)
ggtree * 3.8.2 2023-07-30 [1] Bioconductor
ggtreeExtra * 1.10.0 2023-04-25 [1] Bioconductor
glmnet * 4.1-8 2023-08-22 [1] CRAN (R 4.3.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.3.0)
gridGraphics 0.5-1 2020-12-13 [1] CRAN (R 4.3.0)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.1)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.1)
httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.3.1)
igraph * 2.0.3 2024-03-13 [1] CRAN (R 4.3.1)
IRanges * 2.34.1 2023-07-02 [1] Bioconductor
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
knitr * 1.45 2023-10-30 [1] CRAN (R 4.3.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.1)
lattice 0.22-6 2024-03-20 [1] CRAN (R 4.3.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.1)
Matrix * 1.6-5 2024-01-11 [1] CRAN (R 4.3.1)
matrixStats * 1.3.0 2024-04-11 [1] CRAN (R 4.3.1)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mgcv 1.9-1 2023-12-21 [1] CRAN (R 4.3.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.1)
naturalsort 0.1.3 2016-08-30 [1] CRAN (R 4.3.0)
nlme 3.1-164 2023-11-27 [1] CRAN (R 4.3.1)
patchwork * 1.2.0 2024-01-08 [1] CRAN (R 4.3.1)
pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.4 2024-01-16 [1] CRAN (R 4.3.1)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.3.1)
png 0.1-8 2022-11-29 [1] CRAN (R 4.3.0)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
promises 1.3.0 2024-04-05 [1] CRAN (R 4.3.1)
pscl * 1.5.9 2024-01-31 [1] CRAN (R 4.3.1)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
ragg 1.3.0 2024-03-13 [1] CRAN (R 4.3.1)
RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.3.0)
Rcpp * 1.0.13 2024-07-17 [1] CRAN (R 4.3.3)
RCurl 1.98-1.14 2024-01-09 [1] CRAN (R 4.3.1)
remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.1)
reshape * 0.8.9 2022-04-12 [1] CRAN (R 4.3.0)
reshape2 * 1.4.4 2020-04-09 [1] CRAN (R 4.3.0)
reticulate * 1.36.1 2024-04-22 [1] CRAN (R 4.3.1)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1)
rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.1)
ROCR * 1.0-11 2020-05-02 [1] CRAN (R 4.3.0)
RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.3.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1)
S4Vectors * 0.38.2 2023-09-24 [1] Bioconductor
scales * 1.3.0 2023-11-28 [1] CRAN (R 4.3.1)
seqinr * 4.2-36 2023-12-08 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shape 1.4.6.1 2024-02-23 [1] CRAN (R 4.3.1)
shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.3.1)
stringi * 1.8.3 2023-12-11 [1] CRAN (R 4.3.1)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
survival 3.5-8 2024-02-14 [1] CRAN (R 4.3.1)
systemfonts 1.0.6 2024-03-07 [1] CRAN (R 4.3.1)
textshaping 0.3.7 2023-10-09 [1] CRAN (R 4.3.1)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1)
tidytree * 0.4.6 2023-12-12 [1] CRAN (R 4.3.1)
treeio 1.24.3 2023-07-30 [1] Bioconductor
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis * 2.2.3 2024-02-19 [1] CRAN (R 4.3.1)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
viridis * 0.6.5 2024-01-29 [1] CRAN (R 4.3.1)
viridisLite * 0.4.2 2023-05-02 [1] CRAN (R 4.3.0)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.1)
xfun 0.43 2024-03-25 [1] CRAN (R 4.3.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
XVector * 0.40.0 2023-05-08 [1] Bioconductor
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1)
yulab.utils 0.1.4 2024-01-28 [1] CRAN (R 4.3.1)
zlibbioc 1.46.0 2023-05-08 [1] Bioconductor
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
─ Python configuration ─────────────────────────────────────────────────────────────────────────────────────────────────────
python: /Users/krom/miniforge3/bin/python3
libpython: /Users/krom/miniforge3/lib/libpython3.10.dylib
pythonhome: /Users/krom/miniforge3:/Users/krom/miniforge3
version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy: [NOT FOUND]
NOTE: Python version was forced by use_python() function
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────