R Notebook: Provides reproducible analysis for Gain-of-Function Mutants in the following manuscript:

Citation: Romanowicz KJ, Resnick C, Hinton SR, Plesa C. Exploring antibiotic resistance in diverse homologs of the dihydrofolate reductase protein family through broad mutational scanning. bioRxiv, 2025.

GitHub Repository: https://github.com/PlesaLab/DHFR

NCBI BioProject: https://www.ncbi.nlm.nih.gov/bioproject/1189478

Experiment

This pipeline processes a library of 1,536 DHFR homologs and their associated mutants, with two-fold redundancy (two codon variants per sequence). Fitness scores are derived from a multiplexed in-vivo assay using a trimethoprim concentration gradient, assessing the ability of these homologs and their mutants to complement functionality in an E. coli knockout strain and their tolerance to trimethoprim treatment. This analysis provides insights into how antibiotic resistance evolves across a range of evolutionary starting points. Sequence data were generated using the Illumina NovaSeq platform with 100 bp paired-end sequencing of amplicons.

Methods overview to achieve a broad-mutational scan for DHFR homologs.
Methods overview to achieve a broad-mutational scan for DHFR homologs.

Packages

The following R packages must be installed prior to loading into the R session. See the Reproducibility tab for a complete list of packages and their versions used in this workflow.

# Load the latest version of python (3.10.14) for downstream use:
library(reticulate)
use_python("/Users/krom/miniforge3/bin/python3")

# Make a vector of required packages
required.packages <- c("ape", "bio3d", "Biostrings", "castor", "cowplot", "devtools", "dplyr", "ggExtra", "ggnewscale", "ggplot2", "ggridges", "ggtree", "ggtreeExtra", "glmnet", "gridExtra","igraph", "knitr", "matrixStats", "patchwork", "pheatmap", "purrr", "pscl", "RColorBrewer", "reshape","reshape2", "ROCR", "seqinr", "scales", "stringr", "stringi", "tidyr", "tidytree", "viridis")

# Load required packages with error handling
loaded.packages <- lapply(required.packages, function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package, dependencies = TRUE)
    if (!require(package, character.only = TRUE)) {
      message("Package ", package, " could not be installed and loaded.")
      return(NULL)
    }
  }
  return(package)
})

# Remove NULL entries from loaded packages
loaded.packages <- loaded.packages[!sapply(loaded.packages, is.null)]
Loaded packages: ape, bio3d, Biostrings, castor, cowplot, devtools, dplyr, ggExtra, ggnewscale, ggplot2, ggridges, ggtree, ggtreeExtra, glmnet, gridExtra, igraph, knitr, matrixStats, patchwork, pheatmap, purrr, pscl, RColorBrewer, reshape, reshape2, ROCR, seqinr, scales, stringr, stringi, tidyr, tidytree, viridis 

Import Data Files

Import MUTANTS files generated from DHFR.4.Mutants.RMD relevant for downstream analysis.

# mut_collapse_15
mut_collapse_15 <- read.csv("Mutants/mutants_files_formatted/mut_collapse_15.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

Import BMS files generated from DHFR.5.BMS.RMD relevant for downstream analysis.

# protein_info_1H1T
protein_info_1H1T <- read.csv("BMS/bms_files_formatted/protein_info_1H1T.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

# BMS_matrix15_perfects_and_1_melt
BMS_matrix15_perfects_and_1_melt <- read.csv("BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_melt.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

# BMS_matrix15_perfects_and_1_num_melt
BMS_matrix15_perfects_and_1_num_melt <- read.csv("BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_num_melt.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

GOF Mutants Analysis

This section is based on the R file: “R_dropout_GOF.R”. It describes how to determine if certain mutant versions of a designed homolog increase in fitness under trimethoprim selection (i.e., gain of function after mutation).

Complementation

Start with two histograms showing (1) the distribution of perfects (counts; y-axis) by fitness (x-axis) with fitness < -1 colored blue and fitness > -1 colored gold, and (2) the distribution of mutants (counts; y-axis) by fitness (x-axis) with fitness < -1 colored blue and fitness > -1 colored gold

Perfects: Smooth Histogram

# Subset mutants (mutations != 0)
L15_perfects_complementation <- mut_collapse_15 %>%
  filter(mutations == 0)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_clean <- L15_perfects_complementation$fitD05D03[is.finite(L15_perfects_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf <- floor(min(fitD05D03_perf_clean, na.rm = TRUE))
x_max_perf <- ceiling(max(fitD05D03_perf_clean, na.rm = TRUE))

# Plot smooth density curve
L15_perfects_complementation_density <- ggplot(L15_perfects_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_perf, x_max_perf, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
Please use the `linewidth` argument instead.
# Display plot
print(L15_perfects_complementation_density)

# Subset mutants (mutations != 0)
L15_perfects_complementation <- mut_collapse_15 %>%
  filter(mutations == 0)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_clean <- L15_perfects_complementation$fitD05D03[is.finite(L15_perfects_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf <- floor(min(fitD05D03_perf_clean, na.rm = TRUE))
x_max_perf <- ceiling(max(fitD05D03_perf_clean, na.rm = TRUE))

# Create density data
density_data <- density(L15_perfects_complementation$fitD05D03, na.rm = TRUE)

# Create a data frame from the density data
df_density <- data.frame(x = density_data$x, y = density_data$y)

# Split the data at x = -1
df_left <- df_density[df_density$x <= -1, ]
df_right <- df_density[df_density$x >= -1, ]

# Ensure the split point is included in both datasets
df_left <- rbind(df_left, data.frame(x = -1, y = df_left$y[nrow(df_left)]))
df_right <- rbind(data.frame(x = -1, y = df_right$y[1]), df_right)

# Plot using geom_area
L15_perfects_complementation_density <- ggplot() +
  geom_area(data = df_left, aes(x = x, y = y * 100), fill = "darkblue", alpha = 0.75) +
  geom_area(data = df_right, aes(x = x, y = y * 100), fill = "gold", alpha = 0.75) +
  geom_line(data = df_density, aes(x = x, y = y * 100), color = "black", size = 0.5) +  # Add black outline
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_x_continuous(breaks = seq(x_min_perf, x_max_perf, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), 
                     limits = c(0, max(df_density$y) * 100)) +  # Set y-axis limits
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.
# Display plot
print(L15_perfects_complementation_density)

Percent of Perfects with Strong Depletion (fitness < -2.5)

# Calculate the percentages and counts
results <- L15_perfects_complementation %>%
  filter(!is.na(fitD05D03)) %>%  # Remove rows where fitD05D03 is NA
  summarise(
    total_unique_IDs = n_distinct(ID),
    IDs_below_neg2_5 = n_distinct(ID[fitD05D03 < -2.5]),
    percentage_below_neg2_5 = (IDs_below_neg2_5 / total_unique_IDs) * 100,
    IDs_below_neg1 = n_distinct(ID[fitD05D03 < -1]),
    IDs_above_or_equal_neg1 = n_distinct(ID[fitD05D03 >= -1]),
    percentage_below_neg1 = (IDs_below_neg1 / total_unique_IDs) * 100,
    percentage_above_or_equal_neg1 = (IDs_above_or_equal_neg1 / total_unique_IDs) * 100
  )

# Print the results
print(paste0("Percentage of unique IDs with fitD05D03 < -2.5: ", 
             round(results$percentage_below_neg2_5, 2), "%"))
[1] "Percentage of unique IDs with fitD05D03 < -2.5: 17%"
print(paste("Total unique IDs:", results$total_unique_IDs))
[1] "Total unique IDs: 794"
print(paste("Unique IDs with fitD05D03 < -2.5:", results$IDs_below_neg2_5))
[1] "Unique IDs with fitD05D03 < -2.5: 135"
print("\nAdditional fitness categories:")
[1] "\nAdditional fitness categories:"
print(paste("Unique IDs with fitD05D03 < -1:", results$IDs_below_neg1))
[1] "Unique IDs with fitD05D03 < -1: 378"
print(paste0("Percentage: ", round(results$percentage_below_neg1, 2), "%"))
[1] "Percentage: 47.61%"
print(paste("Unique IDs with fitD05D03 >= -1:", results$IDs_above_or_equal_neg1))
[1] "Unique IDs with fitD05D03 >= -1: 416"
print(paste0("Percentage: ", round(results$percentage_above_or_equal_neg1, 2), "%"))
[1] "Percentage: 52.39%"

Mutants: Smooth Histogram

# Subset mutants (mutations != 0)
L15_mutants_complementation <- mut_collapse_15 %>%
  filter(mutations > 0 & mutations < 6)

# Remove NA and infinite values for x-axis scaling
fitD05D03_mut_clean <- L15_mutants_complementation$fitD05D03[is.finite(L15_mutants_complementation$fitD05D03)]

# Calculate the range of the data
x_min_mut <- floor(min(fitD05D03_mut_clean, na.rm = TRUE))
x_max_mut <- ceiling(max(fitD05D03_mut_clean, na.rm = TRUE))

# Plot smooth density curve
L15_mutants_complementation_density <- ggplot(L15_mutants_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_mut, x_max_mut, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_mutants_complementation_density)

Perfects & Mutants: Smooth Histogram

# Subset mutants (mutations != 0)
L15_perfects_mutants_complementation <- mut_collapse_15 %>%
  filter(mutations >= 0 & mutations < 6)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_mut_clean <- L15_perfects_mutants_complementation$fitD05D03[is.finite(L15_perfects_mutants_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf_mut <- floor(min(fitD05D03_perf_mut_clean, na.rm = TRUE))
x_max_perf_mut <- ceiling(max(fitD05D03_perf_mut_clean, na.rm = TRUE))

# Plot smooth density curve
L15_perfects_mutants_complementation_density <- ggplot(L15_perfects_mutants_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_perf_mut, x_max_perf_mut, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_perfects_mutants_complementation_density)

**Mutant Counts*

# Step 1: Identify IDs that meet the criteria (mutations == 0 and fitD05D03 >= -1)
valid_IDs <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 >= -1) %>%
  pull(ID)

# Step 2: Filter mutIDs based on the criteria and association with valid IDs
unique_mutIDinfo15_5AA_count <- mut_collapse_15 %>%
  filter(ID %in% valid_IDs) %>%  # Keep only rows associated with valid IDs
  filter(mutations > 0 & mutations < 6) %>%  # Keep mutIDs with 1-5 mutations
  distinct(mutID) %>%
  nrow()

# Format and print the result
formatted_count <- format(unique_mutIDinfo15_5AA_count, big.mark = ",")
print(paste("Number of unique mutIDs with 1-5 mutations associated with complementing perfects:", formatted_count))
[1] "Number of unique mutIDs with 1-5 mutations associated with complementing perfects: 6,697"
# Step 1: Identify IDs that meet the criteria (mutations == 0 and fitD05D03 < -1)
LOF_IDs <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 < -1) %>%
  pull(ID)

# Step 2: Filter mutIDs based on the criteria and association with LOF_IDs
GOF_mutants <- mut_collapse_15 %>%
  filter(ID %in% LOF_IDs) %>%  # Keep only rows associated with LOF_IDs
  filter(mutations == 1) %>%  # Keep mutIDs with 1 mutation
  filter(fitD05D03 >= -1)  # Keep GOF mutants with fitness >= -1

# Count unique mutIDs
GOF_mutIDinfo15_1AA_count <- GOF_mutants %>%
  distinct(mutID) %>%
  nrow()

# Step 3: Count how many of the original LOF_IDs have associated GOF mutants
LOF_IDs_with_GOF_mutants <- GOF_mutants %>%
  distinct(ID) %>%
  nrow()

# Format and print the results
formatted_LOF_IDs_count <- format(length(LOF_IDs), big.mark = ",")
formatted_mutants_count <- format(GOF_mutIDinfo15_1AA_count, big.mark = ",")
formatted_LOF_IDs_with_GOF_mutants <- format(LOF_IDs_with_GOF_mutants, big.mark = ",")

print(paste("Number of original IDs (mutations == 0) with fitD05D03 < -1:", formatted_LOF_IDs_count))
[1] "Number of original IDs (mutations == 0) with fitD05D03 < -1: 378"
print(paste("Number of unique mutIDs with 1 mutation associated with these IDs and fitD05D03 >= -1:", formatted_mutants_count))
[1] "Number of unique mutIDs with 1 mutation associated with these IDs and fitD05D03 >= -1: 476"
print(paste("Number of original IDs that have associated mutants meeting the criteria:", formatted_LOF_IDs_with_GOF_mutants))
[1] "Number of original IDs that have associated mutants meeting the criteria: 196"

Both Plots: Plot both graphs together:

patch1 <- (L15_perfects_complementation_density | L15_mutants_complementation_density)
patch1


patch2 <- (L15_perfects_complementation_density | L15_mutants_complementation_density) / L15_perfects_mutants_complementation_density
patch2

Count the number of unique perfects with fitness > -1 and fitness < -1:

# Assuming L15_perfects_complementation is already defined
unique_ID_counts_perfects <- L15_perfects_complementation %>%
  group_by(ID) %>%
  summarise(
    min_fitness = min(fitD05D03, na.rm = TRUE),
    max_fitness = max(fitD05D03, na.rm = TRUE),
    all_na = all(is.na(fitD05D03))
  ) %>%
  mutate(fitness_category = case_when(
    all_na ~ "All NA",
    !is.finite(min_fitness) & !is.finite(max_fitness) ~ "All NA",
    max_fitness < -1 ~ "All less than -1",
    min_fitness > -1 ~ "All greater than -1",
    TRUE ~ "Spans -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(count = n())
Warning: There were 6 warnings in `summarise()`.
The first warning was:
ℹ In argument: `min_fitness = min(fitD05D03, na.rm = TRUE)`.
ℹ In group 1: `ID = "NP_065309"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 5 remaining warnings.
print("Unique ID counts for perfects by fitness category:")
[1] "Unique ID counts for perfects by fitness category:"
print(unique_ID_counts_perfects)

total_unique_IDs_perfects <- n_distinct(L15_perfects_complementation$ID)
print(paste("Total number of unique IDs in perfects:", total_unique_IDs_perfects))
[1] "Total number of unique IDs in perfects: 797"
IDs_with_valid_fitness_perfects <- L15_perfects_complementation %>%
  filter(!is.na(fitD05D03)) %>%
  n_distinct(.$ID)

print(paste("Number of unique IDs in perfects with at least one valid fitD05D03:", IDs_with_valid_fitness_perfects))
[1] "Number of unique IDs in perfects with at least one valid fitD05D03: 794"
# Additional information: count of rows for each fitness category
rows_per_category_perfects <- L15_perfects_complementation %>%
  mutate(fitness_category = case_when(
    is.na(fitD05D03) ~ "NA",
    fitD05D03 < -1 ~ "Less than -1",
    fitD05D03 > -1 ~ "Greater than -1",
    TRUE ~ "Equal to -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(row_count = n())

print("Number of rows in each fitness category for perfects:")
[1] "Number of rows in each fitness category for perfects:"
print(rows_per_category_perfects)

Count the number of unique mutants with fitness > -1 and fitness < -1:

# Count mutants from L15_mutants_complementation dataset
unique_mutID_counts <- L15_mutants_complementation %>%
  group_by(mutID) %>%
  summarise(
    min_fitness = min(fitD05D03, na.rm = TRUE),
    max_fitness = max(fitD05D03, na.rm = TRUE),
    all_na = all(is.na(fitD05D03))
  ) %>%
  mutate(fitness_category = case_when(
    all_na ~ "All NA",
    !is.finite(min_fitness) & !is.finite(max_fitness) ~ "All NA",
    max_fitness < -1 ~ "All less than -1",
    min_fitness > -1 ~ "All greater than -1",
    TRUE ~ "Spans -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(count = n())
Warning: There were 7222 warnings in `summarise()`.
The first warning was:
ℹ In argument: `min_fitness = min(fitD05D03, na.rm = TRUE)`.
ℹ In group 2: `mutID = "NP_065309_G27V_L54X_P55X"`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 7221 remaining warnings.
print(unique_mutID_counts)

total_unique_mutIDs <- n_distinct(L15_mutants_complementation$mutID)
print(paste("Total number of unique mutIDs:", total_unique_mutIDs))
[1] "Total number of unique mutIDs: 12146"
mutIDs_with_valid_fitness <- L15_mutants_complementation %>%
  filter(!is.na(fitD05D03)) %>%
  n_distinct(.$mutID)

print(paste("Number of unique mutIDs with at least one valid fitD05D03:", mutIDs_with_valid_fitness))
[1] "Number of unique mutIDs with at least one valid fitD05D03: 8535"
# Additional information: count of rows for each fitness category
rows_per_category <- L15_mutants_complementation %>%
  mutate(fitness_category = case_when(
    is.na(fitD05D03) ~ "NA",
    fitD05D03 < -1 ~ "Less than -1",
    fitD05D03 > -1 ~ "Greater than -1",
    TRUE ~ "Equal to -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(row_count = n())

print("Number of rows in each fitness category:")
[1] "Number of rows in each fitness category:"
print(rows_per_category)

Dropout Perfects

Start by retrieving all dropout perfects with a log-fold change value less than -1.0 and corresponding GoF mutants with a log-fold change value greater than -1.0. Use the mut_collapse_15 dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5) and 12,174 mutants with up to 5 AA distance and at least 1 BC (numprunedBCs = 1) matching to a perfect variant in the dataset.

# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 < -1.0
dropout15_ids_with_zero_mutations <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 < -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Filter the main dataset to keep mutants with fitness > -1.0 if they match a corresponding perfect ID
dropout_mutants15_GOF <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & fitD05D03 < -1.0) |
    #(mutations != 0 & ID %in% dropout15_ids_with_zero_mutations & !is.na(fitD05D03))) %>%
    (mutations != 0 & fitD05D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, seq)

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.

# 1. First, create two subsets of the data
zero_mutation_rows <- dropout_mutants15_GOF %>%
  filter(mutations == 0 & fitD05D03 < -1.0)

non_zero_mutation_rows <- dropout_mutants15_GOF %>%
  #filter(mutations != 0)
  filter(mutations != 0 & fitD05D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids <- all(non_zero_mutation_rows$ID %in% zero_mutation_rows$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids))
[1] "All non-zero mutation rows have a matching zero mutation row: TRUE"
# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids) {
  problematic_ids <- setdiff(non_zero_mutation_rows$ID, zero_mutation_rows$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero <- setdiff(zero_mutation_rows$ID, non_zero_mutation_rows$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
[1] "IDs with zero mutations but no corresponding non-zero mutation rows:"
print(ids_without_non_zero)
  [1] "NP_459092"    "NP_600072"    "NP_719188"    "NP_773150"    "NP_951629"    "WP_000175745" "WP_000312550" "WP_000312552"
  [9] "WP_000587144" "WP_000637207" "WP_001566069" "WP_001718077" "WP_002136083" "WP_002306927" "WP_002365665" "WP_002384507"
 [17] "WP_002386810" "WP_002451172" "WP_002459164" "WP_002464119" "WP_002470989" "WP_002480941" "WP_002516137" "WP_002635775"
 [25] "WP_002931755" "WP_002950147" "WP_002983619" "WP_003040947" "WP_003140871" "WP_003148742" "WP_003397230" "WP_003660945"
 [33] "WP_003690263" "WP_003710023" "WP_003769360" "WP_003924348" "WP_003937743" "WP_003946393" "WP_004043660" "WP_004057195"
 [41] "WP_004104442" "WP_004223430" "WP_004371455" "WP_004392428" "WP_004435037" "WP_004561476" "WP_004617034" "WP_004629118"
 [49] "WP_004636328" "WP_004680534" "WP_004742552" "WP_004756163" "WP_004813248" "WP_004821027" "WP_004846132" "WP_004855820"
 [57] "WP_004890475" "WP_004958883" "WP_005051336" "WP_005061314" "WP_005180205" "WP_005195010" "WP_005197432" "WP_005200258"
 [65] "WP_005220873" "WP_005236875" "WP_005254207" "WP_005280603" "WP_005291404" "WP_005382259" "WP_005395774" "WP_005544483"
 [73] "WP_005577110" "WP_006015200" "WP_006060602" "WP_006158942" "WP_006277558" "WP_006365071" "WP_006371387" "WP_006464375"
 [81] "WP_006590259" "WP_006731693" "WP_006732882" "WP_006735633" "WP_006743443" "WP_006770119" "WP_006806603" "WP_006817099"
 [89] "WP_006866478" "WP_006896237" "WP_006935896" "WP_007084769" "WP_007115148" "WP_007173507" "WP_007262186" "WP_007378735"
 [97] "WP_007409527" "WP_007548842" "WP_007745442" "WP_007994849" "WP_008039364" "WP_008407301" "WP_008469667" "WP_008494113"
[105] "WP_008640485" "WP_008678172" "WP_008788229" "WP_008856751" "WP_008880180" "WP_008912732" "WP_008914812" "WP_008927974"
[113] "WP_008974514" "WP_009010779" "WP_009226174" "WP_009248943" "WP_009272085" "WP_009332621" "WP_009361210" "WP_009390512"
[121] "WP_009418917" "WP_009541956" "WP_009570978" "WP_009602063" "WP_009707799" "WP_009795644" "WP_009808745" "WP_009873944"
[129] "WP_009887849" "WP_010025427" "WP_010253752" "WP_010361519" "WP_010488407" "WP_010495720" "WP_010622194" "WP_010689012"
[137] "WP_010705407" "WP_010732391" "WP_010752163" "WP_010753941" "WP_010768697" "WP_010976722" "WP_011006950" "WP_011077508"
[145] "WP_011091268" "WP_011095362" "WP_011283540" "WP_011412710" "WP_011466839"
# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows$ID)))
[1] "Number of unique IDs in zero mutation rows: 378"
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows$ID)))
[1] "Number of unique IDs in non-zero mutation rows: 229"
# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution <- non_zero_mutation_rows %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
[1] "Distribution of mutation counts:"
print(mutation_distribution)

# 7. Check for any unexpected mutation values
unexpected_mutations <- dropout_mutants15_GOF %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations)
} else {
  print("No unexpected mutation values found.")
}
[1] "No unexpected mutation values found."

Remove unique perfect IDs if there is no corresponding mutants with fitness greater than -1.0:

# Step 1: Identify IDs with zero mutations
zero_mutation_ids <- dropout_mutants15_GOF %>%
  filter(mutations == 0 & fitD05D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids <- dropout_mutants15_GOF %>%
  filter(mutations != 0 & fitD05D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove <- setdiff(zero_mutation_ids, non_zero_mutation_ids)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_cleaned <- dropout_mutants15_GOF %>%
  filter(!(ID %in% ids_to_remove))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF)))
[1] "Number of rows before cleaning: 1065"
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_cleaned)))
[1] "Number of rows after cleaning: 916"
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF) - nrow(dropout_mutants15_GOF_cleaned)))
[1] "Number of rows removed: 149"
print(paste("Number of unique IDs removed:", length(ids_to_remove)))
[1] "Number of unique IDs removed: 149"
# Optionally, you can print the removed IDs
print("IDs removed:")
[1] "IDs removed:"
print(ids_to_remove)
  [1] "NP_459092"    "NP_600072"    "NP_719188"    "NP_773150"    "NP_951629"    "WP_000175745" "WP_000312550" "WP_000312552"
  [9] "WP_000587144" "WP_000637207" "WP_001566069" "WP_001718077" "WP_002136083" "WP_002306927" "WP_002365665" "WP_002384507"
 [17] "WP_002386810" "WP_002451172" "WP_002459164" "WP_002464119" "WP_002470989" "WP_002480941" "WP_002516137" "WP_002635775"
 [25] "WP_002931755" "WP_002950147" "WP_002983619" "WP_003040947" "WP_003140871" "WP_003148742" "WP_003397230" "WP_003660945"
 [33] "WP_003690263" "WP_003710023" "WP_003769360" "WP_003924348" "WP_003937743" "WP_003946393" "WP_004043660" "WP_004057195"
 [41] "WP_004104442" "WP_004223430" "WP_004371455" "WP_004392428" "WP_004435037" "WP_004561476" "WP_004617034" "WP_004629118"
 [49] "WP_004636328" "WP_004680534" "WP_004742552" "WP_004756163" "WP_004813248" "WP_004821027" "WP_004846132" "WP_004855820"
 [57] "WP_004890475" "WP_004958883" "WP_005051336" "WP_005061314" "WP_005180205" "WP_005195010" "WP_005197432" "WP_005200258"
 [65] "WP_005220873" "WP_005236875" "WP_005254207" "WP_005280603" "WP_005291404" "WP_005382259" "WP_005395774" "WP_005544483"
 [73] "WP_005577110" "WP_006015200" "WP_006060602" "WP_006158942" "WP_006277558" "WP_006365071" "WP_006371387" "WP_006464375"
 [81] "WP_006590259" "WP_006731693" "WP_006732882" "WP_006735633" "WP_006743443" "WP_006770119" "WP_006806603" "WP_006817099"
 [89] "WP_006866478" "WP_006896237" "WP_006935896" "WP_007084769" "WP_007115148" "WP_007173507" "WP_007262186" "WP_007378735"
 [97] "WP_007409527" "WP_007548842" "WP_007745442" "WP_007994849" "WP_008039364" "WP_008407301" "WP_008469667" "WP_008494113"
[105] "WP_008640485" "WP_008678172" "WP_008788229" "WP_008856751" "WP_008880180" "WP_008912732" "WP_008914812" "WP_008927974"
[113] "WP_008974514" "WP_009010779" "WP_009226174" "WP_009248943" "WP_009272085" "WP_009332621" "WP_009361210" "WP_009390512"
[121] "WP_009418917" "WP_009541956" "WP_009570978" "WP_009602063" "WP_009707799" "WP_009795644" "WP_009808745" "WP_009873944"
[129] "WP_009887849" "WP_010025427" "WP_010253752" "WP_010361519" "WP_010488407" "WP_010495720" "WP_010622194" "WP_010689012"
[137] "WP_010705407" "WP_010732391" "WP_010752163" "WP_010753941" "WP_010768697" "WP_010976722" "WP_011006950" "WP_011077508"
[145] "WP_011091268" "WP_011095362" "WP_011283540" "WP_011412710" "WP_011466839"
# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF <- dropout_mutants15_GOF_cleaned

Dropout Mutants

Summarize the number of perfects and mutants at each AA distance after filtering:

# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts <- sapply(1:5, function(x) dropout_mutants15_GOF_count(dropout_mutants15_GOF, x))

# Count perfects separately
perfects_count <- length(unique(subset(dropout_mutants15_GOF, mutations == 0 & fitD05D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count, dropout15_counts, sum(dropout15_counts))
)

# Print the table
print(dropout_mutants15_GOF_table)

GOF Fitness

GoF Fitness: Separate the dropout_mutants15_GOF dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:

# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations != 0, ]

# ALTERNATIVE: Create a dataframe with mutations == 1 (only uses 1 aa mutation variants)
#dropout_mutants15_GOF_with_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations == 1, ]

Now, re-combine these dataframes to calculate the fitness change (delta) between mutants and their parent homologs:

# Step 1: Prepare the reference dataframe
df_reference <- dropout_mutants15_GOF_no_mutations %>%
  select(ID, fitD05D03) %>%
  rename(reference_fitD05D03 = fitD05D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness <- dropout_mutants15_GOF_with_mutations %>%
  left_join(df_reference, by = "ID") %>%
  mutate(fitD05D03 = fitD05D03 - reference_fitD05D03) %>%
  select(ID, mutID, mutations, fitD05D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness)))
[1] "Number of Mutants: 687"
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness$ID))))
[1] "Unique IDs: 229"
print(paste("Range of fitD05D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness$fitD05D03, na.rm = TRUE), 1), collapse = " to ")))
[1] "Range of fitD05D03: 0 to 5.9"

Boxplot: Plot mutant fitness relative to parent variant by number of mutations:

GOF_muts_fitness_by_muts_plot <- ggplot(dropout_mutants15_GOF_fitness, 
                                        aes(x = factor(mutations), y = fitD05D03)) +
  geom_boxplot() +
  labs(title = "fitD05D03 by Number of Mutations", x = "Number of Mutations", y = "fitD05D03")

print(GOF_muts_fitness_by_muts_plot)

Histogram of Mutant Fitness: Clearly shows mutant fitness is normally distributed.

GOF_muts_fitness_dist_plot <- ggplot(dropout_mutants15_GOF_fitness, aes(x = fitD05D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD05D03", x = "fitD05D03", y = "Count")

print(GOF_muts_fitness_dist_plot)

GOF Alignment

FASTA: Generate a FASTA file from the filtered dropout_mutants15_GoF perfects dataset based on shared perfects IDs with 1-AA mutation for GoF analysis:

# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_1mut_unique_ids <- dropout_mutants15_GOF_fitness %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_1mut_unique_id_seq <- dropout_mutants15_GOF %>%
  filter(ID %in% dropout_mutants15_GOF_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_fasta_content <- paste(">", dropout_mutants15_GOF_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.fasta")

# Write the FASTA content to the file
writeLines(dropout_mutants15_GOF_fasta_content, 
           con = dropout_mutants15_GOF_fasta_file_path)

Alignment: Use the clustalo executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: Lib15.GoF.perfects.complementation.fasta for use in GoF analysis.

./Scripts/clustalo -i GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.fasta -o GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.tree.aligned.mod.aln --outfmt=clustal --force

Mapping Residues: Use the following map.aligned.residues.py python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:

import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [197] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/Comp/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples = 419
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()

GOF Plots

Find GoF Perfects for Dropouts

quit
# Create a data frame of unique IDs
mutants15_to_plot <- dropout_mutants15_GOF_fitness %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot)) {
  mutant15_current_temp <- mutants15_to_plot$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/Comp/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove <- c(mutants_to_remove, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot <- mutants15_to_plot[!mutants15_to_plot$ID %in% mutants_to_remove, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}
All mutants have corresponding MSA files. No mutants will be removed.
# If you want to see the remaining mutants
print(mutants15_to_plot)

Read in the E. coli map:

ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), head=TRUE, sep=",")

Make a new data frame which will keep all info

GOF_fitness_map <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))

MSA Mapping: Map the mutants (fitness difference from perfects > 0.5) over all perfects (fit < -1) for GoF analysis:

#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot)){
  
  #current ortholog:
  mutant_current <- as.character(mutants15_to_plot$ID[iii])
  
  #length of name
  name_size = nchar(paste(mutant_current,"_",sep=""))
  
  #get the MSA mapping
  mutant_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/",mutant_current,".csv",sep=""),
                         head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) of at least 0.5 (do not include perfects from dataset)
  GOFmutIDinfo_temp <- dropout_mutants15_GOF_fitness %>% ###UPDATED CODE
    filter(ID == mutant_current) %>%
    filter(mutations != 0) %>% ###UPDATED CODE
    filter(fitD05D03 >= 0.5) ###CHANGE VALUE AS NEEDED (>= 0.5 is default)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp)) {
    
    #this mutants fitness
    gof_fit_temp <- GOFmutIDinfo_temp$fitD05D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names <- as.character(GOFmutIDinfo_temp$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names <- substr(mutations_names, name_size+1, nchar(mutations_names))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map <- rbind(GOF_fitness_map,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp$mutations[mn],
                                              fitness=gof_fit_temp,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp$mutID[mn],
                                              ID=GOFmutIDinfo_temp$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map <- rbind(GOF_fitness_map,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp$mutations[mn],
                                              fitness=gof_fit_temp,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp$mutID[mn],
                                              ID=GOFmutIDinfo_temp$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
Warning: No non-zero mutation data found for ID: WP_000175752Warning: No matching orth_aanum found for mutpos: 162 in ID: WP_000637215Warning: No non-zero mutation data found for ID: WP_001408245Warning: No matching orth_aanum found for mutpos: 170 in ID: WP_002820451Warning: No matching orth_aanum found for mutpos: 162 in ID: WP_004570408Warning: No non-zero mutation data found for ID: WP_004832905Warning: No matching orth_aanum found for mutpos: 166 in ID: WP_004907189Warning: No matching orth_aanum found for mutpos: 160 in ID: WP_005917048Warning: No matching orth_aanum found for mutpos: 169 in ID: WP_006923903Warning: No matching orth_aanum found for mutpos: 163 in ID: WP_008170691Warning: No non-zero mutation data found for ID: WP_008985437Warning: No matching orth_aanum found for mutpos: 161 in ID: WP_009435065Warning: No non-zero mutation data found for ID: WP_010934590Warning: No matching orth_aanum found for mutpos: 167 in ID: WP_011581087

Collapse the GOF fitness values by aa position along the protein sequence:

GOF_fitness_collapsed_by_pos <- GOF_fitness_map %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.

GoF_plot <- ggplot(GOF_fitness_collapsed_by_pos, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "red2")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "darkblue")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradientn(colours = c("darkblue", "red"),
                        name="Num.\nUniq.\nHomo.",
                        na.value="grey", 
                        limits = c(0,1.1*max(GOF_fitness_collapsed_by_pos$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Add marginal plot
GoF_plot_with_marginal <- ggExtra::ggMarginal(GoF_plot,
                                              type = "histogram",
                                              margins = "y",
                                              bins=21,
                                              col = 'black',
                                              fill = 'red2')
Warning: All aesthetics have length 1, but the data has 153 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 153 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 153 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 153 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
# Display the plot
print(GoF_plot_with_marginal)

Print out a summary table of significant aa position along the protein sequence

GOF_fitness_collapsed_by_pos_2sigma <- GOF_fitness_collapsed_by_pos %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos$numpoints)+2*sd(GOF_fitness_collapsed_by_pos$numpoints)))
print(GOF_fitness_collapsed_by_pos_2sigma)

Calculate all Data and Stats:

GOF_fitness_collapsed_all <- GOF_fitness_map %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.
gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 17 |
         X2 == 97 |
         X2 == 98 |
         X2 == 102 |
         X2 == 103 |
         X2 == 104 |
         X2 == 107)

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==17)] <- 1
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==97)] <- 2
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==98)] <- 3
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==102)] <- 4
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==103)] <- 5
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==104)] <- 6
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==107)] <- 7

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)

GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:

# Define the order of amino acids for the rectangles
rect_order <- c("E", "G", "R", "Q", "F", "L", "A")

# Create a data frame for the rectangles
rect_data <- data.frame(
  aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
  xmin = seq(0.5, by = 1, length.out = length(rect_order)),
  xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
            aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1,2,3,4,5,6,7),
                     labels=c("17","97","98","102","103","104","107"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot)

Plot the GOF mutant fitness across the protein sequence:

ggplot(data = gof_matrix_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile() +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                       high = "red",
                       name="Fitness",
                       na.value="grey",
                       limit = c(0,max(gof_matrix_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))

Plot the number of mutants observed at each position along the protein sequence:

ggplot(data = gof_matrix_num_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ 
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="#points",
                      na.value="grey", 
                      limit = c(0,max(gof_matrix_num_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))

Plot the number of mutants observed at each position along the protein sequence:

ggplot(data = gof_matrix_numortho_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ 
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="#points",
                      na.value="grey", 
                      limit = c(0,max(gof_matrix_numortho_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))

Plot the standard deviation of fitness for each position along the protein sequence:

ggplot(data = gof_matrix_sd_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
  scale_fill_gradient2(low = "blue", 
                       high = "red", 
                       mid="gold",
                       name="std(Fitness)",
                       na.value="grey", 
                       limit = c(0,1.1*max(gof_matrix_sd_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))

GOF BMS Position Plot

GOF_fitness_collapsed_by_pos_2sigma_protein_info <- protein_info_1H1T %>%
  dplyr::rename(position=pos) %>%
  filter(position %in% GOF_fitness_collapsed_by_pos_2sigma$position) %>%
  right_join(GOF_fitness_collapsed_by_pos_2sigma,by="position") %>%
  arrange(position)
print(GOF_fitness_collapsed_by_pos_2sigma_protein_info)
rsa_vs_cons_plot <- ggplot(GOF_fitness_collapsed_by_pos_2sigma_protein_info,
             aes(x=cons, y=RSA, color=as.factor(position), fill=as.factor(position), shape=as.factor(position))) +
  geom_point(alpha=0.9, size=4, stroke=1) +
  labs(x = "Site Conservation", y ="Relative Solvent Accessibility", color="Residue", fill="Residue", shape="Residue") +
  scale_color_manual(name = "Residue",
                     values = c("red", "green", "blue", "purple", "orange", "cyan", "magenta")) +
  scale_fill_manual(name = "Residue",
                    values = c("red", "green", "blue", "purple", "orange", "cyan", "magenta")) +
  scale_shape_manual(name = "Residue",
                     values = c(21, 22, 23, 24, 25, 21, 22)) +
  theme_minimal() +
  theme(legend.position = "right")

rsa_vs_cons_plot

Extract the fitness values for each significant aa position from the BMS analysis:

BMS_matrix_perfects_and_1_melt_GOFonly <- BMS_matrix15_perfects_and_1_melt %>%
  filter(X1 != "X") %>%
  filter(X2 == 17 | X2 == 97 | X2 == 98 | X2 == 102 | X2 == 103 | X2 == 104 | X2 == 107)

BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum <- 0
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==17)] <- 1
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==97)] <- 2
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==98)] <- 3
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==102)] <- 4
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==103)] <- 5
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==104)] <- 6
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==107)] <- 7

BMS_matrix_perfects_and_1_melt_GOFonly_wnum <- BMS_matrix_perfects_and_1_melt_GOFonly %>%
  inner_join(BMS_matrix15_perfects_and_1_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)

names(BMS_matrix_perfects_and_1_melt_GOFonly_wnum)
[1] "X1"        "X2"        "value"     "WTcolor"   "aanum"     "mutposnum" "mutnum"   

Determine the minimum and maximum fitness values for plotting:

min(BMS_matrix_perfects_and_1_melt_GOFonly$value, na.rm = TRUE)
[1] -2.83005
max(BMS_matrix_perfects_and_1_melt_GOFonly$value, na.rm = TRUE)
[1] 1.214291

Plot the fitness values of the significant aa positions based on the BMS analysis for Complementation. Black rectangles indicate the aa corresponding to the WT DHFR homolog. White rectangles indicate the aa with the highest number of mutants for each position along the protein sequence.

# Define the order of amino acids for the black rectangles
rect_order <- c("E", "G", "R", "Q", "F", "L", "A")

# Create a data frame for the black rectangles
rect_data <- data.frame(
  aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W")),
  xmin = seq(0.5, by = 1, length.out = length(rect_order)),
  xmax = seq(1.5, by = 1, length.out = length(rect_order)))

# Find the amino acid with the highest mutnum for each position
highest_mutnum <- BMS_matrix_perfects_and_1_melt_GOFonly_wnum %>%
  group_by(mutposnum) %>%
  slice_max(order_by = as.numeric(mutnum), n = 1) %>%
  ungroup()

# Create the plot
BMS_GoF_fit_plot <- ggplot(BMS_matrix_perfects_and_1_melt_GOFonly_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
            aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "black", inherit.aes = FALSE) +
  # Add white rectangles around the highest mutnum
  geom_rect(data = highest_mutnum,
            aes(xmin = mutposnum - 0.5, xmax = mutposnum + 0.5, 
                ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "white", size = 1, inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid", color="") +
  scale_fill_gradient2(low = "blue", high = "red", mid="gold",
                       name="Fitness", na.value="grey", 
                       limit = c(-3,1)) +
  theme_minimal() +
  scale_x_continuous(name="Position (aa)", 
                     breaks=c(1,2,3,4,5,6,7),
                     labels=c("17","97","98","102","103","104","107")) +
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W"))

print(BMS_GoF_fit_plot)

MIC (0.5 ug/mL TMP)

Dropout Perfects

Retrieving all dropout perfects with a log-fold change value less than -1.0 from the COMPLEMENTATION dataset. Then, retrieve the same perfects (ID) from the MIC dataset and all corresponding mutants. Use the mut_collapse_15 dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5) and 12,174 mutants with up to 5 AA distance and at least 1 BC (numprunedBCs = 1) matching to a perfect variant in the dataset.

# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 < -1.0 in COMPLEMENTATION
dropout15_ids_with_zero_mutations_complement <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 > -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Identify the same IDs that have mutations == 0 and fitD07D03 < -1.0 in MIC
dropout15_ids_with_zero_mutations_mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_complement & 
         mutations == 0 & 
         fitD07D03 < -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 3: Retrieve the rows for these IDs
result_rows <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_mic & mutations == 0)

# Step 4: Filter the main dataset to keep mutants (with 1-AA) if they match a corresponding perfect ID
dropout_mutants15_GOF_mic <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & !is.na(fitD05D03) & fitD05D03 > -1.0 & 
     !is.na(fitD07D03) & fitD07D03 < -1.0) |
    (mutations != 0 & fitD07D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations_mic)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, fitD07D03, seq)

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.

# Verification step
verification_result_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0) %>%
  mutate(
    condition_met = fitD05D03 > -1.0 & fitD07D03 < -1,
    fitD05D03_check = fitD05D03 > -1.0,
    fitD07D03_check = fitD07D03 < -1
  )

# Check if all rows meet the condition
all_conditions_met_mic <- all(verification_result_mic$condition_met)

# Summary of the verification
verification_summary_mic <- verification_result_mic %>%
  summarise(
    total_rows = n(),
    rows_meeting_both_conditions = sum(condition_met),
    rows_meeting_fitD05D03 = sum(fitD05D03_check),
    rows_meeting_fitD07D03 = sum(fitD07D03_check)
  )

# Print results
print("Verification Summary:")
[1] "Verification Summary:"
print(verification_summary_mic)
print(paste("All conditions met:", all_conditions_met_mic))
[1] "All conditions met: TRUE"
# If there are any rows not meeting the conditions, display them
if (!all_conditions_met_mic) {
  print("Rows not meeting both conditions:")
  print(verification_result_mic %>% filter(!condition_met) %>% select(ID, fitD05D03, fitD07D03))
}

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD07D03 < -1.0.

# 1. First, create two subsets of the data
zero_mutation_rows_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0 & fitD07D03 < -1.0)

non_zero_mutation_rows_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations != 0 & fitD07D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids_mic <- all(non_zero_mutation_rows_mic$ID %in% zero_mutation_rows_mic$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids_mic))
[1] "All non-zero mutation rows have a matching zero mutation row: TRUE"
# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids_mic) {
  problematic_ids_mic <- setdiff(non_zero_mutation_rows_mic$ID, zero_mutation_rows_mic$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids_mic)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero_mic <- setdiff(zero_mutation_rows_mic$ID, non_zero_mutation_rows_mic$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
[1] "IDs with zero mutations but no corresponding non-zero mutation rows:"
print(ids_without_non_zero_mic)
 [1] "NP_229441"    "NP_390064"    "NP_569589"    "WP_000162313" "WP_000162482" "WP_000175741" "WP_000175742" "WP_000175753"
 [9] "WP_002308859" "WP_002602832" "WP_002687527" "WP_003027976" "WP_003035097" "WP_003072921" "WP_003218222" "WP_003321978"
[17] "WP_003325705" "WP_003633095" "WP_003654807" "WP_003748519" "WP_003758501" "WP_004286630" "WP_004357589" "WP_004375525"
[25] "WP_004547470" "WP_004797168" "WP_004838134" "WP_004916485" "WP_004918315" "WP_004920127" "WP_004955442" "WP_004957782"
[33] "WP_005162425" "WP_005166996" "WP_005398549" "WP_005549928" "WP_005593154" "WP_005605215" "WP_005622496" "WP_005635192"
[41] "WP_005647059" "WP_005707354" "WP_005708598" "WP_005765106" "WP_005773221" "WP_006318421" "WP_006493548" "WP_006578126"
[49] "WP_006688837" "WP_006689329" "WP_006711975" "WP_006814098" "WP_006946351" "WP_007121618" "WP_007134290" "WP_007144234"
[57] "WP_007478890" "WP_007500805" "WP_007546669" "WP_008073041" "WP_008106982" "WP_008112253" "WP_008125961" "WP_008136828"
[65] "WP_008174809" "WP_008302089" "WP_008583481" "WP_008754633" "WP_009133007" "WP_009500922" "WP_009677188" "WP_009725265"
[73] "WP_009847829" "WP_010021426" "WP_010737807" "WP_010759385" "WP_010761388" "WP_010780948" "WP_010850622" "WP_011000898"
[81] "WP_011201021" "WP_011272274" "WP_011303097" "WP_011312551" "WP_011399565" "WP_011489445" "WP_011584955" "WP_011608342"
[89] "WP_011627026"
# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows_mic$ID)))
[1] "Number of unique IDs in zero mutation rows: 162"
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows_mic$ID)))
[1] "Number of unique IDs in non-zero mutation rows: 73"
# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution_mic <- non_zero_mutation_rows_mic %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
[1] "Distribution of mutation counts:"
print(mutation_distribution_mic)

# 7. Check for any unexpected mutation values
unexpected_mutations_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations_mic) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations_mic)
} else {
  print("No unexpected mutation values found.")
}
[1] "No unexpected mutation values found."

Remove unique perfect IDs if there is no corresponding mutants:

# Step 1: Identify IDs with zero mutations
zero_mutation_ids_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0 & fitD07D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations != 0 & fitD07D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove_mic <- setdiff(zero_mutation_ids_mic, non_zero_mutation_ids_mic)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_mic_cleaned <- dropout_mutants15_GOF_mic %>%
  filter(!(ID %in% ids_to_remove_mic))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF_mic)))
[1] "Number of rows before cleaning: 323"
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_mic_cleaned)))
[1] "Number of rows after cleaning: 234"
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF_mic) - nrow(dropout_mutants15_GOF_mic_cleaned)))
[1] "Number of rows removed: 89"
print(paste("Number of unique IDs removed:", length(ids_to_remove_mic)))
[1] "Number of unique IDs removed: 89"
# Optionally, you can print the removed IDs
print("IDs removed:")
[1] "IDs removed:"
print(ids_to_remove_mic)
 [1] "NP_229441"    "NP_390064"    "NP_569589"    "WP_000162313" "WP_000162482" "WP_000175741" "WP_000175742" "WP_000175753"
 [9] "WP_002308859" "WP_002602832" "WP_002687527" "WP_003027976" "WP_003035097" "WP_003072921" "WP_003218222" "WP_003321978"
[17] "WP_003325705" "WP_003633095" "WP_003654807" "WP_003748519" "WP_003758501" "WP_004286630" "WP_004357589" "WP_004375525"
[25] "WP_004547470" "WP_004797168" "WP_004838134" "WP_004916485" "WP_004918315" "WP_004920127" "WP_004955442" "WP_004957782"
[33] "WP_005162425" "WP_005166996" "WP_005398549" "WP_005549928" "WP_005593154" "WP_005605215" "WP_005622496" "WP_005635192"
[41] "WP_005647059" "WP_005707354" "WP_005708598" "WP_005765106" "WP_005773221" "WP_006318421" "WP_006493548" "WP_006578126"
[49] "WP_006688837" "WP_006689329" "WP_006711975" "WP_006814098" "WP_006946351" "WP_007121618" "WP_007134290" "WP_007144234"
[57] "WP_007478890" "WP_007500805" "WP_007546669" "WP_008073041" "WP_008106982" "WP_008112253" "WP_008125961" "WP_008136828"
[65] "WP_008174809" "WP_008302089" "WP_008583481" "WP_008754633" "WP_009133007" "WP_009500922" "WP_009677188" "WP_009725265"
[73] "WP_009847829" "WP_010021426" "WP_010737807" "WP_010759385" "WP_010761388" "WP_010780948" "WP_010850622" "WP_011000898"
[81] "WP_011201021" "WP_011272274" "WP_011303097" "WP_011312551" "WP_011399565" "WP_011489445" "WP_011584955" "WP_011608342"
[89] "WP_011627026"
# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF_mic <- dropout_mutants15_GOF_mic_cleaned

Dropout Mutants

Summarize the number of perfects and mutants at each AA distance after filtering:

# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count_mic <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts_mic <- sapply(1:5, function(x) dropout_mutants15_GOF_count_mic(dropout_mutants15_GOF_mic, x))

# Count perfects separately
perfects_count_mic <- length(unique(subset(dropout_mutants15_GOF_mic, mutations == 0 & fitD07D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table_mic <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count_mic, dropout15_counts_mic, sum(dropout15_counts_mic))
)

# Print the table
print(dropout_mutants15_GOF_table_mic)

GOF Fitness

GOF Fitness: Separate the dropout_mutants15_GOF_mic dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:

# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations_mic <- dropout_mutants15_GOF_mic[dropout_mutants15_GOF_mic$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations_mic <- dropout_mutants15_GOF_mic[dropout_mutants15_GOF_mic$mutations != 0, ]

Now, re-combine these dataframes to retain only the mutants

# Step 1: Prepare the reference dataframe
df_reference_mic <- dropout_mutants15_GOF_no_mutations_mic %>%
  select(ID, fitD07D03) %>%
  rename(reference_fitD07D03 = fitD07D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness_mic <- dropout_mutants15_GOF_with_mutations_mic %>%
  left_join(df_reference_mic, by = "ID") %>%
  mutate(fitD07D03 = fitD07D03 - reference_fitD07D03) %>%
  select(ID, mutID, mutations, fitD07D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness_mic)))
[1] "Number of Mutants: 161"
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness_mic$ID))))
[1] "Unique IDs: 73"
print(paste("Range of fitD07D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness_mic$fitD07D03, na.rm = TRUE), 1), collapse = " to ")))
[1] "Range of fitD07D03: 0.1 to 9.2"

Boxplot: Plot mutant fitness relative to parent variant by number of mutations:

GOF_muts_fitness_by_muts_plot_mic <- ggplot(dropout_mutants15_GOF_fitness_mic, 
                                        aes(x = factor(mutations), y = fitD07D03)) +
  geom_boxplot() +
  labs(title = "fitD07D03 by Number of Mutations", x = "Number of Mutations", y = "fitD07D03")

print(GOF_muts_fitness_by_muts_plot_mic)

Histogram of Mutant Fitness: Clearly shows mutant fitness is normally distributed.

GOF_muts_fitness_dist_plot_mic <- ggplot(dropout_mutants15_GOF_fitness_mic, aes(x = fitD07D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD07D03", x = "fitD07D03", y = "Count")

print(GOF_muts_fitness_dist_plot_mic)

GOF Alignment

FASTA: Generate a FASTA file from the filtered dropout_mutants15_GoF perfects dataset for GoF analysis:

# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_mic_1mut_unique_ids <- dropout_mutants15_GOF_fitness_mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_mic_1mut_unique_id_seq <- dropout_mutants15_GOF_mic %>%
  filter(ID %in% dropout_mutants15_GOF_mic_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_mic_fasta_content <- paste(">", dropout_mutants15_GOF_mic_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_mic_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_mic_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.fasta")

# Write the FASTA content to the file (57 unique IDs)
writeLines(dropout_mutants15_GOF_mic_fasta_content, 
           con = dropout_mutants15_GOF_mic_fasta_file_path)

Alignment: Use the clustalo executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: Lib15.GoF.perfects.complementation.fasta for use in GoF analysis.

./Scripts/clustalo -i GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.fasta -o GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.tree.aligned.mod.aln --outfmt=clustal --force

Mapping Residues: Use the following map.aligned.residues.py python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:

import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [58] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/MIC/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples = 419
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()

GOF Plots

Find GoF Perfects for Dropouts

quit
# Create a data frame of unique IDs
mutants15_to_plot_mic <- dropout_mutants15_GOF_fitness_mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove_mic <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot_mic)) {
  mutant15_current_temp <- mutants15_to_plot_mic$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/MIC/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove_mic <- c(mutants_to_remove_mic, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove_mic) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove_mic)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove_mic), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot_mic <- mutants15_to_plot_mic[!mutants15_to_plot_mic$ID %in% mutants_to_remove_mic, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot_mic), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}
All mutants have corresponding MSA files. No mutants will be removed.
# If you want to see the remaining mutants
print(mutants15_to_plot_mic)

Read in the E. coli map:

ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), head=TRUE, sep=",")

Make a new data frame which will keep all info

GOF_fitness_map_mic <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))

MSA Mapping: Map the mutants (fitness difference from perfects > 0) over all perfects (fit < -1) for GOF analysis:

#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot_mic)){
  
  #current ortholog:
  mutant_current_mic <- as.character(mutants15_to_plot_mic$ID[iii])
  
  #length of name
  name_size_mic = nchar(paste(mutant_current_mic,"_",sep=""))
  
  #get the MSA mapping
  mutant_map_mic <- read.csv(file=paste("GOF/MSA_Dropouts/MIC/",mutant_current_mic,".csv",sep=""),head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) greater than zero (do not include perfects from dataset)
  GOFmutIDinfo_temp_mic <- dropout_mutants15_GOF_fitness_mic %>%
    filter(ID == mutant_current_mic) %>%
    filter(mutations != 0) %>%
    filter(fitD07D03 >= 0.5) ###CHANGE VALUE AS NEEDED (>= 0.5 is default)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp_mic) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current_mic))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp_mic)) {
    
    #this mutants fitness
    gof_fit_temp_mic <- GOFmutIDinfo_temp_mic$fitD07D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names_mic <- as.character(GOFmutIDinfo_temp_mic$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names_mic <- substr(mutations_names_mic, name_size+1, nchar(mutations_names_mic))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names_mic, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp_mic$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map_mic$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map_mic$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map_mic <- rbind(GOF_fitness_map_mic,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_mic$mutations[mn],
                                              fitness=gof_fit_temp_mic,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp_mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_mic$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map_mic <- rbind(GOF_fitness_map_mic,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_mic$mutations[mn],
                                              fitness=gof_fit_temp_mic,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp_mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_mic$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current_mic))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
Warning: No matching orth_aanum found for mutpos: NA in ID: NP_269082Warning: No non-zero mutation data found for ID: WP_005083623Warning: No non-zero mutation data found for ID: WP_005385658Warning: No non-zero mutation data found for ID: WP_007130009Warning: No non-zero mutation data found for ID: WP_007277186Warning: No matching orth_aanum found for mutpos: 169 in ID: WP_007476896Warning: No matching orth_aanum found for mutpos: 162 in ID: WP_008688576Warning: No matching orth_aanum found for mutpos: 161 in ID: WP_010945031

Collapse the GOF fitness values by aa position along the protein sequence:

GOF_fitness_collapsed_by_pos_mic <- GOF_fitness_map_mic %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.

GoF_plot_mic <- ggplot(GOF_fitness_collapsed_by_pos_mic, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "blue")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "red")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradient(low = "blue", 
                       high = "red",
                       name="Num.\nUniq.\nHomo.",
                       na.value="grey", 
                       limit = c(0,1.1*max(GOF_fitness_collapsed_by_pos_mic$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme(legend.position="left")
GoF_plot_mic <- ggExtra::ggMarginal(GoF_plot_mic,type = "histogram",
                    margins = "y",
                    bins=21,
                    col = 'black',
                    fill = 'red')
Warning: All aesthetics have length 1, but the data has 65 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 65 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 65 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 65 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
GoF_plot_mic

Print out a summary table of significant aa position along the protein sequence

GOF_fitness_collapsed_by_pos_2sigma_mic <- GOF_fitness_collapsed_by_pos_mic %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos_mic$numpoints) + 2*sd(GOF_fitness_collapsed_by_pos_mic$numpoints)))
print(GOF_fitness_collapsed_by_pos_2sigma_mic)

Calculate all Data and Stats:

GOF_fitness_collapsed_all_mic <- GOF_fitness_map_mic %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.
gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all_mic)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 89 |
         X2 == 102 |
         X2 == 103 |
         X2 == 121 |
         X2 == 128 |
         X2 == 129 )

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==89)] <- 1
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==102)] <- 2
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==103)] <- 3
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==121)] <- 4
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==128)] <- 5
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==129)] <- 6

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)

GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:

# Define the order of amino acids for the rectangles
rect_order <- c("P", "Q", "F", "G", "Y", "E")

# Create a data frame for the rectangles
rect_data <- data.frame(
aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
xmin = seq(0.5, by = 1, length.out = length(rect_order)),
xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot_mic <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
           aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
           fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1,2,3,4,5,6),
                     labels=c("89","102","103","121","128","129"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot_mic)

400x MIC (200 ug/mL TMP)

Dropout Perfects

Retrieve all dropout perfects with a log-fold change value greater than -1.0 from the COMPLEMENTATION dataset. Then, retrieve the same perfects (ID) from the 400x MIC dataset and retain only those with fitness less than -1. Also, retrieve all corresponding mutants. Use the mut_collapse_15 dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5).

# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 > -1.0 in COMPLEMENTATION
dropout15_ids_with_zero_mutations_complement <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 > -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Identify the same IDs that have mutations == 0 and fitD11D03 < -1.0 in 400x MIC
dropout15_ids_with_zero_mutations_400mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_complement & 
         mutations == 0 & 
         fitD11D03 < -1.0 &
         !is.na(fitD11D03)) %>%
  distinct(ID) %>%
  pull(ID)

# Step 3: Retrieve the rows for these IDs
result_rows_400mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_400mic & mutations == 0)

# Step 4: Filter the main dataset to keep mutants if they match a corresponding perfect ID
dropout_mutants15_GOF_400mic <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & !is.na(fitD05D03) & fitD05D03 > -1.0 & 
     !is.na(fitD11D03) & fitD11D03 < -1.0) |
    (mutations != 0 & fitD11D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations_400mic)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, fitD11D03, seq)

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.

# Verification step
verification_result_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0) %>%
  mutate(
    condition_met = fitD05D03 > -1.0 & fitD11D03 < -1,
    fitD05D03_check = fitD05D03 > -1.0,
    fitD11D03_check = fitD11D03 < -1
  )

# Check if all rows meet the condition
all_conditions_met_400mic <- all(verification_result_400mic$condition_met)

# Summary of the verification
verification_summary_400mic <- verification_result_400mic %>%
  summarise(
    total_rows = n(),
    rows_meeting_both_conditions = sum(condition_met),
    rows_meeting_fitD05D03 = sum(fitD05D03_check),
    rows_meeting_fitD11D03 = sum(fitD11D03_check)
  )

# Print results
print("Verification Summary:")
[1] "Verification Summary:"
print(verification_summary_400mic)
print(paste("All conditions met:", all_conditions_met_400mic))
[1] "All conditions met: TRUE"
# If there are any rows not meeting the conditions, display them
if (!all_conditions_met_400mic) {
  print("Rows not meeting both conditions:")
  print(verification_result_400mic %>% filter(!condition_met) %>% select(ID, fitD05D03, fitD11D03))
}

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD11D03 < -1.0.

# 1. First, create two subsets of the data
zero_mutation_rows_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0 & fitD11D03 < -1.0)

non_zero_mutation_rows_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations != 0 & fitD11D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids_400mic <- all(non_zero_mutation_rows_400mic$ID %in% zero_mutation_rows_400mic$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids_400mic))
[1] "All non-zero mutation rows have a matching zero mutation row: TRUE"
# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids_400mic) {
  problematic_ids_400mic <- setdiff(non_zero_mutation_rows_400mic$ID, zero_mutation_rows_400mic$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids_400mic)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero_400mic <- setdiff(zero_mutation_rows_400mic$ID, non_zero_mutation_rows_400mic$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
[1] "IDs with zero mutations but no corresponding non-zero mutation rows:"
print(ids_without_non_zero_400mic)
  [1] "NP_269082"    "NP_390064"    "NP_721345"    "NP_747233"    "NP_944205"    "WP_000162304" "WP_000162306" "WP_000162307"
  [9] "WP_000162309" "WP_000162310" "WP_000162311" "WP_000162312" "WP_000162313" "WP_000162437" "WP_000162448" "WP_000162450"
 [17] "WP_000162451" "WP_000162457" "WP_000162462" "WP_000162470" "WP_000162471" "WP_000162477" "WP_000162478" "WP_000162482"
 [25] "WP_000162485" "WP_000162499" "WP_000162501" "WP_000162502" "WP_000162504" "WP_000175741" "WP_000175742" "WP_000175753"
 [33] "WP_000312553" "WP_000637196" "WP_000637199" "WP_000637203" "WP_000637204" "WP_000637211" "WP_000637216" "WP_000973544"
 [41] "WP_001035150" "WP_002065185" "WP_002114723" "WP_002118976" "WP_002216373" "WP_002256361" "WP_002569855" "WP_002578253"
 [49] "WP_002583107" "WP_002587016" "WP_002596555" "WP_002602832" "WP_002664533" "WP_002670343" "WP_002811844" "WP_002841202"
 [57] "WP_002889081" "WP_002904437" "WP_002906638" "WP_002909821" "WP_002917312" "WP_002924855" "WP_002934971" "WP_002984955"
 [65] "WP_002990062" "WP_003011321" "WP_003016297" "WP_003027976" "WP_003032260" "WP_003035097" "WP_003054187" "WP_003065197"
 [73] "WP_003072921" "WP_003088134" "WP_003093756" "WP_003218222" "WP_003268911" "WP_003321978" "WP_003325705" "WP_003441833"
 [81] "WP_003474891" "WP_003479843" "WP_003607688" "WP_003633095" "WP_003649012" "WP_003654807" "WP_003666061" "WP_003674233"
 [89] "WP_003683166" "WP_003717025" "WP_003747727" "WP_003748519" "WP_003758501" "WP_003762131" "WP_003773894" "WP_003779699"
 [97] "WP_003795223" "WP_004079202" "WP_004138800" "WP_004187031" "WP_004232047" "WP_004252176" "WP_004286630" "WP_004288595"
[105] "WP_004297954" "WP_004315768" "WP_004321824" "WP_004336126" "WP_004354535" "WP_004357589" "WP_004359715" "WP_004362212"
[113] "WP_004366090" "WP_004368566" "WP_004375525" "WP_004382106" "WP_004456775" "WP_004547470" "WP_004585598" "WP_004614090"
[121] "WP_004762151" "WP_004794794" "WP_004818861" "WP_004830208" "WP_004838134" "WP_004852952" "WP_004855114" "WP_004916485"
[129] "WP_004920127" "WP_004955442" "WP_004974792" "WP_004993976" "WP_005033778" "WP_005083623" "WP_005166996" "WP_005210368"
[137] "WP_005357877" "WP_005364041" "WP_005369895" "WP_005379345" "WP_005392293" "WP_005398549" "WP_005427260" "WP_005430428"
[145] "WP_005433217" "WP_005450473" "WP_005549928" "WP_005558437" "WP_005593154" "WP_005605215" "WP_005612722" "WP_005622496"
[153] "WP_005635192" "WP_005647059" "WP_005648179" "WP_005707354" "WP_005708598" "WP_005742812" "WP_005765106" "WP_005790642"
[161] "WP_005797943" "WP_005808285" "WP_005823311" "WP_005838503" "WP_005845866" "WP_005858889" "WP_005878541" "WP_005933374"
[169] "WP_005935635" "WP_005956098" "WP_005998635" "WP_006036807" "WP_006044971" "WP_006145153" "WP_006148032" "WP_006154666"
[177] "WP_006256429" "WP_006266370" "WP_006318421" "WP_006427068" "WP_006451480" "WP_006460949" "WP_006493548" "WP_006531276"
[185] "WP_006565566" "WP_006578126" "WP_006596310" "WP_006681821" "WP_006711975" "WP_006716908" "WP_006784524" "WP_006794536"
[193] "WP_006891260" "WP_006907657" "WP_006948792" "WP_006956699" "WP_006995561" "WP_007007746" "WP_007130009" "WP_007134290"
[201] "WP_007231246" "WP_007277186" "WP_007416017" "WP_007420749" "WP_007484265" "WP_007500805" "WP_007517024" "WP_007546669"
[209] "WP_007563269" "WP_007639786" "WP_007667931" "WP_007766775" "WP_007788671" "WP_007836218" "WP_007894152" "WP_007972129"
[217] "WP_008024774" "WP_008044785" "WP_008073041" "WP_008106982" "WP_008112253" "WP_008136828" "WP_008174809" "WP_008209042"
[225] "WP_008244029" "WP_008254783" "WP_008466551" "WP_008487653" "WP_008506181" "WP_008556995" "WP_008664019" "WP_008754633"
[233] "WP_008761062" "WP_008808665" "WP_008811335" "WP_008826323" "WP_008925559" "WP_008989833" "WP_009124418" "WP_009133007"
[241] "WP_009175149" "WP_009193507" "WP_009244499" "WP_009262904" "WP_009268572" "WP_009317804" "WP_009372837" "WP_009412440"
[249] "WP_009416511" "WP_009417176" "WP_009437540" "WP_009443405" "WP_009461124" "WP_009496614" "WP_009500922" "WP_009526511"
[257] "WP_009586372" "WP_009589199" "WP_009641684" "WP_009669573" "WP_009677188" "WP_009731090" "WP_009843318" "WP_009850000"
[265] "WP_010073895" "WP_010359207" "WP_010371780" "WP_010381642" "WP_010735128" "WP_010759385" "WP_010761388" "WP_010780948"
[273] "WP_010850622" "WP_011000898" "WP_011009589" "WP_011162224" "WP_011183282" "WP_011201021" "WP_011213001" "WP_011242231"
[281] "WP_011264555" "WP_011275775" "WP_011282675" "WP_011286345" "WP_011303097" "WP_011346477" "WP_011399565" "WP_011482754"
[289] "WP_011489445" "WP_011584955" "WP_011608342" "WP_011627026" "WP_011643962"
# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows_400mic$ID)))
[1] "Number of unique IDs in zero mutation rows: 337"
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows_400mic$ID)))
[1] "Number of unique IDs in non-zero mutation rows: 44"
# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution_400mic <- non_zero_mutation_rows_400mic %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
[1] "Distribution of mutation counts:"
print(mutation_distribution_400mic)

# 7. Check for any unexpected mutation values
unexpected_mutations_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations_400mic) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations_400mic)
} else {
  print("No unexpected mutation values found.")
}
[1] "No unexpected mutation values found."

Remove unique perfect IDs if there is no corresponding mutants:

# Step 1: Identify IDs with zero mutations
zero_mutation_ids_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0 & fitD11D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations != 0 & fitD11D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove_400mic <- setdiff(zero_mutation_ids_400mic, non_zero_mutation_ids_400mic)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_400mic_cleaned <- dropout_mutants15_GOF_400mic %>%
  filter(!(ID %in% ids_to_remove_400mic))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF_400mic)))
[1] "Number of rows before cleaning: 403"
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_400mic_cleaned)))
[1] "Number of rows after cleaning: 110"
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF_400mic) - nrow(dropout_mutants15_GOF_400mic_cleaned)))
[1] "Number of rows removed: 293"
print(paste("Number of unique IDs removed:", length(ids_to_remove_400mic)))
[1] "Number of unique IDs removed: 293"
# Optionally, you can print the removed IDs
print("IDs removed:")
[1] "IDs removed:"
print(ids_to_remove_400mic)
  [1] "NP_269082"    "NP_390064"    "NP_721345"    "NP_747233"    "NP_944205"    "WP_000162304" "WP_000162306" "WP_000162307"
  [9] "WP_000162309" "WP_000162310" "WP_000162311" "WP_000162312" "WP_000162313" "WP_000162437" "WP_000162448" "WP_000162450"
 [17] "WP_000162451" "WP_000162457" "WP_000162462" "WP_000162470" "WP_000162471" "WP_000162477" "WP_000162478" "WP_000162482"
 [25] "WP_000162485" "WP_000162499" "WP_000162501" "WP_000162502" "WP_000162504" "WP_000175741" "WP_000175742" "WP_000175753"
 [33] "WP_000312553" "WP_000637196" "WP_000637199" "WP_000637203" "WP_000637204" "WP_000637211" "WP_000637216" "WP_000973544"
 [41] "WP_001035150" "WP_002065185" "WP_002114723" "WP_002118976" "WP_002216373" "WP_002256361" "WP_002569855" "WP_002578253"
 [49] "WP_002583107" "WP_002587016" "WP_002596555" "WP_002602832" "WP_002664533" "WP_002670343" "WP_002811844" "WP_002841202"
 [57] "WP_002889081" "WP_002904437" "WP_002906638" "WP_002909821" "WP_002917312" "WP_002924855" "WP_002934971" "WP_002984955"
 [65] "WP_002990062" "WP_003011321" "WP_003016297" "WP_003027976" "WP_003032260" "WP_003035097" "WP_003054187" "WP_003065197"
 [73] "WP_003072921" "WP_003088134" "WP_003093756" "WP_003218222" "WP_003268911" "WP_003321978" "WP_003325705" "WP_003441833"
 [81] "WP_003474891" "WP_003479843" "WP_003607688" "WP_003633095" "WP_003649012" "WP_003654807" "WP_003666061" "WP_003674233"
 [89] "WP_003683166" "WP_003717025" "WP_003747727" "WP_003748519" "WP_003758501" "WP_003762131" "WP_003773894" "WP_003779699"
 [97] "WP_003795223" "WP_004079202" "WP_004138800" "WP_004187031" "WP_004232047" "WP_004252176" "WP_004286630" "WP_004288595"
[105] "WP_004297954" "WP_004315768" "WP_004321824" "WP_004336126" "WP_004354535" "WP_004357589" "WP_004359715" "WP_004362212"
[113] "WP_004366090" "WP_004368566" "WP_004375525" "WP_004382106" "WP_004456775" "WP_004547470" "WP_004585598" "WP_004614090"
[121] "WP_004762151" "WP_004794794" "WP_004818861" "WP_004830208" "WP_004838134" "WP_004852952" "WP_004855114" "WP_004916485"
[129] "WP_004920127" "WP_004955442" "WP_004974792" "WP_004993976" "WP_005033778" "WP_005083623" "WP_005166996" "WP_005210368"
[137] "WP_005357877" "WP_005364041" "WP_005369895" "WP_005379345" "WP_005392293" "WP_005398549" "WP_005427260" "WP_005430428"
[145] "WP_005433217" "WP_005450473" "WP_005549928" "WP_005558437" "WP_005593154" "WP_005605215" "WP_005612722" "WP_005622496"
[153] "WP_005635192" "WP_005647059" "WP_005648179" "WP_005707354" "WP_005708598" "WP_005742812" "WP_005765106" "WP_005790642"
[161] "WP_005797943" "WP_005808285" "WP_005823311" "WP_005838503" "WP_005845866" "WP_005858889" "WP_005878541" "WP_005933374"
[169] "WP_005935635" "WP_005956098" "WP_005998635" "WP_006036807" "WP_006044971" "WP_006145153" "WP_006148032" "WP_006154666"
[177] "WP_006256429" "WP_006266370" "WP_006318421" "WP_006427068" "WP_006451480" "WP_006460949" "WP_006493548" "WP_006531276"
[185] "WP_006565566" "WP_006578126" "WP_006596310" "WP_006681821" "WP_006711975" "WP_006716908" "WP_006784524" "WP_006794536"
[193] "WP_006891260" "WP_006907657" "WP_006948792" "WP_006956699" "WP_006995561" "WP_007007746" "WP_007130009" "WP_007134290"
[201] "WP_007231246" "WP_007277186" "WP_007416017" "WP_007420749" "WP_007484265" "WP_007500805" "WP_007517024" "WP_007546669"
[209] "WP_007563269" "WP_007639786" "WP_007667931" "WP_007766775" "WP_007788671" "WP_007836218" "WP_007894152" "WP_007972129"
[217] "WP_008024774" "WP_008044785" "WP_008073041" "WP_008106982" "WP_008112253" "WP_008136828" "WP_008174809" "WP_008209042"
[225] "WP_008244029" "WP_008254783" "WP_008466551" "WP_008487653" "WP_008506181" "WP_008556995" "WP_008664019" "WP_008754633"
[233] "WP_008761062" "WP_008808665" "WP_008811335" "WP_008826323" "WP_008925559" "WP_008989833" "WP_009124418" "WP_009133007"
[241] "WP_009175149" "WP_009193507" "WP_009244499" "WP_009262904" "WP_009268572" "WP_009317804" "WP_009372837" "WP_009412440"
[249] "WP_009416511" "WP_009417176" "WP_009437540" "WP_009443405" "WP_009461124" "WP_009496614" "WP_009500922" "WP_009526511"
[257] "WP_009586372" "WP_009589199" "WP_009641684" "WP_009669573" "WP_009677188" "WP_009731090" "WP_009843318" "WP_009850000"
[265] "WP_010073895" "WP_010359207" "WP_010371780" "WP_010381642" "WP_010735128" "WP_010759385" "WP_010761388" "WP_010780948"
[273] "WP_010850622" "WP_011000898" "WP_011009589" "WP_011162224" "WP_011183282" "WP_011201021" "WP_011213001" "WP_011242231"
[281] "WP_011264555" "WP_011275775" "WP_011282675" "WP_011286345" "WP_011303097" "WP_011346477" "WP_011399565" "WP_011482754"
[289] "WP_011489445" "WP_011584955" "WP_011608342" "WP_011627026" "WP_011643962"
# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF_400mic <- dropout_mutants15_GOF_400mic_cleaned

Dropout Mutants

Summarize the number of perfects and mutants at each AA distance after filtering:

# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count_400mic <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts_400mic <- sapply(1:5, function(x) dropout_mutants15_GOF_count_400mic(dropout_mutants15_GOF_400mic, x))

# Count perfects separately
perfects_count_400mic <- length(unique(subset(dropout_mutants15_GOF_400mic, mutations == 0 & fitD11D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table_400mic <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count_400mic, dropout15_counts_400mic, sum(dropout15_counts_400mic))
)

# Print the table
print(dropout_mutants15_GOF_table_400mic)

GOF Fitness

GOF Fitness: Separate the dropout_mutants15_GOF_mic dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:

# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations_400mic <- dropout_mutants15_GOF_400mic[dropout_mutants15_GOF_400mic$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations_400mic <- dropout_mutants15_GOF_400mic[dropout_mutants15_GOF_400mic$mutations != 0, ]

Now, re-combine these dataframes to retain only the mutants

# Step 1: Prepare the reference dataframe
df_reference_400mic <- dropout_mutants15_GOF_no_mutations_400mic %>%
  select(ID, fitD11D03) %>%
  rename(reference_fitD11D03 = fitD11D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness_400mic <- dropout_mutants15_GOF_with_mutations_400mic %>%
  left_join(df_reference_400mic, by = "ID") %>%
  mutate(fitD11D03 = fitD11D03 - reference_fitD11D03) %>%
  select(ID, mutID, mutations, fitD11D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness_400mic)))
[1] "Number of Mutants: 66"
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness_400mic$ID))))
[1] "Unique IDs: 44"
print(paste("Range of fitD11D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness_400mic$fitD11D03, na.rm = TRUE), 1), collapse = " to ")))
[1] "Range of fitD11D03: 0.5 to 14.9"
# Count unique IDs with mutations == 0
unique_ids_zero_mutations <- dropout_mutants15_GOF_fitness_400mic %>%
  distinct(ID) %>%
  nrow()

print(paste("Number of unique IDs:", unique_ids_zero_mutations))
[1] "Number of unique IDs: 44"

Boxplot: Plot mutant fitness relative to parent variant by number of mutations:

GOF_muts_fitness_by_muts_plot_400mic <- ggplot(dropout_mutants15_GOF_fitness_400mic, 
                                        aes(x = factor(mutations), y = fitD11D03)) +
  geom_boxplot() +
  labs(title = "fitD11D03 by Number of Mutations", x = "Number of Mutations", y = "fitD11D03")

print(GOF_muts_fitness_by_muts_plot_400mic)

Histogram of Mutant Fitness: Clearly shows mutant fitness is normally distributed.

GOF_muts_fitness_dist_plot_400mic <- ggplot(dropout_mutants15_GOF_fitness_400mic, aes(x = fitD11D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD11D03", x = "fitD11D03", y = "Count")

print(GOF_muts_fitness_dist_plot_400mic)

GOF Alignment

FASTA: Generate a FASTA file from the filtered dropout_mutants15_GoF perfects dataset for GoF analysis:

# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_400mic_1mut_unique_ids <- dropout_mutants15_GOF_fitness_400mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_400mic_1mut_unique_id_seq <- dropout_mutants15_GOF_400mic %>%
  filter(ID %in% dropout_mutants15_GOF_400mic_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_400mic_fasta_content <- paste(">", dropout_mutants15_GOF_400mic_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_400mic_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_400mic_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.fasta")

# Write the FASTA content to the file (37 unique ID)
writeLines(dropout_mutants15_GOF_400mic_fasta_content, 
           con = dropout_mutants15_GOF_400mic_fasta_file_path)

Alignment: Use the clustalo executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: Lib15.GoF.perfects.complementation.fasta for use in GoF analysis.

./Scripts/clustalo -i GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.fasta -o GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.tree.aligned.mod.aln --outfmt=clustal --force

Mapping Residues: Use the following map.aligned.residues.py python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:

import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [38] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/400xMIC/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()

GOF Plots

Find GoF Perfects for Dropouts

quit
# Create a data frame of unique IDs
mutants15_to_plot_400mic <- dropout_mutants15_GOF_fitness_400mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove_400mic <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot_400mic)) {
  mutant15_current_temp <- mutants15_to_plot_400mic$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/400xMIC/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove_400mic <- c(mutants_to_remove_400mic, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove_400mic) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove_400mic)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove_400mic), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot_400mic <- mutants15_to_plot_400mic[!mutants15_to_plot_400mic$ID %in% mutants_to_remove_400mic, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot_400mic), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}
All mutants have corresponding MSA files. No mutants will be removed.
# If you want to see the remaining mutants
print(mutants15_to_plot_400mic)

Read in the E. coli map:

ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), 
                      head=TRUE, sep=",")

Make a new data frame which will keep all info

GOF_fitness_map_400mic <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))

MSA Mapping: Map the mutants (fitness difference from perfects > 0) over all perfects (fit < -1) for GOF analysis:

#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot_400mic)){
  
  #current ortholog:
  mutant_current_400mic <- as.character(mutants15_to_plot_400mic$ID[iii])
  
  #length of name
  name_size_400mic = nchar(paste(mutant_current_400mic,"_",sep=""))
  
  #get the MSA mapping
  mutant_map_400mic <- read.csv(file=paste("GOF/MSA_Dropouts/400xMIC/",mutant_current_400mic,".csv",sep=""),head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) greater than zero (do not include perfects from dataset)
  GOFmutIDinfo_temp_400mic <- dropout_mutants15_GOF_fitness_400mic %>%
    filter(ID == mutant_current_400mic) %>%
    filter(mutations != 0) %>%
    filter(fitD11D03 >= 0.5) ###CHANGE VALUE AS NEEDED (default = 0.5)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp_400mic) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current_400mic))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp_400mic)) {
    
    #this mutants fitness
    gof_fit_temp_400mic <- GOFmutIDinfo_temp_400mic$fitD11D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names_400mic <- as.character(GOFmutIDinfo_temp_400mic$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names_400mic <- substr(mutations_names_400mic, name_size+1, nchar(mutations_names_400mic))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names_400mic, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp_400mic$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map_400mic$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map_400mic$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map_400mic <- rbind(GOF_fitness_map_400mic,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_400mic$mutations[mn],
                                              fitness=gof_fit_temp_400mic,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp_400mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_400mic$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map_400mic <- rbind(GOF_fitness_map_400mic,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_400mic$mutations[mn],
                                              fitness=gof_fit_temp_400mic,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp_400mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_400mic$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current_400mic))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
Warning: No matching orth_aanum found for mutpos: NA in ID: NP_636179Warning: No matching orth_aanum found for mutpos: NA in ID: NP_831957

Collapse the GOF fitness values by aa position along the protein sequence:

GOF_fitness_collapsed_by_pos_400mic <- GOF_fitness_map_400mic %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.

GoF_plot_400mic <- ggplot(GOF_fitness_collapsed_by_pos_400mic, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "blue")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "red")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradient(low = "blue", 
                       high = "red",
                       name="Num.\nUniq.\nHomo.",
                       na.value="grey", 
                       limit = c(0,1.1*max(GOF_fitness_collapsed_by_pos_400mic$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme(legend.position="left")
GoF_plot_400mic <- ggExtra::ggMarginal(GoF_plot_400mic,type = "histogram",
                    margins = "y",
                    bins=21,
                    col = 'black',
                    fill = 'red')
Warning: All aesthetics have length 1, but the data has 46 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 46 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 46 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.Warning: All aesthetics have length 1, but the data has 46 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
GoF_plot_400mic

Print out a summary table of significant aa position along the protein sequence

GOF_fitness_collapsed_by_pos_2sigma_400mic <- GOF_fitness_collapsed_by_pos_400mic %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos_400mic$numpoints) +
                         2*sd(GOF_fitness_collapsed_by_pos_400mic$numpoints)))

print(GOF_fitness_collapsed_by_pos_2sigma_400mic)

Calculate all Data and Stats:

GOF_fitness_collapsed_all_400mic <- GOF_fitness_map_400mic %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
`summarise()` has grouped output by 'position'. You can override using the `.groups` argument.
gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all_400mic)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 35)

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==35)] <- 1

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)

GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:

# Define the order of amino acids for the rectangles
rect_order <- c("T")

# Create a data frame for the rectangles
rect_data <- data.frame(
aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
xmin = seq(0.5, by = 1, length.out = length(rect_order)),
xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot_400mic <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
           aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
           fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1),
                     labels=c("35"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot_400mic)

Reproducibility

The session information is provided for full reproducibility.

devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS 15.2
 system   aarch64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2025-01-23
 rstudio  2024.09.0+375 Cranberry Hibiscus (desktop)
 pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package          * version    date (UTC) lib source
 ade4               1.7-22     2023-02-06 [1] CRAN (R 4.3.0)
 ape              * 5.8        2024-04-11 [1] CRAN (R 4.3.1)
 aplot              0.2.2      2023-10-06 [1] CRAN (R 4.3.1)
 bio3d            * 2.4-5      2024-10-29 [1] CRAN (R 4.3.3)
 BiocGenerics     * 0.46.0     2023-06-04 [1] Bioconductor
 Biostrings       * 2.68.1     2023-05-21 [1] Bioconductor
 bitops             1.0-7      2021-04-24 [1] CRAN (R 4.3.0)
 cachem             1.0.8      2023-05-01 [1] CRAN (R 4.3.0)
 castor           * 1.8.0      2024-01-09 [1] CRAN (R 4.3.1)
 cli                3.6.2      2023-12-11 [1] CRAN (R 4.3.1)
 codetools          0.2-20     2024-03-31 [1] CRAN (R 4.3.1)
 colorspace         2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 cowplot          * 1.1.3      2024-01-22 [1] CRAN (R 4.3.1)
 crayon             1.5.2      2022-09-29 [1] CRAN (R 4.3.0)
 devtools         * 2.4.5      2022-10-11 [1] CRAN (R 4.3.0)
 digest             0.6.35     2024-03-11 [1] CRAN (R 4.3.1)
 dplyr            * 1.1.4      2023-11-17 [1] CRAN (R 4.3.1)
 ellipsis           0.3.2      2021-04-29 [1] CRAN (R 4.3.0)
 evaluate           0.23       2023-11-01 [1] CRAN (R 4.3.1)
 fansi              1.0.6      2023-12-08 [1] CRAN (R 4.3.1)
 farver             2.1.1      2022-07-06 [1] CRAN (R 4.3.0)
 fastmap            1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 foreach            1.5.2      2022-02-02 [1] CRAN (R 4.3.0)
 fs                 1.6.3      2023-07-20 [1] CRAN (R 4.3.0)
 generics           0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 GenomeInfoDb     * 1.36.4     2023-10-08 [1] Bioconductor
 GenomeInfoDbData   1.2.10     2023-09-13 [1] Bioconductor
 ggExtra          * 0.10.1     2023-08-21 [1] CRAN (R 4.3.0)
 ggfun              0.1.4      2024-01-19 [1] CRAN (R 4.3.1)
 ggnewscale       * 0.4.10     2024-02-08 [1] CRAN (R 4.3.1)
 ggplot2          * 3.5.1      2024-04-23 [1] CRAN (R 4.3.1)
 ggplotify          0.1.2      2023-08-09 [1] CRAN (R 4.3.0)
 ggridges         * 0.5.6      2024-01-23 [1] CRAN (R 4.3.1)
 ggtree           * 3.8.2      2023-07-30 [1] Bioconductor
 ggtreeExtra      * 1.10.0     2023-04-25 [1] Bioconductor
 glmnet           * 4.1-8      2023-08-22 [1] CRAN (R 4.3.0)
 glue               1.7.0      2024-01-09 [1] CRAN (R 4.3.1)
 gridExtra        * 2.3        2017-09-09 [1] CRAN (R 4.3.0)
 gridGraphics       0.5-1      2020-12-13 [1] CRAN (R 4.3.0)
 gtable             0.3.5      2024-04-22 [1] CRAN (R 4.3.1)
 htmltools          0.5.8.1    2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets        1.6.4      2023-12-06 [1] CRAN (R 4.3.1)
 httpuv             1.6.15     2024-03-26 [1] CRAN (R 4.3.1)
 igraph           * 2.0.3      2024-03-13 [1] CRAN (R 4.3.1)
 IRanges          * 2.34.1     2023-07-02 [1] Bioconductor
 iterators          1.0.14     2022-02-05 [1] CRAN (R 4.3.0)
 jsonlite           1.8.8      2023-12-04 [1] CRAN (R 4.3.1)
 knitr            * 1.45       2023-10-30 [1] CRAN (R 4.3.1)
 labeling           0.4.3      2023-08-29 [1] CRAN (R 4.3.0)
 later              1.3.2      2023-12-06 [1] CRAN (R 4.3.1)
 lattice            0.22-6     2024-03-20 [1] CRAN (R 4.3.1)
 lazyeval           0.2.2      2019-03-15 [1] CRAN (R 4.3.0)
 lifecycle          1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
 magrittr           2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 MASS               7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.1)
 Matrix           * 1.6-5      2024-01-11 [1] CRAN (R 4.3.1)
 matrixStats      * 1.3.0      2024-04-11 [1] CRAN (R 4.3.1)
 memoise            2.0.1      2021-11-26 [1] CRAN (R 4.3.0)
 mime               0.12       2021-09-28 [1] CRAN (R 4.3.0)
 miniUI             0.1.1.1    2018-05-18 [1] CRAN (R 4.3.0)
 munsell            0.5.1      2024-04-01 [1] CRAN (R 4.3.1)
 naturalsort        0.1.3      2016-08-30 [1] CRAN (R 4.3.0)
 nlme               3.1-164    2023-11-27 [1] CRAN (R 4.3.1)
 patchwork        * 1.2.0      2024-01-08 [1] CRAN (R 4.3.1)
 pheatmap         * 1.0.12     2019-01-04 [1] CRAN (R 4.3.0)
 pillar             1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgbuild           1.4.4      2024-03-17 [1] CRAN (R 4.3.1)
 pkgconfig          2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 pkgload            1.3.4      2024-01-16 [1] CRAN (R 4.3.1)
 plyr               1.8.9      2023-10-02 [1] CRAN (R 4.3.1)
 png                0.1-8      2022-11-29 [1] CRAN (R 4.3.0)
 profvis            0.3.8      2023-05-02 [1] CRAN (R 4.3.0)
 promises           1.3.0      2024-04-05 [1] CRAN (R 4.3.1)
 pscl             * 1.5.9      2024-01-31 [1] CRAN (R 4.3.1)
 purrr            * 1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
 R6                 2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 ragg               1.3.0      2024-03-13 [1] CRAN (R 4.3.1)
 RColorBrewer     * 1.1-3      2022-04-03 [1] CRAN (R 4.3.0)
 Rcpp             * 1.0.13     2024-07-17 [1] CRAN (R 4.3.3)
 RCurl              1.98-1.14  2024-01-09 [1] CRAN (R 4.3.1)
 remotes            2.5.0      2024-03-17 [1] CRAN (R 4.3.1)
 reshape          * 0.8.9      2022-04-12 [1] CRAN (R 4.3.0)
 reshape2         * 1.4.4      2020-04-09 [1] CRAN (R 4.3.0)
 reticulate       * 1.36.1     2024-04-22 [1] CRAN (R 4.3.1)
 rlang              1.1.3      2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown          2.26       2024-03-05 [1] CRAN (R 4.3.1)
 ROCR             * 1.0-11     2020-05-02 [1] CRAN (R 4.3.0)
 RSpectra           0.16-1     2022-04-24 [1] CRAN (R 4.3.0)
 rstudioapi         0.16.0     2024-03-24 [1] CRAN (R 4.3.1)
 S4Vectors        * 0.38.2     2023-09-24 [1] Bioconductor
 scales           * 1.3.0      2023-11-28 [1] CRAN (R 4.3.1)
 seqinr           * 4.2-36     2023-12-08 [1] CRAN (R 4.3.1)
 sessioninfo        1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 shape              1.4.6.1    2024-02-23 [1] CRAN (R 4.3.1)
 shiny              1.8.1.1    2024-04-02 [1] CRAN (R 4.3.1)
 stringi          * 1.8.3      2023-12-11 [1] CRAN (R 4.3.1)
 stringr          * 1.5.1      2023-11-14 [1] CRAN (R 4.3.1)
 survival           3.5-8      2024-02-14 [1] CRAN (R 4.3.1)
 systemfonts        1.0.6      2024-03-07 [1] CRAN (R 4.3.1)
 textshaping        0.3.7      2023-10-09 [1] CRAN (R 4.3.1)
 tibble             3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidyr            * 1.3.1      2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect         1.2.1      2024-03-11 [1] CRAN (R 4.3.1)
 tidytree         * 0.4.6      2023-12-12 [1] CRAN (R 4.3.1)
 treeio             1.24.3     2023-07-30 [1] Bioconductor
 urlchecker         1.0.1      2021-11-30 [1] CRAN (R 4.3.0)
 usethis          * 2.2.3      2024-02-19 [1] CRAN (R 4.3.1)
 utf8               1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
 vctrs              0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
 viridis          * 0.6.5      2024-01-29 [1] CRAN (R 4.3.1)
 viridisLite      * 0.4.2      2023-05-02 [1] CRAN (R 4.3.0)
 withr              3.0.0      2024-01-16 [1] CRAN (R 4.3.1)
 xfun               0.43       2024-03-25 [1] CRAN (R 4.3.1)
 xtable             1.8-4      2019-04-21 [1] CRAN (R 4.3.0)
 XVector          * 0.40.0     2023-05-08 [1] Bioconductor
 yaml               2.3.8      2023-12-11 [1] CRAN (R 4.3.1)
 yulab.utils        0.1.4      2024-01-28 [1] CRAN (R 4.3.1)
 zlibbioc           1.46.0     2023-05-08 [1] Bioconductor

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

─ Python configuration ─────────────────────────────────────────────────────────────────────────────────────────────────────
 python:         /Users/krom/miniforge3/bin/python3
 libpython:      /Users/krom/miniforge3/lib/libpython3.10.dylib
 pythonhome:     /Users/krom/miniforge3:/Users/krom/miniforge3
 version:        3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
 numpy:           [NOT FOUND]
 
 NOTE: Python version was forced by use_python() function

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
---
title: "Gain-of-Function Mutants Analysis"
author: 'Authors: [Karl J. Romanowicz](https://kromanowicz.github.io/), Carmen Resnick, Samuel R. Hinton, Calin Plesa'
output:
  html_notebook:
    theme: spacelab
    toc: yes
    toc_depth: 5
    toc_float:
      collapsed: yes
      smooth_scroll: yes
  html_document:
    toc: yes
    toc_depth: '5'
    df_print: paged
  pdf_document:
    toc: yes
    toc_depth: '5'
---

**R Notebook:** <font color="green">Provides reproducible analysis for **Gain-of-Function Mutants** in the following manuscript:</font>

**Citation:** Romanowicz KJ, Resnick C, Hinton SR, Plesa C. Exploring antibiotic resistance in diverse homologs of the dihydrofolate reductase protein family through broad mutational scanning. ***bioRxiv***, 2025. []()

**GitHub Repository:** [https://github.com/PlesaLab/DHFR](https://github.com/PlesaLab/DHFR)

**NCBI BioProject:** [https://www.ncbi.nlm.nih.gov/bioproject/1189478](https://www.ncbi.nlm.nih.gov/bioproject/1189478)

# Experiment

This pipeline processes a library of 1,536 DHFR homologs and their associated mutants, with two-fold redundancy (two codon variants per sequence). Fitness scores are derived from a multiplexed in-vivo assay using a trimethoprim concentration gradient, assessing the ability of these homologs and their mutants to complement functionality in an *E. coli* knockout strain and their tolerance to trimethoprim treatment. This analysis provides insights into how antibiotic resistance evolves across a range of evolutionary starting points. Sequence data were generated using the Illumina NovaSeq platform with 100 bp paired-end sequencing of amplicons.

![Methods overview to achieve a broad-mutational scan for DHFR homologs.](Images/DHFR.Diagram.png)

```{css}
.badCode {
background-color: lightpink;
font-weight: bold;
}

.goodCode {
background-color: lightgreen;
font-weight: bold;
}

.sharedCode {
background-color: lightblue;
font-weight: bold;
}

table {
  margin: auto;
  border-top: 1px solid #666;
  border-bottom: 1px solid #666;
}
table thead th { border-bottom: 1px solid #ddd; }
th, td { padding: 5px; }
thead, tfoot, tr:nth-child(even) { background: #eee; }
```

```{r setup, include=FALSE}
# Set global options for notebook
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = TRUE, message = TRUE)
knitr::opts_chunk$set(echo = TRUE, class.source = "bg-success")

# Getting the path of your current open file and set as wd
current_path = rstudioapi::getActiveDocumentContext()$path 
setwd(dirname(current_path))
print(getwd())
```

# Packages
The following R packages must be installed prior to loading into the R session. See the **Reproducibility** tab for a complete list of packages and their versions used in this workflow.
```{r message=FALSE, warning=FALSE, results='hide'}
# Load the latest version of python (3.10.14) for downstream use:
library(reticulate)
use_python("/Users/krom/miniforge3/bin/python3")

# Make a vector of required packages
required.packages <- c("ape", "bio3d", "Biostrings", "castor", "cowplot", "devtools", "dplyr", "ggExtra", "ggnewscale", "ggplot2", "ggridges", "ggtree", "ggtreeExtra", "glmnet", "gridExtra","igraph", "knitr", "matrixStats", "patchwork", "pheatmap", "purrr", "pscl", "RColorBrewer", "reshape","reshape2", "ROCR", "seqinr", "scales", "stringr", "stringi", "tidyr", "tidytree", "viridis")

# Load required packages with error handling
loaded.packages <- lapply(required.packages, function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package, dependencies = TRUE)
    if (!require(package, character.only = TRUE)) {
      message("Package ", package, " could not be installed and loaded.")
      return(NULL)
    }
  }
  return(package)
})

# Remove NULL entries from loaded packages
loaded.packages <- loaded.packages[!sapply(loaded.packages, is.null)]
```

```{r class.output="sharedCode", echo=FALSE}
# Print loaded packages
cat("Loaded packages:", paste(loaded.packages, collapse = ", "), "\n")
```

```{r include=FALSE}
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
```

# Import Data Files

Import **MUTANTS** files generated from [DHFR.4.Mutants.RMD](https://github.com/PlesaLab/DHFR) relevant for downstream analysis.
```{r}
# mut_collapse_15
mut_collapse_15 <- read.csv("Mutants/mutants_files_formatted/mut_collapse_15.csv", 
                         header = TRUE, stringsAsFactors = FALSE)
```

Import **BMS** files generated from [DHFR.5.BMS.RMD](https://github.com/PlesaLab/DHFR) relevant for downstream analysis.
```{r}
# protein_info_1H1T
protein_info_1H1T <- read.csv("BMS/bms_files_formatted/protein_info_1H1T.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

# BMS_matrix15_perfects_and_1_melt
BMS_matrix15_perfects_and_1_melt <- read.csv("BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_melt.csv", 
                         header = TRUE, stringsAsFactors = FALSE)

# BMS_matrix15_perfects_and_1_num_melt
BMS_matrix15_perfects_and_1_num_melt <- read.csv("BMS/bms_files_formatted/BMS_matrix15_perfects_and_1_num_melt.csv", 
                         header = TRUE, stringsAsFactors = FALSE)
```

# GOF Mutants Analysis

<font color="blue">**This section is based on the R file: "R_dropout_GOF.R".**</font> It describes how to determine if certain mutant versions of a designed homolog increase in fitness under trimethoprim selection (i.e., gain of function after mutation).

## Complementation

Start with two histograms showing (1) the distribution of perfects (counts; y-axis) by fitness (x-axis) with fitness < -1 colored blue and fitness > -1 colored gold, and (2) the distribution of mutants (counts; y-axis) by fitness (x-axis) with fitness < -1 colored blue and fitness > -1 colored gold

**Perfects:** Smooth Histogram
```{r}
# Subset mutants (mutations != 0)
L15_perfects_complementation <- mut_collapse_15 %>%
  filter(mutations == 0)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_clean <- L15_perfects_complementation$fitD05D03[is.finite(L15_perfects_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf <- floor(min(fitD05D03_perf_clean, na.rm = TRUE))
x_max_perf <- ceiling(max(fitD05D03_perf_clean, na.rm = TRUE))

# Plot smooth density curve
L15_perfects_complementation_density <- ggplot(L15_perfects_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_perf, x_max_perf, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_perfects_complementation_density)
```

```{r echo=FALSE}
#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Perfects_low_high_fitness_density.png", plot=L15_perfects_complementation_density,
       width=6, height=5, units="in")
```

```{r}
# Subset mutants (mutations != 0)
L15_perfects_complementation <- mut_collapse_15 %>%
  filter(mutations == 0)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_clean <- L15_perfects_complementation$fitD05D03[is.finite(L15_perfects_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf <- floor(min(fitD05D03_perf_clean, na.rm = TRUE))
x_max_perf <- ceiling(max(fitD05D03_perf_clean, na.rm = TRUE))

# Create density data
density_data <- density(L15_perfects_complementation$fitD05D03, na.rm = TRUE)

# Create a data frame from the density data
df_density <- data.frame(x = density_data$x, y = density_data$y)

# Split the data at x = -1
df_left <- df_density[df_density$x <= -1, ]
df_right <- df_density[df_density$x >= -1, ]

# Ensure the split point is included in both datasets
df_left <- rbind(df_left, data.frame(x = -1, y = df_left$y[nrow(df_left)]))
df_right <- rbind(data.frame(x = -1, y = df_right$y[1]), df_right)

# Plot using geom_area
L15_perfects_complementation_density <- ggplot() +
  geom_area(data = df_left, aes(x = x, y = y * 100), fill = "darkblue", alpha = 0.75) +
  geom_area(data = df_right, aes(x = x, y = y * 100), fill = "gold", alpha = 0.75) +
  geom_line(data = df_density, aes(x = x, y = y * 100), color = "black", size = 0.5) +  # Add black outline
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_x_continuous(breaks = seq(x_min_perf, x_max_perf, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), 
                     limits = c(0, max(df_density$y) * 100)) +  # Set y-axis limits
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_perfects_complementation_density)
```

```{r echo=FALSE}
#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Mutants_low_high_fitness_density.v2.png", plot=L15_perfects_complementation_density,
       width=6, height=5, units="in")
```


**Percent of Perfects with Strong Depletion (fitness < -2.5)**
```{r class.output="goodCode"}
# Calculate the percentages and counts
results <- L15_perfects_complementation %>%
  filter(!is.na(fitD05D03)) %>%  # Remove rows where fitD05D03 is NA
  summarise(
    total_unique_IDs = n_distinct(ID),
    IDs_below_neg2_5 = n_distinct(ID[fitD05D03 < -2.5]),
    percentage_below_neg2_5 = (IDs_below_neg2_5 / total_unique_IDs) * 100,
    IDs_below_neg1 = n_distinct(ID[fitD05D03 < -1]),
    IDs_above_or_equal_neg1 = n_distinct(ID[fitD05D03 >= -1]),
    percentage_below_neg1 = (IDs_below_neg1 / total_unique_IDs) * 100,
    percentage_above_or_equal_neg1 = (IDs_above_or_equal_neg1 / total_unique_IDs) * 100
  )

# Print the results
print(paste0("Percentage of unique IDs with fitD05D03 < -2.5: ", 
             round(results$percentage_below_neg2_5, 2), "%"))
print(paste("Total unique IDs:", results$total_unique_IDs))
print(paste("Unique IDs with fitD05D03 < -2.5:", results$IDs_below_neg2_5))

print("\nAdditional fitness categories:")
print(paste("Unique IDs with fitD05D03 < -1:", results$IDs_below_neg1))
print(paste0("Percentage: ", round(results$percentage_below_neg1, 2), "%"))

print(paste("Unique IDs with fitD05D03 >= -1:", results$IDs_above_or_equal_neg1))
print(paste0("Percentage: ", round(results$percentage_above_or_equal_neg1, 2), "%"))
```

**Mutants:** Smooth Histogram
```{r}
# Subset mutants (mutations != 0)
L15_mutants_complementation <- mut_collapse_15 %>%
  filter(mutations > 0 & mutations < 6)

# Remove NA and infinite values for x-axis scaling
fitD05D03_mut_clean <- L15_mutants_complementation$fitD05D03[is.finite(L15_mutants_complementation$fitD05D03)]

# Calculate the range of the data
x_min_mut <- floor(min(fitD05D03_mut_clean, na.rm = TRUE))
x_max_mut <- ceiling(max(fitD05D03_mut_clean, na.rm = TRUE))

# Plot smooth density curve
L15_mutants_complementation_density <- ggplot(L15_mutants_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_mut, x_max_mut, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_mutants_complementation_density)
```

```{r echo=FALSE}
#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Mutants_low_high_fitness_density.png", plot=L15_mutants_complementation_density,
       width=5, height=5, units="in")
```

**Perfects & Mutants:** Smooth Histogram
```{r}
# Subset mutants (mutations != 0)
L15_perfects_mutants_complementation <- mut_collapse_15 %>%
  filter(mutations >= 0 & mutations < 6)

# Remove NA and infinite values for x-axis scaling
fitD05D03_perf_mut_clean <- L15_perfects_mutants_complementation$fitD05D03[is.finite(L15_perfects_mutants_complementation$fitD05D03)]

# Calculate the range of the data
x_min_perf_mut <- floor(min(fitD05D03_perf_mut_clean, na.rm = TRUE))
x_max_perf_mut <- ceiling(max(fitD05D03_perf_mut_clean, na.rm = TRUE))

# Plot smooth density curve
L15_perfects_mutants_complementation_density <- ggplot(L15_perfects_mutants_complementation, aes(x = fitD05D03)) +
  geom_density(aes(y = after_stat(density * 100), 
                   fill = ifelse(fitD05D03 <= -1, "darkblue", "gold")), 
               alpha = 0.75) +
  geom_vline(xintercept = -1, linetype = "dashed", color = "black") +
  scale_fill_identity() +
  scale_x_continuous(breaks = seq(x_min_perf_mut, x_max_perf_mut, by = 1)) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(x = "Median Fitness (LogFC)", y = "Percentage") +
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Display plot
print(L15_perfects_mutants_complementation_density)
```

```{r echo=FALSE}
#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Perfects_Mutants_low_high_fitness_density.png", plot=L15_perfects_mutants_complementation_density,
       width=5, height=5, units="in")
```

**Mutant Counts*
```{r class.output="goodCode"}
# Step 1: Identify IDs that meet the criteria (mutations == 0 and fitD05D03 >= -1)
valid_IDs <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 >= -1) %>%
  pull(ID)

# Step 2: Filter mutIDs based on the criteria and association with valid IDs
unique_mutIDinfo15_5AA_count <- mut_collapse_15 %>%
  filter(ID %in% valid_IDs) %>%  # Keep only rows associated with valid IDs
  filter(mutations > 0 & mutations < 6) %>%  # Keep mutIDs with 1-5 mutations
  distinct(mutID) %>%
  nrow()

# Format and print the result
formatted_count <- format(unique_mutIDinfo15_5AA_count, big.mark = ",")
print(paste("Number of unique mutIDs with 1-5 mutations associated with complementing perfects:", formatted_count))
```

```{r class.output="goodCode"}
# Step 1: Identify IDs that meet the criteria (mutations == 0 and fitD05D03 < -1)
LOF_IDs <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 < -1) %>%
  pull(ID)

# Step 2: Filter mutIDs based on the criteria and association with LOF_IDs
GOF_mutants <- mut_collapse_15 %>%
  filter(ID %in% LOF_IDs) %>%  # Keep only rows associated with LOF_IDs
  filter(mutations == 1) %>%  # Keep mutIDs with 1 mutation
  filter(fitD05D03 >= -1)  # Keep GOF mutants with fitness >= -1

# Count unique mutIDs
GOF_mutIDinfo15_1AA_count <- GOF_mutants %>%
  distinct(mutID) %>%
  nrow()

# Step 3: Count how many of the original LOF_IDs have associated GOF mutants
LOF_IDs_with_GOF_mutants <- GOF_mutants %>%
  distinct(ID) %>%
  nrow()

# Format and print the results
formatted_LOF_IDs_count <- format(length(LOF_IDs), big.mark = ",")
formatted_mutants_count <- format(GOF_mutIDinfo15_1AA_count, big.mark = ",")
formatted_LOF_IDs_with_GOF_mutants <- format(LOF_IDs_with_GOF_mutants, big.mark = ",")

print(paste("Number of original IDs (mutations == 0) with fitD05D03 < -1:", formatted_LOF_IDs_count))
print(paste("Number of unique mutIDs with 1 mutation associated with these IDs and fitD05D03 >= -1:", formatted_mutants_count))
print(paste("Number of original IDs that have associated mutants meeting the criteria:", formatted_LOF_IDs_with_GOF_mutants))
```

**Both Plots:** Plot both graphs together:
```{r}
patch1 <- (L15_perfects_complementation_density | L15_mutants_complementation_density)
patch1

patch2 <- (L15_perfects_complementation_density | L15_mutants_complementation_density) / L15_perfects_mutants_complementation_density
patch2
```

```{r echo=FALSE}
#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Perfects_Mutants_Together_low_high_fitness_density.png", plot=patch1,
       width=10, height=5, units="in")

#Saving 6 x 5 in images
ggsave(file="GOF/PLOTS/Lib15_Complementation_Perfects_Mutants_Both_low_high_fitness_density.png", plot=patch2,
       width=5, height=5, units="in")
```

Count the number of unique perfects with fitness > -1 and fitness < -1:
```{r class.output="goodCode"}
# Assuming L15_perfects_complementation is already defined
unique_ID_counts_perfects <- L15_perfects_complementation %>%
  group_by(ID) %>%
  summarise(
    min_fitness = min(fitD05D03, na.rm = TRUE),
    max_fitness = max(fitD05D03, na.rm = TRUE),
    all_na = all(is.na(fitD05D03))
  ) %>%
  mutate(fitness_category = case_when(
    all_na ~ "All NA",
    !is.finite(min_fitness) & !is.finite(max_fitness) ~ "All NA",
    max_fitness < -1 ~ "All less than -1",
    min_fitness > -1 ~ "All greater than -1",
    TRUE ~ "Spans -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(count = n())

print("Unique ID counts for perfects by fitness category:")
print(unique_ID_counts_perfects)

total_unique_IDs_perfects <- n_distinct(L15_perfects_complementation$ID)
print(paste("Total number of unique IDs in perfects:", total_unique_IDs_perfects))

IDs_with_valid_fitness_perfects <- L15_perfects_complementation %>%
  filter(!is.na(fitD05D03)) %>%
  n_distinct(.$ID)

print(paste("Number of unique IDs in perfects with at least one valid fitD05D03:", IDs_with_valid_fitness_perfects))

# Additional information: count of rows for each fitness category
rows_per_category_perfects <- L15_perfects_complementation %>%
  mutate(fitness_category = case_when(
    is.na(fitD05D03) ~ "NA",
    fitD05D03 < -1 ~ "Less than -1",
    fitD05D03 > -1 ~ "Greater than -1",
    TRUE ~ "Equal to -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(row_count = n())

print("Number of rows in each fitness category for perfects:")
print(rows_per_category_perfects)
```

Count the number of unique mutants with fitness > -1 and fitness < -1:
```{r class.output="goodCode"}
# Count mutants from L15_mutants_complementation dataset
unique_mutID_counts <- L15_mutants_complementation %>%
  group_by(mutID) %>%
  summarise(
    min_fitness = min(fitD05D03, na.rm = TRUE),
    max_fitness = max(fitD05D03, na.rm = TRUE),
    all_na = all(is.na(fitD05D03))
  ) %>%
  mutate(fitness_category = case_when(
    all_na ~ "All NA",
    !is.finite(min_fitness) & !is.finite(max_fitness) ~ "All NA",
    max_fitness < -1 ~ "All less than -1",
    min_fitness > -1 ~ "All greater than -1",
    TRUE ~ "Spans -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(count = n())

print(unique_mutID_counts)

total_unique_mutIDs <- n_distinct(L15_mutants_complementation$mutID)
print(paste("Total number of unique mutIDs:", total_unique_mutIDs))

mutIDs_with_valid_fitness <- L15_mutants_complementation %>%
  filter(!is.na(fitD05D03)) %>%
  n_distinct(.$mutID)

print(paste("Number of unique mutIDs with at least one valid fitD05D03:", mutIDs_with_valid_fitness))

# Additional information: count of rows for each fitness category
rows_per_category <- L15_mutants_complementation %>%
  mutate(fitness_category = case_when(
    is.na(fitD05D03) ~ "NA",
    fitD05D03 < -1 ~ "Less than -1",
    fitD05D03 > -1 ~ "Greater than -1",
    TRUE ~ "Equal to -1"
  )) %>%
  group_by(fitness_category) %>%
  summarise(row_count = n())

print("Number of rows in each fitness category:")
print(rows_per_category)
```

### Dropout Perfects

Start by retrieving all dropout perfects with a log-fold change value less than -1.0 and corresponding GoF mutants with a log-fold change value greater than -1.0. Use the `mut_collapse_15` dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5) and 12,174 mutants with up to 5 AA distance and at least 1 BC (numprunedBCs = 1) matching to a perfect variant in the dataset.
```{r}
# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 < -1.0
dropout15_ids_with_zero_mutations <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 < -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Filter the main dataset to keep mutants with fitness > -1.0 if they match a corresponding perfect ID
dropout_mutants15_GOF <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & fitD05D03 < -1.0) |
    #(mutations != 0 & ID %in% dropout15_ids_with_zero_mutations & !is.na(fitD05D03))) %>%
    (mutations != 0 & fitD05D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, seq)
```

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.
```{r class.output="goodCode"}
# 1. First, create two subsets of the data
zero_mutation_rows <- dropout_mutants15_GOF %>%
  filter(mutations == 0 & fitD05D03 < -1.0)

non_zero_mutation_rows <- dropout_mutants15_GOF %>%
  #filter(mutations != 0)
  filter(mutations != 0 & fitD05D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids <- all(non_zero_mutation_rows$ID %in% zero_mutation_rows$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids))

# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids) {
  problematic_ids <- setdiff(non_zero_mutation_rows$ID, zero_mutation_rows$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero <- setdiff(zero_mutation_rows$ID, non_zero_mutation_rows$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
print(ids_without_non_zero)

# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows$ID)))
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows$ID)))

# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution <- non_zero_mutation_rows %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
print(mutation_distribution)

# 7. Check for any unexpected mutation values
unexpected_mutations <- dropout_mutants15_GOF %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations)
} else {
  print("No unexpected mutation values found.")
}
```

Remove unique perfect IDs if there is no corresponding mutants with fitness greater than -1.0:
```{r class.output="goodCode"}
# Step 1: Identify IDs with zero mutations
zero_mutation_ids <- dropout_mutants15_GOF %>%
  filter(mutations == 0 & fitD05D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids <- dropout_mutants15_GOF %>%
  filter(mutations != 0 & fitD05D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove <- setdiff(zero_mutation_ids, non_zero_mutation_ids)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_cleaned <- dropout_mutants15_GOF %>%
  filter(!(ID %in% ids_to_remove))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF)))
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_cleaned)))
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF) - nrow(dropout_mutants15_GOF_cleaned)))
print(paste("Number of unique IDs removed:", length(ids_to_remove)))

# Optionally, you can print the removed IDs
print("IDs removed:")
print(ids_to_remove)

# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF <- dropout_mutants15_GOF_cleaned
```

### Dropout Mutants
Summarize the number of perfects and mutants at each AA distance after filtering:
```{r class.output="goodCode"}
# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts <- sapply(1:5, function(x) dropout_mutants15_GOF_count(dropout_mutants15_GOF, x))

# Count perfects separately
perfects_count <- length(unique(subset(dropout_mutants15_GOF, mutations == 0 & fitD05D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count, dropout15_counts, sum(dropout15_counts))
)

# Print the table
print(dropout_mutants15_GOF_table)
```

```{r echo=FALSE}
# Save as .csv for use in other RMD files:
write.csv(dropout_mutants15_GOF, 'GOF/OUTPUT/Comp/Lib15.D05D03.gof.1numprunedBCs.5AA.muts.fitness-1.csv',
          row.names = FALSE, quote=FALSE)
```

### GOF Fitness

**GoF Fitness:** Separate the `dropout_mutants15_GOF` dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:
```{r}
# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations != 0, ]

# ALTERNATIVE: Create a dataframe with mutations == 1 (only uses 1 aa mutation variants)
#dropout_mutants15_GOF_with_mutations <- dropout_mutants15_GOF[dropout_mutants15_GOF$mutations == 1, ]
```

Now, re-combine these dataframes to calculate the fitness change (delta) between mutants and their parent homologs:
```{r class.output="goodCode"}
# Step 1: Prepare the reference dataframe
df_reference <- dropout_mutants15_GOF_no_mutations %>%
  select(ID, fitD05D03) %>%
  rename(reference_fitD05D03 = fitD05D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness <- dropout_mutants15_GOF_with_mutations %>%
  left_join(df_reference, by = "ID") %>%
  mutate(fitD05D03 = fitD05D03 - reference_fitD05D03) %>%
  select(ID, mutID, mutations, fitD05D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness)))
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness$ID))))
print(paste("Range of fitD05D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness$fitD05D03, na.rm = TRUE), 1), collapse = " to ")))
```

**Boxplot:** Plot mutant fitness relative to parent variant by number of mutations:
```{r}
GOF_muts_fitness_by_muts_plot <- ggplot(dropout_mutants15_GOF_fitness, 
                                        aes(x = factor(mutations), y = fitD05D03)) +
  geom_boxplot() +
  labs(title = "fitD05D03 by Number of Mutations", x = "Number of Mutations", y = "fitD05D03")

print(GOF_muts_fitness_by_muts_plot)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF.mut.fitness.by.mutations.complementation.png", 
       plot =GOF_muts_fitness_by_muts_plot,
       width = 4.5, height = 4.5, units = "in")
```

**Histogram of Mutant Fitness:** Clearly shows mutant fitness is normally distributed.
```{r}
GOF_muts_fitness_dist_plot <- ggplot(dropout_mutants15_GOF_fitness, aes(x = fitD05D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD05D03", x = "fitD05D03", y = "Count")

print(GOF_muts_fitness_dist_plot)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF.mut.fitness.distribution.complementation.v2.png", 
       plot =GOF_muts_fitness_dist_plot,
       width = 4.5, height = 4.5, units = "in")
```

### GOF Alignment

**FASTA:** Generate a FASTA file from the filtered `dropout_mutants15_GoF` perfects dataset based on shared perfects IDs with 1-AA mutation for GoF analysis:
```{r}
# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_1mut_unique_ids <- dropout_mutants15_GOF_fitness %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_1mut_unique_id_seq <- dropout_mutants15_GOF %>%
  filter(ID %in% dropout_mutants15_GOF_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_fasta_content <- paste(">", dropout_mutants15_GOF_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.fasta")

# Write the FASTA content to the file
writeLines(dropout_mutants15_GOF_fasta_content, 
           con = dropout_mutants15_GOF_fasta_file_path)
```

**Alignment:** Use the `clustalo` executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: **Lib15.GoF.perfects.complementation.fasta** for use in GoF analysis.
```{bash}
./Scripts/clustalo -i GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.fasta -o GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.tree.aligned.mod.aln --outfmt=clustal --force
```

**Mapping Residues:** Use the following `map.aligned.residues.py` python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:
```{python}
import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/Comp/FASTA/Lib15.GoF.perfects.complementation.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [197] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/Comp/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples = 419
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()
```

### GOF Plots

Find GoF Perfects for Dropouts
```{r}
# Create a data frame of unique IDs
mutants15_to_plot <- dropout_mutants15_GOF_fitness %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
```

```{r class.output="goodCode"}
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot)) {
  mutant15_current_temp <- mutants15_to_plot$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/Comp/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove <- c(mutants_to_remove, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot <- mutants15_to_plot[!mutants15_to_plot$ID %in% mutants_to_remove, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}

# If you want to see the remaining mutants
print(mutants15_to_plot)
```

Read in the E. coli map:
```{r}
ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), head=TRUE, sep=",")
```

Make a new data frame which will keep all info
```{r}
GOF_fitness_map <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))
```

**MSA Mapping:** Map the mutants (fitness difference from perfects > 0.5) over all perfects (fit < -1) for GoF analysis:
```{r class.output="goodCode"}
#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot)){
  
  #current ortholog:
  mutant_current <- as.character(mutants15_to_plot$ID[iii])
  
  #length of name
  name_size = nchar(paste(mutant_current,"_",sep=""))
  
  #get the MSA mapping
  mutant_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/",mutant_current,".csv",sep=""),
                         head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) of at least 0.5 (do not include perfects from dataset)
  GOFmutIDinfo_temp <- dropout_mutants15_GOF_fitness %>% ###UPDATED CODE
    filter(ID == mutant_current) %>%
    filter(mutations != 0) %>% ###UPDATED CODE
    filter(fitD05D03 >= 0.5) ###CHANGE VALUE AS NEEDED (>= 0.5 is default)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp)) {
    
    #this mutants fitness
    gof_fit_temp <- GOFmutIDinfo_temp$fitD05D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names <- as.character(GOFmutIDinfo_temp$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names <- substr(mutations_names, name_size+1, nchar(mutations_names))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map <- rbind(GOF_fitness_map,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp$mutations[mn],
                                              fitness=gof_fit_temp,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp$mutID[mn],
                                              ID=GOFmutIDinfo_temp$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map <- rbind(GOF_fitness_map,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp$mutations[mn],
                                              fitness=gof_fit_temp,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp$mutID[mn],
                                              ID=GOFmutIDinfo_temp$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
```

```{r echo=FALSE}
write.table(GOF_fitness_map, file = "GOF/OUTPUT/Comp/GOF_fitness_map.csv", 
            sep = ",", row.names = F,quote=F,col.names = T)
```

Collapse the GOF fitness values by aa position along the protein sequence:
```{r}
GOF_fitness_collapsed_by_pos <- GOF_fitness_map %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
```

#### GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.
```{r}
GoF_plot <- ggplot(GOF_fitness_collapsed_by_pos, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "red2")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "darkblue")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradientn(colours = c("darkblue", "red"),
                        name="Num.\nUniq.\nHomo.",
                        na.value="grey", 
                        limits = c(0,1.1*max(GOF_fitness_collapsed_by_pos$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme_minimal() +
  theme(axis.line = element_line(colour = 'black', size = 1.0), 
        axis.ticks = element_line(colour = "black", size = 1.0),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_blank(),
        legend.text = element_text(size = 12),
        legend.position = "none")

# Add marginal plot
GoF_plot_with_marginal <- ggExtra::ggMarginal(GoF_plot,
                                              type = "histogram",
                                              margins = "y",
                                              bins=21,
                                              col = 'black',
                                              fill = 'red2')

# Display the plot
print(GoF_plot_with_marginal)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF.Mutant.by.AA.Position.2sigma.complementation.png", 
       plot = GoF_plot_with_marginal,
       width = 4, height = 4.5, units = "in")
```

Print out a summary table of significant aa position along the protein sequence
```{r class.output="goodCode"}
GOF_fitness_collapsed_by_pos_2sigma <- GOF_fitness_collapsed_by_pos %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos$numpoints)+2*sd(GOF_fitness_collapsed_by_pos$numpoints)))
print(GOF_fitness_collapsed_by_pos_2sigma)
```

Calculate all Data and Stats:
```{r}
GOF_fitness_collapsed_all <- GOF_fitness_map %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
```

```{r warning=FALSE}
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all$aa[i])),GOF_fitness_collapsed_all$position[i]] <- as.numeric(GOF_fitness_collapsed_all$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 17 |
         X2 == 97 |
         X2 == 98 |
         X2 == 102 |
         X2 == 103 |
         X2 == 104 |
         X2 == 107)

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==17)] <- 1
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==97)] <- 2
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==98)] <- 3
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==102)] <- 4
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==103)] <- 5
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==104)] <- 6
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==107)] <- 7

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)
```

#### GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:
```{r}
# Define the order of amino acids for the rectangles
rect_order <- c("E", "G", "R", "Q", "F", "L", "A")

# Create a data frame for the rectangles
rect_data <- data.frame(
  aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
  xmin = seq(0.5, by = 1, length.out = length(rect_order)),
  xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
            aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1,2,3,4,5,6,7),
                     labels=c("17","97","98","102","103","104","107"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF_Fitness_Number_Mutants_Complementation.pdf", 
       plot = GOF_fit_nummut_plot,
       width = 4.5, height = 4.5, units = "in")
```

Plot the GOF mutant fitness across the protein sequence:
```{r}
ggplot(data = gof_matrix_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile() +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                       high = "red",
                       name="Fitness",
                       na.value="grey",
                       limit = c(0,max(gof_matrix_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))
```

Plot the number of mutants observed at each position along the protein sequence:
```{r}
ggplot(data = gof_matrix_num_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ 
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="#points",
                      na.value="grey", 
                      limit = c(0,max(gof_matrix_num_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))
```

Plot the number of mutants observed at each position along the protein sequence:
```{r}
ggplot(data = gof_matrix_numortho_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ 
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="#points",
                      na.value="grey", 
                      limit = c(0,max(gof_matrix_numortho_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))
```

Plot the standard deviation of fitness for each position along the protein sequence:
```{r}
ggplot(data = gof_matrix_sd_melt, aes(x=X2, y=X1, fill=value)) +
  geom_tile()+ labs(x = "Position (aa)", y ="Amino acid",color="") +
  scale_fill_gradient2(low = "blue", 
                       high = "red", 
                       mid="gold",
                       name="std(Fitness)",
                       na.value="grey", 
                       limit = c(0,1.1*max(gof_matrix_sd_melt$value))) +
  theme_minimal() + 
  scale_x_continuous(breaks=seq(0,150,10))
```

#### GOF BMS Position Plot

```{r class.output="goodCode"}
GOF_fitness_collapsed_by_pos_2sigma_protein_info <- protein_info_1H1T %>%
  dplyr::rename(position=pos) %>%
  filter(position %in% GOF_fitness_collapsed_by_pos_2sigma$position) %>%
  right_join(GOF_fitness_collapsed_by_pos_2sigma,by="position") %>%
  arrange(position)
print(GOF_fitness_collapsed_by_pos_2sigma_protein_info)
```

```{r}
rsa_vs_cons_plot <- ggplot(GOF_fitness_collapsed_by_pos_2sigma_protein_info,
             aes(x=cons, y=RSA, color=as.factor(position), fill=as.factor(position), shape=as.factor(position))) +
  geom_point(alpha=0.9, size=4, stroke=1) +
  labs(x = "Site Conservation", y ="Relative Solvent Accessibility", color="Residue", fill="Residue", shape="Residue") +
  scale_color_manual(name = "Residue",
                     values = c("red", "green", "blue", "purple", "orange", "cyan", "magenta")) +
  scale_fill_manual(name = "Residue",
                    values = c("red", "green", "blue", "purple", "orange", "cyan", "magenta")) +
  scale_shape_manual(name = "Residue",
                     values = c(21, 22, 23, 24, 25, 21, 22)) +
  theme_minimal() +
  theme(legend.position = "right")

rsa_vs_cons_plot
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF_RSA_vs_Cons_Complementation.pdf", 
       plot = rsa_vs_cons_plot,
       width = 4.5, height = 4.5, units = "in")
```

Extract the fitness values for each significant aa position from the BMS analysis:
```{r class.output="goodCode"}
BMS_matrix_perfects_and_1_melt_GOFonly <- BMS_matrix15_perfects_and_1_melt %>%
  filter(X1 != "X") %>%
  filter(X2 == 17 | X2 == 97 | X2 == 98 | X2 == 102 | X2 == 103 | X2 == 104 | X2 == 107)

BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum <- 0
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==17)] <- 1
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==97)] <- 2
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==98)] <- 3
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==102)] <- 4
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==103)] <- 5
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==104)] <- 6
BMS_matrix_perfects_and_1_melt_GOFonly$mutposnum[which(BMS_matrix_perfects_and_1_melt_GOFonly$X2==107)] <- 7

BMS_matrix_perfects_and_1_melt_GOFonly_wnum <- BMS_matrix_perfects_and_1_melt_GOFonly %>%
  inner_join(BMS_matrix15_perfects_and_1_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)

names(BMS_matrix_perfects_and_1_melt_GOFonly_wnum)
```

Determine the minimum and maximum fitness values for plotting:
```{r class.output="goodCode"}
min(BMS_matrix_perfects_and_1_melt_GOFonly$value, na.rm = TRUE)
max(BMS_matrix_perfects_and_1_melt_GOFonly$value, na.rm = TRUE)
```

Plot the fitness values of the significant aa positions based on the BMS analysis for Complementation. Black rectangles indicate the aa corresponding to the WT DHFR homolog. White rectangles indicate the aa with the highest number of mutants for each position along the protein sequence.
```{r}
# Define the order of amino acids for the black rectangles
rect_order <- c("E", "G", "R", "Q", "F", "L", "A")

# Create a data frame for the black rectangles
rect_data <- data.frame(
  aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W")),
  xmin = seq(0.5, by = 1, length.out = length(rect_order)),
  xmax = seq(1.5, by = 1, length.out = length(rect_order)))

# Find the amino acid with the highest mutnum for each position
highest_mutnum <- BMS_matrix_perfects_and_1_melt_GOFonly_wnum %>%
  group_by(mutposnum) %>%
  slice_max(order_by = as.numeric(mutnum), n = 1) %>%
  ungroup()

# Create the plot
BMS_GoF_fit_plot <- ggplot(BMS_matrix_perfects_and_1_melt_GOFonly_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
            aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "black", inherit.aes = FALSE) +
  # Add white rectangles around the highest mutnum
  geom_rect(data = highest_mutnum,
            aes(xmin = mutposnum - 0.5, xmax = mutposnum + 0.5, 
                ymin = aanum - 0.5, ymax = aanum + 0.5),
            fill = NA, color = "white", size = 1, inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid", color="") +
  scale_fill_gradient2(low = "blue", high = "red", mid="gold",
                       name="Fitness", na.value="grey", 
                       limit = c(-3,1)) +
  theme_minimal() +
  scale_x_continuous(name="Position (aa)", 
                     breaks=c(1,2,3,4,5,6,7),
                     labels=c("17","97","98","102","103","104","107")) +
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W"))

print(BMS_GoF_fit_plot)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/Comp/GOF_BMS_Fitness_Number_Mutants_Complementation.png", 
       plot = BMS_GoF_fit_plot,
       width = 4.5, height = 4.5, units = "in")
```

## MIC (0.5 ug/mL TMP)

### Dropout Perfects

Retrieving all dropout perfects with a log-fold change value less than -1.0 from the **COMPLEMENTATION** dataset. Then, retrieve the same perfects (ID) from the **MIC** dataset and all corresponding mutants. Use the `mut_collapse_15` dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5) and 12,174 mutants with up to 5 AA distance and at least 1 BC (numprunedBCs = 1) matching to a perfect variant in the dataset.
```{r}
# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 < -1.0 in COMPLEMENTATION
dropout15_ids_with_zero_mutations_complement <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 > -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Identify the same IDs that have mutations == 0 and fitD07D03 < -1.0 in MIC
dropout15_ids_with_zero_mutations_mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_complement & 
         mutations == 0 & 
         fitD07D03 < -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 3: Retrieve the rows for these IDs
result_rows <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_mic & mutations == 0)

# Step 4: Filter the main dataset to keep mutants (with 1-AA) if they match a corresponding perfect ID
dropout_mutants15_GOF_mic <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & !is.na(fitD05D03) & fitD05D03 > -1.0 & 
     !is.na(fitD07D03) & fitD07D03 < -1.0) |
    (mutations != 0 & fitD07D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations_mic)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, fitD07D03, seq)
```

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.
```{r class.output="goodCode"}
# Verification step
verification_result_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0) %>%
  mutate(
    condition_met = fitD05D03 > -1.0 & fitD07D03 < -1,
    fitD05D03_check = fitD05D03 > -1.0,
    fitD07D03_check = fitD07D03 < -1
  )

# Check if all rows meet the condition
all_conditions_met_mic <- all(verification_result_mic$condition_met)

# Summary of the verification
verification_summary_mic <- verification_result_mic %>%
  summarise(
    total_rows = n(),
    rows_meeting_both_conditions = sum(condition_met),
    rows_meeting_fitD05D03 = sum(fitD05D03_check),
    rows_meeting_fitD07D03 = sum(fitD07D03_check)
  )

# Print results
print("Verification Summary:")
print(verification_summary_mic)
print(paste("All conditions met:", all_conditions_met_mic))

# If there are any rows not meeting the conditions, display them
if (!all_conditions_met_mic) {
  print("Rows not meeting both conditions:")
  print(verification_result_mic %>% filter(!condition_met) %>% select(ID, fitD05D03, fitD07D03))
}
```

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD07D03 < -1.0.
```{r class.output="goodCode"}
# 1. First, create two subsets of the data
zero_mutation_rows_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0 & fitD07D03 < -1.0)

non_zero_mutation_rows_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations != 0 & fitD07D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids_mic <- all(non_zero_mutation_rows_mic$ID %in% zero_mutation_rows_mic$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids_mic))

# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids_mic) {
  problematic_ids_mic <- setdiff(non_zero_mutation_rows_mic$ID, zero_mutation_rows_mic$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids_mic)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero_mic <- setdiff(zero_mutation_rows_mic$ID, non_zero_mutation_rows_mic$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
print(ids_without_non_zero_mic)

# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows_mic$ID)))
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows_mic$ID)))

# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution_mic <- non_zero_mutation_rows_mic %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
print(mutation_distribution_mic)

# 7. Check for any unexpected mutation values
unexpected_mutations_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations_mic) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations_mic)
} else {
  print("No unexpected mutation values found.")
}
```

Remove unique perfect IDs if there is no corresponding mutants:
```{r class.output="goodCode"}
# Step 1: Identify IDs with zero mutations
zero_mutation_ids_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations == 0 & fitD07D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids_mic <- dropout_mutants15_GOF_mic %>%
  filter(mutations != 0 & fitD07D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove_mic <- setdiff(zero_mutation_ids_mic, non_zero_mutation_ids_mic)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_mic_cleaned <- dropout_mutants15_GOF_mic %>%
  filter(!(ID %in% ids_to_remove_mic))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF_mic)))
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_mic_cleaned)))
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF_mic) - nrow(dropout_mutants15_GOF_mic_cleaned)))
print(paste("Number of unique IDs removed:", length(ids_to_remove_mic)))

# Optionally, you can print the removed IDs
print("IDs removed:")
print(ids_to_remove_mic)

# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF_mic <- dropout_mutants15_GOF_mic_cleaned
```

### Dropout Mutants
Summarize the number of perfects and mutants at each AA distance after filtering:
```{r class.output="goodCode"}
# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count_mic <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts_mic <- sapply(1:5, function(x) dropout_mutants15_GOF_count_mic(dropout_mutants15_GOF_mic, x))

# Count perfects separately
perfects_count_mic <- length(unique(subset(dropout_mutants15_GOF_mic, mutations == 0 & fitD07D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table_mic <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count_mic, dropout15_counts_mic, sum(dropout15_counts_mic))
)

# Print the table
print(dropout_mutants15_GOF_table_mic)
```

```{r echo=FALSE}
# Save as .csv for use in other RMD files:
write.csv(dropout_mutants15_GOF_mic, 'GOF/OUTPUT/MIC/Lib15.D07D03.gof.1numprunedBCs.5AA.muts.fitness-1.csv',
          row.names = FALSE, quote=FALSE)
```

### GOF Fitness

**GOF Fitness:** Separate the `dropout_mutants15_GOF_mic` dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:
```{r}
# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations_mic <- dropout_mutants15_GOF_mic[dropout_mutants15_GOF_mic$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations_mic <- dropout_mutants15_GOF_mic[dropout_mutants15_GOF_mic$mutations != 0, ]
```

Now, re-combine these dataframes to retain only the mutants
```{r class.output="goodCode"}
# Step 1: Prepare the reference dataframe
df_reference_mic <- dropout_mutants15_GOF_no_mutations_mic %>%
  select(ID, fitD07D03) %>%
  rename(reference_fitD07D03 = fitD07D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness_mic <- dropout_mutants15_GOF_with_mutations_mic %>%
  left_join(df_reference_mic, by = "ID") %>%
  mutate(fitD07D03 = fitD07D03 - reference_fitD07D03) %>%
  select(ID, mutID, mutations, fitD07D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness_mic)))
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness_mic$ID))))
print(paste("Range of fitD07D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness_mic$fitD07D03, na.rm = TRUE), 1), collapse = " to ")))
```

**Boxplot:** Plot mutant fitness relative to parent variant by number of mutations:
```{r}
GOF_muts_fitness_by_muts_plot_mic <- ggplot(dropout_mutants15_GOF_fitness_mic, 
                                        aes(x = factor(mutations), y = fitD07D03)) +
  geom_boxplot() +
  labs(title = "fitD07D03 by Number of Mutations", x = "Number of Mutations", y = "fitD07D03")

print(GOF_muts_fitness_by_muts_plot_mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/MIC/GOF.MIC.Mutant.Fitness.png", 
       plot =GOF_muts_fitness_by_muts_plot_mic,
       width = 4.5, height = 4.5, units = "in")
```

**Histogram of Mutant Fitness:** Clearly shows mutant fitness is normally distributed.
```{r}
GOF_muts_fitness_dist_plot_mic <- ggplot(dropout_mutants15_GOF_fitness_mic, aes(x = fitD07D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD07D03", x = "fitD07D03", y = "Count")

print(GOF_muts_fitness_dist_plot_mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/MIC/GOF.MIC.Mutant.Fitness.Distribution.png", 
       plot =GOF_muts_fitness_dist_plot_mic,
       width = 4.5, height = 4.5, units = "in")
```

### GOF Alignment

**FASTA:** Generate a FASTA file from the filtered `dropout_mutants15_GoF` perfects dataset for GoF analysis:
```{r}
# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_mic_1mut_unique_ids <- dropout_mutants15_GOF_fitness_mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_mic_1mut_unique_id_seq <- dropout_mutants15_GOF_mic %>%
  filter(ID %in% dropout_mutants15_GOF_mic_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_mic_fasta_content <- paste(">", dropout_mutants15_GOF_mic_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_mic_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_mic_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.fasta")

# Write the FASTA content to the file (57 unique IDs)
writeLines(dropout_mutants15_GOF_mic_fasta_content, 
           con = dropout_mutants15_GOF_mic_fasta_file_path)
```

**Alignment:** Use the `clustalo` executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: **Lib15.GoF.perfects.complementation.fasta** for use in GoF analysis.
```{bash}
./Scripts/clustalo -i GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.fasta -o GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.tree.aligned.mod.aln --outfmt=clustal --force
```

**Mapping Residues:** Use the following `map.aligned.residues.py` python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:
```{python}
import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/MIC/FASTA/Lib15.GoF.perfects.mic.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [58] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/MIC/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples = 419
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()
```

### GOF Plots

Find GoF Perfects for Dropouts
```{r}
# Create a data frame of unique IDs
mutants15_to_plot_mic <- dropout_mutants15_GOF_fitness_mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
```

```{r class.output="goodCode"}
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove_mic <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot_mic)) {
  mutant15_current_temp <- mutants15_to_plot_mic$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/MIC/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove_mic <- c(mutants_to_remove_mic, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove_mic) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove_mic)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove_mic), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot_mic <- mutants15_to_plot_mic[!mutants15_to_plot_mic$ID %in% mutants_to_remove_mic, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot_mic), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}

# If you want to see the remaining mutants
print(mutants15_to_plot_mic)
```

Read in the E. coli map:
```{r}
ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), head=TRUE, sep=",")
```

Make a new data frame which will keep all info
```{r}
GOF_fitness_map_mic <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))
```

**MSA Mapping:** Map the mutants (fitness difference from perfects > 0) over all perfects (fit < -1) for GOF analysis:
```{r class.output="goodCode"}
#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot_mic)){
  
  #current ortholog:
  mutant_current_mic <- as.character(mutants15_to_plot_mic$ID[iii])
  
  #length of name
  name_size_mic = nchar(paste(mutant_current_mic,"_",sep=""))
  
  #get the MSA mapping
  mutant_map_mic <- read.csv(file=paste("GOF/MSA_Dropouts/MIC/",mutant_current_mic,".csv",sep=""),head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) greater than zero (do not include perfects from dataset)
  GOFmutIDinfo_temp_mic <- dropout_mutants15_GOF_fitness_mic %>%
    filter(ID == mutant_current_mic) %>%
    filter(mutations != 0) %>%
    filter(fitD07D03 >= 0.5) ###CHANGE VALUE AS NEEDED (>= 0.5 is default)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp_mic) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current_mic))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp_mic)) {
    
    #this mutants fitness
    gof_fit_temp_mic <- GOFmutIDinfo_temp_mic$fitD07D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names_mic <- as.character(GOFmutIDinfo_temp_mic$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names_mic <- substr(mutations_names_mic, name_size+1, nchar(mutations_names_mic))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names_mic, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp_mic$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map_mic$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map_mic$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map_mic <- rbind(GOF_fitness_map_mic,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_mic$mutations[mn],
                                              fitness=gof_fit_temp_mic,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp_mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_mic$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map_mic <- rbind(GOF_fitness_map_mic,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_mic$mutations[mn],
                                              fitness=gof_fit_temp_mic,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp_mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_mic$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current_mic))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
```

```{r echo=FALSE}
write.table(GOF_fitness_map_mic, file = "GOF/OUTPUT/MIC/GOF_Fitness_Map_MIC.csv", 
            sep = ",", row.names = F,quote=F,col.names = T)
```

Collapse the GOF fitness values by aa position along the protein sequence:
```{r}
GOF_fitness_collapsed_by_pos_mic <- GOF_fitness_map_mic %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
```

#### GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.
```{r class.output="goodCode"}
GoF_plot_mic <- ggplot(GOF_fitness_collapsed_by_pos_mic, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "blue")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "red")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradient(low = "blue", 
                       high = "red",
                       name="Num.\nUniq.\nHomo.",
                       na.value="grey", 
                       limit = c(0,1.1*max(GOF_fitness_collapsed_by_pos_mic$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme(legend.position="left")
GoF_plot_mic <- ggExtra::ggMarginal(GoF_plot_mic,type = "histogram",
                    margins = "y",
                    bins=21,
                    col = 'black',
                    fill = 'red')
GoF_plot_mic
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/MIC/GOF.MIC.Mutant.by.AA.Pos.2sigma.png", 
       plot = GoF_plot_mic,
       width = 4.5, height = 4.5, units = "in")
```

Print out a summary table of significant aa position along the protein sequence
```{r class.output="goodCode"}
GOF_fitness_collapsed_by_pos_2sigma_mic <- GOF_fitness_collapsed_by_pos_mic %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos_mic$numpoints) + 2*sd(GOF_fitness_collapsed_by_pos_mic$numpoints)))
print(GOF_fitness_collapsed_by_pos_2sigma_mic)
```

Calculate all Data and Stats:
```{r}
GOF_fitness_collapsed_all_mic <- GOF_fitness_map_mic %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
```

```{r warning=FALSE}
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all_mic)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_mic$aa[i])),GOF_fitness_collapsed_all_mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_mic$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 89 |
         X2 == 102 |
         X2 == 103 |
         X2 == 121 |
         X2 == 128 |
         X2 == 129 )

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==89)] <- 1
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==102)] <- 2
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==103)] <- 3
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==121)] <- 4
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==128)] <- 5
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==129)] <- 6

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)
```

#### GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:
```{r}
# Define the order of amino acids for the rectangles
rect_order <- c("P", "Q", "F", "G", "Y", "E")

# Create a data frame for the rectangles
rect_data <- data.frame(
aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
xmin = seq(0.5, by = 1, length.out = length(rect_order)),
xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot_mic <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
           aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
           fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1,2,3,4,5,6),
                     labels=c("89","102","103","121","128","129"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot_mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/MIC/GOF.MIC.Fitness.Number.Mutants.png", 
       plot = GOF_fit_nummut_plot_mic,
       width = 4.5, height = 4.5, units = "in")
```

## 400x MIC (200 ug/mL TMP)

### Dropout Perfects

Retrieve all dropout perfects with a log-fold change value greater than -1.0 from the **COMPLEMENTATION** dataset. Then, retrieve the same perfects (ID) from the **400x MIC** dataset and retain only those with fitness less than -1. Also, retrieve all corresponding mutants. Use the `mut_collapse_15` dataset which includes 797 perfects (mutations = 0; numprunedBCs = 5).
```{r}
# Step 1: Identify IDs that have rows where mutations == 0 and fitD05D03 > -1.0 in COMPLEMENTATION
dropout15_ids_with_zero_mutations_complement <- mut_collapse_15 %>%
  filter(mutations == 0 & fitD05D03 > -1.0) %>%
  distinct(ID) %>%
  pull(ID)

# Step 2: Identify the same IDs that have mutations == 0 and fitD11D03 < -1.0 in 400x MIC
dropout15_ids_with_zero_mutations_400mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_complement & 
         mutations == 0 & 
         fitD11D03 < -1.0 &
         !is.na(fitD11D03)) %>%
  distinct(ID) %>%
  pull(ID)

# Step 3: Retrieve the rows for these IDs
result_rows_400mic <- mut_collapse_15 %>%
  filter(ID %in% dropout15_ids_with_zero_mutations_400mic & mutations == 0)

# Step 4: Filter the main dataset to keep mutants if they match a corresponding perfect ID
dropout_mutants15_GOF_400mic <- mut_collapse_15 %>%
  filter(
    (mutations == 0 & !is.na(fitD05D03) & fitD05D03 > -1.0 & 
     !is.na(fitD11D03) & fitD11D03 < -1.0) |
    (mutations != 0 & fitD11D03 > -1.0 & ID %in% dropout15_ids_with_zero_mutations_400mic)) %>%
  dplyr::select(ID, mutID, numprunedBCs, mutations, fitD05D03, fitD11D03, seq)
```

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD05D03 < -1.0.
```{r class.output="goodCode"}
# Verification step
verification_result_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0) %>%
  mutate(
    condition_met = fitD05D03 > -1.0 & fitD11D03 < -1,
    fitD05D03_check = fitD05D03 > -1.0,
    fitD11D03_check = fitD11D03 < -1
  )

# Check if all rows meet the condition
all_conditions_met_400mic <- all(verification_result_400mic$condition_met)

# Summary of the verification
verification_summary_400mic <- verification_result_400mic %>%
  summarise(
    total_rows = n(),
    rows_meeting_both_conditions = sum(condition_met),
    rows_meeting_fitD05D03 = sum(fitD05D03_check),
    rows_meeting_fitD11D03 = sum(fitD11D03_check)
  )

# Print results
print("Verification Summary:")
print(verification_summary_400mic)
print(paste("All conditions met:", all_conditions_met_400mic))

# If there are any rows not meeting the conditions, display them
if (!all_conditions_met_400mic) {
  print("Rows not meeting both conditions:")
  print(verification_result_400mic %>% filter(!condition_met) %>% select(ID, fitD05D03, fitD11D03))
}
```

Validate that rows where mutations != 0 have an ID that matches rows where mutations == 0 and fitD11D03 < -1.0.
```{r class.output="goodCode"}
# 1. First, create two subsets of the data
zero_mutation_rows_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0 & fitD11D03 < -1.0)

non_zero_mutation_rows_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations != 0 & fitD11D03 > -1.0)

# 2. Check that all IDs in non_zero_mutation_rows are present in zero_mutation_rows
all_valid_ids_400mic <- all(non_zero_mutation_rows_400mic$ID %in% zero_mutation_rows_400mic$ID)
print(paste("All non-zero mutation rows have a matching zero mutation row:", all_valid_ids_400mic))

# 3. If the above is FALSE, find the problematic IDs
if (!all_valid_ids_400mic) {
  problematic_ids_400mic <- setdiff(non_zero_mutation_rows_400mic$ID, zero_mutation_rows_400mic$ID)
  print("IDs with non-zero mutations but no matching zero mutation row:")
  print(problematic_ids_400mic)
}

# 4. Check for any IDs in zero_mutation_rows that don't have a corresponding non-zero mutation row
ids_without_non_zero_400mic <- setdiff(zero_mutation_rows_400mic$ID, non_zero_mutation_rows_400mic$ID)
print("IDs with zero mutations but no corresponding non-zero mutation rows:")
print(ids_without_non_zero_400mic)

# 5. Summary statistics
print(paste("Number of unique IDs in zero mutation rows:", n_distinct(zero_mutation_rows_400mic$ID)))
print(paste("Number of unique IDs in non-zero mutation rows:", n_distinct(non_zero_mutation_rows_400mic$ID)))

# 6. Distribution of mutation counts for non-zero mutation rows
mutation_distribution_400mic <- non_zero_mutation_rows_400mic %>%
  group_by(mutations) %>%
  summarise(count = n()) %>%
  arrange(mutations)

print("Distribution of mutation counts:")
print(mutation_distribution_400mic)

# 7. Check for any unexpected mutation values
unexpected_mutations_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations < 0 | mutations > 5)  # Adjust the upper bound as needed

if (nrow(unexpected_mutations_400mic) > 0) {
  print("Rows with unexpected mutation values:")
  print(unexpected_mutations_400mic)
} else {
  print("No unexpected mutation values found.")
}
```

Remove unique perfect IDs if there is no corresponding mutants:
```{r class.output="goodCode"}
# Step 1: Identify IDs with zero mutations
zero_mutation_ids_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations == 0 & fitD11D03 < -1.0) %>%
  pull(ID)

# Step 2: Identify IDs with non-zero mutations
non_zero_mutation_ids_400mic <- dropout_mutants15_GOF_400mic %>%
  filter(mutations != 0 & fitD11D03 > -1.0) %>%
  pull(ID)

# Step 3: Find IDs that have zero mutations but no corresponding non-zero mutation rows
ids_to_remove_400mic <- setdiff(zero_mutation_ids_400mic, non_zero_mutation_ids_400mic)

# Step 4: Remove the rows with these IDs
dropout_mutants15_GOF_400mic_cleaned <- dropout_mutants15_GOF_400mic %>%
  filter(!(ID %in% ids_to_remove_400mic))

# Print summary
print(paste("Number of rows before cleaning:", nrow(dropout_mutants15_GOF_400mic)))
print(paste("Number of rows after cleaning:", nrow(dropout_mutants15_GOF_400mic_cleaned)))
print(paste("Number of rows removed:", nrow(dropout_mutants15_GOF_400mic) - nrow(dropout_mutants15_GOF_400mic_cleaned)))
print(paste("Number of unique IDs removed:", length(ids_to_remove_400mic)))

# Optionally, you can print the removed IDs
print("IDs removed:")
print(ids_to_remove_400mic)

# Assign the cleaned data back to dropout_mutants15_GOF if you want to update the original variable
dropout_mutants15_GOF_400mic <- dropout_mutants15_GOF_400mic_cleaned
```

### Dropout Mutants
Summarize the number of perfects and mutants at each AA distance after filtering:
```{r class.output="goodCode"}
# Create a function to count unique mutIDs for a given number of mutations
dropout_mutants15_GOF_count_400mic <- function(data, mutation_count) {
  length(unique(subset(data, mutations == mutation_count)$mutID))
}

# Create a vector of counts for mutations 1-5
dropout15_counts_400mic <- sapply(1:5, function(x) dropout_mutants15_GOF_count_400mic(dropout_mutants15_GOF_400mic, x))

# Count perfects separately
perfects_count_400mic <- length(unique(subset(dropout_mutants15_GOF_400mic, mutations == 0 & fitD11D03 < -1.0)$mutID))

# Create a data frame with the results, including the summary row
dropout_mutants15_GOF_table_400mic <- data.frame(
  Mutations = c("Perfects (fit < -1.0)", "1 Mutation", "2 Mutations", "3 Mutations", "4 Mutations", "5 Mutations", "Total Mutations"),
  Count = c(perfects_count_400mic, dropout15_counts_400mic, sum(dropout15_counts_400mic))
)

# Print the table
print(dropout_mutants15_GOF_table_400mic)
```

```{r echo=FALSE}
# Save as .csv for use in other RMD files:
write.csv(dropout_mutants15_GOF_400mic, 'GOF/OUTPUT/400xMIC/Lib15.D11D03.gof.1numprunedBCs.5AA.muts.fitness-1.csv',
          row.names = FALSE, quote=FALSE)
```

### GOF Fitness

**GOF Fitness:** Separate the `dropout_mutants15_GOF_mic` dataset into two new dataframes, where DF1 contains perfects and DF2 contains mutants:
```{r}
# Create a dataframe with mutations == 0
dropout_mutants15_GOF_no_mutations_400mic <- dropout_mutants15_GOF_400mic[dropout_mutants15_GOF_400mic$mutations == 0, ]

# Create a dataframe with mutations != 0
dropout_mutants15_GOF_with_mutations_400mic <- dropout_mutants15_GOF_400mic[dropout_mutants15_GOF_400mic$mutations != 0, ]
```

Now, re-combine these dataframes to retain only the mutants
```{r class.output="goodCode"}
# Step 1: Prepare the reference dataframe
df_reference_400mic <- dropout_mutants15_GOF_no_mutations_400mic %>%
  select(ID, fitD11D03) %>%
  rename(reference_fitD11D03 = fitD11D03)

# Step 2: Join and calculate the difference
dropout_mutants15_GOF_fitness_400mic <- dropout_mutants15_GOF_with_mutations_400mic %>%
  left_join(df_reference_400mic, by = "ID") %>%
  mutate(fitD11D03 = fitD11D03 - reference_fitD11D03) %>%
  select(ID, mutID, mutations, fitD11D03)

# Print summary statistics
print(paste("Number of Mutants:", nrow(dropout_mutants15_GOF_fitness_400mic)))
print(paste("Unique IDs:", length(unique(dropout_mutants15_GOF_fitness_400mic$ID))))
print(paste("Range of fitD11D03:", 
            paste(round(range(dropout_mutants15_GOF_fitness_400mic$fitD11D03, na.rm = TRUE), 1), collapse = " to ")))
```

```{r class.output="goodCode"}
# Count unique IDs with mutations == 0
unique_ids_zero_mutations <- dropout_mutants15_GOF_fitness_400mic %>%
  distinct(ID) %>%
  nrow()

print(paste("Number of unique IDs:", unique_ids_zero_mutations))
```

**Boxplot:** Plot mutant fitness relative to parent variant by number of mutations:
```{r}
GOF_muts_fitness_by_muts_plot_400mic <- ggplot(dropout_mutants15_GOF_fitness_400mic, 
                                        aes(x = factor(mutations), y = fitD11D03)) +
  geom_boxplot() +
  labs(title = "fitD11D03 by Number of Mutations", x = "Number of Mutations", y = "fitD11D03")

print(GOF_muts_fitness_by_muts_plot_400mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/400xMIC/GOF.400xMIC.Mutant.Fitness.by.Mutations.png", 
       plot =GOF_muts_fitness_by_muts_plot_400mic,
       width = 4.5, height = 4.5, units = "in")
```

**Histogram of Mutant Fitness:** Clearly shows mutant fitness is normally distributed.
```{r}
GOF_muts_fitness_dist_plot_400mic <- ggplot(dropout_mutants15_GOF_fitness_400mic, aes(x = fitD11D03)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of fitD11D03", x = "fitD11D03", y = "Count")

print(GOF_muts_fitness_dist_plot_400mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/400xMIC/GOF.400xMIC.Mutant.Fitness.Distribution.png", 
       plot =GOF_muts_fitness_dist_plot_400mic,
       width = 4.5, height = 4.5, units = "in")
```

### GOF Alignment

**FASTA:** Generate a FASTA file from the filtered `dropout_mutants15_GoF` perfects dataset for GoF analysis:
```{r}
# First, let's ensure we have the correct unique IDs for mutations == 1
dropout_mutants15_GOF_400mic_1mut_unique_ids <- dropout_mutants15_GOF_fitness_400mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  pull(ID)

# Now, let's use these IDs to filter the dropout_mutants15_GOF dataset
dropout_mutants15_GOF_400mic_1mut_unique_id_seq <- dropout_mutants15_GOF_400mic %>%
  filter(ID %in% dropout_mutants15_GOF_400mic_1mut_unique_ids & mutations == 0) %>%
  select(ID, seq)

# Create the sequences in FASTA format
dropout_mutants15_GOF_400mic_fasta_content <- paste(">", dropout_mutants15_GOF_400mic_1mut_unique_id_seq$ID, "\n", dropout_mutants15_GOF_400mic_1mut_unique_id_seq$seq, "\n", sep = "", collapse = "")

# Define the file path in the working directory
dropout_mutants15_GOF_400mic_fasta_file_path <- file.path(getwd(), "GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.fasta")

# Write the FASTA content to the file (37 unique ID)
writeLines(dropout_mutants15_GOF_400mic_fasta_content, 
           con = dropout_mutants15_GOF_400mic_fasta_file_path)
```

**Alignment:** Use the `clustalo` executable to align the protein sequences associated with the dropout perfects. This will align the FASTA file: **Lib15.GoF.perfects.complementation.fasta** for use in GoF analysis.
```{bash}
./Scripts/clustalo -i GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.fasta -o GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.tree.aligned.mod.aln --outfmt=clustal --force
```

**Mapping Residues:** Use the following `map.aligned.residues.py` python script to generate csv files for each designed homolog that maps residue positions of each A.A. from the alignment FASTA:
```{python}
import time
import csv

##################################
#INPUTS:

base_path = ""
trees_path_prefix = base_path+""

#clustal format alignment file
align_file_in = [trees_path_prefix+"GOF/MSA_Dropouts/400xMIC/FASTA/Lib15.GoF.perfects.400mic.tree.aligned.mod.aln"]

#number of seqs in each alignment file
num_samples_in_file = [38] #New FASTA w/ mutant fit > -1 (+1 from actual file count)

##################################
#OUTPUTS:

msa_map_out_path = [trees_path_prefix+"GOF/MSA_Dropouts/400xMIC/"]

# Loop to generate .csv files for each ID
for alni in range(1):#len(align_file_in)):
    #print(alni)
    
    ##################################
    #VARIABLES:
    
    #ID as key, align as value
    align_dict = dict()
    
    #num_samples
    num_samples = num_samples_in_file[alni]
    
    #pos key, consensus pos val
    IDaadictlist = [dict() for x in range(num_samples)]
    
    IDtoindexdict = dict()
    indexdtoIDict = dict()
    
    ##################################
    #CODE:
    
    line_count = 0
    #loop over all alignments:
    print(align_file_in[alni])
    for line in open(align_file_in[alni]):
        #skip header
        if line_count > 1:
            listWords = line.split('    ')
            ID = listWords[0]
            align = line[16:].rstrip()
            if ID.strip() != "":
                align_dict[ID] = align_dict.get(ID, "") + align.replace(" ", "")
        line_count += 1
    
    #print("NP_414590")
    #print(align_dict["NP_414590"])

    counter = 0
    for ID in align_dict:
        #print(ID)
        #print(align_dict[ID])
        IDtoindexdict[ID] = counter
        indexdtoIDict[counter]=ID
        align = align_dict[ID]
        
        aacounter = 1
        
        
        for i in range(len(align)):
            if align[i] != "-":
                
                #print(str(counter)+" "+str(aacounter))
                IDaadictlist[counter][aacounter]=i+1
                aacounter += 1
        counter += 1
        
    #print(len(IDaadictlist))
    for i in range(len(IDaadictlist)-1):
        #print(indexdtoIDict[i])
        #print(i)
        #print(alni)
        #print(indexdtoIDict[i])
        csvfile = open(str(msa_map_out_path[alni]+indexdtoIDict[i]+".csv"), 'w')
        fieldnames = ['orth_aanum','msa_aanum']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for j in IDaadictlist[i]:
            #print(str(j)+" "+str(IDaadictlist[i][j]))
            #save all data:
            writer.writerow({'orth_aanum':str(j),'msa_aanum':str(IDaadictlist[i][j])})
        csvfile.close()
```

### GOF Plots

Find GoF Perfects for Dropouts
```{r}
# Create a data frame of unique IDs
mutants15_to_plot_400mic <- dropout_mutants15_GOF_fitness_400mic %>%
  filter(mutations == 1) %>%
  distinct(ID) %>%
  select(ID)
```

```{r class.output="goodCode"}
# Initialize an empty vector to store IDs of mutants to be removed
mutants_to_remove_400mic <- character()

# Check for missing MSA files
for (i in 1:nrow(mutants15_to_plot_400mic)) {
  mutant15_current_temp <- mutants15_to_plot_400mic$ID[i]
  if (!file.exists(paste("GOF/MSA_Dropouts/400xMIC/", mutant15_current_temp, ".csv", sep = ""))) {
    mutants_to_remove_400mic <- c(mutants_to_remove_400mic, mutant15_current_temp)
  }
}

# Output the results
if (length(mutants_to_remove_400mic) > 0) {
  cat("The following mutants will be removed due to missing MSA files:\n")
  print(mutants_to_remove_400mic)
  cat("\nTotal number of mutants to be removed:", length(mutants_to_remove_400mic), "\n")
  
  # Remove the mutants without MSA files
  mutants15_to_plot_400mic <- mutants15_to_plot_400mic[!mutants15_to_plot_400mic$ID %in% mutants_to_remove_400mic, ]
  cat("\nMutants remaining:", nrow(mutants15_to_plot_400mic), "\n")
} else {
  cat("All mutants have corresponding MSA files. No mutants will be removed.\n")
}

# If you want to see the remaining mutants
print(mutants15_to_plot_400mic)
```

Read in the E. coli map:
```{r}
ecoli_map <- read.csv(file=paste("GOF/MSA_Dropouts/Comp/NP_414590.csv", sep=""), 
                      head=TRUE, sep=",")
```

Make a new data frame which will keep all info
```{r}
GOF_fitness_map_400mic <- data.frame(position=numeric(),
                              aa=character(),
                              mutations=numeric(),
                              fitness=numeric(),
                              posortho=numeric(),
                              ingap=character(),
                              mutID=character(),
                              ID=character())

aminoacids <- data.frame(aa=c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T','X'),
                         aanum=c(1:21))
```

**MSA Mapping:** Map the mutants (fitness difference from perfects > 0) over all perfects (fit < -1) for GOF analysis:
```{r}
#loop over all perfects
for (iii in 1:nrow(mutants15_to_plot_400mic)){
  
  #current ortholog:
  mutant_current_400mic <- as.character(mutants15_to_plot_400mic$ID[iii])
  
  #length of name
  name_size_400mic = nchar(paste(mutant_current_400mic,"_",sep=""))
  
  #get the MSA mapping
  mutant_map_400mic <- read.csv(file=paste("GOF/MSA_Dropouts/400xMIC/",mutant_current_400mic,".csv",sep=""),head=TRUE,sep=",")
  
  #grab the mutants with a fitness increase (GoF) greater than zero (do not include perfects from dataset)
  GOFmutIDinfo_temp_400mic <- dropout_mutants15_GOF_fitness_400mic %>%
    filter(ID == mutant_current_400mic) %>%
    filter(mutations != 0) %>%
    filter(fitD11D03 >= 0.5) ###CHANGE VALUE AS NEEDED (default = 0.5)
  
  # Check if GOFmutIDinfo_temp is empty
  if(nrow(GOFmutIDinfo_temp_400mic) == 0) {
    warning(paste("No non-zero mutation data found for ID:", mutant_current_400mic))
    next  # Skip to the next iteration of the outer loop
  }
  
  #loop over all mutants for this construct:
  for (mn in 1:nrow(GOFmutIDinfo_temp_400mic)) {
    
    #this mutants fitness
    gof_fit_temp_400mic <- GOFmutIDinfo_temp_400mic$fitD11D03[mn]  # or whichever fitness column you're using
    
    #grab the mut name
    mutations_names_400mic <- as.character(GOFmutIDinfo_temp_400mic$mutID[mn])
    
    #grab only the relevant portion of the name
    mutations_names_400mic <- substr(mutations_names_400mic, name_size+1, nchar(mutations_names_400mic))
    
    ## split mutation string at non-digits
    s <- strsplit(mutations_names_400mic, "_")
    
    for (mutnum in 1:GOFmutIDinfo_temp_400mic$mutations[mn]){
      
      #grab the corresponding mutation string
      mutcurr<-s[[1]][mutnum]
      
      #get the position
      mutpos <- as.numeric(str_extract(mutcurr, "[0-9]+"))
      
      #get ending aa
      to_aa <- substr(mutcurr, nchar(mutpos)+2, nchar(mutcurr))
      
      #find the number in the consensus seq
      gof_cons_aanum_index <- which(mutant_map_400mic$orth_aanum == mutpos)
      
      if (length(gof_cons_aanum_index) > 0) {
        gof_cons_aanum <- mutant_map_400mic$msa_aanum[gof_cons_aanum_index]
        
        #does this map to a non-gap
        if (gof_cons_aanum %in% ecoli_map$msa_aanum){
          
          #the corresponding e.coli residue
          e_coli_residue <- ecoli_map$orth_aanum[which(ecoli_map$msa_aanum == gof_cons_aanum)]
          
          #add this point to the data
          GOF_fitness_map_400mic <- rbind(GOF_fitness_map_400mic,
                                   data.frame(position=e_coli_residue,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_400mic$mutations[mn],
                                              fitness=gof_fit_temp_400mic,
                                              posortho=mutpos,
                                              ingap="No",
                                              mutID=GOFmutIDinfo_temp_400mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_400mic$ID[mn]))
          
        } else {
          #if it's here it maps to a gap
          
          #add this point to the data
          GOF_fitness_map_400mic <- rbind(GOF_fitness_map_400mic,
                                   data.frame(position=-1,
                                              aa=to_aa,
                                              mutations=GOFmutIDinfo_temp_400mic$mutations[mn],
                                              fitness=gof_fit_temp_400mic,
                                              posortho=mutpos,
                                              ingap="Yes",
                                              mutID=GOFmutIDinfo_temp_400mic$mutID[mn],
                                              ID=GOFmutIDinfo_temp_400mic$ID[mn]))
          
        }
      } else {
        warning(paste("No matching orth_aanum found for mutpos:", mutpos, "in ID:", mutant_current_400mic))
        # You might want to handle this case, perhaps by skipping this mutation or adding it to a separate list for review
      }
    }
  }
}
```

```{r echo=FALSE}
write.table(GOF_fitness_map_400mic, 
            file = "GOF/OUTPUT/400xMIC/GOF_Fitness_Map_400mic.csv", 
            sep = ",", row.names = F,quote=F,col.names = T)
```

Collapse the GOF fitness values by aa position along the protein sequence:
```{r}
GOF_fitness_collapsed_by_pos_400mic <- GOF_fitness_map_400mic %>%
  filter(position > 0) %>%
  group_by(position) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))
```

#### GOF Mutant No. Plot

Plot the number of gain-of-function mutants recovered for each aa position along the protein sequence and include cutoff lines for 1 SD above the mean number of GOF mutants and 2 SD above the mean number of GOF mutants. All positions with GOF mutants above the 2 SD line are considered significant positions positively influencing the parent variants ability to complement metabolic function in the E. coli knockout model.
```{r class.output="goodCode"}
GoF_plot_400mic <- ggplot(GOF_fitness_collapsed_by_pos_400mic, aes(x=position, y=numpoints, color=numortho)) +
  geom_segment(aes(x = 0, y = mean(numpoints)+2*sd(numpoints), 
                   xend = 160, 
                   yend = mean(numpoints)+2*sd(numpoints)),linetype=2,colour = "blue")+
  geom_segment(aes(x = 0, y = mean(numpoints), xend = 160, yend = mean(numpoints)),linetype=2,colour = "red")+
  geom_point(size=1.8)+
  labs(x = "Position (aa)", y ="Number of gain-of-function mutants",color="") +
  scale_color_gradient(low = "blue", 
                       high = "red",
                       name="Num.\nUniq.\nHomo.",
                       na.value="grey", 
                       limit = c(0,1.1*max(GOF_fitness_collapsed_by_pos_400mic$numortho))) +
  scale_x_continuous(breaks=seq(0,160,20))+
  theme(legend.position="left")
GoF_plot_400mic <- ggExtra::ggMarginal(GoF_plot_400mic,type = "histogram",
                    margins = "y",
                    bins=21,
                    col = 'black',
                    fill = 'red')
GoF_plot_400mic
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/400xMIC/GOF.400xMIC.Mutant.by.AA.Pos.2sigma.png", 
       plot = GoF_plot_400mic,
       width = 4.5, height = 4.5, units = "in")
```

Print out a summary table of significant aa position along the protein sequence
```{r class.output="goodCode"}
GOF_fitness_collapsed_by_pos_2sigma_400mic <- GOF_fitness_collapsed_by_pos_400mic %>%
  filter(numpoints >= (mean(GOF_fitness_collapsed_by_pos_400mic$numpoints) +
                         2*sd(GOF_fitness_collapsed_by_pos_400mic$numpoints)))

print(GOF_fitness_collapsed_by_pos_2sigma_400mic)
```

Calculate all Data and Stats:
```{r}
GOF_fitness_collapsed_all_400mic <- GOF_fitness_map_400mic %>%
  filter(position > 0) %>%
  group_by(position, aa) %>%
  summarise(fitval=median(fitness),
            numpoints=n(),
            stdfit=sd(fitness),
            numortho=length(unique(ID)))

gof_aa_dim <- nrow(aminoacids)
gof_ref_len <- nrow(ecoli_map)
```

```{r warning=FALSE}
#these matrices have the fitness/num/sd for each aa at each position:
gof_matrix = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_num = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_sd = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)
gof_matrix_numortho = matrix(rep(NA, gof_ref_len*gof_aa_dim),nrow=gof_aa_dim,ncol=gof_ref_len)

#populate matrix
for (i in 1:nrow(GOF_fitness_collapsed_all_400mic)){
  
  gof_matrix[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$fitval[i])
  gof_matrix_num[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$numpoints[i])
  gof_matrix_sd[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$stdfit[i])
  gof_matrix_numortho[which(aminoacids$aa==as.character(GOF_fitness_collapsed_all_400mic$aa[i])),GOF_fitness_collapsed_all_400mic$position[i]] <- as.numeric(GOF_fitness_collapsed_all_400mic$numortho[i])
}

rownames(gof_matrix)<-aminoacids$aa
colnames(gof_matrix)<-c(1:gof_ref_len)
rownames(gof_matrix_num)<-aminoacids$aa
colnames(gof_matrix_num)<-c(1:gof_ref_len)
rownames(gof_matrix_sd)<-aminoacids$aa
colnames(gof_matrix_sd)<-c(1:gof_ref_len)
rownames(gof_matrix_numortho)<-aminoacids$aa
colnames(gof_matrix_numortho)<-c(1:gof_ref_len)

gof_matrix_melt <- melt(gof_matrix)
gof_matrix_num_melt <- melt(gof_matrix_num)
gof_matrix_sd_melt <- melt(gof_matrix_sd)
gof_matrix_numortho_melt <- melt(gof_matrix_numortho)

# Rename columns to "X1" and "X2"
names(gof_matrix_melt)[names(gof_matrix_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_num_melt)[names(gof_matrix_num_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_sd_melt)[names(gof_matrix_sd_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")
names(gof_matrix_numortho_melt)[names(gof_matrix_numortho_melt) %in% c("Var1", "Var2")] <- c("X1", "X2")

gof_matrix_melt_only_GOFpos <- gof_matrix_melt %>%
  filter(X2 == 35)

gof_matrix_melt_only_GOFpos$mutposnum <- 0
gof_matrix_melt_only_GOFpos$mutposnum[which(gof_matrix_melt_only_GOFpos$X2==35)] <- 1

gof_matrix_melt_only_GOFpos$aanum <- 0
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="A")] <- 12
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="C")] <- 10
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="D")] <- 5
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="E")] <- 4
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="F")] <- 19
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="G")] <- 11
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="H")] <- 3
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="I")] <- 15
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="K")] <- 1
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="L")] <- 14
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="M")] <- 16
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="N")] <- 6
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="P")] <- 17
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Q")] <- 7
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="R")] <- 2
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="S")] <- 9
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="T")] <- 8
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="V")] <- 13
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="W")] <- 20
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="Y")] <- 18
gof_matrix_melt_only_GOFpos$aanum[which(gof_matrix_melt_only_GOFpos$X1=="X")] <- 21

gof_matrix_melt_only_GOFpos_wnum <- gof_matrix_melt_only_GOFpos %>%
  inner_join(gof_matrix_num_melt,by=c("X1","X2")) %>%
  dplyr::rename(mutnum=value.y,value=value.x)
```

#### GOF Position Plot

Plot the mean fitness of each GoF mutation at the significant positions, with the number of mutants observed at each AA:
```{r}
# Define the order of amino acids for the rectangles
rect_order <- c("T")

# Create a data frame for the rectangles
rect_data <- data.frame(
aanum = match(rect_order, c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X")),
xmin = seq(0.5, by = 1, length.out = length(rect_order)),
xmax = seq(1.5, by = 1, length.out = length(rect_order)))

#plot the data from all mutants:
GOF_fit_nummut_plot_400mic <- ggplot(gof_matrix_melt_only_GOFpos_wnum, 
       aes(x=mutposnum, y=aanum,
           fill=value,
           label=mutnum)) +
  geom_tile() +
  geom_text() +
  # Add black rectangles
  geom_rect(data = rect_data,
           aes(xmin = xmin, xmax = xmax, ymin = aanum - 0.5, ymax = aanum + 0.5),
           fill = NA, color = "black", inherit.aes = FALSE) +
  labs(x = "Position (aa)",
       y ="Amino acid",color="") +
  scale_fill_gradient(low = "blue", 
                      high = "red",
                      name="Fitness",
                      na.value="grey",
                      limit = c(0, max(gof_matrix_melt_only_GOFpos_wnum$value))) +
  theme_minimal()+
  scale_x_continuous(name="Position (aa)",
                     breaks=c(1),
                     labels=c("35"))+
  scale_y_continuous(name="Amino acid", 
                     breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21),
                     labels=c("K","R","H","E","D","N","Q","T","S","C","G","A","V","L","I","M","P","Y","F","W","X"))

print(GOF_fit_nummut_plot_400mic)
```

```{r echo=FALSE}
ggsave(filename = "GOF/PLOTS/400xMIC/GOF_400xMIC.Fitness.by.Number.Mutants.png", 
       plot = GOF_fit_nummut_plot_400mic,
       width = 4.5, height = 4.5, units = "in")
```

# Reproducibility

The session information is provided for full reproducibility.
```{r}
devtools::session_info()
```