Firstly, CHOPCHOP and CRISPOR platforms were used to identify multiple top sgRNA sequences targeting critical regions of the BRCA2 and PIK3CA genes for CRISPR-Cas9-mediated knockout. These tools were employed to rank potential sgRNAs based on critical parameters such as specificity, predicted cutting efficiency, and off-target risk. The top three results were selected from both genes and further analysed in R studio to refine the selection process, where efficiency predictions, off-target profiles, and GC content were comprehensively compared.

# Install the conflicted package if it's not already installed
if (!requireNamespace("conflicted", quietly = TRUE)) {
  install.packages("conflicted")
}

# Install and Load necessary packages
# Install Bioconductor if not installed already
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")

}
install.packages("seqinr")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("GGally")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
# Force reinstallation of Biostrings
BiocManager::install("Biostrings", force = TRUE)
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org
## Bioconductor version 3.16 (BiocManager 1.30.25), R 4.2.2 Patched (2022-11-10
##   r83330)
## Installing package(s) 'Biostrings'
## Installation paths not writeable, unable to update packages
##   path: /usr/lib/R/library
##   packages:
##     boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme,
##     nnet, rpart, spatial, survival
# Load necessary packages
library(Biostrings)   # For DNA sequence handling
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
library(tidyverse)    # Includes ggplot2 and dplyr for data manipulation and plotting
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse()     masks Biostrings::collapse(), IRanges::collapse()
## ✖ dplyr::combine()      masks BiocGenerics::combine()
## ✖ purrr::compact()      masks XVector::compact()
## ✖ dplyr::desc()         masks IRanges::desc()
## ✖ tidyr::expand()       masks S4Vectors::expand()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::first()        masks S4Vectors::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()       masks IRanges::reduce()
## ✖ dplyr::rename()       masks S4Vectors::rename()
## ✖ lubridate::second()   masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice()        masks XVector::slice(), IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(conflicted)    # Resolve conflicts explicitly
library(seqinr)        # For sequence analysis
library(GGally)        # For advanced visualizations
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Set preference for conflicted functions
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
# Read the CSV file into a DataFrame
file_path <- "Project_1_BRCA2_PIK3CA.csv"
df <- read_csv(file_path)
## Rows: 6 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gene, Coordinate, sgRNA Sequence, PAM, Off-Target Summary, GC Conte...
## dbl (3): Cutting Efficiency, Specificity Score, Number of Off-Targets
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the DataFrame to ensure it's loaded correctly
head(df)
## # A tibble: 6 × 10
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32398926-32398948  CGTTTTGCCCGATTCCGT… NGG                   0.38
## 2 BRCA2  chr13:32387964-32387986  ACCGGTCTCCCGCGTCTT… NGG                   0.34
## 3 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG                   0.44
## # ℹ 5 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## #   `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>
# Create a new column that combines the gene name with the index so we can identify the different sequences. 
df <- df %>%
  mutate(Gene_Index = paste(Gene, row_number() - 1, sep = "_")) 
# Plot a bar graph of Number of Off-Targets for each gene sgRNA combination
ggplot(df, aes(x = Gene_Index, y = `Number of Off-Targets`, fill = Gene_Index)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = scales::hue_pal()(length(unique(df$Gene_Index)))) + # Customize colors
  labs(title = "Number of Off-Targets for Each Gene sgRNA Combination",
       x = "Gene (with Index)",
       y = "Number of Off-Targets") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for readability

Figure 1 illustrates the number of off-target sites for each sgRNA combination targeting the BRCA2 and PIK3CA genes. The X-axis represents the gene and the specific sgRNA index (e.g., BRCA2_0, BRCA2_1, PIK3CA_3), where the indices correspond to individual sgRNA sequences evaluated for each gene. The Y-axis indicates the total number of off-target sites identified for each sgRNA, as determined through computational analysis. Different colours correspond to each gene-sgRNA pair, visually representing the relative off-target risks. The data highlights that BRCA2_2 has the highest number of off-targets, whereas BRCA2_0 and PIK3CA_3 exhibit lower off-target counts, suggesting they may be more suitable for precise genome editing.

# Calculate GC content and create a specificity score placeholder
df <- df %>%
  mutate(
    GC_content = str_count(`sgRNA Sequence`, "G|C") / str_length(`sgRNA Sequence`),
    Specificity_score = 1 / `Number of Off-Targets`  # Assuming higher specificity with fewer off-targets
  )

# Preview the modified dataset
head(df)
## # A tibble: 6 × 13
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32398926-32398948  CGTTTTGCCCGATTCCGT… NGG                   0.38
## 2 BRCA2  chr13:32387964-32387986  ACCGGTCTCCCGCGTCTT… NGG                   0.34
## 3 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG                   0.44
## # ℹ 8 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## #   `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>,
## #   Gene_Index <chr>, GC_content <dbl>, Specificity_score <dbl>
# Rank sgRNAs by efficiency and specificity for each gene
ranked_sgRNA <- df %>%
  group_by(Gene) %>%
  arrange(desc(`Cutting Efficiency`), Specificity_score) %>%
  dplyr::slice(1)  # Select the top sgRNA per gene

# Display the ranked sgRNAs
ranked_sgRNA
## # A tibble: 2 × 13
## # Groups:   Gene [2]
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 2 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## # ℹ 8 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## #   `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>,
## #   Gene_Index <chr>, GC_content <dbl>, Specificity_score <dbl>
# Split the 'Off-Target Summary' column into separate columns for mismatches
df <- df %>%
  separate(`Off-Target Summary`, into = c("Mismatches_2", "Mismatches_3", "Mismatches_4"), sep = "\\|") %>%
  mutate(
    Mismatches_2 = as.numeric(gsub("2:", "", Mismatches_2)),
    Mismatches_3 = as.numeric(gsub("3:", "", Mismatches_3)),
    Mismatches_4 = as.numeric(gsub("4:", "", Mismatches_4))
  )

# Check the structure of the modified DataFrame
head(df)
## # A tibble: 6 × 15
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32398926-32398948  CGTTTTGCCCGATTCCGT… NGG                   0.38
## 2 BRCA2  chr13:32387964-32387986  ACCGGTCTCCCGCGTCTT… NGG                   0.34
## 3 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG                   0.44
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## #   `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## #   Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## #   GC_content <dbl>, Specificity_score <dbl>
# Identify the best sgRNA for each gene based on minimal off-target mismatches
best_sgRNA <- df %>%
  group_by(Gene) %>%
  arrange(Mismatches_2, Mismatches_3, Mismatches_4) %>%
  dplyr::slice(1)  # Select sgRNA with the fewest off-targets

# Display the best sgRNAs based on off-target mismatches
best_sgRNA
## # A tibble: 2 × 15
## # Groups:   Gene [2]
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 2 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## #   `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## #   Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## #   GC_content <dbl>, Specificity_score <dbl>
# Calculate GC content and create a specificity score placeholder
df <- df %>%
  mutate(
    GC_content = str_count(`sgRNA Sequence`, "G|C") / str_length(`sgRNA Sequence`),
    Specificity_score = 1 / `Number of Off-Targets`  # Assuming higher specificity with fewer off-targets
  )

# Preview the modified dataset
head(df)
## # A tibble: 6 × 15
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 BRCA2  chr13:32398926-32398948  CGTTTTGCCCGATTCCGT… NGG                   0.38
## 2 BRCA2  chr13:32387964-32387986  ACCGGTCTCCCGCGTCTT… NGG                   0.34
## 3 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG                   0.44
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## #   `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## #   Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## #   GC_content <dbl>, Specificity_score <dbl>
# Load ggrepel for better label management
library(ggrepel)

# Scatter plot of GC content vs Specificity Score with Gene_Index labels
ggplot(df, aes(x = GC_content, y = Specificity_score, color = Gene)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = Gene_Index), size = 3, max.overlaps = Inf) +  # Add Gene_Index labels to points
  labs(
    title = "GC Content vs Specificity Score for sgRNAs",
    x = "GC Content",
    y = "Specificity Score"
  ) +
  scale_color_manual(values = scales::hue_pal()(length(unique(df$Gene)))) + # Different colors for each gene
  theme_minimal()

Figure 2 shows a scatterplot of the relationship between GC content and specificity scores for sgRNAs targeting the BRCA2 and PIK3CA genes. The X-axis represents the GC content of each sgRNA sequence, which affects stability and efficiency in CRISPR experiments. The Y-axis displays the specificity score, which indicates how precisely the sgRNA targets the intended sequence while minimizing off-target effects. Points are colour-coded by gene, with red representing BRCA2 sgRNAs and blue representing PIK3CA sgRNAs. Notably, PIK3CA_3 exhibits a high GC content and a strong specificity score, making it a promising candidate, while BRCA2_2 also shows high GC content but lower specificity.

# Rank sgRNAs by Cutting Efficiency for all gene indexes in descending order
df_sorted <- df %>%
  arrange(desc(`Cutting Efficiency`))

# Display the sorted DataFrame
df_sorted
## # A tibble: 6 × 15
##   Gene   Coordinate               `sgRNA Sequence`    PAM   `Cutting Efficiency`
##   <chr>  <chr>                    <chr>               <chr>                <dbl>
## 1 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG                   0.58
## 2 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG                   0.55
## 3 BRCA2  chr13:32387945-32387967  CGGTAGCGGCCCGAACGG… NGG                   0.53
## 4 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG                   0.44
## 5 BRCA2  chr13:32398926-32398948  CGTTTTGCCCGATTCCGT… NGG                   0.38
## 6 BRCA2  chr13:32387964-32387986  ACCGGTCTCCCGCGTCTT… NGG                   0.34
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## #   `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## #   Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## #   GC_content <dbl>, Specificity_score <dbl>
# Load the ggplot2 library for plotting
library(ggplot2)

# Create a lollipop chart showing Cutting Efficiency for each Gene_Index
ggplot(df_sorted, aes(x = reorder(Gene_Index, `Cutting Efficiency`), y = `Cutting Efficiency`, color = Gene)) +
  geom_segment(aes(xend = reorder(Gene_Index, `Cutting Efficiency`), yend = 0), size = 1) + # Line from x-axis to point
  geom_point(size = 4) + # Points for each Cutting Efficiency value
  labs(
    title = "Ranking of sgRNAs by Cutting Efficiency",
    x = "Gene Index (Ranked by Cutting Efficiency)",
    y = "Cutting Efficiency"
  ) +
  coord_flip() + # Flip coordinates for better readability
  theme_minimal() + # Minimal theme for clarity
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels if needed
  scale_color_manual(values = scales::hue_pal()(length(unique(df_sorted$Gene)))) # Unique colors for each Gene
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 3 bar plot displays the ranking of sgRNAs based on their cutting efficiency for the BRCA2 and PIK3CA genes. The X-axis represents the cutting efficiency score, which measures the likelihood of successful cleavage at the target site by CRISPR-Cas9. The Y-axis shows the sgRNA indices for each gene, ranked by their cutting efficiency. Points are color-coded by gene, with red indicating BRCA2 sgRNAs and blue indicating PIK3CA sgRNAs. Among the sgRNAs, PIK3CA_4 and PIK3CA_3 exhibit the highest cutting efficiency, close to 0.6, making them strong candidates for precise genome editing. In contrast, BRCA2_1 and BRCA2_0 show lower cutting efficiencies, suggesting they may be less effective.

# Create a lollipop chart showing Number of Off-Targets for each Gene_Index
ggplot(df_sorted, aes(x = reorder(Gene_Index, `Number of Off-Targets`), y = `Number of Off-Targets`, color = Gene)) +
  geom_segment(aes(xend = reorder(Gene_Index, `Number of Off-Targets`), yend = 0), size = 1) + # Line from x-axis to point
  geom_point(size = 4) + # Points for each Number of Off-Targets value
  labs(
    title = "Ranking of sgRNAs by Number of Off-Targets",
    x = "Gene Index (Ranked by Off-Targets)",
    y = "Number of Off-Targets"
  ) +
  coord_flip() + # Flip coordinates for better readability
  theme_minimal() + # Minimal theme for clarity
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels if needed
  scale_color_manual(values = scales::hue_pal()(length(unique(df_sorted$Gene)))) # Unique colors for each Gene

Figure 4 bar plot ranks sgRNAs by the number of off-target sites for the BRCA2 and PIK3CA genes. The X-axis represents the number of off-target sites identified for each sgRNA, which indicates potential unintended edits elsewhere in the genome. The Y-axis shows the sgRNA indices for each gene, ranked by their off-target counts. Points are colour-coded by gene, with red representing BRCA2 sgRNAs and blue representing PIK3CA sgRNAs. BRCA2_2 exhibits the highest number of off-targets, exceeding 30, while BRCA2_0 and PIK3CA_3 show the fewest off-targets, around 20, making them more favourable for precise genome editing.

# Efficiency prediction visualization
# This will create a bar plot to show the predicted efficiency for each sgRNA
ggplot(df, aes(x = reorder(Gene_Index, `Cutting Efficiency`), y = `Cutting Efficiency`, fill = Gene)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Predicted Efficiency for Each sgRNA",
    x = "sgRNA (Gene Index)",
    y = "Predicted Efficiency"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels for readability
  scale_fill_viridis_d() # Use the viridis color palette

Figure 5 bar plot shows the predicted efficiency of sgRNAs targeting the BRCA2 and PIK3CA genes. The X-axis represents the sgRNA indices for each gene, and the Y-axis indicates the predicted cutting efficiency, which reflects the likelihood of successful DNA cleavage at the target site. Bars are colour-coded by gene, purple representing BRCA2 sgRNAs and yellow representing PIK3CA sgRNAs. PIK3CA_4 and PIK3CA_3 exhibit the highest predicted efficiencies, close to 0.6, making them strong candidates for CRISPR-Cas9 experiments. BRCA2_2 also shows relatively high efficiency compared to BRCA2_1 and BRCA2_0, which have lower predicted efficiencies.

# Compare Off-Target Risks and Efficiency Together
# This scatter plot will visualize the trade-off between off-target risk and efficiency
ggplot(df, aes(x = `Cutting Efficiency`, y = `Number of Off-Targets`, color = Gene_Index)) +
  geom_point(size = 3) +
  labs(
    title = "Off-Target Risk vs. Efficiency for sgRNAs",
    x = "Predicted Efficiency",
    y = "Off-Target Risk (Number of Off-Targets)"
  ) +
  scale_color_viridis_d(option = "D") + # Use rainbow-like color palette
  theme_minimal() +
  theme(legend.position = "bottom") # Place legend at the bottom for better visualization

Figure 6 scatter plot illustrates the relationship between off-target risk and predicted efficiency for sgRNAs targeting the BRCA2 and PIK3CA genes. The X-axis represents the predicted efficiency of each sgRNA, indicating the likelihood of successful DNA cleavage. The Y-axis shows the off-target risk, measured as the number of off-target sites identified for each sgRNA. Points are colour-coded by sgRNA index and gene, with shades of purple representing BRCA2 sgRNAs and green representing PIK3CA sgRNAs. Notably, BRCA2_0 exhibits a low off-target risk (~20 sites) and relatively low efficiency, while PIK3CA_4 and PIK3CA_3 demonstrate high predicted efficiencies (~0.5–0.6) with moderate off-target risks. BRCA2_2 has the highest off-target risk (~35 sites) despite moderate efficiency, suggesting potential trade-offs between precision and effectiveness.