Firstly, CHOPCHOP and CRISPOR platforms were used to identify
multiple top sgRNA sequences targeting critical regions of the BRCA2 and
PIK3CA genes for CRISPR-Cas9-mediated knockout. These tools were
employed to rank potential sgRNAs based on critical parameters such as
specificity, predicted cutting efficiency, and off-target risk. The top
three results were selected from both genes and further analysed in R
studio to refine the selection process, where efficiency predictions,
off-target profiles, and GC content were comprehensively compared.
# Install the conflicted package if it's not already installed
if (!requireNamespace("conflicted", quietly = TRUE)) {
install.packages("conflicted")
}
# Install and Load necessary packages
# Install Bioconductor if not installed already
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
install.packages("seqinr")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("GGally")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
# Force reinstallation of Biostrings
BiocManager::install("Biostrings", force = TRUE)
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cloud.r-project.org
## Bioconductor version 3.16 (BiocManager 1.30.25), R 4.2.2 Patched (2022-11-10
## r83330)
## Installing package(s) 'Biostrings'
## Installation paths not writeable, unable to update packages
## path: /usr/lib/R/library
## packages:
## boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme,
## nnet, rpart, spatial, survival
# Load necessary packages
library(Biostrings) # For DNA sequence handling
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
library(tidyverse) # Includes ggplot2 and dplyr for data manipulation and plotting
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse() masks Biostrings::collapse(), IRanges::collapse()
## ✖ dplyr::combine() masks BiocGenerics::combine()
## ✖ purrr::compact() masks XVector::compact()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ lubridate::second() masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice() masks XVector::slice(), IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(conflicted) # Resolve conflicts explicitly
library(seqinr) # For sequence analysis
library(GGally) # For advanced visualizations
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Set preference for conflicted functions
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
# Read the CSV file into a DataFrame
file_path <- "Project_1_BRCA2_PIK3CA.csv"
df <- read_csv(file_path)
## Rows: 6 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gene, Coordinate, sgRNA Sequence, PAM, Off-Target Summary, GC Conte...
## dbl (3): Cutting Efficiency, Specificity Score, Number of Off-Targets
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the DataFrame to ensure it's loaded correctly
head(df)
## # A tibble: 6 × 10
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32398926-32398948 CGTTTTGCCCGATTCCGT… NGG 0.38
## 2 BRCA2 chr13:32387964-32387986 ACCGGTCTCCCGCGTCTT… NGG 0.34
## 3 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG 0.44
## # ℹ 5 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## # `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>
# Create a new column that combines the gene name with the index so we can identify the different sequences.
df <- df %>%
mutate(Gene_Index = paste(Gene, row_number() - 1, sep = "_"))
# Plot a bar graph of Number of Off-Targets for each gene sgRNA combination
ggplot(df, aes(x = Gene_Index, y = `Number of Off-Targets`, fill = Gene_Index)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = scales::hue_pal()(length(unique(df$Gene_Index)))) + # Customize colors
labs(title = "Number of Off-Targets for Each Gene sgRNA Combination",
x = "Gene (with Index)",
y = "Number of Off-Targets") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for readability

Figure 1 illustrates the number of off-target sites for each sgRNA
combination targeting the BRCA2 and PIK3CA genes. The X-axis represents
the gene and the specific sgRNA index (e.g., BRCA2_0, BRCA2_1,
PIK3CA_3), where the indices correspond to individual sgRNA sequences
evaluated for each gene. The Y-axis indicates the total number of
off-target sites identified for each sgRNA, as determined through
computational analysis. Different colours correspond to each gene-sgRNA
pair, visually representing the relative off-target risks. The data
highlights that BRCA2_2 has the highest number of off-targets, whereas
BRCA2_0 and PIK3CA_3 exhibit lower off-target counts, suggesting they
may be more suitable for precise genome editing.
# Calculate GC content and create a specificity score placeholder
df <- df %>%
mutate(
GC_content = str_count(`sgRNA Sequence`, "G|C") / str_length(`sgRNA Sequence`),
Specificity_score = 1 / `Number of Off-Targets` # Assuming higher specificity with fewer off-targets
)
# Preview the modified dataset
head(df)
## # A tibble: 6 × 13
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32398926-32398948 CGTTTTGCCCGATTCCGT… NGG 0.38
## 2 BRCA2 chr13:32387964-32387986 ACCGGTCTCCCGCGTCTT… NGG 0.34
## 3 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG 0.44
## # ℹ 8 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## # `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>,
## # Gene_Index <chr>, GC_content <dbl>, Specificity_score <dbl>
# Rank sgRNAs by efficiency and specificity for each gene
ranked_sgRNA <- df %>%
group_by(Gene) %>%
arrange(desc(`Cutting Efficiency`), Specificity_score) %>%
dplyr::slice(1) # Select the top sgRNA per gene
# Display the ranked sgRNAs
ranked_sgRNA
## # A tibble: 2 × 13
## # Groups: Gene [2]
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 2 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## # ℹ 8 more variables: `Specificity Score` <dbl>, `Number of Off-Targets` <dbl>,
## # `Off-Target Summary` <chr>, `GC Content` <chr>, Annotation <chr>,
## # Gene_Index <chr>, GC_content <dbl>, Specificity_score <dbl>
# Split the 'Off-Target Summary' column into separate columns for mismatches
df <- df %>%
separate(`Off-Target Summary`, into = c("Mismatches_2", "Mismatches_3", "Mismatches_4"), sep = "\\|") %>%
mutate(
Mismatches_2 = as.numeric(gsub("2:", "", Mismatches_2)),
Mismatches_3 = as.numeric(gsub("3:", "", Mismatches_3)),
Mismatches_4 = as.numeric(gsub("4:", "", Mismatches_4))
)
# Check the structure of the modified DataFrame
head(df)
## # A tibble: 6 × 15
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32398926-32398948 CGTTTTGCCCGATTCCGT… NGG 0.38
## 2 BRCA2 chr13:32387964-32387986 ACCGGTCTCCCGCGTCTT… NGG 0.34
## 3 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG 0.44
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## # `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## # Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## # GC_content <dbl>, Specificity_score <dbl>
# Identify the best sgRNA for each gene based on minimal off-target mismatches
best_sgRNA <- df %>%
group_by(Gene) %>%
arrange(Mismatches_2, Mismatches_3, Mismatches_4) %>%
dplyr::slice(1) # Select sgRNA with the fewest off-targets
# Display the best sgRNAs based on off-target mismatches
best_sgRNA
## # A tibble: 2 × 15
## # Groups: Gene [2]
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 2 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## # `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## # Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## # GC_content <dbl>, Specificity_score <dbl>
# Calculate GC content and create a specificity score placeholder
df <- df %>%
mutate(
GC_content = str_count(`sgRNA Sequence`, "G|C") / str_length(`sgRNA Sequence`),
Specificity_score = 1 / `Number of Off-Targets` # Assuming higher specificity with fewer off-targets
)
# Preview the modified dataset
head(df)
## # A tibble: 6 × 15
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 BRCA2 chr13:32398926-32398948 CGTTTTGCCCGATTCCGT… NGG 0.38
## 2 BRCA2 chr13:32387964-32387986 ACCGGTCTCCCGCGTCTT… NGG 0.34
## 3 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 4 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## 5 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## 6 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG 0.44
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## # `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## # Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## # GC_content <dbl>, Specificity_score <dbl>
# Load ggrepel for better label management
library(ggrepel)
# Scatter plot of GC content vs Specificity Score with Gene_Index labels
ggplot(df, aes(x = GC_content, y = Specificity_score, color = Gene)) +
geom_point(size = 3) +
geom_text_repel(aes(label = Gene_Index), size = 3, max.overlaps = Inf) + # Add Gene_Index labels to points
labs(
title = "GC Content vs Specificity Score for sgRNAs",
x = "GC Content",
y = "Specificity Score"
) +
scale_color_manual(values = scales::hue_pal()(length(unique(df$Gene)))) + # Different colors for each gene
theme_minimal()

Figure 2 shows a scatterplot of the relationship between GC content
and specificity scores for sgRNAs targeting the BRCA2 and PIK3CA genes.
The X-axis represents the GC content of each sgRNA sequence, which
affects stability and efficiency in CRISPR experiments. The Y-axis
displays the specificity score, which indicates how precisely the sgRNA
targets the intended sequence while minimizing off-target effects.
Points are colour-coded by gene, with red representing BRCA2 sgRNAs and
blue representing PIK3CA sgRNAs. Notably, PIK3CA_3 exhibits a high GC
content and a strong specificity score, making it a promising candidate,
while BRCA2_2 also shows high GC content but lower specificity.
# Rank sgRNAs by Cutting Efficiency for all gene indexes in descending order
df_sorted <- df %>%
arrange(desc(`Cutting Efficiency`))
# Display the sorted DataFrame
df_sorted
## # A tibble: 6 × 15
## Gene Coordinate `sgRNA Sequence` PAM `Cutting Efficiency`
## <chr> <chr> <chr> <chr> <dbl>
## 1 PIK3CA chr3:179148561-179148583 ACCCGATGCGGTTAGAGC… NGG 0.58
## 2 PIK3CA chr3:179148731-179148753 ATCGCAGGCGAACTCCGC… NGG 0.55
## 3 BRCA2 chr13:32387945-32387967 CGGTAGCGGCCCGAACGG… NGG 0.53
## 4 PIK3CA chr3:179148598-179148620 GTCCGCGCTCTACTCACG… NNG 0.44
## 5 BRCA2 chr13:32398926-32398948 CGTTTTGCCCGATTCCGT… NGG 0.38
## 6 BRCA2 chr13:32387964-32387986 ACCGGTCTCCCGCGTCTT… NGG 0.34
## # ℹ 10 more variables: `Specificity Score` <dbl>,
## # `Number of Off-Targets` <dbl>, Mismatches_2 <dbl>, Mismatches_3 <dbl>,
## # Mismatches_4 <dbl>, `GC Content` <chr>, Annotation <chr>, Gene_Index <chr>,
## # GC_content <dbl>, Specificity_score <dbl>
# Load the ggplot2 library for plotting
library(ggplot2)
# Create a lollipop chart showing Cutting Efficiency for each Gene_Index
ggplot(df_sorted, aes(x = reorder(Gene_Index, `Cutting Efficiency`), y = `Cutting Efficiency`, color = Gene)) +
geom_segment(aes(xend = reorder(Gene_Index, `Cutting Efficiency`), yend = 0), size = 1) + # Line from x-axis to point
geom_point(size = 4) + # Points for each Cutting Efficiency value
labs(
title = "Ranking of sgRNAs by Cutting Efficiency",
x = "Gene Index (Ranked by Cutting Efficiency)",
y = "Cutting Efficiency"
) +
coord_flip() + # Flip coordinates for better readability
theme_minimal() + # Minimal theme for clarity
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels if needed
scale_color_manual(values = scales::hue_pal()(length(unique(df_sorted$Gene)))) # Unique colors for each Gene
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 6 scatter plot illustrates the relationship between
off-target risk and predicted efficiency for sgRNAs targeting the BRCA2
and PIK3CA genes. The X-axis represents the predicted efficiency of each
sgRNA, indicating the likelihood of successful DNA cleavage. The Y-axis
shows the off-target risk, measured as the number of off-target sites
identified for each sgRNA. Points are colour-coded by sgRNA index and
gene, with shades of purple representing BRCA2 sgRNAs and green
representing PIK3CA sgRNAs. Notably, BRCA2_0 exhibits a low off-target
risk (~20 sites) and relatively low efficiency, while PIK3CA_4 and
PIK3CA_3 demonstrate high predicted efficiencies (~0.5–0.6) with
moderate off-target risks. BRCA2_2 has the highest off-target risk (~35
sites) despite moderate efficiency, suggesting potential trade-offs
between precision and effectiveness.