The primary aim of this project is to investigate the functional
parallels between RAD52 in yeast and BRCA1 in humans, focusing on
homologous recombination (HR) as a DNA repair mechanism. Using the
CRISPR-Cas9 system, the project seeks to knock out RAD52, a critical
homolog of BRCA1 in yeast, to study its role in genome stability and
response to DNA damage. This research will provide insights into
cancer-related mechanisms in human cells by leveraging yeast as a model
organism.
To identify optimal sgRNA sequences targeting the RAD52 gene,
CHOPCHOP, CRISPOR, and Off-Spotter were used. CHOPCHOP ranked sgRNAs
based on efficiency, specificity, GC content, and off-target
predictions, while CRISPOR provided specificity scores (MIT and CFD),
cleavage efficiency, and off-target risks. Off-Spotter confirmed
genome-wide off-target predictions by analysing mismatches and their
genomic contexts. Among the candidates, Rank 1 sgRNA
(AGAACAATGATAAAGAACTGGGG) had an efficiency score of 77, specificity
scores of 99 (MIT) and 100 (CFD), minimal off-targets, and a 25%
frameshift probability, making it a highly efficient and specific
choice. Rank 2 sgRNA (ACAAAACGATGACCACCGCGAGG) demonstrated a 72
efficiency score, perfect specificity (100 MIT and CFD), no off-targets,
and a 32% frameshift probability, offering high specificity and
reliability. Rank 3 sgRNA (GGATGTACTACCTTAGAAGGCGG) showed a 71
efficiency score, perfect specificity (100 MIT and CFD), one off-target
with reduced cleavage likelihood, and a 67% frameshift probability.
Next Steps: Further Analysis in RStudio - After evaluating the sgRNA
candidates, the top results have been selected for detailed statistical
and graphical analysis using RStudio.
# Load the dataset from CSV
gRNA_data <- read_csv("Project 2 RAD52(Sheet1).csv")
## Rows: 3 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Coordinate, sgRNA Sequence, PAM, Off-Target Summary
## dbl (8): Rank, Cutting Efficiency (Doench '16), MIT Specificity Score, CFD S...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the data structure
glimpse(gRNA_data)
## Rows: 3
## Columns: 12
## $ Rank <dbl> 1, 2, 3
## $ Coordinate <chr> "chrXIII:212743", "chrXIII:212815", …
## $ `sgRNA Sequence` <chr> "AGAACAATGATAAAGAACTGGGG", "ACAAAACG…
## $ PAM <chr> "GGG", "AGG", "CGG"
## $ `Cutting Efficiency (Doench '16)` <dbl> 77, 72, 71
## $ `MIT Specificity Score` <dbl> 99, 100, 100
## $ `CFD Specifictiy Score` <dbl> 100, 100, 100
## $ `Number of Off-Target` <dbl> 4, 0, 1
## $ `Off-Target Summary` <chr> "0:0|1:1|3:3", "0:0|1:0|3:0", "0:0|1…
## $ `GC Content (%)` <dbl> 30, 55, 45
## $ `Out-of-Frame Indel Rate` <dbl> 25, 32, 67
## $ `Lindel Score` <dbl> 56, 50, 31
kable(head(gRNA_data), caption = "Preview of the Imported gRNA Dataset")
Preview of the Imported gRNA Dataset
1 |
chrXIII:212743 |
AGAACAATGATAAAGAACTGGGG |
GGG |
77 |
99 |
100 |
4 |
0:0|1:1|3:3 |
30 |
25 |
56 |
2 |
chrXIII:212815 |
ACAAAACGATGACCACCGCGAGG |
AGG |
72 |
100 |
100 |
0 |
0:0|1:0|3:0 |
55 |
32 |
50 |
3 |
chrXIII:212618 |
GGATGTACTACCTTAGAAGGCGG |
CGG |
71 |
100 |
100 |
1 |
0:0|1:0|3:1 |
45 |
67 |
31 |
# Normalise key metrics using z-scores with updated column names
gRNA_data <- gRNA_data %>%
mutate(
Efficiency_z = scale(`Cutting Efficiency (Doench '16)`),
Specificity_z = scale((`MIT Specificity Score` + `CFD Specifictiy Score`) / 2), # Averaging both specificity scores
Off_Targets_z = -scale(`Number of Off-Target`), # Invert to favor fewer off-targets
GC_Content_z = -scale(abs(`GC Content (%)` - 50)), # Closer to 50% is better
Indel_Rate_z = scale(`Out-of-Frame Indel Rate`)
)
# Summing z-scores for an overall ranking (higher score is better)
gRNA_data <- gRNA_data %>%
mutate(Total_Score = Efficiency_z + Specificity_z + Off_Targets_z + GC_Content_z + Indel_Rate_z) %>%
arrange(desc(Total_Score))
# Display the ranked gRNA data
kable(gRNA_data, caption = "Ranked CRISPR gRNA Candidates with Normalized Scores")
Ranked CRISPR gRNA Candidates with Normalized Scores
3 |
chrXIII:212618 |
GGATGTACTACCTTAGAAGGCGG |
CGG |
71 |
100 |
100 |
1 |
0:0|1:0|3:1 |
45 |
67 |
31 |
-0.7258662 |
0.5773503 |
0.3202563 |
0.5773503 |
1.1406469 |
1.889738 |
2 |
chrXIII:212815 |
ACAAAACGATGACCACCGCGAGG |
AGG |
72 |
100 |
100 |
0 |
0:0|1:0|3:0 |
55 |
32 |
50 |
-0.4147807 |
0.5773503 |
0.8006408 |
0.5773503 |
-0.4147807 |
1.125780 |
1 |
chrXIII:212743 |
AGAACAATGATAAAGAACTGGGG |
GGG |
77 |
99 |
100 |
4 |
0:0|1:1|3:3 |
30 |
25 |
56 |
1.1406469 |
-1.1547005 |
-1.1208971 |
-1.1547005 |
-0.7258662 |
-3.015517 |
Table 1 ranks CRISPR gRNA candidates targeting the RAD52 gene in
yeast based on normalised z-scores for cutting efficiency, specificity,
off-target risks, GC content, and indel rates. The genomic coordinates,
sequences, and Protospacer Adjacent Motif (PAM) are listed alongside key
metrics such as predicted cutting efficiency (Doench 2016 model),
specificity scores (MIT and CFD), number of off-targets, off-target
mismatch summary, GC content, out-of-frame indel rate, and Lindel score.
Normalised z-scores are provided for each metric: higher scores indicate
better performance, with the total score summing all z-scores to
determine the overall ranking. Efficiency_z reflects the cutting
likelihood, Specificity_z averages MIT and CFD scores, Off_Targets_z
favors fewer off-targets, GC_Content_z prioritises sequences with GC
content near 50%, and Indel_Rate_z highlights frameshift potential.
# Compute correlation matrix for z-scores
cor_matrix <- gRNA_data %>%
select(Efficiency_z, Specificity_z, Off_Targets_z, GC_Content_z, Indel_Rate_z) %>%
cor()
# Plot correlation matrix
ggcorrplot(cor_matrix, method = "circle", type = "lower", title = "Correlation Between gRNA Metrics")

Figure 1 is a correlation plot that visualises the relationships
between normalised z-scores of metrics used to evaluate CRISPR gRNA
candidates targeting the RAD52 gene. Both the x-axis and y-axis
represent z-scores for key metrics, including specificity (average of
MIT and CFD scores), off-target predictions (fewer off-targets yield
higher scores), GC content (favouring values near 50%), predicted
cutting efficiency (Doench ’16 model), and out-of-frame indel rates
(indicating knockout potential). The size of each bubble indicates the
strength of the correlation, with larger circles representing stronger
relationships, while the colour shows the direction: red for positive
correlations (metrics increasing together), blue for negative
correlations (one metric increases as the other decreases), and white
for little to no correlation.
# Cutting Efficiency and Specificity Scores
ggplot(gRNA_data, aes(x = reorder(`Rank`, -Total_Score), y = `Cutting Efficiency (Doench '16)`, fill = as.factor(Rank))) +
geom_bar(stat = "identity") +
labs(title = "Cutting Efficiency of gRNAs", x = "gRNA Sequence by Rank", y = "Cutting Efficiency (Doench '16)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 4 bar chart depicts the GC content of three ranked gRNA
sequences targeting the RAD52 gene, measured as a percentage. The x-axis
represents the rank of the gRNA sequences (Rank 1, 2, and 3), and the
y-axis shows the GC content (%), with an optimal range typically around
40-60% for efficient and stable binding during CRISPR-Cas9 targeting.
Each bar reflects the GC content of a specific gRNA, with Rank 2
exhibiting the highest GC content at 55%, Rank 3 at 45%, and Rank 1 the
lowest at 30%.
# Extracting the best gRNA
best_gRNA <- gRNA_data %>%
filter(Total_Score == max(Total_Score))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
kable(best_gRNA, caption = "Top gRNA Candidate for RAD52 Knockout Analysis")
Top gRNA Candidate for RAD52 Knockout Analysis
3 |
chrXIII:212618 |
GGATGTACTACCTTAGAAGGCGG |
CGG |
71 |
100 |
100 |
1 |
0:0|1:0|3:1 |
45 |
67 |
31 |
-0.7258662 |
0.5773503 |
0.3202563 |
0.5773503 |
1.140647 |
1.889738 |
# Interpretation of the best candidate
cat("### Analysis of the Best gRNA Candidate")
## ### Analysis of the Best gRNA Candidate
cat("- **Optimal Cutting Efficiency**:", best_gRNA$Cutting_Efficiency, "\n")
## Warning: Unknown or uninitialised column: `Cutting_Efficiency`.
## - **Optimal Cutting Efficiency**:
cat("- **High Specificity**: MIT and CFD scores indicate minimal off-target potential.\n")
## - **High Specificity**: MIT and CFD scores indicate minimal off-target potential.
cat("- **Balanced GC Content**:", best_gRNA$GC_Content, "%, ideal for stability and target binding.\n")
## Warning: Unknown or uninitialised column: `GC_Content`.
## - **Balanced GC Content**: %, ideal for stability and target binding.
cat("- **High Knockout Potential**:", best_gRNA$Out_of_Frame_Indel_Rate, "%, likely to induce a functional knockout.\n")
## Warning: Unknown or uninitialised column: `Out_of_Frame_Indel_Rate`.
## - **High Knockout Potential**: %, likely to induce a functional knockout.
Table 2 showcases the top gRNA candidate for targeting the RAD52
gene, ranked based on its total score, which incorporates metrics like
cutting efficiency, specificity, off-target risks, GC content, and indel
rate.
Summary of Advanced gRNA Analysis
Through normalised scores and in-depth metric analysis, we identified
the top candidate for RAD52 knockout:
- Optimal Cutting Efficiency: The selected gRNA has
the highest efficiency of all candidates, vital for effective
knockout.
- High Specificity: A combined high score in both MIT
and CFD specificity reduces the likelihood of off-target effects,
critical in targeted genome editing.
- Ideal GC Content: A balanced GC content close to
50% optimizes binding and stability, enhancing CRISPR-Cas9
interaction.
- High Indel Likelihood: A high out-of-frame indel
rate suggests a strong potential for a functional knockout, essential
for gene disruption studies.