To identify optimal sgRNA sequences targeting the RAD52 gene, CHOPCHOP, CRISPOR, and Off-Spotter were used. CHOPCHOP ranked sgRNAs based on efficiency, specificity, GC content, and off-target predictions, while CRISPOR provided specificity scores (MIT and CFD), cleavage efficiency, and off-target risks. Off-Spotter confirmed genome-wide off-target predictions by analysing mismatches and their genomic contexts. Among the candidates, Rank 1 sgRNA (AGAACAATGATAAAGAACTGGGG) had an efficiency score of 77, specificity scores of 99 (MIT) and 100 (CFD), minimal off-targets, and a 25% frameshift probability, making it a highly efficient and specific choice. Rank 2 sgRNA (ACAAAACGATGACCACCGCGAGG) demonstrated a 72 efficiency score, perfect specificity (100 MIT and CFD), no off-targets, and a 32% frameshift probability, offering high specificity and reliability. Rank 3 sgRNA (GGATGTACTACCTTAGAAGGCGG) showed a 71 efficiency score, perfect specificity (100 MIT and CFD), one off-target with reduced cleavage likelihood, and a 67% frameshift probability.
Next Steps: Further Analysis in RStudio - After evaluating the sgRNA candidates, the top results have been selected for detailed statistical and graphical analysis using RStudio.
# Load the dataset from CSV
gRNA_data <- read_csv("Project 2 RAD52(Sheet1).csv")
## Rows: 3 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Coordinate, sgRNA Sequence, PAM, Off-Target Summary
## dbl (8): Rank, Cutting Efficiency (Doench '16), MIT Specificity Score, CFD S...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview the data structure
glimpse(gRNA_data)
## Rows: 3
## Columns: 12
## $ Rank                              <dbl> 1, 2, 3
## $ Coordinate                        <chr> "chrXIII:212743", "chrXIII:212815", …
## $ `sgRNA Sequence`                  <chr> "AGAACAATGATAAAGAACTGGGG", "ACAAAACG…
## $ PAM                               <chr> "GGG", "AGG", "CGG"
## $ `Cutting Efficiency (Doench '16)` <dbl> 77, 72, 71
## $ `MIT Specificity Score`           <dbl> 99, 100, 100
## $ `CFD Specifictiy Score`           <dbl> 100, 100, 100
## $ `Number of Off-Target`            <dbl> 4, 0, 1
## $ `Off-Target Summary`              <chr> "0:0|1:1|3:3", "0:0|1:0|3:0", "0:0|1…
## $ `GC Content (%)`                  <dbl> 30, 55, 45
## $ `Out-of-Frame Indel Rate`         <dbl> 25, 32, 67
## $ `Lindel Score`                    <dbl> 56, 50, 31
kable(head(gRNA_data), caption = "Preview of the Imported gRNA Dataset")
Preview of the Imported gRNA Dataset
Rank Coordinate sgRNA Sequence PAM Cutting Efficiency (Doench ’16) MIT Specificity Score CFD Specifictiy Score Number of Off-Target Off-Target Summary GC Content (%) Out-of-Frame Indel Rate Lindel Score
1 chrXIII:212743 AGAACAATGATAAAGAACTGGGG GGG 77 99 100 4 0:0|1:1|3:3 30 25 56
2 chrXIII:212815 ACAAAACGATGACCACCGCGAGG AGG 72 100 100 0 0:0|1:0|3:0 55 32 50
3 chrXIII:212618 GGATGTACTACCTTAGAAGGCGG CGG 71 100 100 1 0:0|1:0|3:1 45 67 31
# Normalise key metrics using z-scores with updated column names
gRNA_data <- gRNA_data %>%
  mutate(
    Efficiency_z = scale(`Cutting Efficiency (Doench '16)`),
    Specificity_z = scale((`MIT Specificity Score` + `CFD Specifictiy Score`) / 2), # Averaging both specificity scores
    Off_Targets_z = -scale(`Number of Off-Target`), # Invert to favor fewer off-targets
    GC_Content_z = -scale(abs(`GC Content (%)` - 50)), # Closer to 50% is better
    Indel_Rate_z = scale(`Out-of-Frame Indel Rate`)
  )

# Summing z-scores for an overall ranking (higher score is better)
gRNA_data <- gRNA_data %>%
  mutate(Total_Score = Efficiency_z + Specificity_z + Off_Targets_z + GC_Content_z + Indel_Rate_z) %>%
  arrange(desc(Total_Score))

# Display the ranked gRNA data
kable(gRNA_data, caption = "Ranked CRISPR gRNA Candidates with Normalized Scores")
Ranked CRISPR gRNA Candidates with Normalized Scores
Rank Coordinate sgRNA Sequence PAM Cutting Efficiency (Doench ’16) MIT Specificity Score CFD Specifictiy Score Number of Off-Target Off-Target Summary GC Content (%) Out-of-Frame Indel Rate Lindel Score Efficiency_z Specificity_z Off_Targets_z GC_Content_z Indel_Rate_z Total_Score
3 chrXIII:212618 GGATGTACTACCTTAGAAGGCGG CGG 71 100 100 1 0:0|1:0|3:1 45 67 31 -0.7258662 0.5773503 0.3202563 0.5773503 1.1406469 1.889738
2 chrXIII:212815 ACAAAACGATGACCACCGCGAGG AGG 72 100 100 0 0:0|1:0|3:0 55 32 50 -0.4147807 0.5773503 0.8006408 0.5773503 -0.4147807 1.125780
1 chrXIII:212743 AGAACAATGATAAAGAACTGGGG GGG 77 99 100 4 0:0|1:1|3:3 30 25 56 1.1406469 -1.1547005 -1.1208971 -1.1547005 -0.7258662 -3.015517
Table 1 ranks CRISPR gRNA candidates targeting the RAD52 gene in yeast based on normalised z-scores for cutting efficiency, specificity, off-target risks, GC content, and indel rates. The genomic coordinates, sequences, and Protospacer Adjacent Motif (PAM) are listed alongside key metrics such as predicted cutting efficiency (Doench 2016 model), specificity scores (MIT and CFD), number of off-targets, off-target mismatch summary, GC content, out-of-frame indel rate, and Lindel score. Normalised z-scores are provided for each metric: higher scores indicate better performance, with the total score summing all z-scores to determine the overall ranking. Efficiency_z reflects the cutting likelihood, Specificity_z averages MIT and CFD scores, Off_Targets_z favors fewer off-targets, GC_Content_z prioritises sequences with GC content near 50%, and Indel_Rate_z highlights frameshift potential.
# Compute correlation matrix for z-scores
cor_matrix <- gRNA_data %>% 
  select(Efficiency_z, Specificity_z, Off_Targets_z, GC_Content_z, Indel_Rate_z) %>%
  cor()

# Plot correlation matrix
ggcorrplot(cor_matrix, method = "circle", type = "lower", title = "Correlation Between gRNA Metrics")

Figure 1 is a correlation plot that visualises the relationships between normalised z-scores of metrics used to evaluate CRISPR gRNA candidates targeting the RAD52 gene. Both the x-axis and y-axis represent z-scores for key metrics, including specificity (average of MIT and CFD scores), off-target predictions (fewer off-targets yield higher scores), GC content (favouring values near 50%), predicted cutting efficiency (Doench ’16 model), and out-of-frame indel rates (indicating knockout potential). The size of each bubble indicates the strength of the correlation, with larger circles representing stronger relationships, while the colour shows the direction: red for positive correlations (metrics increasing together), blue for negative correlations (one metric increases as the other decreases), and white for little to no correlation.
# Cutting Efficiency and Specificity Scores
ggplot(gRNA_data, aes(x = reorder(`Rank`, -Total_Score), y = `Cutting Efficiency (Doench '16)`, fill = as.factor(Rank))) +
  geom_bar(stat = "identity") +
  labs(title = "Cutting Efficiency of gRNAs", x = "gRNA Sequence by Rank", y = "Cutting Efficiency (Doench '16)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 2 bar chart illustrates the cutting efficiency of three ranked gRNA sequences targeting the RAD52 gene, as the Doench 2016 model predicted. The x-axis represents the rank of the gRNA sequences (Rank 1, 2, and 3), and the y-axis indicates the cutting efficiency score, with higher values reflecting better efficiency in inducing double-strand breaks. Each bar is colour-coded based on rank: red for Rank 1, green for Rank 2, and blue for Rank 3. The chart shows that Rank 1 has the highest cutting efficiency score, followed by Rank 2 and Rank 3, highlighting the performance differences among the candidates regarding their predicted editing capabilities.
# Plot MIT and CFD Specificity Scores by gRNA
ggplot(gRNA_data, aes(x = reorder(`Rank`, -Total_Score), y = `MIT Specificity Score`, color = as.factor(Rank))) +
  geom_point(size = 4) +
  geom_line(aes(y = `CFD Specifictiy Score`), linetype = "dashed", size = 1) +
  labs(title = "MIT and CFD Specificity Scores by gRNA", x = "gRNA Sequence by Rank", y = "Specificity Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Figure 3 scatter plot visualises the specificity scores (MIT and CFD) of three ranked gRNA sequences targeting the RAD52 gene. The x-axis represents the rank of the gRNA sequences (Rank 1, 2, and 3), and the y-axis indicates the specificity scores, where higher values reflect greater target specificity and minimal off-target effects. Each point is color-coded according to rank: red for Rank 1, green for Rank 2, and blue for Rank 3. The plot highlights that Ranks 2 and 3 have perfect specificity scores of 100, while Rank 1 has a slightly lower specificity score of 99.
# GC Content Comparison
ggplot(gRNA_data, aes(x = reorder(`Rank`, -Total_Score), y = `GC Content (%)`)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "GC Content of gRNAs", x = "gRNA Sequence by Rank", y = "GC Content (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 4 bar chart depicts the GC content of three ranked gRNA sequences targeting the RAD52 gene, measured as a percentage. The x-axis represents the rank of the gRNA sequences (Rank 1, 2, and 3), and the y-axis shows the GC content (%), with an optimal range typically around 40-60% for efficient and stable binding during CRISPR-Cas9 targeting. Each bar reflects the GC content of a specific gRNA, with Rank 2 exhibiting the highest GC content at 55%, Rank 3 at 45%, and Rank 1 the lowest at 30%.
# Extracting the best gRNA
best_gRNA <- gRNA_data %>%
  filter(Total_Score == max(Total_Score))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
kable(best_gRNA, caption = "Top gRNA Candidate for RAD52 Knockout Analysis")
Top gRNA Candidate for RAD52 Knockout Analysis
Rank Coordinate sgRNA Sequence PAM Cutting Efficiency (Doench ’16) MIT Specificity Score CFD Specifictiy Score Number of Off-Target Off-Target Summary GC Content (%) Out-of-Frame Indel Rate Lindel Score Efficiency_z Specificity_z Off_Targets_z GC_Content_z Indel_Rate_z Total_Score
3 chrXIII:212618 GGATGTACTACCTTAGAAGGCGG CGG 71 100 100 1 0:0|1:0|3:1 45 67 31 -0.7258662 0.5773503 0.3202563 0.5773503 1.140647 1.889738
# Interpretation of the best candidate
cat("### Analysis of the Best gRNA Candidate")
## ### Analysis of the Best gRNA Candidate
cat("- **Optimal Cutting Efficiency**:", best_gRNA$Cutting_Efficiency, "\n")
## Warning: Unknown or uninitialised column: `Cutting_Efficiency`.
## - **Optimal Cutting Efficiency**:
cat("- **High Specificity**: MIT and CFD scores indicate minimal off-target potential.\n")
## - **High Specificity**: MIT and CFD scores indicate minimal off-target potential.
cat("- **Balanced GC Content**:", best_gRNA$GC_Content, "%, ideal for stability and target binding.\n")
## Warning: Unknown or uninitialised column: `GC_Content`.
## - **Balanced GC Content**: %, ideal for stability and target binding.
cat("- **High Knockout Potential**:", best_gRNA$Out_of_Frame_Indel_Rate, "%, likely to induce a functional knockout.\n")
## Warning: Unknown or uninitialised column: `Out_of_Frame_Indel_Rate`.
## - **High Knockout Potential**: %, likely to induce a functional knockout.
Table 2 showcases the top gRNA candidate for targeting the RAD52 gene, ranked based on its total score, which incorporates metrics like cutting efficiency, specificity, off-target risks, GC content, and indel rate.

Summary of Advanced gRNA Analysis

Through normalised scores and in-depth metric analysis, we identified the top candidate for RAD52 knockout: