Guide RNA Analysis for CRISPR Epigenetic Modification of RAS2 Promoter

This analysis aims to identify the optimal guide RNA sequence among the top four ranked guides, focusing on specificity, efficiency, and off-target effects.

# Load necessary libraries
install.packages("ggcorrplot")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("kableExtra")
## Installing package into '/home/catherinetaylor35/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(readr)
library(ggcorrplot)
## Loading required package: ggplot2
library(tidyr)
library(readxl)      # For reading Excel files
library(dplyr)       # For data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)     # For data visualisation
library(knitr)       # For rendering tables in Markdown
library(kableExtra)  # For enhanced table aesthetics
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gridExtra)   # For multi-plot layout
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Load the data
guides_df <- read_csv("Guide_RNA_Analysis_RAS2.xlsx - Sheet1.csv")
## Rows: 4 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Target Sequence, Genomic Location, Strand, Off-Targets, Restriction...
## dbl (9): Rank, GC Content (%), Self-Complementarity, MM0 (Exact Match), MM1,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the data table
guides_df
## # A tibble: 4 × 14
##    Rank `Target Sequence`       `Genomic Location` Strand `GC Content (%)`
##   <dbl> <chr>                   <chr>              <chr>             <dbl>
## 1     1 TAGGCGATAAGAGAAGACAACGG chrXIV:440767      +                    40
## 2     2 CAGAAGAGTCAGATAAAGAGTGG chrXIV:440685      +                    40
## 3     3 ATTTCAAGCGTAACGCAATCCGG chrXIV:440735      -                    40
## 4     4 CGCTTGAAATGAGCGATTCTAGG chrXIV:440748      +                    45
## # ℹ 9 more variables: `Self-Complementarity` <dbl>, `MM0 (Exact Match)` <dbl>,
## #   MM1 <dbl>, MM2 <dbl>, MM3 <dbl>, `Doench '16 Efficiency` <dbl>,
## #   `Mor.-Mateos Efficiency` <dbl>, `Off-Targets` <chr>,
## #   `Restriction Enzymes` <chr>
# Specificity Visualisation
ggplot(guides_df, aes(x = factor(Rank), y = `Self-Complementarity`, fill = factor(Rank))) +
  geom_bar(stat = "identity") +
  labs(title = "Self-Complementarity by Guide Rank", x = "Guide Rank", y = "Self-Complementarity") +
  theme_minimal() +
  theme(legend.position = "none")

Figure 1 shows a bar chart illustrating the Self-Complementarity by Guide Rank, a measure of potential unintended folding within each guide RNA sequence, which can impact efficiency and specificity. The x-axis represents the guide RNA rank (Rank 1, Rank 2, etc.), ordered based on their evaluation on Chopchop and CRISPOR. The y-axis shows the self-complementarity score, where higher values indicate a greater likelihood of self-pairing within the sequence. Each bar is colour-coded to differentiate the ranks. Lower self-complementarity is preferred as it reduces the risk of guide RNA forming secondary structures, which may hinder its binding to the intended target.
# Summarise MM counts
mm_long <- guides_df %>%
  select(Rank, `MM0 (Exact Match)`, MM1, MM2, MM3) %>%
  pivot_longer(cols = starts_with("MM"), names_to = "Mismatch", values_to = "Count")

# View the data to ensure it's in long format
head(mm_long)
## # A tibble: 6 × 3
##    Rank Mismatch          Count
##   <dbl> <chr>             <dbl>
## 1     1 MM0 (Exact Match)     0
## 2     1 MM1                   0
## 3     1 MM2                   0
## 4     1 MM3                   1
## 5     2 MM0 (Exact Match)     0
## 6     2 MM1                   0
Table 1 summarises mismatch (MM) counts for guide RNA sequences across four mismatch categories: MM0 (Exact Match), MM1, MM2, and MM3. The Rank column indicates the rank of each guide RNA, ordered based on suitability for targeting. The Mismatch column specifies the mismatch category, where MM0 represents exact matches with no mismatches, while MM1, MM2, and MM3 represent increasing levels of allowable mismatches. The Count column provides the frequency of mismatches for each guide RNA in the respective category.
ggplot(mm_long, aes(x = factor(Rank), y = Count, fill = Mismatch)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mismatch Distribution by Guide Rank", x = "Guide Rank", y = "Mismatch Count") +
  theme_minimal()

Figure 2 bar chart illustrates the distribution of mismatch counts across guide RNA ranks. The x-axis represents the ranks of guide RNAs (Rank 1, Rank 3, and Rank 4). The y-axis shows the mismatch count for each guide RNA in different mismatch categories. Each bar is color-coded to represent a specific mismatch type: MM0 (Exact Match) for perfect matches, MM1, MM2, and MM3, reflecting increasing tolerance of mismatches. The chart highlights the relative frequency of mismatches in each category, with higher-ranked guide RNAs typically having fewer mismatches, indicating better specificity and lower off-target risks.
# Efficiency Comparison Plot
efficiency_long <- guides_df %>%
  select(Rank, `Doench '16 Efficiency`, `Mor.-Mateos Efficiency`) %>%
  pivot_longer(cols = starts_with("Doench"), names_to = "Efficiency_Metric", values_to = "Score")

ggplot(efficiency_long, aes(x = factor(Rank), y = Score, fill = Efficiency_Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Efficiency Scores by Guide Rank", x = "Guide Rank", y = "Efficiency Score") +
  theme_minimal()

##### Figure 3 bar chart illustrates the efficiency scores of guide RNAs (gRNAs) ranked by their suitability for targeting. The x-axis represents the guide rank (1 to 4). The y-axis shows the efficiency score derived from the Doench ’16 scoring model, which predicts the likelihood of effective cleavage by the gRNA. The bars are colour-coded to correspond to the efficiency metric, emphasizing rank differences. Higher-ranked gRNAs display superior efficiency scores, with Rank 1 achieving the highest value, reflecting its greater suitability for precise gene targeting.

# GC Content by Rank
ggplot(guides_df, aes(x = factor(Rank), y = `GC Content (%)`, fill = factor(Rank))) +
  geom_bar(stat = "identity") +
  labs(title = "GC Content by Guide Rank", x = "Guide Rank", y = "GC Content (%)") +
  theme_minimal() +
  theme(legend.position = "none")

Figure 4 bar chart visualises the GC Content (%) for guide RNAs (gRNAs) ranked from 1 to 4 based on their overall performance. The x-axis represents the rank of each gRNA. At the same time, the y-axis shows the percentage of guanine (G) and cytosine (C) bases in the sequence, an essential factor for stability and binding efficiency. Each rank is distinguished by a unique bar colour, clearly comparing GC content across gRNAs. The chart reveals slight variation, with Rank 4 displaying marginally higher GC content than the others, ensuring all sequences fall within the optimal range (40–60%) necessary for effective CRISPR targeting.
# Table summarizing off-targets for each guide
off_target_summary <- guides_df %>%
  select(Rank, `Off-Targets`)

kable(off_target_summary, col.names = c("Rank", "Off-Target Summary")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Rank Off-Target Summary
1 1 (CDC9)
2 0
3 1 (HSP31-FIT1 Intergenic)
4 0
Table 2 summarises the off-target analysis for the four ranked guide RNAs (gRNAs). The first column lists the rank of each gRNA, while the second column provides details on off-target effects. Rank 1 and Rank 3 each have one off-target, with Rank 1 targeting the CDC9 gene and Rank 3 located in the intergenic region between HSP31 and FIT1. Ranks 2 and 4 demonstrate no off-target effects, ensuring high specificity. This comparison highlights the importance of selecting gRNAs with minimal off-target activity to enhance CRISPR-Cas9 precision and reduce unintended genomic modifications.
# Combined plot
p1 <- ggplot(guides_df, aes(x = factor(Rank), y = `Doench '16 Efficiency`, fill = factor(Rank))) +
  geom_bar(stat = "identity") +
  labs(title = "Doench '16 Efficiency", x = "Guide Rank", y = "Efficiency Score") +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(guides_df, aes(x = factor(Rank), y = `GC Content (%)`, fill = factor(Rank))) +
  geom_bar(stat = "identity") +
  labs(title = "GC Content (%)", x = "Guide Rank", y = "GC Content (%)") +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p1, p2, ncol = 2)

Figure 5 presents a comparison of two key metrics for the four ranked guide RNAs (gRNAs): Doench ’16 Efficiency (left panel) and GC Content (right panel). The left panel highlights the efficiency scores. Rank 1 shows the highest efficiency (above 70), indicating strong targeting potential, followed by Ranks 2 and 3 with moderate efficiency, and Rank 4 with the lowest efficiency. The right panel shows the GC Content percentages, where all gRNAs are within the optimal range (40%-50%) for stability and binding, with Rank 4 exhibiting the highest GC content.