Preceding work for the conservation analysis is omitted aside from histograms plotting window’s scores and percent scores. Here I attempt to define some characteristics of highly conserved R-loops and distinguish them from those which are not.

In the analysis I work with a set of 10 kilo-basepair windows - rl_cons. Each window is scored based on the number of r-loops which it overlaps with.

Score indicates the number of r-loops which overlap with each window, and pct_cons - or percent score - indicates the percentage of r-loops which overlap each window. There are 252 r-loops which were used to arrive at these metrics.

Metaplot Analysis

Metaplots are used to plot the average profile of a set of peaks. In this case, the average distance from a transcription start site (TSS) for a set of windows is displayed.

Below are the metaplots for three subsets of rl_cons; the whole set, windows with a percent score greater than 60, and top 12 scoring windows.

#preparing annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)

#preparing rl_cons_60; windows with having more than 60 percent of Rloops aligned to it
limit <- 60
rl_cons_60 <- rl_cons[which(rl_cons$pct_cons >limit)]

tagMatrix_rl_cons <- getTagMatrix(rl_cons, windows = promoter)
tagMatrix_rl_cons_60 <- getTagMatrix(rl_cons_60,windows = promoter)
tagMatrix_top_ranges <- getTagMatrix(rl_cons[top_ranges], windows = promoter)

#creating metaplots for the three groups above
plotAvgProf(tagMatrix_rl_cons, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3') rl_cons")
plotAvgProf(tagMatrix_rl_cons_60, xlim=c(-3000,3000), xlab = "Genomic Region (5'->3') rl_cons_60")
plotAvgProf(tagMatrix_top_ranges,xlim = c(-3000,3000), xlab = "Genomic Region (5' -> 3') top_ranges")

#exclude windows with a percent score less than 10
rl_cons <- rl_cons[which(rl_cons$pct_cons >=  10)]

splitVia <- function(peaks, percent, mode){
  #function takes:
    #GRanges object with percent conservation, 
    #percent interval to split on,
    #mode (either percent "interval" or "strata")
  #return a GRanges list 
  if(percent == 0){return("Must pass an integer greater than 0")}
  
  if(mode == "interval"){ 
    #if user chooses interval, split peaks according to percent intervals
    return (split(peaks, ceiling(peaks$pct_cons/percent)))
  } else if (mode == "strata"){
    #Otherwise, if user chooses strata, split according to number of strata requested -- percent
    lim <- length(peaks)
    denominator <- as.integer(lim/percent)
    return(split(peaks,ceiling(seq_along(peaks)/denominator)))
  }
  
}

rankWindows <- function(peaks){
  #function takes a GRanges object with percent conservation
  #returns the GRanges object, ranked according to percent score
  peaksDF <- as.data.frame(peaks)
  return(makeGRangesFromDataFrame(peaksDF%>%arrange(pct_cons), keep.extra.columns = TRUE))
}

perm_test <- function(peaks, cores, genome, rlfs,chrom_sizes) {
  
  peaks <- regioneR::toGRanges(peaks)
  rlfs <- regioneR::toGRanges(rlfs)
  # Prevent stranded assignment
  GenomicRanges::strand(rlfs) <- "*"
  rlfs <- GenomicRanges::reduce(rlfs)
  # Mask makes no difference. Turns out R-ChIP is finding genuine R-loops.
  ##chrom_sizes <- readr::read_tsv(paste0('http://hgdownload.soe.ucsc.edu/goldenPath/',
  ##                                      genome, '/bigZips/', genome, '.chrom.sizes'),
  ##                               col_names = FALSE)
  pt <- regioneR::permTest(A=peaks, B=rlfs, ntimes=10000, force.parallel = TRUE, 
                           genome=as.data.frame(chrom_sizes), mc.cores=cores, allow.overlaps = FALSE,
                           randomize.function=regioneR::circularRandomizeRegions, 
                           evaluate.function=regioneR::numOverlaps, alternative = "greater")
  lz <- regioneR::localZScore(pt=pt, A=peaks, B=rlfs, window = 100000, step = 2500) 
  return(list(pt,lz))
}

addGenes <- function(annoData, peaks){
  #function takes a txdb object and GRanges object
  #returns the GRanges object with gene Ids added
  peak_anno <- annotatePeakInBatch(peaks, 
                               AnnotationData=toGRanges(annoData), 
                               output="overlapping",
                               bindingRegion=c(-3000, 3000))
  peak_anno_2 <- addGeneIDs(peak_anno,
                        "org.Hs.eg.db",
                        IDs2Add = c("symbol","genename"),
                        feature_id_type = "entrez_id")
  names(peak_anno_2) = make.names(names(peak_anno_2), unique = TRUE)
  return(peak_anno_2)
}

gene_table<- function(rLoopRanges){
  
  #function takes GRanges object and generates a datatable,
  #uncomment the block of code below to generate csv and bed file of rLoopRanges
  currentCaption <- paste(round(min(rLoopRanges$pct_cons)),"-",round(max(rLoopRanges$pct_cons)),"% score",sep= "")
  currentData <- datatable(subset(as.data.frame(rLoopRanges),select = -c(width,strand,score,peak)),
                            caption = currentCaption,
                            extensions = 'FixedColumns',
                            options = list(
                            dom = 't',
                            scrollX = TRUE,
                            fixedColumns = list(leftColumns = 2, rightColumns = 2),
                            scrollCollapse = TRUE))
  
  #files should exist in repo already
  # current_file <- paste("gene_datatables/",currentCaption,".bed",sep="")
  # current_file_csv <- paste("gene_datatables/",currentCaption,".csv",sep="")
  # if(!file.exists(current_file)){
  #   export.bed(rLoopRanges,current_file)
  # }
  # if(!file.exists(current_file_csv)){
  #   write.csv(as.data.frame(rLoopRanges), current_file_csv)
  # }
  
  currentData
  
}

#create a granges list by splitting according to 10 percent intervals
rl_list<-splitVia(rl_cons, 10, "interval")

names(rl_list) <- c("10-20% score",
                    "20-30% score",
                    "30-40% score",
                    "40-50% score",
                    "50-60% score",
                    "60-70% score",
                    "70-80% score",
                    "80-87% score")

rl_listTagMatricies <- lapply(rl_list,getTagMatrix,windows=promoter)

As expected, the metaplot for all peaks produces a very noisy plot. The other two metaplots are more like those which we might see when analyzing protein coding genes (like NFE2L2 or BCRA1). However, the peaks are more normally distributed around the TSS and produce a curve which is “flatter” than what a protein coding gene might produce.

After transcription, if the nascent RNA had been modified, (ie. spliced, capped and tailed), the annealing of the RNA to the DNA template would displace regions which had previously coded the premature RNA’s exons. These displaced regions form loops which are known as R-loops[1].

R-loops will form across the whole length of a gene being transcribed, and so r-loops will align at windows close to a TSS, but also in surrounding regions since the DNA template for these genes may span at least thousands of base pairs. This may explain the “flatter” curve presented by the metaplots of rl_cons_60 and the top 12 ranges.

Knowing this, I would expect the see groups of windows with higher percent scores present increasingly flat curves - if more peaks align to the higher scoring windows, then there should be more peaks which are further upstream from the TSS. Windows with lower percent scores would produce curves which are just noise, as in the first meta plot.

I repeat the process for percent levels, where each level increases 10 points. Windows with a percent score less than 10% are omitted from this point forward.

returntagmatrix <- TRUE 
upsetplots <- FALSE
permfigs <- FALSE

Percent metaplots

10-20% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

20-30% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

30-40% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

40-50% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

50-60% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

60-70% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

70-80% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

80-87% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

X labels describe the range of percent score of each plot.

My hypothesis was incorrect. Instead of the curve becoming more flat as I plotted higher scoring windows, each group after 40% showed a simlar curve, normally distributed around the TSS. The same remarks hold true for the higher scoring groups, that peaks would be found at and near the TSS. But as the percent score increases, there doesn’t appear to be any change in where peaks are located.

I was also incorrect with lower scoring windows. Between percent scores of 0-39%, curves are inversions of what is seen with higher scoring peaks. It seems then that lesser conserved r-loops are unlikely to be found near the TSS of a gene and more conserved r-loops are more likely to be found near the TSS.

Strata metaplots

Following this, the whole set of windows were ranked in ascending order of percent score, then split into 10 groups, each with an equal number of windows. Metaplots were then generated from these new groups.

10-11% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

11-13% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

13-14% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

14-17% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

17-20% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

20-23% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

23-28% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

28-35% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

35-46% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

46-86% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

The same trend shown by the metaplots of the percent level groups is true for the group of strata as well; lower percent scoring windows tend to be further from the TSS compared to higher percent scoring windows. However, a large portion of the whole set of windows are located hundreds of base pairs from a TSS.

This raises the question of whether this large portion of windows, as well as the whole set of windows, are located near r-loop forming regions(RLFRs). Given that - unlike lower percent scoring windows - higher percent scoring windows tend to be close to a TSS, could there exist a discrepancy between higher and lower scoring windows in their relationship to RLFRs?

Permutation tests

To explore the previous question, a permutation test over rl_cons is run against a region set describing r-loop forming regions. This was done first on the whole set of windows,

As well as each strata.

Figures are displayed in ascending order of percent score:

10-11% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

11-13% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

13-14% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

14-17% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

17-20% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

20-23% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

23-28% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

28-35% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

35-46% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

46-86% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

In reviewing the results of the test for rl_cons it can no longer be assumed that there is no relationship between the whole set of windows and r-loop forming regions (given a significance threshold of 0.05). The same can also be said for each strata in the ranked set.

The z score tends to increase with the percent score of each strata.

Local z score across the first six strata (10-23% score) appear to show some fluctuation but there is no indication that the exact location of these windows have an association with the set of r-loop forming regions. This could be due to the large size of the windows. The remaining 4 strata, with the last one in particular begin to show a notable peak in the local z-score, which may indicate a more legitimate relationship between these windows and the set of RLFRs.

Peak annotation and upset plots

Following this, I annotated the windows (according to percent levels) and used upset plots to display the data. From our understanding of R-loops, I would expect that higher scoring windows would have a greater amount of peaks which span all of 3’ and 5’ untranslated regions (UTR), promoter, exons and introns. Lower scoring groups would have most peaks overlapping 1 or two of these regions.

permfigs <- FALSE 
upsetplots <- TRUE

10-20% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

## Warning: Removed 6 rows containing non-finite values (stat_count).

20-30% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

## Warning: Removed 1 rows containing non-finite values (stat_count).

30-40% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

40-50% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

50-60% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

60-70% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

70-80% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

80-87% score

plotAvgProf(setNow,
            xlim = c(-3000,3000),
            xlab = paste(names(setNow)))

annotateAndUpset(setNow)

My hypothesis seems to hold true per the upset plots; higher scoring windows have more peaks which span the 5 relevant regions (introns, exons, promoters, 3’ and 5’ UTRs), whereas lower scoring peaks have fewer peaks spanning the 5 regions. This divide between higher and lower scoring windows is the same as the meta plots - after a 40% score, there is appears to be a divide in the trends which each group shows.

Gene data tables

Data tables storing genetic information (start, end, width, name and description) for genes contained by each window are stored in the repository and are too large to display here.

Data tables were generated for each percent level and strata

library(enrichR)
setEnrichrSite("Enrichr")
websiteLive<- TRUE

if(is.null(listEnrichrDbs()))
  print("DATABASES ARE NULL")
  websiteLive<-FALSE

Enrichment analysis (percent levels)

Using the enrichr tool, we learn from pathway enrichment that groups which have a higher percent score tend to be associated with characteristics expected of r-loops.

Links to enrichment results via the enrichr website are included under each tab.

dataLst0 <- list(
  "10-20% Conservation" = c(rlAnnoList[[1]]$symbol),
  "20-30% Conservation" = c(rlAnnoList[[2]]$symbol),
  "30-40% Conservation" = c(rlAnnoList[[3]]$symbol),
  "40-50% Conservation" = c(rlAnnoList[[4]]$symbol),
  "50-60% Conservation" = c(rlAnnoList[[5]]$symbol),
  "60-70% Conservation" = c(rlAnnoList[[6]]$symbol),
  "70-80% Conservation" = c(rlAnnoList[[7]]$symbol),
  "80-87% Conservation" = c(rlAnnoList[[8]]$symbol)
)

enrichLinksLevels <- lapply(names(dataLst0), function(group) {
  genes<- dataLst0[[group]]
  response <- httr::POST(url = 'https://maayanlab.cloud/Enrichr/addList', body = list(
    'list' = paste0(genes, collapse = "\n"),
    'description' = group
  ))
  jsonlite::fromJSON(httr::content(response, as = "text"))
})
names(enrichLinksLevels) <- names(dataLst0)
#for(i in enrichLinks){print(paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=",i$shortId))}

10-20% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 10-20% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

20-30% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 20-30% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

30-40% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 30-40% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

40-50% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 40-50% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

50-60% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 50-60% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

60-70% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 60-70% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

70-80% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 70-80% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

80-87% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 80-87% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

Enrichment analysis (strata)

dataLst <- list(
  "10-13% Conservation" = c(rlAnnoRankedList[[1]]$symbol),
  "13-17% Conservation" = c(rlAnnoRankedList[[2]]$symbol),
  "17-22% Conservation" = c(rlAnnoRankedList[[3]]$symbol),
  "22-27% Conservation" = c(rlAnnoRankedList[[4]]$symbol),
  "27-33% Conservation" = c(rlAnnoRankedList[[5]]$symbol),
  "33-39% Conservation" = c(rlAnnoRankedList[[6]]$symbol),
  "39-46% Conservation" = c(rlAnnoRankedList[[7]]$symbol),
  "46-54% Conservation" = c(rlAnnoRankedList[[8]]$symbol),
  "54-63% Conservation" = c(rlAnnoRankedList[[9]]$symbol),
  "63-87% Conservation " = c(rlAnnoRankedList[[10]]$symbol)
)

enrichLinksStrata <- lapply(names(dataLst), function(group) {
  genes<- dataLst[[group]]
  response <- httr::POST(url = 'https://maayanlab.cloud/Enrichr/addList', body = list(
    'list' = paste0(genes, collapse = "\n"),
    'description' = group
  ))
  jsonlite::fromJSON(httr::content(response, as = "text"))
})
names(enrichLinksStrata) <- names(dataLst)
#for(i in enrichLinks){print(paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=",i$shortId))}

10-13% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 10-13% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

13-17% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 13-17% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

17-22% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 17-22% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

22-27% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 22-27% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

27-33% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 27-33% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

33-39% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 33-39% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

39-46% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 39-46% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

46-54% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 46-54% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

54-63% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 54-63% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

63-87% Conservation

permalink <- paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=", permalinkNow[1])

Pathway enrichment was calculated for 63-87% Conservation using enrichr. The permalink to this analysis is available here. Here are some selected pathways which represents these results:

dbsNow <- c("KEGG_2021_Human", "GO_Biological_Process_2018",
            "ChEA_2016", "BioPlanet_2019")
log <- capture.output(enres <- enrichr(genesNow, databases = dbsNow))
plotLst <- lapply(names(enres), function(enrNow) {
  enrNowData <- enres[[enrNow]]
  enrNowData %>%
    slice_max(order_by = Combined.Score, n = 10) %>%
    dplyr::mutate(logP = -log10(Adjusted.P.value)) %>%
    dplyr::mutate(Term = substr(Term, 1, 50)) %>%
    dplyr::mutate(Term = stringr::str_wrap(Term,  width = 40)) %>%
    arrange(Combined.Score) %>%
    dplyr::mutate(Term = factor(Term, levels = unique(Term))) %>%
    ggplot(aes(x = Term, y = Combined.Score, fill = logP)) +
    geom_col() +
    ggpubr::rotate() +
    xlab(NULL) +
    scale_y_continuous(expand = c(0, 1)) +
    labs(title = enrNow, subtitle = paste0("DRLA: ", groupNow)) +
    theme_classic(base_size = 11)
})
ggpubr::ggarrange(plotlist = plotLst, align = "v")

General observations:

Many of the higher ranking pathways within each conservation group play a role in cell differentiation or cell cycle progression. As well, many of these pathways are implicated in the regulation of tumor suppressing or proliferating pathways.

A trend within the overall data set is that there is a decrease in the number of enriched pathways as percent score decreases.

Discussion/Conclusion

Using the percent score as a metric for what is a highly conserved R loop proves to be an effective tool which aligns with what we currently know about r-loops. Windows with higher percent scores are more likely to share characteristics which are expected of r-loop regions. High percent scoring windows are more likely to be closer to the TSS of gene and still have a notable presence surrounding it. Conversely, lower percent scoring windows have little presence near the TSS. Upset plots show a similar distinction between lower percent scoring windows and high percent scoring windows; as the percent score increases there is a greater presence of peaks which overlap whole genes. This is expected as the region which r-loops span may overlap large portions of a gene on a template strand. Finally, we see that pathwys which are implicated in r-loop formation - such as the M phase and transcription and translational roles - are more present in windows with higher percent scores.

From here there are several avenues on which to continue this sort of work. Some next steps might be to analyze how r-loops are conserved among different species. Another might be to analyze the correlation between conserved genes and r-loops which may form from them - this idea comes the the presence of the Rho GTPase pathway in many percent score groups. More over, one could explore the presence of neuronal pathways in less conserved r-loops, as r-loops are implicated in many neurological diseases [1]. Finally, one could look at the difference in CG content between lesser and highly conserved r-loops. R-loops are found to form over unmethylated CpG islands [4] (which are characteristic of euchromatic regions) and so perhaps this may be another characteristic of highly conserved r-loops.

Although the analysis has been fruitful, the late addition of the permutation test points to a flaw in the current approach - the 10kb sized windows. The large size may be too encompassing of the r-loop data, which was used here, to indicate any biologically significant relationship with r-loop forming regions. Further work may explore smaller sized windows.

It is important to note that any trend extracted from analyzing the 70-100% percent scoring groups may have a bias due to their sample size (there are about 2000 windows with a percent score greater than 70).

Conservation Analysis Report

Jebriel Abdul

June 15, 2021

Metaplot Analysis

Percent metaplots

10-20% score

20-30% score

30-40% score

40-50% score

50-60% score

60-70% score

70-80% score

80-87% score

Strata metaplots

10-11% score

11-13% score

13-14% score

14-17% score

17-20% score

20-23% score

23-28% score

28-35% score

35-46% score

46-86% score

Permutation tests

10-11% score

11-13% score

13-14% score

14-17% score

17-20% score

20-23% score

23-28% score

28-35% score

35-46% score

46-86% score

Peak annotation and upset plots

10-20% score

20-30% score

30-40% score

40-50% score

50-60% score

60-70% score

70-80% score

80-87% score

Gene data tables

Enrichment analysis (percent levels)

10-20% Conservation

20-30% Conservation

30-40% Conservation

40-50% Conservation

50-60% Conservation

60-70% Conservation

70-80% Conservation

80-87% Conservation

Enrichment analysis (strata)

10-13% Conservation

13-17% Conservation

17-22% Conservation

22-27% Conservation

27-33% Conservation

33-39% Conservation

39-46% Conservation

46-54% Conservation

54-63% Conservation

63-87% Conservation

General observations:

Discussion/Conclusion