So Hollie gave us an interesting side note to look at when she did both an RRBS sequencing run as well as a whole genome sequencing run on Sample EPI-135. I wanted to just explore it a little bit and see if there are any interesting patterns.
The two main things that leap to my mind is A) how much more data do you get out of WG vs RRBS and B) how does that data differ (Does RRBS normally read a lower, equal, or greater methylation rate at sites shared between sequencing runs).
This may be all just satisfying my own curisoity, as I’m not sure if it’s possible to parse out the sequencing effects vs the RRBS/WG process effects but it’s fairly easy to do and may be interesting!
As always, first we load some packages and read in some data. Also, because I’m using .bed files, I add some column names to make life easier later on. I convert them from data frames to data tables to make operating on them more efficient also.
library(readr)
library(data.table)
WG <- read_delim("~/Documents/RobertsLab/Methylomev2/EPI-135WG_percmeth.bed",
"\t", escape_double = FALSE, trim_ws = TRUE, col_names = FALSE)
colnames(WG)[1:4] <- c("scaff", "start", "end", "meth")
WG <- as.data.table(WG)
RRBS <- read_delim("~/Documents/RobertsLab/Methylomev2/EPI-135_percmeth.bed",
"\t", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
colnames(RRBS)[1:4] <- c("scaff", "start", "end", "meth")
RRBS <- as.data.table(RRBS)
nrow(WG)
[1] 991923
nrow(RRBS)
[1] 93373
Well, thats a pretty big difference. It looks like Whole Genome bisulfite sequencing yields an order of magnitude more methylated spots.
Next I weld together the Scaffold number and location to make an easily searchable vector.
RRBS$loc <- paste0(RRBS$scaff, "-", RRBS$start)
WG$loc <- paste0(WG$scaff, "-", WG$start)
length(which(WG$loc %in% RRBS$loc == TRUE))
[1] 82848
Interesting. So it looks like there’s ~10,000 loci found in the RRBS data set that aren’t found in the WG sequence.
Lets pull out just the WG loci that are found in the RRBS sample.
merge(RRBS, WG, by = "loc") -> temp
matches <- as.data.table(cbind(temp$loc, temp$meth.x, temp$meth.y))
colnames(matches)[1:3] <- c("loc", "rrbs", "wg")
matches$rrbs <- as.numeric(matches$rrbs)
matches$wg <- as.numeric(matches$wg)
matches$diff <- matches$rrbs - matches$wg
head(matches)
So now I’ve got a data table that has shared location, RRBS methylation %, WG methaylation %, and the difference between RRBS - WG
I’m going to start with some pretty basic stuff, looking at the min, max, median, mean, and determining outliers based upon the 1.5x IQR away from the 1st and 3rd quartiles.
summary(matches$diff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.880000 -0.037080 0.000000 -0.007731 0.029410 0.769600
upper.outlier <- (IQR(matches$diff) * 1.5) + quantile(matches$diff)[[4]]
lower.outlier = quantile(matches$diff)[[2]] - (IQR(matches$diff) * 1.5)
length(matches$diff[which(matches$diff < lower.outlier)])
[1] 8579
length(matches$diff[which(matches$diff > upper.outlier)])
[1] 6691
boxplot(matches$diff)

So those ~15,000 loci would be considered outliers based upon and the boxplot supports that idea, that most fall well within the 0 difference range, but there are some to either side.
AFter talking to Steven, it looks like I should mention that the 135WG raw data file is ~10.7gb, while the RRBS EPI-135 file is ~ 1.4gb, so that 10x difference in data sounds more believable.
Steven also asked about difference in coverage, and I realized I can’t answer that from the .bedgraph files I’ve been using, as they’re just percet methylation by loci. Back to the CG.output files from MethylExtract.
Below in a very messy way I Read them in and combine the Watson and Crick strands in to single columns of Methylation and Total Coverage.
Next, I want to look at total coverage depth between the two samples. This includes all sites in both samples.
mean(as.numeric(EPI_135$totcov))
[1] 21.17662
mean(as.numeric(EPI_135WG$totcov))
[1] 37.99779
So whe we look at all sites, the Whole Genome has nearly 2x the coverage as the RRBS. Next lets just look at sites that co-occur.
As always, we start by attaching scaffold and start index.
Since this is just a quick glance, I won’t pretty things up.
summary(as.numeric(merged_135$totcov.x))
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.0 12.0 14.0 21.4 19.0 3323.0
summary(as.numeric(merged_135$totcov.y))
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.0 41.0 65.0 105.4 101.0 13120.0
boxplot(as.numeric(merged_135$totcov.x))

boxplot(as.numeric(merged_135$totcov.y))

Hm. That’s very odd. Everything looks fairly believable until the third quartile, and the maxes are way crazy. After talking to Steven it sort of sounds like this may either be an artifact of Bismark, MethylExtract, or some deficiency in our reference genome. Still, something interesting to look at!
---
title: "Exploring the EPI-135 vs. EPI-135WG samples methylation patterns"
output: html_notebook
---


So Hollie gave us an interesting side note to look at when she did both an RRBS sequencing run as well as a whole genome sequencing run on Sample EPI-135. I wanted to just explore it a little bit and see if there are any interesting patterns.

The two main things that leap to my mind is A) how much more data do you get out of WG vs RRBS and B) how does that data differ (Does RRBS normally read a lower, equal, or greater methylation rate at sites shared between sequencing runs). 

This may be all just satisfying my own curisoity, as I'm not sure if it's possible to parse out the sequencing effects vs the RRBS/WG process effects but it's fairly easy to do and may be interesting!

As always, first we load some packages and read in some data. Also, because I'm using .bed files, I add some column names to make life easier later on. I convert them from data frames to data tables to make operating on them more efficient also. 

```{r}
library(readr)
library(data.table)

WG <- read_delim("~/Documents/RobertsLab/Methylomev2/EPI-135WG_percmeth.bed", 
    "\t", escape_double = FALSE, trim_ws = TRUE, col_names = FALSE)

colnames(WG)[1:4] <- c("scaff", "start", "end", "meth")

WG <- as.data.table(WG)

RRBS <- read_delim("~/Documents/RobertsLab/Methylomev2/EPI-135_percmeth.bed", 
    "\t", escape_double = FALSE, col_names = FALSE, 
    trim_ws = TRUE)

colnames(RRBS)[1:4] <- c("scaff", "start", "end", "meth")

RRBS <- as.data.table(RRBS)

```


```{r}

nrow(WG)

nrow(RRBS)



```

Well, thats a pretty big difference. It looks like Whole Genome bisulfite sequencing yields an order of magnitude more methylated spots.

Next I weld together the Scaffold number and location to make an easily searchable vector. 

```{r}
RRBS$loc <- paste0(RRBS$scaff, "-", RRBS$start)

WG$loc <- paste0(WG$scaff, "-", WG$start)


```


```{r}

length(which(WG$loc %in% RRBS$loc == TRUE))

```

Interesting. So it looks like there's ~10,000 loci found in the RRBS data set that aren't found in the WG sequence. 

Lets pull out just the WG loci that are found in the RRBS sample.

```{r}
merge(RRBS, WG, by = "loc") -> temp

matches <- as.data.table(cbind(temp$loc, temp$meth.x, temp$meth.y))

colnames(matches)[1:3] <- c("loc", "rrbs", "wg")

matches$rrbs <- as.numeric(matches$rrbs)
matches$wg <- as.numeric(matches$wg)

matches$diff <- matches$rrbs - matches$wg


head(matches)
```

So now I've got a data table that has shared location, RRBS methylation %, WG methaylation %, and the difference between RRBS - WG


I'm going to start with some pretty basic stuff, looking at the min, max, median, mean, and determining outliers based upon the 1.5x IQR away from the 1st and 3rd quartiles. 
```{r}

summary(matches$diff)

upper.outlier <- (IQR(matches$diff) * 1.5) + quantile(matches$diff)[[4]]

lower.outlier = quantile(matches$diff)[[2]] - (IQR(matches$diff) * 1.5)

```


```{r}

length(matches$diff[which(matches$diff < lower.outlier)])
length(matches$diff[which(matches$diff > upper.outlier)])

boxplot(matches$diff)

```

So those ~15,000 loci would be considered outliers based upon and the boxplot supports that idea, that most fall well within the 0 difference range, but there are some to either side.


AFter talking to Steven, it looks like I should mention that the 135WG raw data file is ~10.7gb, while the RRBS EPI-135 file is ~ 1.4gb, so that 10x difference in data sounds more believable. 

Steven also asked about difference in coverage, and I realized I can't answer that from the .bedgraph files I've been using, as they're just percet methylation by loci. Back to the CG.output files from MethylExtract. 

Below in a very messy way I Read them in and combine the Watson and Crick strands in to single columns of Methylation and Total Coverage. 
```{r}

setwd("~/Documents/RobertsLab/Methylomev2/")
  
  temp <- read_delim("~/Documents/RobertsLab/Methylomev2/EPI-135_CG.output.txt", 
    "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")
  
  temp <- na.omit(temp)
  
  temp.2 <- as.data.table(cbind(temp$`#CHROM`))
  temp.2$sloc <-temp$POS - 1
  temp.2$eloc <- temp$POS - 1
  
  temp.2$meth <- temp$`Watson METH` + temp$`Crick METH`
  temp.2$totcov <- temp$`Watson COVERAGE` + temp$`Crick COVERAGE`  
  temp.3 <- temp.2[temp.2$totcov > 10,]    

  EPI_135 <- as.data.table(cbind(temp.3$V1, temp.3$sloc, temp.3$eloc, temp.3$meth, temp.3$totcov))  
  colnames(EPI_135)[1:5] <- c("scaff", "start", "end", "meth", "totcov")
  
  
  
 temp <- read_delim(paste0("~/Documents/RobertsLab/Methylomev2/EPI-135WG_CG.output"), 
    "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")
  
  temp <- na.omit(temp)
  
  temp.2 <- as.data.table(cbind(temp$`#CHROM`))
  temp.2$sloc <-temp$POS - 1
  temp.2$eloc <- temp$POS - 1
  
  temp.2$meth <- temp$`Watson METH` + temp$`Crick METH`
  temp.2$totcov <- temp$`Watson COVERAGE` + temp$`Crick COVERAGE`  
  temp.3 <- temp.2[temp.2$totcov > 10,]    

  EPI_135WG <- as.data.table(cbind(temp.3$V1, temp.3$sloc, temp.3$eloc, temp.3$meth, temp.3$totcov))  
  colnames(EPI_135WG)[1:5] <- c("scaff", "start", "end", "meth", "totcov")
```

```{r}

head(EPI_135)
head(EPI_135WG)
```

Next, I want to look at total coverage depth between the two samples. This includes all sites in both samples.

```{r}

mean(as.numeric(EPI_135$totcov))

mean(as.numeric(EPI_135WG$totcov))

```

So whe we look at all sites, the Whole Genome has nearly 2x the coverage as the RRBS. Next lets just look at sites that co-occur.

As always, we start by attaching scaffold and start index. 
```{r}

EPI_135$loc <- paste0(EPI_135$scaff, "-", EPI_135$start)

EPI_135WG$loc <- paste0(EPI_135WG$scaff, "-", EPI_135WG$start)

merged_135 <- merge(EPI_135, EPI_135WG, by = "loc") 

head(merged_135)

```

Since this is just a quick glance, I won't pretty things up.

```{r}

summary(as.numeric(merged_135$totcov.x))
summary(as.numeric(merged_135$totcov.y))

boxplot(as.numeric(merged_135$totcov.x))
boxplot(as.numeric(merged_135$totcov.y))
```

Hm. That's very odd. Everything looks fairly believable until the third quartile, and the maxes are way crazy. After talking to Steven it sort of sounds like this may either be an artifact of Bismark, MethylExtract, or some deficiency in our reference genome. Still, something interesting to look at!


