We’ve been using Methyl Extract for our SNP and Methylation status calling, and it seems to be working well, except the methylation calling output is not that great. We plan to use a modeling software called MACAU, which expects two count files, one a matrix of methylation counts by location and sample, the second a matrix of total reads by location and sample. Methyl Extract provides this in a single matrix, but both counts are split by which strand they came from.

The goal of this notebook is to figure out how to combine the different samples in two ways. First, We need all of the samples to be in a single file, and second, we need the +/- strands summed so we have total counts by sample. This would seem easy at first glance, but there’s the additional problem that each sample may not have had a read happen at each position in each scaffold, so it’s likely there will be dissimilar numbers of rows per sample file.

write.csv(test.meth5, file = "methylation", sep = " ")
attempt to set 'sep' ignored
colnames(test.total7) <- c("loc", "EPI-104", "EPI-111", "EPI-113", "EPI-119")
Error in colnames(test.total7) <- c("loc", "EPI-104", "EPI-111", "EPI-113",  : 
  object 'test.total7' not found
---
title: "Merging Methylation Count files by sample."
output: html_notebook
---

We've been using Methyl Extract for our SNP and Methylation status calling, and it seems to be working well, except the methylation calling output is not that great. We plan to use a modeling software called MACAU, which expects two count files, one a matrix of methylation counts by location and sample, the second a matrix of total reads by location and sample. Methyl Extract provides this in a single matrix, but both counts are split by which strand they came from.

The goal of this notebook is to figure out how to combine the different samples in two ways. First, We need all of the samples to be in a single file, and second, we need the +/- strands summed so we have total counts by sample. This would seem easy at first glance, but there's the additional problem that each sample may not have had a read happen at each position in each scaffold, so it's likely there will be dissimilar numbers of rows per sample file.


```{r}
library(readr)

## Reads in some of the sample files, it also changes any "." characters to NAs.
EPI_111_CG <- read_delim("~/Documents/Roberts Lab/Hollie-Geoduck/Merge-test/VCF-Test/EPI-111-CG.output", "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")

EPI_113_CG <- read_delim("~/Documents/Roberts Lab/Hollie-Geoduck/Merge-test/VCF-Test/EPI-113-CG.output", "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")

EPI_104_CG <- read_delim("~/Documents/Roberts Lab/Hollie-Geoduck/Merge-test/VCF-Test/EPI-104-CG.output", "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")

EPI_119_CG <- read_delim("~/Documents/Roberts Lab/Hollie-Geoduck/Merge-test/VCF-Test/EPI-119-CG.output", "\t", escape_double = FALSE, trim_ws = TRUE, na = ".")


## Sums the meth and coverage columns between both strands, since MethExtract outputs them separate
EPI_104_CG[,10] <- rowSums(cbind(EPI_104_CG$`Watson METH`, EPI_104_CG$`Crick METH`), na.rm = TRUE)
EPI_104_CG[,11] <- rowSums(cbind(EPI_104_CG$`Watson COVERAGE`, EPI_104_CG$`Crick COVERAGE`), na.rm = TRUE)
EPI_104_CG[,12] <- "EPI_104"

EPI_111_CG[,10] <- rowSums(cbind(EPI_111_CG$`Watson METH`, EPI_111_CG$`Crick METH`), na.rm = TRUE)
EPI_111_CG[,11] <- rowSums(cbind(EPI_111_CG$`Watson COVERAGE`, EPI_111_CG$`Crick COVERAGE`), na.rm = TRUE)
EPI_111_CG[,12] <- "EPI_111"

EPI_113_CG[,10] <- rowSums(cbind(EPI_113_CG$`Watson METH`, EPI_113_CG$`Crick METH`), na.rm = TRUE)
EPI_113_CG[,11] <- rowSums(cbind(EPI_113_CG$`Watson COVERAGE`, EPI_113_CG$`Crick COVERAGE`), na.rm = TRUE)
EPI_113_CG[,12] <- "EPI_113"

EPI_119_CG[,10] <- rowSums(cbind(EPI_119_CG$`Watson METH`, EPI_119_CG$`Crick METH`), na.rm = TRUE)
EPI_119_CG[,11] <- rowSums(cbind(EPI_119_CG$`Watson COVERAGE`, EPI_119_CG$`Crick COVERAGE`), na.rm = TRUE)
EPI_119_CG[,12] <- "EPI_119"


## Names the columns
colnames(EPI_104_CG)[10:12] <- c("TotalMeth104", "TotalCov104", "SampleName")
colnames(EPI_111_CG)[10:12] <- c("TotalMeth111", "TotalCov111", "SampleName")
colnames(EPI_113_CG)[10:12] <- c("TotalMeth113", "TotalCov113", "SampleName")
colnames(EPI_119_CG)[10:12] <- c("TotalMeth119", "TotalCov119", "SampleName")





## Merges the different samples into a single file. the merge() function in R only allows two data frames to be merged at a time, so they
## have to be done sequentially. A more efficient soltuion will probably need to be found prior to analyzing the full sample set
test.meth <- merge(EPI_104_CG[,c(1, 2, 3, 10)], EPI_111_CG[,c(1, 2, 3, 10)], by = names(EPI_111_CG[,c(1,2,3)]), all = T)

test.meth2 <- merge(test.meth, EPI_113_CG[,c(1, 2, 3, 10)], all = T)

test.meth3 <- merge(test.meth2, EPI_119_CG[,c(1, 2, 3, 10)], all = T)

write.table(test.meth5, file = "methylation.txt", sep = " ")


test.total1 <- merge(EPI_104_CG[,c(1, 2, 3, 11)], EPI_111_CG[,c(1, 2, 3, 11)], by = names(EPI_111_CG[,c(1,2,3)]), all = T)

test.total2 <- merge(test.meth, EPI_113_CG[,c(1, 2, 3, 11)], all = T)

test.total3 <- merge(test.meth2, EPI_119_CG[,c(1, 2, 3, 11)], all = T)

head(test.meth3)
```

```{r}

## Makes a new label column, concatenating the scaffold name and location. Also drops all of the other non-important columns
test.meth4 <- as.data.frame(paste0(test.meth3[,1], "-", test.meth3[,2]))

test.meth4[,c(2,3,4)] <- test.meth3[,c(4,5,6)]
colnames(test.meth4)[1] <- "Scaff"

test.meth5 <- as.matrix(test.meth4[,c(2,3,4)])


impute.knn(data = test.meth5)

write.csv(test.meth5, file = "test.meth5.csv")


## Calculates percent missing by sample, as well as total missing. ~40% 

perc.missing.s1 <- sum(is.na(test.meth3[,4])) / nrow(test.meth3)
perc.missing.s2 <- sum(is.na(test.meth3[,5])) / nrow(test.meth3)
perc.missing.s3 <- sum(is.na(test.meth3[,6])) / nrow(test.meth3)
perc.missing.s4 <- sum(is.na(test.meth3[,7])) / nrow(test.meth3)

total.missing <- sum(!is.na(test.meth3[,c(4,5,6,7)])) / (4 * nrow(test.meth3))


## After talking to Steven, we decided to drop all loci that did not have 10x coverage in every sample 
test.meth3[complete.cases(test.meth3[, c(4,5,6,7)]),] -> test.meth4

test.meth5 <- test.meth4[apply(test.meth4[, c(4,5,6,7)], MARGIN = 1, function(x) all(x > 10)), ]

test.total3[complete.cases(test.total3[, c(4,5,6,7)]),] -> test.total4

test.total5 <- test.meth4[apply(test.total4[, c(4,5,6,7)], MARGIN = 1, function(x) all(x > 10)), ]





## Makes the same concatenated label column again, oops. 
test.meth5[,8] <- paste0(test.meth5[,1], "-", test.meth5[,2])

test.total5[,8] <- paste0(test.total5[,1], "-", test.total5[,2])

## Removes all loci present in the total coverage file, but not in the methylation file.
test.total6 <- test.total5[test.total5[,8] %in% test.meth5[,8],]

## Creates the final column for Macau input, with labels, then writes them to white-space separated files. 

test.meth6 <- cbind(test.meth5[,8], test.meth5[,4], test.meth5[,5], test.meth5[,6], test.meth5[,7])

colnames(test.meth6) <- c("loc", "EPI-104", "EPI-111", "EPI-113", "EPI-119")

test.total7 <- cbind(test.total6[,8], test.total6[,4], test.total6[,5], test.total6[,6], test.total6[,7])
colnames(test.total7) <- c("loc", "EPI-104", "EPI-111", "EPI-113", "EPI-119")

write.table(test.meth6, file = "meth.txt", sep = " ")
write.table(test.total7, file = "totalcov.txt", sep = " ")

```


