The below cells are for reading in and formatting the data in a single matrix. After the break is where the exciting stuff happens.


setwd("~/Documents/RobertsLab/Clusters")

list.files(path = ".")
library(data.table)
data.table 1.10.2
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

EPI_103_percmeth_bedgraph <- read_delim("~/Documents/RobertsLab/Clusters/EPI-103-percmeth.bedgraph.txt", 
    "\t", escape_double = FALSE, col_names = FALSE, 
    col_types = cols(X4 = col_double()), 
    trim_ws = TRUE)

EPI_103_percmeth_bedgraph <- as.data.table(EPI_103_percmeth_bedgraph)

head(EPI_103_percmeth_bedgraph)

print(EPI_103_percmeth_bedgraph[7,])

EPI_103_DMRs <- EPI_103_percmeth_bedgraph[EPI_103_percmeth_bedgraph$X1 %in% DiffMethRegions$X1,]


EPI_103_DMRs$Loc <- paste0(EPI_103_DMRs$X1, "-", EPI_103_DMRs$X2)


head(EPI_103_DMRs)

EPI_103_DMRs <- EPI_103_DMRs[EPI_103_DMRs$Loc %in% DiffMethRegions$Loc,]

head(EPI_103_DMRs)

EPI_DMRs <- as.data.table(cbind(EPI_103_DMRs$Loc, EPI_103_DMRs$X4))



colnames(EPI_DMRs)[1] <- "DMRLoc"
colnames(EPI_DMRs)[2] <- "EPI_103"
head(EPI_DMRs)

EPI_DMRs$EPI_103 <- as.numeric(EPI_DMRs$EPI_103)

for(i in 2:length(sample.files))   {
  
  temp <- read_delim(paste0("~/Documents/RobertsLab/Clusters/", sample.files[i]), 
    "\t", escape_double = FALSE, col_names = FALSE, 
    col_types = cols(X4 = col_double()), 
    trim_ws = TRUE)

  temp <- as.data.table(temp)
  
  temp_DMRs <- temp[temp$X1 %in% DiffMethRegions$X1,]


  temp_DMRs$Loc <- paste0(temp_DMRs$X1, "-", temp_DMRs$X2)


  temp_DMRs <- temp_DMRs[temp_DMRs$Loc %in% DiffMethRegions$Loc,]

  EPI_DMRs$temp <- temp_DMRs$X4
  
  colnames(EPI_DMRs)[ncol(EPI_DMRs)] <- substr(sample.files[i], 1, 7)
  
  
}

head(EPI_DMRs)

First I make a new data frame that’s re-ordered such that the 4 samples by treatment are together, as opposed to ordered by sample number.

DMRs <- as.data.frame(cbind(EPI_DMRs$DMRLoc, EPI_DMRs[,c(6,7,10,11, 2, 3, 8, 9, 4, 5, 12, 13)]))

Everything after this is looking at different ways to identify the proper number of clusters (K) for a K-means clustering algorithm.

First I want to try a silhouette method using a K-medoid method. This clusters data points by distance from a “medoid”, which is a data point who’s average dissimilarity from all the other data points in the cluster is minimized. This differs from the typical K-means clustering algoritm because it uses a data point as a center, as opposed to a centroid in K-means, which is just a point in space which may not have a data point associated with it.

K-medoid is generally considered to be more robust against noise and outliers, so I felt this would be a good route to try.

library(cluster)
library(fpc)
pamk.best <- pamk(as.matrix(DMRs[1:267,2:13]))
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
number of clusters estimated by optimum average silhouette width: 2 
plot(pam(as.matrix(DMRs[1:267,2:13]), pamk.best$nc), main = "Silhouette plot for K-medoid clustering")

The first plot shows general grouping for two “clusters” of data, but lets delve further in to it, and look at our data graphed with two means. I’m a little confused by the second graph. What should show up are grey bars with distances between points in groups, but instead it’s… nothing. I’m not sure if thats an artifact of the data, or if it’s indicitive of a lack of true grouping.

set.seed(123)
kmeans2 <- kmeans(DMRs[,2:13], centers = 2, nstart = 100)
plot(0, type="n", xlim=c(0,13), ylim=range(0, 1.01))
for(i in 1:nrow(DMRs))  {
points(1:12, DMRs[i,2:13], type="b", col="gray", lwd=0.5)
}
abline(v=c(4.5, 8.5), lty=2, col="blue") #add vertical lines between treatment groups
mtext(side=3, at=2, "Ambient", cex=0.5) #add text to each treatment group
mtext(side=3, at=6.5, "Low", cex=0.5) #add text to each treatment group
mtext(side=3, at=11, "Super Low", cex=0.5) #add text to each treatment group
points(1:12, kmeans2$centers[1,], col = "red", type = "b", lwd = 3)
points(1:12, kmeans2$centers[2,], col = "red", type = "b", lwd = 3)

Thats a little hard to view raw, so lets scale and center it and re-run the k-medoid test.

That… looks nearly the same. Which I guess makes sense, as K-medoid is supposed to be robust against scale and outlier issues. Again, I’m not sure whats going on with the second graph, but will try and figure it out.

scaled.kmeans2 <- kmeans(scaled.data, centers = 2, nstart = 100)
plot(0, type="n", xlim=c(0,13), ylim=range(-3, 3))
for(i in 1:nrow(scaled.data))  {
points(1:12, scaled.data[i,1:12], type="b", col="gray", lwd=0.5)
}
abline(v=c(4.5, 8.5), lty=2, col="blue") #add vertical lines between treatment groups
mtext(side=3, at=2, "Ambient", cex=0.5) #add text to each treatment group
mtext(side=3, at=6.5, "Low", cex=0.5) #add text to each treatment group
mtext(side=3, at=11, "Super Low", cex=0.5) #add text to each treatment group
points(1:12, scaled.kmeans2$centers[1,], col = "red", type = "b", lwd = 3)
points(1:12, scaled.kmeans2$centers[2,], col = "red", type = "b", lwd = 3)

A second way to calculate the number of clusters is via a gap statistic. I’m first going to try it using a K-means algorithm, and see how it fares. I feed it the scaled data, because the gap statistic behaves better with centered data, because scale has a large impact on the effectiveness of clustering.

Well. That’s not good. What the graph shows is that K never converges on some acceptable number (shown by the gap statistic never flattening out as K increases). Lets try it with a K-medoid approach, just for fun.

gapstat <- clusGap(scaled.data, FUN = pam, K.max = 100, B = 500)
Clustering k = 1,2,..., K.max (= 100): .. done
Bootstrapping, b = 1,2,..., B (= 500)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
.................................................. 150 
.................................................. 200 
.................................................. 250 
.................................................. 300 
.................................................. 350 
.................................................. 400 
.................................................. 450 
.................................................. 500 

Two and a half hours later… lets see the results!

plot(gapstat)

Welp, yeah. Again, no convergence on the gap statistic for any value of K between 0 and 100. Which means our data likely doesn’t cluster in any meaningful way.

Given that the gap statistics don’t converge for any meaningful number of groups, I’m going to say that there isn’t likely any clusters of effects?

---
title: "Clustering for fixed Day 10 data"
output: html_notebook
---

The below cells are for reading in and formatting the data in a single matrix. After the break is where the exciting stuff happens.

```{r}

setwd("~/Documents/RobertsLab/Clusters")

list.files(path = ".")

```

```{r}

library(data.table)
library(readr)
DiffMethRegions <- read_delim("~/Documents/RobertsLab/Clusters/Day10-10xCov-DiffMethRegions.bed", 
    "\t", escape_double = FALSE, col_names = FALSE, 
    trim_ws = TRUE)

DiffMethRegions$Loc <- paste0(DiffMethRegions$X1, "-", DiffMethRegions$X2)

head(DiffMethRegions)

sample.files <- list.files(path = ".", pattern = "*bedgraph.txt")

print(sample.files)

```


```{r}

EPI_103_percmeth_bedgraph <- read_delim("~/Documents/RobertsLab/Clusters/EPI-103-percmeth.bedgraph.txt", 
    "\t", escape_double = FALSE, col_names = FALSE, 
    col_types = cols(X4 = col_double()), 
    trim_ws = TRUE)

EPI_103_percmeth_bedgraph <- as.data.table(EPI_103_percmeth_bedgraph)

head(EPI_103_percmeth_bedgraph)

print(EPI_103_percmeth_bedgraph[7,])

```

```{r}

EPI_103_DMRs <- EPI_103_percmeth_bedgraph[EPI_103_percmeth_bedgraph$X1 %in% DiffMethRegions$X1,]


EPI_103_DMRs$Loc <- paste0(EPI_103_DMRs$X1, "-", EPI_103_DMRs$X2)


head(EPI_103_DMRs)

EPI_103_DMRs <- EPI_103_DMRs[EPI_103_DMRs$Loc %in% DiffMethRegions$Loc,]

head(EPI_103_DMRs)

EPI_DMRs <- as.data.table(cbind(EPI_103_DMRs$Loc, EPI_103_DMRs$X4))



colnames(EPI_DMRs)[1] <- "DMRLoc"
colnames(EPI_DMRs)[2] <- "EPI_103"
head(EPI_DMRs)

EPI_DMRs$EPI_103 <- as.numeric(EPI_DMRs$EPI_103)
```


```{r}

for(i in 2:length(sample.files))   {
  
  temp <- read_delim(paste0("~/Documents/RobertsLab/Clusters/", sample.files[i]), 
    "\t", escape_double = FALSE, col_names = FALSE, 
    col_types = cols(X4 = col_double()), 
    trim_ws = TRUE)

  temp <- as.data.table(temp)
  
  temp_DMRs <- temp[temp$X1 %in% DiffMethRegions$X1,]


  temp_DMRs$Loc <- paste0(temp_DMRs$X1, "-", temp_DMRs$X2)


  temp_DMRs <- temp_DMRs[temp_DMRs$Loc %in% DiffMethRegions$Loc,]

  EPI_DMRs$temp <- temp_DMRs$X4
  
  colnames(EPI_DMRs)[ncol(EPI_DMRs)] <- substr(sample.files[i], 1, 7)
  
  
}

head(EPI_DMRs)

```

---

First I make a new data frame that's re-ordered such that the 4 samples by treatment are together, as opposed to ordered by sample number.
```{r}

DMRs <- as.data.frame(cbind(EPI_DMRs$DMRLoc, EPI_DMRs[,c(6,7,10,11, 2, 3, 8, 9, 4, 5, 12, 13)]))
```

Everything after this is looking at different ways to identify the proper number of clusters (K) for a K-means clustering algorithm.

First I want to try a silhouette method using a K-medoid method. This clusters data points by distance from a "medoid", which is a data point who's average dissimilarity from all the other data points in the cluster is minimized. This differs from the typical K-means clustering algoritm because it uses a data point as a center, as opposed to a centroid in K-means, which is just a point in space which may not have a data point associated with it.

K-medoid is generally considered to be more robust against noise and outliers, so I felt this would be a good route to try.

```{r}
library(cluster)
library(fpc)
pamk.best <- pamk(as.matrix(DMRs[1:267,2:13]))

cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")

plot(pam(as.matrix(DMRs[1:267,2:13]), pamk.best$nc), main = "Silhouette plot for K-medoid clustering")

```

The first plot shows general grouping for two "clusters" of data, but lets delve further in to it, and look at our data graphed with two means. I'm a little confused by the second graph. What should show up are grey bars with distances between points in groups, but instead it's... nothing. I'm not sure if thats an artifact of the data, or if it's indicitive of a lack of true grouping.

```{r}

set.seed(123)

kmeans2 <- kmeans(DMRs[,2:13], centers = 2, nstart = 100)

plot(0, type="n", xlim=c(0,13), ylim=range(0, 1.01))

for(i in 1:nrow(DMRs))  {
points(1:12, DMRs[i,2:13], type="b", col="gray", lwd=0.5)

}
abline(v=c(4.5, 8.5), lty=2, col="blue") #add vertical lines between treatment groups
mtext(side=3, at=2, "Ambient", cex=0.5) #add text to each treatment group
mtext(side=3, at=6.5, "Low", cex=0.5) #add text to each treatment group
mtext(side=3, at=11, "Super Low", cex=0.5) #add text to each treatment group


points(1:12, kmeans2$centers[1,], col = "red", type = "b", lwd = 3)
points(1:12, kmeans2$centers[2,], col = "red", type = "b", lwd = 3)
```

Thats a little hard to view raw, so lets scale and center it and re-run the k-medoid test.

```{r}

scaled.data <- scale(as.matrix(DMRs[1:267,2:13]), center = TRUE, scale = TRUE)

scaled.pamk <- pamk(scaled.data)

cat("number of clusters estimated by optimum average silhouette width:", scaled.pamk$nc, "\n")

plot(pam(scaled.data, scaled.pamk$nc), main = "Silhouette plot for K-medoid clustering")

```

That... looks nearly the same. Which I guess makes sense, as K-medoid is supposed to be robust against scale and outlier issues. Again, I'm not sure whats going on with the second graph, but will try and figure it out.

```{r}

scaled.kmeans2 <- kmeans(scaled.data, centers = 2, nstart = 100)

plot(0, type="n", xlim=c(0,13), ylim=range(-3, 3))

for(i in 1:nrow(scaled.data))  {
points(1:12, scaled.data[i,1:12], type="b", col="gray", lwd=0.5)

}
abline(v=c(4.5, 8.5), lty=2, col="blue") #add vertical lines between treatment groups
mtext(side=3, at=2, "Ambient", cex=0.5) #add text to each treatment group
mtext(side=3, at=6.5, "Low", cex=0.5) #add text to each treatment group
mtext(side=3, at=11, "Super Low", cex=0.5) #add text to each treatment group


points(1:12, scaled.kmeans2$centers[1,], col = "red", type = "b", lwd = 3)
points(1:12, scaled.kmeans2$centers[2,], col = "red", type = "b", lwd = 3)

```

A second way to calculate the number of clusters is via a gap statistic. I'm first going to try it using a K-means algorithm, and see how it fares. I feed it the scaled data, because the gap statistic behaves better with centered data, because scale has a large impact on the effectiveness of clustering.

```{r}

gapstat <- clusGap(scaled.data, FUN = kmeans, nstart = 25, K.max = 100, B = 500)

plot(gapstat)

```

Well. That's not good. What the graph shows is that K never converges on some acceptable number (shown by the gap statistic never flattening out as K increases). Lets try it with a K-medoid approach, just for fun.


```{r}

gapstat <- clusGap(scaled.data, FUN = pam, K.max = 100, B = 500)



```

Two and a half hours later... lets see the results!

```{r}

plot(gapstat)

```

Welp, yeah. Again, no convergence on the gap statistic for any value of K between 0 and 100. Which means our data likely doesn't cluster in any meaningful way.

Given that the gap statistics don't converge for any meaningful number of groups, I'm going to say that there isn't likely any clusters of effects? 

