This pipeline works towards some large-scale variant visualization options. I did some work with Circos pipelines but was unimpressed; circular plots think will become confusing with 22 strains. This begins with a table for a single strain, GM11, which includes all varaints per genome range. This data was generated using the bedtools coverage program configured to calculate variants per 10kbp. The table looks like such:
sample.name <- "GM11" #change this every time
GM.covA <- read.table("coverage.windows/GM11.coverage.txt") #import the coverage file
cov.col.names <- c("scaffold","start","end","count") #create a list of appropriate column names, not necessary but helpful
colnames(GM.covA) <- cov.col.names
head(GM.covA)
## scaffold start end count
## 1 scaffold_0 0 10000 41
## 2 scaffold_0 10000 20000 15
## 3 scaffold_0 20000 30000 23
## 4 scaffold_0 30000 40000 7
## 5 scaffold_0 40000 50000 11
## 6 scaffold_0 50000 60000 13
An individual genome can be plotted as a as a barplot (except with points instead of bars):
library(ggplot2)
GM.con <- GM.covA
GM.con$loc <- paste(GM.con$scaffold, GM.con$start, sep = "-")
p <- ggplot(GM.con, aes(x=loc, y=count))
p <- p + geom_point(size=1, alpha = 0.5, color = "darkblue")
p <- p + ggtitle(sample.name)
p <- p + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
p <- p + ylab("variants per 10kbp") + xlab("position in genome")
p <- p + theme(plot.title = element_text(hjust = .02, vjust = -7))
p <- p + theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
p
This becomes a fairly clean way to plot variants across the course of the genome. The goal here is to get a sense of distribution patterns within and between species.
I would like to add a bar at the bottom representing the scaffolds, but there are just too many. this can be added in manually later I think. Also the question is will I be able to actually compare 22 strains at once?
Here is a heatmap as a more efficient alternative for visualizing this data. I have combined the variants per 10kb files into a single matrix for heatmap generation:
matrix.var <- read.table("coverage.windows/matrix.coverage.csv", header = T, row.names = 1)
matrix.loc <- matrix.var[,c(4:26)]
head(matrix.loc)
## GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM104 GM108 GM118
## scaffold_0-10000 41 37 23 56 28 29 27 192 33 49
## scaffold_0-20000 15 12 8 38 17 22 14 172 40 43
## scaffold_0-30000 23 13 14 43 17 29 18 196 40 40
## scaffold_0-40000 7 5 14 43 15 11 9 124 38 49
## scaffold_0-50000 11 16 8 43 5 13 13 152 30 34
## scaffold_0-60000 13 11 6 45 7 9 10 165 42 55
## GM140 GM158 GM170 GM177 GM178 GM179 GM188 GM196 GM205
## scaffold_0-10000 55 48 56 54 47 68 35 26 21
## scaffold_0-20000 41 42 42 34 25 38 33 15 12
## scaffold_0-30000 40 45 40 40 36 40 26 19 16
## scaffold_0-40000 49 44 40 30 9 39 11 6 7
## scaffold_0-50000 33 33 35 40 16 34 13 9 9
## scaffold_0-60000 54 54 48 52 10 43 9 16 10
## GM216 GM219 GM236 GM307
## scaffold_0-10000 56 56 24 33
## scaffold_0-20000 48 33 9 13
## scaffold_0-30000 48 42 17 19
## scaffold_0-40000 45 34 11 8
## scaffold_0-50000 34 38 14 12
## scaffold_0-60000 49 57 10 10
Now I can build a heatmap. Note that this plot is very large: all of the data is embedded in this .html file. You can zoom and get actual counts!
This type of figure would need some cleanup for transition to a flat (non .html) plot, mainly the y-axis. Also, the sample grouping is irrelevant at th is point. Grouping them by metadata or by position in clusters would be interesting.
Clearly some patterns start to emerge. GM104 is crazy like always. It can be removed easily in a later iteration.
library(plotly)
matrix.loc$GM104 <- NULL
data <- as.matrix(matrix.loc)
hm <- plot_ly(x=colnames(data), y=rownames(data), z=data, type="heatmap",zmax = 200, zmin=0) %>%
layout(autosize=F, width = 800, height = 1000, margin = c(50,50,50,50))
hm
this is nice, but the samples need to be clustered AND I need some sort of column labelling/categories that indicates the metadata
Start with a script from https://plot.ly/ggplot2/ggdendro-dendrograms/ which addresses adding dendrograms
library(ggplot2) library(ggdendro) library(plotly)
library(ggplot2)
library(ggdendro)
library(plotly)
#dendogram data
x <- as.matrix(scale(matrix.loc))
#dd.col <- as.dendrogram(hclust(dist(x)))
dd.row <- as.dendrogram(hclust(dist(t(x))))
dx <- dendro_data(dd.row)
#dy <- dendro_data(dd.col)
# helper function for creating dendograms
ggdend <- function(df) {
ggplot() +
geom_segment(data = df, aes(x=x, y=y, xend=xend, yend=yend)) +
labs(x = "", y = "") + theme_minimal() +
theme(axis.text = element_blank(), axis.ticks = element_blank(),
panel.grid = element_blank())
}
# create the dendogram
px <- ggdend(dx$segments)
# determine the strain order
row.ord <- order.dendrogram(dd.row)
#reorder the strains and convert to matrix
matrix.reorder <- matrix.loc[row.ord]
data <- as.matrix(matrix.reorder)
# set up the template plot?
eaxis <- list(
showticklabels = FALSE,
showgrid = FALSE,
zeroline = FALSE
)
p_empty <- plot_ly(filename="r-docs/dendrogram") %>%
# note that margin applies to entire plot, so we can
# add it here to make tick labels more readable
layout(margin = list(l = 200),
xaxis = eaxis,
yaxis = eaxis)
## Warning in plot_ly(filename = "r-docs/dendrogram"): Ignoring filename. Use
## `plotly_POST()` if you want to post figures to plotly.
#create the heatmap
hm <- plot_ly(x=colnames(data), y=rownames(data), z=data, type="heatmap",zmax = 200, zmin=0) %>%
layout(autosize=F, width = 800, height = 1000, margin = c(50,50,50,50)) %>%
layout(legend = list(x = 100, y = 0.5))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
#place plots into the template
hm.dendro <- subplot(px, hm, nrows = 2, margin = 0.001)
## Warning: Can only have one: config
this is ugly; there does not appear to be a way to make the subplots different sizes, nor can I find a way to move the legend.
Try another solution, a package called heatmaply
library(ggplot2)
library(plotly)
library(heatmaply)
## Loading required package: viridis
## Loading required package: viridisLite
##
## ======================
## Welcome to heatmaply version 0.15.2
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
hcols <- colorRampPalette(c("#2c0e5e","#29b7c6","#8ad657","#ffff00","#fcff82"))
hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE,
plot_method = "plotly",
showticklabels = c(T,F),
col = hcols(256))
hm
This looks decent now. How can I add in some sort of metadata color coding?
library(ggplot2)
library(plotly)
library(heatmaply)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#get the metadata file which has been amended with hex color codes to go with metadata
g.metadata <- read.csv("../SNP.tree/metadata.simple.csv", stringsAsFactors = F)
loc.col <- g.metadata[c(1,4)]
hcols <- colorRampPalette(c("#2c0e5e","#29b7c6","#8ad657","#ffff00","#fcff82"))
#compute heatmat in order to get the dendrogram
hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE,
plot_method = "plotly",
showticklabels = c(T,F),
col = hcols(256))
#extract the final vector of strain names
hm.xaxis <- as.vector(hm$x$layout$xaxis$ticktext)
hm.xaxis <- data.frame(hm.xaxis, stringsAsFactors = F)
colnames(hm.xaxis)[1] <- "Samples"
hm.xaxis.col <- dplyr::left_join(hm.xaxis,loc.col, by="Samples")
#define vectors of colors
location.col <- as.vector(hm.xaxis.col$loc.col)
#rebuild heatmap with metadata
hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE,
plot_method = "plotly",
showticklabels = c(T,F),
col = hcols(256),
ColSideColors = location.col,
column_text_angle = 90,
width = 1600,
height = 2000)
## Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
## scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
## col_side_colors is currently not implemented (due to an issue in plotly).
## We hope this would get resolved in future releases.
hm
This is fine except that the colors do not display correctly. I will have to brute force change colors and make a legend.
library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.variant.heat.map”, “variant.hm.plot.html”)