Geosmithia morbida resequencing: Visualizing Variant Patterns Across Genomes

This pipeline works towards some large-scale variant visualization options. I did some work with Circos pipelines but was unimpressed; circular plots think will become confusing with 22 strains. This begins with a table for a single strain, GM11, which includes all varaints per genome range. This data was generated using the bedtools coverage program configured to calculate variants per 10kbp. The table looks like such:

sample.name <- "GM11" #change this every time
GM.covA <- read.table("coverage.windows/GM11.coverage.txt") #import the coverage file
cov.col.names <- c("scaffold","start","end","count") #create a list of appropriate column names, not necessary but helpful
colnames(GM.covA) <- cov.col.names
head(GM.covA)

##     scaffold start   end count
## 1 scaffold_0     0 10000    41
## 2 scaffold_0 10000 20000    15
## 3 scaffold_0 20000 30000    23
## 4 scaffold_0 30000 40000     7
## 5 scaffold_0 40000 50000    11
## 6 scaffold_0 50000 60000    13

An individual genome can be plotted as a as a barplot (except with points instead of bars):

library(ggplot2)

GM.con <- GM.covA
GM.con$loc <- paste(GM.con$scaffold, GM.con$start, sep = "-")
p <- ggplot(GM.con, aes(x=loc, y=count))
p <- p + geom_point(size=1, alpha = 0.5, color = "darkblue")
p <- p + ggtitle(sample.name)
p <- p + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
p <- p + ylab("variants per 10kbp") + xlab("position in genome")
p <- p + theme(plot.title = element_text(hjust = .02, vjust = -7))
p <- p + theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
p

This becomes a fairly clean way to plot variants across the course of the genome. The goal here is to get a sense of distribution patterns within and between species.

I would like to add a bar at the bottom representing the scaffolds, but there are just too many. this can be added in manually later I think. Also the question is will I be able to actually compare 22 strains at once?

Here is a heatmap as a more efficient alternative for visualizing this data. I have combined the variants per 10kb files into a single matrix for heatmap generation:

matrix.var <- read.table("coverage.windows/matrix.coverage.csv", header = T, row.names = 1)
matrix.loc <- matrix.var[,c(4:26)]
head(matrix.loc)

##                  GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM104 GM108 GM118
## scaffold_0-10000   41   37   23   56   28   29   27   192    33    49
## scaffold_0-20000   15   12    8   38   17   22   14   172    40    43
## scaffold_0-30000   23   13   14   43   17   29   18   196    40    40
## scaffold_0-40000    7    5   14   43   15   11    9   124    38    49
## scaffold_0-50000   11   16    8   43    5   13   13   152    30    34
## scaffold_0-60000   13   11    6   45    7    9   10   165    42    55
##                  GM140 GM158 GM170 GM177 GM178 GM179 GM188 GM196 GM205
## scaffold_0-10000    55    48    56    54    47    68    35    26    21
## scaffold_0-20000    41    42    42    34    25    38    33    15    12
## scaffold_0-30000    40    45    40    40    36    40    26    19    16
## scaffold_0-40000    49    44    40    30     9    39    11     6     7
## scaffold_0-50000    33    33    35    40    16    34    13     9     9
## scaffold_0-60000    54    54    48    52    10    43     9    16    10
##                  GM216 GM219 GM236 GM307
## scaffold_0-10000    56    56    24    33
## scaffold_0-20000    48    33     9    13
## scaffold_0-30000    48    42    17    19
## scaffold_0-40000    45    34    11     8
## scaffold_0-50000    34    38    14    12
## scaffold_0-60000    49    57    10    10

Now I can build a heatmap. Note that this plot is very large: all of the data is embedded in this .html file. You can zoom and get actual counts!

This type of figure would need some cleanup for transition to a flat (non .html) plot, mainly the y-axis. Also, the sample grouping is irrelevant at th is point. Grouping them by metadata or by position in clusters would be interesting.

Clearly some patterns start to emerge. GM104 is crazy like always. It can be removed easily in a later iteration.

library(plotly)
matrix.loc$GM104 <- NULL
data <- as.matrix(matrix.loc)
hm <- plot_ly(x=colnames(data), y=rownames(data), z=data, type="heatmap",zmax = 200, zmin=0) %>%
  layout(autosize=F, width = 800, height = 1000, margin = c(50,50,50,50))
hm

this is nice, but the samples need to be clustered AND I need some sort of column labelling/categories that indicates the metadata

Start with a script from https://plot.ly/ggplot2/ggdendro-dendrograms/ which addresses adding dendrograms

library(ggplot2) library(ggdendro) library(plotly)

library(ggplot2)
library(ggdendro)
library(plotly)

#dendogram data
x <- as.matrix(scale(matrix.loc))
#dd.col <- as.dendrogram(hclust(dist(x)))
dd.row <- as.dendrogram(hclust(dist(t(x))))
dx <- dendro_data(dd.row)
#dy <- dendro_data(dd.col)

# helper function for creating dendograms
ggdend <- function(df) {
  ggplot() +
    geom_segment(data = df, aes(x=x, y=y, xend=xend, yend=yend)) +
    labs(x = "", y = "") + theme_minimal() +
    theme(axis.text = element_blank(), axis.ticks = element_blank(),
          panel.grid = element_blank())
}

# create the dendogram
px <- ggdend(dx$segments)

# determine the strain order
row.ord <- order.dendrogram(dd.row)

#reorder the strains and convert to matrix
matrix.reorder <- matrix.loc[row.ord]
data <- as.matrix(matrix.reorder)

# set up the template plot?
eaxis <- list(
  showticklabels = FALSE,
  showgrid = FALSE,
  zeroline = FALSE
)

p_empty <- plot_ly(filename="r-docs/dendrogram") %>%
  # note that margin applies to entire plot, so we can
  # add it here to make tick labels more readable
  layout(margin = list(l = 200),
         xaxis = eaxis,
         yaxis = eaxis)

## Warning in plot_ly(filename = "r-docs/dendrogram"): Ignoring filename. Use
## `plotly_POST()` if you want to post figures to plotly.

#create the heatmap
hm <- plot_ly(x=colnames(data), y=rownames(data), z=data, type="heatmap",zmax = 200, zmin=0) %>%
  layout(autosize=F, width = 800, height = 1000, margin = c(50,50,50,50)) %>% 
  layout(legend = list(x = 100, y = 0.5))

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

#place plots into the template
hm.dendro <- subplot(px, hm, nrows = 2, margin = 0.001)

## Warning: Can only have one: config

this is ugly; there does not appear to be a way to make the subplots different sizes, nor can I find a way to move the legend.

Try another solution, a package called heatmaply

library(ggplot2)
library(plotly)
library(heatmaply)

## Loading required package: viridis

## Loading required package: viridisLite

## 
## ======================
## Welcome to heatmaply version 0.15.2
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================

hcols <- colorRampPalette(c("#2c0e5e","#29b7c6","#8ad657","#ffff00","#fcff82"))


hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE, 
          plot_method = "plotly", 
          showticklabels = c(T,F),
          col = hcols(256))
hm

This looks decent now. How can I add in some sort of metadata color coding?

library(ggplot2)
library(plotly)
library(heatmaply)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#get the metadata file which has been amended with hex color codes to go with metadata
g.metadata <- read.csv("../SNP.tree/metadata.simple.csv", stringsAsFactors = F)
loc.col <- g.metadata[c(1,4)]


hcols <- colorRampPalette(c("#2c0e5e","#29b7c6","#8ad657","#ffff00","#fcff82"))

#compute heatmat in order to get the dendrogram
hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE, 
          plot_method = "plotly", 
          showticklabels = c(T,F),
          col = hcols(256))

#extract the final vector of strain names
hm.xaxis <- as.vector(hm$x$layout$xaxis$ticktext)
hm.xaxis <- data.frame(hm.xaxis, stringsAsFactors = F)
colnames(hm.xaxis)[1] <- "Samples"
hm.xaxis.col <- dplyr::left_join(hm.xaxis,loc.col, by="Samples")

#define vectors of colors

location.col <- as.vector(hm.xaxis.col$loc.col)

#rebuild heatmap with metadata
hm <- heatmaply(matrix.loc, Rowv = FALSE, titleY = FALSE, 
          plot_method = "plotly", 
          showticklabels = c(T,F),
          col = hcols(256),
          ColSideColors = location.col,
          column_text_angle = 90,
          width = 1600,
          height = 2000)

## Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
## scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
## col_side_colors is currently not implemented (due to an issue in plotly).
## We hope this would get resolved in future releases.

hm

This is fine except that the colors do not display correctly. I will have to brute force change colors and make a legend.

library(markdown) library(rmarkdown) library(knitr) rpubsUpload(“G.morbida.variant.heat.map”, “variant.hm.plot.html”)

Geosmithia morbida resequencing: Visualizing Variant Patterns Across Genomes

Robert W. Murdoch

October 26, 2018