Geosmithia morbida resequencing: Visualizing Variant Patterns Across Genomes

This pipeline works towards some large-scale variant visualization options. I did some work with Circos pipelines but was unimpressed; circular plots think will become confusing with 22 strains. This begins with a table for a single strain, GM11, which includes all varaints per genome range. This data was generated using the bedtools coverage program configured to calculate variants per 10kbp. The table looks like such:

sample.name <- "GM11" #change this every time
GM.covA <- read.table("coverage.windows/GM11.coverage.txt") #import the coverage file
cov.col.names <- c("scaffold","start","end","count") #create a list of appropriate column names, not necessary but helpful
colnames(GM.covA) <- cov.col.names
head(GM.covA)

##     scaffold start   end count
## 1 scaffold_0     0 10000    41
## 2 scaffold_0 10000 20000    15
## 3 scaffold_0 20000 30000    23
## 4 scaffold_0 30000 40000     7
## 5 scaffold_0 40000 50000    11
## 6 scaffold_0 50000 60000    13

An individual genome can be plotted as a as a barplot (except with points instead of bars):

library(ggplot2)

GM.con <- GM.covA
GM.con$loc <- paste(GM.con$scaffold, GM.con$start, sep = "-")
p <- ggplot(GM.con, aes(x=loc, y=count))
p <- p + geom_point(size=1, alpha = 0.5, color = "darkblue")
p <- p + ggtitle(sample.name)
p <- p + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
p <- p + ylab("variants per 10kbp") + xlab("position in genome")
p <- p + theme(plot.title = element_text(hjust = .02, vjust = -7))
p <- p + theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
p

This becomes a fairly clean way to plot variants across the course of the genome. The goal here is to get a sense of distribution patterns within and between species.

I would like to add a bar at the bottom representing the scaffolds, but there are just too many. this can be added in manually later I think. Also the question is will I be able to actually compare 22 strains at once?

Here is a heatmap as a more efficient alternative for visualizing this data. I have combined the variants per 10kb files into a single matrix for heatmap generation:

matrix.var <- read.table("coverage.windows/matrix.coverage.csv", header = T, row.names = 1)
matrix.loc <- matrix.var[,c(4:26)]
head(matrix.loc)

##                  GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM104 GM108 GM118
## scaffold_0-10000   41   37   23   56   28   29   27   192    33    49
## scaffold_0-20000   15   12    8   38   17   22   14   172    40    43
## scaffold_0-30000   23   13   14   43   17   29   18   196    40    40
## scaffold_0-40000    7    5   14   43   15   11    9   124    38    49
## scaffold_0-50000   11   16    8   43    5   13   13   152    30    34
## scaffold_0-60000   13   11    6   45    7    9   10   165    42    55
##                  GM140 GM158 GM170 GM177 GM178 GM179 GM188 GM196 GM205
## scaffold_0-10000    55    48    56    54    47    68    35    26    21
## scaffold_0-20000    41    42    42    34    25    38    33    15    12
## scaffold_0-30000    40    45    40    40    36    40    26    19    16
## scaffold_0-40000    49    44    40    30     9    39    11     6     7
## scaffold_0-50000    33    33    35    40    16    34    13     9     9
## scaffold_0-60000    54    54    48    52    10    43     9    16    10
##                  GM216 GM219 GM236 GM307
## scaffold_0-10000    56    56    24    33
## scaffold_0-20000    48    33     9    13
## scaffold_0-30000    48    42    17    19
## scaffold_0-40000    45    34    11     8
## scaffold_0-50000    34    38    14    12
## scaffold_0-60000    49    57    10    10

Now I can build a heatmap. Note that this plot is very large: all of the data is embedded in this .html file. You can zoom and get actual counts!

This type of figure would need some cleanup for transition to a flat (non .html) plot, mainly the y-axis. Also, the sample grouping is irrelevant at th is point. Grouping them by metadata or by position in clusters would be interesting.

Clearly some patterns start to emerge. GM104 is crazy like always. It can be removed easily in a later iteration.

library(plotly)
data <- as.matrix(matrix.loc)
hm <- plot_ly(x=colnames(data), y=rownames(data), z=data, type="heatmap",zmax = 200, zmin=0) %>%
  layout(autosize=F, width = 800, height = 1000, margin = c(50,50,50,50))
hm

Geosmithia morbida resequencing: Visualizing Variant Patterns Across Genomes

Robert W. Murdoch

October 26, 2018