Overview

While there are many well-established pre-existing packages for processing variant/SNP data, such as DartR, pegas, adegenet, and vcfR, these are not well adapted to working with this particular data format. In this pipeline, we will use base R package to read and format the multi-vcf file that was output from the CLC variant calling system. The vcf file is actually quite a straight-forward tab-delimited table. However, there is some curiously formatted data such as coverage at each variant location, which is challenging to access without the aid of the vcfR package (we will use this later to quickly get some nice descriptive statistics of the variants). The goal of the present script is to move towards ordinations based on all variants. The DartR package does perform PCoA analysis of variant data packed into a “genlight” object; however, getting from vcf to genlight while maintaining metadata and nature of variants is extremely time-consuming and for the purposes of ordination, entirely unecessary. For more specialized population genetics approaches, DartR and pegas might become useful, but for now, I will ignore them.

Inputs: * multi.vcf file

Outputs: * variant.table.csv: a table which shows variant, in a parsable code (e.g. “scaffold_0_1000237|A/G”) and presence/absence (1 or 0) in each strain. This is the interpreted multi.vcf and is passed to the next script for interpretation.

1. Import the data

We will simply read in the multi-vcf file as a data table. Note that a vcf has several lines at the beginning that must be skipped.

mvcf <- read.csv("../all.vcf/multi_all.vcf", skip=13, sep="\t")

2. Perform some basic column name cleanup

colnames(mvcf)[1] <- "Scaffold" #make first colum name more appropriate
strain.names <- colnames(mvcf)[10:32] #pull strain names
clean.names <- gsub("\\_.*","",strain.names) #clean up strain names
colnames(mvcf)[10:32] <- clean.names #replace old with new strain names

3. Remove strain GM104

Strain GM104 has proven repeatedly to be a massive outlier. This manually removes it from further analysis.

mvcf$GM104 <- NULL
kable(mvcf[c(1:10),])
Scaffold POS ID REF ALT QUAL FILTER INFO FORMAT GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM108 GM118 GM140 GM158 GM170 GM177 GM178 GM179 GM188 GM196 GM205 GM216 GM219 GM236 GM307
scaffold_12 1 . CT TC . . . GT:CLCAD2:DP . . . . . . . . . . . . . . 0/1:7,7:14 . . . . . . .
scaffold_14 1 . GG AA . . . GT:CLCAD2:DP . . . . . 1:0,11:18 . . . . . . . . . . . . . . . .
scaffold_17 1 . GG TT . . . GT:CLCAD2:DP . . . . . 1:0,6:14 . . . . . . . . . . . . . . . .
scaffold_26 1 . G GAG, GGGAG . . . GT:CLCAD2:DP . . . . . 1:0,8:10 . . . . . . . . . . . . . . 2:0,4:10 .
scaffold_29 1 . G A . . . GT:CLCAD2:DP . . . . . 1:0,7:14 . . . . . . . . . . . . . . . .
scaffold_43 1 . CC AA . . . GT:CLCAD2:DP . . . . . 1:0,8:20 . . . . . . . . . . . . . . . .
scaffold_46 1 . A T . . . GT:CLCAD2:DP . . . . . . . . . . 0/1:5,5:10 . . . . . . . . . . .
scaffold_51 1 . C A, G . . . GT:CLCAD2:DP . . . . . . . . . . . . 1/2:0,4,5:10 . . . . . . . . .
scaffold_52 1 . TG CC . . . GT:CLCAD2:DP . . . . . 1:0,12:15 . . . . . . . . . . . . . . . .
scaffold_6 1 . G C . . . GT:CLCAD2:DP . . . . . . . . . . 1:0,4:10 . . . . . . . . . . .

4. Creating locus names with scaffold, position, ref, and var details

Essentially, I want all of this information concatenated into a single “Locus” cell:

kable((mvcf)[1:10,c(1:2,4:5)])
Scaffold POS REF ALT
scaffold_12 1 CT TC
scaffold_14 1 GG AA
scaffold_17 1 GG TT
scaffold_26 1 G GAG, GGGAG
scaffold_29 1 G A
scaffold_43 1 CC AA
scaffold_46 1 A T
scaffold_51 1 C A, G
scaffold_52 1 TG CC
scaffold_6 1 G C

Becomes:

mvcf$locus <- paste ( paste(mvcf$Scaffold, mvcf$POS, sep="_") , paste(mvcf$REF, mvcf$ALT, sep="/") , sep="|")
kable((mvcf)[1:10,c(1:2,4:5,ncol(mvcf))])
Scaffold POS REF ALT locus
scaffold_12 1 CT TC scaffold_12_1|CT/TC
scaffold_14 1 GG AA scaffold_14_1|GG/AA
scaffold_17 1 GG TT scaffold_17_1|GG/TT
scaffold_26 1 G GAG, GGGAG scaffold_26_1|G/GAG, GGGAG
scaffold_29 1 G A scaffold_29_1|G/A
scaffold_43 1 CC AA scaffold_43_1|CC/AA
scaffold_46 1 A T scaffold_46_1|A/T
scaffold_51 1 C A, G scaffold_51_1|C/A, G
scaffold_52 1 TG CC scaffold_52_1|TG/CC
scaffold_6 1 G C scaffold_6_1|G/C

5. complex loci

There is a small fraction of loci at which two or more variants are observed in the population or even in a single strain. As an example, scaffold_64_1 we see both A -> G and A -> T depending on the strain.

Some of these are fairly dramatic, for example G -> A vs. G -> ACACACC.

These are monstrously difficult to parse. They make up only about 2.5% of the variants across the population however.

I am going to leave them in the set as-is. Kicking them out seems like a mistake, while teasing them apart seems like an un-necessary burden. The goal of this pipeline is to get a general profile and to work towards identifying highly-allelic genes, or foci of adaptation.

The big drawback of leaving them as-is is that clearly different mutations will be considered as being the same.

h.mvcf <- mvcf[!grepl(",", mvcf$locus),] #includes all rows that do not have $locus with ","
p.mvcf <- mvcf[grep(",", mvcf$locus),] #this grabs the polyploid loci for later use
write.csv(p.mvcf, "polyploid_loci.csv") 
h.mvcf <- h.mvcf[,c(32,1:31)] #place $ locus as the first column
kable(head(p.mvcf))
Scaffold POS ID REF ALT QUAL FILTER INFO FORMAT GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM108 GM118 GM140 GM158 GM170 GM177 GM178 GM179 GM188 GM196 GM205 GM216 GM219 GM236 GM307 locus
4 scaffold_26 1 . G GAG, GGGAG . . . GT:CLCAD2:DP . . . . . 1:0,8:10 . . . . . . . . . . . . . . 2:0,4:10 . scaffold_26_1|G/GAG, GGGAG
8 scaffold_51 1 . C A, G . . . GT:CLCAD2:DP . . . . . . . . . . . . 1/2:0,4,5:10 . . . . . . . . . scaffold_51_1|C/A, G
12 scaffold_64 1 . A G, T . . . GT:CLCAD2:DP 1:0,45:101 . 1:0,21:45 . . . . 2:0,11:26 . . . . 1:0,97:213 . . . 1:0,120:244 . . . . 1:0,79:196 scaffold_64_1|A/G, T
25 scaffold_3 3 . G A, T . . . GT:CLCAD2:DP . . . . . . . . . . . . . . . . . . . . . 1/2:0,4,4:10 scaffold_3_3|G/A, T
38 scaffold_49 4 . G A, ACACACC . . . GT:CLCAD2:DP . . . . . . . . . . . . . . . . . . . . 2:0,9:19 . scaffold_49_4|G/A, ACACACC
48 scaffold_13 8 . A G, ATG . . . GT:CLCAD2:DP . . . . . 0/1:13,17:30 1:0,9:11 . . . . . . . . . 1:0,7:11 . . . 0/2:7,5:12 . scaffold_13_8|A/G, ATG
paste("there are", nrow(mvcf)-nrow(h.mvcf),"poly-allelic loci among the", nrow(mvcf),"variant loci.")
## [1] "there are 16914 poly-allelic loci among the 591211 variant loci."

6. Create a simple data matrix suitable for ordination

For ordinations, we should have only locus and presence/absence information, 1 or 0, for each strain

library(dplyr)
o.mvcf <- mvcf[,c(1,10:32)] # remove un-needed columns
amvcf <- o.mvcf
amvcf <- amvcf %>% mutate_if(is.factor, as.character)
bmvcf <- amvcf[-1]
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub(".", "X", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "." to "X"
bmvcf[bmvcf != "X"] <- 1 #next convert any other cell to a 1
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub("X", "0", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "X" to zero
bmvcf <- bmvcf %>% mutate_all(as.numeric)
cmvcf <- cbind(amvcf$locus,bmvcf)
dmvcf <- cmvcf[rowSums(cmvcf[, -1])>0, ] #remove all rows with no variants
colnames(dmvcf)[1] <- "locus"

Any remaining duplicate loci names have to be summed at this point. There are very few of these and I do not know why they arise. for example:

kable(dmvcf[dmvcf$locus=="scaffold_20_3090|G/GA",1:10])
locus GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM108 GM118
8432 scaffold_20_3090|G/GA 0 0 0 0 0 0 0 0 0
8433 scaffold_20_3090|G/GA 0 0 0 0 0 0 0 0 0
8434 scaffold_20_3090|G/GA 0 0 0 0 0 0 0 0 0

we have to group them by locus

test <- dmvcf
emvcf <- aggregate(test[2:23], by=list(Category=test$locus), FUN=sum)
colnames(emvcf)[1] <- "locus"
write.csv(emvcf, "variant.table.csv")
kable(emvcf[emvcf$locus=="scaffold_20_3090|G/GA",1:8])
locus GM11 GM17 GM20 GM25 GM39 GM59 GM64
268406 scaffold_20_3090|G/GA 0 0 0 0 0 0 0
The locus has to be converted to rowname s inste ad
fmvcf <- emvcf[-c(1)]
rownames(fmvcf) <- emvcf[,1]
kable(fmvcf[1:10,1:8])
GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM108
scaffold_0_1000237|A/G 0 0 0 0 0 0 0 0
scaffold_0_1000350|T/C 0 0 0 0 0 0 0 0
scaffold_0_1000374|G/A 0 0 0 0 0 0 0 0
scaffold_0_1000453|C/T 0 0 0 0 0 0 0 0
scaffold_0_1000478|C/T 0 0 0 0 0 0 0 0
scaffold_0_1000494|C/T 0 0 0 0 0 0 0 0
scaffold_0_1000560|C/G 0 0 0 0 0 0 1 0
scaffold_0_1000566|T/C 0 0 0 0 0 0 0 0
scaffold_0_1000576|G/C 0 0 0 0 0 1 0 0
scaffold_0_1000669|T/A 0 0 0 0 0 0 0 0

7. Running an ordination

We now have a matrix of variant loci (with details) by strain. We can run a simple ordination for now. However, first we have to decide on a distance measure. Some guidance here: https://www.ncbi.nlm.nih.gov/pubmed/20067366. They recommend that Reynolds genetic distance provides the highest sensitivity for highly divergent populations.

https://rdrr.io/cran/dartR/man/gl.pcoa.pop.html Dart R uses: euclidean“,”maximum“,”manhattan“,”canberra“,”binary" or “minkowski”. These are actually the standard measures included in R (not very good probably)

For now we will just use Eucludean since I have no real insight into better methods

testvcf <- t(fmvcf)
dist.vcf <- dist(testvcf, method = "euclidean")

And run a principle coordinate analysis

library(ape)
pcoa.vcf <- pcoa(dist.vcf) #, correction ="cailliez"

Here are the eigenvalues for the ordination (relative eignvalue ~= % variance explained)

kable(pcoa.vcf$values)
Eigenvalues Relative_eig Broken_stick Cumul_eig Cumul_br_stick
258430.93 0.3452107 0.1735885 0.3452107 0.1735885
55556.10 0.0742116 0.1259695 0.4194223 0.2995580
37557.04 0.0501685 0.1021599 0.4695908 0.4017179
34199.80 0.0456839 0.0862869 0.5152747 0.4880048
33218.81 0.0443735 0.0743822 0.5596482 0.5623870
30053.18 0.0401449 0.0648584 0.5997931 0.6272453
27744.69 0.0370612 0.0569218 0.6368543 0.6841672
26375.60 0.0352324 0.0501191 0.6720867 0.7342863
25441.12 0.0339841 0.0441667 0.7060708 0.7784531
23243.72 0.0310488 0.0388757 0.7371196 0.8173288
22633.88 0.0302342 0.0341138 0.7673538 0.8514426
20587.14 0.0275002 0.0297848 0.7948540 0.8812274
19641.39 0.0262369 0.0258166 0.8210909 0.9070440
18697.10 0.0249755 0.0221536 0.8460664 0.9291976
18201.07 0.0243129 0.0187522 0.8703793 0.9479498
18056.53 0.0241198 0.0155776 0.8944991 0.9635274
17745.93 0.0237049 0.0126014 0.9182040 0.9761288
16250.23 0.0217070 0.0098003 0.9399110 0.9859291
15654.93 0.0209118 0.0071548 0.9608228 0.9930839
14944.05 0.0199622 0.0046485 0.9807850 0.9977324
14384.72 0.0192150 0.0022676 1.0000000 1.0000000

I can run a built-in plotting (absent of metadata)

biplot(pcoa.vcf, Y=NULL, plot.axes = c(1,2))

8. Plotting with metadata

First I have to create and import a metadata table

metadata <- read.csv("../metadata.csv", header = TRUE)
kable(head(metadata))
strain Host Location Lat Long
GM11 J. nigra Tennessee 35.74214 -83.97303
GM17 J. nigra Tennessee 35.71088 -83.88363
GM20 J. nigra Tennessee 35.75190 -83.97479
GM25 J. nigra Tennessee 35.79385 -84.26230
GM39 J. nigra Tennessee 35.94879 -83.93850
GM59 J. nigra North Carolina 35.61705 -83.12252

Next we pull out the coordinates from the ordination and add a row with the rownames

pcoa.coords <- data.frame(pcoa.vcf$vectors)
pcoa.coords$strain <- rownames(pcoa.coords)
pcoa.coords <- pcoa.coords[,c(22,1:21)]
kable(pcoa.coords[1:5,1:5])
strain Axis.1 Axis.2 Axis.3 Axis.4
GM11 GM11 -114.9047 -3.256782 9.470657 -15.639465
GM17 GM17 -113.2756 -7.873132 11.979418 6.328735
GM20 GM20 -112.4888 -8.404958 11.762947 6.643858
GM25 GM25 122.6842 42.058987 31.585040 14.458111
GM39 GM39 -112.8341 -3.085305 8.760090 -13.531470
Now I b ind in th e metadata
library(dplyr)
pcoa.meta <- dplyr::left_join(metadata,pcoa.coords,by="strain")
## Warning: Column `strain` joining factor and character vector, coercing into
## character vector
kable(pcoa.meta[1:5,1:8])
strain Host Location Lat Long Axis.1 Axis.2 Axis.3
GM11 J. nigra Tennessee 35.74214 -83.97303 -114.9047 -3.256782 9.470657
GM17 J. nigra Tennessee 35.71088 -83.88363 -113.2756 -7.873132 11.979418
GM20 J. nigra Tennessee 35.75190 -83.97479 -112.4888 -8.404958 11.762947
GM25 J. nigra Tennessee 35.79385 -84.26230 122.6842 42.058987 31.585040
GM39 J. nigra Tennessee 35.94879 -83.93850 -112.8341 -3.085305 8.760090

Now I can plot desired axes and use metadata to color/shape

library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) + 
    geom_point(size=5, alpha=0.70) + theme_classic() + scale_color_manual(values=cols)

This creates a rather unpleasant grouping; piles of samples separated by long distances. This I assume is a result of the distance measure and the ordination measure. This does however support the general explanation that Arizona is distinct from Colorado, that Colorado has the widest degree of genetic variation, and that there have been at least two very distinct events leading to colonization of other areas into non-native regions.

NMDS

An NMDS should tease apart the clumps; as is, I cannot display names at all, we cannot make sense of the clump on the left (even if it is accurate, totally in line with the trees so far)

the vegan NMDS (function metaMDS) also calculates distances.

However, this errors out due to memory constraints, it calls for a 1.3TB vector. So NMDS is out of the question.

Alternative; zoom in on the cluster

library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) + 
    geom_point(size=8
               , alpha=0.70) + theme_classic() + scale_color_manual(values=cols) +
  scale_x_continuous(limits = c(-118, -108)) +
  scale_y_continuous(limits = c(-10, 0))

This will be presented as a subplot of the main ordination.

( rpubsUpload(“G.morbida.vcf.to.ordination”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/1.geosmithia.vcf.to.ordnation.html”)) )