While there are many well-established pre-existing packages for processing variant/SNP data, such as DartR, pegas, adegenet, and vcfR, these are not well adapted to working with this particular data format. In this pipeline, we will use base R package to read and format the multi-vcf file that was output from the CLC variant calling system. The vcf file is actually quite a straight-forward tab-delimited table. However, there is some curiously formatted data such as coverage at each variant location, which is challenging to access without the aid of the vcfR package (we will use this later to quickly get some nice descriptive statistics of the variants). The goal of the present script is to move towards ordinations based on all variants. The DartR package does perform PCoA analysis of variant data packed into a “genlight” object; however, getting from vcf to genlight while maintaining metadata and nature of variants is extremely time-consuming and for the purposes of ordination, entirely unecessary. For more specialized population genetics approaches, DartR and pegas might become useful, but for now, I will ignore them.
Inputs: * multi.vcf file
Outputs: * variant.table.csv: a table which shows variant, in a parsable code (e.g. “scaffold_0_1000237|A/G”) and presence/absence (1 or 0) in each strain. This is the interpreted multi.vcf and is passed to the next script for interpretation.
We will simply read in the multi-vcf file as a data table. Note that a vcf has several lines at the beginning that must be skipped.
mvcf <- read.csv("../all.vcf/multi_all.vcf", skip=13, sep="\t")
colnames(mvcf)[1] <- "Scaffold" #make first colum name more appropriate
strain.names <- colnames(mvcf)[10:32] #pull strain names
clean.names <- gsub("\\_.*","",strain.names) #clean up strain names
colnames(mvcf)[10:32] <- clean.names #replace old with new strain names
Strain GM104 has proven repeatedly to be a massive outlier. This manually removes it from further analysis.
mvcf$GM104 <- NULL
kable(mvcf[c(1:10),])
| Scaffold | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | GM11 | GM17 | GM20 | GM25 | GM39 | GM59 | GM64 | GM108 | GM118 | GM140 | GM158 | GM170 | GM177 | GM178 | GM179 | GM188 | GM196 | GM205 | GM216 | GM219 | GM236 | GM307 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scaffold_12 | 1 | . | CT | TC | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | . | . | . | . | 0/1:7,7:14 | . | . | . | . | . | . | . |
| scaffold_14 | 1 | . | GG | AA | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,11:18 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_17 | 1 | . | GG | TT | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,6:14 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_26 | 1 | . | G | GAG, GGGAG | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,8:10 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | 2:0,4:10 | . |
| scaffold_29 | 1 | . | G | A | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,7:14 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_43 | 1 | . | CC | AA | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,8:20 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_46 | 1 | . | A | T | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | 0/1:5,5:10 | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_51 | 1 | . | C | A, G | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | . | . | 1/2:0,4,5:10 | . | . | . | . | . | . | . | . | . |
| scaffold_52 | 1 | . | TG | CC | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,12:15 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . |
| scaffold_6 | 1 | . | G | C | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | 1:0,4:10 | . | . | . | . | . | . | . | . | . | . | . |
Essentially, I want all of this information concatenated into a single “Locus” cell:
kable((mvcf)[1:10,c(1:2,4:5)])
| Scaffold | POS | REF | ALT |
|---|---|---|---|
| scaffold_12 | 1 | CT | TC |
| scaffold_14 | 1 | GG | AA |
| scaffold_17 | 1 | GG | TT |
| scaffold_26 | 1 | G | GAG, GGGAG |
| scaffold_29 | 1 | G | A |
| scaffold_43 | 1 | CC | AA |
| scaffold_46 | 1 | A | T |
| scaffold_51 | 1 | C | A, G |
| scaffold_52 | 1 | TG | CC |
| scaffold_6 | 1 | G | C |
Becomes:
mvcf$locus <- paste ( paste(mvcf$Scaffold, mvcf$POS, sep="_") , paste(mvcf$REF, mvcf$ALT, sep="/") , sep="|")
kable((mvcf)[1:10,c(1:2,4:5,ncol(mvcf))])
| Scaffold | POS | REF | ALT | locus |
|---|---|---|---|---|
| scaffold_12 | 1 | CT | TC | scaffold_12_1|CT/TC |
| scaffold_14 | 1 | GG | AA | scaffold_14_1|GG/AA |
| scaffold_17 | 1 | GG | TT | scaffold_17_1|GG/TT |
| scaffold_26 | 1 | G | GAG, GGGAG | scaffold_26_1|G/GAG, GGGAG |
| scaffold_29 | 1 | G | A | scaffold_29_1|G/A |
| scaffold_43 | 1 | CC | AA | scaffold_43_1|CC/AA |
| scaffold_46 | 1 | A | T | scaffold_46_1|A/T |
| scaffold_51 | 1 | C | A, G | scaffold_51_1|C/A, G |
| scaffold_52 | 1 | TG | CC | scaffold_52_1|TG/CC |
| scaffold_6 | 1 | G | C | scaffold_6_1|G/C |
There is a small fraction of loci at which two or more variants are observed in the population or even in a single strain. As an example, scaffold_64_1 we see both A -> G and A -> T depending on the strain.
Some of these are fairly dramatic, for example G -> A vs. G -> ACACACC.
These are monstrously difficult to parse. They make up only about 2.5% of the variants across the population however.
I am going to leave them in the set as-is. Kicking them out seems like a mistake, while teasing them apart seems like an un-necessary burden. The goal of this pipeline is to get a general profile and to work towards identifying highly-allelic genes, or foci of adaptation.
The big drawback of leaving them as-is is that clearly different mutations will be considered as being the same.
h.mvcf <- mvcf[!grepl(",", mvcf$locus),] #includes all rows that do not have $locus with ","
p.mvcf <- mvcf[grep(",", mvcf$locus),] #this grabs the polyploid loci for later use
write.csv(p.mvcf, "polyploid_loci.csv")
h.mvcf <- h.mvcf[,c(32,1:31)] #place $ locus as the first column
kable(head(p.mvcf))
| Scaffold | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | GM11 | GM17 | GM20 | GM25 | GM39 | GM59 | GM64 | GM108 | GM118 | GM140 | GM158 | GM170 | GM177 | GM178 | GM179 | GM188 | GM196 | GM205 | GM216 | GM219 | GM236 | GM307 | locus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | scaffold_26 | 1 | . | G | GAG, GGGAG | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 1:0,8:10 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | 2:0,4:10 | . | scaffold_26_1|G/GAG, GGGAG |
| 8 | scaffold_51 | 1 | . | C | A, G | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | . | . | 1/2:0,4,5:10 | . | . | . | . | . | . | . | . | . | scaffold_51_1|C/A, G |
| 12 | scaffold_64 | 1 | . | A | G, T | . | . | . | GT:CLCAD2:DP | 1:0,45:101 | . | 1:0,21:45 | . | . | . | . | 2:0,11:26 | . | . | . | . | 1:0,97:213 | . | . | . | 1:0,120:244 | . | . | . | . | 1:0,79:196 | scaffold_64_1|A/G, T |
| 25 | scaffold_3 | 3 | . | G | A, T | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | 1/2:0,4,4:10 | scaffold_3_3|G/A, T |
| 38 | scaffold_49 | 4 | . | G | A, ACACACC | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | 2:0,9:19 | . | scaffold_49_4|G/A, ACACACC |
| 48 | scaffold_13 | 8 | . | A | G, ATG | . | . | . | GT:CLCAD2:DP | . | . | . | . | . | 0/1:13,17:30 | 1:0,9:11 | . | . | . | . | . | . | . | . | . | 1:0,7:11 | . | . | . | 0/2:7,5:12 | . | scaffold_13_8|A/G, ATG |
paste("there are", nrow(mvcf)-nrow(h.mvcf),"poly-allelic loci among the", nrow(mvcf),"variant loci.")
## [1] "there are 16914 poly-allelic loci among the 591211 variant loci."
For ordinations, we should have only locus and presence/absence information, 1 or 0, for each strain
library(dplyr)
o.mvcf <- mvcf[,c(1,10:32)] # remove un-needed columns
amvcf <- o.mvcf
amvcf <- amvcf %>% mutate_if(is.factor, as.character)
bmvcf <- amvcf[-1]
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub(".", "X", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "." to "X"
bmvcf[bmvcf != "X"] <- 1 #next convert any other cell to a 1
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub("X", "0", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "X" to zero
bmvcf <- bmvcf %>% mutate_all(as.numeric)
cmvcf <- cbind(amvcf$locus,bmvcf)
dmvcf <- cmvcf[rowSums(cmvcf[, -1])>0, ] #remove all rows with no variants
colnames(dmvcf)[1] <- "locus"
Any remaining duplicate loci names have to be summed at this point. There are very few of these and I do not know why they arise. for example:
kable(dmvcf[dmvcf$locus=="scaffold_20_3090|G/GA",1:10])
| locus | GM11 | GM17 | GM20 | GM25 | GM39 | GM59 | GM64 | GM108 | GM118 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8432 | scaffold_20_3090|G/GA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8433 | scaffold_20_3090|G/GA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8434 | scaffold_20_3090|G/GA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
we have to group them by locus
test <- dmvcf
emvcf <- aggregate(test[2:23], by=list(Category=test$locus), FUN=sum)
colnames(emvcf)[1] <- "locus"
write.csv(emvcf, "variant.table.csv")
kable(emvcf[emvcf$locus=="scaffold_20_3090|G/GA",1:8])
| locus | GM11 | GM17 | GM20 | GM25 | GM39 | GM59 | GM64 | |
|---|---|---|---|---|---|---|---|---|
| 268406 | scaffold_20_3090|G/GA | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| The locus | has to be converted to | rowname | s inste | ad |
fmvcf <- emvcf[-c(1)]
rownames(fmvcf) <- emvcf[,1]
kable(fmvcf[1:10,1:8])
| GM11 | GM17 | GM20 | GM25 | GM39 | GM59 | GM64 | GM108 | |
|---|---|---|---|---|---|---|---|---|
| scaffold_0_1000237|A/G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000350|T/C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000374|G/A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000453|C/T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000478|C/T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000494|C/T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000560|C/G | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| scaffold_0_1000566|T/C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| scaffold_0_1000576|G/C | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| scaffold_0_1000669|T/A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We now have a matrix of variant loci (with details) by strain. We can run a simple ordination for now. However, first we have to decide on a distance measure. Some guidance here: https://www.ncbi.nlm.nih.gov/pubmed/20067366. They recommend that Reynolds genetic distance provides the highest sensitivity for highly divergent populations.
https://rdrr.io/cran/dartR/man/gl.pcoa.pop.html Dart R uses: euclidean“,”maximum“,”manhattan“,”canberra“,”binary" or “minkowski”. These are actually the standard measures included in R (not very good probably)
For now we will just use Eucludean since I have no real insight into better methods
testvcf <- t(fmvcf)
dist.vcf <- dist(testvcf, method = "euclidean")
And run a principle coordinate analysis
library(ape)
pcoa.vcf <- pcoa(dist.vcf) #, correction ="cailliez"
Here are the eigenvalues for the ordination (relative eignvalue ~= % variance explained)
kable(pcoa.vcf$values)
| Eigenvalues | Relative_eig | Broken_stick | Cumul_eig | Cumul_br_stick |
|---|---|---|---|---|
| 258430.93 | 0.3452107 | 0.1735885 | 0.3452107 | 0.1735885 |
| 55556.10 | 0.0742116 | 0.1259695 | 0.4194223 | 0.2995580 |
| 37557.04 | 0.0501685 | 0.1021599 | 0.4695908 | 0.4017179 |
| 34199.80 | 0.0456839 | 0.0862869 | 0.5152747 | 0.4880048 |
| 33218.81 | 0.0443735 | 0.0743822 | 0.5596482 | 0.5623870 |
| 30053.18 | 0.0401449 | 0.0648584 | 0.5997931 | 0.6272453 |
| 27744.69 | 0.0370612 | 0.0569218 | 0.6368543 | 0.6841672 |
| 26375.60 | 0.0352324 | 0.0501191 | 0.6720867 | 0.7342863 |
| 25441.12 | 0.0339841 | 0.0441667 | 0.7060708 | 0.7784531 |
| 23243.72 | 0.0310488 | 0.0388757 | 0.7371196 | 0.8173288 |
| 22633.88 | 0.0302342 | 0.0341138 | 0.7673538 | 0.8514426 |
| 20587.14 | 0.0275002 | 0.0297848 | 0.7948540 | 0.8812274 |
| 19641.39 | 0.0262369 | 0.0258166 | 0.8210909 | 0.9070440 |
| 18697.10 | 0.0249755 | 0.0221536 | 0.8460664 | 0.9291976 |
| 18201.07 | 0.0243129 | 0.0187522 | 0.8703793 | 0.9479498 |
| 18056.53 | 0.0241198 | 0.0155776 | 0.8944991 | 0.9635274 |
| 17745.93 | 0.0237049 | 0.0126014 | 0.9182040 | 0.9761288 |
| 16250.23 | 0.0217070 | 0.0098003 | 0.9399110 | 0.9859291 |
| 15654.93 | 0.0209118 | 0.0071548 | 0.9608228 | 0.9930839 |
| 14944.05 | 0.0199622 | 0.0046485 | 0.9807850 | 0.9977324 |
| 14384.72 | 0.0192150 | 0.0022676 | 1.0000000 | 1.0000000 |
I can run a built-in plotting (absent of metadata)
biplot(pcoa.vcf, Y=NULL, plot.axes = c(1,2))
First I have to create and import a metadata table
metadata <- read.csv("../metadata.csv", header = TRUE)
kable(head(metadata))
| strain | Host | Location | Lat | Long |
|---|---|---|---|---|
| GM11 | J. nigra | Tennessee | 35.74214 | -83.97303 |
| GM17 | J. nigra | Tennessee | 35.71088 | -83.88363 |
| GM20 | J. nigra | Tennessee | 35.75190 | -83.97479 |
| GM25 | J. nigra | Tennessee | 35.79385 | -84.26230 |
| GM39 | J. nigra | Tennessee | 35.94879 | -83.93850 |
| GM59 | J. nigra | North Carolina | 35.61705 | -83.12252 |
Next we pull out the coordinates from the ordination and add a row with the rownames
pcoa.coords <- data.frame(pcoa.vcf$vectors)
pcoa.coords$strain <- rownames(pcoa.coords)
pcoa.coords <- pcoa.coords[,c(22,1:21)]
kable(pcoa.coords[1:5,1:5])
| strain | Axis.1 | Axis.2 | Axis.3 | Axis.4 | |
|---|---|---|---|---|---|
| GM11 | GM11 | -114.9047 | -3.256782 | 9.470657 | -15.639465 |
| GM17 | GM17 | -113.2756 | -7.873132 | 11.979418 | 6.328735 |
| GM20 | GM20 | -112.4888 | -8.404958 | 11.762947 | 6.643858 |
| GM25 | GM25 | 122.6842 | 42.058987 | 31.585040 | 14.458111 |
| GM39 | GM39 | -112.8341 | -3.085305 | 8.760090 | -13.531470 |
| Now I b | ind in th | e metadata |
library(dplyr)
pcoa.meta <- dplyr::left_join(metadata,pcoa.coords,by="strain")
## Warning: Column `strain` joining factor and character vector, coercing into
## character vector
kable(pcoa.meta[1:5,1:8])
| strain | Host | Location | Lat | Long | Axis.1 | Axis.2 | Axis.3 |
|---|---|---|---|---|---|---|---|
| GM11 | J. nigra | Tennessee | 35.74214 | -83.97303 | -114.9047 | -3.256782 | 9.470657 |
| GM17 | J. nigra | Tennessee | 35.71088 | -83.88363 | -113.2756 | -7.873132 | 11.979418 |
| GM20 | J. nigra | Tennessee | 35.75190 | -83.97479 | -112.4888 | -8.404958 | 11.762947 |
| GM25 | J. nigra | Tennessee | 35.79385 | -84.26230 | 122.6842 | 42.058987 | 31.585040 |
| GM39 | J. nigra | Tennessee | 35.94879 | -83.93850 | -112.8341 | -3.085305 | 8.760090 |
Now I can plot desired axes and use metadata to color/shape
library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) +
geom_point(size=5, alpha=0.70) + theme_classic() + scale_color_manual(values=cols)
This creates a rather unpleasant grouping; piles of samples separated by long distances. This I assume is a result of the distance measure and the ordination measure. This does however support the general explanation that Arizona is distinct from Colorado, that Colorado has the widest degree of genetic variation, and that there have been at least two very distinct events leading to colonization of other areas into non-native regions.
An NMDS should tease apart the clumps; as is, I cannot display names at all, we cannot make sense of the clump on the left (even if it is accurate, totally in line with the trees so far)
the vegan NMDS (function metaMDS) also calculates distances.
However, this errors out due to memory constraints, it calls for a 1.3TB vector. So NMDS is out of the question.
library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) +
geom_point(size=8
, alpha=0.70) + theme_classic() + scale_color_manual(values=cols) +
scale_x_continuous(limits = c(-118, -108)) +
scale_y_continuous(limits = c(-10, 0))
This will be presented as a subplot of the main ordination.
( rpubsUpload(“G.morbida.vcf.to.ordination”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/1.geosmithia.vcf.to.ordnation.html”)) )