Processing Geosmithia vcf data to create ordinations

Overview

While there are many well-established pre-existing packages for processing variant/SNP data, such as DartR, pegas, adegenet, and vcfR, these are not well adapted to working with this particular data format. In this pipeline, we will use base R package to read and format the multi-vcf file that was output from the CLC variant calling system. The vcf file is actually quite a straight-forward tab-delimited table. However, there is some curiously formatted data such as coverage at each variant location, which is challenging to access without the aid of the vcfR package (we will use this later to quickly get some nice descriptive statistics of the variants). The goal of the present script is to move towards ordinations based on all variants. The DartR package does perform PCoA analysis of variant data packed into a “genlight” object; however, getting from vcf to genlight while maintaining metadata and nature of variants is extremely time-consuming and for the purposes of ordination, entirely unecessary. For more specialized population genetics approaches, DartR and pegas might become useful, but for now, I will ignore them.

Inputs: * multi.vcf file

Outputs: * variant.table.csv: a table which shows variant, in a parsable code (e.g. “scaffold_0_1000237|A/G”) and presence/absence (1 or 0) in each strain. This is the interpreted multi.vcf and is passed to the next script for interpretation.

1. Import the data

We will simply read in the multi-vcf file as a data table. Note that a vcf has several lines at the beginning that must be skipped.

mvcf <- read.csv("../all.vcf/multi_all.vcf", skip=13, sep="\t")

2. Perform some basic column name cleanup

colnames(mvcf)[1] <- "Scaffold" #make first colum name more appropriate
strain.names <- colnames(mvcf)[10:32] #pull strain names
clean.names <- gsub("\\_.*","",strain.names) #clean up strain names
colnames(mvcf)[10:32] <- clean.names #replace old with new strain names

3. Remove strain GM104

Strain GM104 has proven repeatedly to be a massive outlier. This manually removes it from further analysis.

mvcf$GM104 <- NULL
kable(mvcf[c(1:10),])

Scaffold	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	GM11	GM17	GM20	GM25	GM39	GM59	GM64	GM108	GM118	GM140	GM158	GM170	GM177	GM178	GM179	GM188	GM196	GM205	GM216	GM219	GM236	GM307
scaffold_12	1	.	CT	TC	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	.	.	.	.	0/1:7,7:14	.	.	.	.	.	.	.
scaffold_14	1	.	GG	AA	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,11:18	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
scaffold_17	1	.	GG	TT	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,6:14	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
scaffold_26	1	.	G	GAG, GGGAG	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,8:10	.	.	.	.	.	.	.	.	.	.	.	.	.	.	2:0,4:10	.
scaffold_29	1	.	G	A	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,7:14	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
scaffold_43	1	.	CC	AA	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,8:20	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
scaffold_46	1	.	A	T	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	0/1:5,5:10	.	.	.	.	.	.	.	.	.	.	.
scaffold_51	1	.	C	A, G	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	.	.	1/2:0,4,5:10	.	.	.	.	.	.	.	.	.
scaffold_52	1	.	TG	CC	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,12:15	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
scaffold_6	1	.	G	C	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	1:0,4:10	.	.	.	.	.	.	.	.	.	.	.

4. Creating locus names with scaffold, position, ref, and var details

Essentially, I want all of this information concatenated into a single “Locus” cell:

kable((mvcf)[1:10,c(1:2,4:5)])

Scaffold	POS	REF	ALT
scaffold_12	1	CT	TC
scaffold_14	1	GG	AA
scaffold_17	1	GG	TT
scaffold_26	1	G	GAG, GGGAG
scaffold_29	1	G	A
scaffold_43	1	CC	AA
scaffold_46	1	A	T
scaffold_51	1	C	A, G
scaffold_52	1	TG	CC
scaffold_6	1	G	C

Becomes:

mvcf$locus <- paste ( paste(mvcf$Scaffold, mvcf$POS, sep="_") , paste(mvcf$REF, mvcf$ALT, sep="/") , sep="|")
kable((mvcf)[1:10,c(1:2,4:5,ncol(mvcf))])

Scaffold	POS	REF	ALT	locus
scaffold_12	1	CT	TC	scaffold_12_1\|CT/TC
scaffold_14	1	GG	AA	scaffold_14_1\|GG/AA
scaffold_17	1	GG	TT	scaffold_17_1\|GG/TT
scaffold_26	1	G	GAG, GGGAG	scaffold_26_1\|G/GAG, GGGAG
scaffold_29	1	G	A	scaffold_29_1\|G/A
scaffold_43	1	CC	AA	scaffold_43_1\|CC/AA
scaffold_46	1	A	T	scaffold_46_1\|A/T
scaffold_51	1	C	A, G	scaffold_51_1\|C/A, G
scaffold_52	1	TG	CC	scaffold_52_1\|TG/CC
scaffold_6	1	G	C	scaffold_6_1\|G/C

5. complex loci

There is a small fraction of loci at which two or more variants are observed in the population or even in a single strain. As an example, scaffold_64_1 we see both A -> G and A -> T depending on the strain.

Some of these are fairly dramatic, for example G -> A vs. G -> ACACACC.

These are monstrously difficult to parse. They make up only about 2.5% of the variants across the population however.

I am going to leave them in the set as-is. Kicking them out seems like a mistake, while teasing them apart seems like an un-necessary burden. The goal of this pipeline is to get a general profile and to work towards identifying highly-allelic genes, or foci of adaptation.

The big drawback of leaving them as-is is that clearly different mutations will be considered as being the same.

h.mvcf <- mvcf[!grepl(",", mvcf$locus),] #includes all rows that do not have $locus with ","
p.mvcf <- mvcf[grep(",", mvcf$locus),] #this grabs the polyploid loci for later use
write.csv(p.mvcf, "polyploid_loci.csv") 
h.mvcf <- h.mvcf[,c(32,1:31)] #place $ locus as the first column
kable(head(p.mvcf))

	Scaffold	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	GM11	GM17	GM20	GM25	GM39	GM59	GM64	GM108	GM118	GM140	GM158	GM170	GM177	GM178	GM179	GM188	GM196	GM205	GM216	GM219	GM236	GM307	locus
4	scaffold_26	1	.	G	GAG, GGGAG	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	1:0,8:10	.	.	.	.	.	.	.	.	.	.	.	.	.	.	2:0,4:10	.	scaffold_26_1\|G/GAG, GGGAG
8	scaffold_51	1	.	C	A, G	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	.	.	1/2:0,4,5:10	.	.	.	.	.	.	.	.	.	scaffold_51_1\|C/A, G
12	scaffold_64	1	.	A	G, T	.	.	.	GT:CLCAD2:DP	1:0,45:101	.	1:0,21:45	.	.	.	.	2:0,11:26	.	.	.	.	1:0,97:213	.	.	.	1:0,120:244	.	.	.	.	1:0,79:196	scaffold_64_1\|A/G, T
25	scaffold_3	3	.	G	A, T	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	1/2:0,4,4:10	scaffold_3_3\|G/A, T
38	scaffold_49	4	.	G	A, ACACACC	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	2:0,9:19	.	scaffold_49_4\|G/A, ACACACC
48	scaffold_13	8	.	A	G, ATG	.	.	.	GT:CLCAD2:DP	.	.	.	.	.	0/1:13,17:30	1:0,9:11	.	.	.	.	.	.	.	.	.	1:0,7:11	.	.	.	0/2:7,5:12	.	scaffold_13_8\|A/G, ATG

paste("there are", nrow(mvcf)-nrow(h.mvcf),"poly-allelic loci among the", nrow(mvcf),"variant loci.")

## [1] "there are 16914 poly-allelic loci among the 591211 variant loci."

6. Create a simple data matrix suitable for ordination

For ordinations, we should have only locus and presence/absence information, 1 or 0, for each strain

library(dplyr)
o.mvcf <- mvcf[,c(1,10:32)] # remove un-needed columns
amvcf <- o.mvcf
amvcf <- amvcf %>% mutate_if(is.factor, as.character)
bmvcf <- amvcf[-1]
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub(".", "X", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "." to "X"
bmvcf[bmvcf != "X"] <- 1 #next convert any other cell to a 1
bmvcf <- data.frame(lapply(bmvcf, function(x) {gsub("X", "0", fixed=TRUE, x)}),stringsAsFactors = FALSE) #convert "X" to zero
bmvcf <- bmvcf %>% mutate_all(as.numeric)
cmvcf <- cbind(amvcf$locus,bmvcf)
dmvcf <- cmvcf[rowSums(cmvcf[, -1])>0, ] #remove all rows with no variants
colnames(dmvcf)[1] <- "locus"

Any remaining duplicate loci names have to be summed at this point. There are very few of these and I do not know why they arise. for example:

kable(dmvcf[dmvcf$locus=="scaffold_20_3090|G/GA",1:10])

	locus	GM11	GM17	GM20	GM25	GM39	GM59	GM64	GM108	GM118
8432	scaffold_20_3090\|G/GA	0	0	0	0	0	0	0	0	0
8433	scaffold_20_3090\|G/GA	0	0	0	0	0	0	0	0	0
8434	scaffold_20_3090\|G/GA	0	0	0	0	0	0	0	0	0

we have to group them by locus

test <- dmvcf
emvcf <- aggregate(test[2:23], by=list(Category=test$locus), FUN=sum)
colnames(emvcf)[1] <- "locus"
write.csv(emvcf, "variant.table.csv")
kable(emvcf[emvcf$locus=="scaffold_20_3090|G/GA",1:8])

	locus	GM11	GM17	GM20	GM25	GM39	GM59	GM64
268406	scaffold_20_3090\|G/GA	0	0	0	0	0	0	0
The locus	has to be converted to	rowname	s inste	ad

fmvcf <- emvcf[-c(1)]
rownames(fmvcf) <- emvcf[,1]
kable(fmvcf[1:10,1:8])

	GM59	GM64
scaffold_0_1000237\|A/G	0	0
scaffold_0_1000350\|T/C	0	0
scaffold_0_1000374\|G/A	0	0
scaffold_0_1000453\|C/T	0	0
scaffold_0_1000478\|C/T	0	0
scaffold_0_1000494\|C/T	0	0
scaffold_0_1000560\|C/G	0	1
scaffold_0_1000566\|T/C	0	0
scaffold_0_1000576\|G/C	1	0
scaffold_0_1000669\|T/A	0	0

7. Running an ordination

We now have a matrix of variant loci (with details) by strain. We can run a simple ordination for now. However, first we have to decide on a distance measure. Some guidance here: https://www.ncbi.nlm.nih.gov/pubmed/20067366. They recommend that Reynolds genetic distance provides the highest sensitivity for highly divergent populations.

https://rdrr.io/cran/dartR/man/gl.pcoa.pop.html Dart R uses: euclidean“,”maximum“,”manhattan“,”canberra“,”binary" or “minkowski”. These are actually the standard measures included in R (not very good probably)

For now we will just use Eucludean since I have no real insight into better methods

testvcf <- t(fmvcf)
dist.vcf <- dist(testvcf, method = "euclidean")

And run a principle coordinate analysis

library(ape)
pcoa.vcf <- pcoa(dist.vcf) #, correction ="cailliez"

Here are the eigenvalues for the ordination (relative eignvalue ~= % variance explained)

kable(pcoa.vcf$values)

Eigenvalues	Relative_eig	Broken_stick	Cumul_eig	Cumul_br_stick
258430.93	0.3452107	0.1735885	0.3452107	0.1735885
55556.10	0.0742116	0.1259695	0.4194223	0.2995580
37557.04	0.0501685	0.1021599	0.4695908	0.4017179
34199.80	0.0456839	0.0862869	0.5152747	0.4880048
33218.81	0.0443735	0.0743822	0.5596482	0.5623870
30053.18	0.0401449	0.0648584	0.5997931	0.6272453
27744.69	0.0370612	0.0569218	0.6368543	0.6841672
26375.60	0.0352324	0.0501191	0.6720867	0.7342863
25441.12	0.0339841	0.0441667	0.7060708	0.7784531
23243.72	0.0310488	0.0388757	0.7371196	0.8173288
22633.88	0.0302342	0.0341138	0.7673538	0.8514426
20587.14	0.0275002	0.0297848	0.7948540	0.8812274
19641.39	0.0262369	0.0258166	0.8210909	0.9070440
18697.10	0.0249755	0.0221536	0.8460664	0.9291976
18201.07	0.0243129	0.0187522	0.8703793	0.9479498
18056.53	0.0241198	0.0155776	0.8944991	0.9635274
17745.93	0.0237049	0.0126014	0.9182040	0.9761288
16250.23	0.0217070	0.0098003	0.9399110	0.9859291
15654.93	0.0209118	0.0071548	0.9608228	0.9930839
14944.05	0.0199622	0.0046485	0.9807850	0.9977324
14384.72	0.0192150	0.0022676	1.0000000	1.0000000

I can run a built-in plotting (absent of metadata)

biplot(pcoa.vcf, Y=NULL, plot.axes = c(1,2))

8. Plotting with metadata

First I have to create and import a metadata table

metadata <- read.csv("../metadata.csv", header = TRUE)
kable(head(metadata))

strain	Host	Location	Lat	Long
GM11	J. nigra	Tennessee	35.74214	-83.97303
GM17	J. nigra	Tennessee	35.71088	-83.88363
GM20	J. nigra	Tennessee	35.75190	-83.97479
GM25	J. nigra	Tennessee	35.79385	-84.26230
GM39	J. nigra	Tennessee	35.94879	-83.93850
GM59	J. nigra	North Carolina	35.61705	-83.12252

Next we pull out the coordinates from the ordination and add a row with the rownames

pcoa.coords <- data.frame(pcoa.vcf$vectors)
pcoa.coords$strain <- rownames(pcoa.coords)
pcoa.coords <- pcoa.coords[,c(22,1:21)]
kable(pcoa.coords[1:5,1:5])

	strain	Axis.1	Axis.2	Axis.3	Axis.4
GM11	GM11	-114.9047	-3.256782	9.470657	-15.639465
GM17	GM17	-113.2756	-7.873132	11.979418	6.328735
GM20	GM20	-112.4888	-8.404958	11.762947	6.643858
GM25	GM25	122.6842	42.058987	31.585040	14.458111
GM39	GM39	-112.8341	-3.085305	8.760090	-13.531470
Now I b	ind in th	e metadata

library(dplyr)
pcoa.meta <- dplyr::left_join(metadata,pcoa.coords,by="strain")

## Warning: Column `strain` joining factor and character vector, coercing into
## character vector

kable(pcoa.meta[1:5,1:8])

strain	Host	Location	Lat	Long	Axis.1	Axis.2	Axis.3
GM11	J. nigra	Tennessee	35.74214	-83.97303	-114.9047	-3.256782	9.470657
GM17	J. nigra	Tennessee	35.71088	-83.88363	-113.2756	-7.873132	11.979418
GM20	J. nigra	Tennessee	35.75190	-83.97479	-112.4888	-8.404958	11.762947
GM25	J. nigra	Tennessee	35.79385	-84.26230	122.6842	42.058987	31.585040
GM39	J. nigra	Tennessee	35.94879	-83.93850	-112.8341	-3.085305	8.760090

Now I can plot desired axes and use metadata to color/shape

library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) + 
    geom_point(size=5, alpha=0.70) + theme_classic() + scale_color_manual(values=cols)

This creates a rather unpleasant grouping; piles of samples separated by long distances. This I assume is a result of the distance measure and the ordination measure. This does however support the general explanation that Arizona is distinct from Colorado, that Colorado has the widest degree of genetic variation, and that there have been at least two very distinct events leading to colonization of other areas into non-native regions.

NMDS

An NMDS should tease apart the clumps; as is, I cannot display names at all, we cannot make sense of the clump on the left (even if it is accurate, totally in line with the trees so far)

the vegan NMDS (function metaMDS) also calculates distances.

However, this errors out due to memory constraints, it calls for a 1.3TB vector. So NMDS is out of the question.

Alternative; zoom in on the cluster

library(ggplot2)
cols = c("#30842a","#f07df2","#f74f4f","#70f0ff","#2657ff","#ffab1c")
ggplot(pcoa.meta, aes(x=Axis.1, y=Axis.2, color=Location, shape=Host)) + 
    geom_point(size=8
               , alpha=0.70) + theme_classic() + scale_color_manual(values=cols) +
  scale_x_continuous(limits = c(-118, -108)) +
  scale_y_continuous(limits = c(-10, 0))

This will be presented as a subplot of the main ordination.

( rpubsUpload(“G.morbida.vcf.to.ordination”, “/home/rmurdoch/Dropbox/Projects/Geosmithia/dartr/1.geosmithia.vcf.to.ordnation.html”)) )