G. morbida GWAS with PLINK

1 Overview of Data Conversion for Input to PLINK
- 1.1 Convert to PLINK using custom pipeline
2 PLINK2.0
- 2.1 Preparing the vcf file
- 2.2 Preparing the phenotype file for GWAS trials
3 Workflow for loading data to PLINK
4 Running PLINK2 GWAS
5 Linkage diequalibrium (LD) decay
6 LD decay conclusions
7 polyploidy / aneuploidy
- 7.1 addressing aneuploidy based on read depth disparities
8 how many consistently variant sites are there
- 8.1 conclusions: aneuploidy

PLINK2 is a common system for performing GWAS analyses (in addition to other related analyses).

This document describes

exploring application of GWAS to the data set, with the understanding that sample size is far too low and LD decay is not described. This is intended only as exploratory
building LD decay plots
exploring polyploidy and aneuploidy in the data set

1 Overview of Data Conversion for Input to PLINK

Generally speaking, PLINK2 takes variant (SNP) data and phenotypic data. However, the specific data formats are odd. I tried using vfctools, which supposedly has a plink convert, but it did not produce anything useful.

1.1 Convert to PLINK using custom pipeline

described here: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html

This also handles some of the limitations of the PLINK system.

2 PLINK2.0

I was using PLINK 1.9, lets try 2.0 instead; it says it can handle vcf format

https://www.cog-genomics.org/plink/2.0/

Required files: * multi-vcf * phenotype https://www.cog-genomics.org/plink/2.0/input#pheno

2.1 Preparing the vcf file

First I want to change the strain names in the multi-vcf. I have what looks like this: _S2_L001 (Variants, AAC), where the S**L*** is variable.

sed 's/ (Variants, AAC)//g' multi_all.vcf > clean.vcf<br>
sed 's/_L001//g' clean.vcf > clean2.vcf<br>
sed 's/_L003//g' clean2.vcf > clean3.vcf<br>
sed 's/_S[0-9][0-9]//g' clean3.vcf > clean4.vcf<br>
sed 's/_S[0-9]//g' clean4.vcf > clean5.vcf<br>
mv clean5.vcf all.vcf

Next, the scaffold names are no good for PLINK (maybe), they should just be numbered

sed 's/scaffold_//g' all.vcf > all.vcf

this will be fed to PLINK2 using

--vcf all.vcf

2.2 Preparing the phenotype file for GWAS trials

Prepared as described in PLINK2 instructions

(note that the blank cells here are actually “NA” in the file that will be fed into PLINK2)

ptype <- read.csv("PLINK/phenotype_cat.tsv", sep="\t",header = T, na.strings = "NA")
datatable(ptype)

Supposedly when running PLINK2 I can be explicit regarding which

This will be fed to PLINK using

–pheno phenotype_cat.tsv –pheno-name Cat

3 Workflow for loading data to PLINK

bcftools norm -Ou -m -any all.vcf |
bcftools norm -Ou -f G.morbida.refgenome.fa |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
    
bcftools norm -Ou -m -any all.vcf |
bcftools norm -Ou -f G.morbida.refgenome.fa |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output

both plink 1.9 and plink 2 binaries are placed in /usr/bin (not sure if 1.9 is required)

First we convert the vcf to a binary format (note that we have to tell PLINK2 what our “chromosome” count is using –chr-set; it assumes human and will give an error otherwise)

plink2 --vcf all.vcf --chr-set 72 --out all_binary

PLINK cannot handle multi-allelic sites and fixing this seems to be a hassle. All I need is for different variants to be on different lines… but once the vcf is built, this is quite challenging. bcftools is generally able to do it? but bcftools needs the vcf to be “indexed” using tabix. Many solutions for this here https://www.biostars.org/p/59492/… BUT first the vcf has to be SORTED, presumably by chromosome and then by position.

test.vcf <- read.csv("PLINK/all.vcf", sep = "\t", comment.char = "#", header = F)
test.vcf <- test.vcf[with(test.vcf, order(V1, V2)),]
datatable(head(test.vcf))

write.table(test.vcf, "PLINK/all.sort.vcf", sep="\t", row.names = F, quote = F, col.names = F)

This has all of the comment lines and headers stripped, so it will have to be merged onto it from the original vcf

head -n 14 all.vcf > vcf.headers.txt<br>
cat vcf.headers.txt all.sort.vcf > all.sort.head.vcf

This appears to be correct now, sorted and with headers

Now I can use tabix to index?

/home/robert/Tools/htslib/bgzip -c all.sort.head.vcf > all.sort.head.vcg.gz<br>
/home/robert/Tools/htslib/tabix -p vcf all.sort.head.vcg.gz

to use bfctools “norm”“, I have to have a corresponding refrence genome; i have to change the scaffold names again

sed 's/scaffold_//g' G.morbida.refgenome.fa > G.morbida.refgenome.clean.fa

now use bfctools norm as described:

bcftools norm -Ou -m -any all.sort.head.vcf.gz |
bcftools norm -Ou -f G.morbida.refgenome.clean.fa |
bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
   --keep-allele-order \
   --vcf-idspace-to _ \
   --const-fid \
   --allow-extra-chr 0 \
   --split-x b37 no-fail \
   --make-bed \
   --out output

Step 1: This step splits multi-allelic sites:

bcftools norm -Ou -m -any all.sort.head.vcf.gz > split.bcf.gz

stdout: “Lines total/split/realigned/skipped: 591211/16914/0/0”

So this means that the sites are split… the second command fails due to some error which was cited as a bug in 2016. This is aimed at changing some split sites so that might be mis-interpreted by PLINK. Unfortunately I will simply have to skip this.

step 3:

bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' split.bcf.gz  > split.an.bcf.gz

step 4:

this should import into PLINK

This “correct” command failed with error “invalid centimorgan position detected”

plink --bcf split.an.bcf.gz \
  --keep-allele-order \
  --vcf-idspace-to _ \
  --const-fid \
  --chr-set 72 \
  --split-x b37 no-fail \
  --make-bed \
  --out output

This simplified command worked

plink --bcf split.an.bcf.gz \
--chr-set 72 \
--out output

reporting that “output.bed + output.bim + output.fam written”

Note that this made PLINK 1.9 format files. I can also export the de-multi-alleled bcf as a vcf file for use in PLINK2

bcftools convert -o all.demulti.vcf -O v split.an.bcf.gz

This has space characters in the alternate allele columns, which causes everything downstream to fail. These can all be bruteforce removed.

sed 's/ //g' all.demulti.vcf > all.demulti2.vcf

I now have the sorted and indexed vcf that I need to bring into PLINK2

4 Running PLINK2 GWAS

https://www.cog-genomics.org/plink/2.0/input

4.1 import the sorted de-multiplexed vcf

plink2 --vcf all.demulti2.vcf --chr-set 72 --out all_plink2

4.1.1 Import log

PLINK v2.00a2LM 64-bit Intel (2 Apr 2019)<br>
Options in effect:<br>
  --chr-set 72<br>
  --out all_plink2<br>
  --vcf all.demulti2.vcf<br>

Hostname: robert-Precision-Tower-3620<br>
Working directory: /home/robert/Dropbox/Projects/Geosmithia/PLINK<br>
Start time: Thu Apr 25 17:39:03 2019<br>

Random number seed: 1556228343<br>
64350 MiB RAM detected; reserving 32175 MiB for main workspace.<br>
Using up to 8 compute threads.<br>
--vcf: 615568 variants scanned.<br>
--vcf: all_plink2.pgen + all_plink2.pvar + all_plink2.psam written.<br>

End time: Thu Apr 25 17:39:04 2019

4.2 Running PLINK2 analysis

This runs a linear regression based analysis against canker size:

plink2 --pfile all_plink2 --pheno phenotype_cat.tsv --pheno-name Canker --glm --out canker.glm

canker.glm <- read.csv("PLINK/canker.glm.Canker.glm.linear", sep = "\t")

Applying an fdr correction

canker.glm$fdr <- p.adjust(canker.glm$P)

sorting out only those with P-values less than 0.005

canker.glm <- canker.glm[with(canker.glm, order(T_STAT)),]
#canker.glm.p <- canker.glm[canker.glm$T_STAT < .05,]
canker.glm.p <- subset(canker.glm, P < 0.005)
canker.glm.p <- canker.glm.p[with(canker.glm.p, order(X.CHROM, POS)),]
datatable(canker.glm.p)

Interesting to note that NOTHING passes fdr-correction thresholds even though the p-values get very low (0.00009). I have to assume this is due to the small sample size (n=11).

Initial manual inspection reveals a lot of exciting stuff where or near the locations of these SNPs, such as membrane trafficking, peptidases and cellulases.

4.3 Annotating variant locations with nearby gene names

I next need a clean mapping protocol for these SNPs, which will apply:

if the SNP is in a gene, return the gene name
if the closest up- and down-stream genes are the same, return one gene name and declare it in an intron
if the up- and down-stream genes are different, return both names.

gene.positions <- read.csv("../Annotations/geosmithia_annotations/gene.position.clean.txt", sep="\t",header=F)
colnames(gene.positions) <- c("scaffold","gene","start","stop","direction")

glm.005 <- data.frame()

for (x in c(1:69)) {
  test <- canker.glm.p[x,]
  gene.pos.s <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
  gene.pos.s <- gene.pos.s[(gene.pos.s$start < test$POS), ]
  gene.pos.s <- gene.pos.s[(gene.pos.s$stop > test$POS),]
  if (nrow(gene.pos.s) == 1) { #this capture only intragenic snps
    gene.pos.s$location <- "intragenic"
    glm.005 <- rbind(glm.005,gene.pos.s)
    }
  if (nrow(gene.pos.s) == 0) {    #this captures snps with only one flanking gene, at the end of a contig
    gene.pos.fix <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
    gene.pos.fix.upstream <- tail(gene.pos.fix[(gene.pos.fix$start < test$POS), ] , 1)
    gene.pos.fix.downstream <- head(gene.pos.fix[(gene.pos.fix$stop > test$POS),] , 1)
    if (nrow(gene.pos.fix.downstream) == 0) {
      gene.pos.fix.upstream$location <- "flanking"
      glm.005 <- rbind(glm.005, gene.pos.fix.upstream)
    }
    else {
      if (nrow(gene.pos.fix.upstream) == 0) {
      gene.pos.fix.downstream$location <- "flanking"
      glm.005 <- rbind(glm.005, gene.pos.fix.downstream)
    }
      else { # if it is not intra-genic or flanking (terminal), then it must be an intron
        gene.pos.fix <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
        gene.pos.fix.upstream <- tail(gene.pos.fix[(gene.pos.fix$start < test$POS), ] , 1)
        gene.pos.fix.downstream <- head(gene.pos.fix[(gene.pos.fix$stop > test$POS),] , 1)
          if (gene.pos.fix.upstream$gene == gene.pos.fix.downstream$gene) {
          gene.pos.fix <- rbind(gene.pos.fix.upstream, gene.pos.fix.downstream)
          gene.pos.fix$location <- "intron"
          glm.005 <- rbind(glm.005,gene.pos.fix)
          }
          if (gene.pos.fix.upstream$gene != gene.pos.fix.downstream$gene ) {
          gene.pos.fix <- rbind(gene.pos.fix.upstream, gene.pos.fix.downstream)
          gene.pos.fix$location <- "intergenic"
          glm.005 <- rbind(glm.005,gene.pos.fix)
          }
        }
      }
}}
datatable(glm.005)

Attaching annotation information is straight-forward from this point on… however, I will stop here

4.4 Conclusions

This ultimately allowed me to get the multi-vcf into PLINK(2.0) and run one of the many types of analyses available.

Points of uncertainty:

SNP count fed into the process was very high, 600,000 for only 11 strains. This leads to stricter multiple-testing corrections to p-values (i.e. fdr). I have seen methods used for SNP filtering, such as removing “rare” variants. This I believe would take us to ~200,000 at most
There are likely still some problems with the vcf file; this is broadly described in this blog post: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html. I’m not overly concerned about a small amount of data mis-handling though.
I am not sure about the utility of using these p-values or fdr correction technique. Another paper I saw, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6262169/ , did not use the PLINK p-values, rather fed the beta scores and SE values into a diffrent R package (ashr) and performed “empirical Bayesian multiple hypothesis testing” and then applied an estimate of the “local false sign rate (lfsr)”. Is this necessary? What does the p-value provided by PLINK2 represent? Is it multiple testing corrected?
variants can be filtered directly in PLINK2; you can remove SNPs in linkage disequalibrium (https://www.cog-genomics.org/plink/2.0/ld#indep) and you can remove the rare variants I think
Phenotype can also be normalized (https://www.cog-genomics.org/plink/2.0/data#variance_standardize)

5 Linkage diequalibrium (LD) decay

the challenge here is actualy that what I want to do is NOT simply linkage disequilibrium. Rather, it is a composite of many many rounds of LD, genome wide LD calculation

Goal: Graphical relationship of All two loci pairwise comparisons, r2 relative to either genetic or physical distance. ALL PAIRWISE LINKAGE COMPARISONS.

This aspect of PLINK is described here: https://www.cog-genomics.org/plink/1.9/ld . For some reason, this does not appear to be functional in PLINK 2.0. It appears fairly straight-forward, not too many commands or requirements.

5.1 importing the VCF (? this title is wrong)

This has to change because I had included GM104 in the GWAS (it has available canker data). However, it is terrible for the LD test (too many variants).

Try ignoring GM104 via the “remove” flag:

sudo plink --r2 --bfile all_plink1 --ld-window-r2 0.01 --ld-window 100 --ld-window-cm 1000 --chr-set -72 --remove remove.txt --out r2_result_v2

the file remove.txt simply looks like this:

GM104<tab>GM104

(use an actual tab character) This worked perfectly.

Basic command to use as a starting point:

sudo plink --r2 --bfile all_plink1 --ld-window-r2 0.01 --ld-window 10000 --ld-window-cm 100000 --remove remove.txt --chr-set -72 --out r2_result_v2

r2 indicates that you want to run variant association
bfile points to binary format (PLINK2) variant files (via. VCF, generated above)
ld-window-r2 is the threshold for reporting r2.
ld-window is the max number of variants between variant pairs that will allow them to be tested… kind of weird, example, variant A and variant B have 50,000 variants between them, this will not be analyzed
ld-window-cm is very important! It is the maximum distance for association testing, by defaullt only 1000bp. I really should test up to 100,000bp apart (this will be huge data file I believe)
chr-set is required if you have something other than human. You provide the number of “chromosomes”, scaffolds in our case. It must be set to negative (-72) to indicate haploid.
out is simply the output file name header

Running the command above generates a 278 mb results file with 2 million pairwise comparisons!

x<-1
r2 <- read.table("PLINK/r2_result_v3.ld",header = T)
datatable(head(r2))

Potential sign of error here: there are two variants at the same site listed as r2 = 1. this should NOT be possible. check the multivcf, what is going on? these are poly-allelic sites which appear as diploid, i.e. multiple variants in the same genome. This actually provides an interestingly quick way to screen for diploid alleles.

5.2 LD decay ignoring very close sites

5.2.1 parsing the r2 file

We simply need a two column distance vs. r2 table which excludes variants within 100bp of one-another

ld <- r2
ld$distance <- abs(ld$BP_B - ld$BP_A)
ld <- ld[c(8,7)]
ld$distance <- as.numeric(ld$distance)
ld <- subset(ld, distance > 99)

out of 2 million pairs, 18,000 were removed by the 100bp proximity test

5.2.2 plotting linkage

with 2 million points, this is computationally taxing

library(ggplot2)
ggplot(ld, aes(x=distance, y=R2)) + geom_point(size=0.1) + geom_smooth(color = "red") +
  theme_bw()

ggsave("LD_decay_point.pdf", device="pdf")
ggsave("LD_decay_point.png", device="png")
ggplot(ld, aes(x=distance, y=R2)) + geom_smooth(color = "red") +
  theme_bw()

ggsave("LD_decay.pdf", device="pdf")
ggsave("LD_decay.png", device="png")

Why would the smooth be different whether I plot with or without the points? This makes no sense… Actually this is simply due to differences in how the Y-axis is displayed, the line-only plot is largely meaningless. Check the plot with more points, 1-99 bp

These plots are messy; I should consider switching to a method employing a binned mean

5.3 LD decay including all sites aside from 0

5.3.1 parsing the r2 file

We simply need a two column distance vs. r2 table which excludes variants within 100bp of one-another

ld <- r2
ld$distance <- abs(ld$BP_B - ld$BP_A)
ld <- ld[c(8,7)]
ld$distance <- as.numeric(ld$distance)
ld <- subset(ld, distance > 0)

out of 2 million pairs, only 18,000 were removed by the 100bp proximity test

5.3.2 plotting linkage

with 2 million points, this is computationally taxing

library(ggplot2)
ggplot(ld, aes(x=distance, y=R2)) + geom_point(size=0.1) + geom_smooth(color = "red") +
  theme_bw()

ggsave("LD_decay_point_1bp.pdf", device="pdf")
ggsave("LD_decay_point_1bp.png", device="png")
ggplot(ld, aes(x=distance, y=R2)) + geom_smooth(color = "red") +
  theme_bw() + ylim(0,0.6)

ggsave("LD_decay_1bp.pdf", device="pdf")
ggsave("LD_decay_1bp.png", device="png")

Plots look essentially identical

I am still very uncomfortable with this; if the organism was truly asexual, you would expect very high r2 values throughout, not 0.35 with enormous variability. Try plotting segmental means rather than smoothing all points; I do NOT trust the smoothing algorithm here since I am not controlling the formula

5.4 binned means LD decay

library(dplyr)
library(stringr)
#I'm honestly not sure what this does, check the output table to get an idea.  Taken from https://www.biostars.org/p/300381/#300423
ld$distc <- cut(ld$distance,breaks=seq(from=min(ld$distance), to=max(ld$distance)+1,by=100))
ld.bin <- ld %>% group_by(distc) %>% summarize(mean=mean(R2),median=median(R2))
#pull out median from each interval
ld.bin <- ld.bin %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))
ggplot()+
  geom_point(data=ld.bin,aes(x=start,y=mean),size=0.4,colour="grey20")+
  geom_line(data=ld.bin,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
  labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
  scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
  theme_bw()

ggsave("LD_decay_binned_means.pdf", device = "pdf")

We can see that it really didn’t change the plot. This is for real! Still nonsense

6 LD decay conclusions

https://www.g3journal.org/content/7/8/2461 This paper describes methods associated with this type of curve. They saw “almost no LD decay” in wild peas. They ran an r2v estimator analysis, described here: https://www.nature.com/articles/hdy201173/, which is essentially a correction of r2 based on a kinship matrix. They demonstrate that in a structured population, r2 for unlinked loci is strongly biased (inflated). In theory, if we can build a kinship matrix, we can move forward with at least a corrected LD decay chart.

This paper" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6220858/ reviews the current status of kinship estimation in the NGS era. It would be the correct starting point. These papers provide adequate explanation that in a highly-structured population, which we have demonstrated via the variant ordination patterns, it is not possible to obtain clear LD decay measures. Sites that are in disequilibrium will have inflated r2 values due to kinship/population-structure. This is not the appropriate set of samples on which to conduct such an analysis. In theory, we could attempt to build a kindship matrix based on ordination distances or extracted haplotype information or even from the SNP tree tip to tip distance, but we would be left with an abyssmally small sample (since so many individuals would be effectively wiped out due to close kinship). What this does tell us is that there is NO SIGN AT ALL of any sexual reproduction. If sexual reproduction takes place, it seems to be somewhat rare and/or restricted to local populations.

7 polyploidy / aneuploidy

While in this organism most loci are certianly haploid, there is a certain proportion of apparently diploid loci. How many? This is best arrived at by exploring the vcf file.

Any variant call cell which contains “/” is polyploid. To detect “" we can simply remove all other characters, i.e. numbers,”:“, and”,"

vcf <- read.csv("PLINK/all.sort.head.vcf",skip=13,header = T,sep="\t",stringsAsFactors = F)
#remove GM104
vcf <- vcf[-17]
#remove everything other than "/" and turn "/" into numeric "1"
library(stringr)
t <- vcf
for (n in c(10:31)) {
  t[n] <- str_remove_all(unlist(t[n]),"[:digit:]")
  t[n] <- str_remove_all(unlist(t[n]),"\\,")
  t[n] <- str_remove_all(unlist(t[n]),"\\:")
  t[n] <- str_remove_all(unlist(t[n]),"\\.")
  t[n] <- str_replace_all(unlist(t[n]),"\\/","1")
  t[n] <- as.numeric(unlist(t[n]))
}
#sums for each strain and site
c <- t[-c(1:9)]
site.var <- rowSums(c, na.rm = T)
strain.var <- colSums(c, na.rm = T)

#add site var to main table
vcf$diploid <- site.var

strain.var

##  GM11  GM17  GM20  GM25  GM39  GM59  GM64 GM108 GM118 GM140 GM158 GM170 
##  3981  2321  1592  6189  3983  3629  3953  3187  6664  6653  6622  5862 
## GM177 GM178 GM179 GM188 GM196 GM205 GM216 GM219 GM236 GM307 
##  5013  5657  6298  1379  3586  4193  5907  5992  3979  3835

Each strain has 1500 - 6000 diploid sites. The question now is how they are distibuted?

there are four explanations of polyploid sites I can come up with (followed by how to test, how the polyploid pattern will appear)

duplicated genes in the reference genome (randomly distributed but concentrated in single genes in all strains)
newly duplicated genes (randomly distributed but concentrated in single genes in some strains)
mistaken variant calls (random sporadic distribution)
diploid chromosomes (concentrated distribution on certain scaffolds)

A simple test for chromosomal frequency would be to summarize by scaffold.

library(dplyr)
var.scaf <- vcf %>% group_by(X.CHROM) %>% summarise(mean=mean(diploid))
var.scaf <- data.frame(var.scaf)
plot(var.scaf$X.CHROM, var.scaf$mean,
     xlab = "scaffold", ylab = "average polyploid sites per variant site")

Distribution seems non-random, but this is a very crude representation. Average polyploid count only rises significantly in the smaller scaffolds. This seems to argue against any widespread diploidy, i.e. any diploid chromosome found in all members of the population.

What we need to see is polyploidy site ratio per scaffold per strain.

library(dplyr)
library(tidyr)
var.long <- gather(t, strain, var, GM11:GM307)
var.long[is.na(var.long)] <- 0
var.long <- var.long %>% group_by(X.CHROM, strain) %>% summarise(mean=mean(var, na.rm = F))
var.long <- data.frame(var.long)
library(ggplot2)
ggplot(var.long, aes(X.CHROM, mean)) + geom_line() + facet_grid(vars(strain)) + theme(strip.text.y = element_text(angle = 0))

We see a fairly consistent pattern of polyploidy across the assembly for the strains, with the smaller strains going a bit crazy, as expected. However, there are a few key interesting patterns here:

some consistent hotspots of polyploidy in all strains
GM205, two much hotter spots around scoffolds 25-30.

Taking a closer look at GM205

GM205 <- subset(var.long, strain=="GM205")
GM205

##      X.CHROM strain        mean
## 14         0  GM205 0.001956779
## 36         1  GM205 0.001890101
## 58         2  GM205 0.002717823
## 80         3  GM205 0.002659380
## 102        4  GM205 0.002187376
## 124        5  GM205 0.001884182
## 146        6  GM205 0.008623974
## 168        7  GM205 0.001466327
## 190        8  GM205 0.002227275
## 212        9  GM205 0.010267751
## 234       10  GM205 0.001831588
## 256       11  GM205 0.010741294
## 278       12  GM205 0.002588396
## 300       13  GM205 0.002589893
## 322       14  GM205 0.008400399
## 344       15  GM205 0.012096286
## 366       16  GM205 0.002584934
## 388       17  GM205 0.001410106
## 410       18  GM205 0.002685673
## 432       19  GM205 0.002057848
## 454       20  GM205 0.002586207
## 476       21  GM205 0.001725295
## 498       22  GM205 0.013078954
## 520       23  GM205 0.081684982
## 542       24  GM205 0.002471577
## 564       25  GM205 0.001033058
## 586       26  GM205 0.002785084
## 608       27  GM205 0.003401361
## 630       28  GM205 0.041382447
## 652       29  GM205 0.251497006
## 674       30  GM205 0.293233083
## 696       31  GM205 0.002793296
## 718       32  GM205 0.000000000
## 740       33  GM205 0.248699272
## 762       34  GM205 0.061224490
## 784       35  GM205 0.000000000
## 806       36  GM205 0.000000000
## 828       38  GM205 0.000000000
## 850       39  GM205 0.000000000
## 872       40  GM205 0.000000000
## 894       41  GM205 0.000000000
## 916       42  GM205 0.000000000
## 938       43  GM205 0.058219178
## 960       44  GM205 0.000000000
## 982       45  GM205 0.000000000
## 1004      46  GM205 0.000000000
## 1026      47  GM205 0.000000000
## 1048      48  GM205 0.000000000
## 1070      49  GM205 0.000000000
## 1092      50  GM205 0.000000000
## 1114      51  GM205 0.000000000
## 1136      52  GM205 0.000000000
## 1158      53  GM205 0.000000000
## 1180      54  GM205 0.028688525
## 1202      55  GM205 0.128617363
## 1224      56  GM205 0.000000000
## 1246      57  GM205 0.000000000
## 1268      58  GM205 0.051724138
## 1290      59  GM205 0.187500000
## 1312      60  GM205 0.014084507
## 1334      61  GM205 0.133333333
## 1356      62  GM205 0.000000000
## 1378      63  GM205 0.170212766
## 1400      64  GM205 0.000000000
## 1422      65  GM205 0.000000000
## 1444      66  GM205 0.066666667
## 1466      67  GM205 0.079470199
## 1488      68  GM205 0.011494253
## 1510      69  GM205 0.000000000
## 1532      70  GM205 0.032258065
## 1554      71  GM205 0.000000000
## 1576      72  GM205 0.000000000

Three scaffolds, 29, 30, and 33, are at ~25% polyploid loci.

7.1 addressing aneuploidy based on read depth disparities

The VCF also encodes read depth. example:

1:0,38:38

The last number is the total number of reads mapping to this site. We can examine the patterns in the same way, but based on total read depth instead of polyploidy. This would be more ideally extracted simply from the bed files.

Plot as shown here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3744429/

The issue with our study is that the reference genome is essentially unknown in terms of chromosome structure, we just have a pile of scaffolds.

I have not completed this section yet

8 how many consistently variant sites are there

counting how many strains have each variant

vcf <- read.csv("PLINK/all.sort.head.vcf",skip=13,header = T,sep="\t",stringsAsFactors = F)
#remove GM104
vcf <- vcf[-17]
#remove everything other than "/" and turn "/" into numeric "1"
library(stringr)
t <- vcf
for (n in c(10:31)) {
  t[n] <- str_remove_all(unlist(t[n]),"[:digit:]")
  t[n] <- str_remove_all(unlist(t[n]),"\\:")
  t[n] <- str_remove_all(unlist(t[n]),"\\/")
  t[n] <- str_replace_all(unlist(t[n]),"\\.","0")
  t[n] <- str_replace_all(unlist(t[n]),"\\,","1")
  t[n] <- as.numeric(unlist(t[n]))
}
#sums for each strain and site
c <- t[-c(1:9)]
site.var <- rowSums(c, na.rm = T)
#brute force out parsing errors
x <- c(1:length(site.var))
c <- data.frame(site.var, x)
vars.all.strains <- subset(c, site.var > 0)
hist(site.var, breaks = 21)

#strain.var <- colSums(c, na.rm = T)



#add site var to main table
#vcf$diploid <- site.var

1749 variants are found in ALL genomes

38448 found in at least 10 strains

This is an open issue, whether if we subset into subpopulations and run LD decay again, will anything interesting emerge? We will not proceed.

8.1 conclusions: aneuploidy

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5656283/ Many examples of aneuploidy in fungi, so this wouldn’t be so unusual. Also there are some great examples of ways to plot this data.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3744429/ this paper looks at copy number variation and assocaition with pathogenicity in a fungus

It appears that strain GM205 has gone very polyploid at one location, with consistent multiple alleles on scaffolds 29, 30, and 33. This comprises about 40 Kbases though. I’m leaning towards ignoring this for the most part; perhaps it could also be evidence of a large duplication event within a chromosome rather than an entire duplicated segment. Again though, this should be addressed by looking at read depth. I would like to be able to say that there is

Some evidence of kilobase-scale duplications having occurred in the strains followed by divergent evolution. This is possibly evidence of aneuploidy but without a completed reference genome it is not possible to draw such conclusions.