G. morbida GWAS with PLINK
G. morbida GWAS with PLINK
PLINK2 is a common system for performing GWAS analyses (in addition to other related analyses).
This document describes
exploring application of GWAS to the data set, with the understanding that sample size is far too low and LD decay is not described. This is intended only as exploratory
building LD decay plots
exploring polyploidy and aneuploidy in the data set
1 Overview of Data Conversion for Input to PLINK
Generally speaking, PLINK2 takes variant (SNP) data and phenotypic data. However, the specific data formats are odd. I tried using vfctools, which supposedly has a plink convert, but it did not produce anything useful.
1.1 Convert to PLINK using custom pipeline
described here: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html
This also handles some of the limitations of the PLINK system.
2 PLINK2.0
I was using PLINK 1.9, lets try 2.0 instead; it says it can handle vcf format
https://www.cog-genomics.org/plink/2.0/
Required files: * multi-vcf * phenotype https://www.cog-genomics.org/plink/2.0/input#pheno
2.1 Preparing the vcf file
First I want to change the strain names in the multi-vcf. I have what looks like this: _S2_L001 (Variants, AAC), where the S**L*** is variable.
sed 's/ (Variants, AAC)//g' multi_all.vcf > clean.vcf<br>
sed 's/_L001//g' clean.vcf > clean2.vcf<br>
sed 's/_L003//g' clean2.vcf > clean3.vcf<br>
sed 's/_S[0-9][0-9]//g' clean3.vcf > clean4.vcf<br>
sed 's/_S[0-9]//g' clean4.vcf > clean5.vcf<br>
mv clean5.vcf all.vcf
Next, the scaffold names are no good for PLINK (maybe), they should just be numbered
sed 's/scaffold_//g' all.vcf > all.vcf
this will be fed to PLINK2 using
--vcf all.vcf
2.2 Preparing the phenotype file for GWAS trials
Prepared as described in PLINK2 instructions
(note that the blank cells here are actually “NA” in the file that will be fed into PLINK2)
ptype <- read.csv("PLINK/phenotype_cat.tsv", sep="\t",header = T, na.strings = "NA")
datatable(ptype)Supposedly when running PLINK2 I can be explicit regarding which
This will be fed to PLINK using
–pheno phenotype_cat.tsv –pheno-name Cat
3 Workflow for loading data to PLINK
bcftools norm -Ou -m -any all.vcf |
bcftools norm -Ou -f G.morbida.refgenome.fa |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
bcftools norm -Ou -m -any all.vcf |
bcftools norm -Ou -f G.morbida.refgenome.fa |
bcftools annotate -Ob -x ID \
-I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
both plink 1.9 and plink 2 binaries are placed in /usr/bin (not sure if 1.9 is required)
First we convert the vcf to a binary format (note that we have to tell PLINK2 what our “chromosome” count is using –chr-set; it assumes human and will give an error otherwise)
plink2 --vcf all.vcf --chr-set 72 --out all_binary
PLINK cannot handle multi-allelic sites and fixing this seems to be a hassle. All I need is for different variants to be on different lines… but once the vcf is built, this is quite challenging. bcftools is generally able to do it? but bcftools needs the vcf to be “indexed” using tabix. Many solutions for this here https://www.biostars.org/p/59492/… BUT first the vcf has to be SORTED, presumably by chromosome and then by position.
test.vcf <- read.csv("PLINK/all.vcf", sep = "\t", comment.char = "#", header = F)
test.vcf <- test.vcf[with(test.vcf, order(V1, V2)),]
datatable(head(test.vcf))write.table(test.vcf, "PLINK/all.sort.vcf", sep="\t", row.names = F, quote = F, col.names = F)This has all of the comment lines and headers stripped, so it will have to be merged onto it from the original vcf
head -n 14 all.vcf > vcf.headers.txt<br>
cat vcf.headers.txt all.sort.vcf > all.sort.head.vcf
This appears to be correct now, sorted and with headers
Now I can use tabix to index?
/home/robert/Tools/htslib/bgzip -c all.sort.head.vcf > all.sort.head.vcg.gz<br>
/home/robert/Tools/htslib/tabix -p vcf all.sort.head.vcg.gz
to use bfctools “norm”“, I have to have a corresponding refrence genome; i have to change the scaffold names again
sed 's/scaffold_//g' G.morbida.refgenome.fa > G.morbida.refgenome.clean.fa
now use bfctools norm as described:
bcftools norm -Ou -m -any all.sort.head.vcf.gz |
bcftools norm -Ou -f G.morbida.refgenome.clean.fa |
bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' |
plink --bcf /dev/stdin \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out output
Step 1: This step splits multi-allelic sites:
bcftools norm -Ou -m -any all.sort.head.vcf.gz > split.bcf.gz
stdout: “Lines total/split/realigned/skipped: 591211/16914/0/0”
So this means that the sites are split… the second command fails due to some error which was cited as a bug in 2016. This is aimed at changing some split sites so that might be mis-interpreted by PLINK. Unfortunately I will simply have to skip this.
step 3:
bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' split.bcf.gz > split.an.bcf.gz
step 4:
this should import into PLINK
This “correct” command failed with error “invalid centimorgan position detected”
plink --bcf split.an.bcf.gz \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--chr-set 72 \
--split-x b37 no-fail \
--make-bed \
--out output
This simplified command worked
plink --bcf split.an.bcf.gz \
--chr-set 72 \
--out output
reporting that “output.bed + output.bim + output.fam written”
Note that this made PLINK 1.9 format files. I can also export the de-multi-alleled bcf as a vcf file for use in PLINK2
bcftools convert -o all.demulti.vcf -O v split.an.bcf.gz
This has space characters in the alternate allele columns, which causes everything downstream to fail. These can all be bruteforce removed.
sed 's/ //g' all.demulti.vcf > all.demulti2.vcf
I now have the sorted and indexed vcf that I need to bring into PLINK2
4 Running PLINK2 GWAS
https://www.cog-genomics.org/plink/2.0/input
4.1 import the sorted de-multiplexed vcf
plink2 --vcf all.demulti2.vcf --chr-set 72 --out all_plink2
4.1.1 Import log
PLINK v2.00a2LM 64-bit Intel (2 Apr 2019)<br>
Options in effect:<br>
--chr-set 72<br>
--out all_plink2<br>
--vcf all.demulti2.vcf<br>
Hostname: robert-Precision-Tower-3620<br>
Working directory: /home/robert/Dropbox/Projects/Geosmithia/PLINK<br>
Start time: Thu Apr 25 17:39:03 2019<br>
Random number seed: 1556228343<br>
64350 MiB RAM detected; reserving 32175 MiB for main workspace.<br>
Using up to 8 compute threads.<br>
--vcf: 615568 variants scanned.<br>
--vcf: all_plink2.pgen + all_plink2.pvar + all_plink2.psam written.<br>
End time: Thu Apr 25 17:39:04 2019
4.2 Running PLINK2 analysis
This runs a linear regression based analysis against canker size:
plink2 --pfile all_plink2 --pheno phenotype_cat.tsv --pheno-name Canker --glm --out canker.glm
canker.glm <- read.csv("PLINK/canker.glm.Canker.glm.linear", sep = "\t")Applying an fdr correction
canker.glm$fdr <- p.adjust(canker.glm$P)sorting out only those with P-values less than 0.005
canker.glm <- canker.glm[with(canker.glm, order(T_STAT)),]
#canker.glm.p <- canker.glm[canker.glm$T_STAT < .05,]
canker.glm.p <- subset(canker.glm, P < 0.005)
canker.glm.p <- canker.glm.p[with(canker.glm.p, order(X.CHROM, POS)),]
datatable(canker.glm.p)Interesting to note that NOTHING passes fdr-correction thresholds even though the p-values get very low (0.00009). I have to assume this is due to the small sample size (n=11).
Initial manual inspection reveals a lot of exciting stuff where or near the locations of these SNPs, such as membrane trafficking, peptidases and cellulases.
4.3 Annotating variant locations with nearby gene names
I next need a clean mapping protocol for these SNPs, which will apply:
- if the SNP is in a gene, return the gene name
- if the closest up- and down-stream genes are the same, return one gene name and declare it in an intron
- if the up- and down-stream genes are different, return both names.
gene.positions <- read.csv("../Annotations/geosmithia_annotations/gene.position.clean.txt", sep="\t",header=F)
colnames(gene.positions) <- c("scaffold","gene","start","stop","direction")
glm.005 <- data.frame()
for (x in c(1:69)) {
test <- canker.glm.p[x,]
gene.pos.s <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
gene.pos.s <- gene.pos.s[(gene.pos.s$start < test$POS), ]
gene.pos.s <- gene.pos.s[(gene.pos.s$stop > test$POS),]
if (nrow(gene.pos.s) == 1) { #this capture only intragenic snps
gene.pos.s$location <- "intragenic"
glm.005 <- rbind(glm.005,gene.pos.s)
}
if (nrow(gene.pos.s) == 0) { #this captures snps with only one flanking gene, at the end of a contig
gene.pos.fix <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
gene.pos.fix.upstream <- tail(gene.pos.fix[(gene.pos.fix$start < test$POS), ] , 1)
gene.pos.fix.downstream <- head(gene.pos.fix[(gene.pos.fix$stop > test$POS),] , 1)
if (nrow(gene.pos.fix.downstream) == 0) {
gene.pos.fix.upstream$location <- "flanking"
glm.005 <- rbind(glm.005, gene.pos.fix.upstream)
}
else {
if (nrow(gene.pos.fix.upstream) == 0) {
gene.pos.fix.downstream$location <- "flanking"
glm.005 <- rbind(glm.005, gene.pos.fix.downstream)
}
else { # if it is not intra-genic or flanking (terminal), then it must be an intron
gene.pos.fix <- gene.positions[(gene.positions$scaffold == test$X.CHROM), ]
gene.pos.fix.upstream <- tail(gene.pos.fix[(gene.pos.fix$start < test$POS), ] , 1)
gene.pos.fix.downstream <- head(gene.pos.fix[(gene.pos.fix$stop > test$POS),] , 1)
if (gene.pos.fix.upstream$gene == gene.pos.fix.downstream$gene) {
gene.pos.fix <- rbind(gene.pos.fix.upstream, gene.pos.fix.downstream)
gene.pos.fix$location <- "intron"
glm.005 <- rbind(glm.005,gene.pos.fix)
}
if (gene.pos.fix.upstream$gene != gene.pos.fix.downstream$gene ) {
gene.pos.fix <- rbind(gene.pos.fix.upstream, gene.pos.fix.downstream)
gene.pos.fix$location <- "intergenic"
glm.005 <- rbind(glm.005,gene.pos.fix)
}
}
}
}}
datatable(glm.005)Attaching annotation information is straight-forward from this point on… however, I will stop here
4.4 Conclusions
This ultimately allowed me to get the multi-vcf into PLINK(2.0) and run one of the many types of analyses available.
Points of uncertainty:
SNP count fed into the process was very high, 600,000 for only 11 strains. This leads to stricter multiple-testing corrections to p-values (i.e. fdr). I have seen methods used for SNP filtering, such as removing “rare” variants. This I believe would take us to ~200,000 at most
There are likely still some problems with the vcf file; this is broadly described in this blog post: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html. I’m not overly concerned about a small amount of data mis-handling though.
I am not sure about the utility of using these p-values or fdr correction technique. Another paper I saw, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6262169/ , did not use the PLINK p-values, rather fed the beta scores and SE values into a diffrent R package (ashr) and performed “empirical Bayesian multiple hypothesis testing” and then applied an estimate of the “local false sign rate (lfsr)”. Is this necessary? What does the p-value provided by PLINK2 represent? Is it multiple testing corrected?
variants can be filtered directly in PLINK2; you can remove SNPs in linkage disequalibrium (https://www.cog-genomics.org/plink/2.0/ld#indep) and you can remove the rare variants I think
Phenotype can also be normalized (https://www.cog-genomics.org/plink/2.0/data#variance_standardize)
5 Linkage diequalibrium (LD) decay
the challenge here is actualy that what I want to do is NOT simply linkage disequilibrium. Rather, it is a composite of many many rounds of LD, genome wide LD calculation
Goal: Graphical relationship of All two loci pairwise comparisons, r2 relative to either genetic or physical distance. ALL PAIRWISE LINKAGE COMPARISONS.
This aspect of PLINK is described here: https://www.cog-genomics.org/plink/1.9/ld . For some reason, this does not appear to be functional in PLINK 2.0. It appears fairly straight-forward, not too many commands or requirements.
5.1 importing the VCF (? this title is wrong)
This has to change because I had included GM104 in the GWAS (it has available canker data). However, it is terrible for the LD test (too many variants).
Try ignoring GM104 via the “remove” flag:
sudo plink --r2 --bfile all_plink1 --ld-window-r2 0.01 --ld-window 100 --ld-window-cm 1000 --chr-set -72 --remove remove.txt --out r2_result_v2
the file remove.txt simply looks like this:
GM104<tab>GM104
(use an actual tab character) This worked perfectly.
Basic command to use as a starting point:
sudo plink --r2 --bfile all_plink1 --ld-window-r2 0.01 --ld-window 10000 --ld-window-cm 100000 --remove remove.txt --chr-set -72 --out r2_result_v2
r2 indicates that you want to run variant association
bfile points to binary format (PLINK2) variant files (via. VCF, generated above)
ld-window-r2 is the threshold for reporting r2.
ld-window is the max number of variants between variant pairs that will allow them to be tested… kind of weird, example, variant A and variant B have 50,000 variants between them, this will not be analyzed
ld-window-cm is very important! It is the maximum distance for association testing, by defaullt only 1000bp. I really should test up to 100,000bp apart (this will be huge data file I believe)
chr-set is required if you have something other than human. You provide the number of “chromosomes”, scaffolds in our case. It must be set to negative (-72) to indicate haploid.
out is simply the output file name header
Running the command above generates a 278 mb results file with 2 million pairwise comparisons!
x<-1
r2 <- read.table("PLINK/r2_result_v3.ld",header = T)
datatable(head(r2))Potential sign of error here: there are two variants at the same site listed as r2 = 1. this should NOT be possible. check the multivcf, what is going on? these are poly-allelic sites which appear as diploid, i.e. multiple variants in the same genome. This actually provides an interestingly quick way to screen for diploid alleles.
5.2 LD decay ignoring very close sites
5.2.1 parsing the r2 file
We simply need a two column distance vs. r2 table which excludes variants within 100bp of one-another
ld <- r2
ld$distance <- abs(ld$BP_B - ld$BP_A)
ld <- ld[c(8,7)]
ld$distance <- as.numeric(ld$distance)
ld <- subset(ld, distance > 99)out of 2 million pairs, 18,000 were removed by the 100bp proximity test
5.2.2 plotting linkage
with 2 million points, this is computationally taxing
library(ggplot2)
ggplot(ld, aes(x=distance, y=R2)) + geom_point(size=0.1) + geom_smooth(color = "red") +
theme_bw()ggsave("LD_decay_point.pdf", device="pdf")
ggsave("LD_decay_point.png", device="png")
ggplot(ld, aes(x=distance, y=R2)) + geom_smooth(color = "red") +
theme_bw()ggsave("LD_decay.pdf", device="pdf")
ggsave("LD_decay.png", device="png")Why would the smooth be different whether I plot with or without the points? This makes no sense… Actually this is simply due to differences in how the Y-axis is displayed, the line-only plot is largely meaningless. Check the plot with more points, 1-99 bp
These plots are messy; I should consider switching to a method employing a binned mean
5.3 LD decay including all sites aside from 0
5.3.1 parsing the r2 file
We simply need a two column distance vs. r2 table which excludes variants within 100bp of one-another
ld <- r2
ld$distance <- abs(ld$BP_B - ld$BP_A)
ld <- ld[c(8,7)]
ld$distance <- as.numeric(ld$distance)
ld <- subset(ld, distance > 0)out of 2 million pairs, only 18,000 were removed by the 100bp proximity test
5.3.2 plotting linkage
with 2 million points, this is computationally taxing
library(ggplot2)
ggplot(ld, aes(x=distance, y=R2)) + geom_point(size=0.1) + geom_smooth(color = "red") +
theme_bw()ggsave("LD_decay_point_1bp.pdf", device="pdf")
ggsave("LD_decay_point_1bp.png", device="png")
ggplot(ld, aes(x=distance, y=R2)) + geom_smooth(color = "red") +
theme_bw() + ylim(0,0.6)ggsave("LD_decay_1bp.pdf", device="pdf")
ggsave("LD_decay_1bp.png", device="png")Plots look essentially identical
I am still very uncomfortable with this; if the organism was truly asexual, you would expect very high r2 values throughout, not 0.35 with enormous variability. Try plotting segmental means rather than smoothing all points; I do NOT trust the smoothing algorithm here since I am not controlling the formula
5.4 binned means LD decay
library(dplyr)
library(stringr)
#I'm honestly not sure what this does, check the output table to get an idea. Taken from https://www.biostars.org/p/300381/#300423
ld$distc <- cut(ld$distance,breaks=seq(from=min(ld$distance), to=max(ld$distance)+1,by=100))
ld.bin <- ld %>% group_by(distc) %>% summarize(mean=mean(R2),median=median(R2))
#pull out median from each interval
ld.bin <- ld.bin %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
mid=start+((end-start)/2))
ggplot()+
geom_point(data=ld.bin,aes(x=start,y=mean),size=0.4,colour="grey20")+
geom_line(data=ld.bin,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
theme_bw()ggsave("LD_decay_binned_means.pdf", device = "pdf")We can see that it really didn’t change the plot. This is for real! Still nonsense
6 LD decay conclusions
https://www.g3journal.org/content/7/8/2461 This paper describes methods associated with this type of curve. They saw “almost no LD decay” in wild peas. They ran an r2v estimator analysis, described here: https://www.nature.com/articles/hdy201173/, which is essentially a correction of r2 based on a kinship matrix. They demonstrate that in a structured population, r2 for unlinked loci is strongly biased (inflated). In theory, if we can build a kinship matrix, we can move forward with at least a corrected LD decay chart.
This paper" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6220858/ reviews the current status of kinship estimation in the NGS era. It would be the correct starting point. These papers provide adequate explanation that in a highly-structured population, which we have demonstrated via the variant ordination patterns, it is not possible to obtain clear LD decay measures. Sites that are in disequilibrium will have inflated r2 values due to kinship/population-structure. This is not the appropriate set of samples on which to conduct such an analysis. In theory, we could attempt to build a kindship matrix based on ordination distances or extracted haplotype information or even from the SNP tree tip to tip distance, but we would be left with an abyssmally small sample (since so many individuals would be effectively wiped out due to close kinship). What this does tell us is that there is NO SIGN AT ALL of any sexual reproduction. If sexual reproduction takes place, it seems to be somewhat rare and/or restricted to local populations.
7 polyploidy / aneuploidy
While in this organism most loci are certianly haploid, there is a certain proportion of apparently diploid loci. How many? This is best arrived at by exploring the vcf file.
Any variant call cell which contains “/” is polyploid. To detect “" we can simply remove all other characters, i.e. numbers,”:“, and”,"
vcf <- read.csv("PLINK/all.sort.head.vcf",skip=13,header = T,sep="\t",stringsAsFactors = F)
#remove GM104
vcf <- vcf[-17]
#remove everything other than "/" and turn "/" into numeric "1"
library(stringr)
t <- vcf
for (n in c(10:31)) {
t[n] <- str_remove_all(unlist(t[n]),"[:digit:]")
t[n] <- str_remove_all(unlist(t[n]),"\\,")
t[n] <- str_remove_all(unlist(t[n]),"\\:")
t[n] <- str_remove_all(unlist(t[n]),"\\.")
t[n] <- str_replace_all(unlist(t[n]),"\\/","1")
t[n] <- as.numeric(unlist(t[n]))
}
#sums for each strain and site
c <- t[-c(1:9)]
site.var <- rowSums(c, na.rm = T)
strain.var <- colSums(c, na.rm = T)
#add site var to main table
vcf$diploid <- site.var
strain.var## GM11 GM17 GM20 GM25 GM39 GM59 GM64 GM108 GM118 GM140 GM158 GM170
## 3981 2321 1592 6189 3983 3629 3953 3187 6664 6653 6622 5862
## GM177 GM178 GM179 GM188 GM196 GM205 GM216 GM219 GM236 GM307
## 5013 5657 6298 1379 3586 4193 5907 5992 3979 3835
Each strain has 1500 - 6000 diploid sites. The question now is how they are distibuted?
there are four explanations of polyploid sites I can come up with (followed by how to test, how the polyploid pattern will appear)
duplicated genes in the reference genome (randomly distributed but concentrated in single genes in all strains)
newly duplicated genes (randomly distributed but concentrated in single genes in some strains)
mistaken variant calls (random sporadic distribution)
diploid chromosomes (concentrated distribution on certain scaffolds)
A simple test for chromosomal frequency would be to summarize by scaffold.
library(dplyr)
var.scaf <- vcf %>% group_by(X.CHROM) %>% summarise(mean=mean(diploid))
var.scaf <- data.frame(var.scaf)
plot(var.scaf$X.CHROM, var.scaf$mean,
xlab = "scaffold", ylab = "average polyploid sites per variant site")Distribution seems non-random, but this is a very crude representation. Average polyploid count only rises significantly in the smaller scaffolds. This seems to argue against any widespread diploidy, i.e. any diploid chromosome found in all members of the population.
What we need to see is polyploidy site ratio per scaffold per strain.
library(dplyr)
library(tidyr)
var.long <- gather(t, strain, var, GM11:GM307)
var.long[is.na(var.long)] <- 0
var.long <- var.long %>% group_by(X.CHROM, strain) %>% summarise(mean=mean(var, na.rm = F))
var.long <- data.frame(var.long)
library(ggplot2)
ggplot(var.long, aes(X.CHROM, mean)) + geom_line() + facet_grid(vars(strain)) + theme(strip.text.y = element_text(angle = 0))We see a fairly consistent pattern of polyploidy across the assembly for the strains, with the smaller strains going a bit crazy, as expected. However, there are a few key interesting patterns here:
some consistent hotspots of polyploidy in all strains
GM205, two much hotter spots around scoffolds 25-30.
Taking a closer look at GM205
GM205 <- subset(var.long, strain=="GM205")
GM205## X.CHROM strain mean
## 14 0 GM205 0.001956779
## 36 1 GM205 0.001890101
## 58 2 GM205 0.002717823
## 80 3 GM205 0.002659380
## 102 4 GM205 0.002187376
## 124 5 GM205 0.001884182
## 146 6 GM205 0.008623974
## 168 7 GM205 0.001466327
## 190 8 GM205 0.002227275
## 212 9 GM205 0.010267751
## 234 10 GM205 0.001831588
## 256 11 GM205 0.010741294
## 278 12 GM205 0.002588396
## 300 13 GM205 0.002589893
## 322 14 GM205 0.008400399
## 344 15 GM205 0.012096286
## 366 16 GM205 0.002584934
## 388 17 GM205 0.001410106
## 410 18 GM205 0.002685673
## 432 19 GM205 0.002057848
## 454 20 GM205 0.002586207
## 476 21 GM205 0.001725295
## 498 22 GM205 0.013078954
## 520 23 GM205 0.081684982
## 542 24 GM205 0.002471577
## 564 25 GM205 0.001033058
## 586 26 GM205 0.002785084
## 608 27 GM205 0.003401361
## 630 28 GM205 0.041382447
## 652 29 GM205 0.251497006
## 674 30 GM205 0.293233083
## 696 31 GM205 0.002793296
## 718 32 GM205 0.000000000
## 740 33 GM205 0.248699272
## 762 34 GM205 0.061224490
## 784 35 GM205 0.000000000
## 806 36 GM205 0.000000000
## 828 38 GM205 0.000000000
## 850 39 GM205 0.000000000
## 872 40 GM205 0.000000000
## 894 41 GM205 0.000000000
## 916 42 GM205 0.000000000
## 938 43 GM205 0.058219178
## 960 44 GM205 0.000000000
## 982 45 GM205 0.000000000
## 1004 46 GM205 0.000000000
## 1026 47 GM205 0.000000000
## 1048 48 GM205 0.000000000
## 1070 49 GM205 0.000000000
## 1092 50 GM205 0.000000000
## 1114 51 GM205 0.000000000
## 1136 52 GM205 0.000000000
## 1158 53 GM205 0.000000000
## 1180 54 GM205 0.028688525
## 1202 55 GM205 0.128617363
## 1224 56 GM205 0.000000000
## 1246 57 GM205 0.000000000
## 1268 58 GM205 0.051724138
## 1290 59 GM205 0.187500000
## 1312 60 GM205 0.014084507
## 1334 61 GM205 0.133333333
## 1356 62 GM205 0.000000000
## 1378 63 GM205 0.170212766
## 1400 64 GM205 0.000000000
## 1422 65 GM205 0.000000000
## 1444 66 GM205 0.066666667
## 1466 67 GM205 0.079470199
## 1488 68 GM205 0.011494253
## 1510 69 GM205 0.000000000
## 1532 70 GM205 0.032258065
## 1554 71 GM205 0.000000000
## 1576 72 GM205 0.000000000
Three scaffolds, 29, 30, and 33, are at ~25% polyploid loci.
7.1 addressing aneuploidy based on read depth disparities
The VCF also encodes read depth. example:
1:0,38:38
The last number is the total number of reads mapping to this site. We can examine the patterns in the same way, but based on total read depth instead of polyploidy. This would be more ideally extracted simply from the bed files.
Plot as shown here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3744429/
The issue with our study is that the reference genome is essentially unknown in terms of chromosome structure, we just have a pile of scaffolds.
I have not completed this section yet
8 how many consistently variant sites are there
counting how many strains have each variant
vcf <- read.csv("PLINK/all.sort.head.vcf",skip=13,header = T,sep="\t",stringsAsFactors = F)
#remove GM104
vcf <- vcf[-17]
#remove everything other than "/" and turn "/" into numeric "1"
library(stringr)
t <- vcf
for (n in c(10:31)) {
t[n] <- str_remove_all(unlist(t[n]),"[:digit:]")
t[n] <- str_remove_all(unlist(t[n]),"\\:")
t[n] <- str_remove_all(unlist(t[n]),"\\/")
t[n] <- str_replace_all(unlist(t[n]),"\\.","0")
t[n] <- str_replace_all(unlist(t[n]),"\\,","1")
t[n] <- as.numeric(unlist(t[n]))
}
#sums for each strain and site
c <- t[-c(1:9)]
site.var <- rowSums(c, na.rm = T)
#brute force out parsing errors
x <- c(1:length(site.var))
c <- data.frame(site.var, x)
vars.all.strains <- subset(c, site.var > 0)
hist(site.var, breaks = 21)#strain.var <- colSums(c, na.rm = T)
#add site var to main table
#vcf$diploid <- site.var1749 variants are found in ALL genomes
38448 found in at least 10 strains
This is an open issue, whether if we subset into subpopulations and run LD decay again, will anything interesting emerge? We will not proceed.
8.1 conclusions: aneuploidy
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5656283/ Many examples of aneuploidy in fungi, so this wouldn’t be so unusual. Also there are some great examples of ways to plot this data.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3744429/ this paper looks at copy number variation and assocaition with pathogenicity in a fungus
It appears that strain GM205 has gone very polyploid at one location, with consistent multiple alleles on scaffolds 29, 30, and 33. This comprises about 40 Kbases though. I’m leaning towards ignoring this for the most part; perhaps it could also be evidence of a large duplication event within a chromosome rather than an entire duplicated segment. Again though, this should be addressed by looking at read depth. I would like to be able to say that there is
- Some evidence of kilobase-scale duplications having occurred in the strains followed by divergent evolution. This is possibly evidence of aneuploidy but without a completed reference genome it is not possible to draw such conclusions.