library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Methods

Filter for max two alleles, snps only. Then filter for max 3 alleles but only if one is an indel. Add header back to indels and combine two files.

bcftools view -M2 -v snps NAM.header.chr10.vcf.gz > NAM.biallelic.snpsonly.vcf
bcftools view -M3  NAM.header.chr10.vcf.gz | grep "*" >  NAM.biallelic.indelsonly.vcf
cat ~/Downloads/B73v5.NAM-illumina_filtered-pass-only-two-round-gatk-snps.headersonly NAM.biallelic.indelsonly.vcf > NAM.biallelic.crap
mv NAM.biallelic.crap NAM.biallelic.indelsonly.vcf
bcftools concat NAM.biallelic.snpsonly.vcf NAM.biallelic.indelsonly.vcf > NAM.goodSNPs.vcf

Set hets to missing data

bcftools filter -S . -e 'GT=="0/1" || GT=="0/2" || GT=="1/2"'   NAM.goodSNPs.vcf> NAM.goodSNPs_hetstomissing.vcf

Here’s a test SNP to make sure it’s working.

grep "\s53589" NAM.goodSNPs_hetstomissing.vcf| grep "*"
chr10   53589   .       C       *,T     67.88   PASS    AC=2,1;AF=0.042,0.021;AN=48;BaseQRankSum=1.871;DP=512;ExcessHet=0.1409;FS=0.000;InbreedingCoeff=0.4937;MLEAC=1,1;MLEAF=0.021,0.021;MQ=48.20;MQRankSum=-3.026;QD=3.99;ReadPosRankSum=0.580;SOR=0.512       GT:AD:DP:GQ:PL  0/0:44,0,0:44:99:0,132,1980,132,1980,1980       0/0:54,0,0:54:99:0,163,2430,162,2381,2373 0/0:7,0,0:7:21:0,21,315,21,315,315      0/2:9,0,3:12:80:80,110,504,0,378,369    0/0:28,0,0:28:84:0,84,1260,84,1260,1260 0/0:14,0,0:14:48:0,48,720,48,720,720    ./.:.:.:.:.       0/0:25,0,0:25:80:0,81,1215,80,1150,1144 0/0:24,0,0:24:72:0,72,1080,72,1080,1080 0/0:10,0,0:10:30:0,30,450,30,450,450    0/0:1,0,0:1:3:0,3,45,3,45,45    0/0:41,0,0:41:99:0,123,1845,123,1829,1823 0/0:0,0,0:0:6:0,6,90,6,55,49    0/0:46,0,0:46:99:0,138,2070,138,2070,2070       ./.:.:.:.:.     0/0:9,0,0:9:27:0,27,405,27,405,405      0/0:8,0,0:8:24:0,24,360,24,360,360        0/0:25,0,0:25:75:0,75,1125,75,1124,1123 0/0:2,0,0:2:6:0,6,90,6,90,90    0/0:31,0,0:31:93:0,93,1395,93,1395,1395 1/1:0,5,0:5:15:225,15,0,225,15,225      0/
0:16,0,0:16:54:0,54,810,54,775,769      0/0:18,0,0:18:54:0,54,810,54,810,810    0/0:41,0,0:41:99:0,123,1845,123,1845,1845       0/0:18,0,0:18:54:0,54,810,54,810,810    0/0:16,0,0
:16:48:0,48,720,48,720,720      ./.:.:.:.:.

Our SNP before setting to missing

 0/2:9,0,3:12:80:80,110,504,0,378,369

And after setting to missing (it works!)

./.:9,0,3:12:80:80,110,504,0,378,369

Let’s do deletions in one individual, B97

bcftools view -s B97 NAM.goodSNPs_hetstomissing.vcf > b97.vcf

Get chr10 from AnchorWave deletions bedfile

grep chr10 ~/Downloads/drive-download-20220502T163928Z-001/B97.deletions.merged.bed > b97.chr10.deletions.bed

Do the intersect

bedtools intersect -a b97.chr10.deletions.bed -b b97.vcf -wao > b97.deletions.snps

Filter this to only keep nonmissing data in B97

grep -v "0$" b97.deletions.snps | grep -v "\./\." > b97.overlap.deletions.snps

Results

We find 71686 deletions on chr10 (wc -l b97.chr10.deletions.bed). Of these, 6935, or 9.7%, overlap one of our filtered SNPs (cut -f 2 b97.overlap.deletions.snps| uniq -c | wc -l). But some of these are SNPs where B97 has a deletion call! We need to find only those SNPs where the vcf has a basepair call that overlaps an anchorwave deletion. First, all sites that are 0/0. Then all sites where deletion (*) is the first alternate, and the genotype is 2/2. Then last, anyplace the genotype is 1/1 and the deletion is the 2nd alternate allele. (In both these cases B97 has a base pair call at a SNP where some other lines have an AnchorWave deletion).

grep 0/0 b97.overlap.deletions.snps > bad.deletions 
grep "\,\*" b97.overlap.deletions.snps | grep 1/1 >> bad.deletions
grep "*," b97.deletions.snps| grep "2/2" >> bad.deletions

OK, so in total 3983 (5.6%) of our AnchorWave deletions have ≥1 “good” SNP (cut -f 2 bad.deletions| sort -n | uniq -c | wc -l). That should be a liberal estimate, as some of these may be bad SNP calls and not bad AnchorWave calls. Let’s see the distribution.

cut -f 2 bad.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | cut -d " " -f 1 | sort -n > snp.distro

Then plot:

bob<-read.table("~/Desktop/snp.distro")
hist(bob$V1,breaks=100)

That’s pretty skewed. Indeed, 1416 (cut -f 2 bad.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | cut -d " " -f 1 | sort -n | grep -c "^1$") have only 1 SNP, so only 2567/71686=3.6% of our deletions have ≥1 SNP and are thus more likely to be wrong.

Postscript

Hmm, some of the deletions are pretty small and those are more likely to be wrong. They’re also less likely to have a SNP, so small deletions may be messing things up. Let’s go back and filter to only keep deletions >100bp.

awk '{if($3-$2 >= 100) print}' b97.chr10.deletions.bed > b97.chr10.filtered.deletions.bed

Yikes! Only 5813 (wc -l b97.chr10.filtered.deletions.bed) remain. Let’s check out accuracy of those.

bedtools intersect -a b97.chr10.filtered.deletions.bed -b b97.vcf -wao > b97.filtered.deletions.snps
grep -v "0$" b97.filtered.deletions.snps | grep -v "\./\." > b97.overlap.filtered.deletions.snps
grep 0/0 b97.overlap.filtered.deletions.snps > bad.filtered.deletions 
grep "\,\*" b97.overlap.filtered.deletions.snps | grep 1/1 >> bad.filtered.deletions
grep "*," b97.overlap.filtered.deletions.snps | grep "2/2" >> bad.filtered.deletions

Whoa. Half (2862 or 49.2% from cut -f 2 bad.filtered.deletions | sort -n | uniq | wc -l) have what appears to be a valid SNP call.

Let’s dig some more. Make a new bedfile with lengths:

awk '{OFS="\t"}{print $1,$2,$3,($3-$2)}' b97.chr10.deletions.bed > b97.chr10.deletions.length.bed

Redo above with new bedfile:

bedtools intersect -a b97.chr10.deletions.length.bed -b b97.vcf -wao > b97.deletions.length.snps
grep -v "0$" b97.deletions.length.snps | grep -v "\./\." > b97.overlap.deletions.length.snps
grep 0/0 b97.overlap.deletions.length.snps > bad.length.deletions 
grep "\,\*" b97.overlap.deletions.length.snps | grep 1/1 >> bad.length.deletions
grep "*," b97.overlap.deletions.length.snps | grep "2/2" >> bad.length.deletions

Still 3983 (cut -f 2,4 bad.length.deletions | sort -n | uniq | wc -l phew! code still works). Now make a file with number of SNPs and deletion length.

cut -f 2 bad.length.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | tr " " "\t" > snp.length.distro

And look at it:

sue<-read.table("~/Desktop/snp.length.distro")
colnames(sue)=c("snps","del_start","del_length")
plot(sue$snps~sue$del_length)

Ugh. So some big deletions with a lot of SNPs. Unsure how to think about this – how many SNPs before we say the AnchorWave deletion is wrong?

Late night thoughts

SNP distro

Get start/end of each deletion and all SNPs in it

cut -f 2,3,4,6 bad.length.deletions > crap

Grab deletions with > 300 SNPs

x<-read.table("~/Desktop/crap")
colnames(x)=c("start","end","length","snp")
starts<-group_by(x,start) %>% summarize(n=length(snp)) %>% filter(n>300) %>% select(start)

Plot distribution of SNPs along deletion

for(i in starts$start){ b<-filter(x,start == i); s<-min(b$start); e<-min(b$end); c<-ggplot(b,aes(x=snp))+geom_histogram(bins=50)+xlim(s,e); print(c); }
## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).