library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Filter for max two alleles, snps only. Then filter for max 3 alleles but only if one is an indel. Add header back to indels and combine two files.
bcftools view -M2 -v snps NAM.header.chr10.vcf.gz > NAM.biallelic.snpsonly.vcf
bcftools view -M3 NAM.header.chr10.vcf.gz | grep "*" > NAM.biallelic.indelsonly.vcf
cat ~/Downloads/B73v5.NAM-illumina_filtered-pass-only-two-round-gatk-snps.headersonly NAM.biallelic.indelsonly.vcf > NAM.biallelic.crap
mv NAM.biallelic.crap NAM.biallelic.indelsonly.vcf
bcftools concat NAM.biallelic.snpsonly.vcf NAM.biallelic.indelsonly.vcf > NAM.goodSNPs.vcf
Set hets to missing data
bcftools filter -S . -e 'GT=="0/1" || GT=="0/2" || GT=="1/2"' NAM.goodSNPs.vcf> NAM.goodSNPs_hetstomissing.vcf
Here’s a test SNP to make sure it’s working.
grep "\s53589" NAM.goodSNPs_hetstomissing.vcf| grep "*"
chr10 53589 . C *,T 67.88 PASS AC=2,1;AF=0.042,0.021;AN=48;BaseQRankSum=1.871;DP=512;ExcessHet=0.1409;FS=0.000;InbreedingCoeff=0.4937;MLEAC=1,1;MLEAF=0.021,0.021;MQ=48.20;MQRankSum=-3.026;QD=3.99;ReadPosRankSum=0.580;SOR=0.512 GT:AD:DP:GQ:PL 0/0:44,0,0:44:99:0,132,1980,132,1980,1980 0/0:54,0,0:54:99:0,163,2430,162,2381,2373 0/0:7,0,0:7:21:0,21,315,21,315,315 0/2:9,0,3:12:80:80,110,504,0,378,369 0/0:28,0,0:28:84:0,84,1260,84,1260,1260 0/0:14,0,0:14:48:0,48,720,48,720,720 ./.:.:.:.:. 0/0:25,0,0:25:80:0,81,1215,80,1150,1144 0/0:24,0,0:24:72:0,72,1080,72,1080,1080 0/0:10,0,0:10:30:0,30,450,30,450,450 0/0:1,0,0:1:3:0,3,45,3,45,45 0/0:41,0,0:41:99:0,123,1845,123,1829,1823 0/0:0,0,0:0:6:0,6,90,6,55,49 0/0:46,0,0:46:99:0,138,2070,138,2070,2070 ./.:.:.:.:. 0/0:9,0,0:9:27:0,27,405,27,405,405 0/0:8,0,0:8:24:0,24,360,24,360,360 0/0:25,0,0:25:75:0,75,1125,75,1124,1123 0/0:2,0,0:2:6:0,6,90,6,90,90 0/0:31,0,0:31:93:0,93,1395,93,1395,1395 1/1:0,5,0:5:15:225,15,0,225,15,225 0/
0:16,0,0:16:54:0,54,810,54,775,769 0/0:18,0,0:18:54:0,54,810,54,810,810 0/0:41,0,0:41:99:0,123,1845,123,1845,1845 0/0:18,0,0:18:54:0,54,810,54,810,810 0/0:16,0,0
:16:48:0,48,720,48,720,720 ./.:.:.:.:.
Our SNP before setting to missing
0/2:9,0,3:12:80:80,110,504,0,378,369
And after setting to missing (it works!)
./.:9,0,3:12:80:80,110,504,0,378,369
Let’s do deletions in one individual, B97
bcftools view -s B97 NAM.goodSNPs_hetstomissing.vcf > b97.vcf
Get chr10 from AnchorWave deletions bedfile
grep chr10 ~/Downloads/drive-download-20220502T163928Z-001/B97.deletions.merged.bed > b97.chr10.deletions.bed
Do the intersect
bedtools intersect -a b97.chr10.deletions.bed -b b97.vcf -wao > b97.deletions.snps
Filter this to only keep nonmissing data in B97
grep -v "0$" b97.deletions.snps | grep -v "\./\." > b97.overlap.deletions.snps
We find 71686 deletions on chr10 (wc -l b97.chr10.deletions.bed
). Of these, 6935, or 9.7%, overlap one of our filtered SNPs (cut -f 2 b97.overlap.deletions.snps| uniq -c | wc -l
). But some of these are SNPs where B97 has a deletion call! We need to find only those SNPs where the vcf has a basepair call that overlaps an anchorwave deletion. First, all sites that are 0/0. Then all sites where deletion (*
) is the first alternate, and the genotype is 2/2
. Then last, anyplace the genotype is 1/1
and the deletion is the 2nd alternate allele. (In both these cases B97 has a base pair call at a SNP where some other lines have an AnchorWave deletion).
grep 0/0 b97.overlap.deletions.snps > bad.deletions
grep "\,\*" b97.overlap.deletions.snps | grep 1/1 >> bad.deletions
grep "*," b97.deletions.snps| grep "2/2" >> bad.deletions
OK, so in total 3983 (5.6%) of our AnchorWave deletions have ≥1 “good” SNP (cut -f 2 bad.deletions| sort -n | uniq -c | wc -l
). That should be a liberal estimate, as some of these may be bad SNP calls and not bad AnchorWave calls. Let’s see the distribution.
cut -f 2 bad.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | cut -d " " -f 1 | sort -n > snp.distro
Then plot:
bob<-read.table("~/Desktop/snp.distro")
hist(bob$V1,breaks=100)
That’s pretty skewed. Indeed, 1416 (cut -f 2 bad.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | cut -d " " -f 1 | sort -n | grep -c "^1$"
) have only 1 SNP, so only 2567/71686=3.6% of our deletions have ≥1 SNP and are thus more likely to be wrong.
Hmm, some of the deletions are pretty small and those are more likely to be wrong. They’re also less likely to have a SNP, so small deletions may be messing things up. Let’s go back and filter to only keep deletions >100bp.
awk '{if($3-$2 >= 100) print}' b97.chr10.deletions.bed > b97.chr10.filtered.deletions.bed
Yikes! Only 5813 (wc -l b97.chr10.filtered.deletions.bed
) remain. Let’s check out accuracy of those.
bedtools intersect -a b97.chr10.filtered.deletions.bed -b b97.vcf -wao > b97.filtered.deletions.snps
grep -v "0$" b97.filtered.deletions.snps | grep -v "\./\." > b97.overlap.filtered.deletions.snps
grep 0/0 b97.overlap.filtered.deletions.snps > bad.filtered.deletions
grep "\,\*" b97.overlap.filtered.deletions.snps | grep 1/1 >> bad.filtered.deletions
grep "*," b97.overlap.filtered.deletions.snps | grep "2/2" >> bad.filtered.deletions
Whoa. Half (2862 or 49.2% from cut -f 2 bad.filtered.deletions | sort -n | uniq | wc -l
) have what appears to be a valid SNP call.
Let’s dig some more. Make a new bedfile with lengths:
awk '{OFS="\t"}{print $1,$2,$3,($3-$2)}' b97.chr10.deletions.bed > b97.chr10.deletions.length.bed
Redo above with new bedfile:
bedtools intersect -a b97.chr10.deletions.length.bed -b b97.vcf -wao > b97.deletions.length.snps
grep -v "0$" b97.deletions.length.snps | grep -v "\./\." > b97.overlap.deletions.length.snps
grep 0/0 b97.overlap.deletions.length.snps > bad.length.deletions
grep "\,\*" b97.overlap.deletions.length.snps | grep 1/1 >> bad.length.deletions
grep "*," b97.overlap.deletions.length.snps | grep "2/2" >> bad.length.deletions
Still 3983 (cut -f 2,4 bad.length.deletions | sort -n | uniq | wc -l
phew! code still works). Now make a file with number of SNPs and deletion length.
cut -f 2 bad.length.deletions| sort -n | uniq -c | sed -e 's/^[ ]*//' | tr " " "\t" > snp.length.distro
And look at it:
sue<-read.table("~/Desktop/snp.length.distro")
colnames(sue)=c("snps","del_start","del_length")
plot(sue$snps~sue$del_length)
Ugh. So some big deletions with a lot of SNPs. Unsure how to think about this – how many SNPs before we say the AnchorWave deletion is wrong?
Definitely redo the above to make sure it checks out
Should check an additional individual and an additional chromosome to make sure qualitative result (holds up.
We should probably drop SNPs close enough to be in the same read as those aren’t likely independent.
Do we care where the SNPs are? e.g. if there are 100 bad SNPs but they’re all at one end of a deletion (rather than in the middle), could just be a TE one the ends of the deletion that got misaligned, but rest of deletion is good.
Get start/end of each deletion and all SNPs in it
cut -f 2,3,4,6 bad.length.deletions > crap
Grab deletions with > 300 SNPs
x<-read.table("~/Desktop/crap")
colnames(x)=c("start","end","length","snp")
starts<-group_by(x,start) %>% summarize(n=length(snp)) %>% filter(n>300) %>% select(start)
Plot distribution of SNPs along deletion
for(i in starts$start){ b<-filter(x,start == i); s<-min(b$start); e<-min(b$end); c<-ggplot(b,aes(x=snp))+geom_histogram(bins=50)+xlim(s,e); print(c); }
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).