Get allele frequency counts from all data:
tail +4 all_frq.csv | cut -d "," -f 8-50000 | tr "," "\n" | awk '{print $1}' | sort | uniq -c | sort -nr
Results in:
Number of Loci | Alelle Frequency |
---|---|
2382706 | NA |
252318 | 1 |
252318 | 0 |
2222 | 0.5 |
742 | 0.666666666666667 |
742 | 0.333333333333333 |
381 | 0.75 |
381 | 0.25 |
218 | 0.8 |
218 | 0.2 |
160 | 0.6 |
160 | 0.4 |
134 | 0.833333333333333 |
134 | 0.166666666666667 |
84 | 0.875 |
84 | 0.125 |
82 | 0.857142857142857 |
82 | 0.142857142857143 |
51 | 0.888888888888889 |
51 | 0.111111111111111 |
Because each SNP is listed twice with both major and minor freq, we drop all freq > 0.5. Similarly we have to halve the number at exactly 0.5. This leaves:
Number of Loci | Alelle Frequency |
---|---|
2382706 | NA |
252318 | 0 |
1111 | 0.5 |
742 | 0.333333333333333 |
381 | 0.25 |
218 | 0.2 |
160 | 0.4 |
134 | 0.166666666666667 |
84 | 0.125 |
82 | 0.142857142857143 |
51 | 0.111111111111111 |
Note many of these frequencies correspond to 1/2 .. 1/9 so we can effectively infer the coverage (total number of reads) from these as 1-9. We have to tweak 0.5 and 0.4 as those could represent 2/4 and 2/5 so we add some or all of those to the 1/4 and 1/5. We drop the 0’s because we can’t tell if it’s 0/1, 0/2 etc.
Number of Loci | Coverage |
---|---|
2382706 | 0 |
836 | 2 |
742 | 3 |
656 | 4 |
378 | 5 |
134 | 6 |
82 | 7 |
84 | 8 |
51 | 9 |
Make a vector of these counts
#counts 0,2,3,4,5,6,7,8,9
dats<-c(100000,286040,4463,3628,1883,930,513,285)
Maximize likelihood \(L\) of poisson by minimizing \[-log(L)=\sum_{i=1}^N(-k_ilog\lambda+\lambda)\] where \(k_i\) are the observed converage of each genotype from above
obs=0:7
lambdas=1:150/100
likes<-sapply(lambdas,function(lambda) sum(dats*(-1*obs*log(lambda)+lambda)))
plot(likes~lambdas)
abline(v=lambdas[which(likes==min(likes))])