Get allele frequency counts from all data:
tail +4 all_frq.csv | cut -d "," -f 8-50000 | tr "," "\n" | awk '{print $1}' | sort | uniq -c | sort -nr
Results in:
| Number of Loci | Alelle Frequency |
|---|---|
| 2382706 | NA |
| 252318 | 1 |
| 252318 | 0 |
| 2222 | 0.5 |
| 742 | 0.666666666666667 |
| 742 | 0.333333333333333 |
| 381 | 0.75 |
| 381 | 0.25 |
| 218 | 0.8 |
| 218 | 0.2 |
| 160 | 0.6 |
| 160 | 0.4 |
| 134 | 0.833333333333333 |
| 134 | 0.166666666666667 |
| 84 | 0.875 |
| 84 | 0.125 |
| 82 | 0.857142857142857 |
| 82 | 0.142857142857143 |
| 51 | 0.888888888888889 |
| 51 | 0.111111111111111 |
Because each SNP is listed twice with both major and minor freq, we drop all freq > 0.5. Similarly we have to halve the number at exactly 0.5. This leaves:
| Number of Loci | Alelle Frequency |
|---|---|
| 2382706 | NA |
| 252318 | 0 |
| 1111 | 0.5 |
| 742 | 0.333333333333333 |
| 381 | 0.25 |
| 218 | 0.2 |
| 160 | 0.4 |
| 134 | 0.166666666666667 |
| 84 | 0.125 |
| 82 | 0.142857142857143 |
| 51 | 0.111111111111111 |
Note many of these frequencies correspond to 1/2 .. 1/9 so we can effectively infer the coverage (total number of reads) from these as 1-9. We have to tweak 0.5 and 0.4 as those could represent 2/4 and 2/5 so we add some or all of those to the 1/4 and 1/5. We drop the 0’s because we can’t tell if it’s 0/1, 0/2 etc.
| Number of Loci | Coverage |
|---|---|
| 2382706 | 0 |
| 836 | 2 |
| 742 | 3 |
| 656 | 4 |
| 378 | 5 |
| 134 | 6 |
| 82 | 7 |
| 84 | 8 |
| 51 | 9 |
Make a vector of these counts
#counts 0,2,3,4,5,6,7,8,9
dats<-c(100000,286040,4463,3628,1883,930,513,285)
Maximize likelihood \(L\) of poisson by minimizing \[-log(L)=\sum_{i=1}^N(-k_ilog\lambda+\lambda)\] where \(k_i\) are the observed converage of each genotype from above
obs=0:7
lambdas=1:150/100
likes<-sapply(lambdas,function(lambda) sum(dats*(-1*obs*log(lambda)+lambda)))
plot(likes~lambdas)
abline(v=lambdas[which(likes==min(likes))])