Coverage

Get allele frequency counts from all data:

tail +4 all_frq.csv | cut -d "," -f 8-50000 | tr "," "\n" |  awk '{print $1}' | sort | uniq -c | sort -nr

Results in:

Number of Loci	Alelle Frequency
2382706	NA
252318	1
252318	0
2222	0.5
742	0.666666666666667
742	0.333333333333333
381	0.75
381	0.25
218	0.8
218	0.2
160	0.6
160	0.4
134	0.833333333333333
134	0.166666666666667
84	0.875
84	0.125
82	0.857142857142857
82	0.142857142857143
51	0.888888888888889
51	0.111111111111111

Because each SNP is listed twice with both major and minor freq, we drop all freq > 0.5. Similarly we have to halve the number at exactly 0.5. This leaves:

Number of Loci	Alelle Frequency
2382706	NA
252318	0
1111	0.5
742	0.333333333333333
381	0.25
218	0.2
160	0.4
134	0.166666666666667
84	0.125
82	0.142857142857143
51	0.111111111111111

Note many of these frequencies correspond to 1/2 .. 1/9 so we can effectively infer the coverage (total number of reads) from these as 1-9. We have to tweak 0.5 and 0.4 as those could represent 2/4 and 2/5 so we add some or all of those to the 1/4 and 1/5. We drop the 0’s because we can’t tell if it’s 0/1, 0/2 etc.

Number of Loci	Coverage
2382706	0
836	2
742	3
656	4
378	5
134	6
82	7
84	8
51	9

Make a vector of these counts

#counts 0,2,3,4,5,6,7,8,9
dats<-c(100000,286040,4463,3628,1883,930,513,285)

Maximize likelihood \(L\) of poisson by minimizing \[-log(L)=\sum_{i=1}^N(-k_ilog\lambda+\lambda)\] where \(k_i\) are the observed converage of each genotype from above

obs=0:7
lambdas=1:150/100
likes<-sapply(lambdas,function(lambda) sum(dats*(-1*obs*log(lambda)+lambda)))
plot(likes~lambdas)
abline(v=lambdas[which(likes==min(likes))])