Assume \(N_e \approx 10^6\) and \(\mu = 3 \times 10^{-8}\). Then \(\theta=0.12\):
ne=1E6
mu=3E-8
theta=4*ne*mu
theta
[1] 0.12
Watterson’s estimator has us expecting \(S=\theta\sum_{i=1}^{n-1}\frac{1}{i}\) segregating sites. Thus for 10 diploids (20 chromosomes) and 600Mb of sequence*, we expect (in millions of SNPs):
n=20
wfactor=sum(1/1:(n-1))
s=600E6*theta*wfactor
s/1E6
[1] 255.4373
*bp of sequence should be positions where mean quality is > threshold (say 30) and >80% of samples have data.
To try
Filter all sites for quality. Filter all sites for >80% present data.
Filter windows to have >20% data (i.e. 2kb in a 10kb window)
LS0tCnRpdGxlOiAiSG93IG1hbnkgU05QcyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKQXNzdW1lICROX2UgXGFwcHJveCAxMF42JCBhbmQgJFxtdSA9IDMgXHRpbWVzIDEwXnstOH0kLiBUaGVuICRcdGhldGE9MC4xMiQ6IApgYGB7cn0KbmU9MUU2Cm11PTNFLTgKdGhldGE9NCpuZSptdQp0aGV0YQpgYGAKCldhdHRlcnNvbidzIGVzdGltYXRvciBoYXMgdXMgZXhwZWN0aW5nICRTPVx0aGV0YVxzdW1fe2k9MX1ee24tMX1cZnJhY3sxfXtpfSQgc2VncmVnYXRpbmcgc2l0ZXMuIFRodXMgZm9yIDEwIGRpcGxvaWRzICgyMCBjaHJvbW9zb21lcykgYW5kIDYwME1iIG9mIHNlcXVlbmNlKiwgd2UgZXhwZWN0IChpbiBtaWxsaW9ucyBvZiBTTlBzKToKCmBgYHtyfQpuPTIwCndmYWN0b3I9c3VtKDEvMToobi0xKSkKc2VxYnA9NjAwRTYKcz1zZXFicCp0aGV0YSp3ZmFjdG9yCnMvMUU2CmBgYAoKKmJwIG9mIHNlcXVlbmNlIHNob3VsZCBiZSBwb3NpdGlvbnMgd2hlcmUgbWVhbiBxdWFsaXR5IGlzID4gdGhyZXNob2xkIChzYXkgMzApIGFuZCA+ODAlIG9mIHNhbXBsZXMgaGF2ZSBkYXRhLgoKIyMgVG8gdHJ5CgpGaWx0ZXIgYWxsIHNpdGVzIGZvciBxdWFsaXR5LgpGaWx0ZXIgYWxsIHNpdGVzIGZvciA+ODAlIHByZXNlbnQgZGF0YS4KCkZpbHRlciB3aW5kb3dzIHRvIGhhdmUgPjIwJSBkYXRhIChpLmUuIDJrYiBpbiBhIDEwa2Igd2luZG93KQ==