This is an R Markdown document reporting the msprime simulation and qpAdm analysis results of a series of human Demography Models. The DNA data generated from the msprime simulations have been processed to reflect ancient DNA conditions of the Ancient Near East. As such the tree-sequence data has been converted into Eigenstrat format and processed according to typical ancient DNA conditions such as missing data, pseudohaploid calls and ascertainment bias. The simulation includes present-day lineages and lineages from which we generated ascertainment. The individuals simulated from these lineages were not post-processed and represent Eigenstrat diploid genomes.
The simulation models (A,B,C & D) consist of a shared core demography of global human genetic variation and vary among one another on the modeling parameters of the Mesopotamian lineage outlined below (A-D). Within each model, there exists three possible variant sub-models (1-3) with each sub-model having different population admixing (40%) into the Levant around the 5th millennium BC. The source admixture lineages for the respective sub-models are:
1=Mesopotamia into the Levant
2=Caucasus into the Levant
3=Zagros into the Levant.
A=Mesopotamia clean split from Zagros.
B=Caucasus 20% Zagros 80% ancestry contributions to Mesopotamia.
C=Caucasus & Zagros 50% ancestry contributions to Mesopotamia.
D=Caucasus 80% & Zagros 20% ancestry contributions to Mesopotamia.
As such, model combinations represent two possible alternate demographic parameter scenarios: E.G. A1 represents a clean split of Mesopotamia from the Zagros and Mesopotamian admixture into the Levant whereas model D3 represents a contribution of 80% Caucasus and 20% Zagros and Zagros admixture into the Levant in the 5th millennium BC.
Before introducing the complex admixture events within the ancient Near East, we began with a simplified demography to assess the expectation of D-statistics and to assess whether the simulation parameters were performing as expected. Below is a topology plot of the simplified ANE demography without any complex admixture histories other than the modeling of Mesopotamia into the Levant at ~5KYA. Within each plot the expected patterns of ABBA and BABA sites are reported. These expectations are derived from theoretical expectations plotted below the topology plots.
Simplified Models 1:A-D Topology
Under the D-statistic of D(Mesopo10K, Zagros10K; Cau, Mbuti) the expectations are presented below. Note that there are 11 Caucasus populations from 13KYA to 4KYA.
Simplified Models 1:A-D D(Meso10K, Zagros10K; Test, Mbuti)
D-statistic of D(Mesopo10K, Zagros10K; Test, Mbuti) are presented below. Note the correspondence between our expected and observed direction (D > or < 0) for the D-statistics. For model A1, D scores for the Caucasus populations are ~ E[0] between two replicates. Model B1, the Caucasus populations are consistently greater than zero (positive) however not significant, a likely cause being the small admixture amount from the ancestral Caucasus population into Mesopotamia. Model C1, like model B1 has all positive Caucasus values, wherein some groups, Caucasus_T2 for example, are significantly positive. This may reflect the impact of a larger admixture contribution from the ancestral Caucasus population into the ancestral Mesopotmaian population and the random nature of replications. Model D1, shows even greater affinity of Caucasus to Mesopotamia with a large number of Caucasus populations significantly positive (abs Z > 3). Again this suggests that the earlier models (B-C) were not significant across a large number of Caucasus populations due to the smaller admixture contribution. Of particular interest is that the two Caucasus populations bracketing the admixture into Mesopotamia are both not consistently positive, even on Model D with 80% contribution into Mesopotamia from the Caucasus.
Also note that Zagros populations are negative for all tests, all Mesopotmaian populations are positive for all tests and groups such as Anatolia and the Aegean remain around zero for all tests in concordance with expectations.
The following analyses were performed in admixtools2 https://github.com/uqrmaie1/admixtools/ where the f2 statistics were computed for all pairs of groups and stored on disk. All subsequent f-statistics tests were performed as computations of the f2-statistics. The missing data cutoff was set to maxmiss = 1 which is equivalent to allsnps: YES in admixtools.
The qpWave/qpAdm modeling requires the pre-defining of two distinct groups of populations. First is the Right/Reference/Outgroup (hereafter referred to as Reference, not to be confused with terminology used in Haak et al. 2015) and second is the Left/Source group (hereafter referred to as Source). The Reference group are a set of populations that exhibit differing genetic relatedness to the Source populations and are selected to represent global/regional genetic diversity. Additionally, it is recommended that the same reference population be selected in the first Reference group position for all comparable models, one unlikely to be related to the admixture event being analysed and minimal missing data as it will be the base of all f4-statistics. As such, we have placed Mbuti_0 as the first population in the Reference group list.
From the theory, if all Reference populations are symmetrically related to all Source and Target populations, qpAdm will be unable to reject any models. As such, under our simple demography, all the Right populations are phylogenetically outgroups to all the ANE populations. As such, we will include a Caucasus population that dates prior to the admixture event of Zagros and Caucasus to Mesopotamia to try and increse the leverage qpAdm has to inder ancestry sources and proportions.
Reference list o9C = Mbuti_0, EastAs_0, Papuan_0, Onge_0, NthEur_0, Amer_0, Ust_45000, Kostenki_36000, (MA1 ~ missing from my data ATM), CHG1.
Simplified MA1 Single Ancestry Source Results
Simplified MB1 Single Ancestry Source Results
Simplified MB1 Double Ancestry Source Results
Simplified MC1 Single Ancestry Source Results
Simplified MC1 Double Ancestry Source Results
Simplified MD1 Single Ancestry Source Results
Simplified MD1 Double Ancestry Source Results
Model Predictions Against Simulated Values: Theory from Durrand et al. https://academic.oup.com/mbe/article/28/8/2239/1052492
Using the equation (5) from Durrand et al. and the model parameters from the simulation, we have derived the expected D-statistic values and plot them against the simulated D-statistic values.
# admixure values for each model
fMA1 <- 0
fMB1 <- 0.2
fMC1 <- 0.5
fMD1 <- 0.8
# Split time, admixture time and N for the models is shared.
tGF <- 1000
tp3 <- 15000
tp2 <- 12500
N <- 3144
# Calculation for each model
ED_MA1 <- (3*fMA1*(tp3 - tGF)) / ((3*fMA1*(tp3 - tGF)) + (4*N*(1-fMA1)*((1-1/(2*N))^(tp3 - tp2))) + (4*N*fMA1*((1-1/(2*N))^(tp3 - tGF))))
ED_MB1 <- (3*fMB1*(tp3 - tGF)) / ((3*fMB1*(tp3 - tGF)) + (4*N*(1-fMB1)*((1-1/(2*N))^(tp3 - tp2))) + (4*N*fMB1*((1-1/(2*N))^(tp3 - tGF))))
ED_MC1 <- (3*fMC1*(tp3 - tGF)) / ((3*fMC1*(tp3 - tGF)) + (4*N*(1-fMC1)*((1-1/(2*N))^(tp3 - tp2))) + (4*N*fMC1*((1-1/(2*N))^(tp3 - tGF))))
ED_MD1 <- (3*fMD1*(tp3 - tGF)) / ((3*fMD1*(tp3 - tGF)) + (4*N*(1-fMD1)*((1-1/(2*N))^(tp3 - tp2))) + (4*N*fMD1*((1-1/(2*N))^(tp3 - tGF))))
# Create data.frame for plotting. Merge with the Caucasus 9k samples from the simulated data
Model <- c("S_MA1", "S_MB1", "S_MC1", "S_MD1")
E_D <- c(ED_MA1, ED_MB1, ED_MC1, ED_MD1)
f <- c(fMA1, fMB1, fMC1, fMD1)
Simp_Exp_D_stats <- data.frame(Model, E_D, f)
# Three simulated models available till the Simulated model D is complete
Model <- c("S_MA1", "S_MB1", "S_MC1", "S_MD1")
E_D <- c(ED_MA1, ED_MB1, ED_MC1, ED_MD1)
f <- c(fMA1, fMB1, fMC1, fMD1)
Simp_Exp_D_stats <- data.frame(Model, E_D, f)
S_Models_1_CAU <- filter(S_Models_1, Y == "CHG2_9000", rep == "r1")
S_Models_1_CAU <- S_Models_1_CAU %>% select(D, Model)
S_Models_1_CAU[, f:=c(0, 0.2, 0.5, 0.8)]
Simp_Exp_D_stats_S_Models_1 <- merge(Simp_Exp_D_stats, S_Models_1_CAU[, c("Model","D")])
The simulation models (A,B,C & D) consist of a shared core demography of global human genetic variation and vary among one another on the modeling parameters of the Mesopotamian lineage outlined below (A-D). Within each model, there exists three possible variant sub-models (1-3) with each sub-model having different population admixing (40%) into the Levant around the 5th millennium BC. The source admixture lineages for the respective sub-models are: Model A=Mesopotamia clean split from Zagros.
Model B=Caucasus 20% Zagros 80% ancestry contributions to Mesopotamia.
Model C=Caucasus & Zagros 50% ancestry contributions to Mesopotamia.
Model D=Caucasus 80% & Zagros 20% ancestry contributions to Mesopotamia.
As such, model combinations represent two possible alternate demographic parameter scenarios: E.G. A1 represents a clean split of Mesopotamia from the Zagros and Mesopotamian admixture into the Levant.
The data from the msprime simulations for ancient individuals has been processed according to typical ancient DNA conditions such as missing data, pseudohaploid calls and ascertainment bias. The individuals from present-day populations were not post-processed.
Below is the link to a document outlining the parameters used in the demography simulation. https://docs.google.com/document/d/1Wc-m_-c3DYPid67PoC-SwJR-3bDe5PulytDWZbVIBbo/edit?usp=sharing
In Model A:1, Mesopotamia splits from the Zagros population 12,500 kya and admixes into the Levant with p=0.4 at 5,000 kya. The admixture arrows are not representative of timing. Below a ANE resolution topology plot will be deisplayed for each model. For a detailed analysis of the timing of global demographic events see the above linked document.
The following is the GroupNumber key for population assignment in the sample table below. All models follow the same sampling and sample-population assignment. 0 = Neanderthal, 1=Denisovan, 2=Mbuti, 3=North African, 4=Ancestral Eurasian, 5=Basal Eurasian, 6=Ust’-Ishim, 7=East Eurasian, 8=West Eurasian, 9=Australasian, 10=East Asian, 11=South Asian, 12=Onge, 13=North Eurasian, 14=Papuan, 15=Austrolasian, 16=American, 17=Ancestral Ancient Near East, 18=Aegean, 19=Zagros, 20=Levant, 21=Caucasus, 22=Mesopotamia, 23=Anatolia.
The D-statistics are reported for two replication runs for each model A-D under the subclass 1.
The following analyses were performed in admixtools2 https://github.com/uqrmaie1/admixtools/ where the f2 statistics were computed for all pairs of groups and stored on disk. All subsequent f-statistics tests were performed as computations of the f2-statistics. The missing data cutoff was set to maxmiss = 1 which is equivalent to allsnps: YES in admixtools.
The qpWave/qpAdm modeling requires the pre-defining of two distinct groups of populations. First is the Right/Reference/Outgroup (hereafter referred to as Reference, not to be confused with terminology used in Haak et al. 2015) and second is the Left/Source group (hereafter referred to as Source). The Reference group are a set of populations that exhibit differing genetic relatedness to the Source populations and are selected to represent global/regional genetic diversity. Additionally, it is recommended that the same reference population be selected in the first Reference group position for all comparable models, one unlikely to be related to the admixture event being analysed and minimal missing data as it will be the base of all f4-statistics. As such, we have placed Mbuti_0 as the first population in the Reference group list.
As it relates to the question of modeling ancient Near Eastern demography, a number of publications have utilized slightly differing population combinations. When selecting potential Source populations Harney et al. 2018 restricted the potential list of populations to those most likely to be admixture sources based on geographical and temporal proximity. For the Reference populations, we will explore the impact of the various Reference groups on the accuracy and power of qpAdm/qpWave.
From the theory, if all Reference populations are symmetrically related to all Source and Target populations, qpAdm will be unable to reject any models. Thus, many publications begin with a baseline Reference set and add additional groups to the Reference set to resolve more fine-scale population relationships. When selecting the list of Reference populations it is recommended to perform the statistic f4(SOURCEi, SOURCEj, REFERENCEk, REFERENCEl) for all combinations. If a Reference population never produces a significant f4-statistic, it is not informative.
As an example, below we have plotted the relationship between selected Source populations and the basic list of Reference populations, o9 used across published research. If two Source populations have identical/very close phylogenetic relationships to the o9 Reference populations, the below plot will display a high covariance.
Reference list o9 = Mbuti_0, EastAs_0, Papuan_0, Onge_0, NthEur_0, Amer_0, Ust_45000, Kostenki_36000, (MA1 ~ missing from my data ATM).
Plotting f4(ZagrosEpiP, Refi, Refj, Refk) against f4(Source2, Refi, Refj, Refk)
In the first plot Source2 is the Neolithic Zagros which is only 2kya younger than the Epipaleolithic Zagros. Second plot source 2 is the Caucasus HG which harbors eastern Eurasian ancestry and finally the third plot has Levant Neolithic as the Source2 who harbor inherited ancient North African ancestry.
The two Zagros populations have nearly identical phylogenetic relationships to the Reference groups which can be seen in plot1. Both the Caucasus and Levantine populations have visibly differentiated relationships to the Zagros Epipaleolithic pop as can be seen in plots 2&3.
If we add to the Reference list populations Natufian Epipaleolithic and west Eurasian, we can visually see that the introduction of Natufians and West Eurasian increases the covariance between Zagros Epipaleolithic and the Zagros Neolithic. Likewise, Caucasus HG1 which previously showed a larger difference in their relationship to the Reference populations against Zagros epipaleolithic, appears more similar. However, the Levant Neolithic which previously showed strong covariance with the Zagros Epipaleolithic in their relationship to the Reference populations, is visually differentiated with the introduction of Natufian Epipaleolithic and western Eurasians.
The following will provide the results of qpWave/qpAdm modeling of the Mesopotamian ancestry at two periods under the three models:
A = Population split from the Zagros
B = 20% ancestry from the Caucasus, 80% ancestry from the Zagros
C = 50% ancestry from the Caucasus and Zagros
D = 80% ancestry from the Caucasus, 20% ancestry from the Zagros
Model A1 = Mesopotamia splitting from the Zagros 12,500 years ago.
Model B1 = Mesopotamia 80% Zagros & 20% Caucasus 12,500 years ago.
Model C1 = Mesopotamia 50% Zagros & Caucasus 12,500 years ago.
Model D1 = Mesopotamia 20% Zagros & 80% Caucasus 12,500 years ago.
qpAdm Models MA1
Model A1 = Mesopotamia splitting from the Zagros 12,500 years ago.
qpAdm Models MB1
Model B1 = Mesopotamia 80% Zagros & 20% Caucasus 12,500 years ago.
qpAdm Models MC1
Model C1 = Mesopotamia 50% Zagros & Caucasus 12,500 years ago.
qpAdm Models MD1
Model D1 = Mesopotamia 20% Zagros & 80% Caucasus 12,500 years ago.
\(~\)
\(~\) \(~\)
\(~\)
\(~\)
The D-statistics are reported for two replication runs for each model.
\(~\)
\(~\)
\(~\)
The D-statistics are reported for two replication runs. Numbers indicate number of snps in each D-analysis.