Evolutionary Genetics

1)

Construct a phylogenetic tree using Bayesian inference and explain your results (relations between observed taxa). How does the method you used differ from the Maximum likelihood method?

The Bayesian phylogenetic tree shows the evolutionary relationships between different lizards that from two major families, Gymnophthalmidae and Alopoglossidae.

At the base of the tree, there’s a split between two main lineages: 1. Alopoglossus species, A. atriventris and A. carninicaudatus 2. Another leading to all other taxa shown

Alopoglossus species A.atriventris and A.carinicaudatus, form a well-supported monophyletic clade which means they share a recent common ancestor.

Arthrosaura reticulata and related species, Anotosaura sp., A.vanzolinia, Leposoma percarinatum and Ecpleopus gaudichausii also form a distinct cluster, and have close evolutionary relationship.

Colobosaurinae taxa, including Colobosaurus mentalis, Nothobachia ablephara, Calyptommatus and Micrablepharus, form another well-supported clade.

The most late derived groups include Prionodactylus oshaughnessyi, Prionodactylus argulus, Pantodactylus schreibersii schreibersii.

The genus Prionodactylus, including P.oshaughnessyi and P.argulus is well-separated but still related to Bachia dorbignyi, showing deeper evolutionary divergence.

The tree topology supports the relationships among these taxa, but some branches have lower support values like 0.5228, 0.3992, 0.4944.

Bayesian inference trees incorporate prior beliefs and assumptions about the evolutionary model, using the Bayes‘ theorem. The resulting initial probabilities are based on previously observed data and the constructed tree topology stems from prior assumptions (known as posterior distribution). It functions in three steps: 1. All initially constructed trees are equally probable 2. Prior information and observarions are considered, some are more probable than others 3. A posterior probability is calculated (using Bayes‘ theorem), combining the probability of prior observations and the probability of their trees

ML depends on a stohastic (random) evolutionary model for nucleotide and amino acid substitution. The ML approach estimates its own paramaters by searching for what values seem most probable and contain the fewest mutations/changes; it first draws out a tree, randomly redistributes its branches and compares the two versions, then keeps the more likely one.

BI is slower, more complex and computationally demanding, assuming BI‘s prior data is reliable and not flawed.It is preferred when dealing with small datasets, complex models, and cases where prior information is available.

ML does not incorporate prior data and is usually faster and less resource intensive, however, because of that, it‘s also more prone to biases.It is useful when computational efficiency is a concern.

2)

While working in the field you found an unknown eDNA sample. After a thorough genetic analysis, you’ve discovered the following sequence. Which species does the sample belong to?

    1 aatattatac tttatttttg ctatatgatc aggaataatt ggttcatcta taagattatt 
   61 aattcgaata gaattaagac atccaggtat atgaattaat aatgatcaaa tttataattc 
  121 tttagtaaca agacatgcat ttttaataat tttttttata gttatacctt ttataattgg 
  181 tggatttgga aattatctaa ttccattaat attaggatcc ccagatatag cttttcctcg 
  241 aataaataat attagatttt gacttctacc tccatcatta ttcatattat tattaagaaa 
  301 tatatttaca cctaatgtag gtacaggatg aactgtatat cctcctttat cttcttattt 
  361 atttcattca tcaccttcaa ttgatattgc aattttttct ttacatatat caggaatctc 
  421 ttcaattatt ggatcattaa attttatcgt tactatttta ttaataaaaa atttttcatt 
  481 aaattatgat caaattaatt tattttcatg atcagtatgt attacagtaa ttttattaat 
  541 tctatcttta ccagtattag ccggtgcaat tactatatta ttatttgatc gaaattttaa 
  601 tacttcattt tttgacccaa taggaggagg agacccaatc ctttatcaac atttattt

Doing Blast Nucleotide I got the organism Bombus terrestris.

Does it have a reference genome published?

We can see that in the entries we got, we have accession numbers that have prefix NC, which stands for RefSeq (Reference Sequence) entries and they are reference genomes or at least very reliable and complete genomes. In GenBank, we can check in description: Bombus terrestris lusitanicus mitochondrion, complete genome.

Is the species endangered?

The species is not currently at risk of extinction.

What is its whole taxonomic classification?

We can check on NCBI taxonomy.

Eukaryota – Superkingdom Opisthokonta – Clade Metazoa - Kingdom Eumetazoa, Bilateria, Protostomia, Ecdysozoa, Panarthropoda – Clades Arthropoda – Phylum Mandibulata, Pancrustacea – Clades Hexapoda – Subphylum Insecta – Class Dicondylia – Clade Pterygota – Subclass Neoptera – Infraclass Endopterygota – Cohort Hymenoptera - Order Apocrita – Suborder Aculeata – Infraorder Apoidea – Superfamily Anthophila – Clade Apidae – Family Apinae – Subfamily Bombini – Tribe Bombus – Genus Bombus - Subgenus

3)

Take a look at the provided haplotype network. What conclusions can you draw about the observed species based on the information provided by the network?

There are three major groups. Each group has a larger haplotype. These haplotypes are most common and ancestral in the regions they are from.

Moroccan haplotypes and haplotypes from La Palma and Iberia are separated by 14 substitutions, indicating genetic distance from the other groups.

La Palma shows a high level of genetic variation with multiple haplotypes, including H1, H37, H38, and others. Some haplotypes like H1, H15 are shared with Iberia which represents common ancestry.

Iberia shows a central haplotype H2 with a lot of closely related haplotypes in a star-like pattern with few mutations. This is probably caused by founder effect, followed by expansion.

Morocco also forms a distinct genetic cluster, with H47 as a central haplotype and multiple smaller ones. It is an isolated genetic lineage indicated by the large genetic distance from the other groups which is probably due to long period of isolation because of geographical barriers, like the Gibraltar which limits gene flow between Iberian and Morrocan populations.

Probably the ancestors are from Morroco, and some individuals passed the Gibraltar and arrived in Iberia which caused founder effect.The surviving individuals formed the central haplotype H2. Different individuals aquired new mutations, forming closely related, newly derived haplotypes, hence the star-shaped pattern.

4)

How would you define the phylogenetic relationships on the tree? How reliable are your results, and what threshold would you set for reliable support in your analysis? What could you do to improve the results – would you exclude any taxa/species/samples or add new data from genetic databases?

The ESS values are acceptable in Tracer. Some nodes have weakly supported clades, which indicate phylogenetic uncertainty. The confidence value threshold is 0.95, and is tolerable till 0.7 or 0.6. Any node below the threshold is not reliable.

We can improve the results with adding more genetic data to increase resolution, like closely related species or we can also remove some species to test robustness. We also can exclude taxa with poor sequencing data.

5)

Your dataset is defined by a single locus or molecular marker. Which molecular markers would you use if you wanted to examine closer or more distant phylogenetic relationships within a species or genus? Would you change the model in this case, and do you think it is possible to use multiple markers? If so, which markers are available and would be useful for this purpose?

For closer phylogenetic relationships within a species or genus we can use: microsatellites, SNPs. For distant phylogenetic relationships within a species or genus we can use rRNAs: 18S rRNA.

It is possible to use more markers and we would need to change the model.

6)

How many S (singleton), C (conserved), V (variable) and Pi (Parsimony – informative) sites did your alignment have? What do these values tell us? For the purpose of our analyses, are singleton or parsimony-informative sites more informative? Why?

S (Singleton): 42/629 C (Conserved): 280/629 V (Variable): 341/629 Pi (Parsimony informative): 299/629

S are sites where only one sequence in the dataset has a unique mutation, they occur only once in the alignment and are not useful for our analysis because they do not provide shared variation between the taxa.

C are sites that remain unchanged across all sequences. They are not informative for phylogeny, but useful for identifying highly conserved regions.

V are positions where at least two different nucleotides appear in the alignment, but not all variable sites are phylogenetically informative.

Pi are sites where at least two different nucleotides appear in at least two different sequences. These are most useful for phylogeny because they help define evolutionary relationships.

High number of Pi sites shows that the dataset contains enough information to construct a reliable tree. Low number of singletons indicates that sequencing errors or unique mutations will not be present in the dataset. The relatively high number of conserved sites (280/629) shows that some regions of the sequence are functionally fixed and not evolving rapidly.

Pi sites are more informative than singleton sites because they provide shared variation, which means they occur in multiple taxa and help build evolutionary relationships, allowing the model to detect patterns of common ancestry. Singletons do not contribute to tree structure, since they appear only in one sequence.

7)

Which is the optimal model for the construction of your phylogenetic tree according to the BIC criterium? Do the BIC and AIC criteria result in identical trees?

Population genetics

library(dartR)
# library(adegenet)
library(SNPRelate)
library(HardyWeinberg)
library(ggtern)
library(dartRverse)
library(reshape2)
library(seqinr)
library(MultiPhen)
library(tidyverse)
library(pegas)
library(hierfstat)

1.

After sequencing and bioinformatics analysis, you have a set of SNPs that you want to filter for further analysis. The species you are studying has 26 chromosome pairs. You want to exclude all loci that have more than 2% missing data and all individuals that have more than 4% missing genotype. You also want to exclude all loci that have a minor allele frequency below 2%. Construct a PLINK command that will filter your hypothetical binary file “data_SNP_4.bed” by the desired parameters.

Since this is a hypothetical file, I did the analysis on actual files we have from exercises. I include also screenshots to prove the command is working.

This is the command I used with my files.

system("plink2 --bfile bed_TEST_EXAM --geno 0.02 --mind 0.04 --maf 0.02 --make-bed --out bed_filtered_EXAM")
## [1] 3

This is the hypothetical command with the hypothetical file.

system("plink2 --bfile data_SNP_4 --geno 0.02 --mind 0.04 --maf 0.02 --make-bed --out data_SNP_4_filtered")

2.

How many individuals, populations, loci do you have in your dataset? Which populations do you have, please give the names. Which command did you use to do this? Before proceeding you need to run another command to check the validity of the data. What is this command?

We will first import all the files we have, so we have all the data prepared at start.

# Load the raw data
raw_data <- read.table("D.raw", header=TRUE, stringsAsFactors=FALSE)
map_data <- read.table("D.map", header=TRUE, stringsAsFactors=FALSE)
head(map_data)
##   X0 AX.93233744_C X0.1 X0.2
## 1  0 AX-93233745_T    0    0
## 2  0 AX-93233746_A    0    0
## 3  0 AX-93233747_C    0    0
## 4  0 AX-93233749_G    0    0
## 5  0 AX-93233752_G    0    0
## 6  0 AX-93233753_T    0    0
coord <- read.table("D_koordinate.txt", header = TRUE)
head(coord)
##    id      lat       lon
## 1 1_3 28.38351 -80.86100
## 2 2_3 28.45910 -80.79395
## 3 3_2 28.48986 -80.69970
## 4 5_1 28.52193 -80.67518
## 5 8_1 28.58668 -80.63023
## 6 9_1 28.66899 -80.60316

We can create genind object in order to check for number of individuals, populations and loci.

D_genind <- read.PLINK("D.raw", type = "genind")
## 
##  Reading PLINK raw format into a genlight object... 
## 
## 
##  Reading loci information... 
## 
##  Reading and converting genotypes... 
## .
##  Building final object... 
## 
## ...done.

Number of individuals.

nInd(D_genind)
## [1] 72

Number of populations:

length(levels(pop(D_genind)))
## [1] 6

Number of populations can be also represented like this:

nlevels(pop(D_genind))
## [1] 6

Population names:

pop(D_genind)
##  [1] Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  PBC  PBC  PBC 
## [16] PBC  PBC  PBC  PBC  PBC  PBC  PBC  PBC  PBC  FM   FM   FM   FM   FM   FM  
## [31] FM   FM   FM   FM   FM   FM   KW16 KW16 KW16 KW16 KW16 KW16 KW16 KW16 KW16
## [46] KW16 KW16 KW16 GG   GG   GG   GG   GG   GG   GG   GG   GG   GG   GG   GG  
## [61] MV   MV   MV   MV   MV   MV   MV   MV   MV   MV   MV   MV  
## Levels: FM GG KW16 Mel MV PBC

Number of individuals in each population:

table(pop(D_genind))
## 
##   FM   GG KW16  Mel   MV  PBC 
##   12   12   12   12   12   12

Number of loci:

nLoc(D_genind)
## [1] 2000

Number of loci can be also represented like this:

length(locNames(D_genind))
## [1] 2000
validObject(D_genind)
## [1] TRUE

Since Adegenet package doesn’t work the best with SNPs it is better if we use DartR for further evaluations. We are using DartR package with genlight objects.

Loading the data

data <- read.PLINK(file="D.raw", map.file = "D.map")

This command transform plink files into a genlight object. Genlight object have data about SNPs.

Here we can see number of SNPs, individuals, genotypes and so on:

data
##  /// GENLIGHT OBJECT /////////
## 
##  // 72 genotypes,  2,000 binary SNPs, size: 325.3 Kb
##  690 (0.48 %) missing data
## 
##  // Basic content
##    @gen: list of 72 SNPbin
##    @ploidy: ploidy of each individual  (range: 2-2)
## 
##  // Optional content
##    @ind.names:  72 individual labels
##    @loc.names:  2000 locus labels
##    @chromosome: factor storing chromosomes of the SNPs
##    @position: integer storing positions of the SNPs
##    @pop: population of each individual (group size range: 12-12)
##    @other: a list containing: sex  phenotype  pat  mat

We can get information about populations, names and number.

data@pop
##  [1] Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  Mel  PBC  PBC  PBC 
## [16] PBC  PBC  PBC  PBC  PBC  PBC  PBC  PBC  PBC  FM   FM   FM   FM   FM   FM  
## [31] FM   FM   FM   FM   FM   FM   KW16 KW16 KW16 KW16 KW16 KW16 KW16 KW16 KW16
## [46] KW16 KW16 KW16 GG   GG   GG   GG   GG   GG   GG   GG   GG   GG   GG   GG  
## [61] MV   MV   MV   MV   MV   MV   MV   MV   MV   MV   MV   MV  
## Levels: FM GG KW16 Mel MV PBC
# DartR
nPop(data)
## [1] 6
# also number, base R
nlevels(data@pop)
## [1] 6

The command we use for getting only the names of the populations is:

unique(data@pop)
## [1] Mel  PBC  FM   KW16 GG   MV  
## Levels: FM GG KW16 Mel MV PBC
# or
unique(pop(data))
## [1] Mel  PBC  FM   KW16 GG   MV  
## Levels: FM GG KW16 Mel MV PBC

Data about individuals, names(ids) and number of individuals:

data$ind.names
##  [1] "1_3"  "2_3"  "3_2"  "5_1"  "8_1"  "9_1"  "10_2" "11_2" "12_2" "13_2"
## [11] "14_2" "15_2" "3_3"  "90"   "99"   "108"  "124"  "128"  "138"  "144" 
## [21] "185"  "190"  "200"  "284"  "1_4"  "2_5"  "3_4"  "4_1"  "5_2"  "6_1" 
## [31] "8_2"  "9_2"  "10_3" "11_3" "13_3" "29"   "1_5"  "2_6"  "3_5"  "4_2" 
## [41] "6_2"  "7_2"  "18"   "20_1" "21_1" "23_1" "24_1" "26_1" "3_9"  "5_5" 
## [51] "6_6"  "7_4"  "10_5" "11_5" "12_4" "13_5" "14_4" "15_3" "18_1" "19_1"
## [61] "1_9"  "2_10" "3_10" "5_6"  "6_7"  "7_5"  "31_1" "32_1" "33"   "34"  
## [71] "37"   "38"
nInd(data)
## [1] 72

Number of loci:

nLoc(data)
## [1] 2000

We now need to check if our dataset is valid, this is the command:

data <- gl.compliance.check(data)
## Starting gl.compliance.check 
##   Processing genlight object with SNP data
##   The slot loc.all, which stores allele name for each locus, is empty. 
## Creating a dummy variable (A/C) to insert in this slot. 
##   Checking coding of SNPs
##     SNP data scored NA, 0, 1 or 2 confirmed
##   Checking locus metrics and flags
##   Recalculating locus metrics
##   Checking for monomorphic loci
##     Dataset contains monomorphic loci
##   Checking for loci with all missing data
##     No loci with all missing data detected
##   Checking whether individual names are unique.
##   Checking for individual metrics
##   Warning: Creating a slot for individual metrics
##   Checking for population assignments
##     Population assignments confirmed
##   Spelling of coordinates checked and changed if necessary to 
##             lat/lon
## Completed: gl.compliance.check

There are no errors reported.

3.

Does your data indicate the presence of inbreeding in the population? Give the results of your analysis and interpret the results.

To check for inbreeding in the population we need to check for FIS values (inbreeding coefficient) and the relation between Ho (observed heterozygosity) and He (expected heterozygosity).

If FIS is positive and Ho < He, inbreeding is present. If FIS is close to 0, data us in Hardy-Weinberg equilibrium. If FIS is greater than 0, we have inbreeding. If FIS is smaller than 0, there is no inbreeding.

gl.report.heterozygosity(data)
## Starting gl.report.heterozygosity 
##   Processing genlight object with SNP data
##   Calculating Observed Heterozygosities, averaged across 
##                     loci, for each population
##   Calculating Expected Heterozygosities

##   pop   nInd nLoc polyLoc monoLoc all_NALoc        Ho      HoSD       He
##    FM 11.923 2000    1637     363         0 0.2662636 0.2156809 0.239767
##    GG 11.953 2000    1063     937         0 0.1948894 0.2486188 0.167650
##  KW16 11.928 2000    1620     380         0 0.2706000 0.2273341 0.236782
##   Mel 11.952 2000    1658     342         0 0.2639000 0.2110112 0.240049
##    MV 11.965 2000    1084     916         0 0.1825515 0.2575590 0.146318
##   PBC 11.934 2000    1681     319         0 0.2586227 0.2001499 0.243036
##      HeSD      uHe    uHeSD         FIS
##  0.166670 0.250262 0.173965 -0.06393986
##  0.185346 0.174969 0.193437 -0.11385336
##  0.169509 0.247141 0.176926 -0.09491973
##  0.163978 0.250529 0.171137 -0.05337015
##  0.178494 0.152699 0.186278 -0.19550086
##  0.162649 0.253664 0.169762 -0.01955022
## Completed: gl.report.heterozygosity

We don’t have positive FIS values, and the observed heterozygosity is higher than the expected, so inbreeding is not present, although the value for PBC is relatively close to 0.

Now let’s check HWE deviation.

The Hardy-Weinberg principle states that in a large, randomly mating population, allele and genotype frequencies will remain constant from generation to generation if other evolutionary influences are not working. These influences include mutation, selection, gene flow, and genetic drift.

Hardy-Weinberg equilibrium allows us to identify loci that may be under selection or experiencing other evolutionary forces.

Ternary plot for each population.

# report for each population separately
gl.report.hwe(data, subset = "each")
## Starting gl.report.hwe 
##   Processing genlight object with SNP data
##   Analysing each population separately

##     Reporting significant departures from Hardy-Weinberg 
##             Equilibrium
##     NB: Departures significant at the alpha level of 0.05 are listed
##     Adjustment of p-values for multiple comparisons vary
##                     with sample size
##  Population             Locus Hom_1   Het Hom_2     N       Prob    Sig
##      <char>            <char> <int> <int> <int> <num>      <num> <char>
##         PBC   AX-93216306_G_C     9     1     2    12 0.03105590    sig
##         PBC   AX-93216317_C_C    11     0     1    12 0.04347826    sig
##         Mel   AX-93216326_A_C     9     1     2    12 0.03105590    sig
##          FM   AX-93216336_C_C     1    10     0    11 0.01813357    sig
##          FM AX-93216336_C_HET     1    10     0    11 0.01813357    sig
##         ---               ---   ---   ---   ---   ---        ---    ---
##         Mel   AX-93262226_A_C    10     0     1    11 0.04761905    sig
##         PBC   AX-93262226_A_C    11     0     1    12 0.04347826    sig
##          GG   AX-93262571_A_C     7     2     3    12 0.04374748    sig
##        KW16   AX-93262571_A_C     9     1     2    12 0.03105590    sig
##        KW16   AX-93263625_T_C     2     1     9    12 0.03105590    sig
##  Prob.adj Sig.adj  npop
##    <lgcl>  <lgcl> <int>
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##       ---     ---   ---
##        NA      NA     2
##        NA      NA     2
##        NA      NA     2
##        NA      NA     2
##        NA      NA     1
## 
## Completed: gl.report.hwe

In these plots we can see that we have some disequilibrium of homozygotes, but also we have disequilibrium of heterozygotes present. Also what we can observe is that BB allele is nearly absent in 4 out of 6 populations.

Ternary plot for whole dataset.

gl.report.hwe(data, subset = "all")
## Starting gl.report.hwe 
##   Processing genlight object with SNP data
##   Pooling all populations for HWE calculations

##     Reporting significant departures from Hardy-Weinberg 
##             Equilibrium
##     NB: Departures significant at the alpha level of 0.05 are listed
##     Adjustment of p-values for multiple comparisons vary
##                     with sample size
##  Population             Locus Hom_1   Het Hom_2     N         Prob    Sig
##      <char>            <char> <int> <int> <int> <num>        <num> <char>
##         pop   AX-93216308_T_C    53    14     5    72 1.934411e-02    sig
##         pop   AX-93216310_T_C    45    19     8    72 2.220949e-02    sig
##         pop   AX-93216311_G_C    48    17     7    72 1.425047e-02    sig
##         pop   AX-93216317_C_C    63     5     2    70 1.887787e-02    sig
##         pop   AX-93216322_C_C    45    16     9    70 2.482724e-03    sig
##         ---               ---   ---   ---   ---   ---          ---    ---
##         pop AX-93262070_T_HET    42    30     0    72 3.031231e-02    sig
##         pop AX-93262071_G_HET    42    30     0    72 3.031231e-02    sig
##         pop   AX-93262226_A_C    59     2     8    69 3.792823e-09    sig
##         pop   AX-93262571_A_C    35    24    13    72 3.536263e-02    sig
##         pop   AX-93263224_A_C    42    14    15    71 8.647280e-06    sig
##  Prob.adj Sig.adj  npop
##    <lgcl>  <lgcl> <int>
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##       ---     ---   ---
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
##        NA      NA     1
## 
## Completed: gl.report.hwe

For the whole dataset, we can observe deviation from Hardy-Weinberg equilibrium, homozygotes AA are not in equilibrium. Additionaly, BB allele is absent. Also, heterozygotes are not in equilibrium. Since FIS values are not positive, this suggests that the observed deviations are not due to inbreeding, but are likely driven by another evolutionary force like genetic drift, selection pressure, or maybe recent migration. Since it seems like allele AA is heading toward fixation and the absence of BB allele is not a matter of sample size, it might be the case that we are dealing with genetic drift.

4.

Calculate the FST between populations and explain genetic differentiation. If present, give possible causes. 2P

FST (Fixation index) is a measure of genetic differentiation between populations. Genetic differentiation is genetic divergence between populations of the same species caused by evolutionary forces.

If we use the argument “nboots,” the program performs 100 bootstrap repetitions to calculate p-values.

fst <- gl.fst.pop(data, nboots=100)
fst$Fsts

The highest genetic differentiation is between GG and MV populations.

fst_df <- melt(as.matrix(fst$Fsts))
colnames(fst_df) <- c("Population1", "Population2", "FST")
head(fst_df)
##   Population1 Population2        FST
## 1         Mel         Mel         NA
## 2         PBC         Mel 0.01292482
## 3          FM         Mel 0.02280287
## 4        KW16         Mel 0.03755873
## 5          GG         Mel 0.15355031
## 6          MV         Mel 0.19810333

Plot heatmap:

# Plot heatmap using ggplot2
ggplot(fst_df, aes(x = Population2, y = Population1, fill = FST)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red", name = "FST") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Pairwise FST Heatmap", x = "Population 1", y = "Population 2")

There is high differentiation between MV and other populations and GG and other populations.

Possible causes for genetic differentiation can be geographic distance. Populations that are geographically closer can have lower FST values since there is higher gene flow because of migration. Another possible cause can be genetic drift, which causes fluctuations of allele frequencies in small populations. Natural selection can be a cause, in which case populations undergo different environmental conditions, so it favors different alleles. Reproductive isolation is also a possible cause.

5.

Calculate the effective size of the populations. On what basis does the command calculate the effective population size? Explain what it is. Which population has the highest and which the lowest estimated effective size? Give the values. 2P

Effective population size

ne <- gl.LDNe(data)
## Starting gl.LDNe 
##   Processing genlight object with SNP data
## Starting gl2genepop 
##   Processing genlight object with SNP data
##   The genepop file is saved as:  C:\Users\aleks\AppData\Local\Temp\Rtmp4i315v/dummy.gen/
## Completed: gl2genepop

## $FM_1
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.8        11.8
##       Independent Comparisons     1337084      974315
##                   OverAll r^2    0.140485    0.144573
##           Expected r^2 Sample    0.113025    0.108832
##                 Estimated Ne^         8.6         6.5
##             CI low Parametric           9         6.2
##            CI high Parametric         9.3         6.4
##              CI low JackKnife         1.9         1.4
##             CI high JackKnife         Inf         Inf
## 
## $GG_3
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.9        11.9
##       Independent Comparisons      563466      444916
##                   OverAll r^2    0.187406     0.19991
##           Expected r^2 Sample    0.110596    0.109016
##                 Estimated Ne^           2         1.7
##             CI low Parametric           2         1.7
##            CI high Parametric           2         1.7
##              CI low JackKnife           1         0.8
##             CI high JackKnife           9         6.2
## 
## $KW16_1
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.8        11.8
##       Independent Comparisons     1307685      934220
##                   OverAll r^2    0.131748    0.138135
##           Expected r^2 Sample    0.108563    0.110154
##                 Estimated Ne^        11.5         8.7
##             CI low Parametric        11.1         8.8
##            CI high Parametric        11.5         9.1
##              CI low JackKnife         3.1         2.5
##             CI high JackKnife         102        97.1
## 
## $Mel_1
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.9        11.9
##       Independent Comparisons     1371746     1013166
##                   OverAll r^2    0.118714    0.120467
##           Expected r^2 Sample    0.110881    0.108453
##                 Estimated Ne^        34.1        25.6
##             CI low Parametric        36.2        23.1
##            CI high Parametric          39        24.6
##              CI low JackKnife        12.4         8.6
##             CI high JackKnife         Inf         Inf
## 
## $MV_1
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.9        11.9
##       Independent Comparisons      585995      295508
##                   OverAll r^2    0.221839    0.209289
##           Expected r^2 Sample    0.108956    0.110215
##                 Estimated Ne^         1.4         1.6
##             CI low Parametric         1.4         1.5
##            CI high Parametric         1.4         1.6
##              CI low JackKnife         0.7         1.1
##             CI high JackKnife         2.9         2.4
## 
## $PBC_3
##                     Statistic Frequency 1 Frequency 2
##  Lowest Allele Frequency Used          0+       No S*
##     Harmonic Mean Sample Size        11.8        11.9
##       Independent Comparisons     1409665     1030837
##                   OverAll r^2    0.109234    0.110855
##           Expected r^2 Sample    0.108669    0.110283
##                 Estimated Ne^      1875.5       308.7
##             CI low Parametric       373.7       350.2
##            CI high Parametric       990.9      1139.7
##              CI low JackKnife       198.3       134.5
##             CI high JackKnife         Inf         Inf
## 
##   The results are saved in: C:\Users\aleks\AppData\Local\Temp\Rtmp4i315v/genepopLD.txt 
## Completed: gl.LDNe
ne
## $FM_1
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.8        11.8
## 3       Independent Comparisons     1337084      974315
## 4                   OverAll r^2    0.140485    0.144573
## 5           Expected r^2 Sample    0.113025    0.108832
## 6                 Estimated Ne^         8.6         6.5
## 7             CI low Parametric           9         6.2
## 8            CI high Parametric         9.3         6.4
## 9              CI low JackKnife         1.9         1.4
## 10            CI high JackKnife         Inf         Inf
## 
## $GG_3
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.9        11.9
## 3       Independent Comparisons      563466      444916
## 4                   OverAll r^2    0.187406     0.19991
## 5           Expected r^2 Sample    0.110596    0.109016
## 6                 Estimated Ne^           2         1.7
## 7             CI low Parametric           2         1.7
## 8            CI high Parametric           2         1.7
## 9              CI low JackKnife           1         0.8
## 10            CI high JackKnife           9         6.2
## 
## $KW16_1
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.8        11.8
## 3       Independent Comparisons     1307685      934220
## 4                   OverAll r^2    0.131748    0.138135
## 5           Expected r^2 Sample    0.108563    0.110154
## 6                 Estimated Ne^        11.5         8.7
## 7             CI low Parametric        11.1         8.8
## 8            CI high Parametric        11.5         9.1
## 9              CI low JackKnife         3.1         2.5
## 10            CI high JackKnife         102        97.1
## 
## $Mel_1
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.9        11.9
## 3       Independent Comparisons     1371746     1013166
## 4                   OverAll r^2    0.118714    0.120467
## 5           Expected r^2 Sample    0.110881    0.108453
## 6                 Estimated Ne^        34.1        25.6
## 7             CI low Parametric        36.2        23.1
## 8            CI high Parametric          39        24.6
## 9              CI low JackKnife        12.4         8.6
## 10            CI high JackKnife         Inf         Inf
## 
## $MV_1
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.9        11.9
## 3       Independent Comparisons      585995      295508
## 4                   OverAll r^2    0.221839    0.209289
## 5           Expected r^2 Sample    0.108956    0.110215
## 6                 Estimated Ne^         1.4         1.6
## 7             CI low Parametric         1.4         1.5
## 8            CI high Parametric         1.4         1.6
## 9              CI low JackKnife         0.7         1.1
## 10            CI high JackKnife         2.9         2.4
## 
## $PBC_3
##                       Statistic Frequency 1 Frequency 2
## 1  Lowest Allele Frequency Used          0+       No S*
## 2     Harmonic Mean Sample Size        11.8        11.9
## 3       Independent Comparisons     1409665     1030837
## 4                   OverAll r^2    0.109234    0.110855
## 5           Expected r^2 Sample    0.108669    0.110283
## 6                 Estimated Ne^      1875.5       308.7
## 7             CI low Parametric       373.7       350.2
## 8            CI high Parametric       990.9      1139.7
## 9              CI low JackKnife       198.3       134.5
## 10            CI high JackKnife         Inf         Inf

The effective population size is estimated using the gl.LDNe() function. The function estimates Ne based on linkage disequilibrium (LD). Linkage disequilibrium happens when two loci are close to each other and are inherited together since they don’t undergo separation in recombination during meiosis.

In small populations LD tends to be higher because of presence of genetic drift.

Population that has the highest effective population size is PBC with value above 1750, and the ones with the lowest are GG and MV, around 0. The others are pretty close to 0 too.

6.

Calculate and then produce a PCA plot. Explain what the plot represents (interpret the results of the analysis). What can you tell about the structure of your dataset?

Calculate principle components:

pca <- gl.pcoa(data)
## Starting gl.pcoa 
##   Processing genlight object with SNP data
##   Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state

## Completed: gl.pcoa

Scree Plot explains the PCA and decides how many components need to be preserved. The first principal component (value 1 on the PCA Axis) contributes to the explanation of about 12% of the variance in our data, the second explains 9%. After PC2, the variance contribution drops sharply to PC3, which means the first three principal components capture the most genetic variation.

gl.pcoa.plot(x = data, glPca = pca)
## Starting gl.pcoa.plot 
##   Processing an ordination file (glPca)
##   Processing genlight object with SNP data
##   Plotting populations in a space defined by the SNPs
##   Preparing plot .... please wait

## Completed: gl.pcoa.plot

On the X-axis we have the 1st component, which explains 11.7% of the variance and on the Y-axis the 2nd component, which explains 8.8% of the variance. From the graph, we can see that the MV population differs the most from the other 5 populations.Also GG differs but not that much. The other 4 populations do not differ from each other,they are clustered together.

gl.pcoa.plot(x = data, glPca = pca, zaxis = 3)
## Starting gl.pcoa.plot 
##   Processing an ordination file (glPca)
##   Processing genlight object with SNP data
##   Displaying a three dimensional plot, mouse over for details for each point
##   May need to zoom out to place 3D plot within bounds
## Completed: gl.pcoa.plot

From the 3D representation we can see that MV and GG form more separate clusters than the others as in the 2D plot, with the difference that now we can see that KW16 is also being somewhat to the side and forming a separate cluster, which we couldn’t see from the 2D representation.That’s why it is good to use a 3D plot.

GG and MV are forming their own distinct clusters, and FM and KW16 are showing more admixture, forming a gradient instead of strict clusters. Mel and PBC are overlapping with FM and KW16, so probably they share ancestry or gene flow with them. MV and GG forms their own clusters, which connects us to the highest genetic differentiation between these two populations and the other populations.

NEIGHBOUR JOINING TREE

gl.tree.nj(data, type = "phylogram")
## Starting gl.tree.nj 
##   Processing genlight object with SNP data
##   Converting to a matrix of frequencies, locus by populations
##   Computing Euclidean distances

## Completed: gl.tree.nj
## 
## Phylogenetic tree with 6 tips and 4 internal nodes.
## 
## Tip labels:
##   FM, GG, KW16, Mel, MV, PBC
## 
## Unrooted; includes branch length(s).

MV and GG populations have long branches, which means they have high genetic differentiation and no recent gene flow with other populations.

KW16 and FM share a more recent common ancestor compared to most of the other populations, but a more distant one than Mel and PBC.

The branch connecting Mel and PBC is short, which means that there has been relatively little evolutionary change (genetic drift, mutation) differentiating them since they shared a common ancestor.

So, on both the dendrogram and the PCA plot Mel and PBC as the most similar pair.

Both methods also overlap on the close relationship between KW16 and FM, although they are not as tightly clustered as Mel and PBC in the PCA.

Both the dendrogram and PCA represent the genetic distinction of GG and MV from the other populations.

STRUCTURE

The STRUCTURE program allows the analysis of different populations, assigning individuals to populations, studying hybrid zones, identifying migrants and mixed individuals.

str <- gl.run.structure(data, k.range= 1:6, num.k.rep = 2, exec = "structure.exe", numreps=200, burnin=100)
## Starting gl.run.structure 
##   Processing genlight object with SNP data

## Completed: gl.run.structure

For us, the \(\Delta\)K graph is important. The command gl.evanno(str) \(\Delta\)K estimates the number of calculated subpopulations that best estimate the structure of our data. We look at \(\Delta\)K with the highest value. In this case, the highest value is 5.

gl.evanno(str)

## $df
##   k reps mean.ln.k     sd.ln.k    ln.pk   ln.ppk    delta.k
## 1 1    2 -113284.7    36.91097       NA       NA         NA
## 2 2    2 -108267.1   262.40733  5017.55  1113.15   4.242069
## 3 3    2 -104362.8    45.74981  3904.40  5791.30 126.586322
## 4 4    2 -106249.6   188.16111 -1886.90  4960.00  26.360388
## 5 5    2 -113096.6 10426.92588 -6846.90 12922.75   1.239363
## 6 6    2 -107020.7  3115.08821  6075.85       NA         NA
## 
## $plots
## $plots$mean.ln.k

## 
## $plots$ln.pk

## 
## $plots$ln.ppk

## 
## $plots$delta.k

This graph is showing the ΔK values, which are used to determine the most likely number of distinct genetic clusters/populations.

A sharp peak at K=5 means that 5 is the most likely number of clusters, so the individuals in the dataset can be divided into five groups that are genetically different from each other.

To draw a graph of population structure we use the command:

gl.evanno(str)

## $df
##   k reps mean.ln.k     sd.ln.k    ln.pk   ln.ppk    delta.k
## 1 1    2 -113284.7    36.91097       NA       NA         NA
## 2 2    2 -108267.1   262.40733  5017.55  1113.15   4.242069
## 3 3    2 -104362.8    45.74981  3904.40  5791.30 126.586322
## 4 4    2 -106249.6   188.16111 -1886.90  4960.00  26.360388
## 5 5    2 -113096.6 10426.92588 -6846.90 12922.75   1.239363
## 6 6    2 -107020.7  3115.08821  6075.85       NA         NA
## 
## $plots
## $plots$mean.ln.k

## 
## $plots$ln.pk

## 
## $plots$ln.ppk

## 
## $plots$delta.k

gl.plot.structure(str, K=2)
## Starting gl.plot.structure

## Completed: gl.plot.structure
gl.plot.structure(str, K=3)
## Starting gl.plot.structure

## Completed: gl.plot.structure
gl.plot.structure(str, K=4)
## Starting gl.plot.structure

## Completed: gl.plot.structure
gl.plot.structure(str, K=5)
## Starting gl.plot.structure

## Completed: gl.plot.structure

MV is represented as distinct population at K=5, which is confirmed by the PCA and the FST values. GG is distinct for all K values, showing no admixture, supported by PCA and FST. The populations FM and KW16 show continuous variation, which indicates common ancestry or historical admixture. Mel and PBC are mostly red, but show some variability, have recent admixture.

7.

Does the calculated isolation by distance indicate isolation of populations? Please comment.

The correlation of genetic diversity between individuals decreases as a function of geographic distance. Migration is usually localized in space, which is why individuals from nearby subpopulations are expected to be more genetically similar.The IBD command calculates the FST distance and the Euclidean distance (distance in space between individuals).

coordinates <- read.table("D_koordinate.txt",header = T)
as.data.frame(coordinates)
##      id      lat        lon
## 1   1_3 28.38351  -80.86100
## 2   2_3 28.45910  -80.79395
## 3   3_2 28.48986  -80.69970
## 4   5_1 28.52193  -80.67518
## 5   8_1 28.58668  -80.63023
## 6   9_1 28.66899  -80.60316
## 7  10_2 28.67131  -80.60089
## 8  11_2 28.68660  -80.59389
## 9  12_2 28.76809  -80.58664
## 10 13_2 28.82765  -80.57171
## 11 14_2 28.84794  -80.47549
## 12 15_2 28.85964  -80.43145
## 13  3_3 26.72200  -80.41735
## 14   90 26.75146  -80.33422
## 15   99 26.77550  -80.30058
## 16  108 26.83108  -80.23237
## 17  124 26.92929  -80.22784
## 18  128 26.96413  -80.13759
## 19  138 27.05090  -80.10414
## 20  144 27.09555  -80.09261
## 21  185 27.13165  -79.99944
## 22  190 27.14187  -79.97806
## 23  200 27.18180  -79.90452
## 24  284 27.23546  -79.87980
## 25  1_4 26.67532  -81.65894
## 26  2_5 26.72812  -81.64386
## 27  3_4 26.78103  -81.58905
## 28  4_1 26.78257  -81.50051
## 29  5_2 26.85402  -81.40285
## 30  6_1 26.85458  -81.34886
## 31  8_2 26.88110  -81.30132
## 32  9_2 26.92853  -81.21521
## 33 10_3 26.96437  -81.20300
## 34 11_3 27.01076  -81.13532
## 35 13_3 27.04317  -81.12838
## 36   29 27.12150  -81.07854
## 37  1_5 25.07614  -81.78424
## 38  2_6 25.11529  -81.73620
## 39  3_5 25.13406  -81.66986
## 40  4_2 25.14318  -81.64972
## 41  6_2 25.15786  -81.62072
## 42  7_2 25.23635  -81.55161
## 43   18 25.26566  -81.48095
## 44 20_1 25.32258  -81.41023
## 45 21_1 25.38406  -81.32932
## 46 23_1 25.40648  -81.26365
## 47 24_1 25.47800  -81.22928
## 48 26_1 25.52140  -81.20662
## 49  3_9 33.67975 -117.51249
## 50  5_5 33.70964 -117.50067
## 51  6_6 33.76716 -117.43522
## 52  7_4 33.82353 -117.41326
## 53 10_5 33.92078 -117.33045
## 54 11_5 33.92621 -117.23190
## 55 12_4 33.93673 -117.15715
## 56 13_5 33.93888 -117.13832
## 57 14_4 33.94749 -117.13420
## 58 15_3 34.00710 -117.07106
## 59 18_1 34.04740 -117.02310
## 60 19_1 34.09604 -116.96714
## 61  1_9 33.60700 -117.50743
## 62 2_10 33.65287 -117.43607
## 63 3_10 33.66657 -117.41446
## 64  5_6 33.68696 -117.31534
## 65  6_7 33.70411 -117.27106
## 66  7_5 33.72575 -117.24889
## 67 31_1 33.81423 -117.22504
## 68 32_1 33.87666 -117.19616
## 69   33 33.90794 -117.12271
## 70   34 33.93775 -117.04865
## 71   37 33.94334 -117.00918
## 72   38 33.98418 -116.96034
gl.ibd(data, coordinates = coordinates[,2-3], distance = "Fst")
## Analysis performed on the genlight object.
## 'nperm' >= set of all permutations: complete enumeration.
## Set of permutations < 'minperm'. Generating entire set.

##   Coordinates used from: data.frame lat/lon (Mercator transformed) 
##   Transformation of Dgeo: Dgeo 
##   Genetic distance: Fst 
##   Tranformation of Dgen:  Dgen 
## 
## Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## vegan::mantel(xdis = Dgen, ydis = Dgeo, permutations = permutations,      na.rm = TRUE) 
## 
## Mantel statistic r: 0.633 
##       Significance: 0.041667 
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.234 0.631 0.634 0.634 
## Permutation: free
## Number of permutations: 719
## 
## 
## Completed: gl.ibd
## $Dgen
##              FM         GG       KW16        Mel         MV
## GG   0.14540894                                            
## KW16 0.04290602 0.17089061                                 
## Mel  0.02280287 0.15355031 0.03755873                      
## MV   0.19719878 0.28493627 0.21905184 0.19810333           
## PBC  0.01961675 0.14332658 0.03340314 0.01292482 0.18688499
## 
## $Dgeo
##              FM         GG       KW16        Mel         MV
## GG   4097009.18                                            
## KW16  200846.64 4130460.12                                 
## Mel   235888.90 4133149.41  432827.36                      
## MV   4092392.80   14521.24 4125185.89 4129320.23           
## PBC   135975.40 4226547.05  260977.44  216437.83 4222067.05
## 
## $mantel
## 
## Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## vegan::mantel(xdis = Dgen, ydis = Dgeo, permutations = permutations,      na.rm = TRUE) 
## 
## Mantel statistic r: 0.633 
##       Significance: 0.041667 
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.234 0.631 0.634 0.634 
## Permutation: free
## Number of permutations: 719

The x-axis shows the geographical distance, and the y-axis shows the genetic distance. Because the slope of the line is positive (goes up), we can conclude that we have isolation with distance in our dataset. The result is statistically significant because since p < 0.05, but R² = 0.401 which means while distance plays a role, other factors like historical migration, environmental barriers and selection influence genetic structure.

Density plot.

#Prepare the matrices
Dgeo <- as.matrix(dist(coordinates[,2-3]))
Dgen <- as.matrix(dist(as.matrix(data)))

#Calculate density
dens <- kde2d(as.vector(Dgeo), as.vector(Dgen), n=300)

#Create color palette
myPal <- colorRampPalette(c("white", "blue", "gold", "orange", "red"))

#Plot all together
plot(Dgeo, Dgen, pch=20,cex=.5)
image(dens, col=transp(myPal(300),.7), add=TRUE)

# Calculate distance 
dist_lm <- lm(as.vector(Dgen) ~ as.vector(Dgeo))
abline(dist_lm)
title("Isolation by distance plot")