Genetic variation in the Scandinavian arctic fox

Christopher Alan Cockerill 7th March 2024
Exercise: genetic processes in small populations
Conservation of Populations BL7076, Stockholm University

Part of this exercise was inspired by Tsai (2021), for more information on bioinformatics and specific concepts, you can check it out here.

Aims

The aims of this exercise is to give you an introduction to how genetic data can be used to inform us about genetic diversity, evolutionary processes and demographic histories of populations.

Next Generation Sequencing (NGS) technologies have transformed genetic research by enabling rapid and accurate sequencing of DNA samples. High-throughput sequencing, a key component of NGS, allows for the efficient decoding of the entire genetic code of organisms such as plants, animals, and microbes. This process involves breaking the organism’s DNA into small fragments and labeling them with fluorescent markers. These labeled fragments are then replicated multiple times through amplification and sequenced simultaneously using specialized equipment. The “re-sequencing” of a genome typically captures information such as Single Nucleotide Polymorphisms (SNPs), INsertions and DELetions (INDELS) and Copy Number Variants (CNVs) within the genomes of individuals. It is called resequencing because you are sequencing the genomes of individuals of a given species that has already been sequenced previously. You then map your sequencing data to this “reference” genome since it has already been assembled. So you don’t need to assemble a genome each time you sequence an individual, which is not trivial.

In recent years, the widespread adoption of NGS has revolutionized the study of genetic variation in natural populations, providing researchers with unprecedented insights into biodiversity, population dynamics, and evolutionary processes. These genetic variants can be informative and used to indicate how closely related populations are, whether they are genetically similar, show metapopulation structure, how genetically diverse each population is. You can even quantify the levels of inbreeding that has occurred recently, and in the distant past using runs of homozygosity, regions of the genome that are identify by descent. This can be very useful when considering the conservation of vulnerable wild populations, and this information can be used to pinpoint which populations require increased conservation efforts and which populations may be suitable candidate populations for translocation.

We won’t go too much into the bioinformatic processing of such sequencing data, instead I have prepared a Variant Call Format (VCF) file that contains variant calls for a number of individuals. This data has been subset from whole-genome Illumina sequencing reads. In this exercise we will convert this vcf file into a genlight object, which is a file format used in R’s adegenet package that efficiently stores genetic marker data, representing genotypes of individuals across markers for population genetic analyses. We will then do some basic population genetic analyses, including a Principle Component Analysis (PCA) using adegenet, calculate observed heterozygosity and inbreeding coefficients (FIS) per population and pairwise fst values using DartR.

The population

Map showing the distribution of Arctic tundra habitat (Scandinavia: alpine tundra [60]; Kola: oro-arctic tundra [61]) in relation to study sample areas in Northern Sweden (Vindelfjällen-Arjeplog), southern Sweden (Helags), northern Norway (Saltfjellet, Øvre Dividal, Reisa nord, the Varanger Peninsula), western Russia (Kola Peninsula and Yamal), Mid-Russia (Taymyr) and eastern Russia (Indigirka, Faddeyesky Island and Wrangel Island.

The Fennoscandian arctic fox (Vulpes lagopus) population has been persecuted extensively for its fur and is in a fragmented state due to climate-change. We will conduct some basic population genetic analyses to investigate the current status of this population relative to the larger population in Russia (see Figure 1 for subpopulation locations).

While you are progressing throughout the exercise, I want you to consider the evolutionary processes that may be shaping the genetic structure of the populations, what demographic events may have caused these processes, and what you would expect from an outbreeding population relative to a small population. This subset data has been subsampled to one chromosome and randomly sampled for informative SNPs to make it lighter on your computers, it should represent the larger dataset, however, the reduced resolution in the data may affect the patterns observed in the final output.

Installation

We need to install some packages to complete the exercise, I posted these on Athena so perhaps you have them set up, but take some time to make sure you have them loaded and that they are working.

#Set your working directory
setwd("~/path/to/your/data/")

# Try install the following:

# If there's an option during the install, type yes. For example, 
# do you want to install from sources the package which needs compilation? 
# (Yes/no/cancel)
# type yes and enter
# In the more intermediate level of R, you will find that you use quite a 
# lot of packages that
# are already available for a given task. A package may also consists of 
# additional packages. So it
# likely that you will install a lot of packages in the end.

install.packages("vcfR")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("reshape2")
install.packages("RColorBrewer")
install.packages("patchwork")
install.packages("ggrepel")
install.packages("dplyr")
install.packages("dartR")
install.packages("dartR.base")
install.packages("readxl")
install.packages("poppr")

# To be able to install dartR properly we recommend installation of a recent 
# version of R (at least version 4.0) and Rstudio (at least version 1.1.463).

# Sometimes there are additional package managers for specific disciplines which allow you to install
# packages that is not available using the install.packages command. In our case, 
# Bioconductor would be a good place to search for packages in:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# If the packages or their dependancies cannot be installed try installing 
using this command
BiocManager::install("DartR")


# for adegenet to work in Mac users, Please make sure: 
# 1. you use R-4.1.1.pkg without the ARM64 version
# 2. Also download and install gfortran 11.2 for BigSur (macOS 11) Intel
#    from https://github.com/fxcoudert/gfortran-for-macOS/releases
install.packages("adegenet")

# Mac users need to ensure Xcode (available on the App Store) is installed 
# before starting the dartR installation process.

# Alternatively you can download the package via github. This tutorial 
# explains how 
# http://georges.biomatix.org/storage/app/media/Tutorial_2_dartR_Installation.pdf

Load the packages

library(vcfR)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(patchwork)
library(ggrepel)
library(dplyr)
library(dartR)
library(dartR.base)
library(adegenet)
library(readxl)
library(poppr)

Data import

Now we will import the data in vcf format, see Figure 1 for how a vcf file is formatted below.

The structure of a VCF file, note the header includes descriptive information about the contents, and the body includes the chromosom, position and genotype information (from Dave Tang, https://davetang.github.io/learning_vcf_file/)

You can see from the name of the vcf file that the different filtering steps that the file has been through. scaff22 means that the file only contains variants called for sites located on scaffold 22. A scaffold is a bit like a chromosome, think of it as a file that contains the information for a given chromosome. Sometimes the chromosome may be separated into several scaffolds. The higher the quality of the assembly, the more complete the scaffold will be.

Biallelic means that sites that are not biallelic (more than two alleles at a site) are filtered out, this is because multiallelic sites are rare and typically consequence of sequencing and other errors so we exclude them.

The 0missing simply means that when we merge all individual files together, all sites need to be shared by all individuals, otherwise the sites are removed.


vcf<-read.vcfR("popgen_practical_scaff22.biallelic.0missing_subset.vcf", 
verbose= FALSE)



# Take a look at how the vcfR object is structured.

vcf

There are variant calls for 83 individuals, across 1 chromosome/scaffold, 12,116 variants, and 0 missing data. And the file size is 13.5 Mb (not super large but lets see how you all get on!).

# Covert the vcf file into a genlight object
vcf.gl <- vcfR2genlight(vcf)

# check the file
vcf.gl

# Next we use the function head() to wrap around the object, so it's only 
# printing a subset of data you can try to inspect vcf.gl@ind.names and 
# vcf.gl@pop without head
head(vcf.gl@ind.names)

# Because excel removes the leading zeros, we need to remove them in the 
# genlight object too, otherwise the metadata won't match
# The files were orgininally named in this way because they have information 
# that is useful for us ("0413", for example is a sample collected in 2004 
# and was the 13th sample to be extracted)
# Some labs use different naming conventions, if you explore the data a bit you
# will see that not all samples are named in the same way. 


vcf.gl@ind.names <- gsub("^0+", "", vcf.gl@ind.names)

# Check the names to make sure the zeros are gone
vcf.gl@ind.names

# Now we can add some metadata for the samples, this is just an excel file 
# that inclues a matching id to the vcf file with a pop column for the 
# source population

# inport the file
popgen_metadata<- readxl::read_excel("popgen_metadata.xlsx")

# In order to use the function properly, like the case of vcfR2genlight, a lot
# of parameters need 
# to be specified.
# We will add the population IDs 
# We will also set the ploidy to 2 since the arctic fox is diploid
pop(vcf.gl) <- popgen_metadata$pop
ploidy(vcf.gl) <- 2



# Check tha the populations have been attached correctly
vcf.gl@pop

Task 1 - Population structure (PCA)

# We will perform Principle Component Analysis (PCA) to take a look at how our 
# genetic data is structured. 
# This may take some time, but only a few minutes.
vcf.pca<-glPca(vcf.gl, nf = 10)

# Have a look at how it is structured
vcf.pca

The output of the PCA - The principal components are what we will plot, the eigenvalues quantify the variance captured by each principal component. The loadings are generated per SNP and will describe the correlation of ech SNP to each principle component.

Next, we will extract the scores for each PC in preparation for making some figures, and add the country information to allow us to explore the data a little better

# convert scores of vcf.pca into a tibble
vcf.pca.scores <- as_tibble (vcf.pca$scores)

# add the population data into a column of vcf.pca.scores tibble
vcf.pca.scores$pop <- popgen_metadata$pop 

# We will also determine the variance each PC contributes the data, which 
# will help us understand potential drivers of patterns in our dataset. 
# Let's plot the eigenvectors to try an understand this a bit more.

barplot(100 * vcf.pca$eig / sum(vcf.pca$eig), col="green")
title(ylab = "Percent of variance explained") 
title(xlab = "Eigenvalues")

Loading plot from the PCA, here you will see that the first few principle components explain the majority of the variation in the data. This is what we will visualize in the next few steps.

# Let's extract the variance associated with the top 4 PCs, so we can use them 
# in our plots.

# first we sum all the eigenvalues
eig.total <- sum(vcf.pca$eig)

# sum the variance
PC1.variance <- formatC(head(vcf.pca$eig)[1]/eig.total * 100)
PC2.variance <- formatC(head(vcf.pca$eig)[2]/eig.total * 100)
PC3.variance <- formatC(head(vcf.pca$eig)[3]/eig.total * 100)
PC4.variance <- formatC(head(vcf.pca$eig)[4]/eig.total * 100)

# Let's check that this has worked

PC1.variance 

[1] "8.721"

This suggests that 8.721% of the variance is explained in PC1.

Now we will plot the data using ggplot and look to see if there is any populations structure.

# plot PC1 and PC2 with each sample as a point
plot12 <- ggplot(vcf.pca.scores, aes(PC1, PC2)) + geom_point()
plot12

# We’ll add some axis labels, and incorporate the variance information 
# to describe the relative importance of the spread of the data
# paste0() essentially connects the strings together

xlabel <- paste0("PC1 variance = ",PC1.variance,"%")
ylabel <- paste0("PC2 variance = ", PC2.variance, "%")
plot12 <- plot12 + labs(x = xlabel, y = ylabel)
plot12

#We need some labels to describe the population of origin.
#We will also set some colours

cols <- colorRampPalette(brewer.pal(8, "Set1"))(17)

plot12 <- plot12 + geom_point(aes(col = pop), size =3) + 
scale_colour_manual(values=cols) +
theme_bw()
plot12

# Let's quickly look at PC3/PC4, and compare to the first plot.

plot34 <- ggplot(vcf.pca.scores, aes(PC3, PC4)) + 
    geom_point(aes(col = pop)) + 
    labs(x = paste0("PC3 variance = ", PC3.variance,"%"), 
    y = paste0("PC4 variance = ", PC4.variance, "%")) + 
    scale_colour_manual(values = cols) +
  theme_bw()

plot12 + plot34

We can see that patterns are starting to emerge regarding the genetic relatedness within and between the populations. However, it may be difficult to see some of the subtle features of the diversity that may be important. let’s explore the data in a slightly different way.

# Calculate the mean value of the principal components for each country. 
# We can use this to make some labels for our plots
means <- vcf.pca.scores %>% group_by(pop) %>% 
  summarize(meanPC1 = mean(PC1),
     meanPC2 = mean(PC2),meanPC3 = mean(PC3), meanPC4 = mean(PC4))

# Let's make a slightly different plot than our first comparison of PC1 and PC2


plot12.2 <- ggplot(vcf.pca.scores, aes(PC1, PC2, col =  pop)) + 
    labs(x = paste0("PC1 variance = ", PC1.variance, "%"), 
        y = paste0("PC2 variance = ", PC2.variance, "%")) + 
    scale_colour_manual(values = cols) +
    stat_ellipse(level = 0.95, size = 1) +
    geom_label_repel(data = means,
    aes(means$meanPC1, means$meanPC2, col = means$pop, label = means$pop)) +
    theme_bw()

plot12 + plot12.2

# You might get a few warnings but it is nothing to worry about. We have a 
# small sample size for some of these populations so this method of plot is
# not ideal, but it will serve for demonstrative purposes anyway

In our new plot, we have added an ellipse that describes how the individual samples per country are distributed in the plot. We have also added country labels, which are positioned on the plot using the mean PC values we calculated earlier. We expect that if all samples within a country are genetically similar, we should see a small ellipse. However, if samples a not genetically similar, we will see a large ellipse.

Questions

How does the PCA plot reveal patterns of genetic differentiation between the Scandinavian and Russian populations of arctic foxes?
What do the first few principal components represent in terms of genetic variation among individuals? How might historical demographic events influence these patterns?
Which subpopulations are most genetically differentiated and what contemporary processes might be driving this isolation?

Task 2 - Genetic variation

Now we are going to calculate some traditional populations genetics estimates from our data. You are probably familiar with Hardy-Weinberg equilibrium.

Briefly, Hardy-Weinberg equilibrium (HWE) is a fundamental principle in population genetics that describes the expected genotype frequencies in a population under certain conditions. It serves as a null hypothesis against which observed genotype frequencies can be compared to assess departures from expected values, which may indicate evolutionary processes such as genetic drift, migration, mutation, or natural selection.

Remember the principles?

Random Mating: Individuals in the population mate randomly with respect to their genotypes.
No Migration: There is no migration into or out of the population.
No Mutation: There is no new mutation introducing new alleles into the population.
No Selection: There is no differential survival or reproductive advantage associated with particular genotypes.
Large Population Size: The population size is large enough to prevent random sampling errors from significantly affecting genotype frequencies.

If a population is out of Hardy-Weinberg equilibrium, it suggests that one or more of these principles were violated.

# Now we will explore genetic variation within the populations with the 
# package DartR.

#For this programme, we need to combine the metadata with the genlight object 
# in a different way


vcf.gl$other$ind.metrics <- readxl::read_excel("popgen_metadata.xlsx")


#Add this as pop for the vcf.gl
pop(vcf.gl) <- vcf.gl$other$ind.metrics$pop 

#Add the id column to match the indNames in the vcf.gl
indNames(vcf.gl) <- vcf.gl$other$ind.metrics$id 

# Then we need to check that the data is in the correct format.

gl <- gl.compliance.check(vcf.gl)

# The data contains monomorphic loci, A DArT dataset should not have 
# monomorphic loci, so we will remove them.
gl_nm<-gl.filter.monomorphs(vcf.gl, verbose = NULL)

# Now we have done that we will calculate metrics for each locus that are not 
# calculatedby default in other programs and are required for various 
# functions in dartR
gl_nm.gl_2 <- gl.recalc.metrics(gl_nm)

# Now some extra formatting that will format locus metrics as a dataframe
gl_nm.gl_2$other$loc.metrics <- as.data.frame(gl_nm.gl_2$other$loc.metrics)

# Check that the monomorphic loci have been removed
gl_nm.gl_3 <- gl.compliance.check(gl_nm.gl_2) 
 

# Once the genlight object has been formatted appropriately, we can finally 
# calculate some heterozygosity estimates.
# This can take a few mninutes as the chromosomes in our dataset are quite 
# large, but it shouldn't take too long. 

gl.report.heterozygosity(
  gl_nm.gl_3,
  method ='pop',
  save2tmp = FALSE,
  verbose = 3)

Information about the heterozygosity estimates (from DartR documentation):

Heterozygosities and FIS (inbreeding coefficient) are calculated by locus within each population using the following equations:

Observed heterozygosity (Ho) = number of homozygotes / n_Ind, where n_Ind is the number of individuals without missing data.

Observed heterozygosity adjusted (Ho.adj) <- Ho * n_Loc / (n_Loc + n.invariant), where n_Loc is the number of loci that do not have all missing data andn.invariant is an estimate of the number of invariant loci to adjust heterozygosity.

Expected heterozygosity (He) = 1 - (p^2 + q^2), where p is the frequency of the reference allele and q is the frequency of the alternative allele.

Expected heterozygosity adjusted (He.adj) = He * n_Loc / (n_Loc + n.invariant)

Unbiased expected heterozygosity (uHe) = He * (2 * n_Ind / (2 * n_Ind - 1))

Inbreeding coefficient (FIS) = 1 - (mean(Ho) / mean(uHe))

This will give you an ordered barchart of observed heterozygosity (Ho), unbiased expected heterozygosity (uHe) and Inbreeding coefficients (FIS) across populations together with a table of mean observed and expected heterozygosities and FIS by population and their respective standard deviations (SD).

In the output, it is also reported by population: the number of loci used to estimate heterozygosity(nLoc), the number of polymorphic loci (polyLoc), the number of monomorphic loci (monoLoc) and loci with all missing data (all_NALoc).

Barplot for observed heterozygosity, adjusted heterozygosity, and inbreeding coefficients (FIS).

The output report heterozygosity by population.

Now lastly, It seems that these populations of arctic foxes vary in genetic variation and show substantial populations structure. Let’s take a quick look at pairwise fst values to see if we can find anything interesting.

# Now let's take a look at the genetic distance of the populations by 
# calculating Fst values. This might take a few minutes.

gl.fst.pop(gl_nm.gl_3)

Pairwise Fst values.

Questions

Describe the patterns of observed and expected heterozygosity and inbreeding coefficients within each population. What do these measures tell us about the populations?
How would you explain the reduced heterozygosity in the Scandinavian samples, what specific micro-evolutionary processes may be involved?
How does the FIS value reflect the degree of inbreeding within each population? What implications might high FIS values have for the long-term viability of populations?
The inbreeding coefficient appears to be quite high in western Russia, can you explain why this pattern may emerge, what could be causing greater inbreeding in this region?
How do the pairwise FST values between populations reflect the degree of genetic differentiation? What factors could contribute to the observed values?

Summary

In this exercise, we introduced you to population genetic analyses using Next Generation Sequencing (NGS) data. Our aims were to unravel the evolutionary processes and demographic histories that have shaped genetic variation within arctic fox populations in Scandinavia and Russia. Through population genetics analyses, we explored how genetic variants can serve as informative markers to assess genetic relatedness, population structure, and diversity. Leveraging tools like Principle Component Analysis (PCA) and calculating metrics such as observed heterozygosity and inbreeding coefficients (FIS), we gained insights into the genetic structure of arctic fox populations across Fennoscandia and Russia.

By delving into these analyses, you should now grasp how genetic data at the whole-genome level can be harnessed to evaluate genetic diversity and unravel population dynamics. These insights into the evolutionary processes shaping arctic fox populations equip us with valuable tools to guide conservation efforts and preserve biodiversity for future generations.

However, it’s important to note that many analyses using genome-wide data require specialized High-Performance Computing (HPCs) for multithreaded analyses. If this exercise piqued your interest, consider exploring further literature on the Fennoscandian arctic fox population. Topics such as genome erosion leading to fitness consequences in the Helags population Hasselgren et al., 2021 and genome erosion and lack of connectivity in the Fennoscandian arctic fox in relation the Russian population Cockerill et al., 2022 offer deeper insights into the genetic intricacies of these populations.

See below for some plots based on genome-wide data that included samples from the mentioned publications.

Here is an example of how a PCA and heterozygosity look at the genome-level, note how their is much more significant difference between each population. This is likely a result of only taking one chromosome/scaffold for our analyses. The genome can be very heterogeneous and only using one chromosome may lead to biased representation and provide an incomplete picture. In the future I would like to randomly sample across the genome, although getting the file small enough to not take hours loading in R might be challenging!