Overview

Today we will be exploring how we can analyze genetic/genomic data using various R packages. Many of the code examples are taken from the Grunwald tutorial shared in the Canvas module.

Objectives:

Using R to perform population genetic analysis

Traditionally, many calculations could be performed by hand. Today, in the genomic era, we increasingly require computers to help us perform calculations on very large genetic/genomic datasets associated with populations. Nowhere is this more apparent than analysis of human genomic information.

Fortunately, the R environment has a very large user base involved in population genetics research. Today, we will use R to calculate various population genetics parameters and perform principle component analysis on data.

Below are some packages we will use to day. Go ahead and load them.

library(poppr)
library(adegenet)
library(mmod)

Load our genetic data

Often, we need to consolidate data from varied sources. Concerning genomic data, the most popular format is the Variant Call Format (or VCF). There are many R packages to help you perform these tasks such as vcfR. For the sake of brevity, today, I have generated an R data object in the GenInd format that is used by a variety of R population genetics packages. Load that file below.

load("humans.RData") # load the human genome population variant data from Baxevanis et al.
#load("/opt/BIOL5436/humans.RData")

humans # let's look at what this object is

Locus summary statistics

When starting to work with any dataset, it is informatic to review determine some basic understanding of the underlying diversity in alleles. We can use this information to make decisions about how useful certain genetic loci might be for downstream applications such as generating phylogenetic relationships.

library("poppr")     # Make sure poppr is loaded if you haven't done so already.
library("magrittr")  # We will also use magrittr for part of this chapter
data("Pinf")         # P. infestans data set from Mexico and South America
locus_table(Pinf)

As we can see, Pi33 has a low number of alleles in this particular example population. With that in mind, we may conclude that is not a good locus for certain types of analysis.

How about our human dataset?

locus_table(humans)

This particular dataset is heavily pared down and designed to allow us to do expedited calculations we will do later. So, since there is very little genetic diversity at each locus we can conclude that this data may not be extremely useful for certain types of calculations since this overall population is not very genetically diverse.

Population structure

We often want to understand how genetically distant populations and individuals are from each other, respectively.

Gst (a genomic form of Fst)

There are some simple functions available in R to calculate Gst (an analog of Fst). Recall that Fst (or Gst) are looking to understand the amount of genetic variation within a subpopulation compared to a broader population. Gst = 0 implies that the variation observed in the subpopulation is indistiguishable from other groups whereas Gst = 1 implies that the variation in subpopulation is distinct from other populations (i.e. it is differentiating).

Here’s an example…

library("mmod")
data("nancycats")
nancycats
Gst_Nei(nancycats)

How about for our subset of humans?

Gst_Nei(humans)

On a per locus basis, we see there is variability in Gst values. As we see, some values are quite low (even negative?). These values indicate there is no differences in variation at those loci or there may be low heterozygosity. Overall, we see a Gst value on par with what we expect - around 11-15%.

Genetic distance

Sometimes, we need to take an individual-level perspective to undertand population structure. This is particularly important when we know little about any underlying distinct populations in a group of individuals. There are variety of measures to calculate genetic distance. Here, we’ll focus on Nei’s Distance - a commonly used metric.

Let’s see an example of calculating distance and then check out our humans.

library("poppr")
library("ape") # To visualize the tree using the "nj" function
library("magrittr")
data(microbov)
set.seed(10)
ten_samples <- sample(nInd(microbov), 10)
mic10       <- microbov[ten_samples]
(micdist    <- provesti.dist(mic10))

We now have distances… Where have we used distances to infer relationships before? Phylogenetic trees!

While not a formal phylogeny, we can use trees to help build intuition about possible underlying population structure. Let’s use a stock dataset of bovine data from Africa and France.

theTree <- micdist %>%
  nj() %>%    # calculate neighbor-joining tree
  ladderize() # organize branches by clade
plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.

Question

What can we conclude about the populations of African and French cattle?


Let’s look at our humans.

humDist <- nei.dist(humans)

humTree <- humDist %>%
  nj() %>%
  ladderize()

plot(humTree)

Clustering

Particularly with large datasets, use of clustering techniques can help provide some insights into population structure. Clustering is a statistical approach that often employs Principle Component Analysis (PCA). Clustering is distinct from overall distance-based methods in that it allows each locus to contribute to the overall determination of similarity and differences between individuals. However, clustering methods like k-means clustering do use per locus genetic distances as the basis for PCA. This ultimately gives rise to distinct clusters of individuals if there is sufficient genetic variation in the population and similarity between individuals. That said, this approach is not good for very clonal populations.

Let’s see an example. We’ll use a fungal pathogen dataset to illustrate some concepts.

library("poppr")
data("Pinf")
Pinf

For best results, copy and past the text below into the console window.

MX <- popsub(Pinf, "North America") # here we're extracting only the samples from North America in our dataset
MXclust <- find.clusters(MX)

We can see that we can form clusters using this analysis. The utlity and meaning of these clusters are up to the investigator, however. Remember, these are just statistical outcomes based on selected parameters.

How about our humans?!

Paste the following in the console, below.

find.clusters(humans)

Discriminant Analysis of Principle Components

Another method to infer population structure is Discriminant Analysis of Principle Components. Like clustering, it is a method reliant on PCA. DAPC is a method that looks to find groups/clusters of genetically related individuals. Unlike hierarchical clustering methods like k-means (that we explored above) DAPC is able to be used on highly clonal or partially clonal populations.

Let’s explore an example to see how this works. This example examines H3N2 influenza data across epidemic years. You might need to run this twice to get plot to show up.

# DAPC requires the adegenet package. Let's load this package:
library("adegenet")
data(H3N2) # load the H3N2 influenza data. Type ?H3N2 for more info.
pop(H3N2) <- H3N2$other$epid
dapc.H3N2 <- dapc(H3N2, var.contrib = TRUE, scale = FALSE, n.pca = 30, n.da = nPop(H3N2) - 1)
scatter(dapc.H3N2, cell = 0, pch = 18:23, cstar = 0, mstree = TRUE, lwd = 2, lty = 2)

Now, how about our humans?!

DAPC.humans <- dapc(humans, var.contrib = TRUE, scale = FALSE, n.pca = 20, n.da = nPop(H3N2) - 1)
scatter(DAPC.humans)

Question

Compare these results to Figure 15.1 in Baxevanis. Do we see any similarities/differences?


---
title: "Module 12 Exercises"
output:
  html_notebook: default
  word_document: default
---

# Overview

Today we will be exploring how we can analyze genetic/genomic data using various R packages. Many of the code examples are taken from the Grunwald tutorial shared in the Canvas module.

Objectives:

- Develop familiarity with popular R population genetics analysis packages.

- Analyze some data from the 1000 Genomes Project.


# Using R to perform population genetic analysis

Traditionally, many calculations could be performed by hand. Today, in the genomic era, we increasingly require computers to help us perform calculations on very large genetic/genomic datasets associated with populations. Nowhere is this more apparent than analysis of human genomic information.

Fortunately, the R environment has a very large user base involved in population genetics research. Today, we will use R to calculate various population genetics parameters and perform principle component analysis on data.

Below are some packages we will use to day. Go ahead and load them.

```{r}
library(poppr)
library(adegenet)
library(mmod)
```


# Load our genetic data

Often, we need to consolidate data from varied sources. Concerning genomic data, the most popular format is the Variant Call Format (or VCF). There are many R packages to help you perform these tasks such as `vcfR`. For the sake of brevity, today, I have generated an R data object in the GenInd format that is used by a variety of R population genetics packages. Load that file below.

```{r}
load("humans.RData") # load the human genome population variant data from Baxevanis et al.
#load("/opt/BIOL5436/humans.RData")

humans # let's look at what this object is
```

# Locus summary statistics

When starting to work with any dataset, it is informatic to review determine some basic understanding of the underlying diversity in alleles. We can use this information to make decisions about how useful certain genetic loci might be for downstream applications such as generating phylogenetic relationships.

```{r}
library("poppr")     # Make sure poppr is loaded if you haven't done so already.
library("magrittr")  # We will also use magrittr for part of this chapter
data("Pinf")         # P. infestans data set from Mexico and South America
locus_table(Pinf)
```

As we can see, Pi33 has a low number of alleles in this particular example population. With that in mind, we may conclude that is not a good locus for certain types of analysis.

How about our human dataset?

```{r}
locus_table(humans)
```

This particular dataset is heavily pared down and designed to allow us to do expedited calculations we will do later. So, since there is very little genetic diversity at each locus we can conclude that this data may not be extremely useful for certain types of calculations since this overall population is not very genetically diverse.

# Population structure

We often want to understand how genetically distant populations and individuals are from each other, respectively. 

## Gst (a genomic form of Fst)

There are some simple functions available in R to calculate Gst (an analog of Fst). Recall that Fst (or Gst) are looking to understand the amount of genetic variation within a subpopulation compared to a broader population. Gst = 0 implies that the variation observed in the subpopulation is indistiguishable from other groups whereas Gst = 1 implies that the variation in subpopulation is distinct from other populations (i.e. it is differentiating). 

Here's an example...

```{r}
library("mmod")
data("nancycats")
nancycats
```
```{r}
Gst_Nei(nancycats)
```

How about for our subset of humans?

```{r}
Gst_Nei(humans)
```

On a per locus basis, we see there is variability in Gst values. As we see, some values are quite low (even negative?). These values indicate there is no differences in variation at those loci or there may be low heterozygosity. Overall, we see a Gst value on par with what we expect - around 11-15%.

## Genetic distance

Sometimes, we need to take an individual-level perspective to undertand population structure. This is particularly important when we know little about any underlying distinct populations in a group of individuals. There are variety of measures to calculate genetic distance. Here, we'll focus on Nei's Distance - a commonly used metric.

Let's see an example of calculating distance and then check out our humans.

```{r}
library("poppr")
library("ape") # To visualize the tree using the "nj" function
library("magrittr")
data(microbov)
set.seed(10)
ten_samples <- sample(nInd(microbov), 10)
mic10       <- microbov[ten_samples]
(micdist    <- provesti.dist(mic10))
```

We now have distances... Where have we used distances to infer relationships before? Phylogenetic trees!

While not a formal phylogeny, we can use trees to help build intuition about possible underlying population structure. Let's use a stock dataset of bovine data from Africa and France.

```{r}
theTree <- micdist %>%
  nj() %>%    # calculate neighbor-joining tree
  ladderize() # organize branches by clade
plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.
```

---

**Question**

What can we conclude about the populations of African and French cattle?

---

Let's look at our humans.

```{r}
humDist <- nei.dist(humans)

humTree <- humDist %>%
  nj() %>%
  ladderize()

plot(humTree)

```

## Clustering

Particularly with large datasets, use of clustering techniques can help provide some insights into population structure. Clustering is a statistical approach that often employs Principle Component Analysis (PCA). Clustering is distinct from overall distance-based methods in that it allows each locus to contribute to the overall determination of similarity and differences between individuals. However, clustering methods like k-means clustering do use per locus genetic distances as the basis for PCA. This ultimately gives rise to distinct clusters of individuals if there is sufficient genetic variation in the population and similarity between individuals. That said, this approach is not good for very clonal populations.

Let's see an example. We'll use a fungal pathogen dataset to illustrate some concepts.

```{r}
library("poppr")
data("Pinf")
Pinf
```

For best results, copy and past the text below into the console window.
```{}
MX <- popsub(Pinf, "North America") # here we're extracting only the samples from North America in our dataset
MXclust <- find.clusters(MX)
```

We can see that we can form clusters using this analysis. The utlity and meaning of these clusters are up to the investigator, however. Remember, these are just statistical outcomes based on selected parameters.

How about our humans?!

Paste the following in the console, below.
```{}
find.clusters(humans)
```

## Discriminant Analysis of Principle Components

Another method to infer population structure is Discriminant Analysis of Principle Components. Like clustering, it is a method reliant on PCA. DAPC is a method that looks to find groups/clusters of genetically related individuals. Unlike hierarchical clustering methods like k-means (that we explored above) DAPC is able to be used on highly clonal or partially clonal populations.

Let's explore an example to see how this works. This example examines H3N2 influenza data across epidemic years. You might need to run this twice to get plot to show up.

```{r}
# DAPC requires the adegenet package. Let's load this package:
library("adegenet")
data(H3N2) # load the H3N2 influenza data. Type ?H3N2 for more info.
pop(H3N2) <- H3N2$other$epid
dapc.H3N2 <- dapc(H3N2, var.contrib = TRUE, scale = FALSE, n.pca = 30, n.da = nPop(H3N2) - 1)
scatter(dapc.H3N2, cell = 0, pch = 18:23, cstar = 0, mstree = TRUE, lwd = 2, lty = 2)
```

Now, how about our humans?!

```{r}
DAPC.humans <- dapc(humans, var.contrib = TRUE, scale = FALSE, n.pca = 20, n.da = nPop(H3N2) - 1)
scatter(DAPC.humans)
```

---

**Question**

Compare these results to Figure 15.1 in Baxevanis. Do we see any similarities/differences?

---