Studying natural selection with SNPs

Packages

Install missing package

#install.packages("kgp")
#contains metadata from the 1000 genomes project!

Load packages

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(kgp)
library(ggplot2)
library(ggpubr)

The package kgp3 has metadata stored in a nice format

head(kgp3)

## # A tibble: 6 × 10
##   fid     id      pid   mid     sex sexf   pop   reg   population         region
##   <chr>   <chr>   <chr> <chr> <int> <fct>  <chr> <chr> <chr>              <chr> 
## 1 HG00096 HG00096 0     0         1 male   GBR   EUR   British in Englan… Europe
## 2 HG00097 HG00097 0     0         2 female GBR   EUR   British in Englan… Europe
## 3 HG00099 HG00099 0     0         2 female GBR   EUR   British in Englan… Europe
## 4 HG00100 HG00100 0     0         2 female GBR   EUR   British in Englan… Europe
## 5 HG00101 HG00101 0     0         1 male   GBR   EUR   British in Englan… Europe
## 6 HG00102 HG00102 0     0         2 female GBR   EUR   British in Englan… Europe

Load data

Download the vcf file for the lesson

vcf.file <- "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
my_vcf <- vcfR::read.vcfR(vcf.file)

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 6877
##   column count: 2513
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 6877
##   Character matrix gt cols: 2513
##   skip: 0
##   nrows: 6877
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant: 6877
## All variants processed

Get population metadata into a vector

myPops <- factor(kgp3$pop)

Get the data on the SNPs from the raw vcf R object (ignore the syntax)

fixed <- data.frame(my_vcf@fix)

Calculate “Fixation Index”

Best known as “Fst”.

Alternate version: “Gst”

# calculate index of genetic distance (Fixation index)

#Gst which characterize how genetically unique the population is

my_Gst <- genetic_diff(my_vcf,
                       myPops, 
                       method = "nei")

Make a dataframe of results and add to the ID codes for SNPs

# don't worry about all of the syntax
my_df <- data.frame(SNP_i = 1:nrow(my_Gst),
                    Gst = my_Gst$Gst,
                    SNP_ID = fixed[,"ID"])

Look at the output

head(my_df)

##   SNP_i         Gst      SNP_ID
## 1     1 0.004563135 rs199561435
## 2     2 0.010048206   rs3737523
## 3     3 0.009692324 rs189523472
## 4     4 0.015376296  rs74122246
## 5     5 0.004474110 rs145194873
## 6     6 0.004851793 rs547902942

ggpubr::gghistogram(data = my_df, x = "Gst")

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: Removed 26 rows containing non-finite values (stat_bin).

ggpubr::ggscatter(data = my_df, 
                  y = "Gst",
                  x = "SNP_i",
                  color = "Gst",
                  main = "Manhattan Plot")

## Warning: Removed 26 rows containing missing values (geom_point).

Get SNP with maximum Gst

Gst_max   <- max(my_df$Gst, na.rm = T)
Gst_max_i <- which.max(my_df$Gst)


fixed[Gst_max_i,1:4] #rs2814778 is the unique identity of the highest SNP

##      CHROM       POS        ID REF
## 4280     1 159204893 rs2814778   T

fixed[Gst_max_i,"ID"]

## [1] "rs2814778"

Genome Scan

Looking at: https://popgen.uchicago.edu/ggv/?data=%221000genomes%22&chr=1&pos=159174683

It appears that this SNP is almost 100% present in Africa and the Carribean has a large amount of the allele. It is completely absent in Asia and in very little amounts around the rest of the world. Overall, the SNP appears to be unique to Africa and has been transported from there.

SNPedia

Looking at: https://www.snpedia.com/index.php/Rs2814778

This SNP is located within/very close to the DARC gene and is almost perfectly fixed at 100% in African populations while fixed at 0% in pretty much all other populations. This is due to Malaria being in Africa, where it is very advantageous to have this SNP (G:G) thus natural selection drove this SNP to be fixed most likely.