We will reduce our 78k SNP dataset using LD-based SNP pruning. Then we will confirm the results of our sNMF analysis using a PCA,followed by two identity by descent relatedness analyses (MoM and MLE)

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(vcfR)
library(adegenet)
library(adegraphics)
library(pegas)
library(StAMPP)
library(lattice)
library(gplots)
library(ape)
library(ggmap) 
library(LEA)
#The new packages we'll be using for these analyses (mainly gdsfmt and SNPRelate) need to be installed from Bioconductor Lite
#source("https://bioconductor.org/biocLite.R", echo=FALSE, verbose = FALSE)
#biocLite()
#biocLite(c("GenomicFeatures", "AnnotationDbi"))
#biocLite("SNPRelate")
library("devtools")
library(gdsfmt)
library(SNPRelate)
library(calibrate)

Preparing data

gdsfmt package provides the genome data structure (GDS) file format for array-oriented bioinformatic data, which is a container for annotation and SNP genotypes.

#Clear cache if necessary using
#closeAllConnections()
#snpgdsClose(genofile) #in case you need to close or re-write a gds file
#closefn.gds(test1.gds)

#Load up our vcf file (SNP data for 24 red squirrels)
vcf.fn <- "/Users/tanyalama/vcftools/stringent_revised.vcf.recode.vcf"

#Reformat VCF --> SNP GDS (genomic data structure) Format
snpgdsVCF2GDS(vcf.fn, "test.gds", method="biallelic.only")
## VCF Format ==> SNP GDS Format
## Method: exacting biallelic SNPs
## Number of samples: 23
## Parsing "/Users/tanyalama/vcftools/stringent_revised.vcf.recode.vcf" ...
##  import 73055 variants.
## + genotype   { Bit2 23x73055, 410.2K } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'test.gds' (872.9K)
##     # of fragments: 53
##     save to 'test.gds.tmp'
##     rename 'test.gds.tmp' (872.5K, reduced: 396B)
##     # of fragments: 20
#Summary
snpgdsSummary("test.gds")
## The file name: /Users/tanyalama/Box Sync/Squirrel/test.gds 
## The total number of samples: 23 
## The total number of SNPs: 73055 
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
genofile <- snpgdsOpen("/Users/tanyalama/Box Sync/Squirrel/test.gds", readonly= FALSE)
#Add population information to your gds genofile 

add.gdsn(genofile, "pop_code", c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)) #all identified to 1 population for starters

#get sample id information
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

Data Analysis

gdsfmt and SNPRelate accelerate two key computations in GWAS (genome wide association studies): principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures.

LD-based SNP pruning

It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis.

set.seed(1) #should be 1000, reduced for rmarkdown
#Try different LD thresholds for sensitivity analysis. #Need to include autosome.only = FALSE clause to run.

snpset <- snpgdsLDpruning(genofile, autosome.only = FALSE,remove.monosnp = TRUE, ld.threshold=0.1, verbose = FALSE) #38,384/72,744 markers are selected in total with .1 ld.

#Create an object to hold our ~38k selected SNPs
snpset.id <- unlist(snpset)

Principal Component Analysis (PCA)

The functions in SNPRelate for PCA include calculating the genetic covariance matrix from genotypes, computing the correlation coefficients between sample loadings and genotypes for each SNP, calculating SNP eigenvectors (loadings), and estimating the sample loadings of a new dataset from specified SNP eigenvectors.

#Run PCA using the ldpruned set of 38k SNPs (snpset.id) instead of the full set. 
#get sample id information
genofile
## File: /Users/tanyalama/Box Sync/Squirrel/test.gds (872.5K)
## +    [  ] *
## |--+ sample.id   { Str8 23 ZIP_ra(53.4%), 101B }
## |--+ snp.id   { Int32 73055 ZIP_ra(34.6%), 98.7K }
## |--+ snp.rs.id   { Str8 73055 ZIP_ra(0.14%), 111B }
## |--+ snp.position   { Int32 73055 ZIP_ra(27.5%), 78.5K }
## |--+ snp.chromosome   { Str8 73055 ZIP_ra(5.30%), 64.2K }
## |--+ snp.allele   { Str8 73055 ZIP_ra(14.2%), 40.6K }
## |--+ genotype   { Bit2 23x73055, 410.2K } *
## |--+ snp.annot   [  ]
## |  |--+ qual   { Float32 73055 ZIP_ra(62.1%), 177.2K }
## |  \--+ filter   { Str8 73055 ZIP_ra(0.23%), 831B }
## \--+ pop_code   { Float64 24, 192B }
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) #run

pca <- snpgdsPCA(genofile, sample.id = NULL, snp.id=snpset.id, autosome.only = FALSE, num.thread=2, verbose= TRUE)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## Working space: 23 samples, 38,794 SNPs
##     using 2 (CPU) cores
## PCA:    the sum of all selected genotypes (0,1,2) = 1524278
## CPU capabilities: Double-Precision SSE2
## Thu Jul 26 14:08:05 2018    (internal increment: 21192)
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed in 0s
## Thu Jul 26 14:08:05 2018    Begin (eigenvalues and eigenvectors)
## Thu Jul 26 14:08:05 2018    Done.
#if autosome.only = TRUE, use autosomal (non-sex chromosome) SNPs only; if it is a numeric or character value, keep SNPs according to the specified chromosome

#tried with autosome.only = TRUE and error read "There is no SNP!"

#The following code shows how to calculate the percent of variation that is accounted for by the top principal components. It is clear to see the first two eigenvectors hold the largest percentage of variance among the population, although the total variance accounted for is still less the one-quarter of the total.

# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
## [1] 5.84 5.62 5.42 5.29 5.24 5.08
#get population information
pop_code<- read.gdsn(index.gdsn(genofile, "pop_code"))
head(cbind(sample.id, pop_code))
##      sample.id pop_code
## [1,] "UMA2_1"  "1"     
## [2,] "UMA35_1" "1"     
## [3,] "RS32_1"  "1"     
## [4,] "UMA20_1" "1"     
## [5,] "UMA39_1" "1"     
## [6,] "UMA19_1" "1"
#make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
    pop = factor(pop_code)[match(pca$sample.id,sample.id)],
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    stringsAsFactors = FALSE)
tab #view the dataframe
##    sample.id pop          EV1          EV2
## 1     UMA2_1   1 -0.063676681  0.084596181
## 2    UMA35_1   1 -0.005127551  0.069118871
## 3     RS32_1   1 -0.030475294  0.037277902
## 4    UMA20_1   1 -0.058753040 -0.020379122
## 5    UMA39_1   1 -0.053058858  0.095025643
## 6    UMA19_1   1 -0.063469156  0.079980237
## 7    RS104_1   1 -0.164887290  0.127483231
## 8    UMA13_1   1 -0.043263337 -0.004482035
## 9     UMA5_1   1 -0.039427354  0.010210736
## 10    UMA3_1   1 -0.096313791 -0.675158601
## 11   UMA26_1   1 -0.040642768 -0.001002618
## 12    RS33_1   1 -0.067450454  0.103904687
## 13   UMA23_1   1 -0.083062793  0.026626533
## 14   UMA40_1   1  0.659397709 -0.005952129
## 15    UMA8_1   1 -0.047733814  0.106796905
## 16    UMA6_1   1 -0.005187611  0.057058837
## 17    UMA9_1   1  0.015748784  0.057635036
## 18   UMA38_1   1  0.660806103 -0.005901249
## 19   UMA17_1   1 -0.192091152  0.250090104
## 20   UMA41_1   1 -0.046733023  0.058452053
## 21   UMA37_1   1 -0.069636955  0.077839051
## 22   UMA21_1   1 -0.090893932 -0.623080340
## 23   UMA42_1   1 -0.074067744  0.093860089
plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1") #three distinct groups

##Zoom In

Estimating IBD: Maximum Likelihood Estimation

Identical by descent (IBD) describes a matching segment of DNA shared by two or more individuals that has been inherited from a common ancestor without any intervening recombination. Alleles that are IBD have been interested by a recent common ancestor (and indicate recent gene flow).

# Estimate IBD coefficients
set.seed(1) #should be 1000
ibd <- snpgdsIBDMLE(genofile, sample.id = NULL, snp.id=snpset.id,autosome.only = FALSE, remove.monosnp= TRUE, maf=0.05, missing.rate=0.05, kinship = TRUE, num.thread=2, verbose = TRUE)
## Identity-By-Descent analysis (MLE) on genotypes:
## Excluding 23,659 SNPs (monomorphic: TRUE, MAF: 0.05, missing rate: 0.05)
## Working space: 23 samples, 15,135 SNPs
##     using 2 (CPU) cores
## MLE IBD:    the sum of all selected genotypes (0,1,2) = 507583
## MLE IBD: Thu Jul 26 14:08:06 2018    0%
## MLE IBD: Thu Jul 26 14:08:11 2018    100%
# Make a dataframe
ibd.coeff <- snpgdsIBDSelection(ibd)