SNP to genepop file conversion using R

Kevin Keenan, 2014

Introduction

This document demonstrate the use of the function snp2gp from the diveRsity package in R.

Getting started

First we need to ensure that the latest version of diveRsity is installed. This can be downloaded and installed from github as follows:

if ("devtools" %in% rownames(installed.packages()) == FALSE) {
    install.packages("devtools", repo = "http://cran.rstudio.com", dep = TRUE)
}
library("devtools")
install_github("diveRsity", "kkeenan02")

Input format

The snp2gp function takes either a dataframe (preloaded into the R workspace) or a tab delimited input file of the following general format:

SNP_ID  pop001_1    pop001_2    pop001_3    pop001_4    pop001_5    pop002_1    pop002_2    pop002_3    pop002_4    pop002_5    pop003_1    pop003_2    pop003_3    pop003_4    pop003_5
SNP1    TC  TC  TC  TC  TC  TT  TC  TC  TC  CC  TT  TT  TC  CC  TC
SNP2    TC  TC  TC  TC  TC  TC  TC  TC  TC  TT  CC  CC  TC  CC  TC
SNP3    TA  TA  TA  TA  AA  TA  TA  TT  TA  TA  AA  AA  TA  AA  TA
SNP4    AG  AG  AG  AG  AG  AG  GG  GG  AG  AG  AA  AA  GG  AA  AG
SNP5    TC  TC  TC  TC  TC  TC  TT  TT  TC  TC  CC  CC  TT  CC  TC
SNP6    AG  AG  AG  AG  AG  AG  AA  AA  AG  AG  GG  GG  AA  GG  AG
SNP7    CC  CC  CC  CC  AC  CC  AC  CC  CC  CC  CC  CC  CC  CC  CC
SNP8    --  TC  TC  TC  TC  TC  TC  TC  TC  TC  TT  TT  CC  TT  TC
SNP9    CC  TC  TC  TC  TC  TC  CC  TC  CC  CC  CC  CC  TC  CC  TC
SNP10   TC  TC  TC  TC  CC  CC  CC  CC  CC  CC  CC  CC  TC  CC  TC

An example SNP data set can be otained by typing the following into the R console:

library(diveRsity)
data(SNPs)

Running the function

To convert the above file, named “snp_file.txt”, we would execute the following command:

snp2gp(infile = "snp_file.txt", prefix_length = 6)

The argument prefix_length specifies the number of characters at the start of each individuals name that is unique to each population sample. As we can from our file above, this number is 6 for our data.

The function will write a genepop file to the same directory as infile. This file will look like this:

sample1-converted   
SNP1,   SNP2,   SNP3,   SNP4,   SNP5,   SNP6,   SNP7,   SNP8,   SNP9,   SNP10           
pop 
pop001_1 ,  0402    0402    0401    0103    0402    0103    0202    0000    0202    0402    
pop001_2 ,  0402    0402    0401    0103    0402    0103    0202    0402    0402    0402    
pop001_3 ,  0402    0402    0401    0103    0402    0103    0202    0402    0402    0402    
pop001_4 ,  0402    0402    0401    0103    0402    0103    0202    0402    0402    0402    
pop001_5 ,  0402    0402    0101    0103    0402    0103    0102    0402    0402    0202    
pop 
pop002_1 ,  0404    0402    0401    0103    0402    0103    0202    0402    0402    0202    
pop002_2 ,  0402    0402    0401    0303    0404    0101    0102    0402    0202    0202    
pop002_3 ,  0402    0402    0404    0303    0404    0101    0202    0402    0402    0202    
pop002_4 ,  0402    0402    0401    0103    0402    0103    0202    0402    0202    0202    
pop002_5 ,  0202    0404    0401    0103    0402    0103    0202    0402    0202    0202    
pop 
pop003_1 ,  0404    0202    0101    0101    0202    0303    0202    0404    0202    0202    
pop003_2 ,  0404    0202    0101    0101    0202    0303    0202    0404    0202    0202    
pop003_3 ,  0402    0402    0401    0303    0404    0101    0202    0202    0402    0402    
pop003_4 ,  0202    0202    0101    0101    0202    0303    0202    0404    0202    0202    
pop003_5 ,  0402    0402    0401    0103    0402    0103    0202    0402    0402    0402    

This file can be used for downstream analysis in other function of the diveRsity package, or other software such as GENEPOP, or SMOGD.