Two input files are required:

A VCF of candidate biallelic SNPs in a set of individuals. Multiallelic SNPs will cause problems.
A CSV file with a column for every individual in the VCF (no more and no less, otherwise there will be problems), and a column for the genotype of interest of each individual.

Necessary Packages:
library(tidyverse)
library(reshape2)
library(limma)

Before using this script in R, open your VCF in excel and remove all the contig information so that the first row of the spreadsheet is “#CHROM, POS, ID, REF, ALT, etc,” then save it as a .csv file. It should look like this:

Quickly set up your data with the following code.

library(limma)

matrix <- read.csv("Name_of_File.csv") #Open your VCF turned CSV

matrix <- matrix %>%
  column_to_rownames("POS") #Turn the location of each SNP into its rowname. This is so you
#can identify which loci contain SNPs correlated to your genotype at the end.

matrix2 <- matrix[-c(1:8)] #Remove everything except individuals from the dataframe so we can
#make it into a numeric count matrix:

matrix2[] <- lapply(matrix2[], function(x) substr(x, 1, 3)) #This function removes everything
#except the allelic frequency of each SNP from the dataframe by deleting everything except
#the first three characters.

matrix2[matrix2== "1/1"] <- "2"
matrix2[matrix2== "0/1"] <- "1"
matrix2[matrix2== "0/0"] <- "0"
matrix2[matrix2== "./."] <- NA #Replace allelic frequency values with numeric genotype values

matrix3 <- data.matrix(matrix2) #Convert your dataframe into a numeric matrix

Now it’s time to open your CSV file that has information about the genotype/haplotype/phenotype/whatever you’re trying to design a CAPS primer for for each individual. This analysis will only let you look at one genotype at a time. Here’s an example of what this file might look like:

Load sample information

meta <- read.csv("Name_of_File.csv") #Open the CSV
colnames(meta)[2]<-"genotype" #Renaming the column for genotype to "genotype"
colnames(meta)[1]<-"sample" #Renaming the column for sample to "sample"
meta <- meta %>% column_to_rownames("sample") #Make sample (e.g., "ANN0802, ANN0803, etc" 
#the rownames)

Next: Find the positions with the greatest correlation to the genotype of your samples by using the following code.

designmat<-model.matrix(~genotype,meta) # Make you genotype data into a "design matrix," 
#a matrix that encodes characteristic data numerically

fit<-(lmFit(matrix3, designmat) %>% eBayes()) #Fit a linear model to the SNP matrix against
#the deisgn matrix using the sample names to tie them together

top <- topTable(fit, coef=1, number = 15) #Make a table of the 15 most correlated SNPs to
#your design matrix. Increase or decrease the number of hits by changing "number = X" in 
#this command.

print(top) #View the fruits of your labour

This is an example of what will be outputed in your console (if using rstudio). The leftmost column of the table printed in your console has the position of your top SNPs in the genome.

Congrats!
You can now go back and look at your raw file to identify where each candidate SNP is and which base has been mutated. Make sure to double check that candidates make sense because this script has some flaws that may find false candidates every now and then.

This script is for finding biallelic SNPs in a VCF that are highly correlated to a genomic feature of interest.

Kathryn Lande

4/10/2019