New enzymes: The rationale

We have been having issues with a lack of uniformity on the GBS data using the Illumina HiSeq 2000 technology, where samples have been sequenced to a maximum depth of 10x per lane and others have a maximum of 1x coverage. As a consequence of this, our objective for using GBS for population analysis using SNP’s with high confidence hasn’t been reached. We need at least 4x coverage in >70% of the sequenced nucleotides to be able to obtain a good number of variants for population genomics.

According to Brett Tyler, the lack of uniformity is something expected in an illumina analysis. He reccomends us to change restriction enzymes to test for better uniformity on the reads. We will be using 2 enzymes instead of 1, reducing even more the complexity of the genome. These enzymes are PstI and Msp1 (already available on the CGRB)

The enzymes:

PstI

PstI, is a Type II restriction enzyme from Providencia stuartii. PstI recognition and cut site are as follows:

pst1 restriction site

Msp1

Msp1, is a Type II restriction enzyme. It was used for methylation analyzes as it was able to cut hydroxymethylated DNA. Msp1 restriction and cut site as follows:

Msp1 cut site

Intreaction between the two restriction enzymes.

The MspI adapter is designed as a Y-adapter. During amplification the reverse primer is identical to the Y tail of the common adapter and can only anneal if the complimentary strand has first been synthesized from the other end. This prevents the amplification of the MspI-MspI fragments and adapter dimers as fragments that have Y adapters (P2 adapters) on both ends will not be amplified via PCR.

Extracting the cut-sites coordinates using `SimRAD`

SimRAD (Lepais and Weir, 2015) is an R package that performs in silico restriction enzyme digests and fragment size selection as implemented in most restriction site associated DNA polymorphism and genotyping by sequencing methods. In silico digestion is performed on a reference genome or on a randomly generated DNA sequence when no reference genome sequence is available. SimRAD accurately predicts the number of loci under alternative protocols when a reference genome sequence is available for the targeted species (or a close relative) but may be unreliable when no reference genome is available.

Inputs:

Reference genome of P. rubi:/Users/tabimaj/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.fa
Enzymes for SimRAD: Pst1 and Msp1

#Reference Sequence
rubi.c <- ref.DNAseq("/Users/tabimaj/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.fa",subselect.contigs = F)
#Restriction Enzyme 1
 #PstI
    cs_5p1 <- "CTGCA"
    cs_3p1 <- "G"
#Restriction Enzyme 2
 #MspI
    cs_5p2 <- "C"
    cs_3p2 <- "CGG"
# double digestion:
rubi.dig <- insilico.digest(rubi.c, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)

## Number of restriction sites for the first enzyme: 49371
## Number of restriction sites for the second enzyme: 239378
## Number of type AB and BA fragments:75852
## Number of type AA fragments:11445
## Number of type BB fragments:201451

#selection of AB type fragments
rubi.selected <- adapt.select(rubi.dig, type="AB+BA", cs_5p1, cs_3p1, cs_5p2, cs_3p2)

We have 75852 regions with Pse1-Msp1 flanking regions. Lets do size selection for anything greater than 100bp

Size selection (100bp ~ Max length)

rubi.size.selected <- size.select(rubi.selected,min.size = 99,max.size = 3894,graph = T)

## 51128 fragments between 99 and 3894 bp

Total nucleotides

If each of the reads have 100bp maximum, in theory we will have 100 * Number of regions = Total nucleotides, but we have regions greater than 100bp, that can lead to more than one 100 bp reads per cluster. In this case, regions greater than 200bp will have a maximum of two different 100bp reads: 200bp per region. For regions greater than 100 and lesser than 200bp, the length will be 100bp + the rest of the region:

bp_1 <- sum(rubi.size.selected@ranges@width == 100) * 100
bp_2 <- sum(rubi.size.selected@ranges@width >= 200) * 200
bp_1.5 <- sum(rubi.size.selected@ranges@width > 100 & rubi.size.selected@ranges@width < 200) + sum(abs(100 - rubi.size.selected@ranges@width[rubi.size.selected@ranges@width > 100 & rubi.size.selected@ranges@width < 200]))
total_nucs <- bp_1 + bp_2 + bp_1.5

If we have a genome with a total size of 74,863,594 base pairs, we divide the number of total nucleotides sequenced by using this method (7,795,229), we will have 10.4125765% of the genome represented.

Comparison with ApeKI

Enzyme	Number of fragments	Number of nucleotides	Percentage of genome sequenced
`ApeKI`	427,000	42,725,124	57.070629
`Pst1 + Msp1`	51,128	7,795,229	10.4125765

Pst1/Mst1 in silico assays for P. rubi/P. fragariae

Javier F. Tabima

April 28, 2015

New enzymes: The rationale

The enzymes:

PstI

Msp1

Intreaction between the two restriction enzymes.

Extracting the cut-sites coordinates using `SimRAD`

Size selection (100bp ~ Max length)

Total nucleotides

Comparison with ApeKI

Pst1/Mst1 in silico assays for P. rubi/P. fragariae

Javier F. Tabima

April 28, 2015

New enzymes: The rationale

The enzymes:

PstI

Msp1

Intreaction between the two restriction enzymes.

Extracting the cut-sites coordinates using SimRAD

Size selection (100bp ~ Max length)

Total nucleotides

Comparison with ApeKI

Extracting the cut-sites coordinates using `SimRAD`