We have been having issues with a lack of uniformity on the GBS data using the Illumina HiSeq 2000 technology, where samples have been sequenced to a maximum depth of 10x per lane and others have a maximum of 1x coverage. As a consequence of this, our objective for using GBS for population analysis using SNP’s with high confidence hasn’t been reached. We need at least 4x coverage in >70% of the sequenced nucleotides to be able to obtain a good number of variants for population genomics.
According to Brett Tyler, the lack of uniformity is something expected in an illumina analysis. He reccomends us to change restriction enzymes to test for better uniformity on the reads. We will be using 2 enzymes instead of 1, reducing even more the complexity of the genome. These enzymes are PstI and Msp1 (already available on the CGRB)
PstI, is a Type II restriction enzyme from Providencia stuartii. PstI recognition and cut site are as follows:
Msp1, is a Type II restriction enzyme. It was used for methylation analyzes as it was able to cut hydroxymethylated DNA. Msp1 restriction and cut site as follows:
The MspI adapter is designed as a Y-adapter. During amplification the reverse primer is identical to the Y tail of the common adapter and can only anneal if the complimentary strand has first been synthesized from the other end. This prevents the amplification of the MspI-MspI fragments and adapter dimers as fragments that have Y adapters (P2 adapters) on both ends will not be amplified via PCR.
SimRADSimRAD (Lepais and Weir, 2015) is an R package that performs in silico restriction enzyme digests and fragment size selection as implemented in most restriction site associated DNA polymorphism and genotyping by sequencing methods. In silico digestion is performed on a reference genome or on a randomly generated DNA sequence when no reference genome sequence is available. SimRAD accurately predicts the number of loci under alternative protocols when a reference genome sequence is available for the targeted species (or a close relative) but may be unreliable when no reference genome is available.
/Users/tabimaj/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.faPst1 and Msp1#Reference Sequence
rubi.c <- ref.DNAseq("/Users/tabimaj/Documents/P_rubi/Genomes/Genomes/rubi/Pr4671.fa",subselect.contigs = F)
#Restriction Enzyme 1
#PstI
cs_5p1 <- "CTGCA"
cs_3p1 <- "G"
#Restriction Enzyme 2
#MspI
cs_5p2 <- "C"
cs_3p2 <- "CGG"
# double digestion:
rubi.dig <- insilico.digest(rubi.c, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)
## Number of restriction sites for the first enzyme: 49371
## Number of restriction sites for the second enzyme: 239378
## Number of type AB and BA fragments:75852
## Number of type AA fragments:11445
## Number of type BB fragments:201451
#selection of AB type fragments
rubi.selected <- adapt.select(rubi.dig, type="AB+BA", cs_5p1, cs_3p1, cs_5p2, cs_3p2)
We have 75852 regions with Pse1-Msp1 flanking regions. Lets do size selection for anything greater than 100bp
rubi.size.selected <- size.select(rubi.selected,min.size = 99,max.size = 3894,graph = T)
## 51128 fragments between 99 and 3894 bp
If each of the reads have 100bp maximum, in theory we will have 100 * Number of regions = Total nucleotides, but we have regions greater than 100bp, that can lead to more than one 100 bp reads per cluster. In this case, regions greater than 200bp will have a maximum of two different 100bp reads: 200bp per region. For regions greater than 100 and lesser than 200bp, the length will be 100bp + the rest of the region:
bp_1 <- sum(rubi.size.selected@ranges@width == 100) * 100
bp_2 <- sum(rubi.size.selected@ranges@width >= 200) * 200
bp_1.5 <- sum(rubi.size.selected@ranges@width > 100 & rubi.size.selected@ranges@width < 200) + sum(abs(100 - rubi.size.selected@ranges@width[rubi.size.selected@ranges@width > 100 & rubi.size.selected@ranges@width < 200]))
total_nucs <- bp_1 + bp_2 + bp_1.5
If we have a genome with a total size of 74,863,594 base pairs, we divide the number of total nucleotides sequenced by using this method (7,795,229), we will have 10.4125765% of the genome represented.
| Enzyme | Number of fragments | Number of nucleotides | Percentage of genome sequenced |
|---|---|---|---|
ApeKI |
427,000 | 42,725,124 | 57.070629 |
Pst1 + Msp1 |
51,128 | 7,795,229 | 10.4125765 |