HiSeq 3000 lanes (2x150bp), each at a different loading concentration:
Lane 3 = GBS0040-2nM (GBS) - 254bp median insert (range of fragments from approximately 180 bp to 10000+ bp), 2 nM
Lane 4 = GBS0041-3nM (GBS) - 254bp median insert (range of fragments from approximately 180 bp to 10000+ bp), 3 nM
| File | Number of reads | Sequence Length | %GC | Encoding |
|---|---|---|---|---|
| GBS0040_S2_L003_R1_001 | 342617205 | 151 | 53 | Sanger / Illumina 1.9 |
| GBS0040_S2_L003_R2_001 | 342617205 | 151 | 52 | Sanger / Illumina 1.9 |
| GBS0041_S3_L004_R1_001 | 334710708 | 151 | 53 | Sanger / Illumina 1.9 |
| GBS0041_S3_L004_R2_001 | 334710708 | 151 | 52 | Sanger / Illumina 1.9 |
4 jobs were sent using sabre
Submit start: 04/20/2015 08:27:48
Queue: Phyto
N. processors per job: 1
SGE_Batch -c '/home/bpp/tabimaj/bin/sabre-master/sabre pe -f /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane4-s003-index-none-GBS0041_S3_L004_R1_001.fastq.gz -r /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane4-s003-index-none-GBS0041_S3_L004_R2_001.fastq.gz -b /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/barcodes_sabre -u unknown_barcodes_1.fastq -w unknown_barcodes_2.fastq' -r demultiplexing_l4
SGE_Batch -c '/home/bpp/tabimaj/bin/sabre-master/sabre pe -f /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane3-s002-index-none-GBS0040_S2_L003_R1_001.fastq.gz -r /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane3-s002-index-none-GBS0040_S2_L003_R2_001.fastq.gz -b /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/barcodes_sabre -u unknown_barcodes_1.fastq -w unknown_barcodes_2.fastq' -r demultiplexing_l3
ramorum_samples)90 jobs were sent
N. processors per job: 1]
Script:
/nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/lane_3/ramorum/bt2pe.sh
Reference:
/nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/ramorum1.fasta
Input:
/nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/lane_3/ramorum/reads
Output:
sams/ bams/
for i in *.e*; do printf "$i\t"; grep "reads; of these:" $i| sed 's/reads; of these://g'; printf "a\t"; grep "overall" $i; done
setwd("~/Documents/P_rubi/GBS/GBS_Data/")
lane3 <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/Lane3_samout.txt", header = T,sep = "\t")
lane4 <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/Lane4_samout.txt",header = T,sep = "\t")
old <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/HiSeq2000.txt", header = T, sep = "\t")
There is an apparent higher number of reads on the HiSeq-3000 lanes than in the HiSeq-2000 samples. We tested for significance using an ANOVA.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Lane | 2 | 7.911311e+13 | 3.955656e+13 | 7.55728 | 0.00064 |
| Residuals | 267 | 1.397541e+15 | 5.234234e+12 | NA | NA |
We see significant differences between treatments.
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Lane_3-HiSeq_2000 | 816400.1 | 12588.96 | 1620211 | 0.0455805 |
| Lane_4-HiSeq_2000 | 1313003.9 | 509192.74 | 2116815 | 0.0004336 |
| Lane_4-Lane_3 | 496603.8 | -307207.39 | 1300415 | 0.3138627 |
The differences are significant between the HiSeq-2000 and the rest the treatments, while Lane 3 and Lane 4 show no significant differences between them according to the Tukey HSD test.
Let’s see if there is a correlation between the number of reads of the HiSeq-2000 and each of the lanes of the HiSeq-3000.
## Loading required package: gridExtra
## Loading required package: grid
We can see a correlation between the number of reads in both methods, so samples with higher number of reads in the HiSeq-2000 are also the most represented in HiSeq-3000. Sample concentration?
Here, we are mapping the proportion of reads mapped to the reference fragmented genome:
There is no evident change between the proportion of mapped reads between methods. To test this we are doing an analysis of variance:
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Lane | 2 | 0.08209 | 0.04104 | 0.38715 | 0.67937 |
| Residuals | 267 | 28.30681 | 0.10602 | NA | NA |
There are no significant differences between the proportion of aligned reads, indicating that the reads we are obtaining are of equal quality when compared to the reference genome. Its important to mention that even if the quality of the reads is similar between the two methods, there are still a higher number of reads in the HiSeq-3000 approach. To test the average coverage per nucleotide between treatments we are going to use samtools ::: mpileup.
A total of 40153102 nucleotides exist in the P. ramorum fragments sequenced by the HiSeq 3000 technology, we calculated this value by counting the number of cut-sites of the ApeKI enzyme on the P. ramorum reference genome and adding 200 bp to each fragment (number of bp per read X 2)
After obtaining the pileups from samtools, we assessed coverage per sample by measuring summary stats and percentage of nucleotides with X coverage per site:
| mean | median | sd | |
|---|---|---|---|
| Lane_3 | 6.363 | 5.039 | 4.779 |
| Lane_4 | 6.924 | 5.417 | 4.548 |
| HiSeq_2000 | 4.811 | 4.028 | 2.932 |
We can see that there is a difference of almost 2x of average coverage between the HiSeq-2000 and the HiSeq-3000 technology:
Analysis of variance: Coverage vs. Lane
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Lane | 2 | 215.5771 | 107.78854 | 6.20471 | 0.00232 |
| Residuals | 267 | 4638.3408 | 17.37206 | NA | NA |
We see significant differences between treatments.
Tukey’s HSD test
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Lane_3-HiSeq_2000 | 1.5514164 | 0.0870386 | 3.015794 | 0.0349468 |
| Lane_4-HiSeq_2000 | 2.1127877 | 0.6484099 | 3.577166 | 0.0022317 |
| Lane_4-Lane_3 | 0.5613713 | -0.9030066 | 2.025749 | 0.6385623 |
We can conclude that there are significant differences between HiSeq-3000 coverage and the HiSeq-2000 technology. If combined, the HiSeq-3000 technology can reach a average coverage as high as 10x, which is impressive for 2 lanes of Illumina. More information should be obtained about the capacity of using 2 lanes instead of one.
We want to have, at least, a coverage of 4x per nucleotide to do the population analysis procedures. To assess it, we are comparing the average coverage with the percentage of nucleotides covered at least 4 times (each of the vertical lines represent the average percentage of nucleotides).
We can conclude a couple things from these results:
We can infer that the HiSeq-3000 technology yields a higher number of 4x covered nucleotides than HiSeq-2000
Nonetheless, there is a decrease of the Y axis after the average coverage surpasses 5x. This shows that, to achieve a higher percentage of reads mapped to 4x depth, we will need to increase the average coverage by more (much more) than 20x.
The recommended approach will be to aim to a depth of 15x-20x to have more than 60% of the nucleotides covered 4 times (these results are concordant to the result found in P. rubi using the HiSeq-2000 technology).
GATK)