Overview and summary stats

HiSeq 3000 lanes (2x150bp), each at a different loading concentration:

GBS_0024 stats

File Number of reads Sequence Length %GC Encoding
GBS0040_S2_L003_R1_001 342617205 151 53 Sanger / Illumina 1.9
GBS0040_S2_L003_R2_001 342617205 151 52 Sanger / Illumina 1.9
GBS0041_S3_L004_R1_001 334710708 151 53 Sanger / Illumina 1.9
GBS0041_S3_L004_R2_001 334710708 151 52 Sanger / Illumina 1.9

Demultiplexing

4 jobs were sent using sabre

Submit start: 04/20/2015 08:27:48

Queue: Phyto

N. processors per job: 1

Scripts

Lane 3

SGE_Batch -c '/home/bpp/tabimaj/bin/sabre-master/sabre pe -f /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane4-s003-index-none-GBS0041_S3_L004_R1_001.fastq.gz -r /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane4-s003-index-none-GBS0041_S3_L004_R2_001.fastq.gz -b /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/barcodes_sabre -u unknown_barcodes_1.fastq -w unknown_barcodes_2.fastq' -r demultiplexing_l4

Lane 4

SGE_Batch -c '/home/bpp/tabimaj/bin/sabre-master/sabre pe -f /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane3-s002-index-none-GBS0040_S2_L003_R1_001.fastq.gz -r /hts2/illumina/150408_J00107_0011_AH2JYNBBXX_1301/L134/lane3-s002-index-none-GBS0040_S2_L003_R2_001.fastq.gz -b /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/barcodes_sabre -u unknown_barcodes_1.fastq -w unknown_barcodes_2.fastq'  -r demultiplexing_l3

Samples:

  • P. ramorum: 90 samples (ramorum_samples)

Number of reads and aligning to reference

90 jobs were sent

N. processors per job: 1]

Script: /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/lane_3/ramorum/bt2pe.sh

Reference: /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/ramorum1.fasta

Input: /nfs0/Grunwald_Lab/home/tabimaj/GBS/HiSeq_3000/lane_3/ramorum/reads

Output: sams/ bams/

Creating tables

for i in *.e*; do printf "$i\t"; grep "reads; of these:" $i| sed 's/reads; of these://g'; printf "a\t"; grep "overall" $i; done

Number of reads:

Reading the tables

setwd("~/Documents/P_rubi/GBS/GBS_Data/")
lane3 <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/Lane3_samout.txt", header = T,sep = "\t")
lane4 <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/Lane4_samout.txt",header = T,sep = "\t")
old <- read.table("~/Documents/P_rubi/GBS/GBS_Data/HiSeq_3000/HiSeq2000.txt", header = T, sep = "\t")

Comparison of different lanes (Horizontal lines are the average reads per lane)

Lines are the average number of reads per laneLines are the average number of reads per lane

Analisis of variance

There is an apparent higher number of reads on the HiSeq-3000 lanes than in the HiSeq-2000 samples. We tested for significance using an ANOVA.

Df Sum Sq Mean Sq F value Pr(>F)
Lane 2 7.911311e+13 3.955656e+13 7.55728 0.00064
Residuals 267 1.397541e+15 5.234234e+12 NA NA

We see significant differences between treatments.

Tukey’s HSD test

diff lwr upr p adj
Lane_3-HiSeq_2000 816400.1 12588.96 1620211 0.0455805
Lane_4-HiSeq_2000 1313003.9 509192.74 2116815 0.0004336
Lane_4-Lane_3 496603.8 -307207.39 1300415 0.3138627

The differences are significant between the HiSeq-2000 and the rest the treatments, while Lane 3 and Lane 4 show no significant differences between them according to the Tukey HSD test.

Correlation of reads between treatments

Let’s see if there is a correlation between the number of reads of the HiSeq-2000 and each of the lanes of the HiSeq-3000.

## Loading required package: gridExtra
## Loading required package: grid

We can see a correlation between the number of reads in both methods, so samples with higher number of reads in the HiSeq-2000 are also the most represented in HiSeq-3000. Sample concentration?


Mapping and aligning to the reference genome

Proportion of reads mapped

Here, we are mapping the proportion of reads mapped to the reference fragmented genome:

There is no evident change between the proportion of mapped reads between methods. To test this we are doing an analysis of variance:

Df Sum Sq Mean Sq F value Pr(>F)
Lane 2 0.08209 0.04104 0.38715 0.67937
Residuals 267 28.30681 0.10602 NA NA

There are no significant differences between the proportion of aligned reads, indicating that the reads we are obtaining are of equal quality when compared to the reference genome. Its important to mention that even if the quality of the reads is similar between the two methods, there are still a higher number of reads in the HiSeq-3000 approach. To test the average coverage per nucleotide between treatments we are going to use samtools ::: mpileup.


Total nucleotides

A total of 40153102 nucleotides exist in the P. ramorum fragments sequenced by the HiSeq 3000 technology, we calculated this value by counting the number of cut-sites of the ApeKI enzyme on the P. ramorum reference genome and adding 200 bp to each fragment (number of bp per read X 2)

Proportion of nucleotides mapped

After obtaining the pileups from samtools, we assessed coverage per sample by measuring summary stats and percentage of nucleotides with X coverage per site:

Coverage:

Coverage summary stats per sample
mean median sd
Lane_3 6.363 5.039 4.779
Lane_4 6.924 5.417 4.548
HiSeq_2000 4.811 4.028 2.932

We can see that there is a difference of almost 2x of average coverage between the HiSeq-2000 and the HiSeq-3000 technology:

Analysis of variance: Coverage vs. Lane

Df Sum Sq Mean Sq F value Pr(>F)
Lane 2 215.5771 107.78854 6.20471 0.00232
Residuals 267 4638.3408 17.37206 NA NA

We see significant differences between treatments.

Tukey’s HSD test

diff lwr upr p adj
Lane_3-HiSeq_2000 1.5514164 0.0870386 3.015794 0.0349468
Lane_4-HiSeq_2000 2.1127877 0.6484099 3.577166 0.0022317
Lane_4-Lane_3 0.5613713 -0.9030066 2.025749 0.6385623

We can conclude that there are significant differences between HiSeq-3000 coverage and the HiSeq-2000 technology. If combined, the HiSeq-3000 technology can reach a average coverage as high as 10x, which is impressive for 2 lanes of Illumina. More information should be obtained about the capacity of using 2 lanes instead of one.

Number of nucleotides with at least 4x coverage

We want to have, at least, a coverage of 4x per nucleotide to do the population analysis procedures. To assess it, we are comparing the average coverage with the percentage of nucleotides covered at least 4 times (each of the vertical lines represent the average percentage of nucleotides).

We can conclude a couple things from these results:

  1. The average percentage of nucleotides coverage by 4x depth are:
  • Lane_3: 39,49%
  • Lane_4: 42.95%
  • HiSeq-2000: 32.68%
  1. We can infer that the HiSeq-3000 technology yields a higher number of 4x covered nucleotides than HiSeq-2000

  2. Nonetheless, there is a decrease of the Y axis after the average coverage surpasses 5x. This shows that, to achieve a higher percentage of reads mapped to 4x depth, we will need to increase the average coverage by more (much more) than 20x.

  3. The recommended approach will be to aim to a depth of 15x-20x to have more than 60% of the nucleotides covered 4 times (these results are concordant to the result found in P. rubi using the HiSeq-2000 technology).

Conclusions

Perspectives and questions (For Aaron and Matthew)

Total time for analysis: 2 weeks and 1 day.

  • Sorting: 3 days
  • Mapping: 2 days
  • Pileups: 2 days
  • R counts, data import and scripting: 1 week
  • Importing pileups to R: 8 hours (classical), 3 hours (fread), 1 hour (Simplifying the input)
  • Analysis and results: 2 days, 7 hours.