1 Basecalling summary

Basecalling was performed using the Guppy 2.3.1 software. Called reads were classified as either pass or fail depending on their mean quality score (The ONT Guppy basecaller sets a minimum quality score of 7.0. Note that these are not phred scores). For this analysis (run 1), a total of 2,794,661 reads were basecalled and of these 2,014,551 (72.1%) were passed as satsifying the quality metric. The passed reads contain a total of 2.26 Gb of DNA sequence. This passed-fraction amounts to 86.6% of the total DNA nucleotide bases sequenced.

2 Sequencing channel activity plot

The nanopores through which DNA is passed, and signal collected, are arrayed as a 2-dimensional matrix. A heatmap can be plotted showing channel productivity against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations. Since each channel can address four differemnt pores (Mux) the activity plot below shows the number of sequences produced per channel, not per pore.

3 Sequence quality and length

This section describes the distribution of base-called sequence lengths and their accompanying qualities. According to ONT: “N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs.”

The distribution of sequence lengths:

The weighted read length histogram above shows the binned distribution of sequence length against number of sequence nucleotides contained within the bin. The histogram includes annotations for N50 and mean sequence sizes. N50 describes the sequence length where 50% of the sequenced bases are contained within reads of this length, or longer. The mean sequence length is the average sequence length across the whole sequence collection.

A histogram of mean QV scores above reveals the relative abundance of sequences of different qualities. The distribution of sequence qualities is shaded by the QV filter pass status. This QV filter is applied during the base-calling process as a modifiable parameter (quality threshold of 7 for Guppy basecaller–this can be modified). The histogram includes annotations for N50 and mean sequence sizes. N50 describes the sequence length where 50% of the sequenced bases are contained within reads of this length, or longer. The mean sequence length is the average sequence length across the whole sequence collection. Note that the majority of sequences that fail QC are comparatively shorter than those that pass the QC.

The plot above shows the distribution of mean read quality scores across the whole sequence collection. The distribution has been shaded for the sequence reads that have passed or failed the base-callers quality filter.

The density plot of mean sequence quality plotted against log10 sequence length is a useful graphic to show patterns within the broader sequence collection. The density plot shown in the figure above has been de-speckled by omitting the rarer sequence bins containing only 5 reads or fewer have been omitted.

4 Time/duty performance

Another key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel will address a number of different pores (mux) and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease. It is useful to consider whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.

Plotting the number of bases generated per unit time over the course of a sequencing run can reveal unexpected behaviours. In an ideal experiment there should not be any sudden decreases in performance.

The temporal data presented in the figure above has been scaled to gigabases of sequence produced per hour.

In addition to plotting the temporal production of data, the cumulative plot shown above shows how data is accumulated during the run. From this dataset, a total of 2.26 Gb of quality passing sequence have been measured. We can identify the timepoint T50, where 50% of sequenced bases have been collected within this time - or 4.58 hours in this run. This is displayed on the graph along with T90, the time at which 90% of the sequenced based have been acquired.

In addition to the cumulative plot of sequenced bases, an equivalent plot for the sequenced reads is shown in the figure above. This is not too dissimilar in structure or morphology to the cumulative baseplot.

The speed/time plot is a useful tool to observe any substantial changes in sequencing speed. A marked slow-down in sequencing speed can indicate challenges within the sequencing chemistry that could have been caused by the method of DNA isolation or an abundance of small DNA fragments.

The data points shown in the speed-time plot above have been filtered to mask outlying sequences (sequences beyond the 95% range). The distribution of the boxplots and their ‘whiskers’ are unchanged.

The graph presented above shows the number of sequencing channels that are actively producing data across time. A channel is defined as being active if one or more sequence reads are observed per time window (one hour for the default graph). It is expected that over the course of the run pores will block and the number of channels producing data will decrease. Changing the pore used by the channel (mux) and strategies to unblock pores mean that the number of functional channels may increase or decrease at a given timepoint but generally the number of channels producing data will decrease over time.

5 Demultiplexing

	barcode	freq	%	Mb	min	max	mean	N50	L50
1	barcode01	179573	8.9	187	106	10200	1039	1185	54790
2	barcode02	468681	23.3	518	96	11010	1106	1279	137692
3	barcode03	504379	25.1	560	108	12578	1110	1265	151557
4	barcode04	739351	36.7	863	113	10418	1167	1337	223980
32	barcode32	335	0.0	0	203	6668	1173	1355	97
34	barcode34	440	0.0	1	200	3975	1139	1301	138
63	barcode64	236	0.0	0	279	3782	1216	1431	73
81	barcode84	200	0.0	0	197	3272	1015	1188	62
87	barcode91	220	0.0	0	219	3579	1041	1169	62
93	unclassified	119217	5.9	128	106	11703	1072	1275	33730

Sequences were demultiplexed using Guppy 2.3.1 software. The table above shows summary statistics for the barcode assignments within this sequence collection. The annotated barcode is presented along with the number of sequence reads assigned to it (freq), the percentage of reads assigned to the barcode (%), the megabases of DNA sequence (Mb), shortest read in nucleotides (min), longest read in nucleotides (max), mean sequence length in nucleotides (mean) and N50 and L50 values, again in nucleotides. Note that these are summary statistics are for sequences that passed the quality threshold (of 7) described in the basecalling procedure above.

The histogram above shows the abundance of different barcodes within the sequence collection. The size of the bar corresponds to the frequency of the observation - this is the number of sequence reads observed. Barcodes with frequecies less than 500 are not represented.

ONT summary statistics and basic QC: Run 1

Report created: 2019-03-01

1 Basecalling summary

2 Sequencing channel activity plot

3 Sequence quality and length

4 Time/duty performance

5 Demultiplexing