1 Basecalling summary

Basecalling was performed using the Guppy 2.3.1 software. Called reads were classified as either pass or fail depending on their mean quality score (The ONT Guppy basecaller sets a minimum quality score of 7.0. Note that these are not phred scores). For this analysis (run 1), a total of 2,794,661 reads were basecalled and of these 2,014,551 (72.1%) were passed as satsifying the quality metric. The passed reads contain a total of 2.26 Gb of DNA sequence. This passed-fraction amounts to 86.6% of the total DNA nucleotide bases sequenced.

2 Sequencing channel activity plot

The nanopores through which DNA is passed, and signal collected, are arrayed as a 2-dimensional matrix. A heatmap can be plotted showing channel productivity against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations. Since each channel can address four differemnt pores (Mux) the activity plot below shows the number of sequences produced per channel, not per pore.

3 Sequence quality and length

This section describes the distribution of base-called sequence lengths and their accompanying qualities. According to ONT: “N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs.”

The distribution of sequence lengths:

The weighted read length histogram above shows the binned distribution of sequence length against number of sequence nucleotides contained within the bin. The histogram includes annotations for N50 and mean sequence sizes. N50 describes the sequence length where 50% of the sequenced bases are contained within reads of this length, or longer. The mean sequence length is the average sequence length across the whole sequence collection.

A histogram of mean QV scores above reveals the relative abundance of sequences of different qualities. The distribution of sequence qualities is shaded by the QV filter pass status. This QV filter is applied during the base-calling process as a modifiable parameter (quality threshold of 7 for Guppy basecaller–this can be modified). The histogram includes annotations for N50 and mean sequence sizes. N50 describes the sequence length where 50% of the sequenced bases are contained within reads of this length, or longer. The mean sequence length is the average sequence length across the whole sequence collection. Note that the majority of sequences that fail QC are comparatively shorter than those that pass the QC.

The plot above shows the distribution of mean read quality scores across the whole sequence collection. The distribution has been shaded for the sequence reads that have passed or failed the base-callers quality filter.

The density plot of mean sequence quality plotted against log10 sequence length is a useful graphic to show patterns within the broader sequence collection. The density plot shown in the figure above has been de-speckled by omitting the rarer sequence bins containing only 5 reads or fewer have been omitted.

4 Time/duty performance

Another key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel will address a number of different pores (mux) and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease. It is useful to consider whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.

Plotting the number of bases generated per unit time over the course of a sequencing run can reveal unexpected behaviours. In an ideal experiment there should not be any sudden decreases in performance.

The temporal data presented in the figure above has been scaled to gigabases of sequence produced per hour.

In addition to plotting the temporal production of data, the cumulative plot shown above shows how data is accumulated during the run. From this dataset, a total of 2.26 Gb of quality passing sequence have been measured. We can identify the timepoint T50, where 50% of sequenced bases have been collected within this time - or 4.58 hours in this run. This is displayed on the graph along with T90, the time at which 90% of the sequenced based have been acquired.

In addition to the cumulative plot of sequenced bases, an equivalent plot for the sequenced reads is shown in the figure above. This is not too dissimilar in structure or morphology to the cumulative baseplot.

The speed/time plot is a useful tool to observe any substantial changes in sequencing speed. A marked slow-down in sequencing speed can indicate challenges within the sequencing chemistry that could have been caused by the method of DNA isolation or an abundance of small DNA fragments.

The data points shown in the speed-time plot above have been filtered to mask outlying sequences (sequences beyond the 95% range). The distribution of the boxplots and their ‘whiskers’ are unchanged.

The graph presented above shows the number of sequencing channels that are actively producing data across time. A channel is defined as being active if one or more sequence reads are observed per time window (one hour for the default graph). It is expected that over the course of the run pores will block and the number of channels producing data will decrease. Changing the pore used by the channel (mux) and strategies to unblock pores mean that the number of functional channels may increase or decrease at a given timepoint but generally the number of channels producing data will decrease over time.

5 Demultiplexing

barcode freq % Mb min max mean N50 L50
1 barcode01 179573 8.9 187 106 10200 1039 1185 54790
2 barcode02 468681 23.3 518 96 11010 1106 1279 137692
3 barcode03 504379 25.1 560 108 12578 1110 1265 151557
4 barcode04 739351 36.7 863 113 10418 1167 1337 223980
32 barcode32 335 0.0 0 203 6668 1173 1355 97
34 barcode34 440 0.0 1 200 3975 1139 1301 138
63 barcode64 236 0.0 0 279 3782 1216 1431 73
81 barcode84 200 0.0 0 197 3272 1015 1188 62
87 barcode91 220 0.0 0 219 3579 1041 1169 62
93 unclassified 119217 5.9 128 106 11703 1072 1275 33730

Sequences were demultiplexed using Guppy 2.3.1 software. The table above shows summary statistics for the barcode assignments within this sequence collection. The annotated barcode is presented along with the number of sequence reads assigned to it (freq), the percentage of reads assigned to the barcode (%), the megabases of DNA sequence (Mb), shortest read in nucleotides (min), longest read in nucleotides (max), mean sequence length in nucleotides (mean) and N50 and L50 values, again in nucleotides. Note that these are summary statistics are for sequences that passed the quality threshold (of 7) described in the basecalling procedure above.

The histogram above shows the abundance of different barcodes within the sequence collection. The size of the bar corresponds to the frequency of the observation - this is the number of sequence reads observed. Barcodes with frequecies less than 500 are not represented.