1 Assessing accuracy of new flip_flop model for guppy basecaller

Flip_flop model provides extra state information on base to provide more signal to allow for more accurate calling.

As this comes at the expense of time spent basecalling we should assess whether the gain is necessary for what we want to do.

For mapping to genome/transcriptome to provide gene expression and transcript usage we may not necessarily require base-level precision. Mappers such as minmap2 can hande the higher error rate of third-generation sequencers.

If we are interested in SNP calling at low coverage we require higher accuracy.

1.1 Set the directory

setwd("~/Documents/Sync_later/Basecaller_test_20190204/")

1.2 Libraries

For reproducibility, one needs the following libraries installed and loaded into your R environment to run the following code

1.3 Reading in the data

Items include PAF format, which is a CSV file containing column meta data.

A collection for 4000 reads generated from SQK-LSK109 kit, converted to a multi-fast5 file suitable for guppy and then run through production model base caller or experimental flip_flop model basecaller. Run on ubuntu 18 using docker for ubuntu 16 so could install guppy_gpu and use on board Nvidia GeForce GTX 1060 GPU, particularly useful if wanting to use the slow flip_flop model

Pass read fastq concatenated into a single pass fastq file. 3. Reads were then mapped to hg38 using minimap2, output to PAF. simplified minimap2 cmd:

‘minimap2 -x splice –secondary=no hg38.mmi /path/to/ > output.PAF’

PAF format contains 11 main columns, detailed in the PAF_format.csv and then additional tags.

PAF_format <- read.csv("PAF_format.csv", header = TRUE, sep = ",")

# PAF_format  deciscription column is used to rename columns of read in data. 
guppy_normal <- read.delim("multi-reads/guppy_normal_cat.paf", header = FALSE, sep = "\t", col.names = PAF_format$Description)

guppy_floppie <- read.delim("multi-reads/guppy_floppie_cat.paf", header = FALSE, sep = "\t", col.names = PAF_format$Description)

# transmute necessary columns to new dataframe, Read contains the unique read name given by MinKNOW and outputted from guppy caller. Identity is generated through the division of the number of matched bases between the query and target, by the length of the target included gaps * 100. This gives an rough accuracy based on mapping similar to BASTn. Length is that of the query. Mapping quality is a phread score generated by minimap2. The last two maybe used for melt to form factors for plotting.

guppy_normal_df <- guppy_normal %>%
  mutate(., Read = Query.sequence.name, 
            Identity = (Number.of.matching.bases.in.the.mapping/ Number.bases..including.gaps..in.the.mapping) * 100,
            length = Query.sequence.length,
            MAPQ = Mapping.quality, basecaller = "guppy") %>%
  select(., Read, Identity, length, MAPQ, basecaller)

guppy_floppie_df <- guppy_floppie %>%
  mutate(., Read = Query.sequence.name, 
            Identity = (Number.of.matching.bases.in.the.mapping/ Number.bases..including.gaps..in.the.mapping) * 100,
            length = Query.sequence.length,
            MAPQ = Mapping.quality, basecaller = "guppy_floppie") %>%
  select(., Read, Identity, length, MAPQ, basecaller)

# as guppy_floppie_df has more rows this is difficult to use tidyverse so suggest as simple hack is to randomly sample the size of guppy_normal

guppy_floppie_df_2 <- sample_n(guppy_floppie_df, 3801)

dim(guppy_floppie_df_2)

## [1] 3801    5

1.4 Example plot of read_identity density for one basecaller only

guppy_normal_plot <- ggplot(guppy_normal_df, aes(x=Identity)) + geom_density(alpha=0.4) + scale_x_continuous(limits = c(70,100)) +
  ggtitle(label = "Mapped Identities of guppy called cDNA reads") + 
  bbc_style()
guppy_normal_plot

finalise_plot(plot_name = guppy_normal_plot,
source = "Source: ONS",
save_filepath = "guppy_normal_plot.png",
width_pixels = 640,
height_pixels = 550)

1.5 Tidying it up for multiple factors in ggplot

To display multiple factors with ggplot (i.e. multiple basecallers) we need to ‘tidy’ up our data frames to generate a column with character strings providing basecaller used for each row data

# create factors for plotting so can overlay densities of different groups, i.e. guppy_normal and guppy_floppie

combined_df <- rbind(guppy_normal_df, guppy_floppie_df_2)
# first provide gcombined data that shows overlap

1.6 Basic stats

basecaller_stat <- combined_df %>% group_by(basecaller) %>%
  summarize(average = mean(Identity), 
            median = median(Identity),
            stdev = sd(Identity),
            count = n(),
            stderror = stdev/count) # %>% data.frame(basecaller_stat)
kable(basecaller_stat)

basecaller	average	median	stdev	count	stderror
guppy	90.91243	91.60839	3.613084	3801	0.0009506
guppy_floppie	92.62080	93.52518	4.078241	3801	0.0010729

1.7 Plotting of data

ggplot + geom_density to create density histrogram of read identities for the two different basecallers

combined_plot <- ggplot(combined_df, aes(x=Identity, fill = basecaller)) + geom_density(alpha=0.4) + scale_x_continuous(limits = c(70,100)) +
ggtitle(label = "Mapped Identities of guppy called cDNA reads") + geom_vline(data=basecaller_stat, aes(xintercept=median, colour=basecaller), linetype="dashed", size=0.5, alpha = 0.4 ) # + geom_text(basecaller_stat$median, aes(x = 1, y = basecaller))
combined_plot

combined_bbc_plot <- ggplot(combined_df, aes(x=Identity, fill = basecaller)) + geom_density(alpha=0.4) + scale_x_continuous(limits = c(70,100)) +
ggtitle(label = "Mapped Identities of guppy called cDNA reads") + geom_vline(data=basecaller_stat, aes(xintercept=median, colour=basecaller), linetype="dashed", size=0.5, alpha = 0.4 ) + bbc_style()
combined_bbc_plot

finalise_plot(plot_name = combined_bbc_plot,
source = "Source: Your mum",
save_filepath = "combined_bbc_plot.png",
width_pixels = 640,
height_pixels = 550)

# Uncomment if you wish to find the colour indexes used in standard ggplotting, if you have not defined the colours using brewer_palette or something similar
# ggplot_build(combined_plot)$data

2 RRwick scripts for identity

I also adapted the scripts created by rrwick (https://github.com/rrwick/Basecalling-comparison/) which were designed to assess the various basecallers to date and also assembly and consensus accraucy for some bacterial genomes.

read_length.py <- Input the concatenated fastq pass reads, and the PAF file generated through minimap2 (mapping to genome in splice mode).

This is a particularly hacky way of doing it, and there are likely some mistakes. In particular I am unsure how to create the read_data.tsv file, and the id_to_fast5 file. I circumvented this issue by creating files that are similar

# Need to provide a list of fast5 file names 
fast5_mapping <- read.delim("multi-reads/filename_mapping.txt", header = FALSE, col.names = c("Fast5_name", "file_name"))

# To produce the read_data.tsv file similar to rrwick we need to get this data from the sequencing summary txt.
read_data_in <- read.delim("multi-reads/guppy/sequencing_summary.txt", header = TRUE) %>% select(., read_id, run_id, start_time, sequence_length_template) %>% cbind(fast5_mapping$Fast5_name)

# Rena e column headers to match that of the rrwick read_data.tsv and the following code in subsequent sections 
colnames(read_data_in) = c("Name", "Run_name", "Start_time", "Signal_length", "Fast5_name")

# reorder columns to match rrwick script
read_data_in <- read_data_in[,c(1,5,2,4,3)]

# Create the read_data.tsv file with the renamed and reordered columns
write.table(read_data_in, file = "results/read_data.tsv", row.names = FALSE, quote = FALSE, sep = "\t")

2.1 Additional libraries

# additional libraries required for this script
library(readr)
library(gridExtra)
library(grid)

# Create a vector containing the different basecallers, and a matched colour used for plotting
basecaller_names <- c()
basecaller_colours <- c()

basecaller_names <- c("guppy_v2.3.1_normal", "guppy_v2.3.1_floppie")
basecaller_colours <- c("#F8766D", "#00BFC4")

names(basecaller_colours) <- basecaller_names

class(basecaller_colours)

## [1] "character"

# Attach the name of the basecaller to the elements of the colour vector
names(basecaller_colours) <- basecaller_names
fill_scale <- scale_fill_manual(name = "Basecaller", values = basecaller_colours)

# rrwick specific theme I guess
my_theme <- theme_bw() + theme(panel.grid.major.x = element_blank())

# Create labels to be used in plotting later, takes the filename as source and grabs the basecaller name
basecaller_labels <- gsub(" ", "\n", basecaller_names, fixed=TRUE)
basecaller_labels <- gsub("basecRAWller", "base-\ncRAWller", basecaller_labels, fixed=TRUE)
basecaller_labels <- gsub("DeepNano", "Deep-\nNano", basecaller_labels, fixed=TRUE)

# function to read in data generated by the read_length.py code
load_tsv_data <- function(filename, column_names) {
  if(file.exists(filename)) {
    data <- read_tsv(filename, skip = 1, col_names = column_names)
  }
  else {
    data <- data.frame(matrix(ncol = 4, nrow = 0))
    colnames(data) <- column_names
  }
  return(data)
}

# All reads comprises the names of the fat5 file used for basecalling. In some cases reads will not pass the Q filter and so will be unaligned, with no identity. 
all_reads <- load_tsv_data("results/read_data.tsv", column_names=c("Name", "Fast5_name", "Run_name", "Signal_length", "Start_time"))
# Create a vector of basecaller identities from the individual basecaller_reads.tsv files.
# The same for the realtive length
basecaller_identities <- c()
basecaller_rel_lengths <- c()

# Use of for loop to pass over the _reads.tsv files generated by the read_length.py script within the /results directory. 
# Add a length column 
# Add identity column
# Add relative length column (relative to the reference length), this provides and understanding of the degree for indels to be systematically deletions of additions. 
for (basecaller in basecaller_names) {
  no_spaces <- gsub(" ", "_", basecaller)

  read_data_filename <- paste("results/", tolower(no_spaces), "_reads.tsv", sep="")

  length_column <- paste("Length_", no_spaces, sep="")
  identity_column <- paste("Identity_", no_spaces, sep="")
  rel_length_column <- paste("Rel_len_", no_spaces, sep="")

  basecaller_identities <- c(basecaller_identities, identity_column)
  basecaller_rel_lengths <- c(basecaller_rel_lengths, rel_length_column)

  column_names = c("Name", length_column, identity_column, rel_length_column)
  read_data <- load_tsv_data(read_data_filename, column_names)
  # Merge all reads (the 4000 fast5 rows) with the data imported from the _reads.tsv file
  all_reads <- merge(all_reads, read_data, by=1, all=TRUE)
}

2.2 Melt for ggplot using reshape

This is to produce different factors for ggplot to grab and plot in different colours, again which is tracked back to the basecaller_colour vector.

# Reformat data frames for ggplot

# Each read's length is the median value of the different basecallers' lengths.
read_lengths <- all_reads[grepl("Length_", names(all_reads))]
all_reads["Length"] <- round(apply(read_lengths, 1, median, na.rm = TRUE))

# If a read lacks an identity, then it's an unaligned read.
# Replace NA with 0.0 in identity columns (unless the whole column is NA, which implies that it's a basecaller for which I don't have data).
for(col_name in names(all_reads)){
  if (startsWith(col_name, "Identity_")) {
    identities <- all_reads[,col_name]
    if (!(all(is.na(identities)))) {
      identities[is.na(identities)] <- 0
      all_reads[,col_name] <- identities
    }
  }
}
# Create data frame which will have 8000 entries, each fast5 row repeated for each different basecaller
read_identities <- all_reads[,c("Name", "Length", basecaller_identities)]
colnames(read_identities) <- c("Name", "Length", basecaller_names)
read_identities <- melt(read_identities, id=c("Name", "Length"))
colnames(read_identities) <- c("Read_name", "Length", "Basecaller", "Identity")

# Same as above
read_rel_lengths <- all_reads[,c("Name", "Length", basecaller_rel_lengths)]
colnames(read_rel_lengths) <- c("Name", "Length", basecaller_names)
read_rel_lengths <- melt(read_rel_lengths, id=c("Name", "Length"))
colnames(read_rel_lengths) <- c("Read_name", "Length", "Basecaller", "Relative_length")

2.3 Plotting the data

For some reason the cmd:

‘+ scale_y_continuous(expand = c(0, 0), breaks = seq(0, 100, 5), minor_breaks = seq(0, 100, 1), labels = scales::unit_format(“%”))’ throws an error and I canot work out why. For now I have removed it to save time. Will address later. I think this is main reason the lower panel and upper panel do not align properly in gA:gB plots.

p1 <- ggplot(read_identities, aes(x = Basecaller, y = Identity, weight = Length, fill = Basecaller)) + # adding quantiles such as 0.5 allows for the median to be displayed
  geom_violin(draw_quantiles = c(0.5), width=0.8, bw=0.6, alpha = 0.7) +
  fill_scale + my_theme + guides(fill=FALSE) +
  theme(axis.ticks.x = element_blank()) + scale_y_continuous(expand = c(0, 0), breaks = seq(0, 100, 5), minor_breaks = seq(0, 100, 1)) +
scale_x_discrete(labels=NULL) + coord_cartesian(ylim=c(65, 100)) + labs(title = "Distribution of read identity", x = "" , y = "read identity", labels = scales::unit_format("%"))
p1

# 

p2 <- ggplot(read_identities, aes(x = Basecaller, y = Identity, weight = Length, fill = Basecaller)) +
  geom_violin(draw_quantiles = c(0.5), width=0.8, bw=0.6, alpha = 0.7) +
  fill_scale + my_theme + guides(fill=FALSE) + scale_y_continuous(expand = c(0, 0), breaks = seq(0, 100, 5), minor_breaks = seq(0, 100, 1)) +
  scale_x_discrete(labels=basecaller_labels) +
  coord_cartesian(ylim=c(0, 5)) +
  labs(x = "", y = "   ", labels = scales::unit_format(accuracy = 1,"%"))
p2

# , labels = scales::unit_format("%")

# To create the multi panel showing the read_identities from the top and the unaligned reads at the bottom
gA <- ggplot_gtable(ggplot_build(p1))
gB <- ggplot_gtable(ggplot_build(p2))
maxWidth = grid::unit.pmax(gA$widths[2:3], gB$widths[2:3])
gA$widths[2:3] <- as.list(maxWidth)
gB$widths[2:3] <- as.list(maxWidth)
read_identity_plot <- grid.arrange(gA, gB, ncol=1, heights=c(3, 1))

ggsave(read_identity_plot, file="results/read_identity.png", width = 12, height = 5.75)

2.4 Joy plots

Show more classical way of showing this distribution, similar to section one.

# Joyplots
library(ggjoy)
plot3 <- ggplot(read_identities, aes(x = Identity, y = Basecaller, weight = Length, fill = Basecaller)) + geom_joy(scale = 1.0, draw_quantiles = c(0.25,0.5, 0.75), alpha = 0.7) + fill_scale + theme_bw() + guides(fill=FALSE) + theme(axis.text.y = element_text(vjust = 0)) + scale_x_continuous(expand = c(0 , 0), breaks = seq(0, 100, 5), minor_breaks = seq(0, 100, 1)) + scale_y_discrete(expand = c(0.05, 0)) + coord_cartesian(xlim=c(60, 100)) + labs(title = "Read identities", x = "", y = "", labels = scales::unit_format("%"))
plot3

ggsave(plot3, filename = "results/read_identity_classic.png", width = 12, height = 5.75)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Basecaller_accuracy

Report created: 2019-02-28