This assessment is:
Your submission will be an edited form of this document. You should make sure to:
{r} after the first
```.To insert an image, use the following code within an R code block:
knitr::include_graphics("FILE PATH TO IMAGE")
You can include a number of extra arguments within the
{r} such as…
fig.align to control the positionfig.cap to add a legendWhen you do not need to show the code within an R code block, such as
when you are just inserting a figure, include the argument
echo = FALSE within the {r}.
An experiment has been running on the International Space Station (ISS) over the last few years called Microbial Tracking. Samples have been collected from the air and surfaces across the ISS, transferred back to Earth and sequenced in NASA’s laboratories. Your job is to assemble the raw sequencing data from the ISS into whole genomes, conduct analysis on these genomes and other public data and visualise some of the output.
A. Extract the contents of your compressed
part_1 file. Show the code used to do this. ’’’ PS C:>
pscp -P 5047 part_1.tar.gz biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com:/home/biouser/
biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com’s
password: part_1.tar.gz | 490520 kB | 70074.4 kB/s | ETA: 00:00:00 |
100%
PS C:> pscp -P 5047 part_2.tar biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com:/home/biouser/ biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com‘s password: part_2.tar | 88064 kB | 1000.7 kB/s | ETA: 00:09:45 | 13%’’’
B. Count the number of reads in each of your
fastq files. Show the code used to do this.
’’’ zcat Strain_A_R1.fastq.gz | wc -1 ’’’
C. Assemble your read count data into a table by importing your read counts into R. Show the R code used to do this.
’’’{r} name<-c(“Strain A”, “Strain B”, “Strain C”, “Strain D”, “Strain E”) ReadCount<-c(“451170”, “483435”, “213716”, “221353”, “828,309”) part_1<-data.frame(name, ReadCount) part_1 name ReadCount 1 Strain A 451170 2 Strain B 483435 3 Strain C 213716 4 Strain D 221353 5 Strain E 828,309 knitr::kable(part_1, “pipe”, col.name=c(“name”, “ReadCount”))
| name | ReadCount |
|---|---|
| Strain A | 451170 |
| Strain B | 483435 |
| Strain C | 213716 |
| Strain D | 221353 |
| Strain E | 828,309 |
’’’
A. Make a directory called
assessment_part_1. Show the code used to do this.
’’’ mkdir assessment_part_1 ’’’
B. Move your raw sequencing reads into the directory you have just created. Show the code used to do this with a wildcard.
’’’ mv Strain_A_R1.fastq.gz assessment_part_1
ls *.fastq.gz ’’’
C. You have been using conda to manage
your software. Show the code used to create a conda
environment and for installing the program fastqc.
’’’ conda create -n my_env fastqc python=3.10 conda install -c bioconda fastqc ’’’
A. Run fastqc on all of your read files
using a wildcard and move the output into a new directory. Show the code
used to do this.
’’’ fastqc -o ~/fastfiles/ *gz ’’’
B. Run multiqc on your fastqc output.
Export the “Sequence Quality Histogram” from your multiqc
report and include it below. Give it an appropriate figure legend.
’’’ ~/fastfiles$ multiqc -n –fastfiles *_fastqc.zip ’’’
C. Import the multiqc_fastqc.txt file
into R using read_tsv(). Create a new column called
Strain using the Sample column as a template.
Using the sub() command, remove the _R1 and
_R2 from your new strain column. You can use the format
pattern1|pattern2 to replace multiple strings at once with
sub(). Repeat this for the Sample column, but
this time remove the strain information, which should just leave either
R1 or R2.
Using ggplot, plot a bar graph of the average sequence
length by strain. Colour the bars based on whether they are the R1 or R2
reads (make sure you use position = 'dodge' otherwise you
will make a stacked bar chart). Do not use the default colours, change
these to two colours of your own choosing. Rename the y-axis and change
the name of the colour key from Sample to
Read set.
INSERT R CODE HERE
A. Using trimmomatic, trim each pair of
read files to remove any adapter sequences remaining from the sequencing
process. You should end up with two paired and two
unpaired files per strain. Show the code used to do this.
’’’ conda install -c bioconda trimmomatic ’’’
B. Delete the unpaired files. Make a new directory
called trimmomatic and move your newly trimmed files into
this directory. Show the code used to do this. ’’’ mkdir trimmomatic
mv paired trimmomatic mv unpaired trimmomatic
’’’ C. As previously, count the number of reads in each of your paired files. Calculate how many reads, if any, have been lost during the trimming process. Collate this information into a spreadsheet and import it into R. Visualise this as a table in this markdown document. Show the R code used to do this.
INSERT R CODE HERE
A. Make a directory called “spades” containing
sub-directories for each of your strains. Assemble each of your strains
into whole genomes using spades with your newly trimmed
sequencing reads. Show the code you used to run spades for
one of your strains.
mkdir spades cd spades mkdir
B. Make a new directory called “genomes”. Rename
your contigs.fasta files from spades and copy
them into your newly created “genomes” folder. Show the code used to do
this.
mkdir genomes mv contigs.fasta genomes
C. Using grep, count the number of
contigs in one of your FASTA files. Show the code used to
do this.
INSERT BASH CODE HERE
D. Using seqkit stats, the pipe
| and cut, produce summary statistics for your
genomes, extract the columns file,
num_seqs, N50 and GC
from the output and direct this into a new file. Show the code used to
do this.
INSERT BASH CODE HERE
E. Import your genome summary statistics into R.
Using ggplot(), create a scatter plot of N50 against the
number of contigs, with a different point shape and colour per strain.
Change the size of the points based on the GC percentage. Make sure the
axes are labelled correctly and you have given the graph an appropriate
legend. Show the code used to do this.
INSERT R CODE HERE
F. Upload your genomes to PubMLST’s species identification website. Briefly summarise the results.
INSERT ANSWER HERE
G. Upload your genomes to the CARD website to
identify antibiotic resistance genes in your genomes. You can directly
highlight the table CARD produces and copy it into Excel. Add a column
for genome, then repeat for the remaining four genomes. You only need to
keep the columns for: Genome, ARO Term and Resistance mechanism. Save
this as a .csv file.
Import the spreadsheet into R. Using geom_tile() from
ggplot2, create a heat map showing the presence / absence
of each gene across your five strains. Colour each gene based on the
resistance mechanism it uses. Use a different theme for this figure.
Make sure the figure is labelled and formatted appropriately. Show the
code used to do this.
INSERT R CODE AND OUTPUT HERE
A. Use prokka to annotate your genomes.
Show the code you used to do this for one of your strains.
INSERT BASH CODE HERE
B. Using the .txt file from prokka,
create a spreadsheet containing information on the number of
CDS, genes, rRNAs and
tRNAs for each strain. Import this into R and display
the table below. Show the code used to do this.
INSERT R CODE AND OUTPUT HERE
C. For each of the .ffn files produced
by prokka, search for the 16S gene. Add this
sequence to the bottom of the 16S file you have been provided with. Make
sure you do not use the pre-16S sequence,
however it is OK if the only sequence you find is
partial-16S. Name each sequence after the strain it is
taken from so you know which sequence is from which file. Using
grep, produce a list of all the sequence header lines in
your new 16S file, and show the code you used to do this.
INSERT BASH CODE HERE
A. Use prank to perform a multiple
sequence alignment on your 16S file. Show the code you used to do
this.
INSERT BASH CODE HERE
B. Use RAxML-NG to create a
phylogenetic tree using the 16S alignment from prank. Show the code you
used to do this.
INSERT BASH CODE HERE
C. Upload the .bestTree file to iTOL.
Annotate the tree at your discretion; you can alter branch colours,
thickness, fonts, etc. to highlight aspects of the tree. It should look
professional and suitable for publication. Export the tree and include
it below. Include a figure legend.
INCLUDE FIGURE HERE
NOTE: You should aim for each answer to be no more than 200 words.
A. Using your multiqc and
fastqc data, comment on the quality of your sequencing
reads. What can you do to improve the quality of reads if they are below
Q30?
Ideally Phred scores should be around 50 or 40 to increase the accuracy of the sequencing. Q30 means that the reads are accurate to a factor of 0.1% which is not impressive when dealing with an entire genome. In order to improve the read quality there are a few strategies which include software fixes and practical improvements. For example Algorithms can be created that recognise errors in a correct sequence of DNA and account for them reducing inaccuracy. Furthermore in practice, improvement to cutting edge reagents and genome sequencing kits are likely to improve the quality of the data and negate error. In practice, researchers should be ready to compare and cross examine a large amount of genome sequences in order to notice when errors occur. It is imperative to adapt to these outliers to gain the best possible advantage.
The fastqc data deviates a lot in its read count between each sample, I cannot see this contributing to the quality. It could be due to excessive duplication, high nitrogen content or even lower quality bases. In order to achieve a higher Phred score, a solution previously mentioned involving accounting for complications during the DNA sequencing.
Dunning, M. (2019) Assessing Read quality, Assessing read quality. Available at: https://sbc.shef.ac.uk/workshops/2019-01-14-rna-seq-r/read-quality.nb.html (Accessed: 08 December 2023).
B. What can you infer from your genome statistics about the quality of your assemblies? Which assembly is the best? Justify your answer.
INSERT ANSWER HERE
C. Describe the process of genome assembly. Why do we do it, what is involved at each step and how does it work?
Genome assembly is the process of sequencing an organisms DNA and piecing it together. It is done to further our understanding of why certain characteristics are expressed in an organism. It can determine the origin of certain disease or disorders and allow us to target genes in order to repair genetic errors.
Initially, a sample of the patient is required usually by taking blood, tumour samples or bone marrow. DNA is extracted from cells via heat and an extraction buffer. In PCR, three stages include denaturing, annealing and amplification of DNA Molecules. The sequences are read by PCR machines for example Illumina, which generates millions of short sequence reads. Algorithms and other techniques can be used to detect any errors which may have occurred during sequencing and lesser quality reads can be excluded. Any gaps can be replaced, and the genome is assembled usually by comparing it with a reference genome and will be validated by comparing with a wider range of sequences.
NHS.2020.Genomics in clinical genetics. available from https://www.genomicseducation.hee.nhs.uk/genomics-in-healthcare/genomics-in-clinical-genetics/
D. What can you infer from the phylogenetic tree and does this agree with the species designations from PubMLST? Which approach do you think is more accurate? Justify your answer.
INSERT ANSWER HERE
E. Why is the 16S gene so widely used in phylogenetic analysis? What are the drawbacks to using this method? Is using 16S sequences still suitable for species assignments? What characteristics does a gene need to have for it to be a good candidate for phylogenetic analysis?
The 16S gene is used because it is vital for cell function and is present in a large majority of organisms including prokaryotes. The gene does not change much between organisms but it does contain variable regions which allow for different species to be differentiated. The gene evolves relatively slow so is useful in comparing a variety of organisms and can also be amplified using PCR. It is arguable whether it is effective for determination of species as closely related organisms can have 16S genes which are similar and difficult to distinguish.
In order to be a good candidate for phylogenetic analysis, the gene must have enough evolutionary information to give context between species and build a map of how it was inherited. The gene must be compatible with PCR so that it can be used for sequencing. There should also be a crucial function of the gene to ensure that it is present in a large variety of species.
Johnson, J.S., Spakowicz, D.J., Hong, BY. et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 10, 5029 (2019). https://doi.org/10.1038/s41467-019-13036-1
F. For this report you have used Illumina DNA sequencing data. What other sequencing technologies can be used for genome assembly, and how do they compare to Illumina?
There are a number of different methods for DNA sequencing and genome assembly including Sanger sequencing, PacBio or Oxford nanopore. In most cases, these technologies will either commit to providing high accuracy, while only providing shorter reads. PacBio and nanopore have much longer reads however they forfeit accuracy and have a high error rate. The techniques which use longer reads are more effective in decoding more complex sequences in the genome. The longer sequencing methods are also known to be less cost effective and furthermore have a lower output in comparison.
Niraj Rayamajhi, Chi-Hing Christina Cheng, Julian M Catchen, Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki, G3 Genes|Genomes|Genetics, Volume 12, Issue 11, November 2022, jkac192, https://doi.org/10.1093/g3journal/jkac192
In Practice, DNA sequencing Is most effective when using a combination of these methods in order to compare findings. With data from shorter and longer reads, a greater and more precise image of the genome sequence can be achieved from a combination. The strengths of Illumina make up for weaknesses in other tech and visa versa.
The data for this scenario were collected during the AGILE Candidate Specific Trial (CST)−2 (Donovan-Banfield et al., 2022) in which humans who were infected with SARS-CoV-2 were randomized to molnupiravir treatment or placebo. SARS-CoV-2 has a positive-sense, single-stranded RNA (+ssRNA) genome (see this link or this link for explanation and this link for a genome map).
Briefly nasopharyngeal swabs were collected at several time points, but RNA was extracted from those collected on days 1, 3 and 5 and converted to cDNA before being amplified in a process that added sequencing adapters for the Illumina platform. Sequencing was carried out on a Novaseq 6000 with a 2 x 150 bp run (for more details see Donovan-Banfield et al., 2022 and associated supplementary information). This is referred to as ampliconic sequencing.
After being metabolised, molnupiravir, acts as a nuceoside analogue, such that it can be incorporated into newly synthesised strands of the RNA genome opposite a G nucleotide. Upon incorporation it frequently changes conformation into an alternative tautomer that can base pair with an A nucleotide. This process most frequently results in G-to-A or (if occuring during positive-strand synthesis) C-to-U transitions (this is nicely explained in figure 1 of Sanderson et al., 2023 and in figure 1b of Donovan-Banfield et al., 2022). Sanderson et al. 2023 have recently shown that the characteristic mutation spectrum of molnupiravir can be detected in population samples (associated with correlates of molnupiravir treatment frequency/likelihood) suggesting that treatment with molnupiravir has contributed to the evolution of new SARS-CoV-2 variants.
Your task is to examine the ampliconic sequencing data from two patients infected with SARS-CoV-2 from the BA.1 lineage (part of the omicron group: see here or here for more information on lineages). One patient was treated with molnupiravir, while the other was treated with placebo (see metadata file). The task of detecting variation throughout the genome is referred to as variant detection and we can identify single-nucleotide polymorphisms (SNPs), insertions/deletions (indels) and more complex variants such as multinucleotide polymorphisms (MNPs). Because our study organism is essentially haploid we are looking at variation within a population from which we have sampled a pool. When we focus on DNA pools derived from a large population (of RNA viruses in this case) we need to consider that variation can be continuous in this case (and is not limited to discrete vales e.g. 0, 0.5, 1.0 in a single-individual diploid pool), this is known as deep sequencing (deepSeq). You will develop and execute a deepSeq workflow.
A. Make a directory called
assessment_part_2 and move into it. Show the code used to
do this.
’’’ mkdir assessment_part_2
cd assessment_part_2 ’’’
B. Extract the contents of the tarballed (but not
compressed) part_2 file. This contains 6 samples worth of
sequence data (3 timepoints each from 2 patients) as well as the
SARS-CoV-2 Wuhan reference genome and a metadata file. Show the code
used to do this.
’’’ tar -xvf part_2 ’’’
C. Count the number of reads in each of your
FASTQ files. Show the code used to do this for one
sample.
’’’ zcat SRR19914904_1..fastq.gz | wc -1 ’’’
D. Which FASTQ file has the smallest
number of reads? How many reads does it have?
SRR19914996 has the least amount of reads at 247627
A. Use the appropriate conda commands
to install fastp, bwa-mem2,
samtools, sambamba, freebayes,
vcftools, bcftools, vcflib
(v.1.0.3) and tabixpp (v.1.1.0). You can call the
environments whatever you want. Show the code used to create these.
’’’ conda create -n freebayesenv conda install freebayes
same for all programs otherwise try sudoapt install (software) ’’’
B. Find out the version numbers of all the software installed in part A and present this as a list.
’’’ freebayes (1.3.6-1) samtools (1.13-4) sambamba (0.8.2+dfsg-2) fastp (0.20.1+dfsg-1) vcftools (0.1.16-3) bcftools (1.13-1)vcflib (v.1.0.3) and tabixpp (v.1.1.0) bwa-mem2 (2.2.1) ’’’
C. Provide full bibliographic entries in the
Harvard/NTU style (as you would find in a reference list) for the
fastp, sambamba and FreeBayes
programs.
’’’ Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019.
A. Tarasov, A. J. Vilella, E. Cuppen, I. J. Nijman, and P. Prins. Sambamba: fast processing of NGS alignment formats. Bioinformatics, 2015.
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012 ’’’ ———————————————
A. Using the fastp program, trim and
filter every pair of FASTQ files (a separate line of code
will be required for each sample). Use default settings but add the
--html flag to generate suitably named reports. Show the
code for at least one sample (pair of read files).
–html suitablynamed
B. Using the reports you’ve generated in part A, and importing data into R, prepare a table showing the percentage of reads passing the filters for every sample. Show the R code used to do this and display the table.
INSERT R CODE AND OUTPUT HERE
C. Choosing (and identifying) only one sample by accession, show a representative before and after filtering quality plot (these can also be found in the reports). Present the two figures here.
INSERT ANSWER HERE
A. The reference genome is named
Wuhan-Hu-1_NC_045512.2.fasta. Use the bwa-mem2
package’s index program to index the reference genome. Show
the code used to do this. ’’’ bwa-mem2 index
Wuhan-Hu-1_NC_045512.2.fasta ’’’
B. For every sample, map the
fastp-trimmed reads to the reference and generate an output
SAM file. Show the code for one sample (pair of input read
files).
INSERT BASH CODE HERE
C. For each SAM file, use the
sambamba (or samtools) package
(view and sort programs) to convert to
BAM format and sort. In each case make sure to add
read-group data corresponding to each sample’s accession. Show the code
for one sample.
INSERT BASH CODE HERE
D. For each sorted BAM file, use the
appropriate sambamba program to create a new
BAM file in which PCR duplicates are marked. Show the code
for one sample.
INSERT BASH CODE HERE
At the end of this question you may wish to discard SAM
and unmarked BAM files to preserve disk space.
A. Use the sambamba (or
samtools) flagstat program to examine all
BAM files. Show the code for one sample.
INSERT BASH CODE HERE
B. Collate data from part A, recording the total
number of reads, the number of duplicates and the number of mapped reads
for every sample. Import into R and use
tidyverse functions to calculate the
proportion of mapped reads and duplicated reads for
every sample. Show the R code you used to do this and display the table
below.
INSERT R CODE AND OUTPUT HERE
C. Use the samtools coverage program to
generate a coverage histogram from a single BAM file.
Present this representative histogram with a legend that identifies the
sample. For this answer you may exceptionally use a screenshot because
the output is displayed on the command line. There is no need to show
the code here.
INSERT SCREENSHOT IMAGE HERE
D. Use the samtools depth program and a
single BAM file to generate a
tab-delimited (TSV format) output file with coverage data
for all positions (check options) in one sample. Show
the code used to generate this table.
INSERT BASH CODE HERE
E. Import the tab-delimited data into R and produce a plot that shows how coverage varies by position (identify the sample in your legend). Show the code used to do this and present this as a figure.
INCLUDE R CODE AND OUTPUT HERE
Now we will use the FreeBayes program to call variants
and then BCFTools to filter the VCF file
A. Use the ls command (with a suitable
option) to generate a file that lists all the (duplicate-marked)
BAM file names, one per line (with no other information).
You may give the list file any suitable name. Show the code you used to
do this.
ls *.BAM
B. Use the freebayes program with
options appropriate for pooled data of type described in the background
section and with the BAM file list (generated in part A) to
call variants across all samples into a single
VCF file (with any suitable name). Executing this code may
take some time. Show the code used to do this.
INSERT BASH CODE HERE
C. Use the bcftools stats program to
derive statistics from your (currently unfiltered) VCF file. Pass in the
reference genome and instruct the program to analyze all samples and all
sites. Write the output to a suitably named file. Show the code used to
do this.
INSERT BASH CODE HERE
D. Use the plot-vcfstats helper script
to produce a graph from the file generated in part C. Find the plot that
shows the proportion of different substitution types.
Insert/display the PNG below and provide a figure
legend.
INSERT FIGURE HERE
E. Use the bcftools filter program to
retain only sites with an overall (PHRED-scaled) quality score > 20
and with at least one sample having a depth of coverage greater than 10.
Pipe the output to the bcftools +split program and separate
the output files into a (suitably named) folder as uncompressed
VCF files. Show the code used to do this.
INSERT BASH CODE HERE
Now you will display the filtered variants (from all samples).
A. Descend into the output directory from Q6 part E
and use the bcftools query program to build a (suitably
named) tab-delimited output file containing the following columns:
%CHROM, %POS, %TYPE, %QUAL, %REF, %FIRST_ALT, and from the sample
specific (FORMAT) fields: %SAMPLE, %DP and %AD{1}. You will need to
process each VCF file with a separate line of code and
append results to the output TSV file using the
>> operator. Show every line of code
used to do this (for all samples).
INSERT BASH CODE HERE
B. Import the metadata CSV file and the
tabular variant TSV file into R using tidyverse
read_*() functions and assign to suitably named objects.
Show the R code used to do this.
INSERT R CODE HERE
C. Use the tidyverse left_join()
function to add metadata to the variant tibble. Use tidyverse to filter
the output tibble to display only SNPs (that is sites that contain
only SNPs in the TYPE column). Show the R code used to
carry out these steps.
INSERT R CODE HERE
D. Using the mutate() function, add a
“MAF” (minor allele frequency) column to the tibble in which you
calculate the proportion of reads supporting the (first) alternative
allele. Additionally convert the day variable into a factor
variable. Show the code used to do this.
INSERT R CODE HERE
E. Use the ggplot2 package to create a
box and whiskers plot with day on the x axis and
MAF on the y axis. Split the plot into panels (facets) with
rows showing the reference allele and columns displaying the (first)
alternative allele. Differentiate boxes by treatment to
show the effect of treatment on allele frequencies at each time point.
Make some changes to the theme and axes to improve plot
appearance/clarity. Show the R code and output plot here.
INSERT R CODE AND OUTPUT HERE
F. Use the mutate() function again to
add a column called mutation which shows, for example
“A>G” to a change from an A base in the reference to a G for the
(first) alternative allele. You will need to use either
str_c() (tidyverse) or paste0() (base R)
functions to assist with this. (Make additional small adjustments if you
believe they improve downstream plotting). Show the code used to do
this.
INSERT R CODE HERE
G. Use the filter() tidyverse function
to select samples from day 3 onwards (eliminating day 1) and that have a
MAF between 0.2 and 0.8. Pipe the output to the tidyverse
count() function and use this to tabulate by
treatment and mutation. Assign the output of
this pipeline to a suitably named new object. Show the code used to do
this.
INSERT R CODE HERE
H. Use the ggplot2 package to create a
bar plot with treatment on the x axis and n
(the count of sites) on the y axis. Split the plot into facets again,
but using the mutation variable. In your faceting call
manually adjust the layout of the subplots for optimal display and
adjust y axes so that they are permitted to differ between subplots.
Make use of any additional axis modifications, aesthetic calls and/or
theme adjustments within your code to improve appearance. Show the code
used to do this and display the plot.
INSERT R CODE AND OUTPUT HERE
NOTE: You should aim for each answer to be no more than 200 words.
A. Explain the functions of the fastp,
sambamba and FreeBayes programs in your
workflow.
Preprocessing is done by fastp which trims adapter sequences from the reads, corrects innacurate bases and filters out reads that do not meet the required Phred score. After, sambamba is used to separate the SAM and BAM files and create an index so that they ca all be analysed together. Finally, FreeBayes is used to single out variant samples and compares te referency genome with the apparent genome and determines the actual alleles.
B. What is the minimum length of read below which a
read would be discarded by fastp (given the code used in
question 3 part A)?
INSERT ANSWER HERE
C. Across all your samples how many variants were
identified by FreeBayes and how many passed the
bcftools filter? How many simple SNPs were identified after
the filtering step in question 7 part c? How might these decreasing
numbers reflect the possibility of false negatives and false
positives?
INSERT ANSWER HERE
D. Can you see a signature of molnupiravir mutagenesis? Consider quantity of variants and the mutation spectrum in your answer.
INSERT ANSWER HERE
Donovan-Banfield, I.A., Penrice-Randal, R., Goldswain, H., Rzeszutek, A.M., Pilgrim, J., Bullock, K., Saunders, G., Northey, J., Dong, X., Ryan, Y. and Reynolds, H., 2022. Characterisation of SARS-CoV-2 genomic variation in response to molnupiravir treatment in the AGILE Phase IIa clinical trial. Nature communications, 13(1), p.7284.
Sanderson, T., Hisner, R., Donovan-Banfield, I.A., Hartman, H., Løchen, A., Peacock, T.P. and Ruis, C., 2023. A molnupiravir-associated mutational signature in global SARS-CoV-2 genomes. Nature, pp.1-3.