Sequence Analysis Report

Assessment Information

This assessment is:

due on Friday 8th of December, 2.30pm,
split into two different scenarios and you must answer all the questions from each scenario.

Your submission will be an edited form of this document. You should make sure to:

include your name and student number in the author line above,
check the GBA on the NOW page for further guidance.

Useful Information

To create code blocks, use ``` above and below the code.
When inserting BASH code, you do not need to add anything to the code block.
When inserting R code, add {r} after the first ```.

To insert an image, use the following code within an R code block:

knitr::include_graphics("FILE PATH TO IMAGE")

You can include a number of extra arguments within the {r} such as…

fig.align to control the position
fig.cap to add a legend

When you do not need to show the code within an R code block, such as when you are just inserting a figure, include the argument echo = FALSE within the {r}.

Scenario 1

Background

An experiment has been running on the International Space Station (ISS) over the last few years called Microbial Tracking. Samples have been collected from the air and surfaces across the ISS, transferred back to Earth and sequenced in NASA’s laboratories. Your job is to assemble the raw sequencing data from the ISS into whole genomes, conduct analysis on these genomes and other public data and visualise some of the output.

Q1. Collecting your sequences

A. Extract the contents of your compressed part_1 file. Show the code used to do this. ’’’ PS C:> pscp -P 5047 part_1.tar.gz biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com:/home/biouser/ biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com’s password: part_1.tar.gz | 490520 kB | 70074.4 kB/s | ETA: 00:00:00 | 100%

PS C:> pscp -P 5047 part_2.tar biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com:/home/biouser/ biouser@lab-49a4b677-0a18-4825-811e-14b33afb7cfa.uksouth.cloudapp.azure.com‘s password: part_2.tar | 88064 kB | 1000.7 kB/s | ETA: 00:09:45 | 13%’’’

B. Count the number of reads in each of your fastq files. Show the code used to do this.

’’’ zcat Strain_A_R1.fastq.gz | wc -1 ’’’

C. Assemble your read count data into a table by importing your read counts into R. Show the R code used to do this.

’’’{r} name<-c(“Strain A”, “Strain B”, “Strain C”, “Strain D”, “Strain E”) ReadCount<-c(“451170”, “483435”, “213716”, “221353”, “828,309”) part_1<-data.frame(name, ReadCount) part_1 name ReadCount 1 Strain A 451170 2 Strain B 483435 3 Strain C 213716 4 Strain D 221353 5 Strain E 828,309 knitr::kable(part_1, “pipe”, col.name=c(“name”, “ReadCount”))

name	ReadCount
Strain A	451170
Strain B	483435
Strain C	213716
Strain D	221353
Strain E	828,309

’’’

Q2. Preparing for the workflow

A. Make a directory called assessment_part_1. Show the code used to do this.

’’’ mkdir assessment_part_1 ’’’

B. Move your raw sequencing reads into the directory you have just created. Show the code used to do this with a wildcard.

’’’ mv Strain_A_R1.fastq.gz assessment_part_1

ls *.fastq.gz ’’’

C. You have been using conda to manage your software. Show the code used to create a conda environment and for installing the program fastqc.

’’’ conda create -n my_env fastqc python=3.10 conda install -c bioconda fastqc ’’’

Q3. Quality checking your sequence data

A. Run fastqc on all of your read files using a wildcard and move the output into a new directory. Show the code used to do this.

’’’ fastqc -o ~/fastfiles/ *gz ’’’

B. Run multiqc on your fastqc output. Export the “Sequence Quality Histogram” from your multiqc report and include it below. Give it an appropriate figure legend.

’’’ ~/fastfiles$ multiqc -n –fastfiles *_fastqc.zip ’’’

C. Import the multiqc_fastqc.txt file into R using read_tsv(). Create a new column called Strain using the Sample column as a template. Using the sub() command, remove the _R1 and _R2 from your new strain column. You can use the format pattern1|pattern2 to replace multiple strings at once with sub(). Repeat this for the Sample column, but this time remove the strain information, which should just leave either R1 or R2.

Using ggplot, plot a bar graph of the average sequence length by strain. Colour the bars based on whether they are the R1 or R2 reads (make sure you use position = 'dodge' otherwise you will make a stacked bar chart). Do not use the default colours, change these to two colours of your own choosing. Rename the y-axis and change the name of the colour key from Sample to Read set.

INSERT R CODE HERE

Q4. Trimming your sequence reads

A. Using trimmomatic, trim each pair of read files to remove any adapter sequences remaining from the sequencing process. You should end up with two paired and two unpaired files per strain. Show the code used to do this. ’’’ conda install -c bioconda trimmomatic ’’’

B. Delete the unpaired files. Make a new directory called trimmomatic and move your newly trimmed files into this directory. Show the code used to do this. ’’’ mkdir trimmomatic

mv paired trimmomatic mv unpaired trimmomatic

’’’ C. As previously, count the number of reads in each of your paired files. Calculate how many reads, if any, have been lost during the trimming process. Collate this information into a spreadsheet and import it into R. Visualise this as a table in this markdown document. Show the R code used to do this.

INSERT R CODE HERE

Q5. Assembling your genomes

A. Make a directory called “spades” containing sub-directories for each of your strains. Assemble each of your strains into whole genomes using spades with your newly trimmed sequencing reads. Show the code you used to run spades for one of your strains.

mkdir spades cd spades mkdir

B. Make a new directory called “genomes”. Rename your contigs.fasta files from spades and copy them into your newly created “genomes” folder. Show the code used to do this.

mkdir genomes mv contigs.fasta genomes

C. Using grep, count the number of contigs in one of your FASTA files. Show the code used to do this.

INSERT BASH CODE HERE

D. Using seqkit stats, the pipe | and cut, produce summary statistics for your genomes, extract the columns file, num_seqs, N50 and GC from the output and direct this into a new file. Show the code used to do this.

INSERT BASH CODE HERE

E. Import your genome summary statistics into R. Using ggplot(), create a scatter plot of N50 against the number of contigs, with a different point shape and colour per strain. Change the size of the points based on the GC percentage. Make sure the axes are labelled correctly and you have given the graph an appropriate legend. Show the code used to do this.

INSERT R CODE HERE

F. Upload your genomes to PubMLST’s species identification website. Briefly summarise the results.

INSERT ANSWER HERE

G. Upload your genomes to the CARD website to identify antibiotic resistance genes in your genomes. You can directly highlight the table CARD produces and copy it into Excel. Add a column for genome, then repeat for the remaining four genomes. You only need to keep the columns for: Genome, ARO Term and Resistance mechanism. Save this as a .csv file.

Import the spreadsheet into R. Using geom_tile() from ggplot2, create a heat map showing the presence / absence of each gene across your five strains. Colour each gene based on the resistance mechanism it uses. Use a different theme for this figure. Make sure the figure is labelled and formatted appropriately. Show the code used to do this.

INSERT R CODE AND OUTPUT HERE

Q6. Annotating your genomes

A. Use prokka to annotate your genomes. Show the code you used to do this for one of your strains.

INSERT BASH CODE HERE

B. Using the .txt file from prokka, create a spreadsheet containing information on the number of CDS, genes, rRNAs and tRNAs for each strain. Import this into R and display the table below. Show the code used to do this.

INSERT R CODE AND OUTPUT HERE

C. For each of the .ffn files produced by prokka, search for the 16S gene. Add this sequence to the bottom of the 16S file you have been provided with. Make sure you do not use the pre-16S sequence, however it is OK if the only sequence you find is partial-16S. Name each sequence after the strain it is taken from so you know which sequence is from which file. Using grep, produce a list of all the sequence header lines in your new 16S file, and show the code you used to do this.

INSERT BASH CODE HERE

Q7. Aligning 16S sequences and generating a phylogenetic tree

A. Use prank to perform a multiple sequence alignment on your 16S file. Show the code you used to do this.

INSERT BASH CODE HERE

B. Use RAxML-NG to create a phylogenetic tree using the 16S alignment from prank. Show the code you used to do this.

INSERT BASH CODE HERE

C. Upload the .bestTree file to iTOL. Annotate the tree at your discretion; you can alter branch colours, thickness, fonts, etc. to highlight aspects of the tree. It should look professional and suitable for publication. Export the tree and include it below. Include a figure legend.

INCLUDE FIGURE HERE

Q8. Interpretation

NOTE: You should aim for each answer to be no more than 200 words.

A. Using your multiqc and fastqc data, comment on the quality of your sequencing reads. What can you do to improve the quality of reads if they are below Q30?

Ideally Phred scores should be around 50 or 40 to increase the accuracy of the sequencing. Q30 means that the reads are accurate to a factor of 0.1% which is not impressive when dealing with an entire genome. In order to improve the read quality there are a few strategies which include software fixes and practical improvements. For example Algorithms can be created that recognise errors in a correct sequence of DNA and account for them reducing inaccuracy. Furthermore in practice, improvement to cutting edge reagents and genome sequencing kits are likely to improve the quality of the data and negate error. In practice, researchers should be ready to compare and cross examine a large amount of genome sequences in order to notice when errors occur. It is imperative to adapt to these outliers to gain the best possible advantage.

The fastqc data deviates a lot in its read count between each sample, I cannot see this contributing to the quality. It could be due to excessive duplication, high nitrogen content or even lower quality bases. In order to achieve a higher Phred score, a solution previously mentioned involving accounting for complications during the DNA sequencing.

Dunning, M. (2019) Assessing Read quality, Assessing read quality. Available at: https://sbc.shef.ac.uk/workshops/2019-01-14-rna-seq-r/read-quality.nb.html (Accessed: 08 December 2023).

B. What can you infer from your genome statistics about the quality of your assemblies? Which assembly is the best? Justify your answer.

INSERT ANSWER HERE

C. Describe the process of genome assembly. Why do we do it, what is involved at each step and how does it work?

Genome assembly is the process of sequencing an organisms DNA and piecing it together. It is done to further our understanding of why certain characteristics are expressed in an organism. It can determine the origin of certain disease or disorders and allow us to target genes in order to repair genetic errors.

Initially, a sample of the patient is required usually by taking blood, tumour samples or bone marrow. DNA is extracted from cells via heat and an extraction buffer. In PCR, three stages include denaturing, annealing and amplification of DNA Molecules. The sequences are read by PCR machines for example Illumina, which generates millions of short sequence reads. Algorithms and other techniques can be used to detect any errors which may have occurred during sequencing and lesser quality reads can be excluded. Any gaps can be replaced, and the genome is assembled usually by comparing it with a reference genome and will be validated by comparing with a wider range of sequences.

NHS.2020.Genomics in clinical genetics. available from https://www.genomicseducation.hee.nhs.uk/genomics-in-healthcare/genomics-in-clinical-genetics/

D. What can you infer from the phylogenetic tree and does this agree with the species designations from PubMLST? Which approach do you think is more accurate? Justify your answer.

INSERT ANSWER HERE

E. Why is the 16S gene so widely used in phylogenetic analysis? What are the drawbacks to using this method? Is using 16S sequences still suitable for species assignments? What characteristics does a gene need to have for it to be a good candidate for phylogenetic analysis?

The 16S gene is used because it is vital for cell function and is present in a large majority of organisms including prokaryotes. The gene does not change much between organisms but it does contain variable regions which allow for different species to be differentiated. The gene evolves relatively slow so is useful in comparing a variety of organisms and can also be amplified using PCR. It is arguable whether it is effective for determination of species as closely related organisms can have 16S genes which are similar and difficult to distinguish.

In order to be a good candidate for phylogenetic analysis, the gene must have enough evolutionary information to give context between species and build a map of how it was inherited. The gene must be compatible with PCR so that it can be used for sequencing. There should also be a crucial function of the gene to ensure that it is present in a large variety of species.

Johnson, J.S., Spakowicz, D.J., Hong, BY. et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 10, 5029 (2019). https://doi.org/10.1038/s41467-019-13036-1

F. For this report you have used Illumina DNA sequencing data. What other sequencing technologies can be used for genome assembly, and how do they compare to Illumina?

There are a number of different methods for DNA sequencing and genome assembly including Sanger sequencing, PacBio or Oxford nanopore. In most cases, these technologies will either commit to providing high accuracy, while only providing shorter reads. PacBio and nanopore have much longer reads however they forfeit accuracy and have a high error rate. The techniques which use longer reads are more effective in decoding more complex sequences in the genome. The longer sequencing methods are also known to be less cost effective and furthermore have a lower output in comparison.

Niraj Rayamajhi, Chi-Hing Christina Cheng, Julian M Catchen, Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki, G3 Genes|Genomes|Genetics, Volume 12, Issue 11, November 2022, jkac192, https://doi.org/10.1093/g3journal/jkac192

In Practice, DNA sequencing Is most effective when using a combination of these methods in order to compare findings. With data from shorter and longer reads, a greater and more precise image of the genome sequence can be achieved from a combination. The strengths of Illumina make up for weaknesses in other tech and visa versa.

Scenario 2

Background

Data

The data for this scenario were collected during the AGILE Candidate Specific Trial (CST)−2 (Donovan-Banfield et al., 2022) in which humans who were infected with SARS-CoV-2 were randomized to molnupiravir treatment or placebo. SARS-CoV-2 has a positive-sense, single-stranded RNA (+ssRNA) genome (see this link or this link for explanation and this link for a genome map).

Methods

Briefly nasopharyngeal swabs were collected at several time points, but RNA was extracted from those collected on days 1, 3 and 5 and converted to cDNA before being amplified in a process that added sequencing adapters for the Illumina platform. Sequencing was carried out on a Novaseq 6000 with a 2 x 150 bp run (for more details see Donovan-Banfield et al., 2022 and associated supplementary information). This is referred to as ampliconic sequencing.

Molnupiravir

After being metabolised, molnupiravir, acts as a nuceoside analogue, such that it can be incorporated into newly synthesised strands of the RNA genome opposite a G nucleotide. Upon incorporation it frequently changes conformation into an alternative tautomer that can base pair with an A nucleotide. This process most frequently results in G-to-A or (if occuring during positive-strand synthesis) C-to-U transitions (this is nicely explained in figure 1 of Sanderson et al., 2023 and in figure 1b of Donovan-Banfield et al., 2022). Sanderson et al. 2023 have recently shown that the characteristic mutation spectrum of molnupiravir can be detected in population samples (associated with correlates of molnupiravir treatment frequency/likelihood) suggesting that treatment with molnupiravir has contributed to the evolution of new SARS-CoV-2 variants.

deepSeq

Your task is to examine the ampliconic sequencing data from two patients infected with SARS-CoV-2 from the BA.1 lineage (part of the omicron group: see here or here for more information on lineages). One patient was treated with molnupiravir, while the other was treated with placebo (see metadata file). The task of detecting variation throughout the genome is referred to as variant detection and we can identify single-nucleotide polymorphisms (SNPs), insertions/deletions (indels) and more complex variants such as multinucleotide polymorphisms (MNPs). Because our study organism is essentially haploid we are looking at variation within a population from which we have sampled a pool. When we focus on DNA pools derived from a large population (of RNA viruses in this case) we need to consider that variation can be continuous in this case (and is not limited to discrete vales e.g. 0, 0.5, 1.0 in a single-individual diploid pool), this is known as deep sequencing (deepSeq). You will develop and execute a deepSeq workflow.

Workflow

Q1. Sequence collection

A. Make a directory called assessment_part_2 and move into it. Show the code used to do this.

’’’ mkdir assessment_part_2

cd assessment_part_2 ’’’

B. Extract the contents of the tarballed (but not compressed) part_2 file. This contains 6 samples worth of sequence data (3 timepoints each from 2 patients) as well as the SARS-CoV-2 Wuhan reference genome and a metadata file. Show the code used to do this.

’’’ tar -xvf part_2 ’’’

C. Count the number of reads in each of your FASTQ files. Show the code used to do this for one sample.

’’’ zcat SRR19914904_1..fastq.gz | wc -1 ’’’

D. Which FASTQ file has the smallest number of reads? How many reads does it have?

SRR19914996 has the least amount of reads at 247627

Q2. Software installation and reproducibility

A. Use the appropriate conda commands to install fastp, bwa-mem2, samtools, sambamba, freebayes, vcftools, bcftools, vcflib (v.1.0.3) and tabixpp (v.1.1.0). You can call the environments whatever you want. Show the code used to create these.

’’’ conda create -n freebayesenv conda install freebayes

same for all programs otherwise try sudoapt install (software) ’’’

B. Find out the version numbers of all the software installed in part A and present this as a list.

’’’ freebayes (1.3.6-1) samtools (1.13-4) sambamba (0.8.2+dfsg-2) fastp (0.20.1+dfsg-1) vcftools (0.1.16-3) bcftools (1.13-1)vcflib (v.1.0.3) and tabixpp (v.1.1.0) bwa-mem2 (2.2.1) ’’’

C. Provide full bibliographic entries in the Harvard/NTU style (as you would find in a reference list) for the fastp, sambamba and FreeBayes programs.

’’’ Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019.

A. Tarasov, A. J. Vilella, E. Cuppen, I. J. Nijman, and P. Prins. Sambamba: fast processing of NGS alignment formats. Bioinformatics, 2015.

Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012 ’’’ ———————————————

Q3. Quality control and trimming

A. Using the fastp program, trim and filter every pair of FASTQ files (a separate line of code will be required for each sample). Use default settings but add the --html flag to generate suitably named reports. Show the code for at least one sample (pair of read files).

–html suitablynamed

B. Using the reports you’ve generated in part A, and importing data into R, prepare a table showing the percentage of reads passing the filters for every sample. Show the R code used to do this and display the table.

INSERT R CODE AND OUTPUT HERE

C. Choosing (and identifying) only one sample by accession, show a representative before and after filtering quality plot (these can also be found in the reports). Present the two figures here.

INSERT ANSWER HERE

Q4. Alignment against reference

A. The reference genome is named Wuhan-Hu-1_NC_045512.2.fasta. Use the bwa-mem2 package’s index program to index the reference genome. Show the code used to do this. ’’’ bwa-mem2 index Wuhan-Hu-1_NC_045512.2.fasta ’’’

B. For every sample, map the fastp-trimmed reads to the reference and generate an output SAM file. Show the code for one sample (pair of input read files).

INSERT BASH CODE HERE

C. For each SAM file, use the sambamba (or samtools) package (view and sort programs) to convert to BAM format and sort. In each case make sure to add read-group data corresponding to each sample’s accession. Show the code for one sample.

INSERT BASH CODE HERE

D. For each sorted BAM file, use the appropriate sambamba program to create a new BAM file in which PCR duplicates are marked. Show the code for one sample.

INSERT BASH CODE HERE

At the end of this question you may wish to discard SAM and unmarked BAM files to preserve disk space.

Q5. Alignment statistics and genome coverage

A. Use the sambamba (or samtools) flagstat program to examine all BAM files. Show the code for one sample.

INSERT BASH CODE HERE

B. Collate data from part A, recording the total number of reads, the number of duplicates and the number of mapped reads for every sample. Import into R and use tidyverse functions to calculate the proportion of mapped reads and duplicated reads for every sample. Show the R code you used to do this and display the table below.

INSERT R CODE AND OUTPUT HERE

C. Use the samtools coverage program to generate a coverage histogram from a single BAM file. Present this representative histogram with a legend that identifies the sample. For this answer you may exceptionally use a screenshot because the output is displayed on the command line. There is no need to show the code here.

INSERT SCREENSHOT IMAGE HERE

D. Use the samtools depth program and a single BAM file to generate a tab-delimited (TSV format) output file with coverage data for all positions (check options) in one sample. Show the code used to generate this table.

INSERT BASH CODE HERE

E. Import the tab-delimited data into R and produce a plot that shows how coverage varies by position (identify the sample in your legend). Show the code used to do this and present this as a figure.

INCLUDE R CODE AND OUTPUT HERE

Q6. Variant calling and filtering

Now we will use the FreeBayes program to call variants and then BCFTools to filter the VCF file

A. Use the ls command (with a suitable option) to generate a file that lists all the (duplicate-marked) BAM file names, one per line (with no other information). You may give the list file any suitable name. Show the code you used to do this.

ls *.BAM

B. Use the freebayes program with options appropriate for pooled data of type described in the background section and with the BAM file list (generated in part A) to call variants across all samples into a single VCF file (with any suitable name). Executing this code may take some time. Show the code used to do this.

INSERT BASH CODE HERE

C. Use the bcftools stats program to derive statistics from your (currently unfiltered) VCF file. Pass in the reference genome and instruct the program to analyze all samples and all sites. Write the output to a suitably named file. Show the code used to do this.

INSERT BASH CODE HERE

D. Use the plot-vcfstats helper script to produce a graph from the file generated in part C. Find the plot that shows the proportion of different substitution types. Insert/display the PNG below and provide a figure legend.

INSERT FIGURE HERE

E. Use the bcftools filter program to retain only sites with an overall (PHRED-scaled) quality score > 20 and with at least one sample having a depth of coverage greater than 10. Pipe the output to the bcftools +split program and separate the output files into a (suitably named) folder as uncompressed VCF files. Show the code used to do this.

INSERT BASH CODE HERE

Q7. Variant display and analysis

Now you will display the filtered variants (from all samples).

A. Descend into the output directory from Q6 part E and use the bcftools query program to build a (suitably named) tab-delimited output file containing the following columns: %CHROM, %POS, %TYPE, %QUAL, %REF, %FIRST_ALT, and from the sample specific (FORMAT) fields: %SAMPLE, %DP and %AD{1}. You will need to process each VCF file with a separate line of code and append results to the output TSV file using the >> operator. Show every line of code used to do this (for all samples).

INSERT BASH CODE HERE

B. Import the metadata CSV file and the tabular variant TSV file into R using tidyverse read_*() functions and assign to suitably named objects. Show the R code used to do this.

INSERT R CODE HERE

C. Use the tidyverse left_join() function to add metadata to the variant tibble. Use tidyverse to filter the output tibble to display only SNPs (that is sites that contain only SNPs in the TYPE column). Show the R code used to carry out these steps.

INSERT R CODE HERE

D. Using the mutate() function, add a “MAF” (minor allele frequency) column to the tibble in which you calculate the proportion of reads supporting the (first) alternative allele. Additionally convert the day variable into a factor variable. Show the code used to do this.

INSERT R CODE HERE

E. Use the ggplot2 package to create a box and whiskers plot with day on the x axis and MAF on the y axis. Split the plot into panels (facets) with rows showing the reference allele and columns displaying the (first) alternative allele. Differentiate boxes by treatment to show the effect of treatment on allele frequencies at each time point. Make some changes to the theme and axes to improve plot appearance/clarity. Show the R code and output plot here.

INSERT R CODE AND OUTPUT HERE

F. Use the mutate() function again to add a column called mutation which shows, for example “A>G” to a change from an A base in the reference to a G for the (first) alternative allele. You will need to use either str_c() (tidyverse) or paste0() (base R) functions to assist with this. (Make additional small adjustments if you believe they improve downstream plotting). Show the code used to do this.

INSERT R CODE HERE

G. Use the filter() tidyverse function to select samples from day 3 onwards (eliminating day 1) and that have a MAF between 0.2 and 0.8. Pipe the output to the tidyverse count() function and use this to tabulate by treatment and mutation. Assign the output of this pipeline to a suitably named new object. Show the code used to do this.

INSERT R CODE HERE

H. Use the ggplot2 package to create a bar plot with treatment on the x axis and n (the count of sites) on the y axis. Split the plot into facets again, but using the mutation variable. In your faceting call manually adjust the layout of the subplots for optimal display and adjust y axes so that they are permitted to differ between subplots. Make use of any additional axis modifications, aesthetic calls and/or theme adjustments within your code to improve appearance. Show the code used to do this and display the plot.

INSERT R CODE AND OUTPUT HERE

Q8. Interpretation

NOTE: You should aim for each answer to be no more than 200 words.

A. Explain the functions of the fastp, sambamba and FreeBayes programs in your workflow.

Preprocessing is done by fastp which trims adapter sequences from the reads, corrects innacurate bases and filters out reads that do not meet the required Phred score. After, sambamba is used to separate the SAM and BAM files and create an index so that they ca all be analysed together. Finally, FreeBayes is used to single out variant samples and compares te referency genome with the apparent genome and determines the actual alleles.

B. What is the minimum length of read below which a read would be discarded by fastp (given the code used in question 3 part A)?

INSERT ANSWER HERE

C. Across all your samples how many variants were identified by FreeBayes and how many passed the bcftools filter? How many simple SNPs were identified after the filtering step in question 7 part c? How might these decreasing numbers reflect the possibility of false negatives and false positives?

INSERT ANSWER HERE

D. Can you see a signature of molnupiravir mutagenesis? Consider quantity of variants and the mutation spectrum in your answer.

INSERT ANSWER HERE

References

Donovan-Banfield, I.A., Penrice-Randal, R., Goldswain, H., Rzeszutek, A.M., Pilgrim, J., Bullock, K., Saunders, G., Northey, J., Dong, X., Ryan, Y. and Reynolds, H., 2022. Characterisation of SARS-CoV-2 genomic variation in response to molnupiravir treatment in the AGILE Phase IIa clinical trial. Nature communications, 13(1), p.7284.

Sanderson, T., Hisner, R., Donovan-Banfield, I.A., Hartman, H., Løchen, A., Peacock, T.P. and Ruis, C., 2023. A molnupiravir-associated mutational signature in global SARS-CoV-2 genomes. Nature, pp.1-3.

Sequence Analysis Report

Oliver Gamble N1002136

Assessment Information

Useful Information

Scenario 1

Background

Q1. Collecting your sequences

Q2. Preparing for the workflow

Q3. Quality checking your sequence data

Q4. Trimming your sequence reads

Q5. Assembling your genomes

Q6. Annotating your genomes

Q7. Aligning 16S sequences and generating a phylogenetic tree

Q8. Interpretation

Scenario 2

Background

Data

Methods

Molnupiravir

deepSeq

Workflow

Q1. Sequence collection

Q2. Software installation and reproducibility

Q3. Quality control and trimming

Q4. Alignment against reference

Q5. Alignment statistics and genome coverage

Q6. Variant calling and filtering

Q7. Variant display and analysis

Q8. Interpretation

References