ISG 5312 Final Project

Introduction

Influenza A viruses are quite known for their segmented genomes and low fidelity polymerases which in turn lead to a rapid evolution, causing major threats to animal and human health. In the paper that I chose to use, the main focus for this study is to see how the vaccine can affect the viral evolution, specifically pertaining to the quasispecies diversity. The way that they chose to study this was with next-generation sequencing of 32 SIAV genomes collected at various time points post-infection. In this study, 16 pigs were used, divided into two groups: 8 vaccinated and 8 nonvaccinated. All of the pigs were then in turn challenged with H1N2 SIAV.

Experimental Setup

From the 16 pigs, 8 of them were inoculated with RESPIPORC FLU3 (Ceva®), a trivalent inactivated vaccine including H1N2, H1N1, and H3N2 strains. After the vaccine was given to half of the study group, the whole group was challenged with Influenza A virus (A/Swine/Spain/80598LP4/2007 (H1N2)). After that samples were collected at 2, 5, and 9 days post-inoculation (dpi). For the sequencing, NGS was used to collect the complete genome with the Illumina MiSeq. Overall there were 32 genomes sequenced, corresponding to the multiple different points that the pigs were sampled. High coverage was achieved, with 88.91% of genome positions represented by at least 50 reads.

Coding Setup

For the code, as represented on GitHub, I began with the quality control checks using FastQC and MultiQC. All reads were seen to be used against the reference https://www.ncbi.nlm.nih.gov/nuccore/HF674912.1/. Alignment was completed with bwa-mem2 as well as samtools, and samblaster. This was followed by the use of bedtools and genmap. Freebayes was then used for variant calling. Variants were then filtered.

Discussion

All results are just for segment 4 of the influenza genome. This was the original results of the QC from multiQC

Sample Name % Dups % GC Length M Seqs
SRR25266111_trim.1 61.0% 44% 134 bp 0.0
SRR25266111_trim.2 59.8% 44% 133 bp 0.0
SRR25266112_trim.1 75.3% 44% 136 bp 0.1
SRR25266112_trim.2 73.1% 44% 135 bp 0.1
SRR25266113_trim.1 45.9% 44% 135 bp 0.0
SRR25266113_trim.2 43.7% 44% 134 bp 0.0
SRR25266114_trim.1 60.2% 44% 135 bp 0.0
SRR25266114_trim.2 58.3% 44% 135 bp 0.0
SRR25266115_trim.1 66.5% 44% 135 bp 0.0
SRR25266115_trim.2 65.1% 44% 135 bp 0.0
SRR25266116_trim.1 57.6% 44% 136 bp 0.0
SRR25266116_trim.2 56.3% 44% 136 bp 0.0
SRR25266117_trim.1 49.1% 43% 123 bp 0.0
SRR25266117_trim.2 49.1% 43% 123 bp 0.0
SRR25266118_trim.1 64.3% 46% 135 bp 0.0
SRR25266118_trim.2 63.5% 46% 135 bp 0.0
SRR25266119_trim.1 60.7% 44% 133 bp 0.0
SRR25266119_trim.2 58.7% 44% 132 bp 0.0
SRR25266120_trim.1 54.3% 44% 135 bp 0.0
SRR25266120_trim.2 53.5% 44% 135 bp 0.0
SRR25266121_trim.1 72.0% 44% 132 bp 0.1
SRR25266121_trim.2 69.3% 44% 131 bp 0.1
SRR25266122_trim.1 71.3% 44% 136 bp 0.1
SRR25266122_trim.2 69.8% 44% 135 bp 0.1
SRR25266123_trim.1 49.2% 44% 134 bp 0.0
SRR25266123_trim.2 43.0% 44% 132 bp 0.0
SRR25266124_trim.1 50.1% 44% 125 bp 0.0
SRR25266124_trim.2 44.2% 44% 123 bp 0.0
SRR25266125_trim.1 63.7% 44% 130 bp 0.0
SRR25266125_trim.2 62.8% 44% 130 bp 0.0
SRR25266126_trim.1 68.3% 44% 134 bp 0.0
SRR25266126_trim.2 66.8% 44% 133 bp 0.0
SRR25266127_trim.1 66.7% 44% 134 bp 0.0
SRR25266127_trim.2 65.3% 44% 134 bp 0.0
SRR25266128_trim.1 44.5% 43% 131 bp 0.0
SRR25266128_trim.2 44.0% 43% 131 bp 0.0
SRR25266129_trim.1 59.0% 45% 136 bp 0.0
SRR25266129_trim.2 56.4% 45% 135 bp 0.0
SRR25266130_trim.1 53.8% 45% 133 bp 0.0
SRR25266130_trim.2 52.8% 45% 133 bp 0.0
SRR25266131_trim.1 66.2% 44% 137 bp 0.0
SRR25266131_trim.2 64.3% 44% 136 bp 0.0
SRR25266132_trim.1 39.9% 43% 134 bp 0.0
SRR25266132_trim.2 39.4% 43% 134 bp 0.0
SRR25266133_trim.1 61.1% 44% 138 bp 0.0
SRR25266133_trim.2 59.4% 44% 137 bp 0.0
SRR25266134_trim.1 51.1% 44% 133 bp 0.0
SRR25266134_trim.2 51.2% 44% 133 bp 0.0
SRR25266135_trim.1 53.4% 46% 132 bp 0.0
SRR25266135_trim.2 52.9% 46% 132 bp 0.0
SRR25266136_trim.1 78.5% 45% 137 bp 0.1
SRR25266136_trim.2 76.0% 45% 136 bp 0.1
SRR25266137_trim.1 61.8% 45% 133 bp 0.0
SRR25266137_trim.2 61.1% 45% 133 bp 0.0
SRR25266138_trim.1 38.8% 44% 131 bp 0.0
SRR25266138_trim.2 33.4% 44% 129 bp 0.0
SRR25266139_trim.1 54.3% 44% 131 bp 0.0
SRR25266139_trim.2 53.3% 44% 131 bp 0.0
SRR25266140_trim.1 64.6% 44% 134 bp 0.0
SRR25266140_trim.2 63.2% 44% 134 bp 0.0
SRR25266141_trim.1 53.5% 43% 131 bp 0.0
SRR25266141_trim.2 39.4% 44% 125 bp 0.0
SRR25266142_trim.1 68.7% 44% 135 bp 0.1
SRR25266142_trim.2 67.5% 44% 134 bp 0.1
SRR25266143_trim.1 60.8% 44% 136 bp 0.0
SRR25266143_trim.2 59.4% 44% 135 bp 0.0

This is the multiQC results after the alignment to the reference.

The bcftools stats produced the following:

SN  0   number of samples:  33
SN  0   number of records:  44
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 38
SN  0   number of MNPs: 0
SN  0   number of indels:   6
SN  0   number of others:   0
SN  0   number of multiallelic sites:   1
SN  0   number of multiallelic SNP sites:   1
stats: no. of samples                     :         33
       no. of chromosomes                 :          1

       ========== Micro variants ==========

       no. of SNP                         :         38
           2 alleles                      :              37 (8.25) [33/4]
           3 alleles                      :               1 (1.00) [1/1]

       no. of INDEL                       :          6
           2 alleles                      :               6 (1.00) [3/3]

       no. of micro variants              :         44

       ++++++ Other useful categories +++++


       ========= General summary ==========

       no. of VCF records                        :         44

I figure that since it is such a small part of the already small genome, this is why the stats present as they are. The transition/tranversion ratio presents as:

# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   34  5   6.80    33  5   6.60

The HA gene was found in the article to have a lot of divergence, in which I feel I need to reassess what I did or go back into my code and go further. I tried to make some graphs in R, but none of my scripts would work in which I assume there was an issue with what I had done. I was able to get some figures similar to the paper with the code below, but could not get it separated into the vaccinated vs unvaccinated group.

df <- read.table("~/Downloads/freebayes.query.txt", header=TRUE, comment.char="")
colnames(df) <- colnames(df) %>% str_remove("^X[.0-9]+")

# filter out multi-allelic loci for simplicity. fix AB so it's numeric (b/c comma-separated for multiallelic loci)
df <- filter(df, !str_detect(ALT, ",")) %>%
    mutate(AB=as.numeric(AB))
    
Warning message:
In read.table("~/downloads/freebayes.query.txt.gz", header = TRUE,  :
  incomplete final line found by readTableHeader on '~/downloads/freebayes.query.txt.gz'

df_long <- df_wide %>%
+     pivot_longer(cols = everything(), names_to = "Sample", values_to = "MedianDepth")
> 
> library(ggplot2)
> 
> # Create a bar plot
> ggplot(df_long, aes(x = Sample, y = MedianDepth)) +
+     geom_bar(stat = "identity", show.legend = FALSE) +
+     theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
+     labs(x = "Sample", y = "Median Depth", title = "Median Depth per Sample")

df_matrix <- df %>%
+     arrange(Sample) %>%
+     select(Sample, MedianDepth)
> 
> # Convert the data frame to a matrix, using Sample as row names
> df_matrix <- as.matrix(df_matrix[, "MedianDepth", drop = FALSE])
> rownames(df_matrix) <- df$Sample
> 
> # Create heatmap
> pheatmap(df_matrix, 
+          cluster_rows = FALSE,
+          cluster_cols = FALSE,
+          color = colorRampPalette(c("white", "lightblue", "blue", "darkblue"))(100),
+          main = "HA Median Depth per Sample")

Challenges

I did not seem to have challenges with the code until the end of the project which affected how my results went. Something I did not notice til the end was that the reference genome was split into various sections, so I was completing the code for only the HA gene for hemagglutinin. I also was having trouble with using snpEff as for the last part of bcftools I did not know where to get a reference for the virus as ensembl is for animals. I tried https://www.ebi.ac.uk/genomes/virus.html but I could not find the reference there either. This was for comapring and annotating variants so I am not as concerned, but it would have been nice to have done for the project as the original paper did so.

Conclusions

Some conclusions to take from the project and the paper was that vaccination may drive viral evolution, leading to the emergence of variants with potential immune escape. This means for the future that monitoring genetic diversity for SIAV is very important for the continuous control of the disease. Future work on the variants would also be good to see the impact of these variants. On my side for coding, I would like to go back and figure out how to incorporate all of the genes in one script since they were all separated. I would also like to figure out how to finish my script up as I believe it was incomplete without snpEff which could help my results.